Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

拼音首字母查询问题,当第二个字的拼音首字母为第一个字的韵母时查询不到结果 #293

Open
Jiangtao976 opened this issue Oct 7, 2023 · 1 comment

Comments

@Jiangtao976
Copy link

{
"settings":{
"number_of_shards":3,
"number_of_replicas":1,
"default_pipeline":"biz_timestamp_pipeline",
"analysis":{
"analyzer":{
"pinyin_analyzer":{
"tokenizer":"my_pinyin"
}
},
"tokenizer":{
"my_pinyin":{
"type":"pinyin",
"keep_separate_first_letter":true,
"keep_full_pinyin":true,
"keep_joined_full_pinyin":false,
"keep_original":true,
"limit_first_letter_length":16,
"lowercase":true,
"remove_duplicated_term":true,
"ignore_pinyin_offset":false
}
}
}
},
"mappings":{
"properties":{
"vendorName":{
"type":"text",
"analyzer":"pinyin_analyzer",
"search_analyzer":"pinyin_analyzer",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
}
}
}

示例一:
中文:刘德华阿里巴巴
分词结果:
{
"tokens": [
{
"token": "l",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "liu",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "刘德华阿里巴巴",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "ldhalbb",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "d",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "de",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "h",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 2
},
{
"token": "hua",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 2
},
{
"token": "a",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 3
},
{
"token": "li",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 4
},
{
"token": "b",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 5
},
{
"token": "ba",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 5
}
]
}

查询:
{
"query": {
"match_phrase": {
"vendorName": {
"query": "ldha"
}
}
}
}

可以看到分词结果中包含了首字母ldha,但查询不到结果,"阿"的首字母a,感觉是受到,"华"(hua)字中的a影响查不到。

示例二:
中文:深圳健安医药有限公司
{
"tokens": [
{
"token": "s",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "shen",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "深圳健安医药有限公司",
"start_offset": 0,
"end_offset": 10,
"type": "word",
"position": 0
},
{
"token": "szjayyyxgs",
"start_offset": 0,
"end_offset": 10,
"type": "word",
"position": 0
},
{
"token": "z",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "zhen",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "j",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 2
},
{
"token": "jian",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 2
},
{
"token": "a",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 3
},
{
"token": "an",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 3
},
{
"token": "y",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 4
},
{
"token": "yi",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 4
},
{
"token": "yao",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 5
},
{
"token": "you",
"start_offset": 6,
"end_offset": 7,
"type": "word",
"position": 6
},
{
"token": "x",
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 7
},
{
"token": "xian",
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 7
},
{
"token": "g",
"start_offset": 8,
"end_offset": 9,
"type": "word",
"position": 8
},
{
"token": "gong",
"start_offset": 8,
"end_offset": 9,
"type": "word",
"position": 8
},
{
"token": "si",
"start_offset": 9,
"end_offset": 10,
"type": "word",
"position": 9
}
]
}

查询:
{
"query": {
"match_phrase": {
"vendorName": {
"query": "szja"
}
}
}
}

可以看到分词结果中包含了首字母szja,但查询不到结果,"安"的首字母a,感觉是受到,"健"(jian)字中的a影响查不到。

其它中文,例如:深圳恩,使用sze同样查询不到,恩的首字母e 受到深(shen)字中的e影响查不到。

我调了很多参数都无法解决这个问题,有大佬救救我吗

@xiaoshi2013
Copy link

xiaoshi2013 commented Mar 2, 2024

查询:
{
"query": {
"match_phrase": {
"vendorName": {
"query": "ldha"
}
}
}
}

可以看到分词结果中包含了首字母ldha,但查询不到结果,"阿"的首字母a,感觉是受到,"华"(hua)字中的a影响查不到。

分词结果并没有把 ldha 分成一个词,所以匹配不上, 你换成 liudehua 就可以查了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants