Tokenization fails when some kind of katakana appeared followed by (株) #14

jnory · 2019-05-20T12:18:46Z

Hi,

Firstly, I want to tell you thank you for this great product.

I recently noticed that this plugin fails to tokenize text on some context.

For example, following texts are failing to tokenize.

(株)サイゼリヤ
(株)アクセス
(株)ライン
(株)ライト

It seems that the plugin fails when KATAKANA word appears just after (株) but not all KATAKANA phrase fails.
Would you like to check this issue?

Thanks in advance,

Reproducing Procedure

[Step 1] Create Dockerfile

FROM elasticsearch:6.5.1
  
RUN elasticsearch-plugin install -b org.codelibs:elasticsearch-analysis-kuromoji-neologd:6.5.1

[Step 2] Build docker image

docker build -t es6 .

[Step 3] Run elasticsearch

docker run -p 9200:9200 es6

[Step 4] Query with this text

% curl -s -H 'Content-Type:application/json' -XPOST http://localhost:9200/_analyze -d '{"tokenizer": "kuromoji_neologd_tokenizer", "text": "(株)サイゼリヤ", "explain": true}' | jq .
{
  "detail": {
    "custom_analyzer": true,
    "charfilters": [],
    "tokenizer": {
      "name": "kuromoji_neologd_tokenizer",
      "tokens": []
    },
    "tokenfilters": []
  }
}

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization fails when some kind of katakana appeared followed by (株) #14

Tokenization fails when some kind of katakana appeared followed by (株) #14

jnory commented May 20, 2019

Tokenization fails when some kind of katakana appeared followed by (株) #14

Tokenization fails when some kind of katakana appeared followed by (株) #14

Comments

jnory commented May 20, 2019

Reproducing Procedure

[Step 1] Create Dockerfile

[Step 2] Build docker image

[Step 3] Run elasticsearch

[Step 4] Query with this text