Why deacc? #19

emillundhcodemill · 2018-02-01T09:39:48Z

I see that the parameter deacc is set to True for all languages when tokenize is called from clean_text_by_word and tokenize_by_word. This means that in Nordic languages, characters with umlaut like å, ä, ö are rendered without umlaut as a, a, o; this drastically reduces the quality of the keyword extraction for Swedish which makes heavy use of umlauts and where the umlauted characters count as characters in their own right (For example snö=snow, sno=twist; äga=own, aga=spank).
AFAICS, it's easy to set deacc=False and obtain much nicer results for Swedish; it should probably be done on a per-language basis (I suggest False for danish, finnish, german, norwegian, and swedish).
I'll be happy to file a PR but I'd like to hear if you have any comments?

The text was updated successfully, but these errors were encountered:

fbarrios · 2018-02-02T22:28:20Z

Hey @emillundhcodemill, thanks for your contributions!

I actually don't recall why the deaccent is always set to true. In our mother tongue (Spanish) few words can be written with or without the accent (solo, for instance). In most cases, though, the accent mark is used consistently to show intonation or to differentiate meaning between one syllable words that are spelled the same.

I would suggest to default deacc to False in all cases and to make it a parameter in the keywords method. What do you think @fedelopez77?

fedelopez77 · 2018-02-03T14:22:27Z

I totally agree with @fbarrios. Honestly we don't remember why we left that with True.

@emillundhcodemill if you could just set it to False by default, and add it as an optional parameter in the keywords method, we would be more than happy to merge your contribution.

Thanks in advance!

emillundhcodemill · 2018-02-06T10:10:04Z

OK, done. #22

fbarrios mentioned this issue Feb 6, 2018

In preprocess, introduce global bool DEACCENT which is set depending … #20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why deacc? #19

Why deacc? #19

emillundhcodemill commented Feb 1, 2018

fbarrios commented Feb 2, 2018 •

edited

Loading

fedelopez77 commented Feb 3, 2018

emillundhcodemill commented Feb 6, 2018

Why deacc? #19

Why deacc? #19

Comments

emillundhcodemill commented Feb 1, 2018

fbarrios commented Feb 2, 2018 • edited Loading

fedelopez77 commented Feb 3, 2018

emillundhcodemill commented Feb 6, 2018

fbarrios commented Feb 2, 2018 •

edited

Loading