TextRank throws an error when using a "»" character #4

BenParizek · 2017-01-23T23:42:03Z

I am processing a block of text with TextRank and it is throwing an error. The text is in French. The language is detected correctly. The part of the text that seems to be throwing the error is:

... «les derniers jours de guerre» ...

TextRank returns the following, with the final raquo being encoded incorrectly:

accord historique,Colombie,jours,guerre�,derniers

It appears the invalid character gets introduced in the DefaultEvents::get_words method:

public function get_words($text)
{
    $words = preg_split('/(?:(^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/', $text, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
    return array_values(array_filter(array_map('trim', $words)));
}

The text appears fine before the preg_split method is called and gets encoded incorrectly in the $words variable afterwards.

I've tried to add the raquo's to the French stopwords and update the preg_split method to mb_split – both of these attempted solutions appear not to work or have other issues. It's worth noting the opening raquo seems to get processed fine. It's the final raquo that seems to cause the issue.

The text was updated successfully, but these errors were encountered:

crodas · 2017-01-24T03:58:11Z

@BenParizek I think I know where the problem is, let me do some tests locally and I would add some phpunit tests as well.

If you are in a rush I would change preg_split to use the multibyte modifier (or perhaps use mb_split instead)

public function get_words($text)
{
    $words = preg_split('/(?:(^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/u', $text, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
    return array_values(array_filter(array_map('trim', $words)));
}

fawzib · 2018-01-06T05:17:25Z

any plans to push a new version?

BenParizek · 2018-01-06T06:18:57Z

@fawzib I think the change was pushed on the develop branch: 073b902

It would be nice to see this package released with a version number instead of just needing to require dev-master.

BenParizek changed the title ~~Text rank throws an error when using a "»" character~~ TextRank throws an error when using a "»" character Jan 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TextRank throws an error when using a "»" character #4

TextRank throws an error when using a "»" character #4

BenParizek commented Jan 23, 2017

crodas commented Jan 24, 2017

fawzib commented Jan 6, 2018

BenParizek commented Jan 6, 2018

TextRank throws an error when using a "»" character #4

TextRank throws an error when using a "»" character #4

Comments

BenParizek commented Jan 23, 2017

crodas commented Jan 24, 2017

fawzib commented Jan 6, 2018

BenParizek commented Jan 6, 2018