Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextRank throws an error when using a "»" character #4

Open
BenParizek opened this issue Jan 23, 2017 · 3 comments
Open

TextRank throws an error when using a "»" character #4

BenParizek opened this issue Jan 23, 2017 · 3 comments

Comments

@BenParizek
Copy link

I am processing a block of text with TextRank and it is throwing an error. The text is in French. The language is detected correctly. The part of the text that seems to be throwing the error is:

... «les derniers jours de guerre» ...

TextRank returns the following, with the final raquo being encoded incorrectly:

accord historique,Colombie,jours,guerre�,derniers

It appears the invalid character gets introduced in the DefaultEvents::get_words method:

public function get_words($text)
{
    $words = preg_split('/(?:(^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/', $text, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
    return array_values(array_filter(array_map('trim', $words)));
}

The text appears fine before the preg_split method is called and gets encoded incorrectly in the $words variable afterwards.

I've tried to add the raquo's to the French stopwords and update the preg_split method to mb_split – both of these attempted solutions appear not to work or have other issues. It's worth noting the opening raquo seems to get processed fine. It's the final raquo that seems to cause the issue.

@BenParizek BenParizek changed the title Text rank throws an error when using a "»" character TextRank throws an error when using a "»" character Jan 23, 2017
@crodas
Copy link
Owner

crodas commented Jan 24, 2017

@BenParizek I think I know where the problem is, let me do some tests locally and I would add some phpunit tests as well.

If you are in a rush I would change preg_split to use the multibyte modifier (or perhaps use mb_split instead)

public function get_words($text)
{
    $words = preg_split('/(?:(^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/u', $text, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
    return array_values(array_filter(array_map('trim', $words)));
}

@fawzib
Copy link

fawzib commented Jan 6, 2018

any plans to push a new version?

@BenParizek
Copy link
Author

@fawzib I think the change was pushed on the develop branch: 073b902

It would be nice to see this package released with a version number instead of just needing to require dev-master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants