You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am processing a block of text with TextRank and it is throwing an error. The text is in French. The language is detected correctly. The part of the text that seems to be throwing the error is:
... «les derniers jours de guerre» ...
TextRank returns the following, with the final raquo being encoded incorrectly:
accord historique,Colombie,jours,guerre�,derniers
It appears the invalid character gets introduced in the DefaultEvents::get_words method:
public function get_words($text)
{
$words = preg_split('/(?:(^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/', $text, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
return array_values(array_filter(array_map('trim', $words)));
}
The text appears fine before the preg_split method is called and gets encoded incorrectly in the $words variable afterwards.
I've tried to add the raquo's to the French stopwords and update the preg_split method to mb_split – both of these attempted solutions appear not to work or have other issues. It's worth noting the opening raquo seems to get processed fine. It's the final raquo that seems to cause the issue.
The text was updated successfully, but these errors were encountered:
BenParizek
changed the title
Text rank throws an error when using a "»" character
TextRank throws an error when using a "»" character
Jan 23, 2017
I am processing a block of text with TextRank and it is throwing an error. The text is in French. The language is detected correctly. The part of the text that seems to be throwing the error is:
TextRank returns the following, with the final raquo being encoded incorrectly:
It appears the invalid character gets introduced in the
DefaultEvents::get_words
method:The text appears fine before the
preg_split
method is called and gets encoded incorrectly in the $words variable afterwards.I've tried to add the raquo's to the French stopwords and update the
preg_split
method tomb_split
– both of these attempted solutions appear not to work or have other issues. It's worth noting the opening raquo seems to get processed fine. It's the final raquo that seems to cause the issue.The text was updated successfully, but these errors were encountered: