Skip to content

TextRank throws an error when using a "»" character #4

@BenParizek

Description

@BenParizek

I am processing a block of text with TextRank and it is throwing an error. The text is in French. The language is detected correctly. The part of the text that seems to be throwing the error is:

... «les derniers jours de guerre» ...

TextRank returns the following, with the final raquo being encoded incorrectly:

accord historique,Colombie,jours,guerre�,derniers

It appears the invalid character gets introduced in the DefaultEvents::get_words method:

public function get_words($text)
{
    $words = preg_split('/(?:(^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/', $text, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
    return array_values(array_filter(array_map('trim', $words)));
}

The text appears fine before the preg_split method is called and gets encoded incorrectly in the $words variable afterwards.

I've tried to add the raquo's to the French stopwords and update the preg_split method to mb_split – both of these attempted solutions appear not to work or have other issues. It's worth noting the opening raquo seems to get processed fine. It's the final raquo that seems to cause the issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions