Normalize German umlaut #125

michelole · 2019-12-03T19:37:23Z

Normalize German umlaut ("ä" -> "ae", etc.) before applying filtering rules, preferably at training time to make models (a bit) denser.

Branched from #87.

A German stemmer (#123) would fix it.

michelole · 2019-12-03T19:50:10Z

This does not seem a real problem in our training set (e.g. "ueber*", which would expand both ÜZ and ÜLT, occurs only 2 times), downgrading.

michelole · 2020-08-19T14:47:05Z

Consider using unidecode instead of the old transliterate_to_seven_bit function.

michelole added the P1 Higher priority issues, a SHOULD label Dec 3, 2019

michelole added P2 High priority issues, a COULD and removed P1 Higher priority issues, a SHOULD labels Dec 3, 2019

Provide feedback