Skip to content

normalize_extracts: inserting spaces around embedded dashes is not always appropriate #650

Open
@mmartin9684-sil

Description

@mmartin9684-sil

It's not always appropriate to normalize a word with embedded punctuation by inserting spaces before + after the embedded punctuation character.

A couple of counter examples from some of the target sentences in recent XRI datasets:

Original sentence: Niri poki pule kua-kua dabe edieng wao pihak ruha ihi partai nbe tenama tule hu'a gu mege wai.
Normalized sentence: Niri poki pule kua - kua dabe edieng wao pihak ruha ihi partai nbe tenama tule hu'a gu mege wai.

Original sentence: Ge a bi? nulu-waleng nu tenama dia wai dabe soro hulu mata nbe
Normalized sentence: Ge a bi? nulu - waleng nu tenama dia wai dabe soro hulu mata nbe

Normalizing the word 'kua-kua' to 'kua - kua', or the word 'nulu-waleng' to 'nulu - waleng' is not correct.

Metadata

Metadata

Assignees

Labels

invalidThis doesn't seem rightpipeline 2: extractIssue related to extracting parallel corporapipeline 3: preprocessIssue related to preprocessing.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions