Skip to content

handling XML entities #3

@jtrmal

Description

@jtrmal

Hi, the content of the kwlist.xml is

  <kw kwid="KW">
    <kwtext>&lt;WORD&gt; word second_word</kwtext>
  </kw>

and rttm file is

 LEXEME utt 1 0 0.39 <WORD> <NA> <NA> <NA>
 LEXEME utt 1 0.39 0.15 word <NA> <NA> <NA>
 NON-LEX utt 1 0.54 0.05 <eps> <NA> <NA> <NA>
 LEXEME utt 1 0.59 0.17 second_word <NA> <NA> <NA>

then the alignment procedure will not map these two things together (no entry in alignment.csv).
However, when I manually edit the rttm to contain this

 LEXEME utt 1 0 0.39 &lt;WORD&gt; <NA> <NA> <NA>
 LEXEME utt 1 0.39 0.15 word <NA> <NA> <NA>
 NON-LEX utt 1 0.54 0.05 <eps> <NA> <NA> <NA>
 LEXEME utt 1 0.59 0.17 second_word <NA> <NA> <NA>

the mapping will be created as expected.

I would assume the xml entities (apos, lt, gt, quot and amp) will be decoded/normalized, because they are enforced by the xml specification to be in the "encoded" form, i.e. it's not at the whim of the user how to put these strings there.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions