Skip to content

Conversation

benjaminking
Copy link
Collaborator

@benjaminking benjaminking commented Sep 5, 2025

This PR implements quotation mark denormalization using Machine.py. This required a fair bit of refactoring because of the requirement that the quote conventions of the source and target projects be able to be automatically detected, if the user does not specify them. This means that information about the corpora used for training and translation needs to be passed to translate_usfm(). Incidentally, some of the command line script parameters had to be modified to support this, since a USFM file with no context doesn't have enough information to have the source/target quote convention detected. I've also made some changes to have SILNLP's postprocessing better match Serval's postprocessing.


This change is Reviewable

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Enkidu93 reviewed 10 of 10 files at r1, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @mshannon-sil)


silnlp/common/paratext.py line 433 at r1 (raw file):

    project_dir = get_project_dir(project)
    settings = FileParatextProjectSettingsParser(project_dir).parse()
    book_id = book_number_to_id(book_number)

Could you shorten this method by calling return get_book_path(project, book_number_to_id(book_number)) or just do that conversion in the calling code if it's infrequent?


silnlp/common/paratext.py line 617 at r1 (raw file):

# This is a placeholder until the ParatextProjectQuoteConventionDetector is released in machine.py

Does this comment mean that all of this code can be removed if we release machine.py? It's not at all difficult to do that if you need us to - then you could avoid the private method issue above too.

I'll just go ahead and begin the release process and let you know when it is released.


silnlp/nmt/translate.py line 9 at r1 (raw file):

from typing import Iterable, Optional, Tuple, Union

from flake8 import LOG

Is this needed?

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @benjaminking and @mshannon-sil)


silnlp/nmt/postprocess.py line 5 at r1 (raw file):

import re
from pathlib import Path
from typing import Dict, List, Optional

I think Dict is unused.


silnlp/nmt/postprocess.py line 147 at r1 (raw file):

        if src_sentence.ref.to_relaxed() != draft_sentence.ref.to_relaxed():
            LOGGER.warning(
                f"Can't process {draft_metadata.source_path} and {draft_path}: Mismatched ref, {src_ref} != {draft_ref}. Files must have the exact same USFM structure"

I think draft_path is supposed to be draft_metdata.draft_path - a couple similar issues in these log messages.

Copy link
Collaborator Author

@benjaminking benjaminking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 8 of 10 files reviewed, 5 unresolved discussions (waiting on @Enkidu93 and @mshannon-sil)


silnlp/common/paratext.py line 617 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Does this comment mean that all of this code can be removed if we release machine.py? It's not at all difficult to do that if you need us to - then you could avoid the private method issue above too.

I'll just go ahead and begin the release process and let you know when it is released.

Yes, I believe I could revert all the changes to paratext.py if FileParatextProjectQuoteConventionDetector is available.


silnlp/nmt/postprocess.py line 5 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I think Dict is unused.

Done.


silnlp/nmt/postprocess.py line 147 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I think draft_path is supposed to be draft_metdata.draft_path - a couple similar issues in these log messages.

Done.


silnlp/nmt/translate.py line 9 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Is this needed?

Nope. One of those annoying VSCode automatic imports when you mistype something.

Copy link
Collaborator

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mshannon-sil reviewed 2 of 10 files at r1, all commit messages.
Reviewable status: 8 of 10 files reviewed, 9 unresolved discussions (waiting on @benjaminking and @Enkidu93)


silnlp/common/postprocesser.py line 21 at r2 (raw file):

    parse_usfm,
)
from machine.punctuation_analysis import (

Reminder that we need to update sil-machine in pyproject.toml to the latest version and update the lock file with poetry lock --no-update for this to work (I know that it hadn't been released when this was originally written).


silnlp/common/postprocesser.py line 105 at r2 (raw file):

        usfm: str,
        rows: List[UpdateUsfmRow],
        remarks: List[str] = [],

I learned this recently, but it's not recommended to have mutable default arguments in Python. Default values are only evaluated once, so if the default is used and then modified in a function call, if the function is called another time, the default remains modified. It's better to set the default to None, and then initialize the argument at the beginning of the method if the argument is None. It would be good to go through your code and modify all the places where mutable default arguments ([], {}, etc.) are being used.


silnlp/common/postprocesser.py line 160 at r2 (raw file):

                    "The source project name must be explicitly provided or be present in translate_config.yml, since an explicit source quote convention name was not provided."
                )
            if selected_training_books is None:

If {} is passed in for selected_training_books, this will not catch it.


silnlp/common/postprocesser.py line 233 at r2 (raw file):

                ]
            )
            + "."

I think as written if every chapter is skipped it results in Quotation marks in the following chapters have been automatically denormalized after translation: . Doesn't matter too much so this comment isn't blocking but thought I would note it.


silnlp/common/postprocesser.py line 287 at r2 (raw file):

            if option in POSTPROCESS_SUFFIX_CHARS and self._config[option] != default:
                if isinstance(default, str):
                    suffix += POSTPROCESS_SUFFIX_CHARS[option][self._config[option]]

Should the if else statements here be the other way around? I would think that if the default is a string, then wouldn't POSTPROCESS_SUFFIX_CHARS[option] return a string not a dict?

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Enkidu93 reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @benjaminking and @ddaspit)


silnlp/common/paratext.py line 617 at r1 (raw file):

Previously, benjaminking (Ben King) wrote…

Yes, I believe I could revert all the changes to paratext.py if FileParatextProjectQuoteConventionDetector is available.

I went ahead and pushed the machine.py PR through and updated the library. Unfortunately, it looks like the functionality already in this PR includes detecting the quote convention using only the selected_books. This isn't something that's available through the helper class made in Ben's machine.py PR. I'm not sure what we should do here: Either we should update machine.py & Machine to allow for this and then potentially update Serval's behavior to match; or we should remove this filtering functionality from this PR.

@ddaspit @benjaminking if you happen to see this and have a clear answer, I'm happy to try and make updates to this PR this week so that it gets in as soon as possible; otherwise, this is a blocker for me to be able to wrap up this PR.

Copy link
Collaborator

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mshannon-sil reviewed 6 of 10 files at r1, 2 of 2 files at r2.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @benjaminking and @ddaspit)


silnlp/nmt/config.py line 18 at r2 (raw file):

from tqdm import tqdm

from silnlp.common.translator import TranslationGroup

This import is being done with silnlp.common.translator, while other imports are using ..common.corpus. Looking at other files like silnlp.nmt.hugging_face_config, it looks like ..common.translator would be preferable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants