Skip to content

Update lex tools #753

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Update lex tools #753

wants to merge 9 commits into from

Conversation

bhartmoore
Copy link
Collaborator

@bhartmoore bhartmoore commented Jun 11, 2025

Add functionality to support "Novel words" investigations for Catapult Reloaded and Crossbow work. Updates include:

  • Order source/target words by occurrences in given book(s) and print a count of occurrences
  • Output a list of verses where "novel" words occur, as well as the words that occur in those verses with their counts

This change is Reviewable

@bhartmoore bhartmoore requested a review from mshannon-sil June 11, 2025 18:50
Copy link
Collaborator

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @bhartmoore)


silnlp/common/find_words.py line 32 at r1 (raw file):

                    src_word_counts.append(entry)
            else:
                print("Error: word counts are missing. Please run count_words.py with the --count flag set.")

I think it would be more appropriate to raise an exception here, unless there's a strong reason you need it to complete successfully. You could do raise ValueError("Word counts are missing. Please run count_words.py with the --count flag set.").


silnlp/common/find_words.py line 44 at r1 (raw file):

    with (lex_path / "src.txt").open("r", encoding = "utf8") as src_data_file:
            with(vref_filename).open("r", encoding = "utf8") as ref_file:
                word_list = list(enumerate(words))

It doesn't seem that word_list[0] is used in any of the following code, only word_list[1], so I don't think you need to use enumerate, unless it's there for debugging purposes.


silnlp/common/find_words.py line 61 at r1 (raw file):

                            word_text.append(word[1])
                            word_num.append(src_word_dict[word[1]])
                    result.append([verse[0].rstrip('\n'), word_count, word_num, word_text])

What does word_count here mean? It seems that it's the sum of the counts for all the words that are in the word list, appear in this verse, and haven't been seen yet. But I'm not sure what this metric means or is used for, so I want to check if my understanding is correct and this is the desired functionality.


silnlp/common/compare_lex.py line 10 at r1 (raw file):

from machine.tokenization import LatinWordTokenizer

# Latin Tokenizer from machine library

Are these comments still needed?


silnlp/common/compare_lex.py line 100 at r1 (raw file):

        # Write source words missing from the alternate source file
        #with (lex_path1 / "unmatched_src_words.txt").open("w", encoding="utf8") as output_file:

Are these comments also still needed?

@bhartmoore
Copy link
Collaborator Author

@mshannon-sil Thanks for these thoughtful comments! I'll answer a couple things here, but I'd like to clean this up a bit more before committing it. That means I'll hold off until after June 22. (I had not originally intended to commit this to silnlp as it was just to facilitate tests we were doing, and it's been modified haphazardly for various purposes.)

  • find_words.py Line 32: Thank you; I've changed this to raise an exception.
  • find_words.py Line 44: Enumeration removed; I think the structure of the underlying list changed at some point and this was a relic.
  • find_words.py Line 61: You're correct, and this is probably something I should rethink for general relevance. This was when we were looking for verses that would have maximum impact because they contain instances of words that occur most often in the training data. I only wanted to find each word once. Later I added word counts to the word list for a different test, so I think I can simplify and rethink the functionality.
  • compare_lex.py Line 10: Yes, I've been switching it back and forth for different applications. I need to keep both in as options and add an argument to select.
  • compare_lex.py Line 100: Comments removed, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants