Update lex tools #753

bhartmoore · 2025-06-11T18:50:37Z

Add functionality to support "Novel words" investigations for Catapult Reloaded and Crossbow work. Updates include:

Order source/target words by occurrences in given book(s) and print a count of occurrences
Output a list of verses where "novel" words occur, as well as the words that occur in those verses with their counts

This change is

mshannon-sil

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @bhartmoore)

silnlp/common/find_words.py line 32 at r1 (raw file):

                    src_word_counts.append(entry)
            else:
                print("Error: word counts are missing. Please run count_words.py with the --count flag set.")

I think it would be more appropriate to raise an exception here, unless there's a strong reason you need it to complete successfully. You could do raise ValueError("Word counts are missing. Please run count_words.py with the --count flag set.").

silnlp/common/find_words.py line 44 at r1 (raw file):

    with (lex_path / "src.txt").open("r", encoding = "utf8") as src_data_file:
            with(vref_filename).open("r", encoding = "utf8") as ref_file:
                word_list = list(enumerate(words))

It doesn't seem that word_list[0] is used in any of the following code, only word_list[1], so I don't think you need to use enumerate, unless it's there for debugging purposes.

silnlp/common/find_words.py line 61 at r1 (raw file):

                            word_text.append(word[1])
                            word_num.append(src_word_dict[word[1]])
                    result.append([verse[0].rstrip('\n'), word_count, word_num, word_text])

What does word_count here mean? It seems that it's the sum of the counts for all the words that are in the word list, appear in this verse, and haven't been seen yet. But I'm not sure what this metric means or is used for, so I want to check if my understanding is correct and this is the desired functionality.

silnlp/common/compare_lex.py line 10 at r1 (raw file):

from machine.tokenization import LatinWordTokenizer

# Latin Tokenizer from machine library

Are these comments still needed?

silnlp/common/compare_lex.py line 100 at r1 (raw file):

        # Write source words missing from the alternate source file
        #with (lex_path1 / "unmatched_src_words.txt").open("w", encoding="utf8") as output_file:

Are these comments also still needed?

bhartmoore · 2025-06-13T16:11:10Z

@mshannon-sil Thanks for these thoughtful comments! I'll answer a couple things here, but I'd like to clean this up a bit more before committing it. That means I'll hold off until after June 22. (I had not originally intended to commit this to silnlp as it was just to facilitate tests we were doing, and it's been modified haphazardly for various purposes.)

find_words.py Line 32: Thank you; I've changed this to raise an exception.
find_words.py Line 44: Enumeration removed; I think the structure of the underlying list changed at some point and this was a relic.
find_words.py Line 61: You're correct, and this is probably something I should rethink for general relevance. This was when we were looking for verses that would have maximum impact because they contain instances of words that occur most often in the training data. I only wanted to find each word once. Later I added word counts to the word list for a different test, so I think I can simplify and rethink the functionality.
compare_lex.py Line 10: Yes, I've been switching it back and forth for different applications. I need to keep both in as options and add an argument to select.
compare_lex.py Line 100: Comments removed, thanks!

bhartmoore added 9 commits February 19, 2025 12:55

Add LatinWordTokenizer from machine

fb074bf

Update lex_tools for Catapult Reloaded work

af3c79a

Furher update lex_tools for Catapult Reloaded work

42e2724

Add functionality to extract vocab coverage lists for Catapult

f47eed9

Revert to naive word separator

523d1d8

Revert compare_lex.py to naive word splitter.

70cd944

Add word counts to unmatched_src_words.txt output

0cce13e

Update lex_tools with word counts for Catapult Reloaded

5c408db

Clean up duplicate versions

d68f27a

bhartmoore requested a review from mshannon-sil June 11, 2025 18:50

mshannon-sil requested changes Jun 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Update lex tools #753

Update lex tools #753

Uh oh!

bhartmoore commented Jun 11, 2025 •

edited by ddaspit

Loading

Uh oh!

mshannon-sil left a comment

Uh oh!

bhartmoore commented Jun 13, 2025

Uh oh!

Uh oh!

Uh oh!

Update lex tools #753

Are you sure you want to change the base?

Update lex tools #753

Uh oh!

Conversation

bhartmoore commented Jun 11, 2025 • edited by ddaspit Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mshannon-sil left a comment

Choose a reason for hiding this comment

Uh oh!

bhartmoore commented Jun 13, 2025

Uh oh!

Uh oh!

bhartmoore commented Jun 11, 2025 •

edited by ddaspit

Loading