-
-
Notifications
You must be signed in to change notification settings - Fork 6
Update lex tools #753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Update lex tools #753
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @bhartmoore)
silnlp/common/find_words.py
line 32 at r1 (raw file):
src_word_counts.append(entry) else: print("Error: word counts are missing. Please run count_words.py with the --count flag set.")
I think it would be more appropriate to raise an exception here, unless there's a strong reason you need it to complete successfully. You could do raise ValueError("Word counts are missing. Please run count_words.py with the --count flag set.")
.
silnlp/common/find_words.py
line 44 at r1 (raw file):
with (lex_path / "src.txt").open("r", encoding = "utf8") as src_data_file: with(vref_filename).open("r", encoding = "utf8") as ref_file: word_list = list(enumerate(words))
It doesn't seem that word_list[0] is used in any of the following code, only word_list[1], so I don't think you need to use enumerate
, unless it's there for debugging purposes.
silnlp/common/find_words.py
line 61 at r1 (raw file):
word_text.append(word[1]) word_num.append(src_word_dict[word[1]]) result.append([verse[0].rstrip('\n'), word_count, word_num, word_text])
What does word_count
here mean? It seems that it's the sum of the counts for all the words that are in the word list, appear in this verse, and haven't been seen yet. But I'm not sure what this metric means or is used for, so I want to check if my understanding is correct and this is the desired functionality.
silnlp/common/compare_lex.py
line 10 at r1 (raw file):
from machine.tokenization import LatinWordTokenizer # Latin Tokenizer from machine library
Are these comments still needed?
silnlp/common/compare_lex.py
line 100 at r1 (raw file):
# Write source words missing from the alternate source file #with (lex_path1 / "unmatched_src_words.txt").open("w", encoding="utf8") as output_file:
Are these comments also still needed?
@mshannon-sil Thanks for these thoughtful comments! I'll answer a couple things here, but I'd like to clean this up a bit more before committing it. That means I'll hold off until after June 22. (I had not originally intended to commit this to silnlp as it was just to facilitate tests we were doing, and it's been modified haphazardly for various purposes.)
|
Add functionality to support "Novel words" investigations for Catapult Reloaded and Crossbow work. Updates include:
This change is