Skip to content

Tools for building very wide (but shallow) language models. This scrapes bible.com for all languages and matches up verses with those from a parsed Greek New Testament -- to produce vocabulary and grammar forms for a large number of languages.

Notifications You must be signed in to change notification settings

solresol/thousand-language-morphology

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

85 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Thousand Language Morphology Database

(All author information redacted until the end of 2022, where I'll put it back again. I just had a paper desk-rejected because of an anonymity fail because of this README.)

My claim is that p-adic metrics are ideally suited to various natural language processing machine learning tasks -- better than Euclidean metrics.

One of the tasks where p-adics are likely to be helpful is in grammar morphology: deducing from a very small number of examples how the language changes the forms of words to represent their roles in a sentence.

Building from scratch

This not complete yet.

  1. git clone [email protected]:OpenText-org/GNT_annotation_v1.0.git in some directory.

  2. Create a postgresql database with utf8 encoding -- createdb --encoding=utf8 --template=template8 thousand_language

  3. Create a user with write permissions to all the tables in the database.

  4. Create a file db.conf

[database]
dbname=thousand_language
user=gntwriter
password=whateverpasswordyouused
host=localhost
port=5432
  1. pip3 install --user -r requirements.txt

  2. Run parse_verses.py . If necessary add --verbose or --opentext-location or --database-config

  3. Run the datacleaning.sql file using psql

  4. Run fetch_verses.py -- this takes a few weeks to complete

  5. Run extract_vocab.py -- this takes a few weeks to complete

  6. Load wikidata codes (\copy f'wikidata_iso639_codes' from 'enrichment/language-codes.csv') and run refresh materialized view wikidata_iso639_codes`

  7. Run wikidata/cache-wikidata.py

  8. Run refresh materialized view wikidata_parental_tree;

  9. Load canonical-english.sql

  10. Load the translation data into the database with save_translations.py and then run refresh materialized view vocabulary_sizes refresh materialized view vocabulary_sizes_crosstab and refresh materialized view lemma_translation_counts and refresh materialized view vocabulary_extraction_correlations

  11. Run ./make_vocab_lists.py

  12. Run ./make_leaftop.py

  13. Hire some translators to check the content in leaftop/

  14. Load their results with load_assessment.py

  15. Run ./make_leaftop.py again, but send the --output to the directory which has the clone of github.com:solresol/leaftop.gif

  16. Run ./make_explorer.py (again, use --output to put the output into a suitable subdirectory (e.g. leaftop-explorer) of the leaftop repo.

  17. Run pairings.py (maybe run it several times with pairings --tokenisation-method bigram/trigram/uni_token etc.) It took about 48 hours to run.

  18. refresh materialized view translation_exploration refresh materialized view likely_valid_vocabulary_extractions refresh materialized view summary_of_vocabulary_extractions refresh materialized view confidence_vs_reality

  19. Make a release of LEAFTOP. zip the data, docs, evaluations and leaftop-explorer files from the leaftop repo.

  20. refresh materialized view singular_plural_similarity_and_signals

About

Tools for building very wide (but shallow) language models. This scrapes bible.com for all languages and matches up verses with those from a parsed Greek New Testament -- to produce vocabulary and grammar forms for a large number of languages.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published