Thousand Language Morphology Database

(All author information redacted until the end of 2022, where I'll put it back again. I just had a paper desk-rejected because of an anonymity fail because of this README.)

My claim is that p-adic metrics are ideally suited to various natural language processing machine learning tasks -- better than Euclidean metrics.

One of the tasks where p-adics are likely to be helpful is in grammar morphology: deducing from a very small number of examples how the language changes the forms of words to represent their roles in a sentence.

Building from scratch

This not complete yet.

git clone [email protected]:OpenText-org/GNT_annotation_v1.0.git in some directory.
Create a postgresql database with utf8 encoding -- createdb --encoding=utf8 --template=template8 thousand_language
Create a user with write permissions to all the tables in the database.
Create a file db.conf

[database]
dbname=thousand_language
user=gntwriter
password=whateverpasswordyouused
host=localhost
port=5432

pip3 install --user -r requirements.txt
Run parse_verses.py . If necessary add --verbose or --opentext-location or --database-config
Run the datacleaning.sql file using psql
Run fetch_verses.py -- this takes a few weeks to complete
Run extract_vocab.py -- this takes a few weeks to complete
Load wikidata codes (\copy f'wikidata_iso639_codes' from 'enrichment/language-codes.csv') and run refresh materialized view wikidata_iso639_codes`
Run wikidata/cache-wikidata.py
Run refresh materialized view wikidata_parental_tree;
Load canonical-english.sql
Load the translation data into the database with save_translations.py and then run refresh materialized view vocabulary_sizes refresh materialized view vocabulary_sizes_crosstab and refresh materialized view lemma_translation_counts and refresh materialized view vocabulary_extraction_correlations
Run ./make_vocab_lists.py
Run ./make_leaftop.py
Hire some translators to check the content in leaftop/
Load their results with load_assessment.py
Run ./make_leaftop.py again, but send the --output to the directory which has the clone of github.com:solresol/leaftop.gif
Run ./make_explorer.py (again, use --output to put the output into a suitable subdirectory (e.g. leaftop-explorer) of the leaftop repo.
Run pairings.py (maybe run it several times with pairings --tokenisation-method bigram/trigram/uni_token etc.) It took about 48 hours to run.
refresh materialized view translation_exploration refresh materialized view likely_valid_vocabulary_extractions refresh materialized view summary_of_vocabulary_extractions refresh materialized view confidence_vs_reality
Make a release of LEAFTOP. zip the data, docs, evaluations and leaftop-explorer files from the leaftop repo.
refresh materialized view singular_plural_similarity_and_signals

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
enrichment		enrichment
extracts		extracts
leaftop-evaluations		leaftop-evaluations
leaftop-explorer		leaftop-explorer
lrec-flyers		lrec-flyers
notebooks		notebooks
release		release
wikidata		wikidata
.gitignore		.gitignore
README-leaftop.txt		README-leaftop.txt
README.md		README.md
canonical-english.sql		canonical-english.sql
datacleaning.sql		datacleaning.sql
explore-web-version.ipynb		explore-web-version.ipynb
extract_vocab.py		extract_vocab.py
fetch_verses.py		fetch_verses.py
load_assessment.py		load_assessment.py
make_explorer.py		make_explorer.py
make_leaftop.py		make_leaftop.py
make_vocab_lists.py		make_vocab_lists.py
makefile_generator.py		makefile_generator.py
notes.txt		notes.txt
pairings.py		pairings.py
parse_verses.ipynb		parse_verses.ipynb
parse_verses.py		parse_verses.py
prep_morphology_task.py		prep_morphology_task.py
prepare-ec2.sh		prepare-ec2.sh
requirements.txt		requirements.txt
save_translations.py		save_translations.py
schema.sql		schema.sql
submission-ldc.txt		submission-ldc.txt
ubuntu.deps		ubuntu.deps

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Thousand Language Morphology Database

Building from scratch

About

Uh oh!

Releases

Packages

Languages

solresol/thousand-language-morphology

Folders and files

Latest commit

History

Repository files navigation

Thousand Language Morphology Database

Building from scratch

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages