(All author information redacted until the end of 2022, where I'll put it back again. I just had a paper desk-rejected because of an anonymity fail because of this README.)
My claim is that p-adic metrics are ideally suited to various natural language processing machine learning tasks -- better than Euclidean metrics.
One of the tasks where p-adics are likely to be helpful is in grammar morphology: deducing from a very small number of examples how the language changes the forms of words to represent their roles in a sentence.
This not complete yet.
-
git clone [email protected]:OpenText-org/GNT_annotation_v1.0.gitin some directory. -
Create a postgresql database with utf8 encoding --
createdb --encoding=utf8 --template=template8 thousand_language -
Create a user with write permissions to all the tables in the database.
-
Create a file
db.conf
[database]
dbname=thousand_language
user=gntwriter
password=whateverpasswordyouused
host=localhost
port=5432
-
pip3 install --user -r requirements.txt -
Run
parse_verses.py. If necessary add--verboseor--opentext-locationor--database-config -
Run the
datacleaning.sqlfile usingpsql -
Run
fetch_verses.py-- this takes a few weeks to complete -
Run
extract_vocab.py-- this takes a few weeks to complete -
Load wikidata codes (
\copy f'wikidata_iso639_codes' from 'enrichment/language-codes.csv') and runrefresh materialized view wikidata_iso639_codes` -
Run
wikidata/cache-wikidata.py -
Run
refresh materialized view wikidata_parental_tree; -
Load
canonical-english.sql -
Load the translation data into the database with
save_translations.pyand then runrefresh materialized view vocabulary_sizesrefresh materialized view vocabulary_sizes_crosstabandrefresh materialized view lemma_translation_countsandrefresh materialized view vocabulary_extraction_correlations -
Run
./make_vocab_lists.py -
Run
./make_leaftop.py -
Hire some translators to check the content in
leaftop/ -
Load their results with
load_assessment.py -
Run
./make_leaftop.pyagain, but send the--outputto the directory which has the clone ofgithub.com:solresol/leaftop.gif -
Run
./make_explorer.py(again, use--outputto put the output into a suitable subdirectory (e.g.leaftop-explorer) of the leaftop repo. -
Run
pairings.py(maybe run it several times withpairings --tokenisation-method bigram/trigram/uni_tokenetc.) It took about 48 hours to run. -
refresh materialized view translation_explorationrefresh materialized view likely_valid_vocabulary_extractionsrefresh materialized view summary_of_vocabulary_extractionsrefresh materialized view confidence_vs_reality -
Make a release of LEAFTOP. zip the data, docs, evaluations and leaftop-explorer files from the leaftop repo.
-
refresh materialized view singular_plural_similarity_and_signals