Source
The now-defunct tokipona.net site site had a corpus of texts (appearing eg in the dropdown of http://tokipona.net/tp/DisplayText.aspx) that appears to be archived here: https://github.com/matthewdeanmartin/tokipona.parser/tree/master/BasicTypes/Tp/Corpus. This corpus was expanded into 'nltk-tp', fixing some unicode issues in the archive and adding the uncompiled 'janKipoCorpus' (from a now-inaccessible google drive).
The list of files is based off of https://github.com/davidar/nltk-tp/tree/master/Corpus, but I'll find the original source of text each where possible.
Since this 'source' is actually a compilation, many of these works will already be in Lapo. In this case I'll just update any missing metadata and won't duplicate the files.
Exclusions
The sections of the corpus labeled 'janKipoCorpus' or 'janKipoCompiled' appear to primarily be discussions, which are hard to source, attribute, and split into distinct files. Thus I have chosen to ignore these texts entirely.
I have also ignored the 'YahooMailingList' section, this is a source (again mostly of discussion) that I'd like to comprehensively handle later.
Files
Source
The now-defunct tokipona.net site site had a corpus of texts (appearing eg in the dropdown of http://tokipona.net/tp/DisplayText.aspx) that appears to be archived here: https://github.com/matthewdeanmartin/tokipona.parser/tree/master/BasicTypes/Tp/Corpus. This corpus was expanded into 'nltk-tp', fixing some unicode issues in the archive and adding the uncompiled 'janKipoCorpus' (from a now-inaccessible google drive).
The list of files is based off of https://github.com/davidar/nltk-tp/tree/master/Corpus, but I'll find the original source of text each where possible.
Since this 'source' is actually a compilation, many of these works will already be in Lapo. In this case I'll just update any missing metadata and won't duplicate the files.
Exclusions
The sections of the corpus labeled 'janKipoCorpus' or 'janKipoCompiled' appear to primarily be discussions, which are hard to source, attribute, and split into distinct files. Thus I have chosen to ignore these texts entirely.
I have also ignored the 'YahooMailingList' section, this is a source (again mostly of discussion) that I'd like to comprehensively handle later.
Files
Zompist Phrasebook, discarded since it doesn't make sense as a monolingual toki pona text (?)Menu, discarded as it's just a website directoryYves Prudhomme Menu, discarded as it's just a website directoryBlog comments, discarded since it's just some basic chatter in toki ponafishcouldn't locate original sourceguestbook, discarded since it's just short sentencesjanKipoCollectedjanKipoCorpusYahooMailingList