Automatic rap lyric generation tool
- invoke
- tensorflow
- juman
- kytea
- chainer
# pip install beautifulsoup
python getlyrics.py -v > output.tsv
- Extract lyrics archive, then run the following command to obtain a file
data/juman_input.txt:
python preprocess.py -crawl data/lyrics_shonan_s27_raw.tsv
- Feed the cleaned crawled corpus to juman:
juman < data/juman_input.txt > data/juman_out.txt
- Process the juman output file:
python preprocess.py -juman data/juman_out.txt
The preprocessing step is finished. You will have three files in the /data folder:
string_corpus.txtas a string corpus file for LSTM training (one sentence per line), each song is separated from the previous one by one linehiragana_corpus.txtas a hiragana corpus file for FFNN training (one sentence per line), each song is separated from the previous one by one linedaihyou_vocab.pfile as a vocabulary file (keys correspond to surface forms, values to 代表表記) - this is used to lookup the embeddings during the LSTM training
- Training
inv train model
- Testing
inv test model
- Training
run the command below at the directory chainer_model
python train_lstm_lm.py (--gpu 0)
You should use gpu to train (this code is very slow on cpu)
- Generating lines
python generate_seq.py --model trained_model -O output_file N 10000
Make term-rhyme table using data/string_corpus.txt and data/hiragana_corpus.txt
python features/make_term_vowel_table.py -v --unknown-terms <path-to-unknown-terms:optional> > <path-to-output-table>
data/term_vowel_table.csv: term to vowel table (each row hasterm,vowels)data/unknown_terms.txt: terms that did not have hiragana form indata/hiragana_corpus.txt. Currently they are filtered out from the table above
python NextLine.py -f data/sample_nextline_prediction_candidates.txt
After the processing, you will have the result test_lyrics.txt.
Note: You may need to comment out the lines below in NextLine.py
if __name__ == "__main__":
...
temp.pop(0)
temp.pop(-1)