A small LSTM Neural Network to predict if a string is a name or not.
Dataset built from DBpedia and combinations of First Names and Last Names
- Python 3.6
- Numpy 1.12.1
- Pandas 0.20.2
- Scikit-Learn 0.18.2
- Seaborn 0.8*
- Matplotlib 2.0.2*
- Theano 0.9
- TensorFlow 1.2.1
- Jupyter Notebooks*
*Only if wants to rerun Jupyter Notebooks
Miniconda Recommended to install the packages
The details about how to extract the data to build the dataset can be found in the file Preprocessing.ipynb or the already processed dataset can downloaded from data/full_names.tar.gz that contains 6 million samples.
Training can be found in Training.ipynb, the whole dataset took almost 3 hours on a Google Compute Engine instance with GPU K80 Tesla, for convenience the already trained model can be found in models/.
The model achieved 96% of accuracy.
The prediction can be done using Predicting.ipynb some samples were added to test the model manually.
To predict if a strings contain a person name can be done using the command line, the program takes as input a tab separated file input.tsv and generates an output with the same format but with the correct label, optionally probabilities can be added as well.
python NamePredictor.py [model] [input] [output]Example
python NamePredictor.py models/model.h5 input.tsv output.tsvTo add verbosity
python NamePredictor.py models/model.h5 input.tsv output.tsv --verbosityTo add probabilities to the output
python NamePredictor.py models/model.h5 input.tsv output.tsv --probabilitiesDeveloped by Omar Contreras [email protected]