Balochi Next-Word Prediction using N-Gram Models

This project focuses on building a statistical language model for the Balochi language using N-gram models. It is capable of performing next-word prediction and analyzing the frequency of common N-grams (bigrams, trigrams, quadgrams) from a given Balochi corpus.

Files in the Repository

balochi.txt: The cleaned Balochi language corpus.
next_word_prediction.py: Python script for building the N-gram model and performing prediction.
Balochi_NextWordPrediction.pkl: Trained Lidstone N-gram model saved using dill.
README.md: Description and documentation of the project.

Features

Tokenization and N-gram generation (up to 4-grams)
Text cleaning and padding
Model training using Lidstone/Laplace/Kneser-Ney smoothing
Perplexity evaluation on test set
Most frequent bigrams, trigrams, and quadgrams
Visualization using Plotly
Model serialization with dill

How to Use

Clone the Repo:

git clone https://github.com/SharanBashir/Next_WordPrediction_Balochi_NLP.git
cd Next_WordPrediction_Balochi_NLP
python next_word_prediction.py

Run the Script:

python next_word_prediction.py

Predict Next Words:

After training, enter any word when prompted to see top next-word predictions:

input_text = (input('Enter a word: ')).split()
sorted(model.counts[input_text].most_common(3))

Dataset

The dataset used is a manually cleaned Balochi text corpus, with sentences separated by the "۔" punctuation mark.

Libraries Used

nltk (language modeling)
pandas, numpy (data handling)
plotly.express (data visualization)
dill (model saving)
collections.Counter, itertools.chain (n-gram counting)

Results

Trained model vocabulary: ~10,000 tokens
Perplexity evaluated on test data
Top 10 bigrams/trigrams/quadgrams visualized using interactive bar charts

Author

Sharan Bashir Final Year PG Research Project, Balochi Language Processing

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Balochi_NextWordPrediction.pkl		Balochi_NextWordPrediction.pkl
LICENSE		LICENSE
NextWordPrediction-2022-Copy1.ipynb		NextWordPrediction-2022-Copy1.ipynb
NextWordPredictionBalochi_withComments.ipynb		NextWordPredictionBalochi_withComments.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Balochi Next-Word Prediction using N-Gram Models

Files in the Repository

Features

How to Use

Clone the Repo:

Run the Script:

Predict Next Words:

Dataset

Libraries Used

Results

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

SharanBashir/Next_WordPrediction_Balochi_NLP

Folders and files

Latest commit

History

Repository files navigation

Balochi Next-Word Prediction using N-Gram Models

Files in the Repository

Features

How to Use

Clone the Repo:

Run the Script:

Predict Next Words:

Dataset

Libraries Used

Results

Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages