This project demonstrates how to classify text data into "disaster" or "not disaster" categories using a Naive Bayes classifier with TF-IDF vectorization. The goal is to predict whether a given sentence is related to a disaster or not based on historical data.
This project applies natural language processing (NLP) techniques to classify text into two categories: disaster and non-disaster. By training a machine learning model with historical data, the model can predict whether new input sentences relate to disasters or not.
- Model: Naive Bayes classifier
- Vectorization: TF-IDF for feature extraction
- Task: Binary classification of disaster-related sentences
- Framework: scikit-learn, pandas, numpy
The following visualizations provide insights into the dataset:
- Task: Clean and preprocess the text data, including tokenization, removing stopwords, and vectorization using TF-IDF.
- Deliverable: Preprocessed dataset ready for model training.
- Task: Train a Naive Bayes classifier using the preprocessed data.
- Deliverable: A trained model capable of classifying sentences as "Disaster" or "Not Disaster."
- Task: Evaluate the model using accuracy, precision, recall, and F1 score.
- Deliverable: Performance metrics that indicate the effectiveness of the model.
- Task: Fine-tune the model, optimize hyperparameters, and test it on new data.
- Deliverable: An optimized model ready for deployment.
This project successfully applies NLP techniques to classify text into disaster-related categories. The model performs well with balanced precision and recall, achieving an accuracy of 81.10%. Future improvements could include exploring other algorithms and enhancing the feature engineering process.