This project focuses on sentiment analysis using two different techniques:
- Unsupervised approach: Learns sentiment patterns from raw text data without using labeled sentiment.
- Discriminative approach: Uses labeled data to directly classify the sentiment of a given text.
The dataset is divided into three parts:
train.csv
: Containssentiment
,text
, anddata_id
columns.data_id
is simply a dataset identifier and can be ignored during training.valid.csv
: Same structure astrain.csv
and used for model validation.test.csv
: Containstext
anddata_id
, along with two columns to be predicted:out_label_model_1
: Predictions from the unsupervised modelout_label_model_2
: Predictions from the discriminative model
Both models are trained and evaluated using the training and validation data, then saved locally:
- Unsupervised model is saved to:
model/2400570/model_unsup
- Discriminative model is saved to:
model/2400570/model_dis
Once trained, the models are reused for predictions on the test set, no need to retrain. The test.csv
file is updated with predicted sentiment labels for both approaches.
Located in the data/20/
folder:
train.csv
: Containssentiment
,text
,data_id
valid.csv
: Containssentiment
,text
,data_id
test.csv
: Containstext
,data_id
,out_label_model_1
,out_label_model_2
Predictions are required for:
out_label_model_1
: Using the unsupervised approachout_label_model_2
: Using the discriminative approach
- Unsupervised model:
model/2400570/model_unsup
- Discriminative model:
model/2400570/model_dis
- Python Version: 3.11.11
pandas==2.2.2
numpy==1.26.4
seaborn==0.13.2
matplotlib==3.8.0
tensorflow==2.17.1
nltk==3.9.1
scikit-learn==1.6.0
spacy==3.7.5
gensim==4.3.3
torch==2.3.0
sentence-transformers==2.7.0
bertopic==0.16.0
hdbscan==0.8.33
imbalanced-learn==0.12.2
pip install pandas==2.2.2 numpy==1.26.4 seaborn==0.13.2 matplotlib==3.8.0 tensorflow==2.17.1 nltk==3.9.1 scikit-learn==1.6.0 spacy==3.7.5 gensim==4.3.3 torch==2.3.0 sentence-transformers==2.7.0 bertopic==0.16.0 hdbscan==0.8.33 imbalanced-learn==0.12.2
These resources are required for text preprocessing. Run once before training:
import nltk
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("omw-1.4")
nltk.download("cmudict")
nltk.download("punkt")
nltk.download("punkt_tab")
- Ensure Python 3.11.11 is installed.
- Install all required libraries listed above.
- Create a project folder (e.g.,
SentimentAnalysis/
) - Place the following inside the folder:
data/
folder with dataset filesmodel/
foldercode.ipynb
file
- Adjust file paths in the code to correctly point to the dataset and model save locations.
- Open and run
code.ipynb
. - Models will be trained and saved to their respective paths:
- Unsupervised:
model/2400570/model_unsup
- Discriminative:
model/2400570/model_dis
- Unsupervised:
- Once saved, the trained models will automatically be used during the testing phase, retraining is not necessary unless the models are deleted.
- The
test.csv
file will be updated with the predicted sentiment labels:out_label_model_1
: from the unsupervised modelout_label_model_2
: from the discriminative model
- Updated
test.csv
file with predictions from both models. - Trained models saved in
model/2400570/
directory.
Presentation_sm23788.pdf
: Project presentation slides.Report_sm23788.pdf
: A detailed report explaining:- Model performance
- Evaluation metrics
- Rationale for model selection
- Comparisons with state-of-the-art (SOTA) techniques