This project focuses on building and evaluating multiple models for multi-label text classification of Persian news articles. It utilizes a diverse dataset of documents categorized into various labels, such as politics, social issues, and culture, implementing both traditional machine learning models and a Long Short-Term Memory (LSTM) network to handle sequence data effectively.
The dataset consists of a comprehensive collection of Persian news articles, organized into a hierarchical structure of categories and subcategories.The dataset can be accessed and downloaded from the following Google Drive link:
Steps to download and set up the dataset:
- Click on the link to navigate to Google Drive.
- Download the required dataset files.
- Place the downloaded files into the root directory of the project, ensuring they are named correctly as per the scripts' configuration.
- Loading Data and Categorizing Documents: Organize documents based on directory structure into categories.
- Display Dataset Information: Provide statistics about the dataset including document counts and category distribution.
- Text Preprocessing: Normalize, tokenize, and lemmatize the text data, removing stopwords and punctuation.
- Identify Key Terms with TF-IDF: Apply TF-IDF vectorization to highlight key terms that characterize each category.
- Feature Extraction: Use TF-IDF and Word2Vec for extracting text features.
- Model Training and Evaluation: Train and evaluate multiple machine learning models and an LSTM network.
- Performance Visualization: Visualize the performance of models through precision, recall, and confusion matrices.
- Dependencies:
- Python 3.x
- Libraries:
numpy
,pandas
,scikit-learn
,TensorFlow
,Keras
,matplotlib
,seaborn
,gensim
,hazm
(for Persian text)
- Dataset Setup:
- Store your dataset with a directory structure where each subfolder represents a category.
- Set the
root_directory
in the script to the location of your dataset.
Execute the script section by section:
- Data Loading and Preprocessing: Load and categorize documents; display dataset statistics.
- Text Preprocessing and Feature Extraction: Clean text data; extract features using TF-IDF and Word2Vec.
- Model Training and Evaluation: Train and evaluate Naive Bayes, Random Forest, Logistic Regression, SVM, MLP, and LSTM models.
- Results Interpretation: Compare model performances and analyze the impact of different feature extraction techniques.
- Naive Bayes
- Random Forest
- Logistic Regression
- Support Vector Machine (SVM)
- Multi-Layer Perceptron (MLP)
- Long Short-Term Memory Network (LSTM)
This project illustrates the application of various machine learning and deep learning models to multi-label text classification, providing insights into which models and feature extraction techniques perform best.