Multi-Label Text Classification with LSTM and Machine Learning Models

Overview 📌

This project focuses on building and evaluating multiple models for multi-label text classification of Persian news articles. It utilizes a diverse dataset of documents categorized into various labels, such as politics, social issues, and culture, implementing both traditional machine learning models and a Long Short-Term Memory (LSTM) network to handle sequence data effectively.

Dataset Description

The dataset consists of a comprehensive collection of Persian news articles, organized into a hierarchical structure of categories and subcategories.

Downloading the Dataset

The dataset can be accessed and downloaded from the following Google Drive link:

Download Dataset

Steps to download and set up the dataset:

Click on the link to navigate to Google Drive.
Download the required dataset files.
Place the downloaded files into the root directory of the project, ensuring they are named correctly as per the scripts' configuration.

Project Objectives 🌟

Loading Data and Categorizing Documents: Organize documents based on directory structure into categories.
Display Dataset Information: Provide statistics about the dataset including document counts and category distribution.
Text Preprocessing: Normalize, tokenize, and lemmatize the text data, removing stopwords and punctuation.
Identify Key Terms with TF-IDF: Apply TF-IDF vectorization to highlight key terms that characterize each category.
Feature Extraction: Use TF-IDF and Word2Vec for extracting text features.
Model Training and Evaluation: Train and evaluate multiple machine learning models and an LSTM network.
Performance Visualization: Visualize the performance of models through precision, recall, and confusion matrices.

Installation and Setup

Dependencies:
- Python 3.x
- Libraries: numpy, pandas, scikit-learn, TensorFlow, Keras, matplotlib, seaborn, gensim, hazm (for Persian text)
Dataset Setup:
- Store your dataset with a directory structure where each subfolder represents a category.
- Set the root_directory in the script to the location of your dataset.

Usage 📘

Execute the script section by section:

Data Loading and Preprocessing: Load and categorize documents; display dataset statistics.
Text Preprocessing and Feature Extraction: Clean text data; extract features using TF-IDF and Word2Vec.
Model Training and Evaluation: Train and evaluate Naive Bayes, Random Forest, Logistic Regression, SVM, MLP, and LSTM models.
Results Interpretation: Compare model performances and analyze the impact of different feature extraction techniques.

Models Included

Naive Bayes
Random Forest
Logistic Regression
Support Vector Machine (SVM)
Multi-Layer Perceptron (MLP)
Long Short-Term Memory Network (LSTM)

Conclusion

This project illustrates the application of various machine learning and deep learning models to multi-label text classification, providing insights into which models and feature extraction techniques perform best.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
README.md		README.md
classifier.ipynb		classifier.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Label Text Classification with LSTM and Machine Learning Models

Overview 📌

Dataset Description

Downloading the Dataset

Project Objectives 🌟

Installation and Setup

Usage 📘

Models Included

Conclusion

About

Releases

Packages

Languages

Mohammad-Rahmanian/Persian-News-Classification

Folders and files

Latest commit

History

Repository files navigation

Multi-Label Text Classification with LSTM and Machine Learning Models

Overview 📌

Dataset Description

Downloading the Dataset

Project Objectives 🌟

Installation and Setup

Usage 📘

Models Included

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages