This repository contains the code and Persian report for the Divar platform's Health Hackathon, aimed at detecting harassment in user chats using NLP techniques. Divar is a classified ads and e-commerce online platform in Iran, boasting 30 million monthly active users. The chat data used for this project belongs to Divar.
In this hackathon, I secured the 6th position out of 50 competitors, achieving a model accuracy of 90%.
The notebook is also available on Google Colab: 
- Setup Environment: Preparing the necessary environment for the project.
 - Load and Read Data: Loading the chat data for analysis.
 - Data Preprocessing and EDA:
- Extract Messages: Extracting messages from the dataset.
 - Post Title WordCloud: Visualizing frequent words in post titles.
 - Post Categories: Analyzing the categories of posts.
 - Normalize Texts: Normalizing text data for consistency.
 - Word Tokenize: Tokenizing words for further analysis.
 - Swear Extraction: Identifying swear words in the dataset.
 - Remove Stopwords: Eliminating common stopwords.
 - Stemming & Lemmatization: Reducing words to their base forms.
 - Check #Words Distribution: Analyzing the distribution of word counts.
 - Choosing Initial Features: Selecting initial features for the model.
 - Split Data: Splitting the data into training and testing sets.
 - TF-IDF Vectorizer: Converting text data into TF-IDF features.
 - Feature Selection with Chi-Squared: Selecting relevant features using the Chi-Squared test.
 
 - Model Data:
- Machine Learning Algorithm Selection and Initialization: Choosing and initializing the appropriate machine learning algorithms.
 - Tune Model with Hyper-Parameters: Fine-tuning the model with hyperparameter optimization.
 - Final Model and Competition Submission: Finalizing the model and preparing for competition submission.