Duplicate Question Pair Detection

Overview

This project tackles the challenge of detecting duplicate questions on Quora, organized by Kaggle. The primary goal is to identify whether two questions have the same meaning, thereby helping to reduce redundant content on the platform. By employing a combination of natural language processing (NLP) techniques and machine learning algorithms, the model achieves an accuracy of approximately 80%.

Demo

You can try out the live demo of the project here.

Technologies Used

Python
Pandas
NumPy
Scikit-learn
NLTK / spaCy
Streamlit

Dataset

The dataset used for this project is sourced from Kaggle, containing pairs of questions and labels indicating whether they are duplicates. You can access the dataset here.

Features

Natural Language Processing: Tokenization, stemming, and lemmatization to preprocess text data.
Machine Learning: Various models tested for optimal performance in detecting duplicates.
Interactive Web App: Users can input question pairs and get real-time feedback on their similarity.

Model

The project utilizes a variety of machine learning techniques, including:

TF-IDF Vectorization: To convert text data into numerical format.
Logistic Regression: As the baseline model for comparison.
Random Forest and XGBoost: Advanced models to improve accuracy.

Installation

To run this project locally, follow these steps:

Clone the repository:

git clone https://github.com/mohitkumhar/duplicate-question-pair-detection.git

Navigate to the project directory:
```
cd duplicate-question-pair-detection
```
Install the required packages:
```
pip install -r requirements.txt
```

Usage

To start the Streamlit application, run the following command in your terminal:

streamlit run app.py

This will open a new tab in your web browser with the application interface, where you can input question pairs for duplicate detection.

Contributing

Contributions are welcome! If you have suggestions for improvements or new features, please fork the repository and submit a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.devcontainer		.devcontainer
.gitignore		.gitignore
README.md		README.md
app.py		app.py
countVector.pkl		countVector.pkl
helper.py		helper.py
model.pkl		model.pkl
project - duplicate question pair.ipynb		project - duplicate question pair.ipynb
requirements.txt		requirements.txt
train.csv		train.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Duplicate Question Pair Detection

Overview

Demo

Table of Contents

Technologies Used

Dataset

Features

Model

Installation

Usage

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

mohitkumhar/duplicate-question-pair-detection

Folders and files

Latest commit

History

Repository files navigation

Duplicate Question Pair Detection

Overview

Demo

Table of Contents

Technologies Used

Dataset

Features

Model

Installation

Usage

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages