Game Reviews & Sales Analysis

Project Overview

This project explores the relationship between video game reviews, sentiment, and sales performance.
We combine text analysis (NLP) and machine learning models to answer:

Do player sentiments in reviews predict game sales?
How do genre and platform affect performance?
Which machine learning models work best for sentiment prediction and sales prediction?

Dataset

We worked with two combined sources:

Steam reviews dataset (dataset.csv) → game reviews and text. (Can be found here: https://www.kaggle.com/datasets/andrewmvd/steam-reviews )
VG Sales dataset (vgsales.csv) → sales numbers, genre, platform, year. (Can be found here: https://www.kaggle.com/datasets/gregorut/videogamesales )

After preprocessing, we built a unified final_df with:

game	genre	platform	review_text	sentiment	sales	year	sentiment_score

Steps & Techniques

1. Data Cleaning

Removed missing values and duplicates.
Cleaned review text: lowercasing, punctuation removal, stopwords, lemmatization.
Fixed formatting issues (spacing between words).
Removed "Unknown" genres from analysis.

2. Exploratory Data Analysis (EDA)

Distribution of genres and platforms.
Sentiment distribution (positive, negative, neutral).
Sales distribution by genre/platform.
ANOVA + Regression checks to test statistical significance.

3. SQL Integration

Data stored in SQLite database (gaming_reviews.db).
Queried sales by genre/platform and sentiment distribution.

4. Hypothesis Testing

H1: Games with positive sentiment have higher sales.
H2: Genre & platform explain sales differences better than reviews alone.
H3: Review sentiment predicts future success.

5. Regression Analysis

Linear regression → Sentiment vs. Sales.
Extended regression → Sales ~ sentiment + genre + platform.
Result: genre/platform stronger predictors than sentiment alone.

6. Classification (Sentiment Prediction)

Transformed reviews into TF-IDF features.
Models tested: Logistic Regression, Random Forest, XGBoost.
Best: Random Forest → Accuracy 91%.

7. Regression (Sales Prediction)

Target: game sales (numeric).
Features: review TF-IDF + sentiment score + genre/platform.
Models tested:
- Random Forest Regressor
- Gradient Boosting
- Ridge & Lasso Regression
Best: Random Forest (R² = 0.46, RMSE ≈ 0.87).

8. Model Insights

Sentiment classification → high accuracy.
Sales regression → moderate predictive power.
Genre & platform more influential than raw reviews.
Missing real-world factors (marketing, hype, brand) likely limit predictive power.

Visualizations

Top 10 game genres (bar plot).
Sentiment distribution.
ANOVA plots for sales by sentiment.
Feature importance plots for sales models (words, genres, platforms).

Results Summary

Sentiment classification worked very well.
Sales regression was limited (R² ≈ 0.46).
Business context (marketing, release timing, brand loyalty) needed to explain sales more fully.

Next Steps

Incorporate external factors (marketing spend, social media hype).
Try deep learning models (LSTMs, transformers) on larger compute resources.
Build a dashboard (Streamlit) for interactive exploration.

Tech Stack

Python (pandas, numpy, matplotlib, seaborn, scikit-learn, statsmodels)
NLP (NLTK, scikit-learn TF-IDF)
SQL (SQLite)
ML Models: Logistic Regression, Random Forest, XGBoost, Ridge, Lasso, Gradient Boosting

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Game Review Sentiment.ipynb		Game Review Sentiment.ipynb
README.md		README.md
gaming_reviews.db		gaming_reviews.db
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Game Reviews & Sales Analysis

Project Overview

Dataset

Steps & Techniques

1. Data Cleaning

2. Exploratory Data Analysis (EDA)

3. SQL Integration

4. Hypothesis Testing

5. Regression Analysis

6. Classification (Sentiment Prediction)

7. Regression (Sales Prediction)

8. Model Insights

Visualizations

Results Summary

Next Steps

Tech Stack

About

Uh oh!

Releases

Packages

Languages

levgirgin/sentiment-analysis-sales-prediction

Folders and files

Latest commit

History

Repository files navigation

Game Reviews & Sales Analysis

Project Overview

Dataset

Steps & Techniques

1. Data Cleaning

2. Exploratory Data Analysis (EDA)

3. SQL Integration

4. Hypothesis Testing

5. Regression Analysis

6. Classification (Sentiment Prediction)

7. Regression (Sales Prediction)

8. Model Insights

Visualizations

Results Summary

Next Steps

Tech Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages