Skip to content

levgirgin/sentiment-analysis-sales-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Game Reviews & Sales Analysis

Project Overview

This project explores the relationship between video game reviews, sentiment, and sales performance.
We combine text analysis (NLP) and machine learning models to answer:

  • Do player sentiments in reviews predict game sales?
  • How do genre and platform affect performance?
  • Which machine learning models work best for sentiment prediction and sales prediction?

Dataset

We worked with two combined sources:

After preprocessing, we built a unified final_df with:

game genre platform review_text sentiment sales year sentiment_score

Steps & Techniques

1. Data Cleaning

  • Removed missing values and duplicates.
  • Cleaned review text: lowercasing, punctuation removal, stopwords, lemmatization.
  • Fixed formatting issues (spacing between words).
  • Removed "Unknown" genres from analysis.

2. Exploratory Data Analysis (EDA)

  • Distribution of genres and platforms.
  • Sentiment distribution (positive, negative, neutral).
  • Sales distribution by genre/platform.
  • ANOVA + Regression checks to test statistical significance.

3. SQL Integration

  • Data stored in SQLite database (gaming_reviews.db).
  • Queried sales by genre/platform and sentiment distribution.

4. Hypothesis Testing

  • H1: Games with positive sentiment have higher sales.
  • H2: Genre & platform explain sales differences better than reviews alone.
  • H3: Review sentiment predicts future success.

5. Regression Analysis

  • Linear regression → Sentiment vs. Sales.
  • Extended regression → Sales ~ sentiment + genre + platform.
  • Result: genre/platform stronger predictors than sentiment alone.

6. Classification (Sentiment Prediction)

  • Transformed reviews into TF-IDF features.
  • Models tested: Logistic Regression, Random Forest, XGBoost.
  • Best: Random Forest → Accuracy 91%.

7. Regression (Sales Prediction)

  • Target: game sales (numeric).
  • Features: review TF-IDF + sentiment score + genre/platform.
  • Models tested:
    • Random Forest Regressor
    • Gradient Boosting
    • Ridge & Lasso Regression
  • Best: Random Forest (R² = 0.46, RMSE ≈ 0.87).

8. Model Insights

  • Sentiment classification → high accuracy.
  • Sales regression → moderate predictive power.
  • Genre & platform more influential than raw reviews.
  • Missing real-world factors (marketing, hype, brand) likely limit predictive power.

Visualizations

  • Top 10 game genres (bar plot).
  • Sentiment distribution.
  • ANOVA plots for sales by sentiment.
  • Feature importance plots for sales models (words, genres, platforms).

Results Summary

  • Sentiment classification worked very well.
  • Sales regression was limited (R² ≈ 0.46).
  • Business context (marketing, release timing, brand loyalty) needed to explain sales more fully.

Next Steps

  • Incorporate external factors (marketing spend, social media hype).
  • Try deep learning models (LSTMs, transformers) on larger compute resources.
  • Build a dashboard (Streamlit) for interactive exploration.

Tech Stack

  • Python (pandas, numpy, matplotlib, seaborn, scikit-learn, statsmodels)
  • NLP (NLTK, scikit-learn TF-IDF)
  • SQL (SQLite)
  • ML Models: Logistic Regression, Random Forest, XGBoost, Ridge, Lasso, Gradient Boosting

About

Analysis to explain if the review sentiments of games affect the sales

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published