This project explores the relationship between video game reviews, sentiment, and sales performance.
We combine text analysis (NLP) and machine learning models to answer:
- Do player sentiments in reviews predict game sales?
- How do genre and platform affect performance?
- Which machine learning models work best for sentiment prediction and sales prediction?
We worked with two combined sources:
- Steam reviews dataset (
dataset.csv) → game reviews and text. (Can be found here: https://www.kaggle.com/datasets/andrewmvd/steam-reviews ) - VG Sales dataset (
vgsales.csv) → sales numbers, genre, platform, year. (Can be found here: https://www.kaggle.com/datasets/gregorut/videogamesales )
After preprocessing, we built a unified final_df with:
| game | genre | platform | review_text | sentiment | sales | year | sentiment_score |
|---|
- Removed missing values and duplicates.
- Cleaned review text: lowercasing, punctuation removal, stopwords, lemmatization.
- Fixed formatting issues (spacing between words).
- Removed
"Unknown"genres from analysis.
- Distribution of genres and platforms.
- Sentiment distribution (positive, negative, neutral).
- Sales distribution by genre/platform.
- ANOVA + Regression checks to test statistical significance.
- Data stored in SQLite database (
gaming_reviews.db). - Queried sales by genre/platform and sentiment distribution.
- H1: Games with positive sentiment have higher sales.
- H2: Genre & platform explain sales differences better than reviews alone.
- H3: Review sentiment predicts future success.
- Linear regression → Sentiment vs. Sales.
- Extended regression → Sales ~ sentiment + genre + platform.
- Result: genre/platform stronger predictors than sentiment alone.
- Transformed reviews into TF-IDF features.
- Models tested: Logistic Regression, Random Forest, XGBoost.
- Best: Random Forest → Accuracy 91%.
- Target: game sales (numeric).
- Features: review TF-IDF + sentiment score + genre/platform.
- Models tested:
- Random Forest Regressor
- Gradient Boosting
- Ridge & Lasso Regression
- Best: Random Forest (R² = 0.46, RMSE ≈ 0.87).
- Sentiment classification → high accuracy.
- Sales regression → moderate predictive power.
- Genre & platform more influential than raw reviews.
- Missing real-world factors (marketing, hype, brand) likely limit predictive power.
- Top 10 game genres (bar plot).
- Sentiment distribution.
- ANOVA plots for sales by sentiment.
- Feature importance plots for sales models (words, genres, platforms).
- Sentiment classification worked very well.
- Sales regression was limited (R² ≈ 0.46).
- Business context (marketing, release timing, brand loyalty) needed to explain sales more fully.
- Incorporate external factors (marketing spend, social media hype).
- Try deep learning models (LSTMs, transformers) on larger compute resources.
- Build a dashboard (Streamlit) for interactive exploration.
- Python (pandas, numpy, matplotlib, seaborn, scikit-learn, statsmodels)
- NLP (NLTK, scikit-learn TF-IDF)
- SQL (SQLite)
- ML Models: Logistic Regression, Random Forest, XGBoost, Ridge, Lasso, Gradient Boosting