This project, SoccerSense, is a comprehensive soccer analytics platform that integrates multiple data sources, including structured CSV datasets, unstructured video data, and semi-structured JSON files. The primary goal is to address challenges in soccer analytics by providing automated data ingestion, advanced AI-driven analysis, and real-time insights for coaches, analysts, and scouts.

Here I have used a variety of structured, semi-structured and unstructured data.
Transfermarkt: https://www.transfermarkt.co.uk/
Youtube: https://www.youtube.com/
-
Temporal Landing Zone
Stores raw data including:- CSVs (match/player stats)
- JSON (YouTube comments)
- MP4 (match videos)
-
Persistent Landing Zone
Cleaned data is written using Delta Lake to support:- ACID-compliant transactional storage
- Metadata management and schema enforcement
- PySpark is used for:
- Large-scale data cleaning and consistency checks
- Video metadata processing
- Parsing and structuring YouTube comments
- Data is stored in DuckDB for in-memory analytics
- KPIs (Key Performance Indicators) such as player performance and win rates are computed using PySpark
- Results are saved in Parquet format for downstream consumption
A Streamlit web application provides:
-
π₯ Video Detection Module
Uses YOLOv8 to detect and track players and ball movement -
π¬ Sentiment Analysis Module
Applies VADER to classify YouTube comments by emotional tone -
π KPI Dashboard Module
Built with Streamlit + DuckDB, enabling:- CSV upload & table preview
- Custom SQL querying
- Dynamic visualization using Matplotlib
https://drive.google.com/drive/folders/1uwOWW3hrIYRXy-AmMu_Qvts5WWMsAZGZ?usp=share_link
To run the code for this project, you will need to install several Python packages.
- PySpark - For distributed data processing.
- Delta Lake - To enable ACID transactions and schema enforcement with Spark.
- yt-dlp - For downloading YouTube videos.
- Google API Client Library - For interacting with YouTube Data API.
- Kaggle API - For downloading datasets from Kaggle.
- Duckdb β An in-process SQL OLAP database optimized for analytics.
- PyTorch β PyTorch for deep learning.
- ultralytics (YOLO) β For loading pretrained YOLO models (e.g., YOLOv5, YOLOv8).
- Streamlit β For building interactive data apps and dashboards.
- VADER Sentiment (vaderSentiment) β A lexicon-based sentiment analysis tool.
# Core data processing and storage
pip install pyspark
pip install delta-spark
# YouTube and API interaction
pip install yt-dlp
pip install google-api-python-client
# External data access
pip install kaggle
# Additional analytics tools
pip install PyTorch
pip install duckdb
pip install streamlit
pip install vaderSentiment
# YOLO pretrained models (via Ultralytics)
pip install ultralytics
git clone https://github.com/woshimajintao/SoccerSense.git
cd SoccerSense/P2/Final_APP/Consumption\ ZoneMake sure all the required packages are installed as listed above.
Start the main Streamlit app:
streamlit run main.pyThis will launch the application in your browser.
Jintao Ma - Big Data Management and Analytics Master Program
This project is licensed under the MIT License - see the LICENSE file for details.




