Predicting sleep efficiency from daily activity patterns using real wearable device data - demonstrating ML capabilities for health technology applications like Oura Ring.
This project analyzes FitBit fitness tracker data to predict sleep efficiency from daytime activity patterns. Built to showcase capabilities relevant to wearable health technology companies like Oura, Whoop, and Fitbit.
- โ Real wearable data from 20 FitBit users over 31 days
- โ Comprehensive feature engineering (64 features including lags, rolling averages, baselines)
- โ User-based train/test split (prevents data leakage)
- โ Explainable AI (SHAP analysis)
- โ Production-ready code structure
- โ Complete pipeline (EDA โ Features โ Training โ Visualization)
| Metric | Random Forest | XGBoost |
|---|---|---|
| Rยฒ Score | -0.26 | -0.36 |
| MAE | 0.093 | 0.098 |
| RMSE | 0.150 | 0.156 |
Note: Model performance is limited due to small dataset size (140 samples after feature engineering, 28 test samples). The project demonstrates the complete ML pipeline and methodology rather than achieving high predictive accuracy.
- Average daily steps: 7,939 steps
- Average sleep duration: 6.56 hours
- Average sleep efficiency: 91.46%
- Strongest predictor: TotalMinutesAsleep (correlation: 0.305)
- Users analyzed: 20 users with both activity and sleep data
- Time period: 31 days
fitbit_sleep_analysis/
โโโ src/
โ โโโ 01_eda.py # Exploratory data analysis
โ โโโ 02_feature_engineering.py # Feature creation (64 features)
โ โโโ 03_train_models.py # Model training (RF + XGBoost)
โ โโโ 04_create_visualizations.py # Results visualization
โโโ data/
โ โโโ raw/ # Raw FitBit CSV files
โ โโโ processed/ # Cleaned & feature-engineered data
โโโ models/ # Trained models (gitignored)
โโโ outputs/ # Visualizations (gitignored)
โโโ README.md
โโโ requirements.txt
โโโ run_pipeline.sh # SLURM batch script for Puhti
โโโ .gitignore
- Python 3.10+
- Required packages (see
requirements.txt)
# Clone repository
git clone https://github.com/mdkarimuddin/fitbit_sleep_analysis.git
cd fitbit_sleep_analysis
# Install dependencies
pip install -r requirements.txtOption 1: Run on Puhti (HPC)
sbatch run_pipeline.shOption 2: Run locally step by step
# Step 1: EDA
python src/01_eda.py
# Step 2: Feature Engineering
python src/02_feature_engineering.py
# Step 3: Model Training
python src/03_train_models.py
# Step 4: Visualizations
python src/04_create_visualizations.pyThe FitBit dataset can be downloaded from Kaggle:
kaggle datasets download -d arashnic/fitbit
unzip fitbit.zip -d data/raw/- Source: FitBit Fitness Tracker Data (Kaggle)
- Users: 20 with complete activity + sleep data
- Duration: 31 days (April-May 2016)
- Metrics: Steps, distance, calories, active minutes, sleep duration, sleep efficiency
- Lagged features (1, 2, 3 days prior)
- Rolling averages (3-day and 7-day windows)
- User baselines (personalization)
- Deviations from baseline (activity/rest indicators)
- Training load (acute vs chronic workload)
- Temporal features (day of week, weekend, cyclical encoding)
- Sleep debt (cumulative sleep deviation)
- Activity intensity score (weighted combination)
- Algorithms: Random Forest Regressor, XGBoost Regressor
- Validation: 5-fold cross-validation
- Train/Test: User-based split (80/20) to prevent leakage
- Metrics: Rยฒ, MAE, RMSE
- Explainability: SHAP analysis for feature importance
The project generates comprehensive visualizations:
- Distribution plots (steps, calories, sleep duration, efficiency)
- Correlation matrix (activity vs sleep metrics)
- Activity vs sleep scatter plots
- Time series patterns (individual users)
- Day of week patterns
- Feature importance (top 20 features)
- Predictions vs actual (scatter plot)
- Error distribution
- SHAP summary plot
- User-level predictions (time series)
This project demonstrates:
โ
Real wearable data processing (FitBit โ generalizable to Oura)
โ
Time-series feature engineering (multi-day patterns, trends)
โ
Personalization (user baselines and adaptations)
โ
Predictive modeling (forecasting sleep from activity)
โ
Explainable AI (SHAP for interpretability)
โ
Production mindset (proper validation, no data leakage)
โ
HPC deployment (SLURM batch processing on Puhti)
- Python 3.10+
- pandas, numpy - Data processing
- scikit-learn - ML, preprocessing
- XGBoost - Gradient boosting
- SHAP - Explainability
- matplotlib, seaborn - Visualization
- SLURM - HPC job scheduling
Rolling averages of activity over 3-7 days capture trends better than single-day metrics.
User-specific baselines and deviations significantly improve predictions compared to population-level features only.
Day of week and cyclical encoding help capture weekly patterns in activity and sleep.
Small dataset size (140 samples) limits model performance. With more data, performance would improve significantly.
- Incorporate heart rate data (available for 14 users)
- Multi-target prediction (sleep duration + efficiency simultaneously)
- LSTM for better temporal modeling
- Uncertainty quantification
- Real-time inference API
- Web dashboard (Streamlit)
- Hyperparameter optimization
- Ensemble methods
Md Karim Uddin, PhD
PhD Veterinary Medicine | MEng Big Data Analytics
Postdoctoral Researcher, University of Helsinki
- GitHub: @mdkarimuddin
- LinkedIn: Md Karim Uddin, PhD
MIT License
- Data: FitBit Fitness Tracker Data via Kaggle
- Inspired by Oura Ring's approach to sleep tracking
- Built on Puhti supercomputer (CSC Finland)
---**โญ Star this repo if you found it useful!**Built to demonstrate capabilities for wearable health technology roles.