A machine learning project that predicts food delivery duration using historical order data. This project implements and compares multiple regression models to estimate delivery times based on various features such as store characteristics, order details, and dasher availability.
Accurate delivery time prediction is critical for food delivery platforms to set customer expectations and optimize logistics. This project analyzes historical delivery data to build predictive models that estimate the total delivery duration from order creation to delivery completion.
The dataset (dataset/historical_data.csv) contains historical delivery records with the following attributes:
| Feature | Description |
|---|---|
market_id |
Identifier for the market/region |
created_at |
Timestamp when the order was created |
actual_delivery_time |
Timestamp when the order was delivered |
store_id |
Unique identifier for the store |
store_primary_category |
Primary category of the store (e.g., american, mexican) |
order_protocol |
Protocol used for the order |
total_items |
Total number of items in the order |
subtotal |
Order subtotal amount |
num_distinct_items |
Number of distinct items ordered |
min_item_price |
Minimum item price in the order |
max_item_price |
Maximum item price in the order |
total_onshift_dashers |
Total dashers currently on shift |
total_busy_dashers |
Total dashers currently busy |
total_outstanding_orders |
Total outstanding orders in the area |
estimated_order_place_duration |
Estimated time to place the order |
estimated_store_to_consumer_driving_duration |
Estimated driving time from store to consumer |
The following features are engineered from the raw data:
- dasher_availability_ratio: Ratio of busy dashers to total on-shift dashers
- non_prep_duration: Combined order placement and driving duration estimates
- price_range: Difference between max and min item prices
- avg_item_price: Average price per item
- distinct_items_ratio: Ratio of distinct items to total items
- hour: Hour of order creation
- day_of_week: Day of the week (0-6)
- is_weekend: Binary indicator for weekend orders
- is_lunch_rush: Binary indicator for lunch hours (11 AM - 2 PM)
- is_dinner_rush: Binary indicator for dinner hours (5 PM - 8 PM)
- Python 3.8+
- pip
pip install pandas numpy matplotlib seaborn scikit-learn xgboost lightgbm scipy- Clone the repository:
git clone https://github.com/AR10129/Delivery-Duration-Prediction.git
cd Delivery-Duration-Prediction-
Open and run
notebook.ipynbin Jupyter Notebook or VS Code with the Jupyter extension. -
The notebook will:
- Load and preprocess the data
- Engineer features
- Train multiple models
- Evaluate and compare model performance
- Perform hyperparameter tuning on the best model
- Data Cleaning: Filter out invalid delivery durations (negative or exceeding 2 hours)
- Feature Validation: Remove records with inconsistent values
- Log Transformation: Apply log transformation to skewed numerical features
- Encoding: One-hot encode categorical variables (market_id, order_protocol, store_category)
- Train-Test Split: Chronological 80-20 split to preserve temporal ordering
The target variable is actual_total_delivery_duration, calculated as the difference in seconds between actual_delivery_time and created_at.
The following regression models are implemented and compared:
| Model | Description |
|---|---|
| Linear Regression | Baseline linear model |
| Ridge Regression | L2 regularized linear regression |
| Lasso Regression | L1 regularized linear regression |
| Random Forest | Ensemble of decision trees |
| XGBoost | Gradient boosting with regularization |
| LightGBM | Light gradient boosting machine |
Randomized search cross-validation is performed on XGBoost with the following parameter space:
n_estimators: [100, 150, 200, 250, 300]learning_rate: [0.01, 0.03, 0.05, 0.07, 0.1, 0.15]max_depth: [3, 5, 7, 10, 12, 15]min_child_weight: [1, 2, 3, 4, 5]subsample: [0.6, 0.7, 0.8, 0.9, 1.0]colsample_bytree: [0.6, 0.7, 0.8, 0.9, 1.0]gamma: [0, 0.1, 0.2, 0.3, 0.4]
- MAE (Mean Absolute Error): Average absolute difference between predicted and actual values
- RMSE (Root Mean Squared Error): Square root of average squared differences
- R-squared (R2): Proportion of variance explained by the model
Models are compared using MAE (in minutes) and R2 score. Visualizations include:
- Model comparison bar charts
- Predicted vs Actual scatter plots
- Residual analysis plots
- Feature importance rankings
- Q-Q plots for residual normality assessment
Delivery-Duration-Prediction/
├── README.MD
├── notebook.ipynb
└── dataset/
└── historical_data.csv