Delivery Duration Prediction

A machine learning project that predicts food delivery duration using historical order data. This project implements and compares multiple regression models to estimate delivery times based on various features such as store characteristics, order details, and dasher availability.

Overview

Accurate delivery time prediction is critical for food delivery platforms to set customer expectations and optimize logistics. This project analyzes historical delivery data to build predictive models that estimate the total delivery duration from order creation to delivery completion.

Dataset

The dataset (dataset/historical_data.csv) contains historical delivery records with the following attributes:

Feature	Description
`market_id`	Identifier for the market/region
`created_at`	Timestamp when the order was created
`actual_delivery_time`	Timestamp when the order was delivered
`store_id`	Unique identifier for the store
`store_primary_category`	Primary category of the store (e.g., american, mexican)
`order_protocol`	Protocol used for the order
`total_items`	Total number of items in the order
`subtotal`	Order subtotal amount
`num_distinct_items`	Number of distinct items ordered
`min_item_price`	Minimum item price in the order
`max_item_price`	Maximum item price in the order
`total_onshift_dashers`	Total dashers currently on shift
`total_busy_dashers`	Total dashers currently busy
`total_outstanding_orders`	Total outstanding orders in the area
`estimated_order_place_duration`	Estimated time to place the order
`estimated_store_to_consumer_driving_duration`	Estimated driving time from store to consumer

Features

Engineered Features

The following features are engineered from the raw data:

dasher_availability_ratio: Ratio of busy dashers to total on-shift dashers
non_prep_duration: Combined order placement and driving duration estimates
price_range: Difference between max and min item prices
avg_item_price: Average price per item
distinct_items_ratio: Ratio of distinct items to total items

Temporal Features

hour: Hour of order creation
day_of_week: Day of the week (0-6)
is_weekend: Binary indicator for weekend orders
is_lunch_rush: Binary indicator for lunch hours (11 AM - 2 PM)
is_dinner_rush: Binary indicator for dinner hours (5 PM - 8 PM)

Installation

Prerequisites

Python 3.8+
pip

Dependencies

pip install pandas numpy matplotlib seaborn scikit-learn xgboost lightgbm scipy

Usage

Clone the repository:

git clone https://github.com/AR10129/Delivery-Duration-Prediction.git
cd Delivery-Duration-Prediction

Open and run notebook.ipynb in Jupyter Notebook or VS Code with the Jupyter extension.
The notebook will:
- Load and preprocess the data
- Engineer features
- Train multiple models
- Evaluate and compare model performance
- Perform hyperparameter tuning on the best model

Methodology

Data Preprocessing

Data Cleaning: Filter out invalid delivery durations (negative or exceeding 2 hours)
Feature Validation: Remove records with inconsistent values
Log Transformation: Apply log transformation to skewed numerical features
Encoding: One-hot encode categorical variables (market_id, order_protocol, store_category)
Train-Test Split: Chronological 80-20 split to preserve temporal ordering

Target Variable

The target variable is actual_total_delivery_duration, calculated as the difference in seconds between actual_delivery_time and created_at.

Models

The following regression models are implemented and compared:

Model	Description
Linear Regression	Baseline linear model
Ridge Regression	L2 regularized linear regression
Lasso Regression	L1 regularized linear regression
Random Forest	Ensemble of decision trees
XGBoost	Gradient boosting with regularization
LightGBM	Light gradient boosting machine

Hyperparameter Tuning

Randomized search cross-validation is performed on XGBoost with the following parameter space:

n_estimators: [100, 150, 200, 250, 300]
learning_rate: [0.01, 0.03, 0.05, 0.07, 0.1, 0.15]
max_depth: [3, 5, 7, 10, 12, 15]
min_child_weight: [1, 2, 3, 4, 5]
subsample: [0.6, 0.7, 0.8, 0.9, 1.0]
colsample_bytree: [0.6, 0.7, 0.8, 0.9, 1.0]
gamma: [0, 0.1, 0.2, 0.3, 0.4]

Results

Evaluation Metrics

MAE (Mean Absolute Error): Average absolute difference between predicted and actual values
RMSE (Root Mean Squared Error): Square root of average squared differences
R-squared (R2): Proportion of variance explained by the model

Model Comparison

Models are compared using MAE (in minutes) and R2 score. Visualizations include:

Model comparison bar charts
Predicted vs Actual scatter plots
Residual analysis plots
Feature importance rankings
Q-Q plots for residual normality assessment

Project Structure

Delivery-Duration-Prediction/
├── README.MD
├── notebook.ipynb
└── dataset/
    └── historical_data.csv

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.ipynb_checkpoints		.ipynb_checkpoints
dataset		dataset
README.MD		README.MD
notebook.ipynb		notebook.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Delivery Duration Prediction

Table of Contents

Overview

Dataset

Features

Engineered Features

Temporal Features

Installation

Prerequisites

Dependencies

Usage

Methodology

Data Preprocessing

Target Variable

Models

Hyperparameter Tuning

Results

Evaluation Metrics

Model Comparison

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Delivery Duration Prediction

Table of Contents

Overview

Dataset

Features

Engineered Features

Temporal Features

Installation

Prerequisites

Dependencies

Usage

Methodology

Data Preprocessing

Target Variable

Models

Hyperparameter Tuning

Results

Evaluation Metrics

Model Comparison

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages