Skip to content

AR10129/Delivery-Duration-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Delivery Duration Prediction

A machine learning project that predicts food delivery duration using historical order data. This project implements and compares multiple regression models to estimate delivery times based on various features such as store characteristics, order details, and dasher availability.

Table of Contents

Overview

Accurate delivery time prediction is critical for food delivery platforms to set customer expectations and optimize logistics. This project analyzes historical delivery data to build predictive models that estimate the total delivery duration from order creation to delivery completion.

Dataset

The dataset (dataset/historical_data.csv) contains historical delivery records with the following attributes:

Feature Description
market_id Identifier for the market/region
created_at Timestamp when the order was created
actual_delivery_time Timestamp when the order was delivered
store_id Unique identifier for the store
store_primary_category Primary category of the store (e.g., american, mexican)
order_protocol Protocol used for the order
total_items Total number of items in the order
subtotal Order subtotal amount
num_distinct_items Number of distinct items ordered
min_item_price Minimum item price in the order
max_item_price Maximum item price in the order
total_onshift_dashers Total dashers currently on shift
total_busy_dashers Total dashers currently busy
total_outstanding_orders Total outstanding orders in the area
estimated_order_place_duration Estimated time to place the order
estimated_store_to_consumer_driving_duration Estimated driving time from store to consumer

Features

Engineered Features

The following features are engineered from the raw data:

  • dasher_availability_ratio: Ratio of busy dashers to total on-shift dashers
  • non_prep_duration: Combined order placement and driving duration estimates
  • price_range: Difference between max and min item prices
  • avg_item_price: Average price per item
  • distinct_items_ratio: Ratio of distinct items to total items

Temporal Features

  • hour: Hour of order creation
  • day_of_week: Day of the week (0-6)
  • is_weekend: Binary indicator for weekend orders
  • is_lunch_rush: Binary indicator for lunch hours (11 AM - 2 PM)
  • is_dinner_rush: Binary indicator for dinner hours (5 PM - 8 PM)

Installation

Prerequisites

  • Python 3.8+
  • pip

Dependencies

pip install pandas numpy matplotlib seaborn scikit-learn xgboost lightgbm scipy

Usage

  1. Clone the repository:
git clone https://github.com/AR10129/Delivery-Duration-Prediction.git
cd Delivery-Duration-Prediction
  1. Open and run notebook.ipynb in Jupyter Notebook or VS Code with the Jupyter extension.

  2. The notebook will:

    • Load and preprocess the data
    • Engineer features
    • Train multiple models
    • Evaluate and compare model performance
    • Perform hyperparameter tuning on the best model

Methodology

Data Preprocessing

  1. Data Cleaning: Filter out invalid delivery durations (negative or exceeding 2 hours)
  2. Feature Validation: Remove records with inconsistent values
  3. Log Transformation: Apply log transformation to skewed numerical features
  4. Encoding: One-hot encode categorical variables (market_id, order_protocol, store_category)
  5. Train-Test Split: Chronological 80-20 split to preserve temporal ordering

Target Variable

The target variable is actual_total_delivery_duration, calculated as the difference in seconds between actual_delivery_time and created_at.

Models

The following regression models are implemented and compared:

Model Description
Linear Regression Baseline linear model
Ridge Regression L2 regularized linear regression
Lasso Regression L1 regularized linear regression
Random Forest Ensemble of decision trees
XGBoost Gradient boosting with regularization
LightGBM Light gradient boosting machine

Hyperparameter Tuning

Randomized search cross-validation is performed on XGBoost with the following parameter space:

  • n_estimators: [100, 150, 200, 250, 300]
  • learning_rate: [0.01, 0.03, 0.05, 0.07, 0.1, 0.15]
  • max_depth: [3, 5, 7, 10, 12, 15]
  • min_child_weight: [1, 2, 3, 4, 5]
  • subsample: [0.6, 0.7, 0.8, 0.9, 1.0]
  • colsample_bytree: [0.6, 0.7, 0.8, 0.9, 1.0]
  • gamma: [0, 0.1, 0.2, 0.3, 0.4]

Results

Evaluation Metrics

  • MAE (Mean Absolute Error): Average absolute difference between predicted and actual values
  • RMSE (Root Mean Squared Error): Square root of average squared differences
  • R-squared (R2): Proportion of variance explained by the model

Model Comparison

Models are compared using MAE (in minutes) and R2 score. Visualizations include:

  • Model comparison bar charts
  • Predicted vs Actual scatter plots
  • Residual analysis plots
  • Feature importance rankings
  • Q-Q plots for residual normality assessment

Project Structure

Delivery-Duration-Prediction/
├── README.MD
├── notebook.ipynb
└── dataset/
    └── historical_data.csv

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors