Cricket Warehouse

An ELT pipeline to build data warehouse for ball-by-ball cricket match data, designed for analytics and modeling.

The project ingests raw match JSON from Cricsheet, normalizes the data into relational tables in PostgreSQL, and builds analytical models using dbt.

The pipeline follows an ELT architecture, where raw data is first loaded into the warehouse and transformations are performed using dbt.

Overview

Cricsheet data is nested and complex. Each match file contains hierarchical JSON describing:

match info
- match dates and format
- teams and players
- outcomes and events
innings
- deliveries

This project builds a reproducible data pipeline that transforms this raw data into a structured warehouse suitable for querying and analytics.

Pipeline stages:

Fetch raw match data from Cricsheet.
Ingest JSON files into normalized source tables
Track ingestion state using file hashes for incremental updates
Transform and model data using dbt
Expose analytical tables for queries and downstream models

Installation

Prerequisites

Python 3.10+
PostgreSQL 14+
Git
Docker (optional)

There are two ways to run the pipeline: locally or in a docker container, orchestrated with Airflow.

Running Locally

1. Clone the repository

git clone https://github.com/shsiddhant/cricket-warehouse.git
cd cricket-warehouse

2. Create and activate a virtual environment

Using `uv`

uv venv .venv --seed
source .venv/bin/activate
uv sync

Using `pip`

python -m venv .venv
source .venv/bin/activate
pip install .

3. Use CLI to run the pipeline manually.

See below on how to use the CLI to run the pipeline.

Run in Airflow

The full pipeline can be run using Docker Compose.

1. Clone the repository

git clone https://github.com/shsiddhant/cricket-warehouse.git
cd cricket-warehouse

2. Start services

docker compose up --build

This will start:

PostgreSQL (Data Warehouse)
Airflow Scheduler
Airflow Webserver (UI)

2. Access Airflow UI

You can open the UI at localhost:8080.

Default credentials:

username: admin
password: admin

You can set credentials in a .env file. Look at the .env.example provided in the repo for variable names.

3. Add a Connection

Before you can run the DAG, you must setup a postgres connection with credentials present in the .env files.

4. Run the DAG

Now you can run the Airflow DAG.

Edit the Cricsheet URL as per your needs, and start the DAG. Note that you must use the correct connection id as set in the previous step.

5. Access the Warehouse

You can access the warehouse for queries by running the psql in the postgres container:

docker exec -it cricket-warehouse-postgres-1 psql -U admin -d cricket_warehouse

Data Source

Cricsheet

Source: https://cricsheet.org/

Cricsheet provides structured cricket match data across multiple formats and leagues.

Raw dataset characteristics:

Format: JSON
Granularity: Ball-by-ball
File structure: One file per match

Architecture

flowchart LR

subgraph Airflow DAG
    A[Cricsheet] -->|Fetch and Extract JSON| B[Local Cache]
    B -->|Batch Ingestion| C[Source Tables: PostgreSQL]

C -->|dbt| D[Staging Layer: Normalized JSONB]

D -->|dbt| E[Intermediate Layer: Relational Tables]

E -->|dbt| F[Marts: Analytics Tables]

end

The entire pipeline, including ingestion and transformation, is orchestrated within an Airflow DAG.

Key design decisions:

JSON ingestion is partially flattened during loading.
Ingestion is incremental, tracked using file hashes.
Transformations are implemented as dbt models
venue metadata is managed via seed tables.

Orchestration (Airflow)

The entire ELT pipeline is orchestrated using Apache Airflow running in Docker.

Airflow runs with a LocalExecutor for parallel task execution
The pipeline is implemented as a DAG (cricket_etl) covering both ingestion and transformation stages
Tasks include:
- Fetching and extracting JSON data from Cricsheet
- Batch ingestion into PostgreSQL source tables
- Running dbt transformations within the warehouse

dbt transformations are executed using Astronomer Cosmos, which integrates dbt directly into Airflow DAGs.

The DAG is structured into task groups aligned with dbt layers:

Staging
Intermediate
Marts

This provides clear visibility into pipeline stages and dependencies within the Airflow UI.

The pipeline is fully automated end-to-end, requiring no manual dbt execution.

Database Model

The warehouse stores cricket match data in normalized relational tables.

Core models

erDiagram
    int_matches {
        integer match_id PK
        text venue_id FK
        date start_date
        text format
        text event_name
        text winner
        text player_of_match
    }

    int_deliveries {
        integer match_id FK
        integer innings_number
        text team
        integer over_number
        integer ball_in_over
        integer runs
        text batter
        text bowler
        text player_out
    }

    int_innings {
        integer match_id FK
        integer innings_number
        text team
        bigint runs_scored
        bigint wickets_lost
    }

    int_match_players {
        text match_player_id PK
        integer match_id FK
        text player_id
        text player_name
        text team
    }

    int_match_teams {
        text match_team_id PK
        integer match_id FK
        text team
        text opponent
        boolean won_match
    }

    int_players {
        text player_id PK
        text team_id FK
        text player_name
        text team
    }

    int_teams {
        text team_id PK
        text team
        text format
    }

    %% Core Relationships
    int_matches ||--o{ int_deliveries : "recorded in"
    int_matches ||--o{ int_innings : "summarizes"
    int_matches ||--o{ int_match_players : "participated by"
    int_matches ||--o{ int_match_teams : "contested by"
    int_teams ||--o{ int_players : "belongs to"

Layers

There are three layers of dbt DAG:

Staging Models
- stg_cricsheet__match_info: Stage match info JSONB into normalized table.
- stg_cricsheet__deliveries: Stage deliveries JSONB into normalized table.
Intermediate Models
- int_venues: Each row represents a match venue.
- int_matches: Each row represents a match, with columns such as
- int_deliveries: Each row represents a unique match delivery.
- int_innings: Each row represents a match innings, with columns such as
- int_teams: Each row represents a unique (team, format) pair.
- int_match_teams: Junction table for represent many to many relationship between matches and teams.
- int_players: Each row represents a unique (player, team, format) tuple.
- int_match_players: Junction table for represent many to many relationship between matches and players.
Marts
- fct_batting_order: Batting order of each match innings.
- fct_dismissed_players: Dismissed players in each match innings.
- fct_deliveries_sequence: Sequence of deliveries in each innings.
- fct_batting_scorecard: Batting scorecard of each innings.
- fct_bowling_scorecard: Bowling scorecard of each innings.

Quick Test Dataset

If you want to quickly test the pipeline, you can ingest a small subset of matches from Cricsheet tournament archives.

Examples:

Run the pipeline with those URLs to be able to run example queries below.

Example Analytical Queries

Once the warehouse is built, you can use analytical marts to answer many questions such as:

Top Runs Scorers - ICC Women's World Cup 2025

WITH stats AS (

SELECT

    COUNT(*) AS innings,
    bs.player_name AS batter,
    SUM(bs.runs) AS runs,
    SUM(bs.balls) AS balls,
    SUM(CASE WHEN bs.is_dismissed THEN 1 ELSE 0 END) AS dismissals

FROM fct_batting_scorecard bs
JOIN int_matches m USING (match_id)
WHERE
    m.event_name = 'ICC Women''s World Cup' AND
    EXTRACT( YEAR FROM m.start_date) = 2025
GROUP BY bs.player_name
)

SELECT

    batter,
    innings,
    innings - dismissals AS not_outs,
    runs,
    ROUND(runs / NULLIF(dismissals, 0), 2) AS average,
    ROUND(100.0 * runs / NULLIF(balls, 0), 2) AS strike_rate

FROM stats
ORDER BY runs DESC
LIMIT 10

Output:

     batter      | innings | not_outs | runs | average | strike_rate
-----------------+---------+----------+------+---------+-------------
 L Wolvaardt     |       9 |        1 |  571 |   71.38 |       98.79
 S Mandhana      |       9 |        1 |  434 |   54.25 |       99.09
 A Gardner       |       5 |        1 |  328 |   82.00 |      130.16
 Pratika Rawal   |       6 |        0 |  308 |   51.33 |       77.78
 P Litchfield    |       7 |        1 |  304 |   50.67 |      112.18
 AJ Healy        |       5 |        1 |  299 |   74.75 |      125.10
 JI Rodrigues    |       7 |        2 |  292 |   58.40 |      101.04
 SFM Devine      |       5 |        0 |  289 |   57.80 |       85.25
 HC Knight       |       7 |        1 |  288 |   48.00 |       85.71
 NR Sciver-Brunt |       6 |        0 |  262 |   43.67 |       85.34
(10 rows)

Top Bowlers by Dot Balls Bowled - IPL 2025

WITH stats AS (

SELECT

    COUNT(*) AS innings,
    bs.bowler,
    SUM(bs.runs) AS runs,
    SUM(bs.balls) AS balls,
    SUM(bs.wickets) AS wickets,
    SUM(bs.dots) AS dots

FROM fct_bowling_scorecard bs
JOIN int_matches m USING (match_id)
WHERE
    m.event_name = 'Indian Premier League' AND
    EXTRACT( YEAR FROM m.start_date) = 2025
GROUP BY bs.bowler
)

SELECT

    bowler,
    innings,
    wickets,
    DIV(balls, 6)::text || '.' || MOD(balls, 6) AS overs,
    ROUND(runs / NULLIF(wickets, 0), 2) AS average,
    ROUND(6 * runs / NULLIF(balls, 0), 2) AS economy,
    dots,
    ROUND(100 * dots / NULLIF(balls, 0), 2) AS dot_ball_pct

FROM stats
ORDER BY dots DESC, dot_ball_pct DESC
LIMIT 10;

Output

      bowler       | innings | wickets | overs | average | economy | dots | dot_ball_pct
-------------------+---------+---------+-------+---------+---------+------+--------------
 Mohammed Siraj    |      15 |      16 | 57.0  |   32.94 |    9.25 |  151 |        44.15
 M Prasidh Krishna |      15 |      25 | 59.0  |   19.52 |    8.27 |  146 |        41.24
 KK Ahmed          |      14 |      15 | 46.4  |   29.80 |    9.58 |  137 |        48.93
 Arshdeep Singh    |      16 |      21 | 58.2  |   24.67 |    8.88 |  137 |        39.14
 JJ Bumrah         |      12 |      18 | 47.2  |   17.56 |    6.68 |  128 |        45.07
 TA Boult          |      16 |      22 | 57.4  |   23.50 |    8.97 |  127 |        36.71
 B Kumar           |      14 |      17 | 52.0  |   28.41 |    9.29 |  123 |        39.42
 JR Hazlewood      |      12 |      22 | 44.0  |   17.55 |    8.77 |  120 |        45.45
 PJ Cummins        |      14 |      16 | 49.4  |   28.13 |    9.06 |  118 |        39.60
 CV Varun          |      13 |      17 | 50.0  |   22.53 |    7.66 |  117 |        39.00
(10 rows)

CLI

The project includes a CLI for managing the ingestion pipeline.

cricwh --help
Usage: cricwh [OPTIONS] COMMAND [ARGS]...                                                                                                                                    
╭─ Options ───────────────────────────────────────────────────────────────╮
│ --install-completion       Install completion for the current shell. │
│ --show-completion          Show completion for the current shell, to │
│                            copy it or customize the installation.    │
│ --help                     Show this message and exit.               │
╰──────────────────────────────────────────────────────────────────────────╯
╭─ Commands───────────────────────────────────────────────────────────────╮
│ fetch      Fetch data from Cricsheet.                                │
│ configure  Configure cricket-warehouse.                              │
│ init       Initialize source tables and seeds.                       │
│ ingest     Ingest JSON files into source tables.                     │
│ update     Update venue city seed.                                   │
╰──────────────────────────────────────────────────────────────────────────╯

Configuration

A config file is provided to manage PostgreSQL database credentials. On first run, cricwh initializes an example config. The config file may be found at:

Operating System	Location
Linux/Unix	`~/.config/cricketwarehouse/config.yaml`
macOS	`~/Library/Preferences/cricketwarehouse/config.yaml`
Windows	`C:\Users\<username>\AppData\Local\cricketwarehouse\cricketwarehouse/config.yaml`

You can edit the configuration using the configure command:

cricwh configure [--init-config-file]

You can reset the config file using the --init-config-file flag in the configure command.

Logs

Detailed logs are written during each command. The log file may be found at:

Operating System	Location
Linux/Unix	`~/.local/share/cricketwarehouse/cricwh.log`
macOS	`~/Library/Application Support/cricketwarehouse/cricwh.log`
Windows	`C:\Users\<username>\AppData\Local\cricketwarehouse\cricketwarehouse/cricwh.log`

CLI Workflow

Assuming you've configured your database, a typical workflow goes as follows:

Initialize source tables (only on first run)
```
cricwh init
```
Fetch and extract raw match data.
```
cricwh fetch [URL] [ZIP FILE PATH]
```
Ingest match data into source tables.
```
cricwh ingest
```

Run dbt models

dbt deps --project-dir dbt/
dbt build --project-dir dbt/

Tools and Libraries

Tool	Purpose
Python	Data ingestion and CLI tooling
Apache Airflow	Pipeline Orchestration
dbt	Data modeling and transformations
PostgreSQL	Data warehouse
Astronomer Cosmos	ELT Pipeline Orchestration
psycopg2	PostgreSQL database interface

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
assets		assets
dags		dags
dbt		dbt
src/cricketwarehouse		src/cricketwarehouse
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
init-db.sh		init-db.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
ruff.toml		ruff.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Cricket Warehouse

Overview

Installation

Prerequisites

Running Locally

1. Clone the repository

2. Create and activate a virtual environment

Using uv

Using pip

3. Use CLI to run the pipeline manually.

Run in Airflow

1. Clone the repository

2. Start services

2. Access Airflow UI

3. Add a Connection

4. Run the DAG

5. Access the Warehouse

Data Source

Cricsheet

Architecture

Orchestration (Airflow)

Database Model

Core models

Layers

Quick Test Dataset

Example Analytical Queries

Top Runs Scorers - ICC Women's World Cup 2025

Top Bowlers by Dot Balls Bowled - IPL 2025

CLI

Configuration

Logs

CLI Workflow

Tools and Libraries

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Using `uv`

Using `pip`

Packages