An ELT pipeline to build data warehouse for ball-by-ball cricket match data, designed for analytics and modeling.
The project ingests raw match JSON from Cricsheet, normalizes the data into relational tables in PostgreSQL, and builds analytical models using dbt.
The pipeline follows an ELT architecture, where raw data is first loaded into the warehouse and transformations are performed using dbt.
Cricsheet data is nested and complex. Each match file contains hierarchical JSON describing:
- match info
- match dates and format
- teams and players
- outcomes and events
- innings
- deliveries
This project builds a reproducible data pipeline that transforms this raw data into a structured warehouse suitable for querying and analytics.
Pipeline stages:
- Fetch raw match data from Cricsheet.
- Ingest JSON files into normalized source tables
- Track ingestion state using file hashes for incremental updates
- Transform and model data using dbt
- Expose analytical tables for queries and downstream models
- Python 3.10+
- PostgreSQL 14+
- Git
- Docker (optional)
There are two ways to run the pipeline: locally or in a docker container, orchestrated with Airflow.
git clone https://github.com/shsiddhant/cricket-warehouse.git
cd cricket-warehouseuv venv .venv --seed
source .venv/bin/activate
uv syncpython -m venv .venv
source .venv/bin/activate
pip install .See below on how to use the CLI to run the pipeline.
The full pipeline can be run using Docker Compose.
git clone https://github.com/shsiddhant/cricket-warehouse.git
cd cricket-warehousedocker compose up --buildThis will start:
- PostgreSQL (Data Warehouse)
- Airflow Scheduler
- Airflow Webserver (UI)
You can open the UI at localhost:8080.
Default credentials:
- username: admin
- password: admin
You can set credentials in a .env file. Look at the .env.example provided in the repo for variable names.
Before you can run the DAG, you must setup a postgres connection with credentials present in the .env files.
Now you can run the Airflow DAG.
Edit the Cricsheet URL as per your needs, and start the DAG. Note that you must use the correct connection id as set in the previous step.
You can access the warehouse for queries by running the psql in the postgres container:
docker exec -it cricket-warehouse-postgres-1 psql -U admin -d cricket_warehouse
Source: https://cricsheet.org/
Cricsheet provides structured cricket match data across multiple formats and leagues.
Raw dataset characteristics:
- Format: JSON
- Granularity: Ball-by-ball
- File structure: One file per match
flowchart LR
subgraph Airflow DAG
A[Cricsheet] -->|Fetch and Extract JSON| B[Local Cache]
B -->|Batch Ingestion| C[Source Tables: PostgreSQL]
C -->|dbt| D[Staging Layer: Normalized JSONB]
D -->|dbt| E[Intermediate Layer: Relational Tables]
E -->|dbt| F[Marts: Analytics Tables]
end
The entire pipeline, including ingestion and transformation, is orchestrated within an Airflow DAG.
Key design decisions:
- JSON ingestion is partially flattened during loading.
- Ingestion is incremental, tracked using file hashes.
- Transformations are implemented as dbt models
- venue metadata is managed via seed tables.
The entire ELT pipeline is orchestrated using Apache Airflow running in Docker.
- Airflow runs with a
LocalExecutorfor parallel task execution - The pipeline is implemented as a DAG (
cricket_etl) covering both ingestion and transformation stages - Tasks include:
- Fetching and extracting JSON data from Cricsheet
- Batch ingestion into PostgreSQL source tables
- Running dbt transformations within the warehouse
dbt transformations are executed using Astronomer Cosmos, which integrates dbt directly into Airflow DAGs.
The DAG is structured into task groups aligned with dbt layers:
- Staging
- Intermediate
- Marts
This provides clear visibility into pipeline stages and dependencies within the Airflow UI.
The pipeline is fully automated end-to-end, requiring no manual dbt execution.
The warehouse stores cricket match data in normalized relational tables.
erDiagram
int_matches {
integer match_id PK
text venue_id FK
date start_date
text format
text event_name
text winner
text player_of_match
}
int_deliveries {
integer match_id FK
integer innings_number
text team
integer over_number
integer ball_in_over
integer runs
text batter
text bowler
text player_out
}
int_innings {
integer match_id FK
integer innings_number
text team
bigint runs_scored
bigint wickets_lost
}
int_match_players {
text match_player_id PK
integer match_id FK
text player_id
text player_name
text team
}
int_match_teams {
text match_team_id PK
integer match_id FK
text team
text opponent
boolean won_match
}
int_players {
text player_id PK
text team_id FK
text player_name
text team
}
int_teams {
text team_id PK
text team
text format
}
%% Core Relationships
int_matches ||--o{ int_deliveries : "recorded in"
int_matches ||--o{ int_innings : "summarizes"
int_matches ||--o{ int_match_players : "participated by"
int_matches ||--o{ int_match_teams : "contested by"
int_teams ||--o{ int_players : "belongs to"
There are three layers of dbt DAG:
-
Staging Models
stg_cricsheet__match_info: Stage match info JSONB into normalized table.stg_cricsheet__deliveries: Stage deliveries JSONB into normalized table.
-
Intermediate Models
int_venues: Each row represents a match venue.int_matches: Each row represents a match, with columns such asint_deliveries: Each row represents a unique match delivery.int_innings: Each row represents a match innings, with columns such asint_teams: Each row represents a unique (team, format) pair.int_match_teams: Junction table for represent many to many relationship between matches and teams.int_players: Each row represents a unique (player, team, format) tuple.int_match_players: Junction table for represent many to many relationship between matches and players.
-
Marts
fct_batting_order: Batting order of each match innings.fct_dismissed_players: Dismissed players in each match innings.fct_deliveries_sequence: Sequence of deliveries in each innings.fct_batting_scorecard: Batting scorecard of each innings.fct_bowling_scorecard: Bowling scorecard of each innings.
If you want to quickly test the pipeline, you can ingest a small subset of matches from Cricsheet tournament archives.
Examples:
Run the pipeline with those URLs to be able to run example queries below.
Once the warehouse is built, you can use analytical marts to answer many questions such as:
WITH stats AS (
SELECT
COUNT(*) AS innings,
bs.player_name AS batter,
SUM(bs.runs) AS runs,
SUM(bs.balls) AS balls,
SUM(CASE WHEN bs.is_dismissed THEN 1 ELSE 0 END) AS dismissals
FROM fct_batting_scorecard bs
JOIN int_matches m USING (match_id)
WHERE
m.event_name = 'ICC Women''s World Cup' AND
EXTRACT( YEAR FROM m.start_date) = 2025
GROUP BY bs.player_name
)
SELECT
batter,
innings,
innings - dismissals AS not_outs,
runs,
ROUND(runs / NULLIF(dismissals, 0), 2) AS average,
ROUND(100.0 * runs / NULLIF(balls, 0), 2) AS strike_rate
FROM stats
ORDER BY runs DESC
LIMIT 10
Output:
batter | innings | not_outs | runs | average | strike_rate
-----------------+---------+----------+------+---------+-------------
L Wolvaardt | 9 | 1 | 571 | 71.38 | 98.79
S Mandhana | 9 | 1 | 434 | 54.25 | 99.09
A Gardner | 5 | 1 | 328 | 82.00 | 130.16
Pratika Rawal | 6 | 0 | 308 | 51.33 | 77.78
P Litchfield | 7 | 1 | 304 | 50.67 | 112.18
AJ Healy | 5 | 1 | 299 | 74.75 | 125.10
JI Rodrigues | 7 | 2 | 292 | 58.40 | 101.04
SFM Devine | 5 | 0 | 289 | 57.80 | 85.25
HC Knight | 7 | 1 | 288 | 48.00 | 85.71
NR Sciver-Brunt | 6 | 0 | 262 | 43.67 | 85.34
(10 rows)
WITH stats AS (
SELECT
COUNT(*) AS innings,
bs.bowler,
SUM(bs.runs) AS runs,
SUM(bs.balls) AS balls,
SUM(bs.wickets) AS wickets,
SUM(bs.dots) AS dots
FROM fct_bowling_scorecard bs
JOIN int_matches m USING (match_id)
WHERE
m.event_name = 'Indian Premier League' AND
EXTRACT( YEAR FROM m.start_date) = 2025
GROUP BY bs.bowler
)
SELECT
bowler,
innings,
wickets,
DIV(balls, 6)::text || '.' || MOD(balls, 6) AS overs,
ROUND(runs / NULLIF(wickets, 0), 2) AS average,
ROUND(6 * runs / NULLIF(balls, 0), 2) AS economy,
dots,
ROUND(100 * dots / NULLIF(balls, 0), 2) AS dot_ball_pct
FROM stats
ORDER BY dots DESC, dot_ball_pct DESC
LIMIT 10;Output
bowler | innings | wickets | overs | average | economy | dots | dot_ball_pct
-------------------+---------+---------+-------+---------+---------+------+--------------
Mohammed Siraj | 15 | 16 | 57.0 | 32.94 | 9.25 | 151 | 44.15
M Prasidh Krishna | 15 | 25 | 59.0 | 19.52 | 8.27 | 146 | 41.24
KK Ahmed | 14 | 15 | 46.4 | 29.80 | 9.58 | 137 | 48.93
Arshdeep Singh | 16 | 21 | 58.2 | 24.67 | 8.88 | 137 | 39.14
JJ Bumrah | 12 | 18 | 47.2 | 17.56 | 6.68 | 128 | 45.07
TA Boult | 16 | 22 | 57.4 | 23.50 | 8.97 | 127 | 36.71
B Kumar | 14 | 17 | 52.0 | 28.41 | 9.29 | 123 | 39.42
JR Hazlewood | 12 | 22 | 44.0 | 17.55 | 8.77 | 120 | 45.45
PJ Cummins | 14 | 16 | 49.4 | 28.13 | 9.06 | 118 | 39.60
CV Varun | 13 | 17 | 50.0 | 22.53 | 7.66 | 117 | 39.00
(10 rows)
The project includes a CLI for managing the ingestion pipeline.
cricwh --help
Usage: cricwh [OPTIONS] COMMAND [ARGS]...
╭─ Options ───────────────────────────────────────────────────────────────╮
│ --install-completion Install completion for the current shell. │
│ --show-completion Show completion for the current shell, to │
│ copy it or customize the installation. │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────╯
╭─ Commands───────────────────────────────────────────────────────────────╮
│ fetch Fetch data from Cricsheet. │
│ configure Configure cricket-warehouse. │
│ init Initialize source tables and seeds. │
│ ingest Ingest JSON files into source tables. │
│ update Update venue city seed. │
╰──────────────────────────────────────────────────────────────────────────╯A config file is provided to manage PostgreSQL database credentials. On first run, cricwh initializes an example config. The config file may be found at:
| Operating System | Location |
|---|---|
| Linux/Unix | ~/.config/cricketwarehouse/config.yaml |
| macOS | ~/Library/Preferences/cricketwarehouse/config.yaml |
| Windows | C:\Users\<username>\AppData\Local\cricketwarehouse\cricketwarehouse/config.yaml |
You can edit the configuration using the configure command:
cricwh configure [--init-config-file]You can reset the config file using the --init-config-file flag in the configure command.
Detailed logs are written during each command. The log file may be found at:
| Operating System | Location |
|---|---|
| Linux/Unix | ~/.local/share/cricketwarehouse/cricwh.log |
| macOS | ~/Library/Application Support/cricketwarehouse/cricwh.log |
| Windows | C:\Users\<username>\AppData\Local\cricketwarehouse\cricketwarehouse/cricwh.log |
Assuming you've configured your database, a typical workflow goes as follows:
-
Initialize source tables (only on first run)
cricwh init
-
Fetch and extract raw match data.
cricwh fetch [URL] [ZIP FILE PATH]
-
Ingest match data into source tables.
cricwh ingest
-
Run dbt models
dbt deps --project-dir dbt/ dbt build --project-dir dbt/
| Tool | Purpose |
|---|---|
| Python | Data ingestion and CLI tooling |
| Apache Airflow | Pipeline Orchestration |
| dbt | Data modeling and transformations |
| PostgreSQL | Data warehouse |
| Astronomer Cosmos | ELT Pipeline Orchestration |
| psycopg2 | PostgreSQL database interface |


