This repository contains the PDF and small code sample (rough draft) for my submission to the Shell Corp Take Home assignment.
NOTE: I have called this shell corp take home (rather than the actual company name) so that it's less likely that other potential candidates in the future find this by accident (especially if the intent is to reuse this take home assignment for new hires in the future). Virgin asked me to do something similar when I interviewed with them some years ago. I am guessing that some people would look for answers to take home assignments online during active interview processes.
The architecture is described in the PDF proposal.pdf
. This contains images of the current and proposed architecture diagrams. There is also a description of the current data pipeline and proposed changes. Images can be found separately in the diagrams
folder (.png
and .svg
versions).
A sample project demonstrating the integration of DuckDB, DLT, and SQLMesh. DuckDB is used as the data warehouse, DLT loads and validates dummy (semi-real) football transfer data, and SQLMesh handles data transformations with built-in testing and validation.
I have skipped the steps regarding converting the CDC logs to tables representing their active state as it would probably take too long to do in the time frame. I will happily explain it in the interview though.
- Install UV package manager (if not already installed)
- Run
make init
to set up the Python environment - Run
make dlt
to load sample data into DuckDB - Run
make sqlmesh-plan
to execute SQLMesh transformations
NOTE: Don't be surprised if the make init
command doesn't work. The original version of the Makefile was for Mac but I had to make it work for Windows which caused some hiccups.
database/
: Contains the DuckDB database filesrc/
loader/
: DLT scripts for loading data into DuckDBsqlmesh/
: SQLMesh configuration and modelsaudits/
: SQLMesh auditsmodels/
: SQLMesh transformation modelstests/
: SQLMesh testsconfig.py
: SQLMesh configuration
utils/
: Shared utility functions
tests/
: Unit tests for loader and utility functions
make init
: Set up Python environment and install dependenciesmake clean
: Remove Python cache files and build artifactsmake upgrade-python-deps
: Upgrade all Python dependencies
make test
: Run all unit tests with coverage report and verbose loggingmake mypy
: Run static type checking
make dlt
: Load sample football transfer data into DuckDBmake sqlmesh-plan
: Execute SQLMesh transformations (only applies changes if models or data have changed)make sqlmesh-run
: Run models based on their configured schedulesmake sqlmesh-restate
: Force rerun of specific model transformations regardless of changesmake sqlmesh-test
: Run SQLMesh model unit testsmake sqlmesh-audit
: Run data quality audits / checks
SQLMesh uses an intelligent execution model to determine when to run transformations:
-
Plan vs Run:
sqlmesh plan
is used for deploying changes to models and synchronizing environmentssqlmesh run
is used for scheduled execution of models based on their cron parameters- Use
plan
during development and deployment - Use
run
for production scheduled execution
-
Plan Behavior:
sqlmesh plan
only applies changes when:- Model definitions have changed
- New data is available
- Models haven't been run for their scheduled interval
- Running
plan
multiple times without changes will not trigger re-execution
-
Run Behavior:
sqlmesh run
checks each model's cron schedule- Only executes models whose scheduled interval has elapsed since last run
- Does not re-execute models that have run within their interval
- Typically executed on a schedule (e.g., via crontab) at least as frequently as your shortest model interval
- Example: If models run every 5 minutes, schedule
sqlmesh run
every 5 minutes
-
Cron Scheduling:
- Models use
cron '*/5 * * * *'
configuration (runs every 5 minutes) - SQLMesh tracks the last successful run time
- A model will only run if:
- It hasn't been run in the last 5 minutes, OR
- You explicitly force a rerun with
sqlmesh-restate
- The schedule is based on exact intervals, not calendar days
- Forward-only runs don't mark the interval as complete
- Models use
-
Restate Command:
- Use
make sqlmesh-restate
to force model re-execution - Useful for:
- Testing changes during development
- Fixing data quality issues
- Backfilling historical data
- Use
The DLT loader includes comprehensive validation:
- Strict schema enforcement using Patito models
- Type validation for all fields
- Business rule validation (e.g., valid age ranges, market values)
- Automated data quality checks during ingestion
SQLMesh models include:
- Data quality audits (e.g., value ranges, date validity)
- Schema validation
- Automated testing using SQLMesh's testing framework
- Data quality assertions in transformations
For simplicity, the SQLMesh models are implemented using full refreshes rather than incremental processing. Converting to incremental processing would involve adding timestamp/version columns to source data, which doesnt make sense in this example.
This would definitely be done if this was a larger dataset in a real project with more frequent updates.
The project demonstrates comprehensive testing at multiple levels:
- Unit tests for Python code using pytest
- Data validation during ingestion using Patito
- Model tests using SQLMesh's testing framework
- Data quality audits for transformed data
This ensures data quality and transformation accuracy throughout the pipeline.