Analytics Engineering Pipeline for an EdTech Startup (Databricks + Delta Lake)

Project Overview

EdTech startups need reliable analytics to understand:

learner sign-ups vs paid conversions
cohort performance
session attendance
revenue trends over time

This project simulates a realistic analytics engineering pipeline for an EdTech bootcamp platform using Databricks Free Edition, with a focus on:

batch processing
clean data modeling
business-ready analytics tables

Objective

Build an end-to-end analytics pipeline that:

ingests raw operational data
transforms it into clean, modeled datasets
produces analytics tables suitable for dashboards and decision-making

High-Level Architecture (Medallion Pattern)

Raw Files (CSV / JSON) ↓ Bronze Delta Tables (raw ingestion) ↓ Silver Delta Tables (cleaned & modeled) ↓ Gold Tables (analytics & metrics)

Even though Unity Catalog is not available in Databricks Free Edition, the Medallion Architecture is implemented using Delta tables in the workspace metastore.

Environment & Constraints

Databricks Environment

Databricks Free Edition
Workspace metastore (no Unity Catalog)
Batch processing only
Delta Lake tables

What Is Intentionally Not Used

Unity Catalog (not available in Free Edition)
Streaming pipelines
Production orchestration tools

These constraints are documented intentionally and discussed as trade-offs rather than limitations.

Tech Stack

Area	Choice
Platform	Databricks Free Edition
Storage	Delta Lake
Processing	Apache Spark
Languages	PySpark, SQL
Version Control	GitHub
Documentation	Markdown

Delta Lake is used for reliability, schema enforcement, and data versioning.

Dataset – EdTech Bootcamp Scenario

The dataset represents a simplified but realistic EdTech workflow:

users sign up
users enroll in cohorts
some users pay
some users attend sessions
some users drop off

Raw Entities

users
courses
cohorts
enrollments
payments
sessions_attended

The dataset is synthetic, but structurally representative of real operational data, designed to surface common analytics challenges such as joins, deduplication, and funnel analysis.

Layer Responsibilities

Bronze Layer – Raw Ingestion

Purpose

Preserve source data as close to raw form as possible
Enable reprocessing if downstream logic changes

Characteristics

Explicit schema definitions
Minimal transformations
Ingestion timestamp added
No joins or business logic

Silver Layer – Clean & Modeled Data

Purpose

Create reliable, analytics-ready datasets
Serve as the single source of truth for analytics

Transformations

Type casting and normalization
Deduplication using business keys
Referential integrity across entities
Separation of fact and dimension tables

Gold Layer – Business Metrics

Purpose

Provide business-facing metrics and KPIs
Optimize data for analytical consumption

Characteristics

Pre-aggregated metrics
SQL-first design
No complex transformation logic

Data Modeling Approach

The project follows a dimensional modeling approach commonly used in analytics engineering.

Dimensions describe entities such as users, courses, and cohorts
Fact tables capture events and transactions such as enrollments, payments, and attendance

Each fact table has a clearly defined grain, documented in the transformation logic and code comments.

Code and Data Storage Decisions

Code

All code is version-controlled in GitHub, including:

notebooks
PySpark transformation logic
SQL analytics queries
documentation

Databricks Repos is used to sync the workspace with GitHub, ensuring changes are tracked and reviewable.

Data

Datasets are deliberately not stored in GitHub.

raw files are uploaded to DBFS
processed data is stored as Delta tables

This avoids Git LFS issues and follows standard data platform practices, where data versioning is handled by the storage layer rather than source control.

How Raw Data Is Loaded

Raw CSV / JSON files are uploaded manually into DBFS using the Databricks UI.

Example Directory Structure

dbfs:/FileStore/data/
├── users.csv
├── courses.csv
├── cohorts.csv
├── enrollments.csv
├── payments.csv
└── sessions_attended.csv

Repository Structure

├── notebooks/
│ ├── bronze_ingestion/
│ ├── silver_transformations/
│ └── gold_metrics/
├── data/
│ └── sample/ # tiny example files only
├── docs/
│ └── architecture.md
├── README.md

What This Project Demonstrates

Designing analytics pipelines under real-world constraints
Applying Medallion Architecture without Unity Catalog
Using Delta Lake for reliable batch analytics
Separating transformation logic from analytics consumption
Communicating trade-offs clearly and transparently

How to Run (High Level)

Upload raw data files to DBFS
Run Bronze ingestion notebooks
Run Silver transformation notebooks
Query Gold tables using SQL

Future Improvements

If moved to a paid Databricks environment:

enable Unity Catalog for governance
introduce streaming ingestion
add automated workflows
integrate dbt for transformations
implement formal data quality checks

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data/raw_data		data/raw_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analytics Engineering Pipeline for an EdTech Startup (Databricks + Delta Lake)

Project Overview

Objective

High-Level Architecture (Medallion Pattern)

Environment & Constraints

Databricks Environment

What Is Intentionally Not Used

Tech Stack

Dataset – EdTech Bootcamp Scenario

Raw Entities

Layer Responsibilities

Bronze Layer – Raw Ingestion

Silver Layer – Clean & Modeled Data

Gold Layer – Business Metrics

Data Modeling Approach

Code and Data Storage Decisions

Code

Data

How Raw Data Is Loaded

Example Directory Structure

Repository Structure

What This Project Demonstrates

How to Run (High Level)

Future Improvements

References:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Analytics Engineering Pipeline for an EdTech Startup (Databricks + Delta Lake)

Project Overview

Objective

High-Level Architecture (Medallion Pattern)

Environment & Constraints

Databricks Environment

What Is Intentionally Not Used

Tech Stack

Dataset – EdTech Bootcamp Scenario

Raw Entities

Layer Responsibilities

Bronze Layer – Raw Ingestion

Silver Layer – Clean & Modeled Data

Gold Layer – Business Metrics

Data Modeling Approach

Code and Data Storage Decisions

Code

Data

How Raw Data Is Loaded

Example Directory Structure

Repository Structure

What This Project Demonstrates

How to Run (High Level)

Future Improvements

References:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages