Systems project that turns bounded GH Archive event streams into reproducible operational signals and manager-facing decision support.
It focuses on ingestion guardrails, schema-drift detection, parse quarantine logic, and matched-window marts so engineering and data leaders can separate platform behavior changes from data-pipeline noise.
Build a bounded event-stream review workflow that demonstrates:
- bounded event-stream ingestion discipline
- parse health and quarantine monitoring
- schema-drift and event-mix change detection
- repo concentration and ops-signal modeling
- manager-facing recommendations backed by reproducible outputs
- GH Archive is live and usable through
https://data.gharchive.org/. - Hour URLs use non-zero-padded hours for
0through9. - Example confirmed paths:
https://data.gharchive.org/2024-10-01-0.json.gzhttps://data.gharchive.org/2025-10-01-12.json.gz
- Pull bounded hourly files only.
- Use
HEADpreflight before any direct download when possible. - Reject any single file above
250 MBunless explicitly approved. - Treat local workloads above
1 GBas cloud-first. - Persist raw downloads plus normalized CSV and Parquet snapshots locally.
- Current window:
2026-03-10hours0through7UTC, labeled2025_2026 - Baseline window:
2024-03-10hours0through7UTC, labeled2023_2024 - This paired
8hshape preserves matched-hour comparison while staying inside the local bandwidth guardrail.
scripts/: preflight, download, normalize, and DuckDB model executionmodels/staging: typed staging modelsmodels/marts: comparison and ops-signal martstests/: unit and SQL checksapp/: Streamlit review appdocs/: project brief and incident memo framingdata/: raw and curated local artifactsoutputs/: manifests and snapshots
- Preflight the current window:
python scripts/head_preflight.py \ --start-hour 2026-03-10-0 \ --hours 8 \ --window-label 2025_2026
- Download the current window:
python scripts/ingest_gharchive_hourly.py \ --manifest-path outputs/manifests/head_preflight_2025_2026_2026-03-10_8h.csv
- Normalize the current window:
python scripts/normalize_events.py \ --download-manifest-path outputs/manifests/download_manifest_2025_2026_2026-03-10_8h.csv
- Repeat steps
1through3for the baseline window:python scripts/head_preflight.py \ --start-hour 2024-03-10-0 \ --hours 8 \ --window-label 2023_2024
- Run the DuckDB model layer across the latest snapshots from both windows:
python scripts/run_duckdb_models.py
- Review the operational narrative:
streamlit run app/streamlit_app.py
mart_parse_health_summary: did parse success or quarantine rates deteriorate?mart_schema_drift_summary: which fields changed presence or observed types across windows?mart_event_type_shift: which event types materially shifted in mix or volume?mart_repo_concentration: did a small repo set dominate the current window?mart_ops_signals: which shifts deserve monitoring or manager-facing follow-up?
A data platform, analytics, operations, or health-tech-adjacent team could use this pattern to turn noisy event files into reviewable tables, catch data quality issues early, and support cautious operational decisions without overstating what the signals mean.