Skip to content

RobGonWin/gh-archive-systems-ops-analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GH Archive Systems Ops Signals

Systems project that turns bounded GH Archive event streams into reproducible operational signals and manager-facing decision support.

It focuses on ingestion guardrails, schema-drift detection, parse quarantine logic, and matched-window marts so engineering and data leaders can separate platform behavior changes from data-pipeline noise.

Screenshots

GH Archive Event Ops Mart Preview 1

GH Archive Event Ops Mart Preview 2

Objective

Build a bounded event-stream review workflow that demonstrates:

  • bounded event-stream ingestion discipline
  • parse health and quarantine monitoring
  • schema-drift and event-mix change detection
  • repo concentration and ops-signal modeling
  • manager-facing recommendations backed by reproducible outputs

Verified dataset behavior

  • GH Archive is live and usable through https://data.gharchive.org/.
  • Hour URLs use non-zero-padded hours for 0 through 9.
  • Example confirmed paths:
    • https://data.gharchive.org/2024-10-01-0.json.gz
    • https://data.gharchive.org/2025-10-01-12.json.gz

Bounded scope and bandwidth guardrails

  • Pull bounded hourly files only.
  • Use HEAD preflight before any direct download when possible.
  • Reject any single file above 250 MB unless explicitly approved.
  • Treat local workloads above 1 GB as cloud-first.
  • Persist raw downloads plus normalized CSV and Parquet snapshots locally.

First completed comparison slice

  • Current window: 2026-03-10 hours 0 through 7 UTC, labeled 2025_2026
  • Baseline window: 2024-03-10 hours 0 through 7 UTC, labeled 2023_2024
  • This paired 8h shape preserves matched-hour comparison while staying inside the local bandwidth guardrail.

Project structure

  • scripts/: preflight, download, normalize, and DuckDB model execution
  • models/staging: typed staging models
  • models/marts: comparison and ops-signal marts
  • tests/: unit and SQL checks
  • app/: Streamlit review app
  • docs/: project brief and incident memo framing
  • data/: raw and curated local artifacts
  • outputs/: manifests and snapshots

Workflow

  1. Preflight the current window:
    python scripts/head_preflight.py \
      --start-hour 2026-03-10-0 \
      --hours 8 \
      --window-label 2025_2026
  2. Download the current window:
    python scripts/ingest_gharchive_hourly.py \
      --manifest-path outputs/manifests/head_preflight_2025_2026_2026-03-10_8h.csv
  3. Normalize the current window:
    python scripts/normalize_events.py \
      --download-manifest-path outputs/manifests/download_manifest_2025_2026_2026-03-10_8h.csv
  4. Repeat steps 1 through 3 for the baseline window:
    python scripts/head_preflight.py \
      --start-hour 2024-03-10-0 \
      --hours 8 \
      --window-label 2023_2024
  5. Run the DuckDB model layer across the latest snapshots from both windows:
    python scripts/run_duckdb_models.py
  6. Review the operational narrative:
    streamlit run app/streamlit_app.py

What the marts answer

  • mart_parse_health_summary: did parse success or quarantine rates deteriorate?
  • mart_schema_drift_summary: which fields changed presence or observed types across windows?
  • mart_event_type_shift: which event types materially shifted in mix or volume?
  • mart_repo_concentration: did a small repo set dominate the current window?
  • mart_ops_signals: which shifts deserve monitoring or manager-facing follow-up?

Why A Team Would Care

A data platform, analytics, operations, or health-tech-adjacent team could use this pattern to turn noisy event files into reviewable tables, catch data quality issues early, and support cautious operational decisions without overstating what the signals mean.

About

Bounded GH Archive pipeline for parse health, schema drift, event-mix monitoring, and manager-facing systems ops signals.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors