GH Archive Systems Ops Signals

Systems project that turns bounded GH Archive event streams into reproducible operational signals and manager-facing decision support.

It focuses on ingestion guardrails, schema-drift detection, parse quarantine logic, and matched-window marts so engineering and data leaders can separate platform behavior changes from data-pipeline noise.

Screenshots

Objective

Build a bounded event-stream review workflow that demonstrates:

bounded event-stream ingestion discipline
parse health and quarantine monitoring
schema-drift and event-mix change detection
repo concentration and ops-signal modeling
manager-facing recommendations backed by reproducible outputs

Verified dataset behavior

GH Archive is live and usable through https://data.gharchive.org/.
Hour URLs use non-zero-padded hours for 0 through 9.
Example confirmed paths:
- https://data.gharchive.org/2024-10-01-0.json.gz
- https://data.gharchive.org/2025-10-01-12.json.gz

Bounded scope and bandwidth guardrails

Pull bounded hourly files only.
Use HEAD preflight before any direct download when possible.
Reject any single file above 250 MB unless explicitly approved.
Treat local workloads above 1 GB as cloud-first.
Persist raw downloads plus normalized CSV and Parquet snapshots locally.

First completed comparison slice

Current window: 2026-03-10 hours 0 through 7 UTC, labeled 2025_2026
Baseline window: 2024-03-10 hours 0 through 7 UTC, labeled 2023_2024
This paired 8h shape preserves matched-hour comparison while staying inside the local bandwidth guardrail.

Project structure

scripts/: preflight, download, normalize, and DuckDB model execution
models/staging: typed staging models
models/marts: comparison and ops-signal marts
tests/: unit and SQL checks
app/: Streamlit review app
docs/: project brief and incident memo framing
data/: raw and curated local artifacts
outputs/: manifests and snapshots

Workflow

Preflight the current window:

python scripts/head_preflight.py \
  --start-hour 2026-03-10-0 \
  --hours 8 \
  --window-label 2025_2026

Download the current window:

python scripts/ingest_gharchive_hourly.py \
  --manifest-path outputs/manifests/head_preflight_2025_2026_2026-03-10_8h.csv

Normalize the current window:

python scripts/normalize_events.py \
  --download-manifest-path outputs/manifests/download_manifest_2025_2026_2026-03-10_8h.csv

Repeat steps 1 through 3 for the baseline window:

python scripts/head_preflight.py \
  --start-hour 2024-03-10-0 \
  --hours 8 \
  --window-label 2023_2024

Run the DuckDB model layer across the latest snapshots from both windows:
```
python scripts/run_duckdb_models.py
```
Review the operational narrative:
```
streamlit run app/streamlit_app.py
```

What the marts answer

mart_parse_health_summary: did parse success or quarantine rates deteriorate?
mart_schema_drift_summary: which fields changed presence or observed types across windows?
mart_event_type_shift: which event types materially shifted in mix or volume?
mart_repo_concentration: did a small repo set dominate the current window?
mart_ops_signals: which shifts deserve monitoring or manager-facing follow-up?

Why A Team Would Care

A data platform, analytics, operations, or health-tech-adjacent team could use this pattern to turn noisy event files into reviewable tables, catch data quality issues early, and support cautious operational decisions without overstating what the signals mean.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
data		data
docs		docs
models		models
outputs		outputs
scripts		scripts
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GH Archive Systems Ops Signals

Screenshots

Objective

Verified dataset behavior

Bounded scope and bandwidth guardrails

First completed comparison slice

Project structure

Workflow

What the marts answer

Why A Team Would Care

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GH Archive Systems Ops Signals

Screenshots

Objective

Verified dataset behavior

Bounded scope and bandwidth guardrails

First completed comparison slice

Project structure

Workflow

What the marts answer

Why A Team Would Care

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages