Skip to content

feat(ingest): offline geo-resolution engine + gazetteers (PR1, no lens)#126

Merged
k2jac9 merged 1 commit into
mainfrom
feat/geo-resolve-engine
Jun 21, 2026
Merged

feat(ingest): offline geo-resolution engine + gazetteers (PR1, no lens)#126
k2jac9 merged 1 commit into
mainfrom
feat/geo-resolve-engine

Conversation

@k2jac9

@k2jac9 k2jac9 commented Jun 21, 2026

Copy link
Copy Markdown
Owner

What

A reusable, 100%-offline geo-resolution engine that turns whatever geo a record carries — a coordinate, an intersection id/string, an address, or a ward/neighbourhood — into (lat, lng, quality) using only local committed gazetteers (no geocoding API, no CDN). Makes the "ingest any city's open data" claim real for the messy-geo datasets we currently skip (e.g. 311).

Scope: engine + committed gazetteers + tests. No consumer/lens wires it into the kernel, so it is behaviour-neutral by construction — the golden numbers (J $323,222 → $105,050) are byte-identical.

Engine — src/urbanos/risk/ingest/georesolve.py

  • Cascade: measured → intersection-id → intersection-string → address → area-centroid → unresolved. First in-bbox hit wins; every coord (input and gazetteer-stored) is validated via in_toronto_bbox, so an out-of-region coord falls through instead of becoming a (0,0) pin.
  • Confidence + fall-out: each tier emits confidence ∈ [0,1] (1.0 measured/exact, the difflib ratio for fuzzy, 0.30 for area); a candidate below max(min_confidence, tier_gate) falls out to the next tier.
  • Built from existing primitives (normalize_address, in_toronto_bbox, _STREET_TYPES/_DIRECTIONS, _find_col/_clean_address/_coord/_read_rows) + stdlib difflib. normalize_address is untouched; the new canonical_street() helper reuses its constants. No new dependency.
  • Offline-safe: resolve() never raises; load_gazetteers() is empty (not an error) when files are absent; process cache + reset_gazetteer_cache().

Gazetteers — scripts/fetch_gazetteers.pydemo_data/

Downtown Centreline intersections (~4000, 6-dp), city ward + neighbourhood centroids (pure-Python vertex mean). Offline-safe; raw GeoJSON never committed.

Validation — scripts/validate_gazetteers.py

Holdout harness re-resolves each intersection from its description alone, measures metre error vs truth (5 m tolerance, matching the tests), writes a fallout report capped at 60 rows (over-tolerance collisions first). Real run: 66.2% within 5 m, p50/p95 = 0 m. Reporting only — never fails the build.

Tests / verification

  • tests/test_georesolve.py21 tests (cascade priority, accuracy, fuzzy bound, confidence/fall-out, bbox rejection, offline degrade, determinism, committed-slice guard, cache parity).
  • Full suite: 684 passed, 1 skipped locally; golden numbers byte-identical.
  • make demo-data regenerates the slices + report.

A reusable, 100%-offline resolver that turns whatever geo a record has
(coord / intersection id or string / address / ward·neighbourhood) into
(lat, lng, quality) using only local committed gazetteers — no geocoding
API, no CDN. Makes the "ingest any city's open data" claim real for the
messy-geo datasets we currently skip (e.g. 311).

Engine (src/urbanos/risk/ingest/georesolve.py):
- Cascade: measured -> intersection-id -> intersection-string ->
  address -> area-centroid -> unresolved. First in-bbox hit wins; every
  coord (input and gazetteer-stored) is validated via in_toronto_bbox, so
  an out-of-region coord falls through instead of becoming a (0,0) pin.
- Confidence + fall-out: each tier emits confidence in [0,1] (1.0 measured/
  exact, the difflib ratio for fuzzy, 0.30 for area); a candidate below
  max(min_confidence, tier_gate) falls out to the next tier.
- Built from existing primitives (normalize_address, in_toronto_bbox,
  _STREET_TYPES/_DIRECTIONS, _find_col/_clean_address/_coord/_read_rows)
  plus stdlib difflib. normalize_address is left untouched; the new
  canonical_street() helper reuses its constants. No new third-party dep.
- Offline-safe: resolve() never raises; load_gazetteers() is empty (not an
  error) when files are absent; process cache + reset_gazetteer_cache().

Gazetteers (scripts/fetch_gazetteers.py -> demo_data/): downtown Centreline
intersections (~4000, 6-dp), city ward + neighbourhood centroids (pure-Python
vertex mean). Offline-safe; raw GeoJSON never committed.

Validation (scripts/validate_gazetteers.py): holdout harness re-resolves each
intersection from its description alone, measures metre error vs truth (5 m
tolerance, matching the tests), and writes a fallout report capped at 60 rows
(over-tolerance collisions first). Real run: 66.2% within 5 m, p50/p95 = 0 m.
Reporting only — never fails the build.

No consumer/lens wires the resolver into the kernel, so it is behaviour-
neutral by construction: the golden numbers (J $323,222 -> $105,050) are
byte-identical. Adds tests/test_georesolve.py (21 tests). make demo-data
regenerates the slices + report.
@k2jac9 k2jac9 merged commit 8e86adf into main Jun 21, 2026
4 checks passed
@k2jac9 k2jac9 deleted the feat/geo-resolve-engine branch June 21, 2026 01:18
k2jac9 added a commit that referenced this pull request Jun 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant