feat(ingest): offline geo-resolution engine + gazetteers (PR1, no lens)#126
Merged
Conversation
A reusable, 100%-offline resolver that turns whatever geo a record has (coord / intersection id or string / address / ward·neighbourhood) into (lat, lng, quality) using only local committed gazetteers — no geocoding API, no CDN. Makes the "ingest any city's open data" claim real for the messy-geo datasets we currently skip (e.g. 311). Engine (src/urbanos/risk/ingest/georesolve.py): - Cascade: measured -> intersection-id -> intersection-string -> address -> area-centroid -> unresolved. First in-bbox hit wins; every coord (input and gazetteer-stored) is validated via in_toronto_bbox, so an out-of-region coord falls through instead of becoming a (0,0) pin. - Confidence + fall-out: each tier emits confidence in [0,1] (1.0 measured/ exact, the difflib ratio for fuzzy, 0.30 for area); a candidate below max(min_confidence, tier_gate) falls out to the next tier. - Built from existing primitives (normalize_address, in_toronto_bbox, _STREET_TYPES/_DIRECTIONS, _find_col/_clean_address/_coord/_read_rows) plus stdlib difflib. normalize_address is left untouched; the new canonical_street() helper reuses its constants. No new third-party dep. - Offline-safe: resolve() never raises; load_gazetteers() is empty (not an error) when files are absent; process cache + reset_gazetteer_cache(). Gazetteers (scripts/fetch_gazetteers.py -> demo_data/): downtown Centreline intersections (~4000, 6-dp), city ward + neighbourhood centroids (pure-Python vertex mean). Offline-safe; raw GeoJSON never committed. Validation (scripts/validate_gazetteers.py): holdout harness re-resolves each intersection from its description alone, measures metre error vs truth (5 m tolerance, matching the tests), and writes a fallout report capped at 60 rows (over-tolerance collisions first). Real run: 66.2% within 5 m, p50/p95 = 0 m. Reporting only — never fails the build. No consumer/lens wires the resolver into the kernel, so it is behaviour- neutral by construction: the golden numbers (J $323,222 -> $105,050) are byte-identical. Adds tests/test_georesolve.py (21 tests). make demo-data regenerates the slices + report.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
A reusable, 100%-offline geo-resolution engine that turns whatever geo a record carries — a coordinate, an intersection id/string, an address, or a ward/neighbourhood — into
(lat, lng, quality)using only local committed gazetteers (no geocoding API, no CDN). Makes the "ingest any city's open data" claim real for the messy-geo datasets we currently skip (e.g. 311).Scope: engine + committed gazetteers + tests. No consumer/lens wires it into the kernel, so it is behaviour-neutral by construction — the golden numbers (J $323,222 → $105,050) are byte-identical.
Engine —
src/urbanos/risk/ingest/georesolve.pyin_toronto_bbox, so an out-of-region coord falls through instead of becoming a(0,0)pin.confidence ∈ [0,1](1.0 measured/exact, the difflib ratio for fuzzy, 0.30 for area); a candidate belowmax(min_confidence, tier_gate)falls out to the next tier.normalize_address,in_toronto_bbox,_STREET_TYPES/_DIRECTIONS,_find_col/_clean_address/_coord/_read_rows) + stdlibdifflib.normalize_addressis untouched; the newcanonical_street()helper reuses its constants. No new dependency.resolve()never raises;load_gazetteers()is empty (not an error) when files are absent; process cache +reset_gazetteer_cache().Gazetteers —
scripts/fetch_gazetteers.py→demo_data/Downtown Centreline intersections (~4000, 6-dp), city ward + neighbourhood centroids (pure-Python vertex mean). Offline-safe; raw GeoJSON never committed.
Validation —
scripts/validate_gazetteers.pyHoldout harness re-resolves each intersection from its description alone, measures metre error vs truth (5 m tolerance, matching the tests), writes a fallout report capped at 60 rows (over-tolerance collisions first). Real run: 66.2% within 5 m, p50/p95 = 0 m. Reporting only — never fails the build.
Tests / verification
tests/test_georesolve.py— 21 tests (cascade priority, accuracy, fuzzy bound, confidence/fall-out, bbox rejection, offline degrade, determinism, committed-slice guard, cache parity).make demo-dataregenerates the slices + report.