feat: auto-detect distance metric from CRS via SpatialCoordinateValue…#86
feat: auto-detect distance metric from CRS via SpatialCoordinateValue…#86FAbdullah17 wants to merge 22 commits intomllam:mainfrom
Conversation
…sSelector (mllam#75)Replaces all scipy.spatial.KDTree usage with a new metric-aware spatialindex class that selects euclidean or haversine automatically based onthe supplied coordinate reference system.New file: src/weather_model_graphs/spatial.py- SpatialCoordinateValuesSelector wraps sklearn.neighbors.BallTree- Supports euclidean (projected CRS) and haversine (geographic CRS)- Haversine: expects lon/lat in degrees, returns distances in metres (arc-length * 6,371,000 m); BallTree internally uses [lat_rad, lon_rad]- Factory method `for_crs(crs, coords)` reads `crs.is_geographic` to pick the metric automatically; falls back to euclidean when CRS is None- ImportError with install hint when scikit-learn is absent and haversine is requestedsrc/weather_model_graphs/create/base.py- Remove scipy.spatial import; import SpatialCoordinateValuesSelector- create_all_graph_components(): build spatial_index via for_crs(); emit UserWarning when a rectilinear mesh kind is combined with a geographic CRS (flat/flat_multiscale/hierarchical + is_geographic)- connect_nodes_across_graphs(): accept spatial_index kwarg; replace kdt.query/query_ball_point with spatial_index.k_nearest_to/with_radiussrc/weather_model_graphs/create/mesh/kinds/hierarchical.py- Remove scipy import; import SpatialCoordinateValuesSelector- create_hierarchical_multiscale_mesh_graph(): add distance_metric param; replace inter-level KDTree with SpatialCoordinateValuesSelectorsrc/weather_model_graphs/__init__.py- Export SpatialCoordinateValuesSelector at package top-levelpyproject.toml- Add [global] optional extras group: scikit-learn>=1.3.0- Install with: pip install weather-model-graphs[global]tests/test_spatial_index.py (new, 27 tests)- TestInit, TestEuclideanKNearest, TestEuclideanWithRadius- TestHaversineKNearest, TestHaversineWithRadius- TestForCrs: pyproj.CRS(EPSG:4326) and ccrs.Geodetic() → haversine; ccrs.PlateCarree() and None → euclidean- TestRectilinearGeographicWarning- TestIntegrationGraphCreation: g2m/m2g edge len values in [1 km, 2000 km] for a haversine graph built over a real lat/lon domaindocs/distance_metric_auto_detection.ipynb (new)- Walkthrough: euclidean demo → haversine demo → for_crs auto-detection table → full graph integration example → rectilinear+geographic warningBackward-compatible: all paths default to euclidean when graph_crs=None.
|
Hi @leifdenby, I've opened this PR as discussed in #75. It introduces the Could you please take a look when you have time? Thanks! |
leifdenby
left a comment
There was a problem hiding this comment.
This is look good! Nice work 🌟
I've added a few suggestions for changes and some questions too.
|
Hi @leifdenby, thank you for the detailed review and the kind words! Please find my responses to each point below. 1. 2 & 3. 4. 5 & 6. 7. 8. 9. We will begin implementing all of the above once you confirm the preferred approach for point 9. Thank you again! |
Could you add your responses directly to my comments instead of in a single comment please? :) thankyou! |
…n- hierarchical: use metric-aware distance from SpatialCoordinateValuesSelector for inter-level edge len (remove Euclidean overwrite)- spatial: return native haversine units (radians) for k-nearest and radius queries; remove Earth-radius conversion helpers- base: rename spatial_index to spatial_coord_selector for clarity- base: document Euclidean default when graph_crs is not provided- base: align connect_nodes_across_graphs API with hierarchical path by passing distance_metric string instead of selector object- base: restore _find_neighbour_node_idxs_in_source_mesh helper naming for nearest_neighbours / within_radius paths- pyproject: promote scikit-learn to core dependency, remove [global] extra- tests(spatial): update haversine assertions/docs to radians and adjust integration expectations- tests(windows): fix tempfile PermissionError by using fig.savefig(fileobj) instead of fig.savefig(path) in plotting tests
|
Hi @leifdenby , Review feedback addressedThanks again for the detailed review — I’ve now applied all requested follow-ups. ✅ 1)
|
leifdenby
left a comment
There was a problem hiding this comment.
This is getting really close!
I've added some general comments and suggestion for change. Can you address each comment in separate commits this time and then link to the commit in a reply to my comment please? It was quite hard to track where my comments were addressed so I had to re-review everything :)
Also, general point: Reading this I was thinking that it might be nicer to give distances in degrees rather than radians, what do you think? Radians are pretty hard to reason about, but everyone know that a circle is 360 degrees. Also we're assuming the lat/lon are given in degrees, so it matches the units of the provided geographical coordinates.
| G_down.add_nodes_from(G_to.nodes(data=True)) | ||
|
|
||
| # build kd tree for mesh point pos | ||
| # build spatial index for source (coarser) mesh node positions |
There was a problem hiding this comment.
This shouldn't be called "spatial index" :) wherever you create a SpatialCoordinateValuesSelector call the variable spatial_coord_selector instead
There was a problem hiding this comment.
Addressed in 7ca16dd: renamed the SpatialCoordinateValuesSelector variable to spatial_coord_selector in hierarchical.py (including corresponding usage/comment wording).
| # sphere, so the mesh node density will vary strongly with latitude. | ||
| _is_geographic = getattr(graph_crs, "is_geographic", False) | ||
| if _is_geographic and m2m_connectivity in ("flat", "flat_multiscale", "hierarchical"): | ||
| warnings.warn( |
There was a problem hiding this comment.
we use loguru for logging throughout, please replace with logger.warn using loguru as elsewhere in the codebase
There was a problem hiding this comment.
| "rectilinear (equally-spaced lon/lat) grid, but the graph CRS is " | ||
| "geographic. Equally-spaced longitude/latitude values are NOT equally " | ||
| "spaced on a sphere — mesh node density will vary with latitude. " | ||
| "Consider projecting to a suitable projected CRS (e.g. via graph_crs) " |
There was a problem hiding this comment.
I would remove these last two lines. Saying that should provide a "projected CRS" is basically just saying "don't provide a graph_crs that is geographic", but the other lines in your warning are adequate in explaining why that might be a bad idea (given that the only available mesh right now is rectilinear). Also they can't use icosahdral mesh
There was a problem hiding this comment.
Addressed in f5bb64a: trimmed the rectilinear/geographic warning text by removing the final two suggestion lines, while keeping the core explanation.
| Maximum number of neighbours to search for in `G_target` for each node in `G_source` | ||
| distance_metric : str, optional | ||
| Distance metric used for neighbour search. Supported values are | ||
| ``"euclidean"`` and ``"haversine"``. Defaults to ``"euclidean"``. |
There was a problem hiding this comment.
instead of defaulting to euclidean I think it would be safer to require that this is explicitly provided (i.e. no default)
There was a problem hiding this comment.
Addressed in e050204: connect_nodes_across_graphs now requires distance_metric explicitly (no default), and direct caller usage was updated accordingly.
tests/test_spatial_index.py
Outdated
| assert dists[1] == pytest.approx(expected_rad, rel=1e-4) | ||
|
|
||
| def test_distances_are_native_haversine_radians(self): | ||
| """For geographic coords, haversine returns unit-sphere radians.""" |
There was a problem hiding this comment.
"unit-sphere radians" -> "radians", it will be radians whatever the radius of the sphere is
There was a problem hiding this comment.
Addressed in 8da3663 : updated the wording from “unit-sphere radians” to “radians” in tests/test_spatial_index.py.
|
Thanks @leifdenby, this is really helpful feedback. Absolutely — this round I’ll address each review comment in a separate commit, and I’ll reply on each thread with the exact commit link/hash so the mapping is clear and easy to review. On the units point: degrees are definitely easier to reason about for users, but for the core haversine API I think radians are the more robust technical choice. So my plan is:
|
|
Hey @leifdenby and @FAbdullah17 ! Great work with this PR. Was taking a look at this implementation and had a few thoughts : Currently, Am I following this execution path correctly, or is there a unit conversion for If we want the interface to cleanly accept degrees as you suggested, |
@FAbdullah17 and @Raj-Taware: I see your point. How about we make so that everything external to |
leifdenby
left a comment
There was a problem hiding this comment.
Thanks for all your hard work here! Only very minor suggestions to changes to do and then this can go in 🚀
| mesh_node_distance: float, | ||
| level_refinement_factor: float, | ||
| max_num_levels: int, | ||
| distance_metric: str = "euclidean", |
There was a problem hiding this comment.
should we maybe not have a default here? As in force the user to provide xy and a distance metric? You have already helpfully in the docstring described what the options and when people should use either. Or maybe it would be better to have a bool argument called xy_is_geographic and then we construct a distance measurer based on this?
There was a problem hiding this comment.
Addressed in 96f2a99 : removed the default distance_metric in the hierarchical mesh builder so callers must pass it explicitly; kept explicit metric selection in the existing CRS-driven flow rather than introducing a parallel xy_is_geographic switch.
src/weather_model_graphs/spatial.py
Outdated
| distances = raw_dists.flatten() | ||
| return indices, distances | ||
|
|
||
| def with_radius( |
There was a problem hiding this comment.
I would rename this as within_radius
There was a problem hiding this comment.
Addressed in 854e062 : renamed with_radius to within_radius in SpatialCoordinateValuesSelector and updated call sites/tests accordingly.
src/weather_model_graphs/spatial.py
Outdated
| Query point (same coordinate convention as for :meth:`k_nearest_to`). | ||
| radius : float | ||
| Search radius. For ``"euclidean"`` this is in the same units as | ||
| *coords*; for ``"haversine"`` this is in **radians**. |
There was a problem hiding this comment.
as I mentioned in my comment I would make this argument be in degrees
src/weather_model_graphs/spatial.py
Outdated
| tree_point, r=radius, return_distance=True | ||
| ) | ||
| indices = raw_idxs[0] | ||
| distances = raw_dists[0] |
There was a problem hiding this comment.
should convert to degrees here if using the haversine distance since the kdballtree query will return distance in radians
There was a problem hiding this comment.
Addressed in bce4d8c : converted haversine within_radius returned distances from radians to degrees (BallTree still computes internally in radians) and added test coverage for degree-valued outputs.
src/weather_model_graphs/spatial.py
Outdated
| tree_point = self._prepare_query_point(point) | ||
| raw_dists, raw_idxs = self._tree.query(tree_point, k=k) | ||
| indices = raw_idxs.flatten() | ||
| distances = raw_dists.flatten() |
There was a problem hiding this comment.
convert to degrees when using haversine distance
There was a problem hiding this comment.
Addressed in 8dbfee1 : updated within_radius so the haversine radius argument is now interpreted in degrees at the API boundary (with internal conversion to radians for BallTree), and updated the related tests accordingly.
There was a problem hiding this comment.
the raw_dists need to be converted back to degrees here so that everything external to SpatialCoordinateValuesSelector is in degrees as you've done for within_radius
There was a problem hiding this comment.
Addressed in 8c1a610 : converted haversine k_nearest_to returned distances from radians to degrees so it matches the external degree-based API used by within_radius
@leifdenby I agree, that's good approach. I'll update
This keeps geographic workflows intuitive while preserving the correct radian‑based computations under the hood. |
leifdenby
left a comment
There was a problem hiding this comment.
One minor thing to fix :) Also, remember to add a changelog entry
| The number of levels in the hierarchical mesh graph. | ||
| distance_metric : {'euclidean', 'haversine'} | ||
| Distance metric used when computing inter-level nearest-neighbour edges | ||
| and storing edge ``"len"`` attributes. Pass ``'haversine'`` when *xy* |
There was a problem hiding this comment.
sorry for only picking up on this now, but I think we should write coords rather than xy since xy may give suggestion that we are expecting to get Cartesian grid coordinates, but these routines work on both Cartesian and geographical coordinates.
There was a problem hiding this comment.
actually, let's keep this for now, we can fix this when we merge in #81
There was a problem hiding this comment.
Agreed – let's keep it as is for now and address naming consistently in #81.
src/weather_model_graphs/spatial.py
Outdated
| tree_point = self._prepare_query_point(point) | ||
| raw_dists, raw_idxs = self._tree.query(tree_point, k=k) | ||
| indices = raw_idxs.flatten() | ||
| distances = raw_dists.flatten() |
There was a problem hiding this comment.
the raw_dists need to be converted back to degrees here so that everything external to SpatialCoordinateValuesSelector is in degrees as you've done for within_radius
tests/test_spatial_index.py
Outdated
| sel = SpatialCoordinateValuesSelector("haversine", simple_geo_coords) | ||
| idxs, dists = sel.k_nearest_to([0.0, 0.0], k=1) | ||
| assert idxs[0] == 0 | ||
| assert dists[0] == pytest.approx(0.0, abs=1e-3) |
There was a problem hiding this comment.
can you change this test so we expect the value to be non-zero? That way you can check that the value is the expected value in degrees. I.e. query with [-10.0, 0.0] and check the distance in is 10.0 (since the distances should be returned degrees)
There was a problem hiding this comment.
Addressed in 0d75615 : updated the haversine k_nearest_to test to use a non-zero query ([-10.0, 0.0]) and assert a 10.0 degree distance, so the test directly verifies degree-based distance outputs.
tests/test_spatial_index.py
Outdated
| sel = SpatialCoordinateValuesSelector("haversine", simple_geo_coords) | ||
| idxs, dists = sel.k_nearest_to([0.0, 0.0], k=2) | ||
| # nearest is self (0 rad), second is [10, 0] = deg2rad(10) | ||
| expected_rad = np.deg2rad(10.0) |
There was a problem hiding this comment.
compare with k_nearest_to() you can see that there isn't need to convert from radians to degrees there. These methods should be symmetric, both should return distances in degrees
There was a problem hiding this comment.
Addressed in cbd09e9: aligned remaining haversine k_nearest_to/wrap-around/integration expectations and wording to degree-based outputs for symmetry with within_radius.
Adressed in 97f0e94: add changelog entry for CRS-aware degree-based metric handling |
leifdenby
left a comment
There was a problem hiding this comment.
You are a 🌟 ! Just one last very small suggestion to change the test, using any is a bit too lenient here, let's check the actual values
tests/test_spatial_index.py
Outdated
| idxs, dists = sel.within_radius([0.0, 0.0], radius=radius_deg) | ||
| assert 0 in idxs # self | ||
| assert 1 in idxs # 10° lon away | ||
| assert any(d == pytest.approx(10.0, rel=1e-4) for d in dists) |
There was a problem hiding this comment.
rather than any here can't we just assert with numpy.test.almost_equals how far away we expect the two points to lie i.e. [0.0, 10.0] presumably within-radius will return the points by distance, no? Otherwise we could maybe sort the indexes and distances by distance here in the test?
There was a problem hiding this comment.
Addressed in ee70506: replaced the any(...) check with a deterministic assertion by sorting returned neighbours by distance in the test, then asserting expected indices [0, 1] and distances [0.0, 10.0] using NumPy test helpers
|
@FAbdullah17 could you have a look at fixing the failing tests please and I will do another review? |
@leifdenby I'v fixed the failing tests and hope so all checks will pass now, adressed in 44b2cd7 |
|
there are still two failing tests @FAbdullah17 :) |
Adressed in 949a71d: Re-checked CI failures – root cause was missing |
Describe your changes
Replaces all
scipy.spatial.KDTreeusage with a new metric-aware spatial index class that automatically selects the correct distance metric (euclidean or haversine) based on the supplied coordinate reference system (CRS). This is the core implementation for #75.New file:
src/weather_model_graphs/spatial.pySpatialCoordinateValuesSelectorwrapssklearn.neighbors.BallTreeto provide fast k‑nearest‑neighbour and radius queries."euclidean"(for projected CRS) and"haversine"(for geographic CRS).[latitude_rad, longitude_rad]and the"haversine"metric.for_crs(crs, coords)readscrs.is_geographicto select the metric automatically; falls back to"euclidean"whencrsisNone."haversine"is requested butscikit-learnis not installed, anImportErrorwith a clear installation hint is raised.Modified:
src/weather_model_graphs/create/base.pyimport scipy.spatial; added import ofSpatialCoordinateValuesSelector.create_all_graph_components():spatial_indexviaSpatialCoordinateValuesSelector.for_crs().UserWarningwhen a rectilinear mesh kind (flat,flat_multiscale,hierarchical) is combined with a geographic CRS (lat/lon). This alerts users that equally‑spaced lon/lat mesh nodes are not equally spaced on a sphere.connect_nodes_across_graphs():spatial_index: SpatialCoordinateValuesSelector | None.kdt.query/kdt.query_ball_pointwithspatial_index.k_nearest_to()/spatial_index.with_radius()."len"attributes are now taken directly from the tree’s returned distances, guaranteeing correctness for both metrics.Modified:
src/weather_model_graphs/create/mesh/kinds/hierarchical.pyimport scipy; added import ofSpatialCoordinateValuesSelector.create_hierarchical_multiscale_mesh_graph():distance_metric: strparameter (default"euclidean").SpatialCoordinateValuesSelectorbuilt from the coarser‑level nodes; the metric is passed down fromcreate_all_graph_components.Modified:
src/weather_model_graphs/__init__.pySpatialCoordinateValuesSelectorat package top‑level for easy access.Modified:
pyproject.toml[global]that installsscikit-learn>=1.3.0.Users can now install with
pip install weather-model-graphs[global]when they need haversine support for global/geographic graphs. The core package remains lightweight for projected‑only use.New tests:
tests/test_spatial_index.py(27 tests)TestInit– valid/invalid metric, coordinate dtype.TestEuclideanKNearest/TestEuclideanWithRadius– basic Euclidean queries, known distances, sorting, radius behaviour.TestHaversineKNearest/TestHaversineWithRadius– haversine distances in metres, known values (e.g. 10° along equator ≈ 1 111 945 m), radius inclusive/exclusive.TestForCrs– verifies thatfor_crsreturns haversine for geographic CRS (pyproj.CRS("EPSG:4326"),ccrs.Geodetic()) and euclidean for projected CRS (ccrs.LambertConformal(), etc.).TestRectilinearGeographicWarning– ensures the warning is raised when a rectilinear mesh is built with a geographic CRS, and absent for projected CRS.TestIntegrationGraphCreation– smoke‑test building a full graph with a geographic CRS; checks that g2m/m2g edge lengths are in metres and lie in a physically reasonable range (1 km – 2000 km for a ~10° domain).New documentation:
docs/distance_metric_auto_detection.ipynbfor_crsfactory method applied to several CRS types.create_all_graph_componentswith a geographic CRS.Backward compatibility
"euclidean"when no CRS is supplied (graph_crs=None), so existing scripts continue to work unchanged.spatial_indexargument inconnect_nodes_across_graphs()is optional; callers that do not provide it fall back to a Euclidean selector, preserving the original behaviour.Motivation:
Graphs for global or sparse irregular data need the haversine metric to correctly measure distances on the sphere. Previously the code used only Euclidean distances, which are incorrect for lon/lat coordinates. This change introduces automatic metric selection based on the CRS, making it easy to build correct graphs for any coordinate system.
New dependencies:
scikit-learnis now an optional dependency (only needed for haversine). Installed viapip install weather-model-graphs[global].Issue Link
Closes #75
Type of change
Checklist before requesting a review
pullwith--rebaseoption if possible).Checklist for reviewers
Each PR comes with its own improvements and flaws. The reviewer should check the following:
Author checklist after completed review
reflecting type of change (add section where missing):
Checklist for assignee