Skip to content

New defaults for concat, merge, combine_* #10062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 49 commits into
base: main
Choose a base branch
from

Conversation

jsignell
Copy link
Contributor

@jsignell jsignell commented Feb 19, 2025

Replaces #10051

  • Towards Stricter defaults for concat, combine, open_mfdataset, merge #8778
    • after a sufficiently long deprecation cycle there should be another PR that:
      • removes the option (but doesn't throw if people are still setting it)
      • removes all the FutureWarnings
    • after even more time one last PR that:
      • encodes the new defaults right into the function signature
      • removes the extra warnings that throw on failure
  • Closes Merging coordinates computes array values #9481
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

This PR attempts to throw warnings if and only if the output would change with the new kwarg defaults. To exercise the new default I am toggling the option back and forth and running the existing test suite.

With new kwarg values

  • Run all the tests with use_new_combine_kwarg_defaults=True -> 78 failed
  • Change to use_new_combine_kwarg_defaults=False and run the last failed -> 76 failed, 2 passed

Those 2 are missed alarms. In these two cases the behavior will be different when xarray switched to using new default kwarg values and there is no warning. But I am not totally sure whether we need to handle them because they are tests for conditions that used to raise an error and with the new defaults they do not.

FAILED xarray/tests/test_concat.py::TestConcatDataset::test_concat_do_not_promote - Failed: DID NOT RAISE <class 'ValueError'>
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_error - Failed: DID NOT RAISE <class 'xarray.core.merge.MergeError'>

With old kwarg values

  • Run all the tests with use_new_combine_kwarg_defaults=False -> 119 failed
  • Change to use_new_combine_kwarg_defaults=True and run the last failed -> 76 failed, 43 passed

Those 44 are potentially false alarms - they could also be tests that just don't assert that the result exactly matches some fixed expectation - but mostly they are false alarms.

Here is a list of them
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[outer-minimal-nested-t] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[outer-minimal-by_coords-None] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[inner-minimal-nested-t] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[inner-minimal-by_coords-None] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[left-minimal-nested-t] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[left-minimal-by_coords-None] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[right-minimal-nested-t] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[right-minimal-by_coords-None] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_dataset_combine_attrs[drop] - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_dataset_combine_attrs[override] - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_dataset_combine_attrs[no_conflicts] - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_dataset_combine_attrs[identical] - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_dataset_combine_attrs[drop_conflicts] - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_dataset_attr_by_coords - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_dataarray_attr_by_coords - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_common_coord_when_datavars_minimal - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_invalid_data_vars_value_should_fail - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_backends.py::TestDask::test_open_mfdataset_concat_dim_none - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_backends.py::TestDask::test_open_mfdataset_concat_dim_default_none - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_backends.py::test_h5netcdf_storage_options - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_combine.py::TestNestedCombine::test_auto_combine_2d - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_combine.py::TestNestedCombine::test_auto_combine_2d_combine_attrs_kwarg - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_combine.py::TestCombineDatasetsbyCoords::test_infer_order_from_coords - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_combine.py::TestCombineDatasetsbyCoords::test_combine_by_coords_no_concat - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_concat.py::TestConcatDataset::test_concat_coords_kwarg[dim1-minimal] - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_concat.py::TestConcatDataset::test_concat_coords_kwarg[dim1-all] - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_concat.py::TestConcatDataset::test_concat_errors - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_concat.py::TestConcatDataArray::test_concat_avoids_index_auto_creation - FutureWarning: In a future version of xarray the default value for coords will change from coords='different' to coords='minimal'. This is likely to lead to different results when multiple datasets have m...
FAILED xarray/tests/test_dask.py::test_map_blocks_da_ds_with_template[obj1] - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_dask.py::test_map_blocks_errors_bad_template[obj1] - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_dataset.py::TestDataset::test_dataset_math_auto_align - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_datasets - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_arrays_attrs_variables[no_conflicts-attrs10-attrs20-expected_attrs0-False] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_arrays_attrs_variables[no_conflicts-attrs11-attrs21-expected_attrs1-False] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_arrays_attrs_variables[no_conflicts-attrs12-attrs22-expected_attrs2-False] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_arrays_attrs_variables[no_conflicts-attrs13-attrs23-expected_attrs3-True] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_arrays_attrs_variables[drop-attrs14-attrs24-expected_attrs4-False] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_arrays_attrs_variables[identical-attrs15-attrs25-expected_attrs5-False] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_arrays_attrs_variables[identical-attrs16-attrs26-expected_attrs6-True] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_arrays_attrs_variables[override-attrs17-attrs27-expected_attrs7-False] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_arrays_attrs_variables[drop_conflicts-attrs18-attrs28-expected_attrs8-False] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_arrays_attrs_variables[<lambda>-attrs19-attrs29-expected_attrs9-False] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_no_conflicts_preserve_attrs - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...

About half of them are triggered by my attempt to catch cases where different datasets have matching overlapping variables and with compat='no_conflicts' you might get different results than with compat='override'. There might be a better way, but we want to be careful to avoid calling compute.

Updating tests

After investigating the impact on existing tests, I updated the tests to make sure they still test what they were meant to test by passing in any required kwargs and I added new tests that explicitly ensure that for a bunch of different inputs, using old defaults throws a warning OR output is the same with new and old defaults.

Notes

  • I used sentinel values to indicate a kwarg value that the user has not explicitly set. The benefit of using this approach is that it makes it more obvious that people should not be setting these kwargs to None in their own code. Also with this approach it is possible to encode both the old and the new default value in the function signature.
  • When users have opted in to the new default values there is logic that tries to catch any errors that might be related to the new default kwargs and adds that context to the error message. In case people don't pay attention to the warnings this will at least give them some indication that things have changed and explain new errors that crop up. This only applies when there is already an error, so they don't catch places where the code succeeds but the result is different than before.

coords="different",
compat="equals",
join="outer",
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hard-coded these to the old defaults since there is no way for the user to set them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this approach. These options result in confusing groupby behaviour (#2145) but we can tackle that later

@jsignell jsignell force-pushed the concat_default_kwargs branch from 0e65034 to 5461a9f Compare February 24, 2025 20:07
@jsignell
Copy link
Contributor Author

The last test file that I need to work on is test_concat.py

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on!

I'd like to avoid adding the extra option. Can you remind us of what that would be useful please?

coords="different",
compat="equals",
join="outer",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this approach. These options result in confusing groupby behaviour (#2145) but we can tackle that later

assert_identical(ds2, actual)

actual = ds2.merge(ds1)
actual = ds2.merge(ds1, compat="override")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this is dangerous, the dimensionality of 'x' in the output depends on the order of merging.

I intended compat="override" to be used when "x" is the same in all datasets, not merely 'compatible' as is here. It seems like we would want to at minimum, assert that ndim is same for all x in all datasets.

Opened #10094

compat,
positions,
dim=dim,
data_vars="all",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is unsettable - I hardcoded it to the old default.

@jsignell
Copy link
Contributor Author

jsignell commented Mar 4, 2025

I'd like to avoid adding the extra option. Can you remind us of what that would be useful please?

The option is part of the deprecation process. Basically how it works is in this PR the default values for these kwargs do not change. BUT you get FutureWarnings if the new default values would lead to a different result or an error. The use_new_combine_kwarg_defaults (this absolutely does not have to be the final name) allows you to opt into the new defaults prematurely. Setting that option to True lets you encounter any issues that might arise and it turns off the FutureWarnings.

This is how I'm thinking about the plan after this PR lands:

  1. after a sufficiently long deprecation cycle there should be another PR that:
    • removes the option (but (probably) doesn't throw if people are still setting it)
    • removes all the FutureWarnings
  2. after even more time one last PR that:
    • encodes the new defaults right into the function signature
    • removes the extra context about new defaults that is included in some error messages

I guess you could get rid of the option the tradeoff is that to get rid of the FutureWarnings people may have to hardcode to the new defaults. But there aren't too many false alarms (I need to update the PR description UPDATED) so most of the time when there is a FutureWarning it does indicate that hardcoding the kwarg value is a good idea.

@jsignell jsignell marked this pull request as ready for review March 4, 2025 21:01
@jsignell
Copy link
Contributor Author

jsignell commented Mar 4, 2025

This is ready for review! The failing test looks like it is also failing on main.

@jsignell
Copy link
Contributor Author

jsignell commented Mar 7, 2025

Wellll maybe the unit tests will pass now. I'll fix mypy and the doctests next week.

@jsignell
Copy link
Contributor Author

The mypy failures now match those in other PRs - I opened #10110 to track

Copy link
Collaborator

@keewis keewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed a few things, and I'll have to revisit the ignored warnings in the units tests at some point, but for now these can stay (I'll fix in a separate PR). Or maybe we should just add a module fixture that opts into the new behavior?

The flaky CI passed after rerunning, so we're all good there.

* main: (94 commits)
  Restructure API documentation into organized sections (pydata#10540)
  Fix convert calendar with different dimension name (pydata#10544)
  Fix kerchunk error in docs build (pydata#10545)
  Call super().__init__() in st.SearchStrategy subclasses (pydata#10543)
  CI: update actions location (pydata#10529)
  meta: Fix labeler topic-documentation (pydata#10524)
  drop_vars: use emit_user_level_warning (pydata#10528)
  Pydap flaky tests (pydata#10525)
  Remove test skip (pydata#10523)
  Improve warning message and tests for timedelta decoding (pydata#10508)
  Add dev whats-new (pydata#10522)
  Add release notes for v2025.07.1 (pydata#10520)
  Updates for Zarr 3 Dtypes (pydata#10456)
  Update Custom Indexes section in api.rst (pydata#10517)
  Expression not assigned (pydata#10507)
  Raise if `Index.create_variables` returns more variables than passed in through `set_xindex` (pydata#10503)
  Allow setting (or skipping) new indexes in open_dataset (pydata#8051)
  doc: add Pandas(Multi)Index to api reference (pydata#10515)
  Update pre-commit hooks (pydata#10510)
  Update pre-commit ruff legacy alias (pydata#10511)
  ...
@@ -494,7 +501,7 @@ def rectify_dim_order(self, data: Dataset, dataset) -> Dataset:
)
def test_concat_simple(self, data: Dataset, dim, coords) -> None:
datasets = [g for _, g in data.groupby(dim, squeeze=False)]
assert_identical(data, concat(datasets, dim, coords=coords))
assert_identical(data, concat(datasets, dim, coords=coords, compat="equals"))
Copy link
Contributor

@dcherian dcherian Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: I don't understand why this one is needed

@@ -1384,6 +1419,117 @@ def test_concat_index_not_same_dim() -> None:
concat([ds1, ds2], dim="x")


class TestNewDefaults:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: didn't review this.

@dcherian dcherian force-pushed the concat_default_kwargs branch from 0b780b2 to 5aa01e5 Compare July 18, 2025 23:01
@dcherian
Copy link
Contributor

Thanks Julia! I powered through test_concat and test_merge. I'll need another 2 hours to do test_backends and test_combine. But hopefully we can get this in for the next release.

It all looks good so far, I managed to reduce the false positive warnings on merge, so that's a good win!

@dcherian dcherian force-pushed the concat_default_kwargs branch from 5aa01e5 to 0b99b85 Compare July 18, 2025 23:19
assert_identical(expected, actual)


class TestNewDefaults:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did not review this

@dcherian dcherian force-pushed the concat_default_kwargs branch from b683b37 to 6e0e0eb Compare July 19, 2025 04:26
except AlignmentError as e:
e.add_note(
"If you are intending to concatenate datasets, please specify the concatenation dimension explicitly. "
"Using merge to concatenate is quite inefficient."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So happy we can disallow this nonsense!

@dcherian dcherian force-pushed the concat_default_kwargs branch from 6e0e0eb to f63ed66 Compare July 19, 2025 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Merging coordinates computes array values
6 participants