Skip to content

Open mfdataset enchancement #9955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 4 additions & 20 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
@@ -24,6 +24,8 @@ New Features

- Added `scipy-stubs <https://github.com/scipy/scipy-stubs>`_ to the ``xarray[types]`` dependencies.
By `Joren Hammudoglu <https://github.com/jorenham>`_.
- Added ``errors`` arg to :py:meth:`open_mfdataset` to better handle invalid files.
(:issue:`6736`, :pull:`9955`). By `Pratiman Patel <https://github.com/pratiman-91>`_.

Breaking changes
~~~~~~~~~~~~~~~~
@@ -246,26 +248,8 @@ eventually be deprecated.

New Features
~~~~~~~~~~~~
- Relax nanosecond resolution restriction in CF time coding and permit
:py:class:`numpy.datetime64` or :py:class:`numpy.timedelta64` dtype arrays
with ``"s"``, ``"ms"``, ``"us"``, or ``"ns"`` resolution throughout xarray
(:issue:`7493`, :pull:`9618`, :pull:`9977`, :pull:`9966`, :pull:`9999`). By
`Kai Mühlbauer <https://github.com/kmuehlbauer>`_ and `Spencer Clark
<https://github.com/spencerkclark>`_.
- Enable the ``compute=False`` option in :py:meth:`DataTree.to_zarr`. (:pull:`9958`).
By `Sam Levang <https://github.com/slevang>`_.
- Improve the error message raised when no key is matching the available variables in a dataset. (:pull:`9943`)
By `Jimmy Westling <https://github.com/illviljan>`_.
- Added a ``time_unit`` argument to :py:meth:`CFTimeIndex.to_datetimeindex`.
Note that in a future version of xarray,
:py:meth:`CFTimeIndex.to_datetimeindex` will return a microsecond-resolution
:py:class:`pandas.DatetimeIndex` instead of a nanosecond-resolution
:py:class:`pandas.DatetimeIndex` (:pull:`9965`). By `Spencer Clark
<https://github.com/spencerkclark>`_ and `Kai Mühlbauer
<https://github.com/kmuehlbauer>`_.
- Adds shards to the list of valid_encodings in the zarr backend, so that
sharded Zarr V3s can be written (:issue:`9947`, :pull:`9948`).
By `Jacob Prince_Bieker <https://github.com/jacobbieker>`_
- Add new ``errors`` arg to :py:meth:`open_mfdataset` to better handle invalid files.
(:issue:`6736`, :pull:`9955`). By `Pratiman Patel <https://github.com/pratiman-91>`_.

Deprecations
~~~~~~~~~~~~
74 changes: 72 additions & 2 deletions xarray/backends/api.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from __future__ import annotations

import os
import warnings
from collections.abc import (
Callable,
Hashable,
@@ -61,6 +62,7 @@
from xarray.core.types import (
CombineAttrsOptions,
CompatOptions,
ErrorOptionsWithWarn,
JoinOptions,
NestedSequence,
ReadBuffer,
@@ -1389,6 +1391,38 @@ def open_groups(
return groups


def _remove_path(paths, path_to_remove) -> list:
"""
Recursively removes specific path from a nested or non-nested list.

Parameters
----------
paths: list
The path list (nested or not) from which to remove paths.
path_to_remove: str or list
The path to be removed.

Returns
-------
list
A new list with specified paths removed.
"""
# Initialize an empty list to store the result
result = []

for item in paths:
if isinstance(item, list):
# If the current item is a list, recursively call remove_elements on it
nested_result = _remove_path(item, path_to_remove)
if nested_result: # Only add non-empty lists to avoid adding empty lists
result.append(nested_result)
elif item not in path_to_remove:
# Add the item to the result if it is not in the set of elements to remove
result.append(item)

return result


def open_mfdataset(
paths: str
| os.PathLike
@@ -1414,6 +1448,7 @@ def open_mfdataset(
join: JoinOptions = "outer",
attrs_file: str | os.PathLike | None = None,
combine_attrs: CombineAttrsOptions = "override",
errors: ErrorOptionsWithWarn = "raise",
**kwargs,
) -> Dataset:
"""Open multiple files as a single dataset.
@@ -1540,7 +1575,12 @@ def open_mfdataset(

If a callable, it must expect a sequence of ``attrs`` dicts and a context object
as its only parameters.
**kwargs : optional
errors : {'raise', 'warn', 'ignore'}, default 'raise'
- If 'raise', then invalid dataset will raise an exception.
- If 'warn', then a warning will be issued for each invalid dataset.
- If 'ignore', then invalid dataset will be ignored.

**kwargs : optional
Additional arguments passed on to :py:func:`xarray.open_dataset`. For an
overview of some of the possible options, see the documentation of
:py:func:`xarray.open_dataset`
@@ -1632,7 +1672,32 @@ def open_mfdataset(
open_ = open_dataset
getattr_ = getattr

datasets = [open_(p, **open_kwargs) for p in paths1d]
if errors not in ("raise", "warn", "ignore"):
raise ValueError(
f"'errors' must be 'raise', 'warn' or 'ignore', got '{errors}'"
)

datasets = []
remove_paths = False
for p in paths1d:
try:
ds = open_(p, **open_kwargs)
datasets.append(ds)
except Exception:
# remove invalid paths
if combine == "nested":
paths = _remove_path(paths, p)
remove_paths = True
if errors == "raise":
raise
elif errors == "warn":
warnings.warn(
f"Could not open {p}. Ignoring.", UserWarning, stacklevel=2
)
continue
else:
continue

closers = [getattr_(ds, "_close") for ds in datasets]
if preprocess is not None:
datasets = [preprocess(ds) for ds in datasets]
@@ -1645,6 +1710,11 @@ def open_mfdataset(
# Combine all datasets, closing them in case of a ValueError
try:
if combine == "nested":
# Create new ids and paths based on removed items
if remove_paths:
combined_ids_paths = _infer_concat_order_from_positions(paths)
ids = list(combined_ids_paths.keys())

# Combined nested list by successive concat and merge operations
# along each dimension, using structure given by "ids"
combined = _nested_combine(
62 changes: 62 additions & 0 deletions xarray/tests/test_backends.py
Original file line number Diff line number Diff line change
@@ -4978,6 +4978,68 @@
) as actual:
assert_identical(original, actual)

def test_open_mfdataset_with_ignore(self) -> None:
original = Dataset({"foo": ("x", np.random.randn(10))})
with create_tmp_files(2) as (tmp1, tmp2):
ds1 = original.isel(x=slice(5))
ds2 = original.isel(x=slice(5, 10))
ds1.to_netcdf(tmp1)
ds2.to_netcdf(tmp2)
with open_mfdataset(
[tmp1, "non-existent-file.nc", tmp2],
concat_dim="x",
combine="nested",
errors="ignore",
) as actual:
assert_identical(original, actual)

def test_open_mfdataset_with_warn(self) -> None:
original = Dataset({"foo": ("x", np.random.randn(10))})
with pytest.warns(UserWarning, match="Ignoring."):
with create_tmp_files(2) as (tmp1, tmp2):
ds1 = original.isel(x=slice(5))
ds2 = original.isel(x=slice(5, 10))
ds1.to_netcdf(tmp1)
ds2.to_netcdf(tmp2)
with open_mfdataset(
[tmp1, "non-existent-file.nc", tmp2],
concat_dim="x",
combine="nested",
errors="warn",
) as actual:
assert_identical(original, actual)

def test_open_mfdataset_2d_with_ignore(self) -> None:
original = Dataset({"foo": (["x", "y"], np.random.randn(10, 8))})
with create_tmp_files(4) as (tmp1, tmp2, tmp3, tmp4):
original.isel(x=slice(5), y=slice(4)).to_netcdf(tmp1)
original.isel(x=slice(5, 10), y=slice(4)).to_netcdf(tmp2)
original.isel(x=slice(5), y=slice(4, 8)).to_netcdf(tmp3)
original.isel(x=slice(5, 10), y=slice(4, 8)).to_netcdf(tmp4)
with open_mfdataset(
[[tmp1, tmp2], ["non-existent-file.nc", tmp3, tmp4]],
combine="nested",
concat_dim=["y", "x"],
errors="ignore",
) as actual:
assert_identical(original, actual)

def test_open_mfdataset_2d_with_warn(self) -> None:
original = Dataset({"foo": (["x", "y"], np.random.randn(10, 8))})
with pytest.warns(UserWarning, match="Ignoring."):
with create_tmp_files(4) as (tmp1, tmp2, tmp3, tmp4):
original.isel(x=slice(5), y=slice(4)).to_netcdf(tmp1)
original.isel(x=slice(5, 10), y=slice(4)).to_netcdf(tmp2)
original.isel(x=slice(5), y=slice(4, 8)).to_netcdf(tmp3)
original.isel(x=slice(5, 10), y=slice(4, 8)).to_netcdf(tmp4)
with open_mfdataset(
[[tmp1, tmp2, "non-existent-file.nc"], [tmp3, tmp4]],
combine="nested",
concat_dim=["y", "x"],
errors="warn",
) as actual:
assert_identical(original, actual)

def test_attrs_mfdataset(self) -> None:
original = Dataset({"foo": ("x", np.random.randn(10))})
with create_tmp_file() as tmp1:
@@ -5362,7 +5424,7 @@
yield actual, expected

def test_cmp_local_file(self) -> None:
with self.create_datasets() as (actual, expected):

Check failure on line 5427 in xarray/tests/test_backends.py

GitHub Actions / ubuntu-latest py3.13 flaky

TestPydapOnline.test_cmp_local_file AttributeError: '<class 'pydap.model.BaseType'>' object has no attribute 'dimensions'
assert_equal(actual, expected)

# global attributes should be global attributes on the dataset
@@ -5398,7 +5460,7 @@

def test_compatible_to_netcdf(self) -> None:
# make sure it can be saved as a netcdf
with self.create_datasets() as (actual, expected):

Check failure on line 5463 in xarray/tests/test_backends.py

GitHub Actions / ubuntu-latest py3.13 flaky

TestPydapOnline.test_compatible_to_netcdf AttributeError: '<class 'pydap.model.BaseType'>' object has no attribute 'dimensions'
with create_tmp_file() as tmp_file:
actual.to_netcdf(tmp_file)
with open_dataset(tmp_file) as actual2:
@@ -5407,7 +5469,7 @@

@requires_dask
def test_dask(self) -> None:
with self.create_datasets(chunks={"j": 2}) as (actual, expected):

Check failure on line 5472 in xarray/tests/test_backends.py

GitHub Actions / ubuntu-latest py3.13 flaky

TestPydapOnline.test_dask AttributeError: '<class 'pydap.model.BaseType'>' object has no attribute 'dimensions'
assert_equal(actual, expected)



Unchanged files with check annotations Beta

if xp == np:
# numpy currently doesn't have a astype:
return data.astype(dtype, **kwargs)

Check warning on line 237 in xarray/core/duck_array_ops.py

GitHub Actions / macos-latest py3.10

invalid value encountered in cast

Check warning on line 237 in xarray/core/duck_array_ops.py

GitHub Actions / macos-latest py3.10

invalid value encountered in cast

Check warning on line 237 in xarray/core/duck_array_ops.py

GitHub Actions / ubuntu-latest py3.10

invalid value encountered in cast

Check warning on line 237 in xarray/core/duck_array_ops.py

GitHub Actions / ubuntu-latest py3.10

invalid value encountered in cast

Check warning on line 237 in xarray/core/duck_array_ops.py

GitHub Actions / windows-latest py3.10

invalid value encountered in cast

Check warning on line 237 in xarray/core/duck_array_ops.py

GitHub Actions / windows-latest py3.10

invalid value encountered in cast
return xp.astype(data, dtype, **kwargs)
# otherwise numpy unsigned ints will silently cast to the signed counterpart
fill_value = fill_value.item()
# passes if provided fill value fits in encoded on-disk type
new_fill = encoded_dtype.type(fill_value)

Check warning on line 237 in xarray/coding/variables.py

GitHub Actions / ubuntu-latest py3.10 min-all-deps

NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays. The conversion of 255 to int8 will fail in the future. For the old behavior, usually: np.array(value).astype(dtype)` will give the desired result (the cast overflows).

Check warning on line 237 in xarray/coding/variables.py

GitHub Actions / ubuntu-latest py3.10 min-all-deps

NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays. The conversion of 255 to int8 will fail in the future. For the old behavior, usually: np.array(value).astype(dtype)` will give the desired result (the cast overflows).

Check warning on line 237 in xarray/coding/variables.py

GitHub Actions / ubuntu-latest py3.10 min-all-deps

NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays. The conversion of 255 to int8 will fail in the future. For the old behavior, usually: np.array(value).astype(dtype)` will give the desired result (the cast overflows).

Check warning on line 237 in xarray/coding/variables.py

GitHub Actions / ubuntu-latest py3.10 min-all-deps

NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays. The conversion of 255 to int8 will fail in the future. For the old behavior, usually: np.array(value).astype(dtype)` will give the desired result (the cast overflows).

Check warning on line 237 in xarray/coding/variables.py

GitHub Actions / ubuntu-latest py3.10 min-all-deps

NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays. The conversion of 255 to int8 will fail in the future. For the old behavior, usually: np.array(value).astype(dtype)` will give the desired result (the cast overflows).

Check warning on line 237 in xarray/coding/variables.py

GitHub Actions / ubuntu-latest py3.10 min-all-deps

NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays. The conversion of 255 to int8 will fail in the future. For the old behavior, usually: np.array(value).astype(dtype)` will give the desired result (the cast overflows).

Check warning on line 237 in xarray/coding/variables.py

GitHub Actions / ubuntu-latest py3.10 min-all-deps

NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays. The conversion of 255 to int8 will fail in the future. For the old behavior, usually: np.array(value).astype(dtype)` will give the desired result (the cast overflows).

Check warning on line 237 in xarray/coding/variables.py

GitHub Actions / ubuntu-latest py3.10 min-all-deps

NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays. The conversion of 255 to int8 will fail in the future. For the old behavior, usually: np.array(value).astype(dtype)` will give the desired result (the cast overflows).

Check warning on line 237 in xarray/coding/variables.py

GitHub Actions / ubuntu-latest py3.10 min-all-deps

NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays. The conversion of 255 to int8 will fail in the future. For the old behavior, usually: np.array(value).astype(dtype)` will give the desired result (the cast overflows).

Check warning on line 237 in xarray/coding/variables.py

GitHub Actions / ubuntu-latest py3.10 min-all-deps

NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays. The conversion of 255 to int8 will fail in the future. For the old behavior, usually: np.array(value).astype(dtype)` will give the desired result (the cast overflows).
except OverflowError:
encoded_kind_str = "signed" if encoded_dtype.kind == "i" else "unsigned"
warnings.warn(
da = DataArray().pipe(lambda data: data)
reveal_type(da) # N: Revealed type is "xarray.core.dataarray.DataArray"

Check failure on line 7 in xarray/tests/test_dataarray_typing.yml

GitHub Actions / ubuntu-latest py3.10 mypy

test_mypy_pipe_lambda_noarg_return_type

Check failure on line 7 in xarray/tests/test_dataarray_typing.yml

GitHub Actions / ubuntu-latest py3.13 mypy

test_mypy_pipe_lambda_noarg_return_type
- case: test_mypy_pipe_lambda_posarg_return_type
main: |
da = DataArray().pipe(lambda data, arg: arg, "foo")
reveal_type(da) # N: Revealed type is "builtins.str"

Check failure on line 15 in xarray/tests/test_dataarray_typing.yml

GitHub Actions / ubuntu-latest py3.10 mypy

test_mypy_pipe_lambda_posarg_return_type

Check failure on line 15 in xarray/tests/test_dataarray_typing.yml

GitHub Actions / ubuntu-latest py3.13 mypy

test_mypy_pipe_lambda_posarg_return_type
- case: test_mypy_pipe_lambda_chaining_return_type
main: |
answer = DataArray().pipe(lambda data, arg: arg, "foo").count("o")
reveal_type(answer) # N: Revealed type is "builtins.int"

Check failure on line 23 in xarray/tests/test_dataarray_typing.yml

GitHub Actions / ubuntu-latest py3.10 mypy

test_mypy_pipe_lambda_chaining_return_type

Check failure on line 23 in xarray/tests/test_dataarray_typing.yml

GitHub Actions / ubuntu-latest py3.13 mypy

test_mypy_pipe_lambda_chaining_return_type
- case: test_mypy_pipe_lambda_missing_arg
main: |
from xarray import DataArray
# Call to pipe missing argument for lambda parameter `arg`
da = DataArray().pipe(lambda data, arg: data)

Check failure on line 30 in xarray/tests/test_dataarray_typing.yml

GitHub Actions / ubuntu-latest py3.10 mypy

test_mypy_pipe_lambda_missing_arg

Check failure on line 30 in xarray/tests/test_dataarray_typing.yml

GitHub Actions / ubuntu-latest py3.13 mypy

test_mypy_pipe_lambda_missing_arg
out: |
main:4: error: No overload variant of "pipe" of "DataWithCoords" matches argument type "Callable[[Any, Any], Any]" [call-overload]
main:4: note: Possible overload variants:
from xarray import DataArray
# Call to pipe with extra argument for lambda
da = DataArray().pipe(lambda data: data, "oops!")

Check failure on line 42 in xarray/tests/test_dataarray_typing.yml

GitHub Actions / ubuntu-latest py3.10 mypy

test_mypy_pipe_lambda_extra_arg

Check failure on line 42 in xarray/tests/test_dataarray_typing.yml

GitHub Actions / ubuntu-latest py3.13 mypy

test_mypy_pipe_lambda_extra_arg
out: |
main:4: error: No overload variant of "pipe" of "DataWithCoords" matches argument types "Callable[[Any], Any]", "str" [call-overload]
main:4: note: Possible overload variants:
return da
# Call to pipe missing argument for function parameter `arg`
da = DataArray().pipe(f)

Check failure on line 57 in xarray/tests/test_dataarray_typing.yml

GitHub Actions / ubuntu-latest py3.10 mypy

test_mypy_pipe_function_missing_posarg

Check failure on line 57 in xarray/tests/test_dataarray_typing.yml

GitHub Actions / ubuntu-latest py3.13 mypy

test_mypy_pipe_function_missing_posarg
out: |
main:7: error: No overload variant of "pipe" of "DataWithCoords" matches argument type "Callable[[DataArray, int], DataArray]" [call-overload]
main:7: note: Possible overload variants:
return da
# Call to pipe missing keyword for kwonly parameter `kwonly`
da = DataArray().pipe(f, 42, "oops!")

Check failure on line 72 in xarray/tests/test_dataarray_typing.yml

GitHub Actions / ubuntu-latest py3.10 mypy

test_mypy_pipe_function_extra_posarg

Check failure on line 72 in xarray/tests/test_dataarray_typing.yml

GitHub Actions / ubuntu-latest py3.13 mypy

test_mypy_pipe_function_extra_posarg
out: |
main:7: error: No overload variant of "pipe" of "DataWithCoords" matches argument types "Callable[[DataArray, int], DataArray]", "int", "str" [call-overload]
main:7: note: Possible overload variants:
return da
# Call to pipe missing argument for kwonly parameter `kwonly`
da = DataArray().pipe(f, 42)

Check failure on line 87 in xarray/tests/test_dataarray_typing.yml

GitHub Actions / ubuntu-latest py3.10 mypy

test_mypy_pipe_function_missing_kwarg

Check failure on line 87 in xarray/tests/test_dataarray_typing.yml

GitHub Actions / ubuntu-latest py3.13 mypy

test_mypy_pipe_function_missing_kwarg
out: |
main:7: error: No overload variant of "pipe" of "DataWithCoords" matches argument types "Callable[[DataArray, int, NamedArg(int, 'kwonly')], DataArray]", "int" [call-overload]
main:7: note: Possible overload variants:
return da
# Call to pipe missing keyword for kwonly parameter `kwonly`
da = DataArray().pipe(f, 42, 99)

Check failure on line 102 in xarray/tests/test_dataarray_typing.yml

GitHub Actions / ubuntu-latest py3.10 mypy

test_mypy_pipe_function_missing_keyword

Check failure on line 102 in xarray/tests/test_dataarray_typing.yml

GitHub Actions / ubuntu-latest py3.13 mypy

test_mypy_pipe_function_missing_keyword
out: |
main:7: error: No overload variant of "pipe" of "DataWithCoords" matches argument types "Callable[[DataArray, int, NamedArg(int, 'kwonly')], DataArray]", "int", "int" [call-overload]
main:7: note: Possible overload variants:
return da
# Call to pipe using wrong keyword: `kw` instead of `kwonly`
da = DataArray().pipe(f, 42, kw=99)

Check failure on line 117 in xarray/tests/test_dataarray_typing.yml

GitHub Actions / ubuntu-latest py3.10 mypy

test_mypy_pipe_function_unexpected_keyword

Check failure on line 117 in xarray/tests/test_dataarray_typing.yml

GitHub Actions / ubuntu-latest py3.13 mypy

test_mypy_pipe_function_unexpected_keyword
out: |
main:7: error: Unexpected keyword argument "kw" for "pipe" of "DataWithCoords" [call-arg]