-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
map_over_datasets throws error on nodes without datasets #9693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This was an intentional change, because a special case to skip empty nodes felt surprsing to me. On the other hand, maybe it does make sense to skip nodes without datasets specifically for a method that maps over datasets (but not for a method that maps over nodes). So I'm open to changing this. The other option would be to add a new keyword argument to For what it's worth, the canonical way to write this today would be something like: def add_resistance_try(dtree):
def calculate_resistance(ds):
if not ds:
return None
ds_new = ds.copy()
ds_new['resistance'] = ds_new['potential']/ds_new['current']
return ds_new
dtree = dtree.map_over_datasets(calculate_resistance)
return dtree |
Thanks for raising this @dhruvbalwada ! I would be in favor of changing this. It came up before for users and I'm not surprised it has come up almost immediately again. I think it's reasonable for "map over datasets" to not map over a node where there is no dataset by default. The subtleties are with inherited variables and attrs. There are multiple issues on the old repo discussing this. |
I like this idea with default |
I can see uses-cases for both But I think we should not add that To assist users with that task xarray could provide the same functionality the OP is looking for using a simple decorator, (Update: now tested, finally): import functools
def skip_empty_nodes(func):
@functools.wraps(func)
def _func(ds, *args, **kwargs):
if not ds:
return ds
return func(ds, *args, **kwargs)
return _func
def add_resistance_try(dtree):
@skip_empty_nodes
def calculate_resistance(ds):
ds_new = ds.copy()
ds_new['resistance'] = ds_new['potential']/ds_new['current']
return ds_new
dtree = dtree.map_over_datasets(calculate_resistance)
return dtree
voltages = add_resistance_try(voltages) Anyway, if the kwarg-solution is preferred, I'm opting for |
I don't think we need extensive helper functions or options in For cases where users need control, they can just iterate over |
Fine with that, too. Are Datasets with only attrs considered empty? |
There are a few different edge cases:
The original I'm not sure whether or not to call the mapped over function for nodes that only define coordinates. Certainly I would not blindly copy coordinates from otherwise empty nodes onto the result, because those coordinates may no longer be relevant on the result. |
Thanks @shoyer for the details. Good to see that there are solutions for many use-cases already built-in or available via external helper functions. I'm diverting a bit from the issue now. I've had to do this kind of wrapping to feed kwargs to my mapping function. What is the canonical way to feed kwargs to map_over_datasets? I should open a separate issue for that. |
You can pass in a helper function or use |
shouldn't that be |
I opened #10042 - happy to get comments |
@shoyer I want to discuss the coordinates question more because it's holding up @mathause in #10042 (comment).
Should the heuristic here be to follow the behaviour of In [1]: import xarray as xr
In [2]: ds = xr.Dataset(coords={'t': [1]}, attrs={'foo': 'bar'})
In [3]: ds
Out[3]:
<xarray.Dataset> Size: 8B
Dimensions: (t: 1)
Coordinates:
* t (t) int64 8B 1
Data variables:
*empty*
Attributes:
foo: bar
In [4]: ds.map(lambda x: x)
Out[4]:
<xarray.Dataset> Size: 0B
Dimensions: ()
Data variables:
*empty*
Following |
Thanks @TomNicholas - that sounds reasonable and would change the code in #10042 as: - if node_tree_args[0].has_data:
+ if node_tree_args[0]._data_variables:
res = func(node_tree_args, kwargs)
|
Currently we do that but I'm trying to remember what the original motivation was... @shoyer @owenlittlejohns @flamingbear @keewis do you remember?
We could alternatively change the definition of |
Given the ambiguity, I'm honestly not sure we should have The main issue with skipping nodes without data variables is that all coordinates defined on such nodes will get pushed down into leaf datasets. But maybe that's OK for a helper function... |
Yes I agree actually, I would be totally fine with getting rid of
Ugh. I forgot about that. That's annoying because it means this property won't hold for any tree with coordinates defined on non-leaf nodes: dt.map_over_datasets(lambda ds: ds) == dt Is there any way we can do this that would preserve that property always? |
This is preserved in the current implementation (i.e., when also applying import xarray as xr
tree = xr.DataTree.from_dict(
{
"/": xr.Dataset(coords={"x": [1, 2]}),
"/first": xr.Dataset({"a": ("x", [1, 2])}),
}
)
tree.map_over_datasets(lambda x: x)["first"]
There seems to be no way to automatically remove coords from children currently. |
map_over_datasets
-- a way to compute over datatrees -- currently seems to try an operate even on nodes which contain no datasets, and consequently raises an error.This seems to be a new issue, and was not a problem when this function was called
map_over_subtree
, which was part of the experimental datatree versions.An example to reproduce this problem is below:
Calling
voltages = add_resistance_only_do(voltages)
raises the error:This can be easily resolved by putting try statements in (e.g.
voltages = add_resistance_try(voltages)
), but we know that Yoda would not recommend try (right @TomNicholas).Can this be built in as a default feature of
map_over_datasets
? as many examples of datatree will have nodes without datasets.The text was updated successfully, but these errors were encountered: