map_over_datasets throws error on nodes without datasets #9693

dhruvbalwada · 2024-10-29T11:22:25Z

map_over_datasets -- a way to compute over datatrees -- currently seems to try an operate even on nodes which contain no datasets, and consequently raises an error.
This seems to be a new issue, and was not a problem when this function was called map_over_subtree, which was part of the experimental datatree versions.

An example to reproduce this problem is below:

## Generate datatree, using example from documentation
def time_stamps(n_samples, T):
    """Create an array of evenly-spaced time stamps"""
    return xr.DataArray(
        data=np.linspace(0, 2 * np.pi * T, n_samples), dims=["time"]
    )


def signal_generator(t, f, A, phase):
    """Generate an example electrical-like waveform"""
    return A * np.sin(f * t.data + phase)


time_stamps1 = time_stamps(n_samples=15, T=1.5)

time_stamps2 = time_stamps(n_samples=10, T=1.0)

voltages = xr.DataTree.from_dict(
    {
        "/oscilloscope1": xr.Dataset(
            {
                "potential": (
                    "time",
                    signal_generator(time_stamps1, f=2, A=1.2, phase=0.5),
                ),
                "current": (
                    "time",
                    signal_generator(time_stamps1, f=2, A=1.2, phase=1),
                ),
            },
            coords={"time": time_stamps1},
        ),
        "/oscilloscope2": xr.Dataset(
            {
                "potential": (
                    "time",
                    signal_generator(time_stamps2, f=1.6, A=1.6, phase=0.2),
                ),
                "current": (
                    "time",
                    signal_generator(time_stamps2, f=1.6, A=1.6, phase=0.7),
                ),
            },
            coords={"time": time_stamps2},
        ),
    }
)

## Write some function to add resistance
def add_resistance_only_do(dtree): 
    def calculate_resistance(ds):
        ds_new = ds.copy()
        
        ds_new['resistance'] = ds_new['potential']/ds_new['current']
        return ds_new 
        
    dtree = dtree.map_over_datasets(calculate_resistance)
    
    return dtree
    
def add_resistance_try(dtree): 
    def calculate_resistance(ds):
        ds_new = ds.copy()
        try:
            ds_new['resistance'] = ds_new['potential']/ds_new['current']
            return ds_new 
        except:
            return ds_new

    dtree = dtree.map_over_datasets(calculate_resistance)
    
    return dtree

Calling voltages = add_resistance_only_do(voltages) raises the error:

KeyError: "No variable named 'potential'. Variables on the dataset include []"
Raised whilst mapping function over node with path '.'

This can be easily resolved by putting try statements in (e.g. voltages = add_resistance_try(voltages)), but we know that Yoda would not recommend try (right @TomNicholas).

Can this be built in as a default feature of map_over_datasets? as many examples of datatree will have nodes without datasets.

The text was updated successfully, but these errors were encountered:

shoyer · 2024-10-29T18:26:36Z

This was an intentional change, because a special case to skip empty nodes felt surprsing to me.

On the other hand, maybe it does make sense to skip nodes without datasets specifically for a method that maps over datasets (but not for a method that maps over nodes). So I'm open to changing this. The other option would be to add a new keyword argument to map_over_datasets for controlling this, something like skip_empty_nodes=True.

For what it's worth, the canonical way to write this today would be something like:

def add_resistance_try(dtree): 
    def calculate_resistance(ds):
        if not ds:
            return None
        ds_new = ds.copy()
        ds_new['resistance'] = ds_new['potential']/ds_new['current']
        return ds_new 

    dtree = dtree.map_over_datasets(calculate_resistance)
    return dtree

TomNicholas · 2024-10-29T18:53:59Z

Thanks for raising this @dhruvbalwada !

I would be in favor of changing this. It came up before for users and I'm not surprised it has come up almost immediately again.

I think it's reasonable for "map over datasets" to not map over a node where there is no dataset by default. The subtleties are with inherited variables and attrs. There are multiple issues on the old repo discussing this.

dcherian · 2024-10-29T20:04:04Z

The other option would be to add a new keyword argument to map_over_datasets for controlling this, something like skip_empty_nodes=True.

I like this idea with default False. With deep hierarchies it can be easy to miss that a node might be unexpectedly empty. So it'd be good to force users to opt in.

kmuehlbauer · 2024-10-30T10:39:03Z

I can see uses-cases for both skip_empty_nodes=False/skip_empty_nodes=True. So we wont make all users happy using one or the other default.

But I think we should not add that skip_empty_nodes-kwarg at all. Instead we could encourage users to work with solutions along @shoyer's above suggestion. In more complex scenarios users will need such solutions anyway, since their functions might only work on dedicated nodes as their tree layout might differ significantly and nodes wont be equivalent in terms of their content.

To assist users with that task xarray could provide the same functionality the OP is looking for using a simple decorator, (Update: now tested, finally):

import functools
def skip_empty_nodes(func):
    @functools.wraps(func)
    def _func(ds, *args, **kwargs):
        if not ds:
            return ds
        return func(ds, *args, **kwargs)
    return _func

def add_resistance_try(dtree):
    @skip_empty_nodes
    def calculate_resistance(ds):
        ds_new = ds.copy()
        ds_new['resistance'] = ds_new['potential']/ds_new['current']
        return ds_new 

    dtree = dtree.map_over_datasets(calculate_resistance)
    return dtree
    
    
voltages = add_resistance_try(voltages)

Anyway, if the kwarg-solution is preferred, I'm opting for skip_empty_nodes=False.

shoyer · 2024-10-30T16:55:48Z

I don't think we need extensive helper functions or options in map_over_datasets. It's a convenience function, which is why I'm OK skipping empty nodes by default.

For cases where users need control, they can just iterate over DataTree.subtree_with_keys or xarray.group_subtrees() themselves.

kmuehlbauer · 2024-10-30T16:59:13Z

Fine with that, too. Are Datasets with only attrs considered empty?

shoyer · 2024-10-30T17:34:01Z

Fine with that, too. Are Datasets with only attrs considered empty?

There are a few different edge cases:

Only attrs
Only coordinates/attrs

The original map_over_subtrees had special logic to propagate forward attributes only for empty nodes, without calling the mapped over function. That seems reasonable to me.

I'm not sure whether or not to call the mapped over function for nodes that only define coordinates. Certainly I would not blindly copy coordinates from otherwise empty nodes onto the result, because those coordinates may no longer be relevant on the result.

kmuehlbauer · 2024-10-30T18:05:09Z

Thanks @shoyer for the details. Good to see that there are solutions for many use-cases already built-in or available via external helper functions.

I'm diverting a bit from the issue now. I've had to do this kind of wrapping to feed kwargs to my mapping function. What is the canonical way to feed kwargs to map_over_datasets? I should open a separate issue for that.

shoyer · 2024-10-30T18:27:59Z

I'm diverting a bit from the issue now. I've had to do this kind of wrapping to feed kwargs to my mapping function. What is the canonical way to feed kwargs to map_over_datasets? I should open a separate issue for that.

You can pass in a helper function or use functools.partial. We could also add a kwargs argument like xarray.apply_ufunc.

keewis · 2024-10-30T18:45:40Z

or use functools.wraps

shouldn't that be functools.partial?

mathause · 2025-02-10T23:12:16Z

I opened #10042 - happy to get comments

TomNicholas · 2025-03-14T16:01:13Z

@shoyer I want to discuss the coordinates question more because it's holding up @mathause in #10042 (comment).

I'm not sure whether or not to call the mapped over function for nodes that only define coordinates.

Should the heuristic here be to follow the behaviour of Dataset.map? That maps only over data variables, and drops all coordinates when applied to a Dataset containing only coordinate variables:

In [1]: import xarray as xr

In [2]: ds = xr.Dataset(coords={'t': [1]}, attrs={'foo': 'bar'})

In [3]: ds
Out[3]: 
<xarray.Dataset> Size: 8B
Dimensions:  (t: 1)
Coordinates:
  * t        (t) int64 8B 1
Data variables:
    *empty*
Attributes:
    foo:      bar

In [4]: ds.map(lambda x: x)
Out[4]: 
<xarray.Dataset> Size: 0B
Dimensions:  ()
Data variables:
    *empty*

Certainly I would not blindly copy coordinates from otherwise empty nodes onto the result, because those coordinates may no longer be relevant on the result.

Following Dataset.map would avoid blindly copying coordinates from otherwise empty nodes onto the result. That seems reasonable to me. (Note that Dataset.map does drop attrs in this case.)

mathause · 2025-03-17T11:09:11Z

Thanks @TomNicholas - that sounds reasonable and would change the code in #10042 as:

- if node_tree_args[0].has_data:
+ if node_tree_args[0]._data_variables:
      res = func(node_tree_args, kwargs)

Do we still copy over ds which only contain attrs?
And what do we do with ds which contain coords and attrs?
- replace by an empty ds
- only copy the attrs over
- something else?

TomNicholas · 2025-03-17T18:15:51Z

Do we still copy over ds which only contain attrs?

Currently we do that but I'm trying to remember what the original motivation was... @shoyer @owenlittlejohns @flamingbear @keewis do you remember?

would change the code in #10042

We could alternatively change the definition of has_data, which would be another breaking change. But if we going to break it anyway perhaps better to do the breaks in one go.

shoyer · 2025-03-17T19:10:13Z

We could alternatively change the definition of has_data, which would be another breaking change. But if we going to break it anyway perhaps better to do the breaks in one go.

Given the ambiguity, I'm honestly not sure we should have has_data attribute. if node.data_variables is more descriptive and just as succinct.

The main issue with skipping nodes without data variables is that all coordinates defined on such nodes will get pushed down into leaf datasets. But maybe that's OK for a helper function...

TomNicholas · 2025-03-17T19:26:17Z

Given the ambiguity, I'm honestly not sure we should have has_data attribute.

Yes I agree actually, I would be totally fine with getting rid of has_data.

The main issue with skipping nodes without data variables is that all coordinates defined on such nodes will get pushed down into leaf datasets.

Ugh. I forgot about that. That's annoying because it means this property won't hold for any tree with coordinates defined on non-leaf nodes:

dt.map_over_datasets(lambda ds: ds) == dt

Is there any way we can do this that would preserve that property always?

mathause · 2025-03-24T14:32:43Z

This is preserved in the current implementation (i.e., when also applying func to nodes with only coords):

import xarray as xr

tree = xr.DataTree.from_dict(
    {
        "/": xr.Dataset(coords={"x": [1, 2]}),
        "/first": xr.Dataset({"a": ("x", [1, 2])}),
    }
)
tree.map_over_datasets(lambda x: x)["first"]

<xarray.DataTree 'first'>
Group: /first
    Dimensions:  (x: 2)
    Inherited coordinates:
      * x        (x) int64 16B 1 2
    Data variables:
        a        (x) int64 16B 1 2

There seems to be no way to automatically remove coords from children currently.

dhruvbalwada added the needs triage label Oct 29, 2024

TomNicholas added topic-DataTree and removed needs triage labels Oct 29, 2024

melonora mentioned this issue Nov 4, 2024

No attribute map_over_subtree #9710

Closed

5 tasks

veni-vidi-vici-dormivi mentioned this issue Nov 8, 2024

add Datatree utility functions MESMER-group/mesmer#556

Merged

2 tasks

This was referenced Jan 24, 2025

upstream dev issues hidden by pinned xarray MESMER-group/mesmer#601

Closed

allow kwargs in map_over_datasets? #10009

Closed

mathause linked a pull request Feb 10, 2025 that will close this issue

map_over_datasets: skip empty nodes #10042

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

map_over_datasets throws error on nodes without datasets #9693

map_over_datasets throws error on nodes without datasets #9693

dhruvbalwada commented Oct 29, 2024 •

edited

Loading

shoyer commented Oct 29, 2024

TomNicholas commented Oct 29, 2024

dcherian commented Oct 29, 2024

kmuehlbauer commented Oct 30, 2024 •

edited

Loading

shoyer commented Oct 30, 2024

kmuehlbauer commented Oct 30, 2024

shoyer commented Oct 30, 2024

kmuehlbauer commented Oct 30, 2024

shoyer commented Oct 30, 2024 •

edited

Loading

keewis commented Oct 30, 2024

mathause commented Feb 10, 2025

TomNicholas commented Mar 14, 2025

mathause commented Mar 17, 2025

TomNicholas commented Mar 17, 2025

shoyer commented Mar 17, 2025

TomNicholas commented Mar 17, 2025 •

edited

Loading

mathause commented Mar 24, 2025

map_over_datasets throws error on nodes without datasets #9693

map_over_datasets throws error on nodes without datasets #9693

Comments

dhruvbalwada commented Oct 29, 2024 • edited Loading

shoyer commented Oct 29, 2024

TomNicholas commented Oct 29, 2024

dcherian commented Oct 29, 2024

kmuehlbauer commented Oct 30, 2024 • edited Loading

shoyer commented Oct 30, 2024

kmuehlbauer commented Oct 30, 2024

shoyer commented Oct 30, 2024

kmuehlbauer commented Oct 30, 2024

shoyer commented Oct 30, 2024 • edited Loading

keewis commented Oct 30, 2024

mathause commented Feb 10, 2025

TomNicholas commented Mar 14, 2025

mathause commented Mar 17, 2025

TomNicholas commented Mar 17, 2025

shoyer commented Mar 17, 2025

TomNicholas commented Mar 17, 2025 • edited Loading

mathause commented Mar 24, 2025

dhruvbalwada commented Oct 29, 2024 •

edited

Loading

kmuehlbauer commented Oct 30, 2024 •

edited

Loading

shoyer commented Oct 30, 2024 •

edited

Loading

TomNicholas commented Mar 17, 2025 •

edited

Loading