Skip to content
Open
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
13557aa
add test to test time slicing
matschreiner Feb 12, 2025
d3802f2
add warning if using step
matschreiner Feb 12, 2025
d7924dc
warn about ignoring step_sie
matschreiner Feb 12, 2025
868558f
update naming
matschreiner Feb 12, 2025
7d72542
test allow timedelta
matschreiner Feb 12, 2025
824d45d
remove tests for checking time is in data and timestep is correct
matschreiner Feb 12, 2025
ff249e9
it is specified in the docstring that it has to be a dit
matschreiner Feb 13, 2025
01df7c3
check selection and warn
matschreiner Feb 13, 2025
ede3b9a
xarray handles this check
matschreiner Feb 13, 2025
570bc65
is always range
matschreiner Feb 13, 2025
4dcba28
better warning
matschreiner Feb 13, 2025
b23e426
make range accept timedelta and datetime
matschreiner Feb 13, 2025
d19b304
add test to test range
matschreiner Feb 13, 2025
b38191c
simplify caasts
matschreiner Feb 13, 2025
a615ae7
parametrize step_sizes
matschreiner Feb 14, 2025
87760f9
to timestamp if coordinate is time
matschreiner Feb 14, 2025
747986a
add tests instantiating from string end_points and datetime endpoints
matschreiner Feb 14, 2025
da5c70c
move slicing to input data and create string yaml with string datetime
matschreiner Feb 14, 2025
0f1466f
remove redundant test
matschreiner Feb 14, 2025
3cd8933
better names in tests
matschreiner Feb 14, 2025
381bc64
better warning
matschreiner Feb 14, 2025
85b5c27
remove test that doesn't test anything
matschreiner Feb 14, 2025
5e217d1
remove obvious comment
matschreiner Feb 14, 2025
96f5d93
remove import
matschreiner Feb 14, 2025
7fc6f37
remove pickle
matschreiner Feb 14, 2025
7b2845b
resolve conflict and merge main
matschreiner Feb 14, 2025
f5fd875
remove unused imports
matschreiner Feb 14, 2025
218c738
allow for none in step
matschreiner Feb 14, 2025
e993334
remove height levels
matschreiner Feb 14, 2025
779530c
improve test name
matschreiner Feb 16, 2025
4d52059
improve testname
matschreiner Feb 16, 2025
3542f8b
improve testname
matschreiner Feb 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added height_levels.pkl
Binary file not shown.
7 changes: 4 additions & 3 deletions mllam_data_prep/config.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from typing import Any, Dict, List, Optional, Union

import dataclass_wizard
Expand Down Expand Up @@ -72,9 +73,9 @@ class Range:
then the entire range will be selected.
"""

start: Union[str, int, float]
end: Union[str, int, float]
step: Optional[Union[str, int, float]] = None
start: Union[str, int, float, datetime]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I'm an idiot. I am not sure we need to support str here since I added that for iso8601 formatted strings. I thought that yaml doesn't natively support datetime serialisation, but I missremembered, that is json! So, if people write iso8601 formatted strings then I think yaml should always turn that into datetime.datetime objects if define that is the type here. That also means we don't have handle turning strings into datetime/timedelta objects in the code, nice!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which I think should mean we should remove str here. Do you agree @matschreiner and @observingClouds ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine some folks might still write times as strings, but not allowing it in the first place would probably be more robust, so I'd be happy to drop str support

Copy link
Copy Markdown
Member

@leifdenby leifdenby Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm confused. Our current config schema supports the following:

schema_version: v0.6.0
dataset_version: v0.1.0

output:
  variables:
    static: [grid_index, static_feature]
    state: [time, grid_index, state_feature]
    forcing: [time, grid_index, forcing_feature]
  coord_ranges:
    time:
      start: 1990-09-03T00:00
      end: 1990-09-09T00:00
      step: PT3H
...

What I am suggesting is that if we replace the str type with datetime in the config dataclasses then the dataclass serialisation (from dataclasses-wizard) will handle turning the start and end fields into datatime objects, rather than us having to do it. In terms of what is in the config yaml-files they would remain unchanged though? So people can keep defining the start/end times as they already are. What was previously interpreted as a string will now simply be interpreted as a datetime serialised as a iso8601 string, no?

Copy link
Copy Markdown
Contributor

@observingClouds observingClouds Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just experimenting together with @matschreiner and 1990-09-03T00:00 is not a valid ISO format and is not converted to datetime and remains a string, however 1990-09-03T00:00:00 is.

There seems to be an issue with a roundtrip though, as mdp.Config.to_yaml()/to_dict/to_json all serialize datetimes back to strings...this might be an upstream issue in the dataclass-wizard

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that is strange. Omitting the seconds should be valid, https://en.wikipedia.org/wiki/ISO_8601#Times

Either the seconds, or the minutes and seconds, may be omitted from the basic or extended time formats for greater brevity but decreased precision; the resulting reduced precision time formats are:[26]
T[hh][mm] in basic format or T[hh]:[mm] in extended format, when seconds are omitted.
T[hh], when both seconds and minutes are omitted.

Maybe this is an upstream bug, or YAML requires the seconds be included...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In dataclass-wizard docs, https://dataclass-wizard.readthedocs.io/en/latest/overview.html#supported-types

For date, time, and datetime types, string values are de-serialized using the builtin fromisoformat() method; for datetime and time types, a suffix of “Z” appearing in the string is first replaced with “+00:00”, which represents UTC time. JSON values for datetime and date annotated types appearing as numbers will be de-serialized using the builtin fromtimestamp() method.

All these types are serialized back to JSON using the builtin isoformat() method. For datetime and time types, there is one noteworthy addition: the suffix “+00:00” is replaced with “Z”, which is a common abbreviation for UTC time.

this seems ok, no? It does mean the round trip wouldn't result in exactly the same yaml because the minutes will be added by isoformat() call https://docs.python.org/3/library/datetime.html#datetime.datetime.isoformat

end: Union[str, int, float, datetime]
step: Union[str, int, float, timedelta, None] = None


@dataclass
Expand Down
117 changes: 42 additions & 75 deletions mllam_data_prep/ops/selection.py
Original file line number Diff line number Diff line change
@@ -1,32 +1,19 @@
import datetime
import warnings

import numpy as np
import pandas as pd

from ..config import Range

def to_timestamp(s):
if isinstance(s, str):
return pd.Timestamp(s)
return s

def _normalize_slice_startstop(s):
if isinstance(s, pd.Timestamp):
return s
elif isinstance(s, str):
try:
return pd.Timestamp(s)
except ValueError:
return s
else:
return s


def _normalize_slice_step(s):
if isinstance(s, pd.Timedelta):
return s
elif isinstance(s, str):
try:
return pd.to_timedelta(s)
except ValueError:
return s
else:
return s
def to_timedelta(s):
if isinstance(s, str):
return np.timedelta64(pd.to_timedelta(s))
return s


def select_by_kwargs(ds, **coord_ranges):
Expand Down Expand Up @@ -56,64 +43,44 @@ def select_by_kwargs(ds, **coord_ranges):
"""

for coord, selection in coord_ranges.items():
if coord not in ds.coords:
raise ValueError(f"Coordinate {coord} not found in dataset")
if isinstance(selection, Range):
if selection.start is None and selection.end is None:
raise ValueError(
f"Selection for coordinate {coord} must have either 'start' and 'end' given"
)
sel_start = _normalize_slice_startstop(selection.start)
sel_end = _normalize_slice_startstop(selection.end)
sel_step = _normalize_slice_step(selection.step)

assert sel_start != sel_end, "Start and end cannot be the same"

# we don't select with the step size for now, but simply check (below) that
# the step size in the data is the same as the requested step size
ds = ds.sel({coord: slice(sel_start, sel_end)})

if coord == "time":
check_point_in_dataset(coord, sel_start, ds)
check_point_in_dataset(coord, sel_end, ds)
if sel_step is not None:
check_step(sel_step, coord, ds)

assert (
len(ds[coord]) > 0
), f"You have selected an empty range {sel_start}:{sel_end} for coordinate {coord}"

elif isinstance(selection, list):
ds = ds.sel({coord: selection})
else:
raise NotImplementedError(
f"Selection for coordinate {coord} must be a list or a dict"
)
sel_start = selection.start
sel_end = selection.end
sel_step = selection.step

if coord == "time":
sel_start = to_timestamp(selection.start)
sel_end = to_timestamp(selection.end)
sel_step = get_time_step(sel_step, ds)
Comment on lines +50 to +53
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @leifdenby that it would be good to only accept datetimes and so this rather boilerplate code is not necessary


assert sel_start != sel_end, "Start and end cannot be the same"

Comment on lines +55 to +56
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ds.sel({'time': slice(dt.datetime(2023,1,1), dt.datetime(2023,1,1))}) is a valid expression and returns data from dt.datetime(2023,1,1)

check_selection(ds, coord, sel_start, sel_end)
ds = ds.sel({coord: slice(sel_start, sel_end, sel_step)})

assert (
len(ds[coord]) > 0
), f"You have selected an empty range {sel_start}:{sel_end} for coordinate {coord}"

return ds


def check_point_in_dataset(coord, point, ds):
"""
check that the requested point is in the data.
"""
if point is not None and point not in ds[coord].values:
def get_time_step(sel_step, ds):
if sel_step is None:
return None

dataset_timedelta = ds.time[1] - ds.time[0]
sel_timedelta = to_timedelta(sel_step)
step = sel_timedelta / dataset_timedelta
if step % 1 != 0:
raise ValueError(
f"Provided value for coordinate {coord} ({point}) is not in the data."
f"The chosen stepsize {sel_step} is not multiple of the stepsize in the dataset {dataset_timedelta}"
)

return int(step)

def check_step(sel_step, coord, ds):
"""
check that the step requested is exactly what the data has
"""
all_steps = ds[coord].diff(dim=coord).values
first_step = all_steps[0].astype("timedelta64[s]").astype(datetime.timedelta)

if not all(all_steps[0] == all_steps):
raise ValueError(
f"Step size for coordinate {coord} is not constant: {all_steps}"
)
if sel_step != first_step:
raise ValueError(
f"Step size for coordinate {coord} is not the same as requested: {first_step} != {sel_step}"
def check_selection(ds, coord, sel_start, sel_end):
if ds[coord].values.min() < sel_start or ds[coord].values.max() > sel_end:
warnings.warn(
f"\nChosen slice exceeds the range of {coord} in the dataset.\n Dataset span: [ {ds[coord].values.min()} : {ds[coord].values.max()} ]\n Chosen slice: [ {sel_start} : {sel_end} ]\n"
Comment on lines 81 to +85
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check assumes that the coordinates are sorted in ascending order, however, this is not always the case.

Example:

import xarray as xr
import numpy as np

# Create geographic coordinates
latitude = np.array([10, 5, 0, -5, -10])  # Latitude in descending order
longitude = np.array([30, 35, 40])  # Longitude in ascending order

# Create some sample data
data = np.random.rand(len(latitude), len(longitude))  # 2D data with latitudes and longitudes

# Create an xarray Dataset with geographic coordinates
ds = xr.Dataset(
    {
        'temperature': (('latitude', 'longitude'), data),  # 2D data with latitude and longitude dimensions
        'precipitation': (('latitude', 'longitude'), data * 0.1),  # Example precipitation data
    },
    coords={
        'latitude': latitude,
        'longitude': longitude,
    }
)

ds.sel({'latitude': slice(8,-5)}) # this slice is within bounds but would raise the above warning

It should also be sel_start <= ds[coord].values.min().

Nevertheless, I would not do this test at all. The user should now what a valid range is.

)
9 changes: 4 additions & 5 deletions tests/resources/sliced_example.danra.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,6 @@ dataset_version: v0.1.0
output:
variables:
state: [time, grid_index, state_feature]
coord_ranges:
time:
start: 1990-09-03T00:00
end: 1990-09-09T00:00
step: PT3H
chunking:
time: 1
splitting:
Expand Down Expand Up @@ -58,5 +53,9 @@ inputs:
y:
start: -50000
end: -40000
time:
start: 1990-09-03T00:00
end: 1990-09-09T00:00
step: PT3H

target_output_variable: state
61 changes: 61 additions & 0 deletions tests/resources/sliced_example_with_datetime_strings.danra.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
schema_version: v0.6.0
dataset_version: v0.1.0

output:
variables:
state: [time, grid_index, state_feature]
chunking:
time: 1
splitting:
dim: time
splits:
train:
start: 1990-09-03T00:00
end: 1990-09-06T00:00
compute_statistics:
ops: [mean, std, diff_mean, diff_std]
dims: [grid_index, time]
val:
start: 1990-09-06T00:00
end: 1990-09-07T00:00
test:
start: 1990-09-07T00:00
end: 1990-09-09T00:00

inputs:
danra_height_levels:
path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/height_levels.zarr
dims: [time, x, y, altitude]
variables:
u:
altitude:
values: [100,]
units: m
v:
altitude:
values: [100, ]
units: m
dim_mapping:
time:
method: rename
dim: time
state_feature:
method: stack_variables_by_var_name
dims: [altitude]
name_format: "{var_name}{altitude}m"
grid_index:
method: stack
dims: [x, y]
coord_ranges:
x:
start: -50000
end: -40000
y:
start: -50000
end: -40000
time:
start: "1990-09-03T00:00"
end: "1990-09-09T00:00"
step: "PT3H"

target_output_variable: state
21 changes: 21 additions & 0 deletions tests/test_config.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
import datetime

import pytest
from dataclass_wizard.errors import MissingFields, UnknownJSONKey

import mllam_data_prep as mdp
from mllam_data_prep import config

INVALID_EXTRA_FIELDS_CONFIG_YAML = """
schema_version: v0.1.0
Expand Down Expand Up @@ -110,6 +113,16 @@ def test_get_config_issues():
"""


def test_can_load_config_with_datetime_object_in_time_range():
fp = "tests/resources/sliced_example.danra.yaml"
mdp.Config.from_yaml_file(fp)


def test_can_load_config_with_datetime_string_in_time_range():
fp = "tests/resources/sliced_example_with_datetime_strings.danra.yaml"
mdp.Config.from_yaml_file(fp)


def test_get_config_nested():
config = mdp.Config.from_yaml(VALID_EXAMPLE_CONFIG_YAML)

Expand All @@ -121,6 +134,14 @@ def test_get_config_nested():
input_config.foobarfield


def test_that_range_accepts_datetime():
start = datetime.datetime(1990, 9, 3, 0, 0)
end = datetime.datetime(1990, 9, 4, 0, 0)
step = "PT3H"

config.Range(start=start, end=end, step=step)


def test_config_roundtrip():
original_config = mdp.Config.from_yaml(VALID_EXAMPLE_CONFIG_YAML)
roundtrip_config_dict = mdp.Config.from_dict(original_config.to_dict())
Expand Down
86 changes: 0 additions & 86 deletions tests/test_from_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,91 +112,6 @@ def test_merging_static_and_surface_analysis():
mdp.create_dataset_zarr(fp_config=fp_config)


@pytest.mark.parametrize("source_data_contains_time_range", [True, False])
@pytest.mark.parametrize(
"time_stepsize",
[testdata.DT_ANALYSIS, testdata.DT_ANALYSIS * 2, testdata.DT_ANALYSIS / 2],
)
def test_time_selection(source_data_contains_time_range, time_stepsize):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come you are getting rid of this test? Don't you like it 😆

"""
Check that time selection works as expected, so that when source
data doesn't contain the time range specified in the config and exception
is raised, and otherwise that the correct timesteps are in the output
"""

tmpdir = tempfile.TemporaryDirectory()
datasets = testdata.create_data_collection(
data_kinds=["surface_analysis", "static"], fp_root=tmpdir.name
)

t_start_dataset = testdata.T_START
t_end_dataset = t_start_dataset + (testdata.NT_ANALYSIS - 1) * testdata.DT_ANALYSIS

if source_data_contains_time_range:
t_start_config = t_start_dataset
t_end_config = t_end_dataset
else:
t_start_config = t_start_dataset - testdata.DT_ANALYSIS
t_end_config = t_end_dataset + testdata.DT_ANALYSIS

config = dict(
schema_version=testdata.SCHEMA_VERSION,
dataset_version="v0.1.0",
output=dict(
variables=dict(
static=["grid_index", "feature"],
state=["time", "grid_index", "feature"],
forcing=["time", "grid_index", "feature"],
),
coord_ranges=dict(
time=dict(
start=t_start_config.isoformat(),
end=t_end_config.isoformat(),
step=isodate.duration_isoformat(time_stepsize),
)
),
),
inputs=dict(
danra_surface=dict(
path=datasets["surface_analysis"],
dims=["analysis_time", "x", "y"],
variables=testdata.DEFAULT_SURFACE_ANALYSIS_VARS,
dim_mapping=dict(
time=dict(
method="rename",
dim="analysis_time",
),
grid_index=dict(
method="stack",
dims=["x", "y"],
),
feature=dict(
method="stack_variables_by_var_name",
name_format="{var_name}",
),
),
target_output_variable="forcing",
),
),
)

# write yaml config to file
fn_config = "config.yaml"
fp_config = Path(tmpdir.name) / fn_config
with open(fp_config, "w") as f:
yaml.dump(config, f)

# run the main function
if source_data_contains_time_range and time_stepsize == testdata.DT_ANALYSIS:
mdp.create_dataset_zarr(fp_config=fp_config)
else:
print(
f"Expecting ValueError for source_data_contains_time_range={source_data_contains_time_range} and time_stepsize={time_stepsize}"
)
with pytest.raises(ValueError):
mdp.create_dataset_zarr(fp_config=fp_config)


@pytest.mark.parametrize("use_common_feature_var_name", [True, False])
def test_feature_collision(use_common_feature_var_name):
"""
Expand Down Expand Up @@ -360,7 +275,6 @@ def test_config_revision_examples(fp_example):
"""
tmpdir = tempfile.TemporaryDirectory()

# copy example to tempdir
fp_config_copy = Path(tmpdir.name) / fp_example.name
shutil.copy(fp_example, fp_config_copy)

Expand Down
Loading