diff --git a/docs/high-level.ipynb b/docs/high-level.ipynb index b372768..a3517a5 100644 --- a/docs/high-level.ipynb +++ b/docs/high-level.ipynb @@ -63,7 +63,7 @@ "xbeam_ds" ], "outputs": [], - "execution_count": 2 + "execution_count": 1 }, { "metadata": { @@ -83,7 +83,7 @@ "xarray_ds.chunk(chunks).to_zarr('example_data.zarr', mode='w')" ], "outputs": [], - "execution_count": 3 + "execution_count": 2 }, { "metadata": { @@ -186,7 +186,7 @@ "xarray.open_zarr('example_climatology.zarr')" ], "outputs": [], - "execution_count": 6 + "execution_count": 3 }, { "metadata": { @@ -215,7 +215,7 @@ "xarray.open_zarr('example_regrid.zarr')" ], "outputs": [], - "execution_count": 7 + "execution_count": 4 }, { "metadata": { @@ -245,7 +245,7 @@ " print(f'{type(e).__name__}: {e}')" ], "outputs": [], - "execution_count": 8 + "execution_count": 5 }, { "metadata": { @@ -262,11 +262,22 @@ }, "cell_type": "code", "source": [ - "ds_beam = xbeam.Dataset.from_zarr('example_data.zarr')\n", - "ds_beam.map_blocks(lambda ds: ds.compute(), template=ds_beam.template)" + "(\n", + " xbeam.Dataset.from_zarr('example_data.zarr')\n", + " .map_blocks(lambda ds: ds.compute(), template=ds_beam.template)\n", + ")" ], "outputs": [], - "execution_count": 9 + "execution_count": 6 + }, + { + "metadata": { + "id": "-U4t0kKIkDvb" + }, + "cell_type": "markdown", + "source": [ + "## Interfacing with low-level transforms" + ] }, { "metadata": { @@ -274,7 +285,52 @@ }, "cell_type": "markdown", "source": [ - "Sometimes, your computation doesn't fit into the ``map_blocks`` paradigm because you don't want to create `xarray.Dataset` objects. For these cases, you can switch to the lower-level Xarray-Beam [data model](data-model), and use raw Beam operations:" + "`Dataset` is a thin wrapper around Xarray-Beam transformations, so you can always drop into the lower-level Xarray-Beam [data model](data-model) and use raw Beam operations. This is especially useful for the reading or writing data.\n", + "\n", + "```{warning}\n", + "The `Dataset` constructor currently performs **no validation** on its inputs!\n", + "```\n", + "\n", + "For example, here's how you could manually recreate a `Dataset`, using the common pattern of evaluating a single example in-memory to create a template with {py:func}`~xarray_beam.make_template` and {py:func}`~xarray_beam.replace_template_dims`:" + ] + }, + { + "metadata": { + "id": "l9pHS1QDlMd-" + }, + "cell_type": "code", + "source": [ + "all_times = pd.date_range('2025-01-01', freq='1D', periods=365)\n", + "source_dataset = xarray.open_zarr('example_data.zarr', chunks=None)\n", + "\n", + "def load_chunk(time: pd.Timestamp) -\u003e tuple[xbeam.Key, xarray.Dataset]:\n", + " key = xbeam.Key({'time': (time - all_times[0]).days})\n", + " dataset = source_dataset.sel(time=[time])\n", + " return key, dataset\n", + "\n", + "_, example = load_chunk(all_times[0])\n", + "\n", + "template = xbeam.make_template(example)\n", + "template = xbeam.replace_template_dims(template, time=all_times)\n", + "\n", + "ds_beam = xbeam.Dataset(\n", + " template=template,\n", + " chunks=xbeam.normalize_chunks({'time': 1}, template),\n", + " split_vars=False,\n", + " ptransform=(beam.Create(all_times) | beam.Map(load_chunk)),\n", + ")\n", + "ds_beam" + ], + "outputs": [], + "execution_count": 12 + }, + { + "metadata": { + "id": "1qjeY5mwlLGJ" + }, + "cell_type": "markdown", + "source": [ + "You can also pull-out the underlying Beam `ptransform` from a dataset to append new transformations, e.g., to write each element of the pipeline to disk as a separate file:" ] }, { @@ -288,16 +344,12 @@ " chunk.to_netcdf(path)\n", "\n", "with beam.Pipeline() as p:\n", - " p | (\n", - " xbeam.Dataset.from_zarr('example_data.zarr')\n", - " .rechunk({'latitude': -1, 'longitude': -1})\n", - " .ptransform\n", - " ) | beam.MapTuple(to_netcdf)\n", + " p | ds_beam.rechunk('50MB').ptransform | beam.MapTuple(to_netcdf)\n", "\n", "%ls *.nc" ], "outputs": [], - "execution_count": 10 + "execution_count": 13 } ], "metadata": {