[xarray_beam] Add a docs on interfacing with Beam transforms

shoyer · Xarray-Beam authors · commit 6306c5e1853e · 2025-10-03T12:58:06.000-07:00
PiperOrigin-RevId: 814797000
diff --git a/docs/high-level.ipynb b/docs/high-level.ipynb
@@ -63,7 +63,7 @@
         "xbeam_ds"
       ],
       "outputs": [],
-      "execution_count": 2
+      "execution_count": 1
     },
     {
       "metadata": {
@@ -83,7 +83,7 @@
         "xarray_ds.chunk(chunks).to_zarr('example_data.zarr', mode='w')"
       ],
       "outputs": [],
-      "execution_count": 3
+      "execution_count": 2
     },
     {
       "metadata": {
@@ -186,7 +186,7 @@
         "xarray.open_zarr('example_climatology.zarr')"
       ],
       "outputs": [],
-      "execution_count": 6
+      "execution_count": 3
     },
     {
       "metadata": {
@@ -215,7 +215,7 @@
         "xarray.open_zarr('example_regrid.zarr')"
       ],
       "outputs": [],
-      "execution_count": 7
+      "execution_count": 4
     },
     {
       "metadata": {
@@ -245,7 +245,7 @@
         "  print(f'{type(e).__name__}: {e}')"
       ],
       "outputs": [],
-      "execution_count": 8
+      "execution_count": 5
     },
     {
       "metadata": {
@@ -262,19 +262,71 @@
       },
       "cell_type": "code",
       "source": [
-        "ds_beam = xbeam.Dataset.from_zarr('example_data.zarr')\n",
-        "ds_beam.map_blocks(lambda ds: ds.compute(), template=ds_beam.template)"
+        "(\n",
+        "    xbeam.Dataset.from_zarr('example_data.zarr')\n",
+        "    .map_blocks(lambda ds: ds.compute(), template=ds_beam.template)\n",
+        ")"
       ],
       "outputs": [],
-      "execution_count": 9
+      "execution_count": 6
+    },
+    {
+      "metadata": {
+        "id": "-U4t0kKIkDvb"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Interfacing with Beam transforms"
+      ]
     },
     {
       "metadata": {
         "id": "75IG-22cKcuE"
       },
       "cell_type": "markdown",
       "source": [
-        "Sometimes, your computation doesn't fit into the ``map_blocks`` paradigm because you don't want to create `xarray.Dataset` objects. For these cases, you can switch to the lower-level Xarray-Beam [data model](data-model), and use raw Beam operations:"
+        "`Dataset` is a thin wrapper around Xarray-Beam transformations, so you can always drop into the lower-level Xarray-Beam [data model](data-model) and use raw Beam operations. This is especially useful for the reading or writing data.\n",
+        "\n",
+        "For example, here's how you could manually recreate a `Dataset`, using the common pattern of evaluating a single example in-memory to create a template with {py:func}`~xarray_beam.make_template` and {py:func}`~xarray_beam.replace_template_dims`:"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "l9pHS1QDlMd-"
+      },
+      "cell_type": "code",
+      "source": [
+        "all_times = pd.date_range('2025-01-01', freq='1D', periods=365)\n",
+        "source_dataset = xarray.open_zarr('example_data.zarr', chunks=None)\n",
+        "\n",
+        "def load_chunk(time: pd.Timestamp) -\u003e tuple[xbeam.Key, xarray.Dataset]:\n",
+        "  key = xbeam.Key({'time': (time - all_times[0]).days})\n",
+        "  dataset = source_dataset.sel(time=[time])\n",
+        "  return key, dataset\n",
+        "\n",
+        "_, example = load_chunk(all_times[0])\n",
+        "\n",
+        "template = xbeam.make_template(example)\n",
+        "template = xbeam.replace_template_dims(template, time=all_times)\n",
+        "\n",
+        "ds_beam = xbeam.Dataset(\n",
+        "    template=template,\n",
+        "    chunks=xbeam.normalize_chunks({'time': 1}, template),\n",
+        "    split_vars=False,\n",
+        "    ptransform=(beam.Create(all_times) | beam.Map(load_chunk)),\n",
+        ")\n",
+        "ds_beam"
+      ],
+      "outputs": [],
+      "execution_count": 12
+    },
+    {
+      "metadata": {
+        "id": "1qjeY5mwlLGJ"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "You can also pull-out the underlying Beam `ptransform` from a dataset to append new transformations, e.g., to write each element of the pipeline to disk as a separate file:"
       ]
     },
     {
@@ -288,16 +340,12 @@
         "  chunk.to_netcdf(path)\n",
         "\n",
         "with beam.Pipeline() as p:\n",
-        "  p | (\n",
-        "      xbeam.Dataset.from_zarr('example_data.zarr')\n",
-        "      .rechunk({'latitude': -1, 'longitude': -1})\n",
-        "      .ptransform\n",
-        "  ) | beam.MapTuple(to_netcdf)\n",
+        "  p | ds_beam.rechunk('50MB').ptransform | beam.MapTuple(to_netcdf)\n",
         "\n",
         "%ls *.nc"
       ],
       "outputs": [],
-      "execution_count": 10
+      "execution_count": 13
     }
   ],
   "metadata": {