Skip to content

Commit e60b6ad

Browse files
shoyerXarray-Beam authors
authored andcommitted
Update discussion of rechunking, with a new illustration
PiperOrigin-RevId: 814436240
1 parent a3d29f8 commit e60b6ad

File tree

2 files changed

+44
-37
lines changed

2 files changed

+44
-37
lines changed

docs/_static/pancake-vs-pencil.svg

Lines changed: 3 additions & 0 deletions
Loading

docs/high-level.ipynb

Lines changed: 41 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -55,15 +55,15 @@
5555
"import pandas as pd\n",
5656
"\n",
5757
"xarray_ds = xarray.Dataset(\n",
58-
" {'temperature': (('time', 'longitude', 'latitude'), np.random.randn(365, 180, 90))},\n",
58+
" {'temperature': (('time', 'longitude', 'latitude'), np.random.randn(365, 360, 180))},\n",
5959
" coords={'time': pd.date_range('2025-01-01', freq='1D', periods=365)},\n",
6060
")\n",
61-
"chunks = {'time': 100, 'longitude': 90, 'latitude': 90}\n",
61+
"chunks = {'time': '30MB', 'longitude': -1, 'latitude': -1}\n",
6262
"xbeam_ds = xbeam.Dataset.from_xarray(xarray_ds, chunks)\n",
6363
"xbeam_ds"
6464
],
6565
"outputs": [],
66-
"execution_count": 3
66+
"execution_count": 2
6767
},
6868
{
6969
"metadata": {
@@ -83,7 +83,7 @@
8383
"xarray_ds.chunk(chunks).to_zarr('example_data.zarr', mode='w')"
8484
],
8585
"outputs": [],
86-
"execution_count": 4
86+
"execution_count": 3
8787
},
8888
{
8989
"metadata": {
@@ -102,56 +102,59 @@
102102
"\n",
103103
"All non-trivial computation happens via the embarrasingly parallel `map_blocks` method.\n",
104104
"\n",
105+
"### Chunking strategies\n",
106+
"\n",
105107
"In order for `map_blocks` to work, data needs to be appropriately chunked. Here are a few typical chunking patterns that work well for most needs:\n",
106108
"\n",
107-
"- \"Pencil\" chunks, which group together all times, and parallelize over space. These long and skinny chunks look like a box of pencils:"
109+
"- \"Pencil\" chunks, which group together all times, and parallelize over space. These long and skinny chunks look like a box of pencils.\n",
110+
"- \"Pancake\" chunks, which group together all spatial locations, and parallelize over time. These flat and wide chunks look like a stack of pancakes.\n",
111+
"- Intermediate \"compromise\" chunks, somewhere between these extremes. These optimize for worst-case behavior. "
108112
]
109113
},
110114
{
111115
"metadata": {
112-
"id": "EpOpwzZS9Rte"
113-
},
114-
"cell_type": "code",
115-
"source": [
116-
"xarray_ds.temperature.chunk({'time': -1, 'latitude': 20, 'longitude': 20}).data"
117-
],
118-
"outputs": [],
119-
"execution_count": 5
120-
},
121-
{
122-
"metadata": {
123-
"id": "74t0dh3E9kny"
116+
"id": "e7dkTphQyLzz"
124117
},
125118
"cell_type": "markdown",
126119
"source": [
127-
"- \"Pancake\" chunks, which group together all spatial locations, and parallelize over time. These flat and wide chunks look like a stack of pancakes:"
120+
"![Pancake vs Pencil chunks](./_static/pancake-vs-pencil.svg)\n"
128121
]
129122
},
130123
{
131124
"metadata": {
132-
"id": "fho4ub-69mD5"
125+
"id": "t3APp6uB9yjT"
133126
},
134-
"cell_type": "code",
127+
"cell_type": "markdown",
135128
"source": [
136-
"xarray_ds.temperature.chunk({'time': 1, 'latitude': -1, 'longitude': -1}).data"
137-
],
138-
"outputs": [],
139-
"execution_count": 6
129+
"Weather/climate datasets are typically generated and stored in \"pancake\" chunks, but \"pencil\" chunks are more suitable for most analytics queries, which requires large histories of weather at small numbers of locations.\n",
130+
"\n",
131+
"Using the right chunks is *absolutely essentially* for efficient operations with Xarray-Beam and Zarr. For example, reading data from a single location across all times (a \"pencil\" query) is extremely inefficient for a dataset stored in \"pancake\" chunks -- it would require loading the entire dataset from disk!\n",
132+
"\n",
133+
"[Like Dask](https://docs.dask.org/en/latest/array-best-practices.html), Xarray-Beam works best with chunks that are 10s to 100s of MB in size, large enough to amortize Python overhead but small enough that chunks fit easily into memory. Smaller chunk sizes in Zarr stores (closer to 10MB) can be convenient to increase the flexiblity of chunking methods when reading the data from disk later\n",
134+
"\n",
135+
"You can explicitly adjusts chunk sizes with {py:meth}`~xarray_beam.Dataset.rechunk`. The syntax is a mapping from dimension names (or `...`, for all other dimensions) to chunk sizes, which can be any of `-1` (to indicate \"size of the full dimension\"), a positive integer or a string indicating a target size in bytes like `'100MB'`, e.g.,\n",
136+
"- `chunks=-1`: no chunking.\n",
137+
"- `chunks='100 MB'`: chunk size of ~100 MB, without dimension-specific constraints.\n",
138+
"- `chunks={'time': 100}`: fixed size of 100 along time, preserving original chunks along other dimensions.\n",
139+
"- `chunks={'time': -1, ...: '100 MB'}`: pencil chunks of ~100 MB each.\n",
140+
"- `chunks={'latitude': -1, 'longitude': -1, ...: '100 MB'}`: pancake chunks of ~100 MB each.\n",
141+
"\n",
142+
"See {py:func}`~xarray_beam.normalize_chunks` for full details."
143+
]
140144
},
141145
{
142146
"metadata": {
143-
"id": "t3APp6uB9yjT"
147+
"id": "noNiRxGrGX3L"
144148
},
145149
"cell_type": "markdown",
146150
"source": [
147-
"Weather/climate datasets are typically generated and stored in \"pancake\" chunks, but \"pencil\" chunks are more suitable for most analytics queries, which requires large histories of weather at small numbers of locations.\n",
148151
"\n",
149-
"Using the right chunks is *absolutely essentially* for efficient operations with Xarray-Beam and Zarr. For example, reading data from a single location across all times (a \"pencil\" query) is extremely inefficient for a dataset stored in \"pancake\" chunks -- it would require loading the entire dataset from disk!\n",
150-
"\n",
151-
"Rechunking is a fundamentally an expensive operation (it requires multiple complete reads/writes of a dataset from disk), but in Xarray-Beam it's straightforward, via {py:meth}`~xarray_beam.Dataset.rechunk`.\n",
152+
"```{warning}\n",
153+
"Rechunking between very different schemes is fundamentally an expensive operation (it requires multiple complete reads/writes of a dataset from disk). If performance and flexibility are critical, it may be worth storing multiple copies of your data in different chunking formats.\n",
154+
"```\n",
152155
"\n",
153156
"```{tip}\n",
154-
"Intermediate \"compromise\" chunks can sometimes be a good idea, although if performance and flexibility are critical it may be worth storing multiple copies of your data in different formats. Using Zarr v3's sharding feature in {py:meth}`~xarray_beam.Dataset.to_zarr` to group smaller chunks into shards can also help mitigate the challenges of picking an optimal chunk size.\n",
157+
"Using Zarr v3's sharding feature in {py:meth}`~xarray_beam.Dataset.to_zarr` to group smaller chunks (~1 MB) into shards can also help mitigate the challenges of picking an optimal chunk size.\n",
155158
"```"
156159
]
157160
},
@@ -175,14 +178,15 @@
175178
"with beam.Pipeline() as p:\n",
176179
" p | (\n",
177180
" xbeam.Dataset.from_zarr('example_data.zarr')\n",
178-
" .rechunk({'time': -1, 'latitude': 30, 'longitude': 30})\n",
181+
" .rechunk({'time': -1, ...: '100 MB'})\n",
179182
" .map_blocks(lambda ds: ds.groupby('time.month').mean())\n",
183+
" .rechunk('10 MB') # ensure a reasonable min chunk-size for Zarr\n",
180184
" .to_zarr('example_climatology.zarr')\n",
181185
" )\n",
182186
"xarray.open_zarr('example_climatology.zarr')"
183187
],
184188
"outputs": [],
185-
"execution_count": 7
189+
"execution_count": 6
186190
},
187191
{
188192
"metadata": {
@@ -204,14 +208,14 @@
204208
"with beam.Pipeline() as p:\n",
205209
" p | (\n",
206210
" xbeam.Dataset.from_zarr('example_data.zarr')\n",
207-
" .rechunk({'time': 10, 'latitude': -1, 'longitude': -1})\n",
211+
" .rechunk({'time': '30MB', 'latitude': -1, 'longitude': -1})\n",
208212
" .map_blocks(lambda ds: ds.coarsen(latitude=2, longitude=2).mean())\n",
209213
" .to_zarr('example_regrid.zarr')\n",
210214
" )\n",
211215
"xarray.open_zarr('example_regrid.zarr')"
212216
],
213217
"outputs": [],
214-
"execution_count": 8
218+
"execution_count": 7
215219
},
216220
{
217221
"metadata": {
@@ -241,7 +245,7 @@
241245
" print(f'{type(e).__name__}: {e}')"
242246
],
243247
"outputs": [],
244-
"execution_count": 9
248+
"execution_count": 8
245249
},
246250
{
247251
"metadata": {
@@ -262,7 +266,7 @@
262266
"ds_beam.map_blocks(lambda ds: ds.compute(), template=ds_beam.template)"
263267
],
264268
"outputs": [],
265-
"execution_count": 10
269+
"execution_count": 9
266270
},
267271
{
268272
"metadata": {
@@ -293,7 +297,7 @@
293297
"%ls *.nc"
294298
],
295299
"outputs": [],
296-
"execution_count": 14
300+
"execution_count": 10
297301
}
298302
],
299303
"metadata": {

0 commit comments

Comments
 (0)