clemsgrs · clemsgrs · Mar 23, 2026 · Mar 23, 2026 · Mar 23, 2026 · Mar 23, 2026
diff --git a/.github/workflows/pr-test.yaml b/.github/workflows/pr-test.yaml
@@ -1,4 +1,4 @@
-name: Test WSI to embedding consistency
+name: Test suite
 
 on:
   pull_request:

diff --git a/Dockerfile b/Dockerfile
@@ -30,6 +30,7 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
     libtiff-dev \
     cmake \
     zlib1g-dev \
+    libnuma1 \
     curl \
     vim screen \
     zip unzip \
@@ -104,6 +105,7 @@ ENV PATH="/home/user/.local/bin:${PATH}"
 RUN apt-get update && apt-get install -y --no-install-recommends \
     libtiff-dev \
     zlib1g-dev \
+    libnuma1 \
     curl \
     vim screen \
     zip unzip \

diff --git a/Dockerfile.ci b/Dockerfile.ci
@@ -20,6 +20,7 @@ WORKDIR /opt/app
 RUN apt-get update && apt-get install -y --no-install-recommends \
     libtiff-dev \
     zlib1g-dev \
+    libnuma1 \
     curl \
     cmake \
     vim screen \

diff --git a/Dockerfile.coding-agents b/Dockerfile.coding-agents
diff --git a/README.md b/README.md
@@ -21,38 +21,37 @@ pip install "slide2vec[models]"
 ## Python API
 
 ```python
-from slide2vec import Model, PreprocessingConfig
+from slide2vec import Model
+from slide2vec.utils.config import hf_login
 
-model = Model.from_pretrained("virchow2", level="tile")
-preprocessing = PreprocessingConfig(
-    target_spacing_um=0.5,
-    target_tile_size_px=224,
-    tissue_threshold=0.1,
-)
-embedded = model.embed_slide(
-    "/path/to/slide.svs",
-    preprocessing=preprocessing,
-)
+hf_login()
+
+model = Model.from_preset("virchow2")
+embedded = model.embed_slide("/path/to/slide.svs")
 
 tile_embeddings = embedded.tile_embeddings
 coordinates = embedded.coordinates
 ```
 
-By default, `ExecutionOptions()` uses all available GPUs. Set `ExecutionOptions(num_gpus=4)` when you want to cap the sharding explicitly.
-
 Use `Pipeline(...)` for manifest-driven batch processing when you want artifacts written to disk instead of only in-memory outputs:
 
 ```python
-from slide2vec import ExecutionOptions, Pipeline
+from slide2vec import ExecutionOptions, Pipeline, PreprocessingConfig
 
 pipeline = Pipeline(
     model=model,
-    preprocessing=preprocessing,
+    preprocessing=PreprocessingConfig(
+        target_spacing_um=0.5,
+        target_tile_size_px=224,
+        tissue_threshold=0.1,
+    ),
     execution=ExecutionOptions(output_dir="outputs/demo"),
 )
 result = pipeline.run(manifest_path="/path/to/slides.csv")
 ```
 
+By default, `ExecutionOptions()` uses all available GPUs. Set `ExecutionOptions(num_gpus=4)` when you want to cap the sharding explicitly.
+
 ### Input Manifest
 
 Manifest-driven runs use the schema below. `mask_path` and `spacing_at_level_0` are optional.
@@ -81,7 +80,7 @@ The package writes explicit artifact directories:
 
 ### Supported Models
 
-`slide2vec` currently ships preset configs for 10 tile-level models and 3 slide-level models.  
+`slide2vec` currently ships preset configs for 20 tile-level models and 3 slide-level models.  
 For the full catalog and preset names, see [`docs/models.md`](docs/models.md).
 
 ## CLI
@@ -115,4 +114,5 @@ docker run --rm -it \
 
 - [`docs/cli.md`](docs/cli.md) for the config-driven CLI guide
 - [`docs/python-api.md`](docs/python-api.md) for the detailed API reference
+- [`tutorials/api_walkthrough.ipynb`](tutorials/api_walkthrough.ipynb) for a notebook walkthrough of the API
 - [`docs/models.md`](docs/models.md) for the full supported-model catalog
diff --git a/docs/cli.md b/docs/cli.md
@@ -96,6 +96,37 @@ Common overrides:
 
 ## GPU Behavior
 
+### GPU-accelerated tile decoding (`gpu_decode`)
+
+When using the on-the-fly cucim backend (`tiling.on_the_fly: true`, `tiling.backend: cucim` or `auto`), slide2vec can decode tiles on the GPU during embedding.
+
+Enable it in your config:
+
+```yaml
+tiling:
+  gpu_decode: true  # default
+```
+
+Or override from the command line:
+
+```shell
+python -m slide2vec --config-file /path/to/config.yaml tiling.gpu_decode=true
+```
+
+When enabled, two things happen:
+1. `ENABLE_CUSLIDE2=1` is set in the process environment before CuCIM is imported, activating NVIDIA's cuSlide2 GPU-accelerated SVS/TIFF reader.
+2. `device="cuda"` is passed to cucim's `read_region`, so batch JPEG decoding runs on the GPU via nvImageCodec.
+
+This can give a significant speedup (~3.8× for batch decoding) on `.svs` and `.tif` files.
+
+**Note:** decoded pixels are currently converted back to CPU via `np.asarray` before being fed into the DataLoader. The speedup is real (GPU decoding is faster than CPU) but the data still round-trips through CPU before reaching the model. A true zero-copy path would require bypassing the DataLoader entirely and is tracked in `ideas-to-explore.md`.
+
+**Requirements:** `libnuma1` must be installed and `nvImageCodec` must be available (included with `cucim-cu12`). If the installed CuCIM version does not support `device="cuda"`, slide2vec falls back silently to CPU decoding.
+
+**Default:** `true` — disable with `tiling.gpu_decode: false` if needed.
+
+### GPU count
+
 By default, the CLI uses all available GPUs.
 
 To cap GPU usage, set: