Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 34 additions & 7 deletions docs/models.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ The canonical model presets are registered in code and documented below. Use the

Preset-specific behavior lives in registry metadata and, where supported, `model.output_variant`.

## Tile-level models (17)
## Tile-level models (18)

| Preset | Model | Supported Spacing (um) | Notes |
| --- | --- | --- | --- |
Expand All @@ -20,6 +20,7 @@ Preset-specific behavior lives in registry metadata and, where supported, `model
| `h0-mini` | [H0-mini](https://huggingface.co/bioptimus/H0-mini) | `0.5` | Supports `output_variant="cls"` or `"cls_patch_mean"` |
| `hibou-b` | [Hibou-B](https://huggingface.co/histai/hibou-b) | `0.5` | |
| `hibou-l` | [Hibou-L](https://huggingface.co/histai/hibou-L) | `0.5` | |
| `lunit` | [Lunit ViT-S/8](https://huggingface.co/1aurent/vit_small_patch8_224.lunit_dino) | `0.5` | 384-dim; used as tile backbone for MOOZY |
| `midnight` | [MidNight12k](https://huggingface.co/kaiko-ai/midnight) | `0.25`, `0.5`, `1.0`, `2.0` | Alias: `kaiko-midnight` |
| `musk` | [MUSK](https://huggingface.co/xiangjx/musk) | `0.25`, `0.5`, `1.0` | Supports `output_variant="ms_aug"` (2048-dim, default) or `"cls"` (1024-dim). |
| `phikon` | [Phikon](https://huggingface.co/owkin/phikon) | `0.5` | |
Expand All @@ -30,10 +31,36 @@ Preset-specific behavior lives in registry metadata and, where supported, `model
| `virchow` | [Virchow](https://huggingface.co/paige-ai/Virchow) | `0.5` | Supports `output_variant="cls"` or `"cls_patch_mean"` |
| `virchow2` | [Virchow2](https://huggingface.co/paige-ai/Virchow2) | `0.5`, `1.0`, `2.0` | Supports `output_variant="cls"` or `"cls_patch_mean"` |

## Slide-level models (3)
## Slide-level models (4)

| Preset | Model | Tile Encoder | Supported Spacing (um) |
| --- | --- | --- | --- |
| `gigapath-slide` | [Prov-GigaPath](https://huggingface.co/prov-gigapath/prov-gigapath) | `gigapath` | `0.5` |
| `prism` | [PRISM](https://huggingface.co/paige-ai/PRISM) | `virchow` (cls_patch_mean) | `0.5` |
| `titan` | [TITAN](https://huggingface.co/MahmoodLab/TITAN) | `conchv15` | `0.5` |
| Preset | Model | Tile Encoder | Supported Spacing (um) | Notes |
| --- | --- | --- | --- | --- |
| `gigapath-slide` | [Prov-GigaPath](https://huggingface.co/prov-gigapath/prov-gigapath) | `gigapath` | `0.5` | |
| `moozy-slide` | [MOOZY](https://huggingface.co/AtlasAnalyticsLab/MOOZY) | `lunit` | `0.5` | 768-dim slide embedding; standalone slide encoder from the MOOZY stage-2 checkpoint |
| `prism` | [PRISM](https://huggingface.co/paige-ai/PRISM) | `virchow` (cls_patch_mean) | `0.5` | |
| `titan` | [TITAN](https://huggingface.co/MahmoodLab/TITAN) | `conchv15` | `0.5` | |

## Patient-level models (1)

Patient-level models aggregate multiple slide embeddings for the same patient into a single patient-level embedding. They require a `patient_id` column in the input manifest CSV (or `patient_id` keys in each slide dict when using the Python API).

| Preset | Model | Tile Encoder | Supported Spacing (um) | Notes |
| --- | --- | --- | --- | --- |
| `moozy` | [MOOZY](https://huggingface.co/AtlasAnalyticsLab/MOOZY) | `lunit` | `0.5` | 768-dim patient embedding; runs Lunit tile encoder → MOOZY slide encoder → CaseAggregator transformer |

### Patient manifest format

Add a `patient_id` column to the standard manifest CSV to group slides by patient:

```csv
sample_id,image_path,patient_id
slide_1a,/data/slide_1a.svs,patient_1
slide_1b,/data/slide_1b.svs,patient_1
slide_2a,/data/slide_2a.svs,patient_2
```

`sample_id` remains the unique slide identifier. Multiple rows may share the same `patient_id`.

### Per-slide embeddings

When running a patient-level model via `Pipeline`, the intermediate per-slide MOOZY embeddings can be saved alongside the patient embeddings by setting `save_slide_embeddings: true` in config (or `ExecutionOptions(save_slide_embeddings=True)` in the Python API). Saved slide embeddings are written to `slide_embeddings/` in the output directory.
53 changes: 51 additions & 2 deletions docs/python-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

`slide2vec` exposes two main workflows:

- direct in-memory embedding with `Model.embed_slide(...)` and `Model.embed_slides(...)`
- direct in-memory embedding with `Model.embed_slide(...)`, `Model.embed_slides(...)`, `Model.embed_patient(...)`, and `Model.embed_patients(...)`
- artifact generation with `Pipeline.run(...)`

## Minimal interactive usage
Expand Down Expand Up @@ -108,12 +108,60 @@ Common fields:
- `output_dir`
- `output_format` - `"pt"` (default) or `"npz"`
- `save_tile_embeddings` - persist tile embeddings for slide-level models (default `False`)
- `save_slide_embeddings` - persist per-slide embeddings when running a patient-level model (default `False`)
- `save_latents` - persist latent representations when available (default `False`)

`num_gpus` defaults to all available GPUs. `embed_slide(...)` uses tile sharding for one slide, and `embed_slides(...)` balances whole slides across GPUs while preserving input order.

If you need persisted artifact generation without using `Pipeline.run(...)`, use `Model.embed_tiles(...)` and `Model.aggregate_tiles(...)`.

## Patient-level embedding

For patient-level models (e.g. `moozy`), use `Model.embed_patient(...)` for a single patient or `Model.embed_patients(...)` for a batch of patients.

### Single patient

```python
from slide2vec import Model

model = Model.from_preset("moozy")
result = model.embed_patient(
["/data/slide_1a.svs", "/data/slide_1b.svs"],
patient_id="patient_1",
)

print(result.patient_id) # "patient_1"
print(result.patient_embedding.shape) # torch.Size([768])
print(result.slide_embeddings) # {"slide_1a": tensor, "slide_1b": tensor}
```

`embed_patient(...)` returns a single `EmbeddedPatient`. The `patient_id` argument is optional — when omitted, it is read from `patient_id` keys in the slide dicts, or falls back to `sample_id`.

### Multiple patients

```python
results = model.embed_patients(
[
{"sample_id": "slide_1a", "image_path": "/data/slide_1a.svs", "patient_id": "patient_1"},
{"sample_id": "slide_1b", "image_path": "/data/slide_1b.svs", "patient_id": "patient_1"},
{"sample_id": "slide_2a", "image_path": "/data/slide_2a.svs", "patient_id": "patient_2"},
]
)

for r in results:
print(r.patient_id, r.patient_embedding.shape)
```

`embed_patients(...)` returns one `EmbeddedPatient` per unique patient, ordered by first appearance. Pass an explicit `patient_id_map` dict (`{sample_id: patient_id}`) to override the per-slide `patient_id` keys.

Each `EmbeddedPatient` has:

- `patient_id`
- `patient_embedding` — tensor of shape `(D,)` (768 for MOOZY)
- `slide_embeddings` — `{sample_id: tensor}` for each contributing slide

Both methods raise a `ValueError` if called on a non-patient-level model.

## Hierarchical Feature Extraction

Hierarchical mode spatially groups tiles into regions before embedding, producing outputs with shape `(num_regions, tiles_per_region, feature_dim)`. This is useful for downstream models that consume region-level spatial structure rather than flat tile bags.
Expand Down Expand Up @@ -170,9 +218,10 @@ result = pipeline.run(manifest_path="/path/to/slides.csv")
- `tile_artifacts`
- `hierarchical_artifacts`
- `slide_artifacts`
- `patient_artifacts` — populated when using a patient-level model (e.g. `moozy`); one entry per unique patient, written to `patient_embeddings/` in the output directory
- `process_list_path`

The manifest schema matches HS2P and accepts optional `mask_path` and `spacing_at_level_0` columns.
The manifest schema matches HS2P and accepts optional `mask_path` and `spacing_at_level_0` columns. Patient-level models additionally require a `patient_id` column; see [Patient manifest format](models.md#patient-manifest-format).

### Reusing pre-extracted coordinates

Expand Down
4 changes: 4 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,9 @@ hibou = [
"scipy~=1.8.1",
"scikit-image~=0.19.3",
]
moozy = [
"moozy",
]
titan = [
"torch==2.0.1",
"timm==1.0.3",
Expand Down Expand Up @@ -106,6 +109,7 @@ fm = [
"scikit-survival",
"scikit-learn",
"fairscale",
"moozy",
"packaging==23.2",
"ninja==1.11.1.1",
"psutil<6",
Expand Down
87 changes: 87 additions & 0 deletions slide2vec/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@

from slide2vec.artifacts import (
HierarchicalEmbeddingArtifact,
PatientEmbeddingArtifact,
SlideEmbeddingArtifact,
TileEmbeddingArtifact,
)
Expand Down Expand Up @@ -127,6 +128,7 @@ class ExecutionOptions:
prefetch_factor: int = 4
persistent_workers: bool = True
save_tile_embeddings: bool = False
save_slide_embeddings: bool = False
save_latents: bool = False

@classmethod
Expand All @@ -151,6 +153,7 @@ def from_config(cls, cfg: Any, *, run_on_cpu: bool = False) -> "ExecutionOptions
prefetch_factor=prefetch_factor,
persistent_workers=persistent_workers,
save_tile_embeddings=bool(cfg.model.save_tile_embeddings),
save_slide_embeddings=bool(cfg.model.save_slide_embeddings),
save_latents=bool(cfg.model.save_latents),
)

Expand Down Expand Up @@ -200,9 +203,17 @@ class RunResult:
tile_artifacts: list[TileEmbeddingArtifact]
hierarchical_artifacts: list[HierarchicalEmbeddingArtifact]
slide_artifacts: list[SlideEmbeddingArtifact]
patient_artifacts: list[PatientEmbeddingArtifact] = field(default_factory=list)
process_list_path: Path | None = None


@dataclass(frozen=True, kw_only=True)
class EmbeddedPatient:
patient_id: str
patient_embedding: Any # torch.Tensor [D]
slide_embeddings: dict[str, Any] # {sample_id: torch.Tensor [D]}


@dataclass(frozen=True, kw_only=True)
class EmbeddedSlide:
sample_id: str
Expand Down Expand Up @@ -343,6 +354,82 @@ def embed_slides(
execution=resolved,
)

def embed_patient(
self,
slides: SlideSequence,
patient_id: str | None = None,
*,
preprocessing: PreprocessingConfig | None = None,
execution: ExecutionOptions | None = None,
) -> "EmbeddedPatient":
"""Embed a single patient's slides and return one ``EmbeddedPatient``.

Convenience wrapper around :meth:`embed_patients` for the common case
where all *slides* belong to the same patient.

Args:
slides: All slides for this patient.
patient_id: Optional patient identifier applied to every slide.
When omitted, ``patient_id`` is read from slide dict keys or
object attributes; slides that carry no ``patient_id`` fall
back to ``sample_id``.
"""
patient_id_map: dict | None = None
if patient_id is not None:
patient_id_map = {}
for s in slides:
if isinstance(s, (str, Path)):
patient_id_map[Path(s).stem] = patient_id
elif isinstance(s, dict):
patient_id_map[str(s["sample_id"])] = patient_id
else:
patient_id_map[str(s.sample_id)] = patient_id
return self.embed_patients(
slides,
patient_id_map=patient_id_map,
preprocessing=preprocessing,
execution=execution,
)[0]

def embed_patients(
self,
slides: SlideSequence,
patient_id_map: dict | None = None,
*,
preprocessing: PreprocessingConfig | None = None,
execution: ExecutionOptions | None = None,
) -> "list[EmbeddedPatient]":
"""Embed slides and aggregate them into patient-level embeddings.

Requires a patient-level model (e.g. ``moozy``). For each patient
all contributing slide embeddings are aggregated by the model's
``encode_patient`` method.

Args:
slides: Slides to process. Each entry may be a path, a
``SlideSpec``, or a dict with ``sample_id`` / ``image_path``
keys. When *patient_id_map* is ``None`` a ``patient_id``
key in each dict is used to group slides.
patient_id_map: Optional explicit ``{sample_id: patient_id}``
mapping. When provided it takes precedence over any
``patient_id`` key embedded in the slide dicts. When
omitted and the slide dicts carry no ``patient_id``, each
slide is treated as its own patient.
"""
from slide2vec.inference import embed_patients

resolved = _coerce_execution_options(execution, model=self)
resolved_preprocessing = _resolve_direct_api_preprocessing(self, preprocessing)
with _auto_progress_reporting(output_dir=resolved.output_dir):
_validate_model_config(self, resolved_preprocessing, resolved)
return embed_patients(
self,
slides,
patient_id_map=patient_id_map,
preprocessing=resolved_preprocessing,
execution=resolved,
)

def _load_backend(self) -> LoadedModel:
if self._backend is None:
from slide2vec.inference import load_model
Expand Down
53 changes: 53 additions & 0 deletions slide2vec/artifacts.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,20 @@ def metadata(self) -> dict[str, Any]:
return load_metadata(self.metadata_path)


@dataclass(frozen=True, kw_only=True)
class PatientEmbeddingArtifact:
patient_id: str
path: Path
metadata_path: Path
format: str
feature_dim: int
num_slides: int

@property
def metadata(self) -> dict[str, Any]:
return load_metadata(self.metadata_path)


@dataclass(frozen=True, kw_only=True)
class HierarchicalEmbeddingArtifact:
sample_id: str
Expand Down Expand Up @@ -223,6 +237,45 @@ def write_slide_embeddings(
)


def write_patient_embeddings(
patient_id: str,
embedding,
*,
output_dir: str | Path,
output_format: str = "pt",
metadata: dict[str, Any] | None = None,
num_slides: int = 0,
) -> PatientEmbeddingArtifact:
output_format = _validate_output_format(output_format)
artifact_path, metadata_path = _setup_artifact_paths(
output_dir, "patient_embeddings", patient_id, output_format
)
embedding_array = _ensure_array(embedding)
if output_format == "pt":
torch.save(_ensure_tensor(embedding), artifact_path)
else:
np.savez_compressed(artifact_path, features=embedding_array)

patient_metadata = {
"patient_id": patient_id,
"artifact_type": "patient_embeddings",
"format": output_format,
"feature_dim": int(embedding_array.shape[-1]) if embedding_array.ndim else 1,
"num_slides": num_slides,
}
if metadata:
patient_metadata.update(metadata)
_write_metadata(metadata_path, patient_metadata)
return PatientEmbeddingArtifact(
patient_id=patient_id,
path=artifact_path,
metadata_path=metadata_path,
format=output_format,
feature_dim=patient_metadata["feature_dim"],
num_slides=num_slides,
)


def write_hierarchical_embeddings(
sample_id: str,
features,
Expand Down
1 change: 1 addition & 0 deletions slide2vec/configs/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ model:
output_variant: # requested output variant for presets that expose multiple outputs
batch_size: 32
save_tile_embeddings: false # whether to save tile embeddings alongside the pooled slide embedding when level is "slide"
save_slide_embeddings: false # whether to save per-slide embeddings when level is "patient" (e.g. moozy); requires a 'patient_id' column in the input CSV
save_latents: false # whether to save the latent representations from the model alongside the slide embedding (only supported for 'prism')
allow_non_recommended_settings: false # when true, non-recommended spacing / tile size / precision combinations warn instead of erroring

Expand Down
2 changes: 2 additions & 0 deletions slide2vec/encoders/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

from slide2vec.encoders.base import (
Encoder,
PatientEncoder,
SlideEncoder,
TileEncoder,
TimmTileEncoder,
Expand All @@ -24,6 +25,7 @@

__all__ = [
"Encoder",
"PatientEncoder",
"TileEncoder",
"SlideEncoder",
"TimmTileEncoder",
Expand Down
Loading
Loading