Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
141 changes: 141 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

### 1. High-level
- No unreleased feature changes yet.

### 2. Low-level
- N/A.

## [0.2.0] - 2026-02-12

Comparison scope for this release note (relative to `v0.1.0`):
- Baseline reference: `dd681a8` (`2025-12-17`, `pyproject.toml` version `0.1.0`)
- Base comparison: `dd681a8...release/v2-stable`
- Branch snapshot used for this summary: `2c92b43` (`2026-02-13`)
- Delta size at that snapshot: `33` commits, `76` files changed, `18,232` insertions, `321` deletions.

### 0. Technical Summary (concise)

#### New CLI workflows
- `segger predict`:
- Checkpoint-only inference with strict checkpoint/data compatibility checks (`segger_vocab`, `segger_me_gene_pairs`, `n_genes`).
- Supports inference-time graph overrides, assignment threshold controls, fragment controls, and `--use-3d`.
- `segger export`:
- Unified format conversion (`xenium_explorer|merged|spatialdata|anndata`) from parquet/csv/SpatialData segmentation inputs.
- Adds explicit input resolution (`--input-format`) and boundary policy controls (`--boundary-method`).
- `segger plot`:
- Resolves Lightning metrics automatically (or via `--log-version`), groups train/val curves by metric key, and renders terminal or PNG outputs.

#### New capabilities
- End-to-end SpatialData support (ingest + export), including optional AnnData table embedding.
- Alignment-loss pipeline with ME-gene constraints, scheduled weighting, and checkpoint metadata persistence.
- Fragment-mode assignment for unassigned transcripts via tx-tx connected components with GPU-first/CPU-fallback execution.

#### Stability/performance changes
- Strong checkpoint-first safety checks to prevent silent inference mismatches.
- Improved thresholding and memory behavior in segmentation writing.
- Hardened boundary generation and parallel Xenium export fallback (process -> thread retry).
- Expanded lazy optional-dependency handling with clearer failure modes.
- Broader tests/CI coverage across CLI, export, alignment, fragment, and SpatialData paths.

### 1. High-level (major changes)

#### 1.1 CLI and workflow expansion
- Added a checkpoint-first inference command: `segger predict -c <checkpoint>`.
- Added checkpoint metadata validation for saved vocabulary and ME-gene pairs before inference starts.
- Added training early-stopping controls and best-checkpoint prediction handoff in `segger segment`.
- Added `segger plot` for loss curves with both terminal output (`--quick`, `uniplot`) and image output (`matplotlib`).
- Expanded CLI output controls to multi-format segmentation exports (`segger_raw`, `merged`, `spatialdata`, `anndata`, `all`).
- Expanded export controls to include `--input-format`, `--boundary-method`, and related boundary-generation knobs.

#### 1.2 New export architecture and format support
- Added a format registry (`OutputFormat`, writer protocol/registration) for consistent export extension.
- Added dedicated writers for merged transcript output, AnnData output, and SpatialData output.
- Added a richer Xenium Explorer export path with improved polygon handling and metadata consistency.
- Added support for choosing boundary-generation strategy (`input`, `convex_hull`, `delaunay`, `skip` where supported).
- Added SOPA compatibility helpers and conversion utilities for SpatialData-centric downstream workflows.

#### 1.3 SpatialData support from input to output
- Added SpatialData loader support and `.zarr` path detection in the data module and CLI.
- Added SpatialData export writer support, including transcript points and optional shapes.
- Added optional embedding of an AnnData table in SpatialData output.
- Added lightweight SpatialData Zarr read/write utilities for environments that avoid full `spatialdata` dependency trees.

#### 1.4 Data loading and graph construction upgrades
- Added configurable transcript quality filtering (`min_qv`) with platform-aware logic.
- Added explicit quality-filter classes for Xenium, CosMx, MERSCOPE, and SpatialData-based inputs.
- Added 3D-aware graph construction controls (`use_3d` with `auto/true/false` semantics).
- Added prediction graph scale-factor plumbing and alignment so CLI and data-module behavior stay consistent.
- Added optional transcript-edge similarity capture in graph construction for downstream fragment operations.

#### 1.5 Model/loss evolution (alignment + metadata-aware inference)
- Added `AlignmentLoss` integration with scheduled weighting and combination modes (`interpolate` and `additive`).
- Added ME-gene edge generation and labeling in heterodata construction.
- Added contrastive same-gene positive edges and ME-pair negative edges for alignment training.
- Added positive subsampling logic to control alignment class imbalance.
- Added checkpoint persistence and restore of `segger_vocab` and `segger_me_gene_pairs`.
- Added stricter runtime compatibility checks between checkpoint metadata and prediction input data.

#### 1.6 Fragment-mode segmentation for unassigned transcripts
- Added fragment-mode assignment pipeline for previously unassigned transcripts.
- Added connected-component grouping using transcript-transcript edges with similarity thresholding.
- Added GPU-first execution path (when RAPIDS is available) with CPU fallback behavior.
- Added minimum-fragment-size controls and auto-threshold options for fragment similarity.

#### 1.7 Optional dependency model and package surface cleanup
- Added centralized optional dependency utilities (`segger.utils.optional_deps`) with clear install guidance.
- Added lazy module loading in `segger.io`, `segger.export`, `segger.datasets`, and other package entry points.
- Added explicit RAPIDS requirement checks where GPU-only operations are required.
- Added optional dependency groups in `pyproject.toml` (`spatialdata`, `spatialdata-io`, `sopa`, `plot`, `spatialdata-all`, `dev`).

#### 1.8 New datasets/helpers for reproducible testing and demos
- Added `segger.datasets` with toy Xenium loaders and synthetic data generation.
- Added sample-output generation helpers for merged/parquet and SpatialData conversion workflows.
- Added plotting and SpatialData demo notebooks to document end-to-end usage.

#### 1.9 Testing and CI expansion
- Added a full test suite scaffold (`tests/`, fixtures, and targeted modules by subsystem).
- Added tests for alignment loss, fragment mode, prediction graph behavior, exporters, optional deps, and SpatialData I/O.
- Added CI workflow (`.github/workflows/test.yml`) and Dependabot config for dependency hygiene.
- Added pytest and coverage configuration directly in `pyproject.toml`.

#### 1.10 Documentation expansion
- Added dedicated docs for installation troubleshooting, release process, versioning policy, loss functions, and math foundations.
- Added structured release note document for `v0.2.0`.

### 2. Low-level (minor changes and refinements)

#### 2.1 Accuracy, performance, and stability refinements
- Improved thresholding logic in segmentation writing with robust Li/Yen handling and safe fallbacks.
- Reduced peak memory in per-gene threshold calculations through iterative sampling-based processing.
- Improved boundary generation throughput with parallel Delaunay options.
- Added fallback from process workers to thread workers in parallel Xenium export when process pools fail.
- Added safer empty/degenerate polygon handling in boundary extraction and export code paths.
- Added additional positional-embedding guards for empty batches and zero-variance coordinates.

#### 2.2 ME-gene discovery and alignment tuning refinements
- Added ME-gene discovery caching keyed by scRNA input metadata and discovery parameters.
- Added scRNA preprocessing normalization and optional per-cell-type subsampling for faster ME discovery.
- Added progress/debug messages for ME discovery and alignment-edge creation (`SEGGER_ME_VERBOSE` / debug flags).
- Tightened default ME exclusivity criteria and increased pair coverage tuning in discovery defaults.

#### 2.3 CLI polish and compatibility refinements
- Unified worker-count semantics across related CLI steps.
- Improved CLI help text for format/export settings and deprecation messaging.
- Added robust cell-id column alias resolution for export inputs.
- Added typed handling for unassigned IDs in AnnData export paths.

#### 2.4 Internal API and import refinements
- Switched multiple package-level imports to lazy-loading patterns to reduce import side effects and startup overhead.
- Updated data utility import strategy to stay consistent with existing project patterns.
- Added compatibility comments and deprecation guidance around legacy `cli/config.yaml` defaults.

#### 2.5 Housekeeping
- No additional housekeeping notes in this release summary.
141 changes: 140 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@

Before installing **segger**, please install GPU-accelerated versions of PyTorch, RAPIDS, and related packages compatible with your system. *Please ensure all CUDA-enabled packages are compiled for the same CUDA version.*

- Segger is GPU-only and requires the RAPIDS stack (no CPU-only mode).

- **PyTorch & torchvision:** [Installation guide](https://pytorch.org/get-started/locally/)
- **torch_scatter:** [Installation guide](https://github.com/rusty1s/pytorch_scatter#installation)
- **RAPIDS (cuDF, cuML, cuGraph):** [Installation guide](https://docs.rapids.ai/install)
Expand Down Expand Up @@ -32,6 +34,87 @@ git clone https://github.com/dpeerlab/segger.git segger && cd segger
pip install -e .
```

## Tips & Troubleshooting (v0.2.0)

- Avoid user-site bleed: set `PYTHONNOUSERSITE=1` so `~/.local` packages do not shadow the env.
- Torch Geometric wheels must match your `torch` + CUDA version (`data.pyg.org` URL must match).
- Keep RAPIDS packages from a single channel/version set; avoid pip/conda mixing for RAPIDS.
- NFS cleanup noise (`.nfs*`): set `TMPDIR` to local scratch to avoid exit-time errors.
- UCX/CUDA segfaults: try `UCX_MEMTYPE_CACHE=n` and `UCX_TLS=sm,self`.

## Optional Dependencies & Lazy Imports

Segger defers imports for several heavy/optional features to keep `import segger` fast and to allow partial installs. If an optional dependency is missing, some top-level re-exports (notably in `segger.io` and `segger.export`) will be `None` rather than raising at import time.

```python
from segger.io import get_preprocessor
if get_preprocessor is None:
raise ImportError("Install opencv-python for preprocessors.")
```

For strict import errors, import from submodules directly:

```python
from segger.io.preprocessor import get_preprocessor
```

Common optional dependencies:
- `opencv-python` (preprocessors)
- `spatialdata` + `dask` (SpatialData loader/writer)
- `spatialdata-io` (platform-specific SpatialData readers)
- `uniplot` + `matplotlib` (loss curve plotting; install with `segger[plot]`)
- `sopa` (SOPA export helpers)
- `geopandas`/`shapely` (geometry utilities)

## v0.2.0 Detailed Delta vs `v0.1.0`

This summary is intentionally based on the release baseline comparison (`dd681a8...release/v2-stable`), not only on the latest commit.

- Comparison snapshot:
- `v0.1.0` baseline reference: `dd681a8` (`2025-12-17`, `pyproject.toml` version `0.1.0`)
- Release snapshot: `2c92b43` (`2026-02-13`)
- Delta size: `33` commits, `76` files changed, `18,232` insertions, `321` deletions

### New CLI workflows

- `segger predict`:
- Checkpoint-only inference (`-c`) with strict checkpoint/data compatibility checks for `segger_vocab`, `segger_me_gene_pairs`, and `n_genes`.
- Supports inference-time graph overrides (`--transcripts-max-k`, `--transcripts-max-dist`, `--prediction-max-k`), assignment controls (`--min-similarity`, `--min-similarity-shift`), fragment controls, and `--use-3d`.
- Supports post-predict multi-format output (`--output-format`) with optional overwrite semantics.
- `segger export`:
- Unified export entry point for `xenium_explorer|merged|spatialdata|anndata`.
- Handles segmentation inputs from parquet/csv/SpatialData with `--input-format auto|raw|spatialdata`.
- Adds explicit boundary policy (`--boundary-method input|convex_hull|delaunay|skip`), worker controls, polygon vertex limits, and cell-id alias resolution.
- `segger plot`:
- Resolves latest or specific Lightning run metrics (`--log-version`) from `lightning_logs/version_*`.
- Groups train/val series by metric key, applies rolling smoothing, and renders either terminal pages (`--quick`) or paginated PNG outputs.

### New capabilities

- End-to-end SpatialData support:
- `.zarr` ingest path in data loading and export path to SpatialData output.
- Optional AnnData table embedding in SpatialData output.
- Lightweight direct SpatialData Zarr I/O utilities for reduced dependency footprints.
- Alignment-loss pipeline:
- ME-gene constraints integrated into graph/loss flow with scheduled weighting and combination modes.
- Checkpoint persistence + restore of `segger_vocab` and `segger_me_gene_pairs`.
- Fragment-mode segmentation:
- Unassigned transcript recovery via tx-tx connected components with similarity thresholding.
- GPU-first path with CPU fallback.

### Stability/performance changes

- Checkpoint-first inference hardening:
- Explicit mismatch failures for vocabulary order and gene-count incompatibility to prevent silent misalignment.
- Segmentation writer improvements:
- More robust auto-thresholding path with safer memory behavior and sign-stable threshold shifting.
- Boundary/export resilience:
- Safer polygon handling and process-to-thread fallback for parallel Xenium export when process pools fail.
- Optional dependency behavior:
- Expanded lazy-loading and explicit install guidance for partial environments.
- Validation surface:
- Significant test/CI expansion across CLI, export, alignment, fragment, and SpatialData paths.

# Usage

You can run **segger** from the command line with:
Expand All @@ -42,4 +125,60 @@ segger segment -i /path/to/your/ist/data/ -o /path/to/save/outputs/
To see all available parameter options:
```bash
segger segment --help
```
```

Run prediction only from a saved checkpoint (no retraining):
```bash
segger predict -c /path/to/checkpoints/segger-best-epoch.ckpt \
-i /path/to/your/ist/data/ \
-o /path/to/save/outputs/
```

Plot loss curves from the latest training run:
```bash
segger plot -o /path/to/save/outputs/
```

Quick terminal plot (no image saved):
```bash
segger plot -o /path/to/save/outputs/ --quick
```

Plot a specific Lightning run version:
```bash
segger plot -o /path/to/save/outputs/ --log-version 0
```

## CLI Parameters (New/Updated)

- `--input-format` (`auto` | `raw` | `spatialdata`) and `--output-format` (`segger_raw` | `merged` | `spatialdata` | `anndata` | `all`).
- `--boundary-method` (`input` | `convex_hull` | `delaunay` | `skip`) and `--boundary-n-jobs` (0 uses `--num-workers`).
- `--sopa-compatible` for SOPA-ready SpatialData output.
- `--num-workers` for data loading (and as the default for boundary generation).
- `--prediction-scale-factor`: polygon scaling for tx→bd candidate edges (default 1.2).
- `--min-similarity`: fixed similarity threshold; if unset, per-gene auto-thresholding.
- `--fragment-mode`, `--fragment-min-transcripts`, `--fragment-similarity-threshold`.
- `--alignment-loss`, `--scrna-reference-path`, `--scrna-celltype-column`.
- `--alignment-loss-weight-start`, `--alignment-loss-weight-end`, `--loss-combination-mode`.
- `--early-stopping-patience` (default `10`) and `--early-stopping-min-delta` (default `1e-4`) for validation-based stopping on `val:loss`.
- `--use-3d` (`auto` | `true` | `false`) and `--min-qv` for quality filtering.
- `--tiling-margin-training`, `--tiling-margin-prediction`, `--max-nodes-per-tile`, `--max-edges-per-batch`.

## Alignment Loss Example

```bash
segger segment -i /path/to/your/ist/data/ -o /path/to/save/outputs/ \
--alignment-loss \
--scrna-reference-path segger_experiments/data_raw/scrnaseq/human_crc.h5ad \
--scrna-celltype-column celltype
```

# Project Docs

- Versioning: `docs/VERSIONING.md`
- Release process: `docs/RELEASE.md`
- Release notes: `docs/releases/v0.2.0.md`
- Installation notes: `docs/INSTALLATION.md`
- Loss functions: `docs/LOSS_FUNCTIONS.md`
- Math foundations: `docs/MATH.md`
- Changelog: `CHANGELOG.md`
48 changes: 48 additions & 0 deletions docs/INSTALLATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Installation Notes (v0.2.0)

This project relies on GPU-accelerated packages (PyTorch, RAPIDS, cuSpatial). A clean, consistent environment avoids most runtime errors.

- Segger is GPU-only and requires the RAPIDS stack (no CPU-only mode).

## Clean Install Checklist

- Use a fresh env; avoid mixing pip/conda for RAPIDS packages.
- Keep CUDA versions consistent across PyTorch, RAPIDS, and cuSpatial.
- Install `torch-geometric` from a wheel that matches your `torch` + CUDA version.
- Pin `sympy` to `1.13.1` (matches PyTorch 2.5.x) and ensure `mpmath` is installed.
- Install Lightning from the same env (avoid `~/.local` bleed):
- `PYTHONNOUSERSITE=1` before running jobs.

## Cluster Tips

- NFS cleanup errors (`.nfs*`) are harmless but noisy. Set `TMPDIR` to local scratch:
- `export TMPDIR=/ssd/$USER/segger_tmp` (or cluster-specific scratch).
- UCX/CUDA segfaults: try
- `export UCX_MEMTYPE_CACHE=n`
- `export UCX_TLS=sm,self`

## Alignment Loss

Alignment loss requires an scRNA-seq reference:

```bash
segger segment -i /path/to/data -o /path/to/output \
--alignment-loss \
--scrna-reference-path segger_experiments/data_raw/scrnaseq/human_crc.h5ad \
--scrna-celltype-column celltype
```

## Optional Dependencies (Lazy-Loaded)

Segger defers imports for several heavy or optional features, so `import segger` works without them. These features become available only when the corresponding dependency is installed.

- Preprocessors: `opencv-python`
- SpatialData loader/writer: `spatialdata`, `dask`, `zarr` (and `geopandas` for shapes)
- SpatialData platform readers: `spatialdata-io` (install with `segger[spatialdata-io]`)
- Loss curve plotting: `uniplot` + `matplotlib` (install with `segger[plot]`)
- SOPA helpers: `sopa`
- Geometry utilities: `geopandas`, `shapely`
- scRNA utilities: `scanpy`, `scikit-learn`
- RAPIDS/GPU helpers: `cudf`, `cuml`, `cugraph`, `cupy`, `cupyx`

When importing from top-level modules like `segger.io` or `segger.export`, optional re-exports may be `None` if dependencies are missing. Import from the submodule directly to get a strict `ImportError`.
Loading