dpeerlab · Tobiaspk · Mar 4, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,141 @@
+# Changelog
+
+All notable changes to this project will be documented in this file.
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+
+## [Unreleased]
+
+### 1. High-level
+- No unreleased feature changes yet.
+
+### 2. Low-level
+- N/A.
+
+## [0.2.0] - 2026-02-12
+
+Comparison scope for this release note (relative to `v0.1.0`):
+- Baseline reference: `dd681a8` (`2025-12-17`, `pyproject.toml` version `0.1.0`)
+- Base comparison: `dd681a8...release/v2-stable`
+- Branch snapshot used for this summary: `2c92b43` (`2026-02-13`)
+- Delta size at that snapshot: `33` commits, `76` files changed, `18,232` insertions, `321` deletions.
+
+### 0. Technical Summary (concise)
+
+#### New CLI workflows
+- `segger predict`:
+  - Checkpoint-only inference with strict checkpoint/data compatibility checks (`segger_vocab`, `segger_me_gene_pairs`, `n_genes`).
+  - Supports inference-time graph overrides, assignment threshold controls, fragment controls, and `--use-3d`.
+- `segger export`:
+  - Unified format conversion (`xenium_explorer|merged|spatialdata|anndata`) from parquet/csv/SpatialData segmentation inputs.
+  - Adds explicit input resolution (`--input-format`) and boundary policy controls (`--boundary-method`).
+- `segger plot`:
+  - Resolves Lightning metrics automatically (or via `--log-version`), groups train/val curves by metric key, and renders terminal or PNG outputs.
+
+#### New capabilities
+- End-to-end SpatialData support (ingest + export), including optional AnnData table embedding.
+- Alignment-loss pipeline with ME-gene constraints, scheduled weighting, and checkpoint metadata persistence.
+- Fragment-mode assignment for unassigned transcripts via tx-tx connected components with GPU-first/CPU-fallback execution.
+
+#### Stability/performance changes
+- Strong checkpoint-first safety checks to prevent silent inference mismatches.
+- Improved thresholding and memory behavior in segmentation writing.
+- Hardened boundary generation and parallel Xenium export fallback (process -> thread retry).
+- Expanded lazy optional-dependency handling with clearer failure modes.
+- Broader tests/CI coverage across CLI, export, alignment, fragment, and SpatialData paths.
+
+### 1. High-level (major changes)
+
+#### 1.1 CLI and workflow expansion
+- Added a checkpoint-first inference command: `segger predict -c <checkpoint>`.
+- Added checkpoint metadata validation for saved vocabulary and ME-gene pairs before inference starts.
+- Added training early-stopping controls and best-checkpoint prediction handoff in `segger segment`.
+- Added `segger plot` for loss curves with both terminal output (`--quick`, `uniplot`) and image output (`matplotlib`).
+- Expanded CLI output controls to multi-format segmentation exports (`segger_raw`, `merged`, `spatialdata`, `anndata`, `all`).
+- Expanded export controls to include `--input-format`, `--boundary-method`, and related boundary-generation knobs.
+
+#### 1.2 New export architecture and format support
+- Added a format registry (`OutputFormat`, writer protocol/registration) for consistent export extension.
+- Added dedicated writers for merged transcript output, AnnData output, and SpatialData output.
+- Added a richer Xenium Explorer export path with improved polygon handling and metadata consistency.
+- Added support for choosing boundary-generation strategy (`input`, `convex_hull`, `delaunay`, `skip` where supported).
+- Added SOPA compatibility helpers and conversion utilities for SpatialData-centric downstream workflows.
+
+#### 1.3 SpatialData support from input to output
+- Added SpatialData loader support and `.zarr` path detection in the data module and CLI.
+- Added SpatialData export writer support, including transcript points and optional shapes.
+- Added optional embedding of an AnnData table in SpatialData output.
+- Added lightweight SpatialData Zarr read/write utilities for environments that avoid full `spatialdata` dependency trees.
+
+#### 1.4 Data loading and graph construction upgrades
+- Added configurable transcript quality filtering (`min_qv`) with platform-aware logic.
+- Added explicit quality-filter classes for Xenium, CosMx, MERSCOPE, and SpatialData-based inputs.
+- Added 3D-aware graph construction controls (`use_3d` with `auto/true/false` semantics).
+- Added prediction graph scale-factor plumbing and alignment so CLI and data-module behavior stay consistent.
+- Added optional transcript-edge similarity capture in graph construction for downstream fragment operations.
+
+#### 1.5 Model/loss evolution (alignment + metadata-aware inference)
+- Added `AlignmentLoss` integration with scheduled weighting and combination modes (`interpolate` and `additive`).
+- Added ME-gene edge generation and labeling in heterodata construction.
+- Added contrastive same-gene positive edges and ME-pair negative edges for alignment training.
+- Added positive subsampling logic to control alignment class imbalance.
+- Added checkpoint persistence and restore of `segger_vocab` and `segger_me_gene_pairs`.
+- Added stricter runtime compatibility checks between checkpoint metadata and prediction input data.
+
+#### 1.6 Fragment-mode segmentation for unassigned transcripts
+- Added fragment-mode assignment pipeline for previously unassigned transcripts.
+- Added connected-component grouping using transcript-transcript edges with similarity thresholding.
+- Added GPU-first execution path (when RAPIDS is available) with CPU fallback behavior.
+- Added minimum-fragment-size controls and auto-threshold options for fragment similarity.
+
+#### 1.7 Optional dependency model and package surface cleanup
+- Added centralized optional dependency utilities (`segger.utils.optional_deps`) with clear install guidance.
+- Added lazy module loading in `segger.io`, `segger.export`, `segger.datasets`, and other package entry points.
+- Added explicit RAPIDS requirement checks where GPU-only operations are required.
+- Added optional dependency groups in `pyproject.toml` (`spatialdata`, `spatialdata-io`, `sopa`, `plot`, `spatialdata-all`, `dev`).
+
+#### 1.8 New datasets/helpers for reproducible testing and demos
+- Added `segger.datasets` with toy Xenium loaders and synthetic data generation.
+- Added sample-output generation helpers for merged/parquet and SpatialData conversion workflows.
+- Added plotting and SpatialData demo notebooks to document end-to-end usage.
+
+#### 1.9 Testing and CI expansion
+- Added a full test suite scaffold (`tests/`, fixtures, and targeted modules by subsystem).
+- Added tests for alignment loss, fragment mode, prediction graph behavior, exporters, optional deps, and SpatialData I/O.
+- Added CI workflow (`.github/workflows/test.yml`) and Dependabot config for dependency hygiene.
+- Added pytest and coverage configuration directly in `pyproject.toml`.
+
+#### 1.10 Documentation expansion
+- Added dedicated docs for installation troubleshooting, release process, versioning policy, loss functions, and math foundations.
+- Added structured release note document for `v0.2.0`.
+
+### 2. Low-level (minor changes and refinements)
+
+#### 2.1 Accuracy, performance, and stability refinements
+- Improved thresholding logic in segmentation writing with robust Li/Yen handling and safe fallbacks.
+- Reduced peak memory in per-gene threshold calculations through iterative sampling-based processing.
+- Improved boundary generation throughput with parallel Delaunay options.
+- Added fallback from process workers to thread workers in parallel Xenium export when process pools fail.
+- Added safer empty/degenerate polygon handling in boundary extraction and export code paths.
+- Added additional positional-embedding guards for empty batches and zero-variance coordinates.
+
+#### 2.2 ME-gene discovery and alignment tuning refinements
+- Added ME-gene discovery caching keyed by scRNA input metadata and discovery parameters.
+- Added scRNA preprocessing normalization and optional per-cell-type subsampling for faster ME discovery.
+- Added progress/debug messages for ME discovery and alignment-edge creation (`SEGGER_ME_VERBOSE` / debug flags).
+- Tightened default ME exclusivity criteria and increased pair coverage tuning in discovery defaults.
+
+#### 2.3 CLI polish and compatibility refinements
+- Unified worker-count semantics across related CLI steps.
+- Improved CLI help text for format/export settings and deprecation messaging.
+- Added robust cell-id column alias resolution for export inputs.
+- Added typed handling for unassigned IDs in AnnData export paths.
+
+#### 2.4 Internal API and import refinements
+- Switched multiple package-level imports to lazy-loading patterns to reduce import side effects and startup overhead.
+- Updated data utility import strategy to stay consistent with existing project patterns.
+- Added compatibility comments and deprecation guidance around legacy `cli/config.yaml` defaults.
+
+#### 2.5 Housekeeping
+- No additional housekeeping notes in this release summary.
diff --git a/README.md b/README.md
@@ -4,6 +4,8 @@
 
 Before installing **segger**, please install GPU-accelerated versions of PyTorch, RAPIDS, and related packages compatible with your system. *Please ensure all CUDA-enabled packages are compiled for the same CUDA version.*
 
+- Segger is GPU-only and requires the RAPIDS stack (no CPU-only mode).
+
 - **PyTorch & torchvision:** [Installation guide](https://pytorch.org/get-started/locally/)
 - **torch_scatter:** [Installation guide](https://github.com/rusty1s/pytorch_scatter#installation)
 - **RAPIDS (cuDF, cuML, cuGraph):** [Installation guide](https://docs.rapids.ai/install)
@@ -32,6 +34,87 @@ git clone https://github.com/dpeerlab/segger.git segger && cd segger
 pip install -e .
 ```
 
+## Tips & Troubleshooting (v0.2.0)
+
+- Avoid user-site bleed: set `PYTHONNOUSERSITE=1` so `~/.local` packages do not shadow the env.
+- Torch Geometric wheels must match your `torch` + CUDA version (`data.pyg.org` URL must match).
+- Keep RAPIDS packages from a single channel/version set; avoid pip/conda mixing for RAPIDS.
+- NFS cleanup noise (`.nfs*`): set `TMPDIR` to local scratch to avoid exit-time errors.
+- UCX/CUDA segfaults: try `UCX_MEMTYPE_CACHE=n` and `UCX_TLS=sm,self`.
+
+## Optional Dependencies & Lazy Imports
+
+Segger defers imports for several heavy/optional features to keep `import segger` fast and to allow partial installs. If an optional dependency is missing, some top-level re-exports (notably in `segger.io` and `segger.export`) will be `None` rather than raising at import time.
+
+```python
+from segger.io import get_preprocessor
+if get_preprocessor is None:
+    raise ImportError("Install opencv-python for preprocessors.")
+```
+
+For strict import errors, import from submodules directly:
+
+```python
+from segger.io.preprocessor import get_preprocessor
+```
+
+Common optional dependencies:
+- `opencv-python` (preprocessors)
+- `spatialdata` + `dask` (SpatialData loader/writer)
+- `spatialdata-io` (platform-specific SpatialData readers)
+- `uniplot` + `matplotlib` (loss curve plotting; install with `segger[plot]`)
+- `sopa` (SOPA export helpers)
+- `geopandas`/`shapely` (geometry utilities)
+
+## v0.2.0 Detailed Delta vs `v0.1.0`
+
+This summary is intentionally based on the release baseline comparison (`dd681a8...release/v2-stable`), not only on the latest commit.
+
+- Comparison snapshot:
+  - `v0.1.0` baseline reference: `dd681a8` (`2025-12-17`, `pyproject.toml` version `0.1.0`)
+  - Release snapshot: `2c92b43` (`2026-02-13`)
+  - Delta size: `33` commits, `76` files changed, `18,232` insertions, `321` deletions
+
+### New CLI workflows
+
+- `segger predict`:
+  - Checkpoint-only inference (`-c`) with strict checkpoint/data compatibility checks for `segger_vocab`, `segger_me_gene_pairs`, and `n_genes`.
+  - Supports inference-time graph overrides (`--transcripts-max-k`, `--transcripts-max-dist`, `--prediction-max-k`), assignment controls (`--min-similarity`, `--min-similarity-shift`), fragment controls, and `--use-3d`.
+  - Supports post-predict multi-format output (`--output-format`) with optional overwrite semantics.
+- `segger export`:
+  - Unified export entry point for `xenium_explorer|merged|spatialdata|anndata`.
+  - Handles segmentation inputs from parquet/csv/SpatialData with `--input-format auto|raw|spatialdata`.
+  - Adds explicit boundary policy (`--boundary-method input|convex_hull|delaunay|skip`), worker controls, polygon vertex limits, and cell-id alias resolution.
+- `segger plot`:
+  - Resolves latest or specific Lightning run metrics (`--log-version`) from `lightning_logs/version_*`.
+  - Groups train/val series by metric key, applies rolling smoothing, and renders either terminal pages (`--quick`) or paginated PNG outputs.
+
+### New capabilities
+
+- End-to-end SpatialData support:
+  - `.zarr` ingest path in data loading and export path to SpatialData output.
+  - Optional AnnData table embedding in SpatialData output.
+  - Lightweight direct SpatialData Zarr I/O utilities for reduced dependency footprints.
+- Alignment-loss pipeline:
+  - ME-gene constraints integrated into graph/loss flow with scheduled weighting and combination modes.
+  - Checkpoint persistence + restore of `segger_vocab` and `segger_me_gene_pairs`.
+- Fragment-mode segmentation:
+  - Unassigned transcript recovery via tx-tx connected components with similarity thresholding.
+  - GPU-first path with CPU fallback.
+
+### Stability/performance changes
+
+- Checkpoint-first inference hardening:
+  - Explicit mismatch failures for vocabulary order and gene-count incompatibility to prevent silent misalignment.
+- Segmentation writer improvements:
+  - More robust auto-thresholding path with safer memory behavior and sign-stable threshold shifting.
+- Boundary/export resilience:
+  - Safer polygon handling and process-to-thread fallback for parallel Xenium export when process pools fail.
+- Optional dependency behavior:
+  - Expanded lazy-loading and explicit install guidance for partial environments.
+- Validation surface:
+  - Significant test/CI expansion across CLI, export, alignment, fragment, and SpatialData paths.
+
 # Usage
 
 You can run **segger** from the command line with:
@@ -42,4 +125,60 @@ segger segment -i /path/to/your/ist/data/ -o /path/to/save/outputs/
 To see all available parameter options:
 ```bash
 segger segment --help
-```
+```
+
+Run prediction only from a saved checkpoint (no retraining):
+```bash
+segger predict -c /path/to/checkpoints/segger-best-epoch.ckpt \
+  -i /path/to/your/ist/data/ \
+  -o /path/to/save/outputs/
+```
+
+Plot loss curves from the latest training run:
+```bash
+segger plot -o /path/to/save/outputs/
+```
+
+Quick terminal plot (no image saved):
+```bash
+segger plot -o /path/to/save/outputs/ --quick
+```
+
+Plot a specific Lightning run version:
+```bash
+segger plot -o /path/to/save/outputs/ --log-version 0
+```
+
+## CLI Parameters (New/Updated)
+
+- `--input-format` (`auto` | `raw` | `spatialdata`) and `--output-format` (`segger_raw` | `merged` | `spatialdata` | `anndata` | `all`).
+- `--boundary-method` (`input` | `convex_hull` | `delaunay` | `skip`) and `--boundary-n-jobs` (0 uses `--num-workers`).
+- `--sopa-compatible` for SOPA-ready SpatialData output.
+- `--num-workers` for data loading (and as the default for boundary generation).
+- `--prediction-scale-factor`: polygon scaling for tx→bd candidate edges (default 1.2).
+- `--min-similarity`: fixed similarity threshold; if unset, per-gene auto-thresholding.
+- `--fragment-mode`, `--fragment-min-transcripts`, `--fragment-similarity-threshold`.
+- `--alignment-loss`, `--scrna-reference-path`, `--scrna-celltype-column`.
+- `--alignment-loss-weight-start`, `--alignment-loss-weight-end`, `--loss-combination-mode`.
+- `--early-stopping-patience` (default `10`) and `--early-stopping-min-delta` (default `1e-4`) for validation-based stopping on `val:loss`.
+- `--use-3d` (`auto` | `true` | `false`) and `--min-qv` for quality filtering.
+- `--tiling-margin-training`, `--tiling-margin-prediction`, `--max-nodes-per-tile`, `--max-edges-per-batch`.
+
+## Alignment Loss Example
+
+```bash
+segger segment -i /path/to/your/ist/data/ -o /path/to/save/outputs/ \
+  --alignment-loss \
+  --scrna-reference-path segger_experiments/data_raw/scrnaseq/human_crc.h5ad \
+  --scrna-celltype-column celltype
+```
+
+# Project Docs
+
+- Versioning: `docs/VERSIONING.md`
+- Release process: `docs/RELEASE.md`
+- Release notes: `docs/releases/v0.2.0.md`
+- Installation notes: `docs/INSTALLATION.md`
+- Loss functions: `docs/LOSS_FUNCTIONS.md`
+- Math foundations: `docs/MATH.md`
+- Changelog: `CHANGELOG.md`
diff --git a/docs/INSTALLATION.md b/docs/INSTALLATION.md
@@ -0,0 +1,48 @@
+# Installation Notes (v0.2.0)
+
+This project relies on GPU-accelerated packages (PyTorch, RAPIDS, cuSpatial). A clean, consistent environment avoids most runtime errors.
+
+- Segger is GPU-only and requires the RAPIDS stack (no CPU-only mode).
+
+## Clean Install Checklist
+
+- Use a fresh env; avoid mixing pip/conda for RAPIDS packages.
+- Keep CUDA versions consistent across PyTorch, RAPIDS, and cuSpatial.
+- Install `torch-geometric` from a wheel that matches your `torch` + CUDA version.
+- Pin `sympy` to `1.13.1` (matches PyTorch 2.5.x) and ensure `mpmath` is installed.
+- Install Lightning from the same env (avoid `~/.local` bleed):
+  - `PYTHONNOUSERSITE=1` before running jobs.
+
+## Cluster Tips
+
+- NFS cleanup errors (`.nfs*`) are harmless but noisy. Set `TMPDIR` to local scratch:
+  - `export TMPDIR=/ssd/$USER/segger_tmp` (or cluster-specific scratch).
+- UCX/CUDA segfaults: try
+  - `export UCX_MEMTYPE_CACHE=n`
+  - `export UCX_TLS=sm,self`
+
+## Alignment Loss
+
+Alignment loss requires an scRNA-seq reference:
+
+```bash
+segger segment -i /path/to/data -o /path/to/output \
+  --alignment-loss \
+  --scrna-reference-path segger_experiments/data_raw/scrnaseq/human_crc.h5ad \
+  --scrna-celltype-column celltype
+```
+
+## Optional Dependencies (Lazy-Loaded)
+
+Segger defers imports for several heavy or optional features, so `import segger` works without them. These features become available only when the corresponding dependency is installed.
+
+- Preprocessors: `opencv-python`
+- SpatialData loader/writer: `spatialdata`, `dask`, `zarr` (and `geopandas` for shapes)
+- SpatialData platform readers: `spatialdata-io` (install with `segger[spatialdata-io]`)
+- Loss curve plotting: `uniplot` + `matplotlib` (install with `segger[plot]`)
+- SOPA helpers: `sopa`
+- Geometry utilities: `geopandas`, `shapely`
+- scRNA utilities: `scanpy`, `scikit-learn`
+- RAPIDS/GPU helpers: `cudf`, `cuml`, `cugraph`, `cupy`, `cupyx`
+
+When importing from top-level modules like `segger.io` or `segger.export`, optional re-exports may be `None` if dependencies are missing. Import from the submodule directly to get a strict `ImportError`.