System Design Document — April 2026
The goal is not to build an agent that passes scBench by memorizing answers. It's to build a system so well-structured that correctness follows from design. The hypothesis:
An agent that deeply understands experimental paradigms, enforces proper tool sequencing, tracks provenance, and presents biologists with verifiable intermediate results will outperform general-purpose coding agents on scRNA-seq tasks — without ever seeing the benchmark.
Four pillars:
- Question-driven analysis with dependency awareness — the system responds to what the researcher asks, validates steps against a dependency graph, and optionally generates a structured plan when the researcher has a clear goal. Paradigm selection is helpful but not required.
- State-aware onboarding — the system inspects foreign data to determine what has been done, what's available, and what's needed. It never assumes it controls the data from the start.
- Provenance-first — every result is traceable to inputs, parameters, and tool versions
- Branched exploration — biologists can fork, compare, and merge analysis paths
┌─────────────────────────────────────────────────────────────────┐
│ scAgent CLI (REPL) │
│ Researcher types natural language ←→ Agent responds + acts │
└──────────────────────────┬──────────────────────────────────────┘
│
┌────────────┴────────────┐
│ ORCHESTRATOR │
│ ┌──────────────────┐ │
│ │ Paradigm Router │ │ ← Identifies experiment type
│ │ Plan Generator │ │ ← Generates analysis DAG
│ │ Step Executor │ │ ← Runs tools with params
│ │ Verifier │ │ ← Checks outputs for sanity
│ └──────────────────┘ │
└────────────┬────────────┘
│
┌─────────────────┼─────────────────┐
│ │ │
┌────┴────┐ ┌──────┴──────┐ ┌─────┴─────┐
│ TOOL │ │ PROVENANCE │ │ STATE │
│ REGISTRY│ │ GRAPH │ │ MANAGER │
│ │ │ (PROV-JSONLD│ │ │
│ Scanpy │ │ lineage) │ │ AnnData │
│ Seurat │ │ │ │ snapshots │
│ CellR. │ │ │ │ branch │
│ DESeq2 │ │ │ │ tree │
│ ... │ │ │ │ │
└─────────┘ └─────────────┘ └───────────┘
│ │ │
└─────────────────┼─────────────────┘
│
┌─────────────────┼─────────────────┐
│ │ │
┌────┴────┐ ┌──────┴──────┐ ┌─────┴─────┐
│ MEMORY │ │ KNOWLEDGE │ │ EXPERIMENT│
│ PALACE │ │ MarkerDB + │ │ CONTEXT │
│(mempal.)│ │ best prac- │ │ (minSCe) │
│ │ │ tice refs │ │ │
│ Chat │ │ │ │ Species │
│ history │ │ Canonical + │ │ Platform │
│ across │ │ CellTypist +│ │ Tissue │
│ months │ │ User DBs │ │ Chemistry │
└─────────┘ └─────────────┘ └───────────┘
Based on Füllgrabe et al. (2020) "Guidelines for reporting single-cell RNA-seq experiments" (Nature Biotechnology 38:1384–1386), every scAgent project is initialized with a structured context object. This is the system's "ground truth" about the experiment and gates all downstream decisions.
Why this matters for the agent: When the paradigm is disease_vs_healthy, the orchestrator knows that:
- Cross-condition DE (pseudobulk) is a required downstream step
- Batch correction is likely needed
- Differential abundance testing is relevant
- Trajectory inference is probably NOT the primary goal
When the paradigm is developmental_trajectory, the orchestrator knows:
- Pseudotime/trajectory inference IS the primary goal
- RNA velocity may be relevant
- Cross-condition DE may not apply
Each tool is registered with its capabilities, valid paradigms, step placement, parameter schemas, and validation rules.
// tools/leiden_clustering.json
{
"tool_id": "leiden_clustering",
"name": "Leiden Clustering",
"category": "clustering",
"framework": "scanpy",
"function": "sc.tl.leiden",
"valid_after": ["neighbor_graph"], // must come after this step
"valid_before": ["differential_expression", "cell_annotation"],
"paradigms": ["all"], // valid in all paradigms
"parameters": {
"resolution": {
"type": "float",
"default": 1.0,
"range": [0.1, 5.0],
"guidance": "Higher values → more clusters. Use 0.3-0.5 for major lineages, 0.8-1.2 for standard, 1.5-3.0 for subtypes. Sciaraffa et al. 2025 found resolution=2.0 optimal across 3 datasets.",
"auto_strategy": "sweep", // agent should try multiple values
"sweep_values": [0.3, 0.5, 0.8, 1.0, 1.5, 2.0]
},
"key_added": {
"type": "string",
"default": "leiden_{resolution}"
}
},
"outputs": {
"cluster_labels": "adata.obs['leiden_{resolution}']",
"n_clusters": "int"
},
"validation": {
"min_clusters": 2,
"max_clusters_ratio": 0.1, // no more than 10% of cells as clusters
"check": "all clusters have ≥ 10 cells"
},
"provenance_captures": ["resolution", "n_neighbors_used", "n_pcs_used", "random_state"]
}All 31 tool schemas live flat in tools/ — no subdirectories. Categories are encoded in each schema's category field, not the filesystem. This keeps the registry simple to enumerate and extend.
tools/
├── load_10x_h5.json # loading
├── filter_cells.json # qc
├── filter_genes.json # qc
├── scrublet_doublets.json # qc
├── log_normalize.json # normalization
├── highly_variable_genes.json # feature_selection
├── pca.json # dimensionality_reduction
├── umap.json # dimensionality_reduction
├── neighbors.json # graph
├── harmony.json # integration
├── scvi.json # integration
├── scanorama.json # integration
├── bbknn.json # integration
├── leiden.json # clustering
├── louvain.json # clustering
├── celltypist.json # annotation
├── validate_annotation.json # annotation
├── query_markers.json # annotation
├── wilcoxon_markers.json # differential_expression
├── deseq2_pseudobulk.json # differential_expression
├── edger_pseudobulk.json # differential_expression
├── paga.json # trajectory (topology)
├── diffusion_pseudotime.json # trajectory (ordering) — NEW
├── slingshot.json # trajectory (R-only, deferred)
├── scvelo.json # trajectory (RNA velocity)
├── guide_assignment.json # perturbation — NEW
├── perturbation_de.json # perturbation — NEW
├── cellchat.json # communication (R-only, deferred)
├── cellphonedb.json # communication
├── gsea.json # enrichment
├── cluster_profiler.json # enrichment
├── sccoda.json # composition
└── milor.json # composition
Python implementations that the agent calls. Each follows the same pattern: validate inputs → run analysis → generate plots → return structured result + provenance.
scagent/tools/
├── loading.py # Load 10x data
├── qc.py # QC metrics + filtering
├── doublets.py # Scrublet doublet detection
├── normalize.py # Log-normalization
├── feature_selection.py # HVG selection
├── pca.py # PCA
├── neighbors.py # KNN graph
├── integration.py # Harmony batch correction
├── clustering.py # Leiden/Louvain
├── embedding.py # UMAP
├── markers.py # Wilcoxon marker genes
├── annotation.py # CellTypist annotation
├── pseudobulk_de.py # Pseudobulk DE (PyDESeq2)
├── enrichment.py # GSEA pathway enrichment
├── knowledge_tools.py # Marker gene knowledge queries
├── trajectory.py # PAGA + DPT + scVelo — NEW (chunk 11)
├── composition.py # scCODA + Milo via pertpy — NEW (chunk 11)
├── communication.py # LIANA+ consensus L-R — NEW (chunk 11)
├── perturbation.py # Guide assignment + perturbation DE — NEW (chunk 11)
├── repertoire.py # Scirpy: VDJ loading + clonotype — NEW (chunk 11)
├── multimodal.py # CITE-seq: ADT + CLR + WNN — NEW (chunk 11)
└── pipeline.py # Pipeline orchestration
Analysis planning has two layers:
The dependency graph encodes what each step requires — prerequisite steps and data conditions. It is not a plan — it's static knowledge used to validate any step the user requests.
# scagent/dependencies.py
DEPENDENCIES = {
'qc_metrics': {'requires': [], 'requires_x': ['raw_counts']},
'normalize': {'requires': [], 'requires_x': ['raw_counts']},
'hvg': {'requires': ['normalize'], 'requires_x': ['log_normalized']},
'pca': {'requires': ['hvg']},
'neighbors': {'requires': ['pca']},
'clustering': {'requires': ['neighbors']},
'umap': {'requires': ['neighbors']},
'annotation': {'requires': ['clustering']},
'pseudobulk_de': {'requires': ['annotation'], 'requires_data': ['raw_counts', 'condition_key']},
'trajectory': {'requires': ['neighbors']},
...
}When the researcher asks for something, the agent:
- Checks
check_prerequisites(goal, state)— can this run now? - If not, calls
plan_steps(goal, state)— what's the minimal set of steps to get there? - Fills the gaps, then runs the requested step.
This works with or without a DAG. A researcher can ask "run trajectory" mid-atlas-analysis and the dependency graph validates it.
When the researcher states a clear goal ("I want to build a cell atlas"), the system generates a paradigm-specific DAG. This DAG is advisory — it tracks progress and suggests next steps, but doesn't block out-of-plan requests.
The DAG can be modified dynamically:
dag.add_step(step, after="clustering")— add trajectory to an atlas DAGdag.mark_precomputed_from_state(state)— mark already-completed steps when onboarding foreign data
For each paradigm, the orchestrator generates an analysis DAG. Steps are nodes; edges are data dependencies. The researcher can modify the DAG interactively.
cellranger_multi
│
▼
qc_filtering ──────────────────┐
│ │
▼ ▼
normalization doublet_detection
│ │
▼ │
feature_selection ◄────────────┘
│
▼
pca
│
▼
batch_correction (if multi-batch)
│
▼
neighbor_graph
│
├──────────────────┐
▼ ▼
clustering umap
│ │
▼ │
marker_de ◄─────────────┘
│
▼
cell_annotation
│
├──────────────────┐
▼ ▼
cross_condition_de composition_analysis
│ │
▼ │
pathway_enrichment │
│ │
└──────┬───────────┘
▼
report_generation
cellranger_multi → qc → normalization → hvg → pca → integration
│
├→ neighbor_graph → clustering → annotation
│ │ │
│ │ └→ PAGA topology → DPT pseudotime
│ │
│ └→ RNA velocity (scVelo, optional — requires spliced/unspliced)
│
└→ marker_de
Tool IDs: paga → diffusion_pseudotime → scvelo_velocity (optional).
PAGA beats Slingshot on complex topologies (tree: 0.770 vs 0.630, Saelens et al. 2019).
DPT ordering follows PAGA topology. scVelo requires velocyto/STARsolo preprocessing.
shared_prefix → annotation → guide_assignment → perturbation_de → enrichment
shared_prefix → batch_correction (always required) → neighbors → clustering → annotation
│
├→ pseudobulk_de (timepoint contrasts)
├→ composition (proportion trends over time)
└→ pathway_enrichment
Batch correction is mandatory — different timepoints = different batches [BP-1].
shared_prefix → annotation → load_vdj → clonotype_analysis → repertoire_overlap
Uses Scirpy (scverse) for TCR/BCR analysis from 10x Chromium VDJ.
load → qc → normalize → hvg → pca
↘
load_protein → normalize_protein (CLR) → WNN → clustering → annotation
↗
protein_markers ────┘
Uses muon for WNN (weighted nearest neighbors) joint RNA + protein graph.
When scAgent encounters data it didn't create — a researcher's pre-processed .h5ad, scBench eval data, or a collaborator's shared file — it needs to figure out what state the data is in before doing anything.
adata.X could be raw integer counts, log-normalized floats, or z-score scaled values with negatives. Attempting to normalize already-scaled data produces NaN values and crashes. The inspector runs once at data load and produces a comprehensive state report.
| Check | Signal | State |
|---|---|---|
(X < 0).any() |
Negative values | scaled |
allclose(sample, round(sample)) |
Integer values | raw_counts |
'log1p' in adata.uns |
Scanpy breadcrumb | log_normalized |
X.max() < 20 |
Value range heuristic | log_normalized (with warning) |
| else | Unknown transform | transformed_unknown |
Each Scanpy step leaves specific keys. The inspector checks for all of them:
| Step | Breadcrumb | Reliable? |
|---|---|---|
| QC metrics | obs['n_genes_by_counts'] |
✅ |
| Normalization | uns['log1p'] |
✅ |
| HVG selection | var['highly_variable'] |
✅ |
| PCA | obsm['X_pca'] |
✅ |
| Neighbors | uns['neighbors'] + obsp['connectivities'] |
✅ |
| UMAP | obsm['X_umap'] |
✅ |
| Clustering | obs['leiden*'] / obs['louvain*'] |
|
| Cell types | obs['cell_type'] / obs['annotation'] / etc. |
|
| DE results | uns['rank_genes_groups'] |
✅ |
The inspector also locates raw counts for re-processing:
layers['counts'](Scanpy convention)layers['raw_counts'](some pipelines)Xitself (if state israw_counts)
from scagent.inspector import inspect_adata, summarize_state
from scagent.dependencies import ensure_ready_for
state = inspect_adata(adata)
print(summarize_state(state))
# → "6,420 cells × 19,918 genes (mouse), X: z-score scaled,
# raw counts in layers['raw_counts'], clusters in obs['leiden']..."
# Bring X to the right state for HVG selection
ensure_ready_for(adata, state, needs="log_normalized")User: [drops a pre-processed .h5ad]
scAgent: I've inspected your data:
6,420 cells × 19,918 genes (mouse)
X is z-score scaled, raw counts in layers['raw_counts']
Clusters: obs['leiden'] (12 clusters)
Cell types: obs['cell_type'] (8 types)
What would you like to do?
User: Find DE genes between the CAF subtypes
scAgent: I see 'CAF' in your cell type labels. I'll:
1. Subset to CAF cells
2. Re-cluster to find subtypes (raw counts → normalize → PCA → cluster)
3. Run DE between subtypes
Proceed?
Every step executed produces a provenance record. The format follows W3C PROV-O, serialized as JSON-LD. This is the "portable knowledge format" — a colleague can reconstruct exactly what happened.
{
"@context": {
"prov": "http://www.w3.org/ns/prov#",
"sca": "https://scagent.dev/ontology/v1#",
"xsd": "http://www.w3.org/2001/XMLSchema#"
},
"@graph": [
{
// The output data entity
"@id": "sca:adata_clustered_v3",
"@type": "prov:Entity",
"prov:wasGeneratedBy": "sca:activity_leiden_001",
"prov:wasDerivedFrom": "sca:adata_neighbors_v2",
"sca:snapshot_hash": "sha256:a3f8c2...",
"sca:branch": "main",
"sca:step_index": 7
},
{
// The activity that produced it
"@id": "sca:activity_leiden_001",
"@type": "prov:Activity",
"prov:used": "sca:adata_neighbors_v2",
"prov:wasAssociatedWith": "sca:tool_leiden_clustering",
"prov:startedAtTime": "2026-04-11T14:23:00Z",
"prov:endedAtTime": "2026-04-11T14:23:12Z",
"sca:parameters": {
"resolution": 1.0,
"random_state": 42,
"n_iterations": -1,
"key_added": "leiden_1.0"
},
"sca:tool_version": "scanpy==1.10.4",
"sca:python_version": "3.11.8",
"sca:triggered_by": "user_request",
"sca:user_prompt": "cluster the cells at resolution 1.0"
},
{
// The tool (agent)
"@id": "sca:tool_leiden_clustering",
"@type": ["prov:Agent", "prov:SoftwareAgent"],
"sca:tool_id": "leiden_clustering",
"sca:registry_version": "0.1.0"
}
]
}Portability: This file is ~2KB per step. A full analysis of 15 steps is ~30KB. Send it to a colleague → they can see exactly what was done, with what parameters, in what order, and reproduce it.
The provenance graph is append-only. Branches create new provenance chains. The graph is queryable:
- "What parameters were used for clustering?" → look up all activities with tool_id=leiden_clustering
- "What changed between branch A and branch B?" → diff the provenance chains
- "Reproduce this analysis" → replay all activities in order with recorded parameters
The researcher can fork any point in the analysis to explore alternatives. This is NOT git for code — it's version control for analysis state.
main: qc → norm → hvg → pca → neighbors → leiden(0.8) → annotation → DE
│
├── branch: "high_res"
│ leiden(2.0) → annotation → DE
│
└── branch: "harmony_corrected"
harmony → neighbors → leiden(1.0) → annotation → DE
Each "state" is an AnnData snapshot identified by a content hash. We don't copy the entire AnnData each time — we use a copy-on-write strategy:
.scagent/
├── project.json # experiment context (minSCe)
├── provenance.jsonld # full provenance graph
├── dag.json # current analysis DAG
├── branches/
│ ├── main/
│ │ ├── HEAD # pointer to current state hash
│ │ ├── states/
│ │ │ ├── a3f8c2.h5ad # snapshot after clustering
│ │ │ ├── b7d1e4.h5ad # snapshot after annotation
│ │ │ └── ...
│ │ └── log.jsonl # step-by-step execution log
│ ├── high_res/
│ │ ├── HEAD
│ │ ├── parent_branch: "main"
│ │ ├── fork_point: "a3f8c2" # forked from this state
│ │ └── states/
│ │ └── c9e3f1.h5ad
│ └── harmony_corrected/
│ └── ...
├── tools/ # tool registry (local overrides)
├── knowledge/ # MarkerDB user databases
│ ├── cellmarker2.xlsx # optional CellMarker 2.0
│ └── panglaodb.tsv # optional PanglaoDB
└── memory/ # mempalace data
└── palace.db
> /branch list
main (active) — 12 steps, last: differential_expression
high_res — 3 steps, forked from main@leiden_clustering
harmony_corrected — 5 steps, forked from main@pca
> /branch create "no_mito_filter" from main@qc_filtering
Created branch 'no_mito_filter' from state a1b2c3
> /branch compare high_res main
Diverges at: leiden_clustering
high_res: resolution=2.0 → 24 clusters → ...
main: resolution=0.8 → 12 clusters → ...
Shared cell types: 9/12 overlap
high_res splits: CD4_naive → CD4_naive_1, CD4_naive_2
> /branch merge high_res into main
⚠ Cannot auto-merge analysis branches. Use /branch switch to activate.
MemPalace stores all researcher–agent conversations verbatim in ChromaDB, organized by the palace metaphor:
- Wing: The project (e.g., "PBMC aging study")
- Hall: Type of interaction (analysis decisions, parameter discussions, interpretation)
- Room: Specific topic (e.g., "clustering resolution debate", "B cell marker validation")
On each REPL turn, the agent can search MemPalace for relevant past context:
# Before generating a response, the agent searches
results = mempalace.search("why did we choose resolution 1.0", wing="pbmc_aging")
# → Returns the conversation from 3 weeks ago where the researcher
# discussed marker genes and decided 0.8 was too coarseKnowledge is structured, not a generic knowledge graph. Two components:
MarkerDB (knowledge.py) — 3-tier marker gene lookup:
- Built-in canonical markers (always available, no download)
- CellTypist model extraction (61 trained models, ~50 MB each)
- User-supplied databases (CellMarker 2.0 xlsx, PanglaoDB TSV in
.scagent/knowledge/)
> /knowledge query "NK cells"
Canonical: NKG7, GNLY, PRF1, GZMB, KLRD1, NCAM1 (CD56)
CellTypist (Immune_All_Low): NKG7 (rank=1), GNLY (rank=2), GZMB (rank=3)
Confidence: HIGH (sources agree)
> /knowledge validate --markers GNLY,NKG7,GZMB,PRF1 --label "NK cells"
Match: 4/6 canonical markers, overlap=67%, confidence=HIGH
Best practice references (best_practices/reference/, 12 Markdown files) — distilled literature-backed guidance per analysis step, sourced from Heumos et al. 2023, 10x Genomics Analysis Guide 2025, and sc-best-practices.org. Loaded into agent context via skills on demand. Backed by two full PDFs.
Modern LLMs provide 200K+ token context windows (Claude: 200K, Gemini: 1M+). scAgent should use this capacity aggressively — the whole point is that biologists accumulate context over months.
The strategy is tiered: a small always-pinned layer, a large working memory layer that grows with the session, and a compaction mechanism that fires only when genuinely necessary.
┌──────────────────────────────────────────────────────────────┐
│ CONTEXT WINDOW (200K tokens available) │
│ │
│ TIER 1 — ALWAYS PINNED (~3K tokens): │
│ ├─ System prompt + paradigm-specific rules ~1,200 │
│ ├─ Experiment context (minSCe JSON) ~500 │
│ ├─ Current branch + state summary ~500 │
│ ├─ Current DAG position + next steps ~400 │
│ └─ Active tool schema (for current step) ~400 │
│ │
│ TIER 2 — WORKING MEMORY (grows up to ~150K tokens): │
│ ├─ Full conversation history (all turns this session) │
│ │ Typical: ~1,500 tokens/turn × 40 turns = ~60K │
│ ├─ Provenance chain for current branch ~2-5K │
│ ├─ Last tool outputs (QC plots, DE tables, UMAPs) ~5-10K │
│ ├─ Knowledge graph retrievals (papers, markers) ~3-5K │
│ └─ MemPalace retrievals (prior session context) ~3-5K │
│ │
│ TIER 3 — OVERFLOW / RETRIEVAL (~47K reserved): │
│ ├─ Full provenance history (all branches) │
│ ├─ Extended marker gene tables │
│ ├─ Paper excerpts from knowledge graph │
│ └─ Earlier conversation turns (retrieved on demand) │
│ │
│ COMPACTION TRIGGER: │
│ When Tier 2 exceeds 150K tokens: │
│ → Summarize conversation turns older than 20 turns │
│ → Mine full verbatim history to MemPalace │
│ → Compress tool outputs to key findings only │
│ → Keep Tier 1 + last 20 turns + summaries of older turns │
│ │
│ CROSS-SESSION CONTINUITY: │
│ On session start: │
│ → Load Tier 1 (always) │
│ → Load last session summary from MemPalace (~2K) │
│ → Load recent provenance chain (~2K) │
│ → Retrieve relevant past discussions via MemPalace search │
│ Total cold-start context: ~10K tokens │
│ Researcher can /recall "topic" to pull more past context │
└──────────────────────────────────────────────────────────────┘
Key principle: Don't compress prematurely. With 200K tokens, a researcher can have ~80-100 back-and-forths before compaction is needed. For a typical analysis session (20-30 turns), the full conversation fits comfortably. Compaction is a safety valve, not a routine operation.
For 1M-token models (Gemini), compaction may never be needed within a single session.
$ scagent init --paradigm disease_vs_healthy
╔══════════════════════════════════════════════════════════════╗
║ scAgent v0.1 — Single-Cell Analysis Agent ║
║ Project: pbmc_aging_study ║
║ Paradigm: Disease vs. Healthy ║
║ Platform: 10x Chromium GEM-X 3' v4 ║
║ Branch: main (step 0/12) ║
╚══════════════════════════════════════════════════════════════╝
Commands:
/status Current analysis state
/plan Show analysis DAG
/branch Branch management
/provenance View lineage
/knowledge Query knowledge graph
/tools List available tools
/export Export provenance / report
/help Full command list
You: I've uploaded my Cell Ranger outputs for 4 donors. Can you run QC?
scAgent: I see 4 filtered_feature_bc_matrix directories. Let me load them
and run QC. Here's what I'll do:
1. Load all 4 samples into a merged AnnData
2. Compute QC metrics (n_genes, n_counts, pct_mito, pct_ribo)
3. Show you the distributions so we can set thresholds together
[Executing... ████████████ 100%]
Here are the QC distributions:
┌─ Genes per cell ──────────────────┐
│ Donor 1: median=3,274 (IQR: 2,100-4,500)
│ Donor 2: median=3,102 (IQR: 1,900-4,200)
│ ...
└────────────────────────────────────┘
┌─ % Mitochondrial ────────────────┐
│ All donors: median=4.2% (95th pctl: 8.7%)
│ Suggested threshold: 10%
└────────────────────────────────────┘
Based on your PBMC samples, I suggest:
• Min genes: 200 | Max genes: 5000
• Max mito: 10% | Min cells per gene: 3
These are conservative defaults. Want to adjust?
You: Looks good, apply those. Also use scrublet for doublet detection.
scAgent: Applying QC filters + Scrublet doublet detection...
┌─ Filtering Results ──────────────┐
│ Donor 1: 5,710 → 5,423 cells (95.0% retained)
│ Donor 2: 5,200 → 4,891 cells (94.1% retained)
│ ...
│ Doublets removed: 312 total (2.8%)
└────────────────────────────────────┘
State saved: branch=main, step=qc_filtering, hash=a1b2c3
Provenance recorded: filter_cells(min_genes=200, max_genes=5000, max_mito=0.10)
+ scrublet(expected_doublet_rate=0.06)
Next suggested step: normalization (SCTransform recommended for Chromium)
Leave a clean interface for foundation model embeddings:
// tools/foundation_models/scgpt_embeddings.json
{
"tool_id": "scgpt_embeddings",
"name": "scGPT Cell Embeddings",
"category": "dimensionality_reduction",
"framework": "scgpt",
"status": "planned", // not yet active
"requires": {
"gpu": true,
"vram_gb": 16,
"model_checkpoint": "scGPT_human"
},
"valid_after": ["normalization"],
"valid_before": ["clustering", "cell_annotation"],
"parameters": {
"model": {
"type": "string",
"options": ["scGPT_human", "scGPT_mouse", "scGPT_brain"],
"default": "scGPT_human"
},
"batch_size": { "type": "int", "default": 64 }
},
"outputs": {
"cell_embeddings": "adata.obsm['X_scgpt']"
},
"notes": "Alternative to PCA for downstream clustering/annotation. scCluBench (2025) found foundation model embeddings underperform task-specific clustering methods but excel at classification. Consider using for annotation transfer rather than clustering."
}When hardware becomes available, the tool just gets activated in the registry. No architecture changes needed.
| scBench Failure Mode | scAgent Design Response |
|---|---|
| Agent doesn't know what platform the data is from | Experiment context explicitly encodes platform, chemistry, kit |
| Agent doesn't know data is pre-processed | Inspector detects X state (raw/normalized/scaled), finds raw counts in layers |
| Agent uses wrong normalization for the data type | Inspector + dependency graph check X state before any transformation; ensure_ready_for() handles recovery |
| Agent treats cells as independent replicates in cross-condition DE | Dependency graph for pseudobulk requires condition column + raw counts |
| Agent picks arbitrary clustering resolution | Parameter guidance from literature baked into tool schema; resolution sweep is default |
| Agent can't identify cell types | Knowledge graph loaded with marker databases; annotation validated against DE markers |
| Agent forgets what it did 5 steps ago | Provenance graph always available; current state summary always in context |
| Agent hallucinates gene names | Validation rules check gene names against the loaded AnnData var_names |
| Agent crashes on unfamiliar data structures | All data is AnnData — the system enforces a single data format |
scBench's hardest tasks are cell typing (48% best) and DE (41% best). These fail because general agents:
- Don't know what tissue they're working with
- Don't know what markers to expect
- Don't enforce pseudobulk for cross-sample comparisons
- Pick arbitrary parameters without justification
scAgent addresses all four by construction — the experiment context tells the agent the tissue, the knowledge graph provides markers, the DAG enforces pseudobulk, and the tool registry constrains parameters with literature-backed defaults.
| Component | Technology | Role |
|---|---|---|
| CLI/REPL | pi (platform) + scAgent system prompt | User interface |
| LLM backbone | Claude via API (through pi) | Reasoning, code generation, interpretation |
| Data format | AnnData (.h5ad) | Universal single-cell data container |
| Analysis core | Scanpy (Python) | Primary analysis framework |
| R bridge | rpy2 or subprocess | For Seurat, DESeq2, edgeR when needed |
| Chat memory | MemPalace (ChromaDB) | Long-term conversation storage |
| Knowledge | MarkerDB + best_practices/reference/ | Marker genes + literature-backed guidance |
| Provenance | Custom PROV-JSONLD writer | Lineage tracking |
| State snapshots | AnnData .h5ad + SHA256 | Branched analysis states |
| Tool registry | JSON schemas | Tool capabilities and constraints |
| Visualization | Matplotlib/Scanpy plots + Rich terminal | Inline plot display |
- Project initialization (
scagent init) - Experiment context schema (minSCe JSON)
- Tool registry with 10 core tools (QC through clustering)
- Basic REPL with single-branch execution
- Provenance recording (PROV-JSONLD)
- AnnData state management (save/load snapshots)
- Paradigm router (maps experiment type → analysis DAG)
- LLM integration for natural language → tool invocation
- Parameter guidance from tool schemas
- Validation checks on tool outputs
- MemPalace integration for chat persistence
- Branch/fork/compare workflow
- Multiple resolution sweep with comparison view
- Expand MarkerDB with tissue-specific databases
- Cell annotation pipeline with validation
- DE pipeline (markers + pseudobulk)
- Full DAG for all 7 paradigms (chunk 11: trajectory, temporal, perturbation, repertoire, multimodal)
- Context compaction strategy
- Export: provenance file, methods section, reproducibility package
- Tool registry expanded to 34 tools + 21 Python wrappers
- Documentation and onboarding flow
- Run against scBench canonical 30 (Chromium subset: 7 evals)
- Request full 394-eval access from LatchBio
- Identify paradigm/task gaps and iterate
- scGPT integration when GPU available
| Question | Options | Recommendation |
|---|---|---|
| Should branching copy full AnnData or use deltas? | Full copy (simple), delta/diff (space-efficient), lazy checkpointing (only save on branch) | Start with lazy checkpointing — only snapshot when branching. Use full copy for simplicity. |
| How to handle R tools (DESeq2, Seurat)? | rpy2 in-process, subprocess with file exchange, separate R server | Subprocess with .h5ad → .rds conversion. Cleanest isolation. |
| Should the agent auto-advance the DAG or wait for user? | Auto-advance, always wait, suggest + confirm | Suggest + confirm. Show next step, ask "proceed?" |
| How granular should provenance be? | Per-tool, per-function-call, per-line-of-code | Per-tool invocation. Each tool is one provenance Activity. |
| Context compaction: summarize or discard? | Summarize to MemPalace, discard old turns, hierarchical summary | Summarize + mine to MemPalace. Keep summaries queryable. |