Skip to content

RF: Decouple fMRIPrep from sMRIPrep with a precomputed-first strategy #3610

@oesteban

Description

@oesteban

Summary

This issue proposes decoupling fMRIPrep from sMRIPrep by adopting a precomputed-first strategy:

  1. fMRIPrep first checks if all sMRIPrep fit derivatives are already present (via --derivatives). If so, it populates buffer nodes and skips all sMRIPrep computation.
  2. If the fit derivatives are incomplete, fMRIPrep falls back to calling init_anat_fit_wf exactly as today, passing any partial cache as precomputed=.

This would remove ~200 lines of fragile inline orchestration from init_single_subject_wf, make the sMRIPrep output contract explicit, and enable the "run sMRIPrep once, fMRIPrep many times" workflow for multi-task/multi-session datasets.


Current coupling: three layers

Layer 1: Workflow embedding (tight)

fMRIPrep imports init_anat_fit_wf() and embeds it directly into its Nipype DAG (call site at L353-L371), passing 16 configuration kwargs. The returned pe.Workflow is a sub-workflow of fMRIPrep's subject graph — they share the same execution context.

Additionally, clean_datasinks() patches every DataSink in the entire workflow tree to remove out_path_base, forcing all sMRIPrep outputs into fMRIPrep's output directory. This is a maintenance hazard: any new DataSink added in sMRIPrep is silently patched. It is called from L617 and L941.

Layer 2: Transform-stage orchestration (~200 lines — the main cost)

Beyond init_anat_fit_wf, fMRIPrep imports and assembles ~10 sMRIPrep transform-stage workflows inline at level == 'full'. These are the most fragile lines in fMRIPrep — they break whenever sMRIPrep changes workflow signatures:

Code block Lines What it does
Template iterator + standard-space volumes 29 init_template_iterator_wf + init_ds_anat_volumes_wf — resample T1w/mask/dseg/tpms into each standard space
Surface derivatives + outputs 39 init_surface_derivatives_wf + init_ds_surfaces_wf + init_ds_surface_metrics_wf + init_ds_fs_segs_wf — inflated surfaces, curvature, aparc, aseg
CIFTI morphometrics pipeline 111 init_gifti_morphometrics_wf + init_hcp_morphometrics_wf + init_morph_grayords_wf + init_resample_surfaces_wf + init_ds_grayord_metrics_wf + init_ds_surfaces_wf

Layer 3: Utility imports (loose, acceptable)

Other sMRIPrep imports across fMRIPrep:

File Line Import
base.py:100 stringify_sessions
base.py:271 collect_derivatives (as collect_anat_derivatives)
base.py:505 TemplateFlowSelect
bold/base.py:678 init_resample_surfaces_wf (for non-CIFTI surface outputs)
bold/resampling.py:884,961 smriprep.data (atlas ROI files)
interfaces/reports.py:43 ReconAll (for reports)
interfaces/bids.py:184 stringify_sessions

What fMRIPrep consumes from sMRIPrep

Every output field flowing from anat_fit_wf.outputnode into downstream workflows (connection block at L882-L900 for BOLD, plus L427-L445 for standard space, L474-L498 for surfaces, L556-L614 for CIFTI):

Tier 1 — Core volumetric (always needed):
t1w_preproc, t1w_mask, t1w_dseg, t1w_tpms, t1w_valid_list

Tier 2 — FreeSurfer / surfaces (when run_reconall):
subjects_dir, subject_id, fsnative2t1w_xfm, white, pial, midthickness, thickness, sulc, curv, anat_ribbon, cortex_mask, sphere_reg_fsLR/sphere_reg_msm

Tier 3 — Standard space and CIFTI (at level='full'):
template, anat2std_xfm, std2anat_xfm, std_t1w, std_mask, midthickness_fsLR, curv_fsLR, thickness_fsLR, sulc_fsLR

Critical observation: There is no formal contract specification. The coupling is encoded purely in Nipype connection tuples scattered across base.py. A field rename in sMRIPrep silently breaks fMRIPrep.


Proposed architecture

The two-phase design

Phase 1 — Precomputed-first lookup: fMRIPrep checks for existing sMRIPrep derivatives (via --derivatives or --anat-derivatives). Using smriprep.utils.bids.collect_derivatives(), it populates an anatomical_cache dict. If the cache contains all fields required by the current configuration (level, run_reconall, spaces, cifti_output, msm_sulc), fMRIPrep skips init_anat_fit_wf entirely and feeds the cache into buffer nodes.

Phase 2 — Fallback to sMRIPrep: If the cache is incomplete, fMRIPrep calls init_anat_fit_wf as today, with the partial cache as precomputed=. This preserves the single-command user experience — fmriprep /data /out participant continues to work without pre-running sMRIPrep.

Targeted edits in init_single_subject_wf

1. Cache collection + validation (enhance L269-L283)

Currently ~15 lines collecting the cache. Add a validation function (~40 lines) that checks whether the cache satisfies the current configuration:

anatomical_cache = collect_and_validate_anat_cache(
    derivatives=config.execution.derivatives,
    subject_id=subject_id,
    session_id=session_id,
    spaces=spaces,
    freesurfer=freesurfer,
    msm_sulc=msm_sulc,
    cifti_output=config.workflow.cifti_output,
    level=config.workflow.level,
)
anat_cache_complete = _is_anat_cache_complete(anatomical_cache, ...)

Validate outputs, not settings: if the right files exist in the right spaces, they are usable regardless of the parameters that produced them.

2. Conditional anat_fit_wf (replace L353-L371)

if anat_cache_complete:
    # Buffer-only: no sMRIPrep computation
    anat_fit_wf = _init_anat_buffer_wf(anatomical_cache, name='anat_fit_wf')
else:
    # Fallback: run sMRIPrep, passing partial cache
    from smriprep.workflows.anatomical import init_anat_fit_wf
    anat_fit_wf = init_anat_fit_wf(
        ...same 16 kwargs...,
        precomputed=anatomical_cache,
    )

The _init_anat_buffer_wf function (~30 lines) returns a trivial workflow whose outputnode mirrors init_anat_fit_wf's outputnode, populated from the cache.

3. REMOVE inline transform-stage orchestration (L413-L614)

This is the big win. These ~200 lines assemble 10+ sMRIPrep workflows with intricate connections. Under decoupling, all of these become sMRIPrep's responsibility when run standalone. fMRIPrep would only retain:

These are not anatomical outputs — they are transform selectors that thread spatial mappings into the functional pipeline. They should stay.

4. REMOVE clean_datasinks() (L1022-L1027)

With sMRIPrep writing to its own derivatives directory, there is no need to patch out_path_base. fMRIPrep would instead add sMRIPrep's output as a DatasetLink for BIDS-URI provenance. Remove both call sites at L617 and L941.

5. BOLD connections stay unchanged (L882-L939)

The connections from anat_fit_wf.outputnode to bold_wf.inputnode remain identical. The buffer workflow presents the same outputnode interface as the full init_anat_fit_wf.

6. Remaining sMRIPrep imports to address

File Import Action
bold/base.py:678 init_resample_surfaces_wf (for non-CIFTI surface outputs) Could be buffered if sMRIPrep emits midthickness_{template}_{density}
bold/resampling.py:884,961 smriprep.data (atlas ROI files for CIFTI) Atlas ROIs should be moved to a shared location or looked up from the sMRIPrep derivatives
interfaces/reports.py:43 ReconAll (for reports) This is a lightweight interface import; acceptable to keep
interfaces/bids.py:184 stringify_sessions Utility; acceptable to keep

Code simplification assessment

Section Current lines After decoupling Change
Cache collection + validation ~15 ~60 +45 (new validation + buffer function)
anat_fit_wf call ~19 ~25 +6 (conditional + fallback)
Transform-stage orchestration (L413-L614) ~200 ~0 -200
clean_datasinks() + calls ~8 ~0 -8
sMRIPrep import block (L173-L185) ~13 ~3 -10 (keep init_anat_fit_wf, collect_derivatives, stringify_sessions)
Net ~255 ~88 -167 lines, -27% of init_single_subject_wf

The 200 most complex lines are eliminated — the ones that assemble 10+ sMRIPrep workflows with intricate connection tuples that break whenever sMRIPrep's function signatures change.


Prerequisites (changes needed in sMRIPrep)

  1. sMRIPrep standalone must emit ALL outputs fMRIPrep needs — including surface derivatives, fsLR-resampled surfaces, CIFTI morphometrics, and cortex masks. Currently, fMRIPrep builds these transform-stage workflows itself because init_anat_preproc_wf does not cover everything.

  2. Add --level flag to sMRIPrep CLI--level minimal (fit only, fast, deterministic, reusable) vs --level full (fit + all transforms). This mirrors fMRIPrep's existing level gating.

  3. Expand collect_derivatives() to cover all fields — including midthickness_fsLR, sphere_reg_fsLR/sphere_reg_msm, cortex_mask, anat_ribbon, CIFTI morphometric dscalars.

  4. Standardize FreeSurfer output path — fMRIPrep's BBR registration (init_bbreg_wf) needs a live SUBJECTS_DIR with the full reconstruction (not just output surfaces). sMRIPrep should place this at a standard path (sourcedata/freesurfer/).

  5. Version stamp in dataset_description.jsonGeneratedBy with sMRIPrep version and container info for compatibility checking.


Risks and mitigations

Risk Mitigation
Version mismatch between sMRIPrep producer and fMRIPrep consumer Validate outputs, not settings. If the required files exist, they are usable. Check GeneratedBy in dataset_description.json for compatibility warnings.
Configuration divergence (e.g., sMRIPrep run without MNI152NLin2009cAsym but fMRIPrep needs it for carpetplots) _is_anat_cache_complete() checks that needed normalizations/surfaces exist. If missing, fail with a clear error or fall back to running sMRIPrep.
FreeSurfer directory completeness collect_derivatives() validates the FreeSurfer dir exists and contains key files (mri/brain.mgz, surf/lh.white, etc.)
Single-command UX regression The fallback path ensures fmriprep /data /out participant continues to work without pre-running sMRIPrep. Decoupling is an optimization, not a requirement.
Transform-stage boundary is fuzzy (surface derivatives involve computation, not just file writes) sMRIPrep owns these computations when run standalone. The contract is "provide these BIDS-Derivatives files," not "run these specific workflows."

Recommended phasing

Phase Scope Work
1 sMRIPrep only Add --level, emit all outputs fMRIPrep needs standalone, expand collect_derivatives(), standardize FreeSurfer path
2 fMRIPrep only Add _is_anat_cache_complete() + _init_anat_buffer_wf(), make anat_fit_wf conditional, remove inline transform orchestration (L413-L614), remove clean_datasinks() (L1022-L1027), add DatasetLink for sMRIPrep
3 Integration Integration tests with synthetic sMRIPrep derivative fixtures, contract documentation, user-facing documentation for the two-step workflow

Phase 1 is backwards-compatible: fMRIPrep continues to work unchanged until Phase 2 is implemented.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions