Skip to content

Refactor and API changes #4950

@jasonb5

Description

@jasonb5

Compatibility-first change policy

A high-priority goal of this refactor is to preserve current usage patterns,
existing workflows, and current functionality unless change is absolutely
required.

Where possible, the refactor should preserve both:

  • interface compatibility
  • functional behavior and supported capabilities

If a breaking or behavior-changing modification is unavoidable, it must be
explicitly justified and accompanied by a migration plan.

Refactor

This incremental refactor of CIME to improve:

  • dependency injection
  • resiliency and fault recovery
  • modularity and maintainability
  • testability
  • compatibility-preserving internal evolution

The proposed package layout uses the following structure as a baseline:

CIME/api/ -> User facing classes e.g. Case, Downloader, etc
CIME/cli/
CIME/core/
CIME/core/build/
CIME/core/baseline/
CIME/core/batch/
CIME/core/config/
CIME/core/compare/
CIME/core/downloader/
CIME/core/locking/
CIME/core/mods/
CIME/core/namelist/
CIME/core/status/
CIME/core/timing/
CIME/core/tools/
CIME/core/xml/
CIME/data/
CIME/non_py
CIME/build_scripts
CIME/SystemsTests/
CIME/tests/

This layout is a starting point, not a hard limit. Additional directories under
CIME/core/ may be added as needed.

The refactor is explicitly designed to preserve compatibility for:

  • symlinked tools created in case directories
  • external scripts using build_scripts
  • external models that integrate with CIME, including:
    • E3SM (E3SM-Project/E3SM)
    • CESM (ESCOMP/CESM)
    • NorESM (NorESMhub/NorESM)

Constraints and assumptions

CIME is not currently an installed package

Today, CIME is not generally consumed as an installed Python package. Instead,
cases create symlinked tools in the case directory, and those tools must modify
sys.path so they can import CIME.

This behavior must continue to work until a packaging transition is actually
adopted. This RFC does not assume immediate installed-package semantics.

CLI is optional

A CIME/cli/ layer is allowed in the design, but implementing it is optional.
The architecture should support it, not require it.

build_scripts must remain

There are external scripts that use build_scripts. The directory must remain
as a compatibility surface.

Logic may move into CIME/core/build/, but CIME/build_scripts must remain as
a stable wrapper or adapter layer.

External model compatibility is a core requirement

CIME is used by external models, notably E3SM, CESM, and NorESM. Internal
refactors must minimize behavioral impact on these downstream consumers.

Baseline layout, not final taxonomy

The provided directory layout is the baseline. Additional CIME/core/*
subpackages may be added if they improve boundaries.

Goals

Primary goals

  • Improve dependency injection around side-effecting infrastructure
  • Improve resiliency of build, submit, run, and recovery workflows
  • Reduce hidden global state and implicit runtime coupling
  • Preserve external-facing behavior where possible
  • Separate user-facing API from core orchestration logic
  • Make current non-installed usage safer and more structured

Secondary goals

  • Prepare for a future installed-package model without requiring it now
  • Keep symlinked tool behavior working during transition
  • Preserve build_scripts compatibility for outside consumers
  • Make integration boundaries clearer for external models

Non-goals

This RFC does not require:

  • immediate conversion to an installed package
  • immediate introduction of a CLI layer
  • removal of symlinked case tools
  • breaking changes to build_scripts
  • breaking changes for external model integrations
  • wholesale rewrite of CIME internals in a single step

Compatibility-first principles

Preserve external model behavior

External models such as E3SM, CESM, and NorESM are first-class
compatibility targets.

Changes should preserve current invocation patterns unless a migration path is
explicitly provided.

Preserve symlinked case tools

Case-created symlinked tools must continue to run even when CIME is not
installed.

Preserve build_scripts entrypoints

CIME/build_scripts remains externally visible and stable, even if its logic is
internally delegated.

Internal evolution behind stable surfaces

Move implementation into CIME/core/*, but keep stable wrappers in:

  • CIME/api/
  • CIME/build_scripts
  • legacy tool locations as needed

Architectural model

Baseline layout

CIME/
  api/
  cli/                  # optional
  core/
    build/
    baseline/
    batch/
    config/
    compare/
    downloader/
    locking/
    mods/
    namelist/
    status/
    timing/
    tools/
    xml/
    ...                 # additional core dirs allowed
  data/
  non_py/
  build_scripts/
  SystemsTests/
  tests/

Recommended additional core/ directories

Not required immediately, but likely useful:

  • CIME/core/case/
  • CIME/core/runtime/
  • CIME/core/plugins/
  • CIME/core/logging/

These help avoid turning existing buckets like tools/ or config/ into
miscellaneous catch-alls.

User-facing and compatibility layers

CIME/api/

Purpose:

  • stable user-facing classes
  • preserve conceptual public API
  • delegate to core services

Expected examples:

  • Case
  • Downloader

CIME/api/ should contain facade classes, not core workflow logic.

CIME/build_scripts/

Purpose:

  • preserve compatibility for outside scripts
  • remain importable or invokable by current consumers
  • delegate implementation to CIME/core/build/

This directory stays in place even if most logic moves elsewhere.

Optional CIME/cli/

Purpose:

  • possible future thin entrypoints
  • not required by this RFC

If implemented, it should remain thin:

  • parse args
  • compose services
  • call use cases
  • map exceptions to exit codes

Managing the non-installed package model

This is one of the most important parts of the RFC.

Current reality

Symlinked tools inside a case directory need to locate the CIME source tree and
mutate sys.path in order to import CIME.

This is currently necessary and must continue to work until CIME is packaged and
installed consistently.

Problem

Today, sys.path manipulation appears in multiple places and is mixed with
general-purpose logic, which creates:

  • inconsistent import behavior
  • process-global mutation
  • brittle assumptions about current working directory and source layout
  • difficult debugging for external consumers

Proposed compatibility design

Instead of eliminating path setup immediately, centralize it.

The canonical location for this logic will be:

CIME/core/config/bootstrap.py

Responsibilities of the bootstrap layer:

  • determine CIME source root from:
    • symlink target location
    • script location
    • case metadata
    • known relative layouts
  • add only the minimum required entries to sys.path
  • do so in one controlled place
  • expose a single helper for legacy scripts and symlinked tools

Conceptual API:

def ensure_cime_on_path(script_path=None, case_root=None):
    """
    Ensure the source-tree CIME package is importable in a non-installed setup.

    Returns the resolved CIME root.
    """

Design rule

Until CIME is packaged and installed consistently, controlled path bootstraps
are acceptable, but ad hoc sys.path manipulation scattered through the
codebase is not.

All new bootstrap or path-resolution logic should go through
CIME/core/config/bootstrap.py.

Long-term transition path

When or if CIME becomes installable:

  1. bootstrap first checks normal import
  2. falls back to source-tree path bootstrap
  3. eventually source-tree bootstrap can be deprecated

This provides a clean migration path without forcing it now.

Dependency injection design

DI strategy

Use lightweight dependency injection:

  • constructor injection
  • explicit factories
  • composition at entrypoints and facades
  • no framework required

Injectable boundaries

Inject only side-effecting or environment-specific services:

  • filesystem
  • subprocess or process execution
  • environment variables
  • XML store
  • scheduler integration
  • clock or time
  • bootstrap or path resolver
  • plugin or customization loading

Example interfaces

from dataclasses import dataclass
from pathlib import Path
from typing import Mapping, Protocol, Sequence

@dataclass(frozen=True)
class CommandResult:
    returncode: int
    stdout: str
    stderr: str

class FileSystem(Protocol):
    def exists(self, path: Path) -> bool: ...
    def is_dir(self, path: Path) -> bool: ...
    def mkdir(self, path: Path, parents: bool = False) -> None: ...
    def read_text(self, path: Path) -> str: ...
    def write_text(self, path: Path, content: str) -> None: ...

class ProcessRunner(Protocol):
    def run(
        self,
        argv: Sequence[str],
        cwd: Path | None = None,
        env: Mapping[str, str] | None = None,
        timeout: int | None = None,
    ) -> CommandResult: ...

class EnvironmentProvider(Protocol):
    def get(self, key: str, default: str | None = None) -> str | None: ...

class Clock(Protocol):
    def now_iso(self) -> str: ...

Core subsystem guidance

CIME/core/build/

Owns:

  • build planning
  • build orchestration
  • build dependency logic

Externally visible wrappers remain in CIME/build_scripts.

CIME/core/batch/

Owns:

  • submit and resubmit behavior
  • scheduler abstraction
  • batch dependency rules
  • recovery-aware submission flow

CIME/core/config/

Owns:

  • typed config models
  • runtime config loading
  • customization coordination
  • bootstrap for non-installed source-tree execution

CIME/core/xml/

Owns:

  • XML store abstraction
  • caching policy
  • schema validation
  • domain mapping from XML to typed objects

CIME/core/status/

Owns:

  • workflow state tracking
  • case status writes
  • structured events
  • state transition rules

CIME/core/locking/

Owns:

  • lock-file policy
  • mutation guards
  • concurrency-sensitive file handling

CIME/core/tools/

Owns:

  • shared tool support logic only

It should not become a generic dumping ground.

Case refactor strategy

Preserve public role

Case remains the primary user-facing class for now.

Change implementation role

Internal logic may be extracted from Case incrementally so that Case
increasingly serves as a compatibility facade over core services.

Compatibility requirement

Existing external code that imports or uses Case should continue to work with
minimal or no changes wherever feasible.

Physical move timing

Case should remain in its current location until the final stage of migration.

A physical move of Case to a new package location, if still desired at that
point, should occur only as part of one coordinated downstream model migration
across supported consumers.

This avoids repeated churn for downstream integrations and keeps the current
import surface stable during the refactor.

Resiliency design

Typed errors

Replace broad generic fatal control flow with typed exceptions.

Example hierarchy:

class CimeError(Exception):
    pass

class ConfigurationError(CimeError):
    pass

class ValidationError(CimeError):
    pass

class FilesystemError(CimeError):
    pass

class ExternalCommandError(CimeError):
    pass

class RetryableExternalCommandError(ExternalCommandError):
    pass

class LockingError(CimeError):
    pass

class StateTransitionError(CimeError):
    pass

Retry and recovery policy

Move retry and recovery policy into explicit services:

  • FailureClassifier
  • RetryPolicy
  • RecoveryService

Explicit workflow states

For critical operations like submit, run, and archive, define explicit states
such as:

  • READY
  • VALIDATING
  • SUBMITTING
  • RUNNING
  • FAILED_RETRYABLE
  • RECOVERING
  • SUCCEEDED
  • FAILED_FINAL

Idempotency

Prioritize idempotency for:

  • lock and unlock operations
  • status updates
  • regeneration steps
  • archive and restore steps
  • submit and resubmit transitions

External model compatibility plan

CIME compatibility must be evaluated against at least these active downstream
model integrations:

  • E3SM (E3SM-Project/E3SM)
  • CESM (ESCOMP/CESM)
  • NorESM (NorESMhub/NorESM)

Why this is a hard constraint

These downstream models rely on CIME-centered case and workflow behavior.
Refactors that change source-tree execution, script invocation, bootstrap
behavior, Case, or build_scripts can affect their workflows.

Compatibility rules

  1. preserve current import surfaces where possible
  2. preserve case-created symlink tool behavior
  3. preserve build_scripts
  4. preserve existing script entrypoints or provide stable wrappers
  5. avoid requiring downstream models to adopt installed-package semantics
  6. validate major refactors against representative downstream integration paths
    from E3SM, CESM, and NorESM

build_scripts compatibility surface

For build_scripts, the following are part of the strict compatibility surface:

  • command names
  • invocation patterns
  • accepted arguments and options
  • argument semantics
  • user-visible parsing behavior
  • argument-handling behavior
  • exit behavior tied to argument handling

Implementation beneath that layer may move into CIME/core/build/ or other core
packages.

Downstream regression checkpoints

Major refactors should be validated against representative downstream workflows
for E3SM, CESM, and NorESM.

At minimum, regression checkpoints should include:

  • source-tree CIME usage
  • case creation workflow
  • symlinked case-tool execution
  • standard setup, build, and run path

Cross-model checkpoint categories

For all three downstream models, validate:

  • import and bootstrap behavior in non-installed mode
  • user-facing script invocation compatibility
  • Case compatibility at current import location
  • build_scripts argument parsing compatibility
  • no unexpected changes in standard case lifecycle sequencing

This should be treated as a migration validation requirement for major refactor
milestones.

Proposed migration phases

Phase 1 — Compatibility bootstrap and seams

  • create centralized source-tree bootstrap for non-installed imports
  • add FileSystem, ProcessRunner, EnvironmentProvider, Clock
  • leave behavior unchanged

Phase 2 — Error model and status layer

  • add typed exceptions
  • add workflow and status services
  • improve failure classification in high-value paths

Phase 3 — Build and batch extraction

  • move build logic into CIME/core/build/
  • keep CIME/build_scripts as compatibility wrapper layer
  • move submit logic into CIME/core/batch/

Phase 4 — Case internal extraction

  • continue extracting logic behind Case
  • preserve current import location and compatibility methods
  • delegate to core services

Phase 5 — XML and config extraction

  • create CIME/core/xml/ and typed config loading
  • reduce process-global XML caching behavior

Phase 6 — Optional CLI

  • implement CIME/cli/ only if desired
  • keep it thin and optional

Phase 7 — Coordinated downstream migration

  • perform one coordinated downstream model migration if needed
  • consider moving Case only at this stage
  • support installed-package mode when available
  • keep source-tree bootstrap fallback until migration is complete

Revised target tree

CIME/
  api/
    downloader.py

  cli/                       # optional

  core/
    build/
    baseline/
    batch/
    config/
      bootstrap.py
    compare/
    downloader/
    locking/
    mods/
    namelist/
    status/
    timing/
    tools/
    xml/
    case/                    # optional addition
    runtime/                 # optional addition
    plugins/                 # optional addition

  data/
  non_py/
  build_scripts/
  SystemsTests/
  tests/

Recommended first implementation slice

Given these constraints, the recommended first slice is:

Slice 1

  • add centralized source-tree bootstrap helper for symlinked tools
  • add typed exception hierarchy
  • add ProcessRunner abstraction
  • add FileSystem abstraction

Slice 2

  • move batch and submit logic into CIME/core/batch/
  • keep current entrypoints and wrappers intact

Slice 3

  • move build logic into CIME/core/build/
  • keep CIME/build_scripts stable and delegating

Slice 4

  • begin extracting logic behind Case
  • preserve external consumer behavior

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions