Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
# Configuration Module

## Responsibility

The `config` module is the centralized authority for configuration loading and
validation.

It is the only module that:
- Reads YAML configuration files from disk
- Parses raw YAML safely (`yaml.safe_load`)
- Validates configuration structure and values
- Constructs typed, immutable configuration objects

All other modules receive an already-validated config object (e.g.
`PipelineConfig`) and never touch raw YAML.

---

## Design Principles

- Fail-fast validation — invalid config raises before any stage runs
- Explicit required keys, checked per section
- Controlled allowed values (task type, log level, stage names, algorithms, …)
- Immutable configuration (`@dataclass(frozen=True)`)
- Clear separation between parsing and usage

---

## Module layout

The Python code is split by responsibility:

| File | Responsibility |
|------|----------------|
| `schema.py` | All frozen dataclasses (`PipelineConfig`, `TrainingConfig`, …) and the `VALID_*` / `REQUIRED_*` constant sets. |
| `validation.py` | Shared validation helpers (`_validate_enum`, `_validate_section`, …). |
| `loader.py` | Public facade — re-exports every `load_*` function and dataclass so existing imports keep working. |
| `pipeline_loader.py` | `load_config` → `PipelineConfig` (the top-level config). |
| `training_loader.py` | `load_training_config` → `TrainingConfig`. |
| `preprocessing_loader.py` | `load_preprocessing_config` → `PreprocessingConfig`. |
| `evaluation_loader.py` | `load_evaluation_config` → `EvaluationConfig`. |
| `promotion_loader.py` | `load_promotion_config` → `PromotionConfig`. |
| `deployment_loader.py` | `load_deployment_config` → `DeploymentConfig`. |
| `drift_loader.py` | `load_drift_config` → `DriftConfig`. |

---

## Pipeline configs

A run is selected by passing one pipeline config to `run-pipeline --config`.
There is one file per task rather than a single generic `pipeline.yaml`:

| File | Task |
|------|------|
| `pipeline_tabular_classification.yaml` | Tabular classification |
| `pipeline_tabular_classification_ci.yaml` | Tabular classification, CI variant (no `promotion` stage) |
| `pipeline_tabular_regression.yaml` | Tabular regression |
| `pipeline_image_cnn.yaml`, `pipeline_cifar10.yaml` | Standard-image CNN classification |
| `pipeline_image_raw.yaml`, `pipeline_fivek.yaml` | Raw-image (ISP → CNN) classification |

Each pipeline config defines:
- Project metadata (`project.name`, `project.version`)
- `task_type` — one of `classification`, `regression`, `image_classification_cnn`
- `random_seed`
- `dataset` and `data` directory paths
- `pipeline_stages` — ordered subset of `preprocessing`, `training`,
  `evaluation`, `model_analysis`, `promotion`, `deployment`
- `output_dir`
- `log_level`
- `mlflow` settings
- `configs` — paths to the sub-configs below

`pipeline_loader.py` validates the required top-level keys, the enum values,
the stage names, and that `configs` contains all required sub-keys:
`preprocessing`, `training`, `evaluation`, `deployment`, `promotion`, `drift`.

---

## Sub-configs

Referenced from a pipeline config's `configs` block. Most have per-task
variants; the loader reads whichever path the pipeline config points to.

### Preprocessing (`preprocessing_tabular.yaml`, `preprocessing_image.yaml`, `preprocessing_cifar10.yaml`, `preprocessing_raw_image.yaml`)

Controls data-contract enforcement and the transform policy. Tabular contract
keys:

| Key                | Type | Description                                              |
|--------------------|------|----------------------------------------------------------|
| `fail_on_nulls`    | bool | Hard-fail if a feature column contains nulls when `missing_values.policy=fail` |
| `min_rows`         | int  | Minimum rows required in each split                      |
| `validate_types`   | bool | Enforce that column dtypes match the declared schema     |
| `validate_labels`  | bool | Verify the target column only contains declared labels   |
| `validate_on_skip` | bool | Re-validate splits on a cache hit (slower, used in CI)   |

It also controls encoding, scaling, and missing-value policy; image variants
control resize/normalisation and (for raw images) the ISP pipeline stages.
Per-dataset schema/constraints (`label_classes`, `max_null_fraction`) live in
`data/raw/<dataset>/dataset.yaml` under the `constraints` key.

### Training (`training_classification.yaml`, `training_regression.yaml`, `training_image_cnn.yaml`)

Model algorithm, hyperparameters, and (for CNNs) architecture and fine-tuning
settings.

### `evaluation.yaml`

Classification and regression evaluation parameters.

### `promotion.yaml`

Declarative promotion rules (metric thresholds and operators) and the
`promotion_evaluation_split` setting (`val` / `test` / `both`).

### `deployment.yaml`

Prediction-service settings: server host/port, model serving stage, healthcheck,
and the optional reload endpoint.

### `drift.yaml`

Drift-monitoring configuration: statistical tests, thresholds, and severity
bands shared by the tabular and image drift workflows.