MLOps-CI-Pipeline/src/config at main · DriftForgeLab/MLOps-CI-Pipeline

History

Name		Name	Last commit message	Last commit date
parent directory ..
README		README
__init__.py		__init__.py
deployment.yaml		deployment.yaml
deployment_loader.py		deployment_loader.py
drift.yaml		drift.yaml
drift_loader.py		drift_loader.py
evaluation.yaml		evaluation.yaml
evaluation_loader.py		evaluation_loader.py
loader.py		loader.py
pipeline_cifar10.yaml		pipeline_cifar10.yaml
pipeline_fivek.yaml		pipeline_fivek.yaml
pipeline_image_cnn.yaml		pipeline_image_cnn.yaml
pipeline_image_raw.yaml		pipeline_image_raw.yaml
pipeline_loader.py		pipeline_loader.py
pipeline_tabular_classification.yaml		pipeline_tabular_classification.yaml
pipeline_tabular_classification_ci.yaml		pipeline_tabular_classification_ci.yaml
pipeline_tabular_regression.yaml		pipeline_tabular_regression.yaml
preprocessing_cifar10.yaml		preprocessing_cifar10.yaml
preprocessing_image.yaml		preprocessing_image.yaml
preprocessing_loader.py		preprocessing_loader.py
preprocessing_raw_image.yaml		preprocessing_raw_image.yaml
preprocessing_tabular.yaml		preprocessing_tabular.yaml
promotion.yaml		promotion.yaml
promotion_loader.py		promotion_loader.py
schema.py		schema.py
training_classification.yaml		training_classification.yaml
training_image_cnn.yaml		training_image_cnn.yaml
training_loader.py		training_loader.py
training_regression.yaml		training_regression.yaml
validation.py		validation.py

README

# Configuration Module

## Responsibility

The `config` module is the centralized authority for configuration loading and
validation.

It is the only module that:
- Reads YAML configuration files from disk
- Parses raw YAML safely (`yaml.safe_load`)
- Validates configuration structure and values
- Constructs typed, immutable configuration objects

All other modules receive an already-validated config object (e.g.
`PipelineConfig`) and never touch raw YAML.

---

## Design Principles

- Fail-fast validation — invalid config raises before any stage runs
- Explicit required keys, checked per section
- Controlled allowed values (task type, log level, stage names, algorithms, …)
- Immutable configuration (`@dataclass(frozen=True)`)
- Clear separation between parsing and usage

---

## Module layout

The Python code is split by responsibility:

| File | Responsibility |
|------|----------------|
| `schema.py` | All frozen dataclasses (`PipelineConfig`, `TrainingConfig`, …) and the `VALID_*` / `REQUIRED_*` constant sets. |
| `validation.py` | Shared validation helpers (`_validate_enum`, `_validate_section`, …). |
| `loader.py` | Public facade — re-exports every `load_*` function and dataclass so existing imports keep working. |
| `pipeline_loader.py` | `load_config` → `PipelineConfig` (the top-level config). |
| `training_loader.py` | `load_training_config` → `TrainingConfig`. |
| `preprocessing_loader.py` | `load_preprocessing_config` → `PreprocessingConfig`. |
| `evaluation_loader.py` | `load_evaluation_config` → `EvaluationConfig`. |
| `promotion_loader.py` | `load_promotion_config` → `PromotionConfig`. |
| `deployment_loader.py` | `load_deployment_config` → `DeploymentConfig`. |
| `drift_loader.py` | `load_drift_config` → `DriftConfig`. |

---

## Pipeline configs

A run is selected by passing one pipeline config to `run-pipeline --config`.
There is one file per task rather than a single generic `pipeline.yaml`:

| File | Task |
|------|------|
| `pipeline_tabular_classification.yaml` | Tabular classification |
| `pipeline_tabular_classification_ci.yaml` | Tabular classification, CI variant (no `promotion` stage) |
| `pipeline_tabular_regression.yaml` | Tabular regression |
| `pipeline_image_cnn.yaml`, `pipeline_cifar10.yaml` | Standard-image CNN classification |
| `pipeline_image_raw.yaml`, `pipeline_fivek.yaml` | Raw-image (ISP → CNN) classification |

Each pipeline config defines:
- Project metadata (`project.name`, `project.version`)
- `task_type` — one of `classification`, `regression`, `image_classification_cnn`
- `random_seed`
- `dataset` and `data` directory paths
- `pipeline_stages` — ordered subset of `preprocessing`, `training`,
  `evaluation`, `model_analysis`, `promotion`, `deployment`
- `output_dir`
- `log_level`
- `mlflow` settings
- `configs` — paths to the sub-configs below

`pipeline_loader.py` validates the required top-level keys, the enum values,
the stage names, and that `configs` contains all required sub-keys:
`preprocessing`, `training`, `evaluation`, `deployment`, `promotion`, `drift`.

---

## Sub-configs

Referenced from a pipeline config's `configs` block. Most have per-task
variants; the loader reads whichever path the pipeline config points to.

### Preprocessing (`preprocessing_tabular.yaml`, `preprocessing_image.yaml`, `preprocessing_cifar10.yaml`, `preprocessing_raw_image.yaml`)

Controls data-contract enforcement and the transform policy. Tabular contract
keys:

| Key                | Type | Description                                              |
|--------------------|------|----------------------------------------------------------|
| `fail_on_nulls`    | bool | Hard-fail if a feature column contains nulls when `missing_values.policy=fail` |
| `min_rows`         | int  | Minimum rows required in each split                      |
| `validate_types`   | bool | Enforce that column dtypes match the declared schema     |
| `validate_labels`  | bool | Verify the target column only contains declared labels   |
| `validate_on_skip` | bool | Re-validate splits on a cache hit (slower, used in CI)   |

It also controls encoding, scaling, and missing-value policy; image variants
control resize/normalisation and (for raw images) the ISP pipeline stages.
Per-dataset schema/constraints (`label_classes`, `max_null_fraction`) live in
`data/raw/<dataset>/dataset.yaml` under the `constraints` key.

### Training (`training_classification.yaml`, `training_regression.yaml`, `training_image_cnn.yaml`)

Model algorithm, hyperparameters, and (for CNNs) architecture and fine-tuning
settings.

### `evaluation.yaml`

Classification and regression evaluation parameters.

### `promotion.yaml`

Declarative promotion rules (metric thresholds and operators) and the
`promotion_evaluation_split` setting (`val` / `test` / `both`).

### `deployment.yaml`

Prediction-service settings: server host/port, model serving stage, healthcheck,
and the optional reload endpoint.

### `drift.yaml`

Drift-monitoring configuration: statistical tests, thresholds, and severity
bands shared by the tabular and image drift workflows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README

FilesExpand file tree

config

Directory actions

More options

Directory actions

More options

Latest commit

History

config

Folders and files

parent directory

README