config
Directory actions
More options
Directory actions
More options
config
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
parent directory.. | ||||
# Configuration Module ## Responsibility The `config` module is the centralized authority for configuration loading and validation. It is the only module that: - Reads YAML configuration files from disk - Parses raw YAML safely (`yaml.safe_load`) - Validates configuration structure and values - Constructs typed, immutable configuration objects All other modules receive an already-validated config object (e.g. `PipelineConfig`) and never touch raw YAML. --- ## Design Principles - Fail-fast validation — invalid config raises before any stage runs - Explicit required keys, checked per section - Controlled allowed values (task type, log level, stage names, algorithms, …) - Immutable configuration (`@dataclass(frozen=True)`) - Clear separation between parsing and usage --- ## Module layout The Python code is split by responsibility: | File | Responsibility | |------|----------------| | `schema.py` | All frozen dataclasses (`PipelineConfig`, `TrainingConfig`, …) and the `VALID_*` / `REQUIRED_*` constant sets. | | `validation.py` | Shared validation helpers (`_validate_enum`, `_validate_section`, …). | | `loader.py` | Public facade — re-exports every `load_*` function and dataclass so existing imports keep working. | | `pipeline_loader.py` | `load_config` → `PipelineConfig` (the top-level config). | | `training_loader.py` | `load_training_config` → `TrainingConfig`. | | `preprocessing_loader.py` | `load_preprocessing_config` → `PreprocessingConfig`. | | `evaluation_loader.py` | `load_evaluation_config` → `EvaluationConfig`. | | `promotion_loader.py` | `load_promotion_config` → `PromotionConfig`. | | `deployment_loader.py` | `load_deployment_config` → `DeploymentConfig`. | | `drift_loader.py` | `load_drift_config` → `DriftConfig`. | --- ## Pipeline configs A run is selected by passing one pipeline config to `run-pipeline --config`. There is one file per task rather than a single generic `pipeline.yaml`: | File | Task | |------|------| | `pipeline_tabular_classification.yaml` | Tabular classification | | `pipeline_tabular_classification_ci.yaml` | Tabular classification, CI variant (no `promotion` stage) | | `pipeline_tabular_regression.yaml` | Tabular regression | | `pipeline_image_cnn.yaml`, `pipeline_cifar10.yaml` | Standard-image CNN classification | | `pipeline_image_raw.yaml`, `pipeline_fivek.yaml` | Raw-image (ISP → CNN) classification | Each pipeline config defines: - Project metadata (`project.name`, `project.version`) - `task_type` — one of `classification`, `regression`, `image_classification_cnn` - `random_seed` - `dataset` and `data` directory paths - `pipeline_stages` — ordered subset of `preprocessing`, `training`, `evaluation`, `model_analysis`, `promotion`, `deployment` - `output_dir` - `log_level` - `mlflow` settings - `configs` — paths to the sub-configs below `pipeline_loader.py` validates the required top-level keys, the enum values, the stage names, and that `configs` contains all required sub-keys: `preprocessing`, `training`, `evaluation`, `deployment`, `promotion`, `drift`. --- ## Sub-configs Referenced from a pipeline config's `configs` block. Most have per-task variants; the loader reads whichever path the pipeline config points to. ### Preprocessing (`preprocessing_tabular.yaml`, `preprocessing_image.yaml`, `preprocessing_cifar10.yaml`, `preprocessing_raw_image.yaml`) Controls data-contract enforcement and the transform policy. Tabular contract keys: | Key | Type | Description | |--------------------|------|----------------------------------------------------------| | `fail_on_nulls` | bool | Hard-fail if a feature column contains nulls when `missing_values.policy=fail` | | `min_rows` | int | Minimum rows required in each split | | `validate_types` | bool | Enforce that column dtypes match the declared schema | | `validate_labels` | bool | Verify the target column only contains declared labels | | `validate_on_skip` | bool | Re-validate splits on a cache hit (slower, used in CI) | It also controls encoding, scaling, and missing-value policy; image variants control resize/normalisation and (for raw images) the ISP pipeline stages. Per-dataset schema/constraints (`label_classes`, `max_null_fraction`) live in `data/raw/<dataset>/dataset.yaml` under the `constraints` key. ### Training (`training_classification.yaml`, `training_regression.yaml`, `training_image_cnn.yaml`) Model algorithm, hyperparameters, and (for CNNs) architecture and fine-tuning settings. ### `evaluation.yaml` Classification and regression evaluation parameters. ### `promotion.yaml` Declarative promotion rules (metric thresholds and operators) and the `promotion_evaluation_split` setting (`val` / `test` / `both`). ### `deployment.yaml` Prediction-service settings: server host/port, model serving stage, healthcheck, and the optional reload endpoint. ### `drift.yaml` Drift-monitoring configuration: statistical tests, thresholds, and severity bands shared by the tabular and image drift workflows.