The pipeline supports CNN-based image classification alongside tabular datasets.
Image datasets flow through the same pipeline entry point (run-pipeline) with
task_type: image_classification_cnn — no separate CLI or workflow is needed.
Image datasets use the ImageFolder convention:
data/raw/<dataset_name>/
dataset.yaml
images/
class_a/
img001.jpg
img002.png
class_b/
img003.jpg
Each subdirectory under images/ is a class label. All images in that
directory are assigned that label.
name: my_images
task_type: image_classification_cnn
description: "My image classification dataset"
source: "custom"
created_at: "2026-03-18"
features: [] # no tabular features
target: label
schema: {} # no column-level schema
image_properties:
expected_formats: [".jpg", ".png"]
min_images_per_class: 5
constraints:
min_rows: 10
max_null_fraction: 0.0
label_classes: [class_a, class_b]Key differences from tabular datasets:
features: []andschema: {}— images have no tabular columns.image_properties— declares expected image formats and minimum class sizes.task_type: image_classification_cnn— drives dispatch throughout the pipeline.
If dataset.yaml is missing, the pipeline will auto-detect the images/ folder
and prompt you interactively to generate it.
Two preprocessing configs are available depending on image format:
Standard JPG/PNG — src/config/preprocessing_image.yaml:
image:
target_size: [64, 64]
color_mode: "rgb"
normalize: true
flatten: false # preserve spatial structure for CNN
augmentation:
enabled: false
horizontal_flip: false
rotation_degrees: 0
augmentation_factor: 1Raw DNG (ISP pipeline) — src/config/preprocessing_raw_image.yaml:
image:
target_size: [64, 64]
color_mode: "rgb"
normalize: true
flatten: false
raw_input: true # triggers ISP pipeline: DNG → demosaic → WB → CCM → gamma → CNN
isp:
demosaicing:
algorithm: "bilinear" # bilinear | malvar2004 | menon2007
gamma_correction:
gamma: 2.2
# ... (see full config for all ISP options)Normalization statistics (mean, std) are computed only from training images and applied to all splits (leak-proof).
# src/config/pipeline_image_cnn.yaml
task_type: "image_classification_cnn"
dataset: sample_images # change to your dataset name under data/raw/
configs:
preprocessing: "src/config/preprocessing_image.yaml"
training: "src/config/training_image_cnn.yaml"
# ... other configs unchangedFor raw DNG datasets, use pipeline_image_raw.yaml instead, which points to
preprocessing_raw_image.yaml.
CIFAR-10 is used as a larger multi-class image benchmark. Once the dataset is present on disk in the ImageFolder layout shown below, run:
run-pipeline --config src/config/pipeline_cifar10.yamlThe conversion from the original CIFAR-10 archive into the ImageFolder layout
was done with a separate import script that is not included in this
repository. To recreate the dataset, download CIFAR-10 through torchvision
and write one PNG per image into the per-class folders shown below, together
with data/raw/cifar10/dataset.yaml:
data/raw/cifar10/
dataset.yaml
images/
airplane/
automobile/
bird/
cat/
deer/
dog/
frog/
horse/
ship/
truck/
src/config/preprocessing_cifar10.yaml preserves the native 32x32 input size
instead of upscaling to the generic 64x64 image default.
# src/config/training_image_cnn.yaml
model:
algorithm: "cnn"
architecture:
conv_layers:
- out_channels: 32
kernel_size: 3
- out_channels: 64
kernel_size: 3
- out_channels: 128
kernel_size: 3
fc_units: 256
dropout: 0.3
hyperparameters:
epochs: 30
batch_size: 32
learning_rate: 0.001# Standard JPG/PNG dataset
run-pipeline --config src/config/pipeline_image_cnn.yaml
# Raw DNG dataset (ISP pipeline)
run-pipeline --config src/config/pipeline_image_raw.yamlThe pipeline executes: versioning → splitting → preprocessing → training → evaluation → promotion, identical to tabular datasets.
data/raw/<dataset>/images/{class}/...
↓ versioning (manifest hash)
data/processed/<dataset>/<version_id>/images/{class}/...
↓ stratified splitting
data/processed/<dataset>/<version_id>/train/images/{class}/...
val/images/{class}/...
test/images/{class}/...
↓ preprocessing (resize, normalize)
data/processed/<dataset>/<version_id>/preprocessed/
train.npz val.npz test.npz
feature_map.json pipeline.pkl metadata.json
{split}.npz— Compressed numpy archives withX(features, shapeN×H×W×C) andy(integer labels).feature_map.json— Class mapping, normalization stats, image shape, feature count.pipeline.pkl— Serialized normalization parameters for inference.
Offline augmentation expands the training set before model fitting:
augmentation:
enabled: true
horizontal_flip: true
rotation_degrees: 90 # 0/90/180/270 degree rotations
augmentation_factor: 3 # 3x training set sizeAugmentation is applied only to training data, never to validation or test. The augmented dataset is deterministic given the same seed and config.
- All images are loaded into memory as numpy arrays. Datasets larger than ~10,000 images at 64×64 may require significant RAM.
- Supported formats:
.jpg,.jpeg,.png,.bmp,.tiff,.gif,.dng. Unreadable images are skipped with a warning. - Raw DNG processing requires
rawpyandcolour-demosaicingto be installed.