Image Dataset Support

The pipeline supports CNN-based image classification alongside tabular datasets. Image datasets flow through the same pipeline entry point (run-pipeline) with task_type: image_classification_cnn — no separate CLI or workflow is needed.

Folder Structure

Image datasets use the ImageFolder convention:

data/raw/<dataset_name>/
  dataset.yaml
  images/
    class_a/
      img001.jpg
      img002.png
    class_b/
      img003.jpg

Each subdirectory under images/ is a class label. All images in that directory are assigned that label.

dataset.yaml

name: my_images
task_type: image_classification_cnn
description: "My image classification dataset"
source: "custom"
created_at: "2026-03-18"
features: []          # no tabular features
target: label
schema: {}            # no column-level schema
image_properties:
  expected_formats: [".jpg", ".png"]
  min_images_per_class: 5
constraints:
  min_rows: 10
  max_null_fraction: 0.0
  label_classes: [class_a, class_b]

Key differences from tabular datasets:

features: [] and schema: {} — images have no tabular columns.
image_properties — declares expected image formats and minimum class sizes.
task_type: image_classification_cnn — drives dispatch throughout the pipeline.

If dataset.yaml is missing, the pipeline will auto-detect the images/ folder and prompt you interactively to generate it.

Preprocessing Configuration

Two preprocessing configs are available depending on image format:

Standard JPG/PNG — src/config/preprocessing_image.yaml:

image:
  target_size: [64, 64]
  color_mode: "rgb"
  normalize: true
  flatten: false        # preserve spatial structure for CNN
  augmentation:
    enabled: false
    horizontal_flip: false
    rotation_degrees: 0
    augmentation_factor: 1

Raw DNG (ISP pipeline) — src/config/preprocessing_raw_image.yaml:

image:
  target_size: [64, 64]
  color_mode: "rgb"
  normalize: true
  flatten: false
  raw_input: true       # triggers ISP pipeline: DNG → demosaic → WB → CCM → gamma → CNN
  isp:
    demosaicing:
      algorithm: "bilinear"   # bilinear | malvar2004 | menon2007
    gamma_correction:
      gamma: 2.2
    # ... (see full config for all ISP options)

Normalization statistics (mean, std) are computed only from training images and applied to all splits (leak-proof).

Pipeline Configuration

# src/config/pipeline_image_cnn.yaml
task_type: "image_classification_cnn"
dataset: sample_images    # change to your dataset name under data/raw/
configs:
  preprocessing: "src/config/preprocessing_image.yaml"
  training: "src/config/training_image_cnn.yaml"
  # ... other configs unchanged

For raw DNG datasets, use pipeline_image_raw.yaml instead, which points to preprocessing_raw_image.yaml.

CIFAR-10

CIFAR-10 is used as a larger multi-class image benchmark. Once the dataset is present on disk in the ImageFolder layout shown below, run:

run-pipeline --config src/config/pipeline_cifar10.yaml

The conversion from the original CIFAR-10 archive into the ImageFolder layout was done with a separate import script that is not included in this repository. To recreate the dataset, download CIFAR-10 through torchvision and write one PNG per image into the per-class folders shown below, together with data/raw/cifar10/dataset.yaml:

data/raw/cifar10/
  dataset.yaml
  images/
    airplane/
    automobile/
    bird/
    cat/
    deer/
    dog/
    frog/
    horse/
    ship/
    truck/

src/config/preprocessing_cifar10.yaml preserves the native 32x32 input size instead of upscaling to the generic 64x64 image default.

Architecture Configuration

# src/config/training_image_cnn.yaml
model:
  algorithm: "cnn"
  architecture:
    conv_layers:
      - out_channels: 32
        kernel_size: 3
      - out_channels: 64
        kernel_size: 3
      - out_channels: 128
        kernel_size: 3
    fc_units: 256
    dropout: 0.3
  hyperparameters:
    epochs: 30
    batch_size: 32
    learning_rate: 0.001

Running

# Standard JPG/PNG dataset
run-pipeline --config src/config/pipeline_image_cnn.yaml

# Raw DNG dataset (ISP pipeline)
run-pipeline --config src/config/pipeline_image_raw.yaml

The pipeline executes: versioning → splitting → preprocessing → training → evaluation → promotion, identical to tabular datasets.

Data Flow

data/raw/<dataset>/images/{class}/...
        ↓  versioning (manifest hash)
data/processed/<dataset>/<version_id>/images/{class}/...
        ↓  stratified splitting
data/processed/<dataset>/<version_id>/train/images/{class}/...
                                      val/images/{class}/...
                                      test/images/{class}/...
        ↓  preprocessing (resize, normalize)
data/processed/<dataset>/<version_id>/preprocessed/
    train.npz  val.npz  test.npz
    feature_map.json  pipeline.pkl  metadata.json

Output Artifacts

{split}.npz — Compressed numpy archives with X (features, shape N×H×W×C) and y (integer labels).
feature_map.json — Class mapping, normalization stats, image shape, feature count.
pipeline.pkl — Serialized normalization parameters for inference.

Augmentation

Offline augmentation expands the training set before model fitting:

augmentation:
  enabled: true
  horizontal_flip: true
  rotation_degrees: 90        # 0/90/180/270 degree rotations
  augmentation_factor: 3      # 3x training set size

Augmentation is applied only to training data, never to validation or test. The augmented dataset is deterministic given the same seed and config.

Limitations

All images are loaded into memory as numpy arrays. Datasets larger than ~10,000 images at 64×64 may require significant RAM.
Supported formats: .jpg, .jpeg, .png, .bmp, .tiff, .gif, .dng. Unreadable images are skipped with a warning.
Raw DNG processing requires rawpy and colour-demosaicing to be installed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image Dataset Support

Folder Structure

dataset.yaml

Preprocessing Configuration

Pipeline Configuration

CIFAR-10

Architecture Configuration

Running

Data Flow

Output Artifacts

Augmentation

Limitations

FilesExpand file tree

image_datasets.md

Latest commit

History

image_datasets.md

File metadata and controls

Image Dataset Support

Folder Structure

dataset.yaml

Preprocessing Configuration

Pipeline Configuration

CIFAR-10

Architecture Configuration

Running

Data Flow

Output Artifacts

Augmentation

Limitations