Name	Name	Last commit message	Last commit date
parent directory ..
breast_cancer	breast_cancer
california_housing	california_housing
iris	iris
sample_images	sample_images
wine	wine
README.md	README.md

Raw Data

This directory contains raw datasets used as input to the pipeline. Each dataset lives in its own subdirectory. dataset.yaml files are tracked in git; image files are excluded (see below).

Datasets

california_housing (`data/raw/california_housing/`)

Tabular regression dataset. data.csv is tracked in git.

Task: Regression — predict median house value (MedHouseVal) in $100k units
Source: sklearn.datasets.fetch_california_housing() (Pace & Barry, 1997)
Rows: 20,640 | Features: 8 numeric (MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude)
No missing values, no categorical features — preprocessing is column-selection + optional scaling only
Pipeline config: src/config/pipeline_tabular_regression.yaml
Expected RF performance (val split): R² ≈ 0.81, MAE ≈ 0.38, RMSE ≈ 0.52
Promotion gates: R² ≥ 0.80 and MAE ≤ 0.40 (configured in src/config/promotion.yaml)

To regenerate data.csv from scratch:

from sklearn.datasets import fetch_california_housing
import pandas as pd
fetch_california_housing(as_frame=True).frame.to_csv("data/raw/california_housing/data.csv", index=False)

iris (`data/raw/iris/`)

Tabular classification dataset. data.csv is tracked in git.

sample_images (`data/raw/sample_images/`)

Small JPG image dataset for testing the image classification pipeline. Images are tracked in git (small files).

cifar10 (`data/raw/cifar10/`)

Imported CIFAR-10 dataset converted into the pipeline's ImageFolder PNG layout.

Source: https://www.cs.toronto.edu/~kriz/cifar.html
Pipeline config: src/config/pipeline_cifar10.yaml

CIFAR-10 is consumed in the standard ImageFolder layout (data/raw/cifar10/images/<class>/*.png plus data/raw/cifar10/dataset.yaml). The conversion from the original CIFAR-10 archive into this layout was performed with a separate import script that is not included in this repository. To use this dataset, recreate that layout yourself — for example by downloading CIFAR-10 via torchvision and writing one PNG per image into the per-class folders. The generated PNG files are not tracked in git.

fivek (`data/raw/fivek/`)

Raw DNG photographs from the MIT-Adobe FiveK dataset, used to demonstrate the ISP preprocessing pipeline on raw images.

Image source: MIT-Adobe FiveK Dataset
Pipeline config: src/config/pipeline_fivek.yaml

Images are NOT tracked in git because raw DNG files are large.

The pipeline consumes FiveK as a raw-image classification dataset, end-to-end: raw DNG → ISP preprocessing → train/val/test split → CNN pipeline. FiveK is published as an image-enhancement dataset rather than a labelled classification dataset, so the class labels come from how you sort the photos into subfolders.

Folder structure for any raw DNG dataset:

data/raw/<dataset_name>/images/
    <class_a>/    ← all DNG images for class A
    <class_b>/    ← all DNG images for class B

Subdirectory names become the class labels. Minimum 5 images per class is required.

To reproduce:

Download the raw DNG files from the MIT-Adobe FiveK Dataset.
Sort a subset of the photos into per-class subfolders under data/raw/fivek/images/:

data/raw/fivek/images/
    <class_a>/    ← DNG images for class A
    <class_b>/    ← DNG images for class B

Run: run-pipeline --config src/config/pipeline_fivek.yaml

Note: Larger raw datasets are supported, but they require more storage, preprocessing time, and compute. In practice, large raw-image runs are better suited to local execution than routine execution in GitHub Actions.

Directory Structure

Tabular datasets:

data/raw/<dataset_name>/
    dataset.yaml   ← tracked in git
    data.csv       ← tracked in git (small files only)

Image datasets:

data/raw/<dataset_name>/
    dataset.yaml          ← tracked in git
    images/
        <class_name>/
            image1.jpg    ← not tracked in git

Adding a New Tabular Dataset

Create data/raw/<name>/data.csv
Run the pipeline — it detects missing metadata and prompts for target column and task type:

run-pipeline --config src/config/pipeline_tabular_classification.yaml

Adding a New Image Dataset

Create data/raw/<name>/images/<class>/ and place images inside
Create data/raw/<name>/dataset.yaml — see existing files as templates
Create src/config/pipeline_<name>.yaml pointing to the correct preprocessing config
Run the pipeline:

run-pipeline --config src/config/pipeline_<name>.yaml

For raw DNG images that require ISP preprocessing, use preprocessing_raw_image.yaml as the preprocessing config. For standard JPG/PNG images, use preprocessing_image.yaml.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Raw Data

Datasets

california_housing (`data/raw/california_housing/`)

iris (`data/raw/iris/`)

sample_images (`data/raw/sample_images/`)

cifar10 (`data/raw/cifar10/`)

fivek (`data/raw/fivek/`)

Directory Structure

Adding a New Tabular Dataset

Adding a New Image Dataset

FilesExpand file tree

raw

Directory actions

More options

Directory actions

More options

Latest commit

History

raw

Folders and files

parent directory

README.md

Raw Data

Datasets

california_housing (data/raw/california_housing/)

iris (data/raw/iris/)

sample_images (data/raw/sample_images/)

cifar10 (data/raw/cifar10/)

fivek (data/raw/fivek/)

Directory Structure

Adding a New Tabular Dataset

Adding a New Image Dataset

california_housing (`data/raw/california_housing/`)

iris (`data/raw/iris/`)

sample_images (`data/raw/sample_images/`)

cifar10 (`data/raw/cifar10/`)

fivek (`data/raw/fivek/`)