ZebraOps — specification.md

Status: Draft v0.1
Audience: ML beginners, slightly-advanced developers, data engineers
Scope: Local-first, self-hosted, open-source reference implementation for “good MLOps” with minimal friction.

1. Purpose

ZebraOps is a local-first MLOps devkit and reference implementation that lets a user:

clone a repo
run docker compose up
drop in train.py or a train.ipynb
run training + evaluation + promotion + local serving
get monitoring + drift checks + retrain triggers
scale later to remote compute (Vast / SageMaker / Vertex) without rewriting the core project shape

ZebraOps is not aimed at hyperscale (FAANG). It targets small-to-mid teams that want correctness, reproducibility, and a paved path.

2. Design principles

Local-first: the default workflow runs fully on a laptop or a single workstation.
Paved road: one “golden path” that works immediately; advanced users can extend via adapters.
Contracts over frameworks: users keep their train.py and notebooks; ZebraOps wraps them.
Minimal new abstractions: only a few concepts are introduced (Model Spec, Dataset Manifest, Runner).
Reproducible by default: every run is tied to code version + config + data manifest + environment.
Composable open-source stack: integrate best-of-breed instead of reimplementing them.
Deterministic cleanup and scaffolding: no dependency on LLMs for core operations (LLM-assisted customization is optional).

3. Non-goals

Competing with full enterprise platforms (ClearML, Vertex AI, SageMaker end-to-end governance).
Replacing experiment trackers (use MLflow) or orchestrators (use Prefect).
Building a full feature store (Feast integration optional; ZebraOps can provide lightweight dataset/feature materialization patterns instead).

4. Target user personas

Beginner ML engineer: has a train.py and wants “real MLOps” without platform engineering.
Data engineer: wants clean ingestion + caching + scheduled retrains + observability.
Small team lead: wants shared runs, simple promotion gates, basic rollback, and dashboards.

5. Reference stack (default)

Local Devstack (Docker Compose):

Orchestration: Prefect (server + UI)
Tracking + registry: MLflow (tracking server + UI, model registry)
Storage:
- Postgres (Prefect + ZebraOps metadata if needed)
- MinIO (S3-compatible artifacts + datasets)
Serving: FastAPI (local inference service)
Monitoring: Prometheus + Grafana
Drift/data quality: Evidently (reports + metrics export)

Optional integrations (profiles):

Feature store: Feast (optional)
Remote compute: Vast.ai (SSH/Docker runner)
Cloud training: AWS SageMaker, Google Vertex AI

6. High-level architecture

             +----------------------+
             |  ZebraOps UI + CLI   |
             |  (Control Plane)     |
             +----------+-----------+
                        |
                        v


+-------------------+    +-----------+    +------------------+
| Prefect (flows)   |--> | Runner    |--> | MLflow Tracking  |
| ingest/train/eval |    | (py/nb)   |    | + Model Registry |
+-------------------+    +-----------+    +------------------+
|                     |                 |
v                     v                 v
+---------+           +----------+       +----------+
| MinIO   |           | Artifacts|       | Metrics  |
| datasets|           | models   |       | params   |
+---------+           +----------+       +----------+
|
v
+-------------------+    +-------------------+
| Monitoring jobs   |--> | Prometheus/Grafana|
| drift/quality/perf|    +-------------------+
+-------------------+
|
v
+-------------------+
| FastAPI Serving    |
| loads "prod" alias |
+-------------------+

7. Repository layout (reference)


zebraops/
models/
<model_name>/
model.yaml
train.py | train.ipynb
predict.py            # optional
requirements.txt      # optional (or extras in pyproject)
data_sources/
connectors/             # db/s3/files/http connectors
flows/                  # Prefect flows: ingest/feature/labels
cached_data/
manifests/              # dataset manifests, schema, hashes
parquet/                # local materializations
libraries/
contracts/              # schema validators, manifests, events
runner/                 # python + notebook runners
promotion/              # gates + stage transitions
serving/                # FastAPI app + loaders
monitoring/             # drift/quality/perf jobs
adapters/               # Vast/SageMaker/Vertex integrations
platform/
compose/
docker-compose.yaml
profiles/
local.yaml
shared.yaml
vast.yaml
sagemaker.yaml
vertex.yaml
ui/                     # simple web UI (thin)
scripts/
init_project.py
prune_examples.py
doctor.py
reset_local.py
docs/
specification.md
skills.md

8. Core concepts

8.1 Model Spec (`models/<name>/model.yaml`)

Declarative configuration for a model.

Minimum fields (v0.1):

name: string (unique)
type: python | notebook
entrypoint:
- python: train.py
- notebook: train.ipynb
datasets:
- train: dataset id
- valid: dataset id
- test: dataset id (optional)
resources:
- accelerator: cpu | gpu
- gpu_count: int (optional)
tracking:
- experiment_name: string
metrics:
- list of metric definitions:
  - name
  - direction: maximize | minimize
  - threshold: number (optional; promotion gate)
promotion:
- registry: mlflow
- alias_prod: default prod
- alias_staging: default staging

Optional fields:

hyperparams:
- grid: dict of lists (for parallel runs)
env:
- python_version
- pip_requirements (path)
notes: free text

8.2 Dataset Manifest (`cached_data/manifests/<id>.json`)

Immutable description of a dataset materialization used for training/eval.

Required fields:

id: unique id (content-addressable recommended)
created_at
source:
- connector type + query/path + parameters
schema:
- columns, dtypes, target label definition
splits:
- train/valid/test definitions (hash-based or explicit)
stats (optional):
- row count, missingness summary, basic distributions
artifacts:
- object store locations (MinIO/S3 URIs)
hashes:
- content hash of materialization (or partition hashes)

8.3 Run

A single execution of training/evaluation. ZebraOps guarantees:

run is traceable to:
- model spec version
- code revision
- dataset manifest(s)
- environment/container digest
run logs to MLflow:
- params, metrics, artifacts, tags

8.4 Promotion

Moving a model artifact to a named alias/stage (staging/prod) in MLflow registry, gated by thresholds.

9. Runner contracts (drop-in support)

9.1 Python runner (`train.py`)

ZebraOps supports existing scripts if they can be executed with a config.

Contract:

ZebraOps invokes:
- python models/<name>/train.py --config <path>
Script must:
- read config (YAML/JSON)
- load dataset paths from config/env
- log to MLflow (directly or via ZebraOps helper)
- write model artifact(s) to an output directory provided in config/env
- exit non-zero on failure

Standard env variables set by ZebraOps:

ZEBRA_MODEL_NAME
ZEBRA_RUN_ID
ZEBRA_OUTPUT_DIR
ZEBRA_DATASET_TRAIN_URI
ZEBRA_DATASET_VALID_URI
MLFLOW_TRACKING_URI
MLFLOW_EXPERIMENT_NAME

9.2 Notebook runner (`train.ipynb`)

ZebraOps executes parameterized notebooks via Papermill (or equivalent).

Contract:

ZebraOps injects parameters:
- dataset URIs, output dir, run id, model name, tracking URI
Notebook must log to MLflow and write artifacts to output dir.

10. Control plane

10.1 CLI (`mlops` entrypoint)

CLI is the primary interface; UI calls the same backend.

Minimum commands (v0.1):

mlops doctor — validate environment, services, contracts
mlops up / mlops down — start/stop local devstack (compose wrappers)
mlops list models|datasets|runs
mlops ingest <dataset_id> — run ingest flow to materialize cached dataset + manifest
mlops train <model> [--grid|--params] [--profile local|vast|sagemaker|vertex]
mlops eval <model> --run <run_id>
mlops promote <model> --run <run_id> --to staging|prod
mlops rollback <model> --to <run_id>
mlops serve <model> [--alias prod|staging]
mlops monitor <model> — run drift/quality checks and export metrics

10.2 UI (simple)

A thin local web UI for beginners:

list models and their status (latest run, current prod alias)
buttons: ingest/train/eval/promote/serve/monitor
links out to:
- Prefect UI
- MLflow UI
- Grafana dashboards
view “recommendations” (e.g., drift exceeded → retrain suggested)

10.3 Local state DB

Default: SQLite at .zebraops/state.db (gitignored).
Stores:
- discovered models + spec hashes
- last known endpoints (mlflow/prefect/minio)
- local “active serving” pointers
- last monitoring results + acknowledgements
Must be portable to Postgres (shared deployment mode).

11. Orchestration (Prefect)

Prefect flows (minimum):

ingest_flow: materialize dataset + manifest
train_flow: run runner, log MLflow, store artifacts
eval_flow: compute metrics + produce eval report artifact
promote_flow: apply gates + update MLflow alias/stage
monitor_flow: drift/quality checks, export metrics, recommend retrain
retrain_flow: conditional wrapper (monitor → train → eval → promote)

Execution requirements:

local concurrency supports parallel hyperparam experiments
all flows parametrized by:
- model name
- dataset manifest ids
- profile (local/remote/cloud)

12. Monitoring and feedback loop

12.1 Local monitoring (minimum viable)

drift against training baseline using Evidently
export drift metrics to Prometheus
render an HTML report artifact stored in MinIO and linked in UI
retrain suggestion if thresholds exceeded

12.2 Production feedback (optional in v0.1, required later)

Define a minimal event schema for production services to emit:

Event types:

prediction_event (sampled)
data_quality_event
drift_event
label_event (when ground truth arrives)
performance_event (computed metrics)
incident_event (latency/error/cost anomalies)

Collector mode (later):

simple ingestion API
storage in Postgres/ClickHouse (configurable)
jobs that produce:
- segment regressions
- retrain recommendations
- alerts

13. Serving

FastAPI service that loads the model referenced by MLflow registry alias (default: prod).
Endpoints:
- POST /predict (JSON)
- GET /health
- GET /model (current model version/alias)
Logging:
- request counts, latency to Prometheus
- optional sampled payload summaries for drift monitoring (privacy-aware)

14. Profiles (local → shared → cloud)

14.1 Local (default)

Everything via Docker Compose.
Storage in MinIO.
Runs executed locally.

14.2 Shared

Shared MLflow/Prefect/MinIO endpoints.
Developers run CLI/UI locally pointing to shared services.

14.3 Vast

Submit training job to a remote machine via SSH/Docker.
Same container image; logs to shared MLflow; artifacts to MinIO/S3.

14.4 SageMaker / Vertex

Adapter submits a training job spec using provider SDK.
Requirements:
- container image
- dataset URIs in S3/GCS
- logs and artifacts routed to MLflow-compatible tracking (either via network or post-run sync)
ZebraOps keeps contract stable: mlops train <model> --profile sagemaker.

15. CI/CD (reference implementation)

GitHub Actions (minimum):

lint + unit tests
contract tests:
- validate model.yaml schema
- validate dataset manifest schema
integration test:
- start compose stack
- run one example end-to-end on small data
- smoke test serving endpoint

Optional:

“continuous training” workflow on schedule or when data changes
promotion gates enforced in CI (same code as local)

16. Security and secrets

Default secrets via .env (local), never committed.
Support secret backends later:
- environment variables
- SOPS-encrypted env files
- cloud secret managers (profile-specific)
No secrets stored in SQLite/Postgres; only references/keys.

17. UX requirements (must-haves)

docker compose up results in:
- Prefect UI reachable
- MLflow UI reachable
- basic Grafana dashboards reachable
- ZebraOps UI reachable
mlops doctor prints actionable fixes.
mlops init can create a new model skeleton from templates.
Example(s) run end-to-end on CPU in <10 minutes on a normal laptop (small datasets).

18. Examples (shipped)

Minimum:

tabular-churn — classic supervised ML with clear metrics
llm-rag — retrieval + eval dataset + offline scoring Optional:
timeseries-forecast
streaming-fraud (simulated batch micro-batches)

19. Roadmap

v0.1

local devstack
python + notebook runner
dataset manifests
MLflow tracking + basic promotion
FastAPI serving
drift report + Prometheus export
CLI + thin UI

v0.2

shared profile
Vast adapter
richer gates (latency/cost budget checks)
better hyperparam sweep UX

v0.3

basic production collector mode
label join + performance monitoring
retrain triggers via Prefect deployments

20. Contribution rules

Keep the default path simple; advanced features must be optional profiles.
Avoid introducing new infrastructure unless it replaces 3+ existing moving parts.
Maintain backward compatibility for model.yaml with explicit versioning.
Every new component must include:
- a runnable example
- docs updates
- integration test coverage where feasible

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZebraOps — specification.md

1. Purpose

2. Design principles

3. Non-goals

4. Target user personas

5. Reference stack (default)

6. High-level architecture

7. Repository layout (reference)

8. Core concepts

8.1 Model Spec (`models/<name>/model.yaml`)

8.2 Dataset Manifest (`cached_data/manifests/<id>.json`)

8.3 Run

8.4 Promotion

9. Runner contracts (drop-in support)

9.1 Python runner (`train.py`)

9.2 Notebook runner (`train.ipynb`)

10. Control plane

10.1 CLI (`mlops` entrypoint)

10.2 UI (simple)

10.3 Local state DB

11. Orchestration (Prefect)

12. Monitoring and feedback loop

12.1 Local monitoring (minimum viable)

12.2 Production feedback (optional in v0.1, required later)

13. Serving

14. Profiles (local → shared → cloud)

14.1 Local (default)

14.2 Shared

14.3 Vast

14.4 SageMaker / Vertex

15. CI/CD (reference implementation)

16. Security and secrets

17. UX requirements (must-haves)

18. Examples (shipped)

19. Roadmap

20. Contribution rules

FilesExpand file tree

spec.md

Latest commit

History

spec.md

File metadata and controls

ZebraOps — specification.md

1. Purpose

2. Design principles

3. Non-goals

4. Target user personas

5. Reference stack (default)

6. High-level architecture

7. Repository layout (reference)

8. Core concepts

8.1 Model Spec (models/<name>/model.yaml)

8.2 Dataset Manifest (cached_data/manifests/<id>.json)

8.3 Run

8.4 Promotion

9. Runner contracts (drop-in support)

9.1 Python runner (train.py)

9.2 Notebook runner (train.ipynb)

10. Control plane

10.1 CLI (mlops entrypoint)

10.2 UI (simple)

10.3 Local state DB

11. Orchestration (Prefect)

12. Monitoring and feedback loop

12.1 Local monitoring (minimum viable)

12.2 Production feedback (optional in v0.1, required later)

13. Serving

14. Profiles (local → shared → cloud)

14.1 Local (default)

14.2 Shared

14.3 Vast

14.4 SageMaker / Vertex

15. CI/CD (reference implementation)

16. Security and secrets

17. UX requirements (must-haves)

18. Examples (shipped)

19. Roadmap

20. Contribution rules

8.1 Model Spec (`models/<name>/model.yaml`)

8.2 Dataset Manifest (`cached_data/manifests/<id>.json`)

9.1 Python runner (`train.py`)

9.2 Notebook runner (`train.ipynb`)

10.1 CLI (`mlops` entrypoint)