Skip to content

Latest commit

 

History

History
473 lines (381 loc) · 13.6 KB

File metadata and controls

473 lines (381 loc) · 13.6 KB

ZebraOps — specification.md

Status: Draft v0.1
Audience: ML beginners, slightly-advanced developers, data engineers
Scope: Local-first, self-hosted, open-source reference implementation for “good MLOps” with minimal friction.


1. Purpose

ZebraOps is a local-first MLOps devkit and reference implementation that lets a user:

  • clone a repo
  • run docker compose up
  • drop in train.py or a train.ipynb
  • run training + evaluation + promotion + local serving
  • get monitoring + drift checks + retrain triggers
  • scale later to remote compute (Vast / SageMaker / Vertex) without rewriting the core project shape

ZebraOps is not aimed at hyperscale (FAANG). It targets small-to-mid teams that want correctness, reproducibility, and a paved path.


2. Design principles

  1. Local-first: the default workflow runs fully on a laptop or a single workstation.
  2. Paved road: one “golden path” that works immediately; advanced users can extend via adapters.
  3. Contracts over frameworks: users keep their train.py and notebooks; ZebraOps wraps them.
  4. Minimal new abstractions: only a few concepts are introduced (Model Spec, Dataset Manifest, Runner).
  5. Reproducible by default: every run is tied to code version + config + data manifest + environment.
  6. Composable open-source stack: integrate best-of-breed instead of reimplementing them.
  7. Deterministic cleanup and scaffolding: no dependency on LLMs for core operations (LLM-assisted customization is optional).

3. Non-goals

  • Competing with full enterprise platforms (ClearML, Vertex AI, SageMaker end-to-end governance).
  • Replacing experiment trackers (use MLflow) or orchestrators (use Prefect).
  • Building a full feature store (Feast integration optional; ZebraOps can provide lightweight dataset/feature materialization patterns instead).

4. Target user personas

  • Beginner ML engineer: has a train.py and wants “real MLOps” without platform engineering.
  • Data engineer: wants clean ingestion + caching + scheduled retrains + observability.
  • Small team lead: wants shared runs, simple promotion gates, basic rollback, and dashboards.

5. Reference stack (default)

Local Devstack (Docker Compose):

  • Orchestration: Prefect (server + UI)
  • Tracking + registry: MLflow (tracking server + UI, model registry)
  • Storage:
    • Postgres (Prefect + ZebraOps metadata if needed)
    • MinIO (S3-compatible artifacts + datasets)
  • Serving: FastAPI (local inference service)
  • Monitoring: Prometheus + Grafana
  • Drift/data quality: Evidently (reports + metrics export)

Optional integrations (profiles):

  • Feature store: Feast (optional)
  • Remote compute: Vast.ai (SSH/Docker runner)
  • Cloud training: AWS SageMaker, Google Vertex AI

6. High-level architecture


             +----------------------+
             |  ZebraOps UI + CLI   |
             |  (Control Plane)     |
             +----------+-----------+
                        |
                        v

+-------------------+    +-----------+    +------------------+
| Prefect (flows)   |--> | Runner    |--> | MLflow Tracking  |
| ingest/train/eval |    | (py/nb)   |    | + Model Registry |
+-------------------+    +-----------+    +------------------+
|                     |                 |
v                     v                 v
+---------+           +----------+       +----------+
| MinIO   |           | Artifacts|       | Metrics  |
| datasets|           | models   |       | params   |
+---------+           +----------+       +----------+
|
v
+-------------------+    +-------------------+
| Monitoring jobs   |--> | Prometheus/Grafana|
| drift/quality/perf|    +-------------------+
+-------------------+
|
v
+-------------------+
| FastAPI Serving    |
| loads "prod" alias |
+-------------------+


7. Repository layout (reference)


zebraops/
models/
<model_name>/
model.yaml
train.py | train.ipynb
predict.py            # optional
requirements.txt      # optional (or extras in pyproject)
data_sources/
connectors/             # db/s3/files/http connectors
flows/                  # Prefect flows: ingest/feature/labels
cached_data/
manifests/              # dataset manifests, schema, hashes
parquet/                # local materializations
libraries/
contracts/              # schema validators, manifests, events
runner/                 # python + notebook runners
promotion/              # gates + stage transitions
serving/                # FastAPI app + loaders
monitoring/             # drift/quality/perf jobs
adapters/               # Vast/SageMaker/Vertex integrations
platform/
compose/
docker-compose.yaml
profiles/
local.yaml
shared.yaml
vast.yaml
sagemaker.yaml
vertex.yaml
ui/                     # simple web UI (thin)
scripts/
init_project.py
prune_examples.py
doctor.py
reset_local.py
docs/
specification.md
skills.md


8. Core concepts

8.1 Model Spec (models/<name>/model.yaml)

Declarative configuration for a model.

Minimum fields (v0.1):

  • name: string (unique)
  • type: python | notebook
  • entrypoint:
    • python: train.py
    • notebook: train.ipynb
  • datasets:
    • train: dataset id
    • valid: dataset id
    • test: dataset id (optional)
  • resources:
    • accelerator: cpu | gpu
    • gpu_count: int (optional)
  • tracking:
    • experiment_name: string
  • metrics:
    • list of metric definitions:
      • name
      • direction: maximize | minimize
      • threshold: number (optional; promotion gate)
  • promotion:
    • registry: mlflow
    • alias_prod: default prod
    • alias_staging: default staging

Optional fields:

  • hyperparams:
    • grid: dict of lists (for parallel runs)
  • env:
    • python_version
    • pip_requirements (path)
  • notes: free text

8.2 Dataset Manifest (cached_data/manifests/<id>.json)

Immutable description of a dataset materialization used for training/eval.

Required fields:

  • id: unique id (content-addressable recommended)
  • created_at
  • source:
    • connector type + query/path + parameters
  • schema:
    • columns, dtypes, target label definition
  • splits:
    • train/valid/test definitions (hash-based or explicit)
  • stats (optional):
    • row count, missingness summary, basic distributions
  • artifacts:
    • object store locations (MinIO/S3 URIs)
  • hashes:
    • content hash of materialization (or partition hashes)

8.3 Run

A single execution of training/evaluation. ZebraOps guarantees:

  • run is traceable to:
    • model spec version
    • code revision
    • dataset manifest(s)
    • environment/container digest
  • run logs to MLflow:
    • params, metrics, artifacts, tags

8.4 Promotion

Moving a model artifact to a named alias/stage (staging/prod) in MLflow registry, gated by thresholds.


9. Runner contracts (drop-in support)

9.1 Python runner (train.py)

ZebraOps supports existing scripts if they can be executed with a config.

Contract:

  • ZebraOps invokes:
    • python models/<name>/train.py --config <path>
  • Script must:
    • read config (YAML/JSON)
    • load dataset paths from config/env
    • log to MLflow (directly or via ZebraOps helper)
    • write model artifact(s) to an output directory provided in config/env
    • exit non-zero on failure

Standard env variables set by ZebraOps:

  • ZEBRA_MODEL_NAME
  • ZEBRA_RUN_ID
  • ZEBRA_OUTPUT_DIR
  • ZEBRA_DATASET_TRAIN_URI
  • ZEBRA_DATASET_VALID_URI
  • MLFLOW_TRACKING_URI
  • MLFLOW_EXPERIMENT_NAME

9.2 Notebook runner (train.ipynb)

ZebraOps executes parameterized notebooks via Papermill (or equivalent).

Contract:

  • ZebraOps injects parameters:
    • dataset URIs, output dir, run id, model name, tracking URI
  • Notebook must log to MLflow and write artifacts to output dir.

10. Control plane

10.1 CLI (mlops entrypoint)

CLI is the primary interface; UI calls the same backend.

Minimum commands (v0.1):

  • mlops doctor — validate environment, services, contracts
  • mlops up / mlops down — start/stop local devstack (compose wrappers)
  • mlops list models|datasets|runs
  • mlops ingest <dataset_id> — run ingest flow to materialize cached dataset + manifest
  • mlops train <model> [--grid|--params] [--profile local|vast|sagemaker|vertex]
  • mlops eval <model> --run <run_id>
  • mlops promote <model> --run <run_id> --to staging|prod
  • mlops rollback <model> --to <run_id>
  • mlops serve <model> [--alias prod|staging]
  • mlops monitor <model> — run drift/quality checks and export metrics

10.2 UI (simple)

A thin local web UI for beginners:

  • list models and their status (latest run, current prod alias)
  • buttons: ingest/train/eval/promote/serve/monitor
  • links out to:
    • Prefect UI
    • MLflow UI
    • Grafana dashboards
  • view “recommendations” (e.g., drift exceeded → retrain suggested)

10.3 Local state DB

  • Default: SQLite at .zebraops/state.db (gitignored).
  • Stores:
    • discovered models + spec hashes
    • last known endpoints (mlflow/prefect/minio)
    • local “active serving” pointers
    • last monitoring results + acknowledgements
  • Must be portable to Postgres (shared deployment mode).

11. Orchestration (Prefect)

Prefect flows (minimum):

  • ingest_flow: materialize dataset + manifest
  • train_flow: run runner, log MLflow, store artifacts
  • eval_flow: compute metrics + produce eval report artifact
  • promote_flow: apply gates + update MLflow alias/stage
  • monitor_flow: drift/quality checks, export metrics, recommend retrain
  • retrain_flow: conditional wrapper (monitor → train → eval → promote)

Execution requirements:

  • local concurrency supports parallel hyperparam experiments
  • all flows parametrized by:
    • model name
    • dataset manifest ids
    • profile (local/remote/cloud)

12. Monitoring and feedback loop

12.1 Local monitoring (minimum viable)

  • drift against training baseline using Evidently
  • export drift metrics to Prometheus
  • render an HTML report artifact stored in MinIO and linked in UI
  • retrain suggestion if thresholds exceeded

12.2 Production feedback (optional in v0.1, required later)

Define a minimal event schema for production services to emit:

Event types:

  • prediction_event (sampled)
  • data_quality_event
  • drift_event
  • label_event (when ground truth arrives)
  • performance_event (computed metrics)
  • incident_event (latency/error/cost anomalies)

Collector mode (later):

  • simple ingestion API
  • storage in Postgres/ClickHouse (configurable)
  • jobs that produce:
    • segment regressions
    • retrain recommendations
    • alerts

13. Serving

  • FastAPI service that loads the model referenced by MLflow registry alias (default: prod).
  • Endpoints:
    • POST /predict (JSON)
    • GET /health
    • GET /model (current model version/alias)
  • Logging:
    • request counts, latency to Prometheus
    • optional sampled payload summaries for drift monitoring (privacy-aware)

14. Profiles (local → shared → cloud)

14.1 Local (default)

  • Everything via Docker Compose.
  • Storage in MinIO.
  • Runs executed locally.

14.2 Shared

  • Shared MLflow/Prefect/MinIO endpoints.
  • Developers run CLI/UI locally pointing to shared services.

14.3 Vast

  • Submit training job to a remote machine via SSH/Docker.
  • Same container image; logs to shared MLflow; artifacts to MinIO/S3.

14.4 SageMaker / Vertex

  • Adapter submits a training job spec using provider SDK.
  • Requirements:
    • container image
    • dataset URIs in S3/GCS
    • logs and artifacts routed to MLflow-compatible tracking (either via network or post-run sync)
  • ZebraOps keeps contract stable: mlops train <model> --profile sagemaker.

15. CI/CD (reference implementation)

GitHub Actions (minimum):

  • lint + unit tests
  • contract tests:
    • validate model.yaml schema
    • validate dataset manifest schema
  • integration test:
    • start compose stack
    • run one example end-to-end on small data
    • smoke test serving endpoint

Optional:

  • “continuous training” workflow on schedule or when data changes
  • promotion gates enforced in CI (same code as local)

16. Security and secrets

  • Default secrets via .env (local), never committed.
  • Support secret backends later:
    • environment variables
    • SOPS-encrypted env files
    • cloud secret managers (profile-specific)
  • No secrets stored in SQLite/Postgres; only references/keys.

17. UX requirements (must-haves)

  • docker compose up results in:
    • Prefect UI reachable
    • MLflow UI reachable
    • basic Grafana dashboards reachable
    • ZebraOps UI reachable
  • mlops doctor prints actionable fixes.
  • mlops init can create a new model skeleton from templates.
  • Example(s) run end-to-end on CPU in <10 minutes on a normal laptop (small datasets).

18. Examples (shipped)

Minimum:

  1. tabular-churn — classic supervised ML with clear metrics
  2. llm-rag — retrieval + eval dataset + offline scoring Optional:
  3. timeseries-forecast
  4. streaming-fraud (simulated batch micro-batches)

19. Roadmap

v0.1

  • local devstack
  • python + notebook runner
  • dataset manifests
  • MLflow tracking + basic promotion
  • FastAPI serving
  • drift report + Prometheus export
  • CLI + thin UI

v0.2

  • shared profile
  • Vast adapter
  • richer gates (latency/cost budget checks)
  • better hyperparam sweep UX

v0.3

  • basic production collector mode
  • label join + performance monitoring
  • retrain triggers via Prefect deployments

20. Contribution rules

  • Keep the default path simple; advanced features must be optional profiles.
  • Avoid introducing new infrastructure unless it replaces 3+ existing moving parts.
  • Maintain backward compatibility for model.yaml with explicit versioning.
  • Every new component must include:
    • a runnable example
    • docs updates
    • integration test coverage where feasible