Skip to content

feat: add fleet version and binary hash observability#156

Closed
ethenotethan wants to merge 11 commits into
swift-providerfrom
feat/version-observability
Closed

feat: add fleet version and binary hash observability#156
ethenotethan wants to merge 11 commits into
swift-providerfrom
feat/version-observability

Conversation

@ethenotethan
Copy link
Copy Markdown
Contributor

Summary

  • Add providers.per_version and providers.per_binary_hash gauges to DD gauge loop
  • Add coordinator.min_provider_version_set gauge
  • Add provider_version_below_minimum counter at all 3 version gate points (registration, challenge revalidation, manifest sync), tagged by gate and version
  • Add ProviderCountByVersion() and ProviderCountByBinaryHash() to registry
  • Add "Fleet Version & Binary Hash" section to Datadog dashboard

New Metrics

Metric Type Tags
providers.per_version gauge version
providers.per_binary_hash gauge binary_hash
coordinator.min_provider_version_set gauge min_version
provider_version_below_minimum counter gate, version

Gajesh2007 and others added 11 commits April 28, 2026 08:30
* Harden release registration and binary hash policy

* derive release download URL from allowlist

* Stabilize provider coordinator test

---------

Co-authored-by: Gajesh Naik <26431906+Gajesh2007@users.noreply.github.com>
* e2e: add local simulation environment skeleton

Introduces scripts/e2e-runner.py, a Python orchestrator that spins up the
real coordinator binary with test-friendly configuration (in-memory store,
mock billing, no trust requirements) alongside a simulated or real
provider, and runs HTTP/WebSocket-level assertions against the live stack.

Key components:
- Coordinator class: builds and spawns coordinator with EIGENINFERENCE_MIN_TRUST=none,
  EIGENINFERENCE_BILLING_MOCK=true, and in-memory store
- SimulatedProvider: pure-Python WebSocket client speaking the full provider protocol
  (register, attestation challenge/response, heartbeat, inference request/response)
- Test framework: decorator-based test registration, pass/fail summary, signal-safe
  cleanup via atexit + signal handlers
- Test stubs: test_basic (registration + discovery), test_inference (consumer
  request routing), test_multi_provider (two providers, same model)

TODO:
- RealProvider wrapper around darkbloom serve --coordinator
- Coordination between provider challenge cycle and consumer request timing
- API key handling for consumer vs admin routes
- Python dependency management (websockets, cryptography)

* Revert "e2e: add local simulation environment skeleton"

This reverts commit d02074e. The Python E2E runner adds noise on top of
the existing Go integration tests (internal/api/integration_test.go +
fullstack_integration_test.go) which already cover the full coordinator
protocol surface. The cross-language orchestration doesn't buy anything
over what httptest.Server + simulated providers already provide.

* Remove stale Python integration test

@ethenotethan

tests/integration_test.py is superseded by the Go-based coordinator
integration tests at coordinator/internal/api/:

- Test coverage for coordinator protocol (register, challenge, heartbeat,
  inference) is covered by integration_test.go using httptest.Server +
  Go simulated providers — same coverage, no binary build needed
- Full-stack GPU inference is covered by fullstack_integration_test.go
  with real vllm-mlx backends (gated behind LIVE_FULLSTACK_TEST=1)
- The Python test uses stale binary names ('eigeninference-provider'),
  old flags ('--backend mlx-lm'), and predates attestation challenges,
  E2E encryption, and the vllm-mlx backend migration
- No external dependency coverage (Postgres, Stripe, etc.) is lost — the
  coordinator main.go wiring for those is trivially tested elsewhere
- The Python SDK tests (4.5.x) belong in the SDK repo, not the infra repo

---------

Co-authored-by: Hank Bob <hankbob@researchoors.com>
* chore: remove unused dependencies

* test: fix console ui test isolation

* chore: prune repo-wide dead code findings
Cloud Build (deploy/gcp/cloudbuild.yaml) already deploys the coordinator
on the same trigger (push to master touching coordinator/** or deploy/gcp/**).
Having both paths active creates a race condition where two CI systems
simultaneously deploy to the same dev VM — see #115.
Install Datadog Agent on the dev GCE VM (DogStatsD, APM, journald logs)
and wire the coordinator to emit structured metrics, split attestation
counters, model_type tags, reactive provider-count gauges, and a
completion-tokens counter. Rebuild the dev dashboard with 7 sections
covering metrics, logs, traces, and system health.
Disconnect now checks StatusUntrusted before decrementing the online
counter and model-provider gauges, since MarkUntrusted already
decremented them.
- Accept swift-provider deletions (release.yml, StatusViewModel.swift, release-runbook.md)
- Accept swift-provider's evolved test names/behavior in provider_test.go
- Add metallib_hash/backend fields to registerReleaseRequest and validateReleaseMetadata
- Remove duplicate normalizeSHA256Hex from server.go (already in release_handlers.go)
- Update edge_case_test.go to set R2 CDN URL for artifact verification
- Remove duplicate test functions from merge conflict resolution
New metrics:
- providers.per_version gauge (per provider binary version)
- providers.per_binary_hash gauge (per attested binary hash)
- coordinator.min_provider_version_set gauge (1 when configured)
- provider_version_below_minimum counter (tagged by gate and version)

Gates instrumented:
- registration (provider.go)
- challenge revalidation (provider.go)
- manifest sync (server.go)

Registry additions:
- ProviderCountByVersion()
- ProviderCountByBinaryHash()

Dashboard: Fleet Version & Binary Hash group with providers by version,
providers by binary hash, min provider version, below-minimum events,
and top binary hashes toplist.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 12, 2026

Deployment failed with the following error:

You don't have permission to create a Preview Deployment for this Vercel project: d-inference.

View Documentation: https://vercel.com/docs/accounts/team-members-and-roles

@vercel
Copy link
Copy Markdown

vercel Bot commented May 12, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
d-inference-console-ui-dev Ready Ready Preview May 12, 2026 6:53pm

Request Review

@ethenotethan
Copy link
Copy Markdown
Contributor Author

Closing — these changes are now part of #143 instead.

@github-actions
Copy link
Copy Markdown

Benchmark Results

Runner: macos-15 (M1 Virtual) | Date: 2026-05-12 19:00 UTC

1-provider-streaming

1 providers, 1 users, 30 requests, concurrency=5, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 1 0.5 GB
Metric Value
Total Requests 30
Success 30
Errors 0
Total Duration 12.43s
Throughput 2.4 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 30 966ms 9ms 4.218s 9.025s
parse 30 45µs 29µs 162µs 188µs
reserve 30 2ms 1ms 7ms 9ms
route 30 396ms 0s 615ms 8.998s
queue_wait 7 1.699s 453ms 8.998s 8.998s
encrypt 30 182µs 151µs 338µs 400µs
dispatch 30 44µs 29µs 145µs 187µs
coordinator_to_provider 30 565ms 5ms 4.206s 4.208s

Assertion Report: PASS

Assertion Result Detail
parse:mean<=1ms PASS mean=44.633µs (threshold=1ms)
parse:p95<=5ms PASS p95=162µs (threshold=5ms)
reserve:mean<=50ms PASS mean=2.187633ms (threshold=50ms)
reserve:p95<=200ms PASS p95=7.355ms (threshold=200ms)
encrypt:mean<=5ms PASS mean=182.466µs (threshold=5ms)
encrypt:p95<=50ms PASS p95=338µs (threshold=50ms)
dispatch:mean<=5ms PASS mean=43.8µs (threshold=5ms)
dispatch:p95<=50ms PASS p95=145µs (threshold=50ms)

1-provider-non-streaming

1 providers, 1 users, 20 requests, concurrency=5, streaming=false

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 1 0.5 GB
Metric Value
Total Requests 20
Success 20
Errors 0
Total Duration 5.34s
Throughput 3.7 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 20 1.315s 405ms 4.16s 4.16s
parse 20 25µs 14µs 135µs 135µs
reserve 20 2ms 1ms 9ms 9ms
route 20 245ms 0s 3.806s 3.806s
queue_wait 4 1.226s 394ms 3.806s 3.806s
encrypt 20 167µs 150µs 442µs 442µs
dispatch 20 25µs 19µs 103µs 103µs
coordinator_to_provider 20 631ms 3ms 3.15s 3.15s

Assertion Report: PASS

Assertion Result Detail
parse:mean<=1ms PASS mean=24.85µs (threshold=1ms)
parse:p95<=5ms PASS p95=135µs (threshold=5ms)
reserve:mean<=50ms PASS mean=1.9285ms (threshold=50ms)
reserve:p95<=200ms PASS p95=8.638ms (threshold=200ms)
encrypt:mean<=5ms PASS mean=167.3µs (threshold=5ms)
encrypt:p95<=50ms PASS p95=442µs (threshold=50ms)
dispatch:mean<=5ms PASS mean=25.4µs (threshold=5ms)
dispatch:p95<=50ms PASS p95=103µs (threshold=50ms)

7-provider-multi-model

7 providers, 5 users, 50 requests, concurrency=10, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 4 0.5 GB
mlx-community/gemma-3-270m-4bit 3 0.2 GB
Metric Value
Total Requests 50
Success 50
Errors 0
Total Duration 44.221s
Throughput 1.1 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 50 4.196s 367ms 23.136s 23.167s
parse 50 38µs 20µs 127µs 338µs
reserve 50 10ms 2ms 43ms 127ms
route 50 1.308s 0s 10.002s 20.006s
queue_wait 10 1.534s 2.078s 2.656s 2.656s
encrypt 50 161µs 140µs 247µs 525µs
dispatch 50 40µs 32µs 108µs 179µs
coordinator_to_provider 50 2.865s 7ms 23.104s 23.151s

Assertion Report: PASS

Assertion Result Detail
parse:mean<=1ms PASS mean=37.68µs (threshold=1ms)
parse:p95<=5ms PASS p95=127µs (threshold=5ms)
reserve:mean<=50ms PASS mean=10.09578ms (threshold=50ms)
reserve:p95<=200ms PASS p95=42.639ms (threshold=200ms)
encrypt:mean<=5ms PASS mean=161µs (threshold=5ms)
encrypt:p95<=50ms PASS p95=247µs (threshold=50ms)
dispatch:mean<=5ms PASS mean=40.04µs (threshold=5ms)
dispatch:p95<=50ms PASS p95=108µs (threshold=50ms)

3-provider-high-concurrency

3 providers, 10 users, 60 requests, concurrency=20, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 3 0.5 GB
Metric Value
Total Requests 60
Success 60
Errors 0
Total Duration 10.539s
Throughput 5.7 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 60 2.375s 1.208s 6.659s 6.902s
parse 60 35µs 27µs 64µs 293µs
reserve 60 13ms 2ms 59ms 62ms
route 60 1.475s 1.014s 6.542s 6.78s
queue_wait 43 2.058s 1.195s 6.542s 6.781s
encrypt 60 150µs 138µs 219µs 297µs
dispatch 60 25µs 22µs 44µs 130µs
coordinator_to_provider 60 875ms 5ms 4.513s 4.559s

Assertion Report: PASS

Assertion Result Detail
parse:mean<=1ms PASS mean=34.566µs (threshold=1ms)
parse:p95<=5ms PASS p95=64µs (threshold=5ms)
reserve:mean<=50ms PASS mean=13.472766ms (threshold=50ms)
reserve:p95<=200ms PASS p95=58.69ms (threshold=200ms)
encrypt:mean<=5ms PASS mean=150.2µs (threshold=5ms)
encrypt:p95<=50ms PASS p95=219µs (threshold=50ms)
dispatch:mean<=5ms PASS mean=24.683µs (threshold=5ms)
dispatch:p95<=50ms PASS p95=44µs (threshold=50ms)

1-provider-queue-saturation

1 providers, 10 users, 40 requests, concurrency=15, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 1 0.5 GB
Metric Value
Total Requests 40
Success 40
Errors 0
Total Duration 8.877s
Throughput 4.5 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 40 2.436s 1.696s 5.323s 5.324s
parse 40 36µs 23µs 140µs 219µs
reserve 40 6ms 1ms 22ms 22ms
route 40 2.108s 1.57s 5.276s 5.276s
queue_wait 35 2.409s 1.66s 5.276s 5.276s
encrypt 40 148µs 137µs 215µs 277µs
dispatch 40 18µs 17µs 36µs 41µs
coordinator_to_provider 40 313ms 3ms 3.105s 3.105s

Assertion Report: PASS

Assertion Result Detail
parse:mean<=1ms PASS mean=35.875µs (threshold=1ms)
parse:p95<=5ms PASS p95=140µs (threshold=5ms)
reserve:mean<=50ms PASS mean=6.36715ms (threshold=50ms)
reserve:p95<=200ms PASS p95=21.772ms (threshold=200ms)
encrypt:mean<=5ms PASS mean=148.425µs (threshold=5ms)
encrypt:p95<=50ms PASS p95=215µs (threshold=50ms)
dispatch:mean<=5ms PASS mean=18.2µs (threshold=5ms)
dispatch:p95<=50ms PASS p95=36µs (threshold=50ms)

3-provider-20-users

3 providers, 20 users, 60 requests, concurrency=10, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 3 0.5 GB
Metric Value
Total Requests 60
Success 60
Errors 0
Total Duration 9.5s
Throughput 6.3 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 60 726ms 13ms 3.624s 3.624s
parse 60 24µs 18µs 72µs 116µs
reserve 60 3ms 1ms 15ms 20ms
route 60 121ms 0s 600ms 641ms
queue_wait 20 364ms 414ms 641ms 641ms
encrypt 60 157µs 137µs 240µs 802µs
dispatch 60 21µs 18µs 40µs 55µs
coordinator_to_provider 60 599ms 4ms 3.6s 3.612s

Assertion Report: PASS

Assertion Result Detail
parse:mean<=1ms PASS mean=24.466µs (threshold=1ms)
parse:p95<=5ms PASS p95=72µs (threshold=5ms)
reserve:mean<=50ms PASS mean=3.1461ms (threshold=50ms)
reserve:p95<=200ms PASS p95=15.331ms (threshold=200ms)
encrypt:mean<=5ms PASS mean=156.65µs (threshold=5ms)
encrypt:p95<=50ms PASS p95=240µs (threshold=50ms)
dispatch:mean<=5ms PASS mean=20.933µs (threshold=5ms)
dispatch:p95<=50ms PASS p95=40µs (threshold=50ms)

1-provider-scaling

1 providers, 5 users, 30 requests, concurrency=10, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 1 0.5 GB
Metric Value
Total Requests 30
Success 30
Errors 0
Total Duration 7.536s
Throughput 4.0 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 30 1.846s 967ms 4.748s 4.749s
parse 30 24µs 22µs 50µs 66µs
reserve 30 3ms 1ms 8ms 8ms
route 30 1.353s 747ms 4.728s 4.729s
queue_wait 25 1.624s 944ms 4.728s 4.729s
encrypt 30 153µs 139µs 237µs 331µs
dispatch 30 19µs 15µs 42µs 43µs
coordinator_to_provider 30 488ms 4ms 3.634s 3.634s

Assertion Report: PASS

Assertion Result Detail
parse:mean<=1ms PASS mean=24µs (threshold=1ms)
parse:p95<=5ms PASS p95=50µs (threshold=5ms)
reserve:mean<=50ms PASS mean=2.520933ms (threshold=50ms)
reserve:p95<=200ms PASS p95=7.79ms (threshold=200ms)
encrypt:mean<=5ms PASS mean=152.8µs (threshold=5ms)
encrypt:p95<=50ms PASS p95=237µs (threshold=50ms)
dispatch:mean<=5ms PASS mean=19.4µs (threshold=5ms)
dispatch:p95<=50ms PASS p95=42µs (threshold=50ms)

3-provider-scaling

3 providers, 5 users, 30 requests, concurrency=10, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 3 0.5 GB
Metric Value
Total Requests 30
Success 30
Errors 0
Total Duration 8.315s
Throughput 3.6 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 30 1.548s 9ms 4.862s 4.864s
parse 30 38µs 30µs 85µs 139µs
reserve 30 9ms 4ms 48ms 55ms
route 30 28µs 22µs 68µs 69µs
encrypt 30 148µs 134µs 259µs 327µs
dispatch 30 44µs 31µs 88µs 225µs
coordinator_to_provider 30 1.535s 5ms 4.852s 4.856s

Assertion Report: PASS

Assertion Result Detail
parse:mean<=1ms PASS mean=38.033µs (threshold=1ms)
parse:p95<=5ms PASS p95=85µs (threshold=5ms)
reserve:mean<=50ms PASS mean=8.719533ms (threshold=50ms)
reserve:p95<=200ms PASS p95=48.2ms (threshold=200ms)
encrypt:mean<=5ms PASS mean=147.966µs (threshold=5ms)
encrypt:p95<=50ms PASS p95=259µs (threshold=50ms)
dispatch:mean<=5ms PASS mean=43.966µs (threshold=5ms)
dispatch:p95<=50ms PASS p95=88µs (threshold=50ms)

5-provider-scaling

5 providers, 5 users, 30 requests, concurrency=10, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 5 0.5 GB
Metric Value
Total Requests 30
Success 30
Errors 0
Total Duration 13.081s
Throughput 2.3 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 30 2.687s 11ms 8.227s 8.227s
parse 30 38µs 29µs 98µs 107µs
reserve 30 13ms 3ms 46ms 120ms
route 30 48µs 32µs 170µs 281µs
encrypt 30 157µs 139µs 261µs 315µs
dispatch 30 46µs 41µs 70µs 246µs
coordinator_to_provider 30 2.669s 5ms 8.17s 8.189s

Assertion Report: PASS

Assertion Result Detail
parse:mean<=1ms PASS mean=37.7µs (threshold=1ms)
parse:p95<=5ms PASS p95=98µs (threshold=5ms)
reserve:mean<=50ms PASS mean=12.910933ms (threshold=50ms)
reserve:p95<=200ms PASS p95=45.67ms (threshold=200ms)
encrypt:mean<=5ms PASS mean=157.466µs (threshold=5ms)
encrypt:p95<=50ms PASS p95=261µs (threshold=50ms)
dispatch:mean<=5ms PASS mean=46.4µs (threshold=5ms)
dispatch:p95<=50ms PASS p95=70µs (threshold=50ms)

3-provider-heavy-100conc-10kb

3 providers, 20 users, 100 requests, concurrency=100, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 3 0.5 GB
Metric Value
Total Requests 100
Success 100
Errors 0
Total Duration 13.913s
Throughput 7.2 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 100 9.152s 9.503s 13.265s 13.476s
parse 100 0s 0s 1ms 2ms
reserve 100 68ms 68ms 85ms 86ms
route 100 8.53s 9.378s 13.113s 13.328s
queue_wait 88 9.694s 9.86s 13.113s 13.328s
encrypt 100 0s 0s 0s 1ms
dispatch 100 0s 0s 1ms 2ms
coordinator_to_provider 100 499ms 6ms 4.146s 4.186s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=261.5µs (threshold=1ms)
parse:p95<=5ms PASS p95=1.435ms (threshold=5ms)
reserve:mean<=50ms FAIL mean=68.174ms (threshold=50ms)
reserve:p95<=200ms PASS p95=84.614ms (threshold=200ms)
encrypt:mean<=5ms PASS mean=246.31µs (threshold=5ms)
encrypt:p95<=50ms PASS p95=341µs (threshold=50ms)
dispatch:mean<=5ms PASS mean=121.1µs (threshold=5ms)
dispatch:p95<=50ms PASS p95=792µs (threshold=50ms)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants