feat: add fleet version and binary hash observability by ethenotethan · Pull Request #156 · Layr-Labs/d-inference

ethenotethan · 2026-05-12T18:52:57Z

Summary

Add providers.per_version and providers.per_binary_hash gauges to DD gauge loop
Add coordinator.min_provider_version_set gauge
Add provider_version_below_minimum counter at all 3 version gate points (registration, challenge revalidation, manifest sync), tagged by gate and version
Add ProviderCountByVersion() and ProviderCountByBinaryHash() to registry
Add "Fleet Version & Binary Hash" section to Datadog dashboard

New Metrics

Metric	Type	Tags
`providers.per_version`	gauge	`version`
`providers.per_binary_hash`	gauge	`binary_hash`
`coordinator.min_provider_version_set`	gauge	`min_version`
`provider_version_below_minimum`	counter	`gate`, `version`

* Harden release registration and binary hash policy * derive release download URL from allowlist * Stabilize provider coordinator test --------- Co-authored-by: Gajesh Naik <26431906+Gajesh2007@users.noreply.github.com>

@ethenotethan

* e2e: add local simulation environment skeleton Introduces scripts/e2e-runner.py, a Python orchestrator that spins up the real coordinator binary with test-friendly configuration (in-memory store, mock billing, no trust requirements) alongside a simulated or real provider, and runs HTTP/WebSocket-level assertions against the live stack. Key components: - Coordinator class: builds and spawns coordinator with EIGENINFERENCE_MIN_TRUST=none, EIGENINFERENCE_BILLING_MOCK=true, and in-memory store - SimulatedProvider: pure-Python WebSocket client speaking the full provider protocol (register, attestation challenge/response, heartbeat, inference request/response) - Test framework: decorator-based test registration, pass/fail summary, signal-safe cleanup via atexit + signal handlers - Test stubs: test_basic (registration + discovery), test_inference (consumer request routing), test_multi_provider (two providers, same model) TODO: - RealProvider wrapper around darkbloom serve --coordinator - Coordination between provider challenge cycle and consumer request timing - API key handling for consumer vs admin routes - Python dependency management (websockets, cryptography) * Revert "e2e: add local simulation environment skeleton" This reverts commit d02074e. The Python E2E runner adds noise on top of the existing Go integration tests (internal/api/integration_test.go + fullstack_integration_test.go) which already cover the full coordinator protocol surface. The cross-language orchestration doesn't buy anything over what httptest.Server + simulated providers already provide. * Remove stale Python integration test @ethenotethan tests/integration_test.py is superseded by the Go-based coordinator integration tests at coordinator/internal/api/: - Test coverage for coordinator protocol (register, challenge, heartbeat, inference) is covered by integration_test.go using httptest.Server + Go simulated providers — same coverage, no binary build needed - Full-stack GPU inference is covered by fullstack_integration_test.go with real vllm-mlx backends (gated behind LIVE_FULLSTACK_TEST=1) - The Python test uses stale binary names ('eigeninference-provider'), old flags ('--backend mlx-lm'), and predates attestation challenges, E2E encryption, and the vllm-mlx backend migration - No external dependency coverage (Postgres, Stripe, etc.) is lost — the coordinator main.go wiring for those is trivially tested elsewhere - The Python SDK tests (4.5.x) belong in the SDK repo, not the infra repo --------- Co-authored-by: Hank Bob <hankbob@researchoors.com>

* chore: remove unused dependencies * test: fix console ui test isolation * chore: prune repo-wide dead code findings

Cloud Build (deploy/gcp/cloudbuild.yaml) already deploys the coordinator on the same trigger (push to master touching coordinator/** or deploy/gcp/**). Having both paths active creates a race condition where two CI systems simultaneously deploy to the same dev VM — see #115.

Install Datadog Agent on the dev GCE VM (DogStatsD, APM, journald logs) and wire the coordinator to emit structured metrics, split attestation counters, model_type tags, reactive provider-count gauges, and a completion-tokens counter. Rebuild the dev dashboard with 7 sections covering metrics, logs, traces, and system health.

Disconnect now checks StatusUntrusted before decrementing the online counter and model-provider gauges, since MarkUntrusted already decremented them.

- Accept swift-provider deletions (release.yml, StatusViewModel.swift, release-runbook.md) - Accept swift-provider's evolved test names/behavior in provider_test.go - Add metallib_hash/backend fields to registerReleaseRequest and validateReleaseMetadata - Remove duplicate normalizeSHA256Hex from server.go (already in release_handlers.go) - Update edge_case_test.go to set R2 CDN URL for artifact verification - Remove duplicate test functions from merge conflict resolution

New metrics: - providers.per_version gauge (per provider binary version) - providers.per_binary_hash gauge (per attested binary hash) - coordinator.min_provider_version_set gauge (1 when configured) - provider_version_below_minimum counter (tagged by gate and version) Gates instrumented: - registration (provider.go) - challenge revalidation (provider.go) - manifest sync (server.go) Registry additions: - ProviderCountByVersion() - ProviderCountByBinaryHash() Dashboard: Fleet Version & Binary Hash group with providers by version, providers by binary hash, min provider version, below-minimum events, and top binary hashes toplist.

vercel · 2026-05-12T18:53:01Z

Deployment failed with the following error:

You don't have permission to create a Preview Deployment for this Vercel project: d-inference.

View Documentation: https://vercel.com/docs/accounts/team-members-and-roles

vercel · 2026-05-12T18:53:02Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
d-inference-console-ui-dev	Ready	Preview	May 12, 2026 6:53pm

ethenotethan · 2026-05-12T18:55:52Z

Closing — these changes are now part of #143 instead.

github-actions · 2026-05-12T19:04:40Z

Benchmark Results

Runner: macos-15 (M1 Virtual) | Date: 2026-05-12 19:00 UTC

1-provider-streaming

1 providers, 1 users, 30 requests, concurrency=5, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	1	0.5 GB

Metric	Value
Total Requests	30
Success	30
Errors	0
Total Duration	12.43s
Throughput	2.4 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	30	966ms	9ms	4.218s	9.025s
parse	30	45µs	29µs	162µs	188µs
reserve	30	2ms	1ms	7ms	9ms
route	30	396ms	0s	615ms	8.998s
queue_wait	7	1.699s	453ms	8.998s	8.998s
encrypt	30	182µs	151µs	338µs	400µs
dispatch	30	44µs	29µs	145µs	187µs
coordinator_to_provider	30	565ms	5ms	4.206s	4.208s

Assertion Report: PASS

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=44.633µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=162µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=2.187633ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=7.355ms (threshold=200ms)
encrypt:mean<=5ms	PASS	mean=182.466µs (threshold=5ms)
encrypt:p95<=50ms	PASS	p95=338µs (threshold=50ms)
dispatch:mean<=5ms	PASS	mean=43.8µs (threshold=5ms)
dispatch:p95<=50ms	PASS	p95=145µs (threshold=50ms)

1-provider-non-streaming

1 providers, 1 users, 20 requests, concurrency=5, streaming=false

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	1	0.5 GB

Metric	Value
Total Requests	20
Success	20
Errors	0
Total Duration	5.34s
Throughput	3.7 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	20	1.315s	405ms	4.16s	4.16s
parse	20	25µs	14µs	135µs	135µs
reserve	20	2ms	1ms	9ms	9ms
route	20	245ms	0s	3.806s	3.806s
queue_wait	4	1.226s	394ms	3.806s	3.806s
encrypt	20	167µs	150µs	442µs	442µs
dispatch	20	25µs	19µs	103µs	103µs
coordinator_to_provider	20	631ms	3ms	3.15s	3.15s

Assertion Report: PASS

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=24.85µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=135µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=1.9285ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=8.638ms (threshold=200ms)
encrypt:mean<=5ms	PASS	mean=167.3µs (threshold=5ms)
encrypt:p95<=50ms	PASS	p95=442µs (threshold=50ms)
dispatch:mean<=5ms	PASS	mean=25.4µs (threshold=5ms)
dispatch:p95<=50ms	PASS	p95=103µs (threshold=50ms)

7-provider-multi-model

7 providers, 5 users, 50 requests, concurrency=10, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	4	0.5 GB
mlx-community/gemma-3-270m-4bit	3	0.2 GB

Metric	Value
Total Requests	50
Success	50
Errors	0
Total Duration	44.221s
Throughput	1.1 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	50	4.196s	367ms	23.136s	23.167s
parse	50	38µs	20µs	127µs	338µs
reserve	50	10ms	2ms	43ms	127ms
route	50	1.308s	0s	10.002s	20.006s
queue_wait	10	1.534s	2.078s	2.656s	2.656s
encrypt	50	161µs	140µs	247µs	525µs
dispatch	50	40µs	32µs	108µs	179µs
coordinator_to_provider	50	2.865s	7ms	23.104s	23.151s

Assertion Report: PASS

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=37.68µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=127µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=10.09578ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=42.639ms (threshold=200ms)
encrypt:mean<=5ms	PASS	mean=161µs (threshold=5ms)
encrypt:p95<=50ms	PASS	p95=247µs (threshold=50ms)
dispatch:mean<=5ms	PASS	mean=40.04µs (threshold=5ms)
dispatch:p95<=50ms	PASS	p95=108µs (threshold=50ms)

3-provider-high-concurrency

3 providers, 10 users, 60 requests, concurrency=20, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	3	0.5 GB

Metric	Value
Total Requests	60
Success	60
Errors	0
Total Duration	10.539s
Throughput	5.7 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	60	2.375s	1.208s	6.659s	6.902s
parse	60	35µs	27µs	64µs	293µs
reserve	60	13ms	2ms	59ms	62ms
route	60	1.475s	1.014s	6.542s	6.78s
queue_wait	43	2.058s	1.195s	6.542s	6.781s
encrypt	60	150µs	138µs	219µs	297µs
dispatch	60	25µs	22µs	44µs	130µs
coordinator_to_provider	60	875ms	5ms	4.513s	4.559s

Assertion Report: PASS

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=34.566µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=64µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=13.472766ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=58.69ms (threshold=200ms)
encrypt:mean<=5ms	PASS	mean=150.2µs (threshold=5ms)
encrypt:p95<=50ms	PASS	p95=219µs (threshold=50ms)
dispatch:mean<=5ms	PASS	mean=24.683µs (threshold=5ms)
dispatch:p95<=50ms	PASS	p95=44µs (threshold=50ms)

1-provider-queue-saturation

1 providers, 10 users, 40 requests, concurrency=15, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	1	0.5 GB

Metric	Value
Total Requests	40
Success	40
Errors	0
Total Duration	8.877s
Throughput	4.5 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	40	2.436s	1.696s	5.323s	5.324s
parse	40	36µs	23µs	140µs	219µs
reserve	40	6ms	1ms	22ms	22ms
route	40	2.108s	1.57s	5.276s	5.276s
queue_wait	35	2.409s	1.66s	5.276s	5.276s
encrypt	40	148µs	137µs	215µs	277µs
dispatch	40	18µs	17µs	36µs	41µs
coordinator_to_provider	40	313ms	3ms	3.105s	3.105s

Assertion Report: PASS

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=35.875µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=140µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=6.36715ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=21.772ms (threshold=200ms)
encrypt:mean<=5ms	PASS	mean=148.425µs (threshold=5ms)
encrypt:p95<=50ms	PASS	p95=215µs (threshold=50ms)
dispatch:mean<=5ms	PASS	mean=18.2µs (threshold=5ms)
dispatch:p95<=50ms	PASS	p95=36µs (threshold=50ms)

3-provider-20-users

3 providers, 20 users, 60 requests, concurrency=10, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	3	0.5 GB

Metric	Value
Total Requests	60
Success	60
Errors	0
Total Duration	9.5s
Throughput	6.3 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	60	726ms	13ms	3.624s	3.624s
parse	60	24µs	18µs	72µs	116µs
reserve	60	3ms	1ms	15ms	20ms
route	60	121ms	0s	600ms	641ms
queue_wait	20	364ms	414ms	641ms	641ms
encrypt	60	157µs	137µs	240µs	802µs
dispatch	60	21µs	18µs	40µs	55µs
coordinator_to_provider	60	599ms	4ms	3.6s	3.612s

Assertion Report: PASS

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=24.466µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=72µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=3.1461ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=15.331ms (threshold=200ms)
encrypt:mean<=5ms	PASS	mean=156.65µs (threshold=5ms)
encrypt:p95<=50ms	PASS	p95=240µs (threshold=50ms)
dispatch:mean<=5ms	PASS	mean=20.933µs (threshold=5ms)
dispatch:p95<=50ms	PASS	p95=40µs (threshold=50ms)

1-provider-scaling

1 providers, 5 users, 30 requests, concurrency=10, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	1	0.5 GB

Metric	Value
Total Requests	30
Success	30
Errors	0
Total Duration	7.536s
Throughput	4.0 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	30	1.846s	967ms	4.748s	4.749s
parse	30	24µs	22µs	50µs	66µs
reserve	30	3ms	1ms	8ms	8ms
route	30	1.353s	747ms	4.728s	4.729s
queue_wait	25	1.624s	944ms	4.728s	4.729s
encrypt	30	153µs	139µs	237µs	331µs
dispatch	30	19µs	15µs	42µs	43µs
coordinator_to_provider	30	488ms	4ms	3.634s	3.634s

Assertion Report: PASS

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=24µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=50µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=2.520933ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=7.79ms (threshold=200ms)
encrypt:mean<=5ms	PASS	mean=152.8µs (threshold=5ms)
encrypt:p95<=50ms	PASS	p95=237µs (threshold=50ms)
dispatch:mean<=5ms	PASS	mean=19.4µs (threshold=5ms)
dispatch:p95<=50ms	PASS	p95=42µs (threshold=50ms)

3-provider-scaling

3 providers, 5 users, 30 requests, concurrency=10, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	3	0.5 GB

Metric	Value
Total Requests	30
Success	30
Errors	0
Total Duration	8.315s
Throughput	3.6 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	30	1.548s	9ms	4.862s	4.864s
parse	30	38µs	30µs	85µs	139µs
reserve	30	9ms	4ms	48ms	55ms
route	30	28µs	22µs	68µs	69µs
encrypt	30	148µs	134µs	259µs	327µs
dispatch	30	44µs	31µs	88µs	225µs
coordinator_to_provider	30	1.535s	5ms	4.852s	4.856s

Assertion Report: PASS

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=38.033µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=85µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=8.719533ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=48.2ms (threshold=200ms)
encrypt:mean<=5ms	PASS	mean=147.966µs (threshold=5ms)
encrypt:p95<=50ms	PASS	p95=259µs (threshold=50ms)
dispatch:mean<=5ms	PASS	mean=43.966µs (threshold=5ms)
dispatch:p95<=50ms	PASS	p95=88µs (threshold=50ms)

5-provider-scaling

5 providers, 5 users, 30 requests, concurrency=10, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	5	0.5 GB

Metric	Value
Total Requests	30
Success	30
Errors	0
Total Duration	13.081s
Throughput	2.3 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	30	2.687s	11ms	8.227s	8.227s
parse	30	38µs	29µs	98µs	107µs
reserve	30	13ms	3ms	46ms	120ms
route	30	48µs	32µs	170µs	281µs
encrypt	30	157µs	139µs	261µs	315µs
dispatch	30	46µs	41µs	70µs	246µs
coordinator_to_provider	30	2.669s	5ms	8.17s	8.189s

Assertion Report: PASS

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=37.7µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=98µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=12.910933ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=45.67ms (threshold=200ms)
encrypt:mean<=5ms	PASS	mean=157.466µs (threshold=5ms)
encrypt:p95<=50ms	PASS	p95=261µs (threshold=50ms)
dispatch:mean<=5ms	PASS	mean=46.4µs (threshold=5ms)
dispatch:p95<=50ms	PASS	p95=70µs (threshold=50ms)

3-provider-heavy-100conc-10kb

3 providers, 20 users, 100 requests, concurrency=100, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	3	0.5 GB

Metric	Value
Total Requests	100
Success	100
Errors	0
Total Duration	13.913s
Throughput	7.2 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	100	9.152s	9.503s	13.265s	13.476s
parse	100	0s	0s	1ms	2ms
reserve	100	68ms	68ms	85ms	86ms
route	100	8.53s	9.378s	13.113s	13.328s
queue_wait	88	9.694s	9.86s	13.113s	13.328s
encrypt	100	0s	0s	0s	1ms
dispatch	100	0s	0s	1ms	2ms
coordinator_to_provider	100	499ms	6ms	4.146s	4.186s

Assertion Report: FAIL

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=261.5µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=1.435ms (threshold=5ms)
reserve:mean<=50ms	FAIL	mean=68.174ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=84.614ms (threshold=200ms)
encrypt:mean<=5ms	PASS	mean=246.31µs (threshold=5ms)
encrypt:p95<=50ms	PASS	p95=341µs (threshold=50ms)
dispatch:mean<=5ms	PASS	mean=121.1µs (threshold=5ms)
dispatch:p95<=50ms	PASS	p95=792µs (threshold=50ms)

Gajesh2007 and others added 11 commits April 28, 2026 08:30

Fix Darkbloom analytics tracking

f7dab6f

Harden release workflow protections (#103)

e515244

Harden release registration and binary hash policy (#99)

b5dd048

* Harden release registration and binary hash policy * derive release download URL from allowlist * Stabilize provider coordinator test --------- Co-authored-by: Gajesh Naik <26431906+Gajesh2007@users.noreply.github.com>

chore: remove unused dependencies (#112)

7ccc592

* chore: remove unused dependencies * test: fix console ui test isolation * chore: prune repo-wide dead code findings

ci: run CI on any PR, not just master/main (#119)

98a3a02

fix: prevent double-decrement when untrusted provider disconnects

d57c8dd

Disconnect now checks StatusUntrusted before decrementing the online counter and model-provider gauges, since MarkUntrusted already decremented them.

ethenotethan closed this May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add fleet version and binary hash observability#156

feat: add fleet version and binary hash observability#156
ethenotethan wants to merge 11 commits into
swift-providerfrom
feat/version-observability

ethenotethan commented May 12, 2026

Uh oh!

vercel Bot commented May 12, 2026

Uh oh!

vercel Bot commented May 12, 2026

Uh oh!

ethenotethan commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ethenotethan commented May 12, 2026

Summary

New Metrics

Uh oh!

vercel Bot commented May 12, 2026

Uh oh!

vercel Bot commented May 12, 2026

Uh oh!

ethenotethan commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Benchmark Results

1-provider-streaming

Latency Decomposition

Assertion Report: PASS

1-provider-non-streaming

Latency Decomposition

Assertion Report: PASS

7-provider-multi-model

Latency Decomposition

Assertion Report: PASS

3-provider-high-concurrency

Latency Decomposition

Assertion Report: PASS

1-provider-queue-saturation

Latency Decomposition

Assertion Report: PASS

3-provider-20-users

Latency Decomposition

Assertion Report: PASS

1-provider-scaling

Latency Decomposition

Assertion Report: PASS

3-provider-scaling

Latency Decomposition

Assertion Report: PASS

5-provider-scaling

Latency Decomposition

Assertion Report: PASS

3-provider-heavy-100conc-10kb

Latency Decomposition

Assertion Report: FAIL

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants