feat: add fleet version and binary hash observability#156
feat: add fleet version and binary hash observability#156ethenotethan wants to merge 11 commits into
Conversation
* Harden release registration and binary hash policy * derive release download URL from allowlist * Stabilize provider coordinator test --------- Co-authored-by: Gajesh Naik <26431906+Gajesh2007@users.noreply.github.com>
* e2e: add local simulation environment skeleton Introduces scripts/e2e-runner.py, a Python orchestrator that spins up the real coordinator binary with test-friendly configuration (in-memory store, mock billing, no trust requirements) alongside a simulated or real provider, and runs HTTP/WebSocket-level assertions against the live stack. Key components: - Coordinator class: builds and spawns coordinator with EIGENINFERENCE_MIN_TRUST=none, EIGENINFERENCE_BILLING_MOCK=true, and in-memory store - SimulatedProvider: pure-Python WebSocket client speaking the full provider protocol (register, attestation challenge/response, heartbeat, inference request/response) - Test framework: decorator-based test registration, pass/fail summary, signal-safe cleanup via atexit + signal handlers - Test stubs: test_basic (registration + discovery), test_inference (consumer request routing), test_multi_provider (two providers, same model) TODO: - RealProvider wrapper around darkbloom serve --coordinator - Coordination between provider challenge cycle and consumer request timing - API key handling for consumer vs admin routes - Python dependency management (websockets, cryptography) * Revert "e2e: add local simulation environment skeleton" This reverts commit d02074e. The Python E2E runner adds noise on top of the existing Go integration tests (internal/api/integration_test.go + fullstack_integration_test.go) which already cover the full coordinator protocol surface. The cross-language orchestration doesn't buy anything over what httptest.Server + simulated providers already provide. * Remove stale Python integration test @ethenotethan tests/integration_test.py is superseded by the Go-based coordinator integration tests at coordinator/internal/api/: - Test coverage for coordinator protocol (register, challenge, heartbeat, inference) is covered by integration_test.go using httptest.Server + Go simulated providers — same coverage, no binary build needed - Full-stack GPU inference is covered by fullstack_integration_test.go with real vllm-mlx backends (gated behind LIVE_FULLSTACK_TEST=1) - The Python test uses stale binary names ('eigeninference-provider'), old flags ('--backend mlx-lm'), and predates attestation challenges, E2E encryption, and the vllm-mlx backend migration - No external dependency coverage (Postgres, Stripe, etc.) is lost — the coordinator main.go wiring for those is trivially tested elsewhere - The Python SDK tests (4.5.x) belong in the SDK repo, not the infra repo --------- Co-authored-by: Hank Bob <hankbob@researchoors.com>
* chore: remove unused dependencies * test: fix console ui test isolation * chore: prune repo-wide dead code findings
Cloud Build (deploy/gcp/cloudbuild.yaml) already deploys the coordinator on the same trigger (push to master touching coordinator/** or deploy/gcp/**). Having both paths active creates a race condition where two CI systems simultaneously deploy to the same dev VM — see #115.
Install Datadog Agent on the dev GCE VM (DogStatsD, APM, journald logs) and wire the coordinator to emit structured metrics, split attestation counters, model_type tags, reactive provider-count gauges, and a completion-tokens counter. Rebuild the dev dashboard with 7 sections covering metrics, logs, traces, and system health.
Disconnect now checks StatusUntrusted before decrementing the online counter and model-provider gauges, since MarkUntrusted already decremented them.
- Accept swift-provider deletions (release.yml, StatusViewModel.swift, release-runbook.md) - Accept swift-provider's evolved test names/behavior in provider_test.go - Add metallib_hash/backend fields to registerReleaseRequest and validateReleaseMetadata - Remove duplicate normalizeSHA256Hex from server.go (already in release_handlers.go) - Update edge_case_test.go to set R2 CDN URL for artifact verification - Remove duplicate test functions from merge conflict resolution
New metrics: - providers.per_version gauge (per provider binary version) - providers.per_binary_hash gauge (per attested binary hash) - coordinator.min_provider_version_set gauge (1 when configured) - provider_version_below_minimum counter (tagged by gate and version) Gates instrumented: - registration (provider.go) - challenge revalidation (provider.go) - manifest sync (server.go) Registry additions: - ProviderCountByVersion() - ProviderCountByBinaryHash() Dashboard: Fleet Version & Binary Hash group with providers by version, providers by binary hash, min provider version, below-minimum events, and top binary hashes toplist.
|
Deployment failed with the following error: View Documentation: https://vercel.com/docs/accounts/team-members-and-roles |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Closing — these changes are now part of #143 instead. |
Benchmark ResultsRunner: 1-provider-streaming1 providers, 1 users, 30 requests, concurrency=5, streaming=true
Latency Decomposition
Assertion Report: PASS
1-provider-non-streaming1 providers, 1 users, 20 requests, concurrency=5, streaming=false
Latency Decomposition
Assertion Report: PASS
7-provider-multi-model7 providers, 5 users, 50 requests, concurrency=10, streaming=true
Latency Decomposition
Assertion Report: PASS
3-provider-high-concurrency3 providers, 10 users, 60 requests, concurrency=20, streaming=true
Latency Decomposition
Assertion Report: PASS
1-provider-queue-saturation1 providers, 10 users, 40 requests, concurrency=15, streaming=true
Latency Decomposition
Assertion Report: PASS
3-provider-20-users3 providers, 20 users, 60 requests, concurrency=10, streaming=true
Latency Decomposition
Assertion Report: PASS
1-provider-scaling1 providers, 5 users, 30 requests, concurrency=10, streaming=true
Latency Decomposition
Assertion Report: PASS
3-provider-scaling3 providers, 5 users, 30 requests, concurrency=10, streaming=true
Latency Decomposition
Assertion Report: PASS
5-provider-scaling5 providers, 5 users, 30 requests, concurrency=10, streaming=true
Latency Decomposition
Assertion Report: PASS
3-provider-heavy-100conc-10kb3 providers, 20 users, 100 requests, concurrency=100, streaming=true
Latency Decomposition
Assertion Report: FAIL
|
Summary
providers.per_versionandproviders.per_binary_hashgauges to DD gauge loopcoordinator.min_provider_version_setgaugeprovider_version_below_minimumcounter at all 3 version gate points (registration, challenge revalidation, manifest sync), tagged by gate and versionProviderCountByVersion()andProviderCountByBinaryHash()to registryNew Metrics
providers.per_versionversionproviders.per_binary_hashbinary_hashcoordinator.min_provider_version_setmin_versionprovider_version_below_minimumgate,version