-
Notifications
You must be signed in to change notification settings - Fork 9
feat(infra): OpenTelemetry observability across all services #226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 23 commits
Commits
Show all changes
39 commits
Select commit
Hold shift + click to select a range
b15ef8a
build(deps): bump the bsv-workspace group with 32 updates
dependabot[bot] f005f07
build(deps): bump the infra-deps group across 8 directories with 7 up…
dependabot[bot] e7a333e
docs(infra): OpenTelemetry + structured logging design spec
sirdeggen e29441f
feat(overlay-server): emit OpenTelemetry traces, metrics, logs
sirdeggen d8b73e0
feat(wallet-infra): emit OpenTelemetry traces, metrics, logs
sirdeggen 22e5d57
feat(message-box-server): emit OpenTelemetry traces, metrics, logs
sirdeggen 645e456
feat(chaintracks-server): emit OpenTelemetry traces, metrics, logs
sirdeggen 5cf724c
feat(uhrp-server-cloud-bucket): emit OpenTelemetry traces, metrics, logs
sirdeggen eeca4e8
feat(uhrp-server-basic): emit OpenTelemetry traces, metrics, logs
sirdeggen 6ebe455
feat(wab): emit OpenTelemetry traces, metrics, logs
sirdeggen 4f86a74
chore(infra): unify telemetry bootstrap across components
sirdeggen f8e8e60
docs(infra): add OpenTelemetry observability runbook
sirdeggen 3d62128
refactor(wallet-infra): replace console.* with structured log.*
sirdeggen 893fdab
refactor(message-box-server): replace raw console.* with structured l…
sirdeggen ef139fa
refactor(chaintracks-server): replace console.* with structured log.*
sirdeggen 79682ec
refactor(uhrp-server-cloud-bucket): replace console.* with structured…
sirdeggen d4c981f
refactor(uhrp-server-basic): replace console.* with structured log.*
sirdeggen 9c2cc24
refactor(wab): replace console.* with structured log.*
sirdeggen 98268a5
fix(infra): redact PII/credentials from structured logs before egress
sirdeggen 942abd4
fix(infra): boot overlay-server + repair ESM telemetry loader hook
sirdeggen 6908481
feat(infra): unified local stack with Traefik hostname routing
sirdeggen 07d3705
fix(infra): make the full local stack boot without crashing
sirdeggen 9a9d9e6
change(wallet-infra): default network to mainnet, not mock/test
sirdeggen 098f0ce
fix(docs): add missing frontmatter to infra-opentelemetry design spec
sirdeggen 0594fc6
fix(docs): add infra to allowed domain values in page schema
sirdeggen 236e3fe
Merge remote-tracking branch 'origin/dependabot/npm_and_yarn/bsv-work…
sirdeggen 83a38c2
chore(infra): merge dependabot PRs #222 and #223 into feat/infra-open…
sirdeggen 693adb7
fix(infra): address Copilot review findings on PR #226
sirdeggen 58fc3dc
fix(docs-site): include .md/.mdx in React plugin to resolve jsx-runtime
sirdeggen 1d108d9
fix(docs-site): alias react/jsx-runtime so MDX-compiled docs resolve
sirdeggen 125d43d
fix(docs-site): pin react-router-dom to v6 for vite-react-ssg compat
sirdeggen 03e4555
fix(ts-p2p): cast gossipsub service factory after libp2p type drift
sirdeggen 56f1aba
fix(tests): make bsv-wallet-helper tests hermetic; drop live storage dep
sirdeggen fc0be52
fix(wallet-toolbox): pin chalk to v4 so jest can parse createAction2 …
sirdeggen 54070e2
Merge origin/main and resolve version conflicts
Copilot d06b40e
fix(infra): resolve wab + wallet-infra TypeScript build errors
BraydenLangley a1c50d5
fix(infra): sync wab + chaintracks lockfiles to @bsv/wallet-toolbox 2…
BraydenLangley df8a85c
fix(ci): repair pnpm-lock.yaml broken entry for wallet-toolbox-client
sirdeggen 13052cd
test(ts-paymail): mock DNS + DoH in dnsResolver tests to remove CI flake
sirdeggen File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
95 changes: 95 additions & 0 deletions
95
docs/superpowers/specs/2026-06-22-infra-opentelemetry-design.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,95 @@ | ||
| --- | ||
| id: infra-opentelemetry-design | ||
| title: Infra OpenTelemetry & Structured Logging — Design | ||
| kind: spec | ||
| domain: infra | ||
| version: 1.0.0 | ||
| last_updated: "2026-06-22" | ||
| last_verified: "2026-06-22" | ||
| status: experimental | ||
| tags: | ||
| - opentelemetry | ||
| - observability | ||
| - infra | ||
| - logging | ||
| --- | ||
|
|
||
| # Infra OpenTelemetry & Structured Logging — Design | ||
|
|
||
| **Date:** 2026-06-22 | ||
| **Goal:** Every infra component in the stack produces OpenTelemetry (traces, metrics, logs) to improve observability — specifically to find bugs faster and diagnose memory leaks / resource issues. | ||
|
|
||
| ## Scope | ||
|
|
||
| Seven standalone infra components (each its own npm project — own `package-lock.json`, **not** in the pnpm workspace): | ||
|
|
||
| | Component | Pkg name | Module | Build dir | Entry | Notes | | ||
| |---|---|---|---|---|---| | ||
| | overlay-server | `@bsv/overlay-express-examples` | CJS | `dist/` | `dist/index.ts` | Express owned by `@bsv/overlay-express`; Mongo + MySQL/Knex. **Reference impl.** | | ||
| | wallet-infra | `@bsv/wallet-infra` | **ESM** | `out/` | `out/src/index.js` | Express; nginx front | | ||
| | message-box-server | `@bsv/messagebox-server` | **ESM** | `out/` | `out/src/index.js` | Express + auth/payment middleware; nginx | | ||
| | chaintracks-server | `chaintracks-server` | CJS | `dist/` | `dist/server.js` | Express | | ||
| | uhrp-server-cloud-bucket | `@bsv/uhrp-storage-server` | CJS | `out/` | `out/src/index.js` | Express + Bugsnag; notifier sidecar | | ||
| | uhrp-server-basic | `@bsv/uhrp-lite` | CJS | `out/` | `out/src/index.js` | Express; **no Dockerfile** | | ||
| | wab | `@bsv/wab-server` | CJS | `dist/` | `dist/server.js` | Express + rate-limit | | ||
|
|
||
| Rollout order (fixed by user): **overlay-server → wallet-infra → message-box-server → chaintracks-server → uhrp-server-cloud-bucket → uhrp-server-basic → wab**. | ||
|
|
||
| ## Decisions (locked) | ||
|
|
||
| - **Exporter:** OTLP/HTTP, all config from `OTEL_*` env. No vendor hardcoding (backend is OTLP-compatible, e.g. Coralogix collector). Endpoint unset → console exporters so boot never breaks in dev. | ||
| - **Signals:** Traces + Metrics + Logs. | ||
| - **Load:** Preload before app code. CJS → `node --require ./<out>/telemetry.js`; ESM → `node --import ./<out>/telemetry.mjs`. Guarantees auto-instrumentation patches modules before import. | ||
| - **Duplication:** Each component owns its `src/telemetry.ts` (identical content, compiled by existing `tsc`). No generator. | ||
| - **Structured logging:** Adopt **pino** as the structured logger, replacing ad-hoc `console.log`. `@opentelemetry/instrumentation-pino` auto-injects `trace_id`/`span_id` so logs correlate to spans. A console→OTel log shim stays as a fallback for un-converted call sites. | ||
|
|
||
| ## Architecture | ||
|
|
||
| ### Per-component telemetry bootstrap (`src/telemetry.ts`) | ||
|
|
||
| Starts a `NodeSDK` (`@opentelemetry/sdk-node`) with: | ||
|
|
||
| - **Resource**: `service.name` (= package name, overridable via `OTEL_SERVICE_NAME`), `service.version` (= package version), `deployment.environment` (from `DEPLOY_ENV`/`NODE_ENV`, default `development`). Correct even with zero env set. | ||
| - **Auto-instrumentation**: `getNodeAutoInstrumentations()` — HTTP, Express, Mongo/Mongoose, MySQL2, DNS, net, pino. Filesystem instrumentation disabled (noise). | ||
| - **Runtime metrics**: `@opentelemetry/instrumentation-runtime-node` — heap used/total, GC pause/count, event-loop lag, active handles. **This is the primary memory-leak signal.** | ||
| - **Exporters** (chosen at runtime by presence of `OTEL_EXPORTER_OTLP_ENDPOINT`): | ||
| - set → OTLP/HTTP trace + metric (PeriodicExportingMetricReader) + logs exporters. | ||
| - unset → `ConsoleSpanExporter` / console metric + log exporters. | ||
| - **Logs**: `LoggerProvider` with OTLP (or console) `BatchLogRecordProcessor`; console→OTel shim patches `console.*` to also emit log records at mapped severities. | ||
| - **Graceful shutdown**: `SIGTERM`/`SIGINT` → `sdk.shutdown()` to flush before exit. | ||
|
|
||
| ### Deps added per component (`@opentelemetry/…`) | ||
|
|
||
| `sdk-node`, `auto-instrumentations-node`, `instrumentation-runtime-node`, `exporter-trace-otlp-http`, `exporter-metrics-otlp-http`, `exporter-logs-otlp-http`, `resources`, `semantic-conventions`, `api-logs`, plus `pino`. | ||
|
|
||
| ### Dockerfile / compose changes | ||
|
|
||
| - `CMD` gains the preload flag (`--require`/`--import` per module type). | ||
| - `docker-compose.yml` passes through `OTEL_EXPORTER_OTLP_ENDPOINT`, `OTEL_EXPORTER_OTLP_HEADERS`, `OTEL_SERVICE_NAME`, `OTEL_RESOURCE_ATTRIBUTES`, `DEPLOY_ENV`. | ||
| - uhrp-server-basic has no Dockerfile → preload added to `start`/`dev` scripts via `NODE_OPTIONS` or `--require`. | ||
|
|
||
| ## Per-component phases | ||
|
|
||
| Each component goes through three phases; depth of B/C scales with the component: | ||
|
|
||
| - **Phase A — Bootstrap:** add deps, `telemetry.ts`, preload wiring, compose env. Signals flow from auto-instrumentation + runtime metrics. Build verifies clean. | ||
| - **Phase B — Structured logging:** audit existing log sites, replace `console.*` with a pino logger emitting leveled, structured events with **stable field names** (`service`, `operation`, `duration_ms`, plus domain fields like `tx_id`, `topic`, `host`). Drop noisy/duplicate logs; promote silent failures to logged events. | ||
| - **Phase C — Domain spans/metrics:** wrap the operations that matter (overlay submit/lookup, wallet storage calls, message send/ack, header sync) in spans with attributes, and add a few custom counters/histograms where a bug or leak would show up. | ||
|
|
||
| overlay-server (reference) gets A+B+C fully, establishing the template; later components reuse its `telemetry.ts` verbatim and apply B/C proportional to their surface. | ||
|
|
||
| ## Field-name conventions (structured logs) | ||
|
|
||
| Stable keys so queries work across services: `service`, `env`, `operation`, `outcome` (`ok`|`error`), `duration_ms`, `error.type`, `error.msg`, plus OTel-injected `trace_id`/`span_id`. Domain keys namespaced per component. | ||
|
|
||
| ## Testing / verification | ||
|
|
||
| - Each component: `npm run build` clean; boot locally with `OTEL_EXPORTER_OTLP_ENDPOINT` unset → console spans/metrics/logs visible; boot with a local OTLP collector → spans/metrics/logs received. | ||
| - No new lint errors. Memory-leak signal confirmed by observing `runtime.node.memory.heap.used` + GC metrics in console/collector. | ||
| - Per release-flow memory: patch-bump only own `version` field; do not run sync-versions; user builds + tests the Docker image locally before any push. | ||
|
|
||
| ## Out of scope | ||
|
|
||
| - Choosing/standing up the collector or backend (env-driven; user supplies endpoint). | ||
| - Distributed-trace context propagation across components beyond what auto-instrumentation provides via HTTP headers (W3C tracecontext is on by default). | ||
| - Dashboards/alerts (separate effort; Coralogix CLI skills available later). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,58 @@ | ||
| # Local infra stack | ||
|
|
||
| Runs the BSV infra components together behind a Traefik reverse proxy that routes | ||
| by hostname, so you can hit each service at `<name>.localhost` in the browser. | ||
|
|
||
| ```sh | ||
| docker compose -f infra/docker-compose.yaml up --build | ||
| ``` | ||
|
|
||
| | URL | Component | | ||
| |---|---| | ||
| | http://overlay.localhost | overlay-server | | ||
| | http://wallet.localhost | wallet-infra | | ||
| | http://messagebox.localhost | message-box-server | | ||
| | http://chaintracks.localhost | chaintracks-server | | ||
| | http://wab.localhost | wab | | ||
| | http://uhrp.localhost | uhrp-server-basic | | ||
| | http://localhost:8080/dashboard/ | Traefik dashboard | | ||
|
|
||
| (`uhrp-server-cloud-bucket` is intentionally excluded — it needs a real GCP bucket | ||
| + service-account credentials and can't run locally.) | ||
|
|
||
| ## Hostname resolution | ||
|
|
||
| Chromium-based browsers and Firefox resolve `*.localhost` to `127.0.0.1` | ||
| automatically. **Safari and `curl` on macOS do not** — add the hosts once: | ||
|
|
||
| ```sh | ||
| echo "127.0.0.1 overlay.localhost wallet.localhost messagebox.localhost chaintracks.localhost wab.localhost uhrp.localhost traefik.localhost" | sudo tee -a /etc/hosts | ||
| ``` | ||
|
|
||
| Quick check without editing hosts: | ||
|
|
||
| ```sh | ||
| curl -H 'Host: chaintracks.localhost' http://127.0.0.1/ | ||
| ``` | ||
|
|
||
| ## What runs | ||
|
|
||
| - **traefik** — fronts `:80`, routes by `Host` header using the file provider | ||
| (`local/traefik/dynamic.yml`); dashboard on `:8080`. (File provider, not the | ||
| docker provider: the local daemon rejects Traefik's docker API calls with a 400.) | ||
| - **mysql** (shared) — one container, four databases created on first boot | ||
| (`appdb`, `wallet_storage`, `messagebox-backend`, `app`); host port `3307`. | ||
| - **mongo** (shared) — for overlay-server; host port `27018`. | ||
| - the six app components, built from their own directories. | ||
|
|
||
| ## Notes / caveats | ||
|
|
||
| - Keys and passwords in the compose file are **throwaway local-dev values only**. | ||
| - `wallet-infra` runs with `BSV_NETWORK=mock` (no external chain services needed). | ||
| - `overlay-server`, `wab`, and `uhrp-server-basic` reach out to external BSV | ||
| services (wallet storage, ARC) at runtime; some operations need network access | ||
| or real backends to fully succeed. Routing + telemetry still work regardless. | ||
| - Telemetry: set `OTEL_EXPORTER_OTLP_ENDPOINT` (+ `OTEL_EXPORTER_OTLP_HEADERS`) | ||
| in your environment before `up` to ship traces/metrics/logs to your collector; | ||
| unset falls back to console exporters. See `infra/OBSERVABILITY.md`. | ||
| - First `up` builds six images and runs `npm ci` in each — expect a few minutes. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,88 @@ | ||
| # Infra Observability (OpenTelemetry) | ||
|
|
||
| Every infra component emits OpenTelemetry **traces, metrics and logs**. Each | ||
| component has a self-contained bootstrap (`src/telemetry.ts`) that is preloaded | ||
| before application code so auto-instrumentation can patch modules before they | ||
| are imported. | ||
|
|
||
| ## Components | ||
|
|
||
| | Component | Module | Preload | | ||
| |---|---|---| | ||
| | overlay-server | CJS | `node --require ./dist/telemetry.js dist/index.js` | | ||
|
Copilot marked this conversation as resolved.
Outdated
|
||
| | chaintracks-server | CJS | `node --require ./dist/telemetry.js dist/server.js` | | ||
| | wab | CJS | `node --require ./dist/telemetry.js dist/server.js` | | ||
| | uhrp-server-cloud-bucket | CJS | `node --require ./out/src/telemetry.js … out/src/index.js` | | ||
| | uhrp-server-basic | CJS | `ts-node -r ./src/telemetry.ts src/index.ts` / `start:prod` | | ||
| | wallet-infra | ESM | `node --import ./out/src/telemetry.js out/src/index.js` | | ||
| | message-box-server | ESM | `node --import ./out/src/telemetry.js out/src/index.js` | | ||
|
|
||
| ESM components (overlay-server, wallet-infra, message-box-server) deliberately do | ||
| **not** register the `import-in-the-middle` loader hook. That hook rebuilds the | ||
| named exports of CJS packages imported as ESM and drops some of them (e.g. | ||
| `@bsv/sdk`'s `PushDrop`), crashing the app at import time. The libraries we | ||
| actually instrument (http, express, mongodb, mysql2, pino) are loaded through CJS | ||
| dependency chains (overlay-express, wallet-toolbox, authsocket) and remain patched | ||
| by `require-in-the-middle`, so auto-instrumentation coverage is retained. | ||
|
|
||
| ## Configuration | ||
|
|
||
| All wiring is driven by standard `OTEL_*` environment variables. The Dockerfiles | ||
| and `docker-compose.yml` files pass these through. | ||
|
|
||
| | Variable | Purpose | Default | | ||
| |---|---|---| | ||
| | `OTEL_EXPORTER_OTLP_ENDPOINT` | OTLP/HTTP collector base URL. **Unset → console exporters** (dev-safe). | — | | ||
| | `OTEL_EXPORTER_OTLP_HEADERS` | Comma-separated headers, e.g. auth for Coralogix. | — | | ||
| | `OTEL_SERVICE_NAME` | Overrides `service.name` (defaults to the package name). | package name | | ||
| | `OTEL_RESOURCE_ATTRIBUTES` | Extra resource attributes. | — | | ||
| | `DEPLOY_ENV` / `NODE_ENV` | Becomes `deployment.environment`. | `development` | | ||
| | `OTEL_METRIC_EXPORT_INTERVAL` | Metric export interval (ms). | `60000` | | ||
| | `OTEL_DIAG` | `true` enables OTel internal diagnostic logging. | off | | ||
| | `LOG_LEVEL` | pino log level. | `info` | | ||
|
|
||
| Point the whole stack at a collector by exporting once, e.g.: | ||
|
|
||
| ```sh | ||
| export OTEL_EXPORTER_OTLP_ENDPOINT="https://ingress.<region>.coralogix.com" | ||
| export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer <key>" | ||
| docker compose up | ||
| ``` | ||
|
|
||
| With the endpoint **unset**, each service prints spans/metrics/logs to the | ||
| console — useful for verifying instrumentation locally without a backend. | ||
|
|
||
| ## Signals | ||
|
|
||
| - **Traces** — HTTP, Express, MongoDB, MySQL/Knex, DNS auto-instrumentation, plus | ||
| a `*.bootstrap` span per service wrapping startup. | ||
| - **Metrics** — HTTP server/client metrics, and **runtime metrics** | ||
| (`nodejs.eventloop.*`, `v8js.memory.heap.*`, GC) via | ||
| `@opentelemetry/instrumentation-runtime-node`. These are the primary signal for | ||
| **memory-leak and event-loop diagnosis**. | ||
| - **Logs** — structured JSON via **pino** (`src/logger.ts`), with `trace_id` / | ||
| `span_id` injected by `@opentelemetry/instrumentation-pino` so logs correlate to | ||
| traces, shipped over OTLP. Stray `console.*` calls are also bridged to OTel logs | ||
| during the migration to structured logging. | ||
|
|
||
| ### Structured logging conventions | ||
|
|
||
| Use stable field names so queries work across services: | ||
| `service`, `env`, `operation`, `outcome` (`ok` | `error`), `duration_ms`, `err`, | ||
| plus domain-specific keys. Example: | ||
|
|
||
| ```ts | ||
| import { log } from './logger' | ||
| log.info({ operation: 'listen', outcome: 'ok', port }, 'server listening') | ||
| ``` | ||
|
|
||
| ## Notes | ||
|
|
||
| - Telemetry shutdown flushes the SDK on `SIGTERM`/`SIGINT` and only force-exits | ||
| when the app has no signal handler of its own (e.g. chaintracks owns its | ||
| lifecycle), so it never preempts application cleanup. | ||
| - Adding telemetry introduced no new dependency CVEs; pre-existing transitive | ||
| advisories (e.g. message-box `firebase-admin → @google-cloud/storage`) are | ||
| unrelated. | ||
|
|
||
| See the design spec: `docs/superpowers/specs/2026-06-22-infra-opentelemetry-design.md`. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.