Author prep for: Warwick Peatey (Kubecost / OpenCost maintainer)
Target consumer: Claude Code, for further development
Date: 2026-04-16
Upstream repo reviewed: https://github.com/opencost/opencost-ai @ main (3 commits, 0 stars, 0 issues, 1 drive-by README PR)
opencost-ai is not a product today. It is a ~160-line Dockerfile plus a ~210-line Flask script that shells out to a TUI (ollmcp) and scrapes its output with regex. The container bakes Ollama, ollmcp, Flask, and a tiny 0.5B model into a single monolithic image to talk to the OpenCost MCP server (already built-in to OpenCost as of v1.118).
The intent — a local, air-gap-friendly, open-source AI assistant for OpenCost — is sound and fills a real gap. The execution as committed is a prototype that should not be built on. Specifically:
- The core integration mechanism is wrong. Driving an interactive TUI with
pexpectand parsing ANSI-stripped regex groups out of the output is brittle by design. Any version bump toollmcpbreaks it silently. - Security posture is absent. Runs as root, no authentication, no input validation, CORS wide open, binds
0.0.0.0, subprocess spawn for every request with a 5-minute timeout each. - No tests, no CI, no Helm chart, no README, no versioning. The repo is a personal experiment pushed to the org namespace.
- Ollama-in-container is the wrong unit of deployment for the stated air-gap use case. It couples model lifecycle to container rebuilds and makes GPU passthrough, model caching, and multi-tenant use harder, not easier.
The good news: there's a clean path forward that preserves the stated goal (local, open, air-gap-capable), reuses the OpenCost MCP server as-is, and swaps the brittle TUI-scraping for a first-class programmatic bridge. The bulk of this document is that design.
Full inventory:
opencost-ai/
├── LICENSE # Apache-2.0
├── docker/
│ └── Dockerfile.ollama # ~160 lines; Ollama + ollmcp + Flask in one image
└── src/
└── ollmcp-api-server.py # ~210 lines; Flask wrapper around ollmcp TUI
No README, no CONTRIBUTING, no tests, no CI, no Helm chart, no compose file, no requirements.txt (pinned in the Dockerfile only), no changelog, no .gitignore.
Three endpoints:
GET /health— returns{"status":"healthy","default_model":"qwen2.5:0.5b"}.GET /models— shells out toollama list, parses stdout line-by-line.GET /tools— spawnsollmcp, sendsquit, greps the welcome screen for✓ opencost.*.POST /query— the main path. Spawnsollmcpviapexpect, waits for the❯prompt, sendshilto disable human-in-the-loop, sends the query, waits on a 5-minute timeout, optionally answersyto a confirmation prompt, sendsquit, then regex-parses the captured output for📝 Answer (Markdown):blocks, strips ANSI escapes and box-drawing characters, and returns whatever survives.
Concrete problems, in order of severity:
- TUI scraping as the integration contract.
ollmcpis an interactive terminal UI whose output format is an implementation detail, not an API. The parser already needs a primary regex, a fallback regex, and a third last-resort line-filter loop with a skip-list of emoji. Any version ofollmcpthat changes a prompt character, a banner, or an emoji breaks production. This cannot be the interface. - No authentication.
/queryexecutes LLM calls against infrastructure data. Anyone on the pod network can call it. No API key, no mTLS, no service-account check. - Subprocess per request. Every
/querycall spawns a newollmcpprocess, which reconnects to the MCP server, lists tools, loads the model context, and tears down. Startup cost is multiples of query cost for small queries. - No request concurrency control. Flask dev server (
app.run), no WSGI, no rate limit, no queue. A second request while the first is running will fight for the sameollmcpsession depending on how the kernel schedules the spawn. - Input goes straight into a shell-adjacent context.
user_queryis typed into apexpectsession viasendline. It's not obvious what happens when the query contains\n, backtick, or a line starting with a slash-command thatollmcpwould intercept. Needs adversarial testing before exposure. - Errors leak internals.
except Exception as e: return jsonify({"error": str(e)})returns raw exception strings — pexpect tracebacks, stack state, config paths. - Hardcoded assumptions.
/root/.config/ollmcp/servers.json, port 8888, Unicode❯as expect token,qwen2.5:0.5bas default. - No structured logging, no metrics, no tracing. Not even a per-request ID.
Builds from ollama/ollama:latest, installs Python + ollmcp + flask + pexpect into a venv, pre-pulls qwen2.5:0.5b at build time (configurable via --build-arg), writes a startup script that rewrites servers.json from $MCP_SERVER_URL, starts ollama serve, waits 30s for it, probes the OpenCost MCP server with a JSON-RPC initialize call, starts the Flask API, and does wait -n on both PIDs.
Concrete problems:
FROM ollama/ollama:latest— non-reproducible base, no digest pin, no SBOM.- Runs as root. No
USERdirective. - Model baked into image. A 0.5B model is ~400MB. A useful model (7B Q4) is 4–5GB. Baking into the image couples model updates to image rebuilds, breaks caching at the registry and OCI layer level, and makes air-gap deployments ship the model through the image registry rather than a dedicated artifact store. Ollama has a volume-mounted model cache for exactly this reason.
qwen2.5:0.5bis too small to be useful for tool use. Tool-calling quality falls off a cliff below ~3B. This default will produce bad answers out of the box, which is the worst possible first impression.apt-get updatewith no cache cleanup audit,latesttags, no--no-install-recommendson the apt layer.- Shell heredoc with
COPY <<EOFfor the startup script — harder to test, harder to lint, harder to rebuild incrementally than a file. - Health probe is a curl against the MCP server inside the startup script that continues "anyway" on failure. The container reports healthy when the MCP path is dead.
- No
HEALTHCHECKdirective, no resource hints, no labels.
To be fair:
- The overall topology is correct: client → local LLM runtime with tool use → OpenCost MCP server → OpenCost HTTP API. That's the right shape.
- Apache-2.0 is the right license for an OpenCost-org project.
- Choosing MCP as the contract to OpenCost is correct — OpenCost ships a built-in MCP server on port 8081 as of v1.118, and building against that is strictly better than reinventing an LLM-specific API.
- Air-gap-first framing is the right differentiator. Kubecost users in regulated industries, defense, and finance cannot use hosted AI, and the current landscape is dominated by hosted offerings.
Your stated goal: "a LOCAL AI running that would support air-gapped installations and be open source."
The gap between that goal and the current code, stated plainly:
| Requirement (stated) | Current state | Gap |
|---|---|---|
| Local inference | ✅ Ollama in-container | Model choice is too small; no GPU story |
| Air-gap capable | Image pulls from ollama/ollama:latest; model pulled at build; no offline install flow documented |
|
| Open source | ✅ Apache-2.0 | Fine |
| OpenCost integration | ollmcp TUI scrape |
Wrong integration boundary — see §2.1 |
| Production deployable | ❌ | No Helm chart, no auth, runs as root, no tests, Flask dev server |
| Maintainable | ❌ | Zero docs, zero tests, brittle regex parser |
Before writing new code, three upstream projects matter:
Already exists. Runs on port 8081 in every default Helm install. Exposes three tools: cost allocation, asset cost, cloud cost — with filtering and aggregation. This is our data plane. We do not touch it; we consume it.
Same author as ollmcp. Critical architectural alternative: it's a transparent proxy in front of the Ollama API (FastAPI-based) that pre-loads MCP servers at startup, injects their tools into every /api/chat request, handles multi-round tool execution, and streams responses. It is a drop-in Ollama replacement — existing Ollama clients point at the bridge URL instead of Ollama and get MCP tools for free.
This matters because it replaces the entire "Flask wrapper scraping a TUI" layer with a battle-tested proxy that speaks the native Ollama /api/chat contract. No regex. No pexpect. No TUI. The right answer for the /query path in this project is almost certainly "use ollama-mcp-bridge, don't reinvent it."
ollmcp is a TUI for humans. It is not an API. It was never meant to be scripted. The current repo is using it as an API because that's what was reachable in a weekend prototype. Do not continue down this path.
Scope the v0.1 deliberately narrow so it can ship and get real usage feedback. Everything below is a recommendation; push back on any of it with counter-evidence.
A Kubernetes-native, air-gap-deployable, open-source AI assistant for OpenCost that lets platform teams ask cost questions in natural language without sending cluster data to a third-party LLM provider.
What that phrase excludes, intentionally:
- Not a general chatbot. It answers OpenCost-derived questions.
- Not a cost-recommendation engine (yet). Generating "you should rightsize X" requires evaluation harnesses we don't have. v0.1 exposes existing data through language; it does not prescribe.
- Not a multi-cluster federated system. One OpenCost instance per deployment.
- Not hosted. No SaaS in the open-source project.
- Platform / FinOps engineers running OpenCost in an air-gapped or sovereign cluster (DoD, regulated finance, EU data-residency).
- Kubecost users evaluating whether the OSS AI path is credible before asking for a managed equivalent.
- OpenCost contributors who want an AI dev-experience against their local cluster without signing up for anything.
- Cost forecasting or anomaly detection models.
- Fine-tuned / domain-specific models. Shipping a generic tool-use-capable model is good enough.
- Web UI. CLI + OpenAPI-compatible HTTP endpoint only; UI comes after the API is stable.
- Authenticated multi-tenancy. SPIFFE-style cluster identity is enough; user-level auth is v0.2.
┌───────────────────────────────────────────────────────────────────┐
│ Kubernetes cluster (air-gapped) │
│ │
│ ┌────────────────┐ kubectl/curl ┌──────────────────────┐ │
│ │ Platform user │ ───────────────▶ │ opencost-ai-gateway │ │
│ └────────────────┘ │ (Go, thin HTTP API) │ │
│ └────────┬─────────────┘ │
│ │ /api/chat │
│ ▼ │
│ ┌──────────────────────────────┐ │
│ │ ollama-mcp-bridge │ │
│ │ (FastAPI, upstream OSS) │ │
│ └──┬───────────────────┬───────┘ │
│ │ │ │
│ MCP tools │ │ inference│
│ ▼ ▼ │
│ ┌──────────────────┐ ┌────────────────┐ │
│ │ OpenCost exporter│ │ Ollama │ │
│ │ + MCP svr :8081 │ │ (GPU optional) │ │
│ └──────────────────┘ └────────────────┘ │
└───────────────────────────────────────────────────────────────────┘
Four containers, three of them upstream and untouched:
opencost-ai-gateway— the only thing we own and ship. Go, thin, auth + audit + quota + prompt-shaping. Documented below.ollama-mcp-bridge— upstream, packaged in our Helm chart, configured to point at OpenCost's MCP endpoint.ollama— upstream, with a PVC for the model cache so models survive pod restarts and aren't baked into images.- OpenCost — upstream, already shipping the MCP server.
Because the bridge speaks the Ollama /api/chat contract, which is intentionally unauthenticated (it's designed for trusted localhost). For a cluster-exposed API we need:
- Authentication (start with static bearer token, then SPIFFE/SPIRE).
- Per-caller rate limits.
- Audit logging of the query (not the completion — cost data is sensitive).
- Prompt guardrails — a system prompt that scopes the model to OpenCost questions.
- Result post-processing and optional schema enforcement (e.g. return structured JSON for UI consumption).
- A small, stable HTTP surface (
POST /v1/ask,GET /v1/health,GET /v1/tools,GET /v1/models) decoupled from whatever Ollama's evolving/api/chatshape is.
Go, not Python, because:
- Aligns with Kubecost/OpenCost codebase skills.
- Smaller, statically linked container; simpler SBOM; faster cold start.
- Easier to share types with OpenCost if we ever inline the MCP client.
- Ollama has model-format standardization (GGUF), a cache, an Ollama Registry, and offline
ollama createfrom GGUF files — all of which matter for air-gap. - Swappable later. The gateway only sees
ollama-mcp-bridge; swapping to vLLM or llama.cpp server is a bridge-level concern.
Tool use is the hard requirement; reasoning quality is secondary. The IBM Granite family is the chosen line: every published tag is Apache 2.0 (so no per-weight licensing check is needed), and the instruct models support native tool calling and structured JSON output. Candidates and trade-offs:
| Model | Size (Q4) | Tool use | Notes |
|---|---|---|---|
granite4.1:3b (CI smoke default) |
~2 GB | Fair | Apache 2.0; smallest Granite — CI/plumbing only, below the production tool-use bar |
granite4.1:8b (v0.1 default) |
~5 GB | Good | Apache 2.0; ~7 GB VRAM floor |
granite4.1:30b |
~18 GB | Best | Apache 2.0; hybrid MoE, best reasoning, ~18 GB VRAM floor; documented upgrade path |
Ship granite4.1:8b as the default, exposed via Helm values key
ollama.defaultModel so operators with headroom can substitute
granite4.1:30b without rebuilding. README states the VRAM/RAM floor
for each option and lists the override command.
POST /v1/ask # main endpoint
GET /v1/tools # list MCP tools discovered through the bridge
GET /v1/models # list installed Ollama models
GET /v1/health # liveness + dependency readiness
GET /v1/version # build metadata (git SHA, version, SBOM hash)
GET /metrics # Prometheus metrics
Request:
{
"query": "string, required, max 4KB",
"model": "string, optional; defaults to server config",
"stream": false,
"format": "text|json",
"conversation_id": "optional uuid for multi-turn"
}Response (non-streaming):
{
"request_id": "uuid",
"model": "granite4.1:8b",
"query": "echoed",
"answer": "markdown",
"tool_calls": [
{"name": "opencost.allocation", "args": {...}, "duration_ms": 142}
],
"usage": {"prompt_tokens": 412, "completion_tokens": 187},
"latency_ms": 1843
}Response (streaming): SSE, events typed as thinking, tool_call, tool_result, token, done. Same schema as native Ollama streaming, wrapped.
Errors: problem+json (RFC 7807) — no raw exception strings ever.
- v0.1: static bearer token read from a Kubernetes Secret.
Authorization: Bearer <token>. Rotate via Secret update; gateway watches for changes. - v0.2: SPIFFE/SPIRE workload identity. Documented as a follow-up.
Constrains model behavior to:
- Use MCP tools for cost data; never invent numbers.
- If a tool call fails, say so explicitly; do not hallucinate a fallback answer.
- Return markdown formatted for terminal and web readability.
- Refuse to answer questions unrelated to Kubernetes cost / OpenCost data.
- Runs as non-root UID 65532 with a read-only root filesystem.
- No host network, no privileged, no
hostPathmounts. - NetworkPolicy shipped in the Helm chart: egress only to the configured bridge + Ollama + (if needed) OpenCost MCP; no internet.
- PodSecurity
restrictedcompliant. - Images signed with cosign; SBOM published per release.
- Distroless or Chainguard base.
- All inputs length-validated, content-type-checked, and rejected on unexpected fields.
- Structured audit log to stdout with request ID, caller identity, timestamp, model, token counts, tool calls, but not the query text or completion text unless explicitly enabled per-deployment (opt-in, off by default).
- Prometheus metrics: request count by endpoint/status, latency histograms, in-flight requests, tool-call count and duration, per-model token totals, upstream error rate.
- OTLP tracing optional, off by default, configurable endpoint.
- Log format: JSON, one line per event,
slog-style.
OPENCOST_AI_BRIDGE_URL default: http://ollama-mcp-bridge:8000
OPENCOST_AI_LISTEN_ADDR default: :8080
OPENCOST_AI_DEFAULT_MODEL default: granite4.1:8b
OPENCOST_AI_REQUEST_TIMEOUT default: 120s
OPENCOST_AI_MAX_REQUEST_BYTES default: 8192
OPENCOST_AI_AUDIT_LOG_QUERY default: false
OPENCOST_AI_AUTH_TOKEN_FILE default: /var/run/secrets/opencost-ai/token
Language: current stable Go (1.26 as of initial commit). go.mod and
the CI/build toolchain track the same line; this is a greenfield repo
with no consumers, so there is no reason to pin below current stable.
opencost-ai/
├── CLAUDE.md # project-level instructions for Claude Code
├── README.md
├── LICENSE # existing Apache-2.0
├── SECURITY.md
├── CONTRIBUTING.md
├── CODEOWNERS
├── .github/
│ └── workflows/
│ ├── ci.yml # build + test + lint + SLSA provenance
│ ├── release.yml # cosign-signed images + SBOM
│ └── codeql.yml
├── cmd/
│ └── gateway/main.go
├── internal/
│ ├── server/ # HTTP handlers, middleware
│ ├── auth/ # bearer-token validator, token file watcher
│ ├── bridge/ # ollama-mcp-bridge client
│ ├── prompt/ # system prompt loader, validator
│ ├── audit/ # structured audit log
│ ├── ratelimit/ # token-bucket per-caller limiter
│ └── config/ # env + file loader, validation
├── pkg/
│ └── apiv1/ # exported request/response types for SDKs
├── deploy/
│ ├── helm/opencost-ai/ # Helm chart: gateway + bridge + ollama
│ └── examples/
│ ├── air-gapped.md
│ └── dev-local/
├── test/
│ ├── integration/ # against kind + helm install
│ └── e2e/ # against real OpenCost
└── docs/
├── architecture.md
├── security.md
├── air-gap-install.md
└── prompts.md
CLAUDE.md at the root is important per your standing preference. It should encode: never commit secrets, use signed commits (your existing opencost-contributor skill already covers this), prefer stdlib over dependencies in internal/, keep the gateway under 2000 LOC.
The current src/ollmcp-api-server.py and docker/Dockerfile.ollama should be archived, not extended. Specifically:
- Move both files into
legacy/prototype-flask/with a README noting the prototype's purpose and why it was replaced. - New development starts clean in
cmd/gateway/anddeploy/helm/. - The one-page
/querycontract from the prototype can inform the/v1/askcontract, but nothing else in that file is worth carrying over.
This is a judgment call — you could incrementally refactor, but the rewrite surface is larger than the rewrite-from-scratch surface.
Sized for Warwick's TAU methodology (1 BE + 2 FE-capable contributors), but FE work is minimal in v0.1 so it's really 2 backend people + a reviewer.
| Week | Work |
|---|---|
| 1 | Scaffold Go gateway; CI/CD; distroless image; cosign signing; integration test harness (kind + OpenCost + bridge + Ollama with granite4.1:3b smoke model). |
| 2 | POST /v1/ask happy path against the bridge. System prompt + guardrails. Problem+json errors. Bearer-token auth + token-file watcher. |
| 3 | Streaming SSE. Rate limit. Audit log. Prometheus metrics. |
| 4 | Helm chart: gateway + bridge + Ollama with PVC. NetworkPolicy. PodSecurity. ServiceMonitor. |
| 5 | Air-gap install flow documented end-to-end: ollama pull on a connected machine → ollama save to GGUF → OCI artifact → internal registry → ollama create in-cluster. Validated on a disconnected kind cluster. |
| 6 | Docs pass; threat model writeup; release v0.1.0 with signed images and SBOM; community announcement. |
Explicit out-of-scope for v0.1: streaming multi-turn conversations with persisted history, per-user auth, fine-tuned models, evaluation harness, web UI.
Resolved by project lead (Warwick Peatey, 2026-04-16). Claude Code treats these as settled and implements against them.
- MCP transport: Streamable HTTP (MCP spec 2025-03-26). Gateway and bridge standardize on this. A one-hour spike in Session 1 confirms the OpenCost MCP server (v1.118+) serves it correctly; if it does not, stop and escalate rather than fall back.
- Bridge
servers.jsontransport string:streamable_http. OpenCost advertisestype: "http"at its endpoint — that's the endpoint description, not the bridge client config. Bridge config names it explicitly. - Model weights in air-gap: OCI registry via ORAS. Reuses existing
container-registry auth, mirroring, and signing. Documented
end-to-end in
docs/air-gap-install.mdper Session 5. - Helm chart home:
opencost-airepo (this repo). Separate release cadence from OpenCost core. Migration toopencost-helm-chartis deferred to v1.0 and out of scope. - Default model:
granite4.1:8bwith Helm override. Values keyollama.defaultModellets operators substitutegranite4.1:30b(better reasoning, hybrid MoE, ~18 GB VRAM floor) or the smallergranite4.1:3bwithout rebuilding. README states the VRAM floor (~7 GB for the 8B default, ~18 GB for the 30B upgrade) and lists the override command.granite4.1:30bis the documented upgrade path for operators with headroom. No bundled-weights licensing check is needed because every Granite tag is Apache 2.0.
Written after the scaffold landed (2026-04-17). This section is the
authoritative delta between the design above and the code on main at
v0.1.0. Where the two disagree, the code wins and this section
explains why. Where something named in §6–§9 did not make the cut,
this section says so.
cmd/gateway shipped — main.go only, wire-up
internal/server shipped — handlers, middleware, SSE
internal/bridge shipped — ollama-mcp-bridge client
internal/auth shipped — file-backed bearer token
internal/audit shipped — JSON-line audit logger
internal/ratelimit shipped — per-caller token bucket
internal/config shipped — env loader + validate
internal/metrics shipped — Prometheus registry
internal/requestid shipped — per-request correlation
internal/prompt NOT SHIPPED — see §11.3
pkg/apiv1 shipped — wire types, no behavior
deploy/helm/opencost-ai shipped — gateway + bridge + ollama
scripts/air-gap shipped — ORAS export/push/pull, image mirror
test/integration shipped — gateway_test.go
test/airgap shipped — iptables egress-block harness
internal/requestid is a new package relative to §7.8: it was split
out to break a potential cycle between internal/server and
internal/auth (both need the per-request correlation token). No
behavioural change — it is the ctx key and a middleware, nothing
else.
| Endpoint | Status | Notes |
|---|---|---|
POST /v1/ask |
shipped | JSON + SSE streaming per §7.2. format field deferred; see §11.4. |
GET /v1/tools |
shipped | Returns {tools:[], discovery_deferred:true}. See §11.5. |
GET /v1/models |
shipped | Proxies Ollama /api/tags through the bridge. |
GET /v1/health |
shipped | Liveness-only. No upstream probe. See §11.6. |
GET /v1/version |
NOT SHIPPED | Build metadata surfaces via HealthResponse.Version. |
GET /metrics |
shipped | On a separate listener (loopback by default). See §11.7. |
§7.4 described a ConfigMap-loaded system prompt constraining the model
to OpenCost-only questions, refusing off-topic queries, and forbidding
hallucinated numbers. This did not ship in v0.1. The gateway
forwards the user query to the bridge with a single role:"user"
message and no system frame.
Rationale for the cut: jonigl/ollama-mcp-bridge already injects
tool definitions on every /api/chat request, which is the load-
bearing part of the guardrail. A system prompt without the
corresponding evaluation harness (out of scope per §5.3) lands as
unverified LLM-ergonomics prose — worth shipping, but not worth
blocking the release for. The internal/prompt package and its
ConfigMap wiring are tracked for v0.2.
Operators needing guardrails in v0.1 can pin them client-side by
prepending a system message to their query text. docs/prompts.md
documents the intended prompt verbatim and the reasoning behind it
so operators who choose to front-load the guardrail get the same
text the gateway will ship in v0.2.
§7.2 listed a "format":"text|json" field for optional
structured-JSON responses. pkg/apiv1.AskRequest does not expose it:
the gateway ships one markdown-string answer shape. No consumer has
asked for structured JSON yet, and adding the field speculatively
would commit the gateway to a schema we would have to maintain
through v0.x. Reintroduce when the first UI consumer lands.
§7.1 promised a live list of MCP tools discovered through the bridge.
jonigl/ollama-mcp-bridge does not expose an endpoint for that
today. Rather than invent one (which would mean forking the bridge),
the handler returns an empty list with discovery_deferred:true so
clients render "tool discovery not yet supported" instead of
silently assuming misconfiguration. When the bridge grows a listing
endpoint, or when the gateway caches tools observed in streaming
responses, this field goes false and the list populates.
§7.1 called it "liveness + dependency readiness". The shipped
endpoint is pure liveness: it returns 200 and the build version
while the process is up. Readiness (bridge reachable, Ollama up,
OpenCost MCP answering) belongs on a separate /v1/ready endpoint
so Kubernetes liveness probes — which restart the pod on failure —
cannot cycle the gateway because an upstream blipped. The Helm
chart's livenessProbe wires /v1/health; readinessProbe is
left empty by design (see deploy/helm/opencost-ai/values.yaml).
§7.6 described /metrics on the main listener. It ships on a
dedicated listener bound to 127.0.0.1:9090 by default
(OPENCOST_AI_METRICS_LISTEN_ADDR). Two reasons:
/metricsis unauthenticated by design (Prometheus scrapers do not speak bearer tokens), so keeping it off the main listener means the bearer-token gate cannot accidentally protect it or leak through a middleware misconfiguration.- Loopback-default means an operator who forgets to write a NetworkPolicy still does not expose per-caller token counters cluster-wide.
The chart's ServiceMonitor template targets the separate metrics
port (service.metricsPort, default 9090) and the chart ships a
NetworkPolicy that scopes ingress on that port to same-namespace
pods by default; operators pointing a cross-namespace Prometheus at
it override networkPolicy.metricsIngress.allowedFrom.
All §7.7 env vars shipped with the documented defaults. One
addition: OPENCOST_AI_METRICS_LISTEN_ADDR (default
127.0.0.1:9090) per §11.7. The internal/config package exports
constant names for every env var so ops docs and tests share the
same identifiers; see config/config.go for the canonical list.
§7.6 listed OTLP tracing as optional-off-by-default. It did not
ship: there is no OTLP exporter wired into the gateway and no
OTEL_* env var handling. log/slog structured logs with the
per-request ID (requestid.HeaderName == "X-Request-ID") are the
v0.1 correlation surface; tracing lands when there is a cross-pod
span to propagate (realistically once the bridge exposes spans).
Three third-party runtime dependencies, each with an import-site justification comment:
github.com/prometheus/client_golang— metrics exposition. Used only byinternal/metrics; the rest of the code base depends on the package's narrow wrapper types, not on Prometheus directly.github.com/prometheus/client_model— transitive, needed to read metric values for tests.golang.org/x/time— token-bucket primitive underinternal/ratelimit. Justified inline perCLAUDE.md.
No dependencies under cmd/ or pkg/apiv1. Total third-party
runtime surface on the hot request path: one client for Prometheus
text exposition and one token-bucket primitive.
Gateway code (cmd/ + internal/ + pkg/) fits inside the 2000-
line CLAUDE.md budget at tag time. The budget is a soft contract —
CI does not enforce it — so future reviewers should sample
git ls-files cmd internal pkg | xargs wc -l on the PR branch to
keep the line honest.
The design above still describes the intended v0.2 surface. Concrete items promoted out of §5.3 / resolved during v0.1:
internal/prompt+ ConfigMap-driven system prompt (§11.3).GET /v1/readyendpoint + upstream-reachability probe (§11.6).AskRequest.format="json"when the first UI consumer lands (§11.4)./v1/toolsdiscovery once the bridge exposes tool listing (§11.5).- OTLP tracing when the bridge emits spans (§11.9).
- SPIFFE/SPIRE auth to replace static bearer token (§7.3).
None of these block v0.1.0. They are the documented reasons someone should expect a v0.2.