An async HTTP proxy that sits between OpenAI-compatible clients (agents, Slack
gateways, chat UIs) and local LLM serving backends (MLX, vLLM, llama.cpp-style
servers). Its primary job is to translate text-format tool calls emitted by
raw model output into the structured tool_calls shape that OpenAI clients
expect — no more brittle regex hacks on the client side.
Many locally-served models emit tool calls as plain text (<tool_call> tags,
TOOL_CALL: JSON lines, or XML-style blocks) because their chat templates
weren't trained against the OpenAI tool_calls JSON schema. Agents and
gateways built against the OpenAI API don't know how to parse that. This proxy
does the translation transparently, so you can point any OpenAI-compatible
client at a local model and have tool-calling Just Work.
OpenAI-compatible client
│
▼
blockops-proxy :8080
│
▼
local LLM backend :8082
(MLX / vLLM / llama.cpp / ...)
The proxy terminates the client's HTTP/SSE connection, forwards the request
upstream, parses the streamed output, rewrites tool-call text into structured
tool_calls chunks, and relays everything back to the client.
- Tool-call text-format translation — parses 3 formats and emits OpenAI
tool_calls - SSE streaming with heartbeats — keeps client sockets alive during tool-call latency and cold starts
- Concurrency gating —
MAX_CONCURRENTcap prevents backend saturation on single-GPU hosts - Context truncation — soft/hard token thresholds drop middle-of-history turns before the backend OOMs
- Memory-pressure warnings — injects visible notices when system RAM gets
tight (uses
psutil) - KV cache purge on session reset — hits backend
DELETE /v1/cachewhen the client signals/new - Traffic logging — every request/response to a configurable log file
1. <tool_call> tags with JSON body (Qwen-Agent style)
<tool_call>
{"name": "search", "arguments": {"query": "weather"}}
</tool_call>
2. TOOL_CALL: line-prefix JSON
TOOL_CALL: {"name": "search", "arguments": {"query": "weather"}}
3. XML-style function/parameter blocks
<tool_call><function=search><parameter=query>weather</parameter></function></tool_call>
All three are detected, extracted, stripped from the visible content, and
re-emitted as OpenAI tool_calls deltas on the SSE stream.
All configuration is via environment variables:
| Variable | Default | Description |
|---|---|---|
BACKEND_URL |
http://127.0.0.1:8082 |
Upstream LLM server URL |
PORT |
8080 |
Port this proxy listens on |
MODEL_NAME |
(empty) | Optional display name shown in session banner |
BLOCKOPS_LOG_FILE |
~/.blockops/proxy.log |
Traffic log destination |
MAX_CONCURRENT |
3 |
Max concurrent in-flight requests per backend (edit in source) |
TOKEN_WARN_THRESHOLD |
80000 |
Soft context warning threshold (edit in source) |
TOKEN_HARD_THRESHOLD |
100000 |
Hard truncation trigger (edit in source) |
git clone https://github.com/trevorgordon981/blockops-proxy.git
cd blockops-proxy
pip install -r requirements.txt
# point the proxy at your local backend and run it
export BACKEND_URL=http://127.0.0.1:8082
export PORT=8080
python3 blockops-proxy.pyThen point any OpenAI-compatible client at http://127.0.0.1:8080/v1:
curl http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"local","messages":[{"role":"user","content":"hi"}],"stream":true}'Part of a self-hosted LLM operations toolkit:
- llm-otel-proxy — natural companion layer: put it in front of this proxy for token/cost/latency telemetry
- context-bench — characterize the backend this proxy fronts before tuning MAX_CONCURRENT / TOKEN_HARD_THRESHOLD
- alfred-infra — infrastructure monitoring that visualizes this proxy's metrics and alerts on concurrency rejections
- alfred-rag — RAG stack that fronts-ends through this proxy when serving to OpenAI-compatible clients
MIT — see LICENSE.