Skip to content

trevorgordon981/blockops-proxy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

blockops-proxy

An async HTTP proxy that sits between OpenAI-compatible clients (agents, Slack gateways, chat UIs) and local LLM serving backends (MLX, vLLM, llama.cpp-style servers). Its primary job is to translate text-format tool calls emitted by raw model output into the structured tool_calls shape that OpenAI clients expect — no more brittle regex hacks on the client side.

Why this exists

Many locally-served models emit tool calls as plain text (<tool_call> tags, TOOL_CALL: JSON lines, or XML-style blocks) because their chat templates weren't trained against the OpenAI tool_calls JSON schema. Agents and gateways built against the OpenAI API don't know how to parse that. This proxy does the translation transparently, so you can point any OpenAI-compatible client at a local model and have tool-calling Just Work.

Architecture

   OpenAI-compatible client
            │
            ▼
   blockops-proxy  :8080
            │
            ▼
   local LLM backend  :8082
   (MLX / vLLM / llama.cpp / ...)

The proxy terminates the client's HTTP/SSE connection, forwards the request upstream, parses the streamed output, rewrites tool-call text into structured tool_calls chunks, and relays everything back to the client.

Features

  • Tool-call text-format translation — parses 3 formats and emits OpenAI tool_calls
  • SSE streaming with heartbeats — keeps client sockets alive during tool-call latency and cold starts
  • Concurrency gatingMAX_CONCURRENT cap prevents backend saturation on single-GPU hosts
  • Context truncation — soft/hard token thresholds drop middle-of-history turns before the backend OOMs
  • Memory-pressure warnings — injects visible notices when system RAM gets tight (uses psutil)
  • KV cache purge on session reset — hits backend DELETE /v1/cache when the client signals /new
  • Traffic logging — every request/response to a configurable log file

Supported tool-call formats

1. <tool_call> tags with JSON body (Qwen-Agent style)

<tool_call>
{"name": "search", "arguments": {"query": "weather"}}
</tool_call>

2. TOOL_CALL: line-prefix JSON

TOOL_CALL: {"name": "search", "arguments": {"query": "weather"}}

3. XML-style function/parameter blocks

<tool_call><function=search><parameter=query>weather</parameter></function></tool_call>

All three are detected, extracted, stripped from the visible content, and re-emitted as OpenAI tool_calls deltas on the SSE stream.

Configuration

All configuration is via environment variables:

Variable Default Description
BACKEND_URL http://127.0.0.1:8082 Upstream LLM server URL
PORT 8080 Port this proxy listens on
MODEL_NAME (empty) Optional display name shown in session banner
BLOCKOPS_LOG_FILE ~/.blockops/proxy.log Traffic log destination
MAX_CONCURRENT 3 Max concurrent in-flight requests per backend (edit in source)
TOKEN_WARN_THRESHOLD 80000 Soft context warning threshold (edit in source)
TOKEN_HARD_THRESHOLD 100000 Hard truncation trigger (edit in source)

Quickstart

git clone https://github.com/trevorgordon981/blockops-proxy.git
cd blockops-proxy
pip install -r requirements.txt

# point the proxy at your local backend and run it
export BACKEND_URL=http://127.0.0.1:8082
export PORT=8080
python3 blockops-proxy.py

Then point any OpenAI-compatible client at http://127.0.0.1:8080/v1:

curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"local","messages":[{"role":"user","content":"hi"}],"stream":true}'

Related projects

Part of a self-hosted LLM operations toolkit:

  • llm-otel-proxy — natural companion layer: put it in front of this proxy for token/cost/latency telemetry
  • context-bench — characterize the backend this proxy fronts before tuning MAX_CONCURRENT / TOKEN_HARD_THRESHOLD
  • alfred-infra — infrastructure monitoring that visualizes this proxy's metrics and alerts on concurrency rejections
  • alfred-rag — RAG stack that fronts-ends through this proxy when serving to OpenAI-compatible clients

License

MIT — see LICENSE.

About

HTTP proxy that translates text-format tool calls from local LLMs into OpenAI tool_calls. SSE streaming, concurrency gating, context truncation. Works with MLX/vLLM/llama.cpp backends.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages