[Idea] IOChain — a request/response filter pipeline for the inference layer #20545

mukesh-hai · 2026-03-13T20:14:50Z

mukesh-hai
Mar 13, 2026

SGLang currently has no extensible way to inspect or act on requests and responses at the inference layer. The existing Starlette/FastAPI middlewares only see raw HTTP — they can't access tokenised inputs, generated completions, or usage stats. There are several open requests for this (#13825, #6621) but no solution yet.

Idea: an IOChain filter pipeline

Inspired by the IOChain pattern in network stacks — every request passes
through an ordered pipeline of filters on ingress (before inference) and
egress (after the response is built).

Each filter is a simple class with two async hooks:

class MyFilter(IOFilter):
    blocking = True   # False = fire-and-forget background task

    async def on_request(self, ctx: IOContext) -> None:
        # inspect/modify request before it hits the engine
        ...

    async def on_response(self, ctx: IOContext) -> None:
        # inspect/log response after inference
        ...

Two execution modes
 1. Mode: Inline, Blocking: True, Behavior: Request waits for filter to finish, Use case: Guardrails, Content policy, auth, rate limiting
 2. Mode: Background, Blocking: False, Behavior:Fire-and-forget asyncio.create_task, Use case: Token counting, audit logging, telemetry


What IOContext carries

ctx.request_id       # unique per request
ctx.raw_request      # OpenAI-protocol request object
ctx.adapted_request  # internal GenerateReqInput (tokenised)
ctx.response         # populated on egress
ctx.metadata         # dict for filters to share state
ctx.start_time       # for latency measurement

Single hook point

The chain is invoked once in OpenAIServingBase.handle_request — the one
chokepoint all OpenAI-compatible endpoints pass through. Zero changes to the
inference path. Measured overhead for a no-op chain: <0.1 ms.

Example: built-in token counter (non-blocking)

from sglang.srt.iochain import get_default_chain, get_token_counter

print(get_token_counter().get_stats())
# {"request_count": 42, "prompt_tokens": 18400, "completion_tokens": 9200, ...}

Questions for the community

1. Is this the right abstraction layer, or should filters sit at the scheduler/engine level instead?
2. Should blocking filters be able to mutate the request (e.g. for prompt injection or rewriting or add guardrails)?
3. Would a --filter CLI flag (similar to vLLM's --middleware) be the right way to register filters without code changes?
4. Streaming egress: what's the cleanest hook point for inspecting a completed stream?

Happy to share a prototype implementation and open a proper RFC if there's interest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Idea] IOChain — a request/response filter pipeline for the inference layer #20545

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

[Idea] IOChain — a request/response filter pipeline for the inference layer #20545

Uh oh!

mukesh-hai Mar 13, 2026

Idea: an IOChain filter pipeline

Replies: 0 comments

mukesh-hai
Mar 13, 2026