Sieve

Transparent feedback compression middleware for LLM coding agents. Sieve sits between an agent and its tools, parsing tool output and emitting a compact form before it enters the conversation context.

83.9% of tokens in coding-agent trajectories are tool observations (JetBrains, NeurIPS 2025). Most of those are re-read on every subsequent turn. Sieve targets that bloat by parsing — not truncating — the output of common dev tools, then diffing against prior turns so the agent only sees what changed.

Full design in docs/agent-compress-specs.md. Where this is headed: ROADMAP.md — parser contributions are the easiest entry point (good first issues).

What's in the box


Parsers	pytest, Python traceback, mypy, tsc, eslint, gcc/clang, pip, generic fallback
Output formats	plain, structured (JSON), XML, minimal
Integrations	MCP proxy, SWE-bench Lite paired Cursor / Codex runners
Dependencies	none for the library; `mcp` extra for the proxy (Python ≥ 3.11)

Measured compression

Benchmark over the 22-sample fixture corpus in tests/fixtures/ (uv run python -m benchmarks.run):

Category	Samples	Raw	Compressed	Ratio
pytest	7	13,475	1,292	90.4%
pip	2	9,505	138	98.5%
runtime	6	3,150	1,068	66.1%
gcc	1	1,318	721	45.3%
tsc	2	1,264	708	44.0%
generic	1	480	433	9.8%
eslint	1	844	780	7.6%
mypy	2	758	756	0.3%
total	22	30,794	5,896	80.9%

A 5-turn delta scenario where the same pytest failure repeats compresses to 86.3% cumulative — turn 1 is 297 chars, turns 2-5 collapse to 238 chars each ("PYTEST DELTA: unchanged" + still-failing nodeids).

The compressor enforces a never-larger-than-raw invariant on its default plain output: on inputs already smaller than the framing overhead (mypy clean output, ESLint with terse messages, etc.) it passes the raw text through unchanged. The low ratios on those categories aren't a bug — there's nothing to compress; the structured items are still extracted and used for cross-turn delta dedup. (The structured and xml formats wrap the result in a JSON/XML envelope, so on tiny inputs the framed output can exceed the raw size — they trade bytes for machine-readability.)

End-to-end agent run (SWE-bench Lite, Cursor Composer-2)

Paired baseline-vs-sieve trial with Cursor CLI / Composer-2, scored by the official swebench.harness.run_evaluation Docker harness.

	baseline	sieve
instances scored	4	4
resolved	2	2
resolve rate (of scored)	50.0%	50.0%
patch chars	21,242	19,956
agent-facing chars	47,688	11,613
raw chars	47,688	40,416
compression ratio	0%	71.3%

Resolve rate is unchanged and agent-facing context drops 75.6% (47k → 11.6k chars). This is a small-N pilot (4 instances scored) — treat it as a directional signal, not a statistically robust result. The fixture-corpus numbers above (N=22) are the methodologically clean measurement. Reproduce with:

bash scripts/run_cursor_swe_bench_profiles.sh --resume \
  --eval-with-harness --harness-namespace none
PYTHONPATH=src python3 -m benchmarks.swe_bench_compare \
  --baseline artifacts/cursor-swe-bench-lite.baseline.jsonl \
  --sieve    artifacts/cursor-swe-bench-lite.sieve.jsonl

CI-Repair-Bench (noisy-logs benchmark)

CI-Repair-Bench is built from real GitHub Actions failures (workflow YAML + long logs). We measure observation compression over the ci-benchmark-user/ci-repair-bench dataset; each observation is workflow + flattened logs with the gold diff excluded, so the metric is diagnostic bulk, not patch leakage. For repair scoring use the paper's upstream harness.

uv sync --group swe-eval   # datasets
uv run python -m benchmarks.ci_repair_bench --compare --json   # all 567 rows

Installation

The library has no runtime dependencies:

pip install agent-sieve            # library + `sieve` / `sieve-run` CLIs
pip install 'agent-sieve[mcp]'     # add the MCP proxy

After install, wire the hooks into your agent harness:

sieve config claude --write        # Claude Code PreToolUse (Bash) hook
sieve config cursor --write        # Cursor preToolUse (Shell) hook
sieve config claude --status       # check what's installed

Both config commands print a dry-run preview by default; add --write to apply (a .sieve.bak backup is made first) and --uninstall to remove.

Quick start

uv run python -m unittest discover -s tests -v
uv run python -m benchmarks.run                   # compression report on fixture corpus

SWE-bench Lite (paired baseline-vs-sieve trials)

Wraps the official swebench.harness.run_evaluation Docker harness. The paired runner builds each instance's harness-container, mounts the workspace at /testbed, runs the agent, and emits a predictions.jsonl per profile. The harness then rewrites each row with the authoritative resolved value.

uv sync --group swe-eval

bash scripts/run_cursor_swe_bench_profiles.sh \
  --manifest benchmarks/manifests/lite_smoke.jsonl \
  --resume --eval-with-harness --harness-namespace none

PYTHONPATH=src python3 -m benchmarks.swe_bench_compare \
  --baseline artifacts/cursor-swe-bench-lite.baseline.jsonl \
  --sieve    artifacts/cursor-swe-bench-lite.sieve.jsonl

Notes:

--harness-namespace none reuses locally-built sweb.eval.x86_64.<id>:latest images instead of pulling swebench/... from Docker Hub.
The runner deletes per-instance images and workspaces after each row, and passes --clean True --cache_level base to the harness. Pass --keep-images / --keep-workspaces to retain them.
Swap cursor for codex in the script name to use the Codex CLI agent instead.
The trajectory-only variant (benchmarks.swe_bench_lite, replays *.traj files for token counts without re-running tests) is still available; it does not run the harness.

benchmarks.swe_bench_compare reports three resolve-rate views: overall (resolved / instances), of-scored (resolved / scored), and scored-rate (scored / instances).

Usage

Direct compression

from sieve import CompressSession

session = CompressSession()
result = session.compress(
    command="pytest tests/",
    stdout=raw_stdout,
    stderr=raw_stderr,
    exit_code=1,
)
print(result.text)              # send this to the LLM
print(result.stats.compression_ratio)

CompressSession keeps state across calls, so the second compress(...) for the same test suite emits a delta against the first.

Decorator wrapper

import subprocess
from sieve import wrap_tool

@wrap_tool
def run_bash(command: str) -> tuple[str, str, int]:
    p = subprocess.run(command, shell=True, capture_output=True, text=True)
    return p.stdout, p.stderr, p.returncode

run_bash(...) now returns the compressed string. Session state is held by the decorator.

Configuration

from sieve import CompressConfig, CompressSession, OutputFormat

session = CompressSession(CompressConfig(
    format=OutputFormat.STRUCTURED,   # plain | structured | xml | minimal
    delta_mode=True,
    include_pattern_hints=True,
    max_raw_lines=50,
))

MCP proxy (no application code changes)

sieve.integrations.mcp is an MCP proxy that wraps any upstream MCP server. The agent talks to the proxy as if it were the upstream; the proxy forwards tools/list and tools/call, and runs every TextContent block in the result through a shared CompressSession before returning it. A single session lives for the proxy's lifetime, so cross-tool delta compression works.

Install the optional dep:

pip install 'agent-sieve[mcp]'

Configure your MCP client (Claude Desktop, Cursor, Continue, etc.) to launch the proxy in place of the upstream. Example wrapping the official server-everything demo server (npx downloads it on first run):

// Claude Desktop's mcp_servers config / Cursor .cursor/mcp.json
{
  "mcpServers": {
    "sieve-demo": {
      "command": "python",
      "args": [
        "-m", "sieve.integrations.mcp",
        "--",
        "npx", "-y", "@modelcontextprotocol/server-everything"
      ]
    }
  }
}

Use whatever upstream you already rely on (e.g. @modelcontextprotocol/server-filesystem with the paths your client expects); there is no published @modelcontextprotocol/server-bash on npm.

That's the entire integration: the agent sees the same upstream tools, but every TextContent result has been compressed.

For non-shell tools, the proxy uses the tool name as the parser-router hint; for shell-like tools that take a command / cmd / shellCommand argument, the actual command string is forwarded so parser detection (pytest, mypy, etc.) works correctly.

What it looks like

Illustrative example (hand-written to show the shape of the transform, not a captured fixture):

Raw pytest run with two failures (1,818 chars):

============================= test session starts ==============================
platform linux -- Python 3.12.0, pytest-8.1.1, pluggy-1.4.0
... [40 lines of header + per-test output] ...
=================================== FAILURES ===================================
________________________________ test_user_update ________________________________
    def test_user_update(self):
        ...
>       assert response.status_code == 200
E       AssertionError: assert 403 == 200
tests/test_views.py:89: AssertionError
... [equivalent block for test_user_delete] ...
=========================== short test summary info ============================
FAILED tests/test_views.py::TestUserViewSet::test_user_update - AssertionError
FAILED tests/test_views.py::TestUserViewSet::test_user_delete - AssertionError
========================= 2 failed, 140 passed, 0 warnings ====================

After Sieve (297 chars, 83.7% reduction):

PYTEST: 2 failed, 140 passed (142 total)
FAIL tests/test_views.py::TestUserViewSet::test_user_update (test_views.py:89)
  expected 200, got 403
FAIL tests/test_views.py::TestUserViewSet::test_user_delete (test_views.py:102)
  expected 204, got 403
Pattern: All failures return 403 in test_views.py

On the next turn, the agent fixes test_user_update and re-runs:

PYTEST DELTA (turn 2)
PASS tests/test_views.py::TestUserViewSet::test_user_update now passes
STILL FAIL tests/test_views.py::TestUserViewSet::test_user_delete (line 102) - expected 204, got 403
Result: 1 failed, 141 passed (142 total)

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.claude		.claude
.cursor		.cursor
.github		.github
benchmarks		benchmarks
docs		docs
integrations/swe-bench-lite-cursor		integrations/swe-bench-lite-cursor
scripts		scripts
src/sieve		src/sieve
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sieve

What's in the box

Measured compression

End-to-end agent run (SWE-bench Lite, Cursor Composer-2)

CI-Repair-Bench (noisy-logs benchmark)

Installation

Quick start

SWE-bench Lite (paired baseline-vs-sieve trials)

Usage

Direct compression

Decorator wrapper

Configuration

MCP proxy (no application code changes)

What it looks like

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sieve

What's in the box

Measured compression

End-to-end agent run (SWE-bench Lite, Cursor Composer-2)

CI-Repair-Bench (noisy-logs benchmark)

Installation

Quick start

SWE-bench Lite (paired baseline-vs-sieve trials)

Usage

Direct compression

Decorator wrapper

Configuration

MCP proxy (no application code changes)

What it looks like

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages