Skip to content

Commit 965dce0

Browse files
SonAIengineclaude
andcommitted
feat(v0.20): Phase A pagination + Phase 0 unified scorer
PHASE A (exhaustive recall): - cursor / has_more / next_cursor / offset on filter_nodes, aggregate_nodes, top_nodes, join_related (4 structured tools). Stateless re-call protocol: agent re-issues same tool with cursor=<prior next_cursor>; pages disjoint, no dedup needed. - Adaptive turn budget: enumeration markers (모두/전체/목록/리스트/ list all/every/all the/모든-leading) bump max_turns 5→15 so the agent can walk pagination cursors. Caller max_turns >5 wins. - Honest truncation signaling: project_tool_result now records _truncated_from per list (was: bare _trimmed_for_context bool). - Cursor follow-through guidance added to both AGENT_SYSTEM prompts with worked multi-step example. - 14 + 20 + 1 = 35 new tests (pagination contract, classifier, truncation signal). Full suite: 940 pass. PHASE 0 (unified validation): - eval/unified.py: dimension classifier (lang / recall_type / hop_count / structured_pct / enumeration / cross_domain / cross_language) + weighted UnifiedScore composite + per-axis breakdown + per-bench legacy view. CLI scorer + JSON report + diff-vs-baseline mode. - Critical invariant: classifier.enumeration is bit-aligned with agent_loop._is_enumeration_query (test-locked) so scoring matches the upstream budget decision. - 22 new tests covering weight normalisation, axis no-coverage flagging, multi-language detection, alignment with agent loop. HONEST MEASUREMENT: Phase A regresses on per-bench scoring (KRRA Hard 34→31, assort Hard 30→29) — adaptive budget fires correctly on h012 but agent paginates wrong-set (phrase hubs, not docs); also deterministic prompt-shift reroutes h012/h025/h031. Single-bench scoring can't tell us if enumeration recall up > broad-topical down. The unified scorer is the path out: Phase 0.3 / 0.4 will fill cross-domain + cross-language coverage gaps so every future Phase competes on a single number that captures the real trade-offs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 108a44a commit 965dce0

9 files changed

Lines changed: 1661 additions & 23 deletions

CHANGELOG.md

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,147 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
66

77
## [Unreleased]
88

9+
### Added — v0.20 Phase 0: unified validation scorer (`eval/unified.py`)
10+
11+
The cumulative bench infrastructure was per-corpus: KRRA Hard / assort
12+
Hard / X2BEE / MuSiQue each reported a single MRR-or-hit number, and
13+
ship/no-ship judgements relied on summing those across releases. This
14+
hides cross-cutting regressions: a feature that wins enumeration but
15+
loses broad-topical retrieval shows up as "+2 here, -3 there" with no
16+
single metric explaining the trade-off.
17+
18+
`eval/unified.py` introduces a dimension-tagged composite. Each query
19+
is auto-classified along seven axes:
20+
21+
- `language` (ko / en / mixed)
22+
- `recall_type` (single / top_n / multi_hop / enumeration / summary)
23+
- `hop_count`
24+
- `structured_pct`
25+
- `enumeration` (explicit "all/모두/전체/list all" markers)
26+
- `cross_domain`
27+
- `cross_language`
28+
29+
The scorer aggregates per-axis hit-rate and produces a single
30+
**UnifiedScore**[0, 1] as a weighted average of axis hit-rates.
31+
Default weights: ko 0.30, en 0.10, mixed 0.05, multi_hop 0.15,
32+
enumeration 0.10, structured 0.10, cross_domain 0.10, cross_language
33+
0.10.
34+
35+
Critical invariant locked in tests
36+
(`tests/test_eval_unified.py`): the enumeration classifier here is
37+
bit-identical with `src/synaptic/agent_loop.py:_is_enumeration_query`,
38+
so every query that triggers the agent's adaptive turn budget is
39+
*also* counted in the enumeration recall slice. Any drift is a
40+
reporting bug.
41+
42+
First UnifiedScore measurement on the v0.20 partial bench logs:
43+
44+
```
45+
UnifiedScore: 0.598
46+
lang:ko hit=0.870 (n=69) ✓
47+
lang:en hit=— (n=0) ✗ NO COVERAGE — agent bench has no EN
48+
recall:multi_hop hit=0.818 (n=11) ✓
49+
recall:enumeration hit=0.861 (n=43) ✓
50+
structured hit=0.897 (n=39) ✓
51+
cross_domain hit=— (n=0) ✗ NO COVERAGE — no cross-corpus queries
52+
cross_language hit=— (n=0) ✗ NO COVERAGE — no EN→KO paraphrase queries
53+
```
54+
55+
The 0.60 ceiling at full Korean / structured competence — caused by
56+
30% of weight sitting on three uncovered axes — is exactly the
57+
reporting we want. Phase 0.3 / 0.4 author Korean multi-hop +
58+
cross-domain / cross-language query sets to fill the coverage gaps.
59+
Phase 0.5 then re-measures every prior release against the new
60+
unified metric so progression is judged on a single number going
61+
forward.
62+
63+
Tests: 22 new (classifier alignment with agent loop, weight
64+
normalisation, axis no-coverage flagging, default-weight ceiling
65+
sanity).
66+
67+
### Measured — v0.20 Phase A: honest result on partial 2-bench data
68+
69+
Two of the five planned v0.20 benches completed before the parallel
70+
sweep was killed (vLLM contention with 5 concurrent agents); the
71+
remaining three were re-run individually but only the first 3-4
72+
queries of each survived in the log files as of this write. So this
73+
section reports the partial measurement honestly:
74+
75+
| Bench | v0.19 @ T=0 | **v0.20 @ T=0** | Δ |
76+
|---|---:|---:|---:|
77+
| KRRA Hard (39q) | 34 / 39 = 87 % | **31 / 39 = 79 %** | **−3** |
78+
| assort Hard (33q) | 30 / 33 = 91 % | **29 / 33 = 88 %** | **−1** |
79+
| **Combined (2 benches, 72q)** | **64 / 72 = 89 %** | **60 / 72 = 83 %** | **−4** |
80+
81+
Per-query diff KRRA Hard:
82+
83+
- **adaptive turn budget fired correctly** on h012 (`turns=12` vs
84+
prior 5) — feature works as designed
85+
- **h012 still misses** though: agent paginated 20 phrase-hub nodes
86+
("이용자보호 과제", "한국마사회 이용자보호") rather than documents.
87+
Different problem (entity-linker bias surfaces hubs above docs in
88+
FTS rank) — pagination ≠ better recall when the wrong set is
89+
being paginated
90+
- **NEW misses h012, h025, h031** (vs v0.19) — same deterministic-
91+
prompt-shift dynamic seen in v0.19 X2BEE Conv c013: adding cursor
92+
follow-through guidance to the system prompt reroutes some
93+
Korean structured queries down different deterministic paths
94+
even though the new rule technically targets only enumeration
95+
96+
Net read: Phase A's pagination + adaptive budget *succeeds at its
97+
direct target* (adaptive budget fires, cursor parameter wires through
98+
correctly, 14 + 20 + 1 = 35 new tests lock the contract) but
99+
**regresses overall agent quality on the existing per-bench scoring**.
100+
This is exactly the measurement gap that motivates Phase 0:
101+
single-bench numbers can't tell us whether enumeration recall went
102+
up faster than broad-topical retrieval went down. The unified scorer
103+
shipped above will resolve it after Phase 0.3 / 0.4 add the missing
104+
query coverage.
105+
106+
The Phase A code itself is correct, tested, and additive. We choose
107+
to ship it and re-evaluate under the unified metric rather than
108+
revert.
109+
110+
### Added — v0.20 Phase A: pagination protocol + adaptive enumeration budget
111+
112+
The first ship of the v0.20+ track ("multi-domain ontology + multi-turn
113+
exhaustive recall"). Targets the structural ceiling that left enumeration
114+
queries unsolved in earlier versions: agent had no way to retrieve
115+
results [21, total] when ``filter_nodes`` capped at 20 with only a
116+
``truncated=true`` boolean.
117+
118+
**1. Pagination on all four structured tools.**
119+
``filter_nodes`` / ``aggregate_nodes`` / ``top_nodes`` / ``join_related``
120+
all gain an opaque ``cursor`` parameter and emit ``has_more`` +
121+
``next_cursor`` + ``offset`` in their response. Stateless: agent re-issues
122+
the same tool with ``cursor=<next_cursor>`` to advance. Pages are
123+
disjoint — no dedup needed.
124+
125+
**2. Adaptive turn budget for enumeration queries.**
126+
``run_agent_loop`` now sniffs the query for enumeration markers
127+
(``모두`` / ``전체`` / ``목록`` / ``리스트`` / ``전수`` / ``list all`` /
128+
``every`` / ``all of the`` / ``all the`` / ``show me all`` / leading
129+
``모든``). Detected → ``max_turns`` bumps from 5 to 15 so the agent
130+
has room to walk the cursor. Caller-provided ``max_turns > 5`` always
131+
wins. Conservative classifier biased toward recall — a single marker
132+
flips the budget.
133+
134+
**3. Honest truncation signaling.**
135+
``project_tool_result`` already shrunk lists to fit the 4 KB context
136+
budget; previously it set a single ``_trimmed_for_context: true``
137+
boolean. Now also records ``_truncated_from: {results: 200, evidence: 50}``
138+
— per-list pre-shrink size — so the agent can tell whether one item
139+
was dropped or 199.
140+
141+
**4. Prompt updates.**
142+
Both ``AGENT_SYSTEM`` prompts (``src/synaptic/agent_loop.py`` and
143+
``eval/run_all.py``) now include explicit pagination guidance with a
144+
worked multi-step example: ``top_nodes → join_related (cursor=)``
145+
loop for "100건 이상 판매 상품의 리뷰 모두" pattern.
146+
147+
Test coverage: +35 tests (14 pagination contract, 20 enumeration
148+
classifier, 1 truncation signal). Full suite: 940 pass.
149+
9150
### Measured — v0.19 prompt patch (English-paraphrase / search-first guidance)
10151

11152
Added a single new section to the agent system prompt — both

eval/run_all.py

Lines changed: 72 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -643,12 +643,38 @@ async def run_public_dataset(
643643
unfiltered topic search returns too many candidates AND you have
644644
evidence the user wants a specific year.
645645
646-
## "List all" / enumeration questions
647-
- Queries like "X 목록", "X 상품 전체", "list all X" need the COMPLETE
648-
set, not one representative. Use ``filter_nodes(limit=100)`` (or
649-
higher) and keep scanning. The GT for these often has 5-10 specific
650-
rows; a limit=20 default plus a retry that narrows instead of
651-
widening will miss half of them.
646+
## "List all" / enumeration questions — paginate with cursor
647+
- Queries like "X 목록", "X 상품 전체", "list all X", "모두/전체"
648+
need the COMPLETE set, not a representative sample. Strategy:
649+
1. First call: filter_nodes / aggregate_nodes / top_nodes / join_related
650+
with limit=100.
651+
2. Read the result: if has_more=true, the response includes
652+
next_cursor (a short token like "100").
653+
3. Issue the SAME tool with the SAME args plus cursor=<next_cursor>.
654+
Keep going until has_more=false (or you've gathered enough).
655+
- Each follow-through page is disjoint from prior pages — no
656+
deduplication needed on your side.
657+
- Total/total_groups stays constant across pages; "showing" is the
658+
size of the current page only. Use total to plan how many follow-
659+
throughs you need.
660+
661+
Q: "이용자보호 관련 모든 문서 목록"
662+
Step 1: filter_nodes(table="documents", property="category",
663+
op="contains", value="이용자보호", limit=100)
664+
→ total=27, showing=27, has_more=false (one call sufficed)
665+
666+
Q: "100건 이상 판매 상품의 리뷰 모두"
667+
Step 1: top_nodes(table="products", sort_by="cumulative_sales",
668+
order="desc", limit=100,
669+
where_property="cumulative_sales", where_op=">=",
670+
where_value="100")
671+
→ results = top-100 products sorted by sales
672+
Step 2: join_related(from_values=<all 100 product codes>,
673+
fk_property="goods_no", target_table="reviews",
674+
limit=100)
675+
→ has_more=true, next_cursor="100", total=247
676+
Step 3: join_related(... same args ..., cursor="100")
677+
Step 4: join_related(... same args ..., cursor="200") → has_more=false
652678
653679
## Multi-source questions
654680
- Queries like "X 관련 자료", "X 관련 내용", "X 관련 정보" explicitly
@@ -720,6 +746,10 @@ async def run_public_dataset(
720746
"type": "integer",
721747
"description": "Max results to return (default 20). Use higher for listings.",
722748
},
749+
"cursor": {
750+
"type": "string",
751+
"description": "Pagination token from a prior call's next_cursor — pass to fetch the NEXT page when has_more=true. Use for 'list all X' queries.",
752+
},
723753
"from_ids": {
724754
"type": "array",
725755
"items": {"type": "string"},
@@ -759,6 +789,10 @@ async def run_public_dataset(
759789
"description": "Date bucket format: 'YYYY', 'YYYY-MM', 'YYYY-MM-DD'. Use for monthly/yearly aggregation on datetime columns.",
760790
},
761791
"limit": {"type": "integer", "description": "Max groups (default 50)"},
792+
"cursor": {
793+
"type": "string",
794+
"description": "Pagination token from a prior call's next_cursor.",
795+
},
762796
"from_ids": {
763797
"type": "array",
764798
"items": {"type": "string"},
@@ -789,6 +823,10 @@ async def run_public_dataset(
789823
"description": "Target table e.g. pr_goods_sold_hist",
790824
},
791825
"limit": {"type": "integer", "description": "Max results (default 20)"},
826+
"cursor": {
827+
"type": "string",
828+
"description": "Pagination token from a prior call's next_cursor.",
829+
},
792830
},
793831
"required": ["fk_property", "target_table"],
794832
},
@@ -820,6 +858,10 @@ async def run_public_dataset(
820858
"where_property": {"type": "string"},
821859
"where_op": {"type": "string"},
822860
"where_value": {"type": "string"},
861+
"cursor": {
862+
"type": "string",
863+
"description": "Pagination token from a prior call's next_cursor.",
864+
},
823865
"from_ids": {
824866
"type": "array",
825867
"items": {"type": "string"},
@@ -1050,6 +1092,7 @@ async def _agent_dispatch(name, args, backend, session, *, embedder=None):
10501092
op=args.get("op", "contains"),
10511093
value=args.get("value", ""),
10521094
limit=int(args.get("limit", 20)),
1095+
cursor=args.get("cursor"),
10531096
from_ids=args.get("from_ids") or None,
10541097
)
10551098
elif name == "aggregate_nodes":
@@ -1065,6 +1108,7 @@ async def _agent_dispatch(name, args, backend, session, *, embedder=None):
10651108
where_value=args.get("where_value", ""),
10661109
group_by_format=args.get("group_by_format", ""),
10671110
limit=int(args.get("limit", 50)),
1111+
cursor=args.get("cursor"),
10681112
from_ids=args.get("from_ids") or None,
10691113
)
10701114
elif name == "join_related":
@@ -1076,6 +1120,7 @@ async def _agent_dispatch(name, args, backend, session, *, embedder=None):
10761120
fk_property=args.get("fk_property", ""),
10771121
target_table=args.get("target_table", ""),
10781122
limit=int(args.get("limit", 20)),
1123+
cursor=args.get("cursor"),
10791124
)
10801125
elif name == "top_nodes":
10811126
r = await top_nodes_tool(
@@ -1088,6 +1133,7 @@ async def _agent_dispatch(name, args, backend, session, *, embedder=None):
10881133
where_property=args.get("where_property", ""),
10891134
where_op=args.get("where_op", ""),
10901135
where_value=args.get("where_value", ""),
1136+
cursor=args.get("cursor"),
10911137
from_ids=args.get("from_ids") or None,
10921138
)
10931139
elif name == "get_document":
@@ -1198,7 +1244,15 @@ async def _agent_loop_run(
11981244
continue
11991245
total += 1
12001246

1201-
session = SearchSession(budget_tool_calls=max_turns * 3)
1247+
# Adaptive budget: enumeration queries ("X 모두/전체/목록",
1248+
# "list all X") get 3x the turns so the agent can walk
1249+
# pagination cursors. Mirror src/synaptic/agent_loop.py.
1250+
from synaptic.agent_loop import _is_enumeration_query
1251+
1252+
eff_max_turns = (
1253+
15 if (max_turns <= 5 and _is_enumeration_query(query_text)) else max_turns
1254+
)
1255+
session = SearchSession(budget_tool_calls=eff_max_turns * 3)
12021256
messages = [
12031257
{"role": "system", "content": system},
12041258
{"role": "user", "content": query_text},
@@ -1208,7 +1262,7 @@ async def _agent_loop_run(
12081262
turns_used = 0
12091263
final_answer = ""
12101264

1211-
for turn in range(max_turns):
1265+
for turn in range(eff_max_turns):
12121266
turns_used = turn + 1
12131267
try:
12141268
resp = await client.chat.completions.create(
@@ -1443,7 +1497,15 @@ async def run_agent_benchmark(
14431497
continue
14441498
total += 1
14451499

1446-
session = SearchSession(budget_tool_calls=max_turns * 3)
1500+
# Adaptive budget: enumeration queries ("X 모두/전체/목록",
1501+
# "list all X") get 3x the turns so the agent can walk
1502+
# pagination cursors. Mirror src/synaptic/agent_loop.py.
1503+
from synaptic.agent_loop import _is_enumeration_query
1504+
1505+
eff_max_turns = (
1506+
15 if (max_turns <= 5 and _is_enumeration_query(query_text)) else max_turns
1507+
)
1508+
session = SearchSession(budget_tool_calls=eff_max_turns * 3)
14471509
messages = [
14481510
{"role": "system", "content": system},
14491511
{"role": "user", "content": query_text},
@@ -1453,7 +1515,7 @@ async def run_agent_benchmark(
14531515
turns_used = 0
14541516
final_answer = ""
14551517

1456-
for turn in range(max_turns):
1518+
for turn in range(eff_max_turns):
14571519
turns_used = turn + 1
14581520
try:
14591521
resp = await client.chat.completions.create(

0 commit comments

Comments
 (0)