PlateerLab
diff --git a/‎CHANGELOG.md‎
Lines changed: 141 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 141 additions & 0 deletions
diff --git a/‎eval/run_all.py‎
Lines changed: 72 additions & 10 deletions b/‎eval/run_all.py‎
Lines changed: 72 additions & 10 deletions
@@ -6,6 +6,147 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
 
 ## [Unreleased]
 
+### Added — v0.20 Phase 0: unified validation scorer (`eval/unified.py`)
+
+The cumulative bench infrastructure was per-corpus: KRRA Hard / assort
+Hard / X2BEE / MuSiQue each reported a single MRR-or-hit number, and
+ship/no-ship judgements relied on summing those across releases. This
+hides cross-cutting regressions: a feature that wins enumeration but
+loses broad-topical retrieval shows up as "+2 here, -3 there" with no
+single metric explaining the trade-off.
+
+`eval/unified.py` introduces a dimension-tagged composite. Each query
+is auto-classified along seven axes:
+
+  - `language` (ko / en / mixed)
+  - `recall_type` (single / top_n / multi_hop / enumeration / summary)
+  - `hop_count`
+  - `structured_pct`
+  - `enumeration` (explicit "all/모두/전체/list all" markers)
+  - `cross_domain`
+  - `cross_language`
+
+The scorer aggregates per-axis hit-rate and produces a single
+**UnifiedScore** ∈ [0, 1] as a weighted average of axis hit-rates.
+Default weights: ko 0.30, en 0.10, mixed 0.05, multi_hop 0.15,
+enumeration 0.10, structured 0.10, cross_domain 0.10, cross_language
+0.10.
+
+Critical invariant locked in tests
+(`tests/test_eval_unified.py`): the enumeration classifier here is
+bit-identical with `src/synaptic/agent_loop.py:_is_enumeration_query`,
+so every query that triggers the agent's adaptive turn budget is
+*also* counted in the enumeration recall slice. Any drift is a
+reporting bug.
+
+First UnifiedScore measurement on the v0.20 partial bench logs:
+
+```
+UnifiedScore: 0.598
+  lang:ko        hit=0.870 (n=69)   ✓
+  lang:en        hit=—     (n=0)    ✗ NO COVERAGE — agent bench has no EN
+  recall:multi_hop  hit=0.818 (n=11) ✓
+  recall:enumeration hit=0.861 (n=43) ✓
+  structured     hit=0.897 (n=39)   ✓
+  cross_domain   hit=—     (n=0)    ✗ NO COVERAGE — no cross-corpus queries
+  cross_language hit=—     (n=0)    ✗ NO COVERAGE — no EN→KO paraphrase queries
+```
+
+The 0.60 ceiling at full Korean / structured competence — caused by
+30% of weight sitting on three uncovered axes — is exactly the
+reporting we want. Phase 0.3 / 0.4 author Korean multi-hop +
+cross-domain / cross-language query sets to fill the coverage gaps.
+Phase 0.5 then re-measures every prior release against the new
+unified metric so progression is judged on a single number going
+forward.
+
+Tests: 22 new (classifier alignment with agent loop, weight
+normalisation, axis no-coverage flagging, default-weight ceiling
+sanity).
+
+### Measured — v0.20 Phase A: honest result on partial 2-bench data
+
+Two of the five planned v0.20 benches completed before the parallel
+sweep was killed (vLLM contention with 5 concurrent agents); the
+remaining three were re-run individually but only the first 3-4
+queries of each survived in the log files as of this write. So this
+section reports the partial measurement honestly:
+
+| Bench | v0.19 @ T=0 | **v0.20 @ T=0** | Δ |
+|---|---:|---:|---:|
+| KRRA Hard (39q) | 34 / 39 = 87 % | **31 / 39 = 79 %** | **−3** |
+| assort Hard (33q) | 30 / 33 = 91 % | **29 / 33 = 88 %** | **−1** |
+| **Combined (2 benches, 72q)** | **64 / 72 = 89 %** | **60 / 72 = 83 %** | **−4** |
+
+Per-query diff KRRA Hard:
+
+  - **adaptive turn budget fired correctly** on h012 (`turns=12` vs
+    prior 5) — feature works as designed
+  - **h012 still misses** though: agent paginated 20 phrase-hub nodes
+    ("이용자보호 과제", "한국마사회 이용자보호") rather than documents.
+    Different problem (entity-linker bias surfaces hubs above docs in
+    FTS rank) — pagination ≠ better recall when the wrong set is
+    being paginated
+  - **NEW misses h012, h025, h031** (vs v0.19) — same deterministic-
+    prompt-shift dynamic seen in v0.19 X2BEE Conv c013: adding cursor
+    follow-through guidance to the system prompt reroutes some
+    Korean structured queries down different deterministic paths
+    even though the new rule technically targets only enumeration
+
+Net read: Phase A's pagination + adaptive budget *succeeds at its
+direct target* (adaptive budget fires, cursor parameter wires through
+correctly, 14 + 20 + 1 = 35 new tests lock the contract) but
+**regresses overall agent quality on the existing per-bench scoring**.
+This is exactly the measurement gap that motivates Phase 0:
+single-bench numbers can't tell us whether enumeration recall went
+up faster than broad-topical retrieval went down. The unified scorer
+shipped above will resolve it after Phase 0.3 / 0.4 add the missing
+query coverage.
+
+The Phase A code itself is correct, tested, and additive. We choose
+to ship it and re-evaluate under the unified metric rather than
+revert.
+
+### Added — v0.20 Phase A: pagination protocol + adaptive enumeration budget
+
+The first ship of the v0.20+ track ("multi-domain ontology + multi-turn
+exhaustive recall"). Targets the structural ceiling that left enumeration
+queries unsolved in earlier versions: agent had no way to retrieve
+results [21, total] when ``filter_nodes`` capped at 20 with only a
+``truncated=true`` boolean.
+
+**1. Pagination on all four structured tools.**
+``filter_nodes`` / ``aggregate_nodes`` / ``top_nodes`` / ``join_related``
+all gain an opaque ``cursor`` parameter and emit ``has_more`` +
+``next_cursor`` + ``offset`` in their response. Stateless: agent re-issues
+the same tool with ``cursor=<next_cursor>`` to advance. Pages are
+disjoint — no dedup needed.
+
+**2. Adaptive turn budget for enumeration queries.**
+``run_agent_loop`` now sniffs the query for enumeration markers
+(``모두`` / ``전체`` / ``목록`` / ``리스트`` / ``전수`` / ``list all`` /
+``every`` / ``all of the`` / ``all the`` / ``show me all`` / leading
+``모든``). Detected → ``max_turns`` bumps from 5 to 15 so the agent
+has room to walk the cursor. Caller-provided ``max_turns > 5`` always
+wins. Conservative classifier biased toward recall — a single marker
+flips the budget.
+
+**3. Honest truncation signaling.**
+``project_tool_result`` already shrunk lists to fit the 4 KB context
+budget; previously it set a single ``_trimmed_for_context: true``
+boolean. Now also records ``_truncated_from: {results: 200, evidence: 50}``
+— per-list pre-shrink size — so the agent can tell whether one item
+was dropped or 199.
+
+**4. Prompt updates.**
+Both ``AGENT_SYSTEM`` prompts (``src/synaptic/agent_loop.py`` and
+``eval/run_all.py``) now include explicit pagination guidance with a
+worked multi-step example: ``top_nodes → join_related (cursor=)``
+loop for "100건 이상 판매 상품의 리뷰 모두" pattern.
+
+Test coverage: +35 tests (14 pagination contract, 20 enumeration
+classifier, 1 truncation signal). Full suite: 940 pass.
+
 ### Measured — v0.19 prompt patch (English-paraphrase / search-first guidance)
 
 Added a single new section to the agent system prompt — both
 
@@ -643,12 +643,38 @@ async def run_public_dataset(
   unfiltered topic search returns too many candidates AND you have
   evidence the user wants a specific year.
 
-## "List all" / enumeration questions
-- Queries like "X 목록", "X 상품 전체", "list all X" need the COMPLETE
-  set, not one representative. Use ``filter_nodes(limit=100)`` (or
-  higher) and keep scanning. The GT for these often has 5-10 specific
-  rows; a limit=20 default plus a retry that narrows instead of
-  widening will miss half of them.
+## "List all" / enumeration questions — paginate with cursor
+- Queries like "X 목록", "X 상품 전체", "list all X", "모두/전체"
+  need the COMPLETE set, not a representative sample. Strategy:
+  1. First call: filter_nodes / aggregate_nodes / top_nodes / join_related
+     with limit=100.
+  2. Read the result: if has_more=true, the response includes
+     next_cursor (a short token like "100").
+  3. Issue the SAME tool with the SAME args plus cursor=<next_cursor>.
+     Keep going until has_more=false (or you've gathered enough).
+- Each follow-through page is disjoint from prior pages — no
+  deduplication needed on your side.
+- Total/total_groups stays constant across pages; "showing" is the
+  size of the current page only. Use total to plan how many follow-
+  throughs you need.
+
+Q: "이용자보호 관련 모든 문서 목록"
+Step 1: filter_nodes(table="documents", property="category",
+                     op="contains", value="이용자보호", limit=100)
+   → total=27, showing=27, has_more=false  (one call sufficed)
+
+Q: "100건 이상 판매 상품의 리뷰 모두"
+Step 1: top_nodes(table="products", sort_by="cumulative_sales",
+                  order="desc", limit=100,
+                  where_property="cumulative_sales", where_op=">=",
+                  where_value="100")
+   → results = top-100 products sorted by sales
+Step 2: join_related(from_values=<all 100 product codes>,
+                     fk_property="goods_no", target_table="reviews",
+                     limit=100)
+   → has_more=true, next_cursor="100", total=247
+Step 3: join_related(... same args ..., cursor="100")
+Step 4: join_related(... same args ..., cursor="200")  → has_more=false
 
 ## Multi-source questions
 - Queries like "X 관련 자료", "X 관련 내용", "X 관련 정보" explicitly
@@ -720,6 +746,10 @@ async def run_public_dataset(
                         "type": "integer",
                         "description": "Max results to return (default 20). Use higher for listings.",
                     },
+                    "cursor": {
+                        "type": "string",
+                        "description": "Pagination token from a prior call's next_cursor — pass to fetch the NEXT page when has_more=true. Use for 'list all X' queries.",
+                    },
                     "from_ids": {
                         "type": "array",
                         "items": {"type": "string"},
@@ -759,6 +789,10 @@ async def run_public_dataset(
                         "description": "Date bucket format: 'YYYY', 'YYYY-MM', 'YYYY-MM-DD'. Use for monthly/yearly aggregation on datetime columns.",
                     },
                     "limit": {"type": "integer", "description": "Max groups (default 50)"},
+                    "cursor": {
+                        "type": "string",
+                        "description": "Pagination token from a prior call's next_cursor.",
+                    },
                     "from_ids": {
                         "type": "array",
                         "items": {"type": "string"},
@@ -789,6 +823,10 @@ async def run_public_dataset(
                         "description": "Target table e.g. pr_goods_sold_hist",
                     },
                     "limit": {"type": "integer", "description": "Max results (default 20)"},
+                    "cursor": {
+                        "type": "string",
+                        "description": "Pagination token from a prior call's next_cursor.",
+                    },
                 },
                 "required": ["fk_property", "target_table"],
             },
@@ -820,6 +858,10 @@ async def run_public_dataset(
                     "where_property": {"type": "string"},
                     "where_op": {"type": "string"},
                     "where_value": {"type": "string"},
+                    "cursor": {
+                        "type": "string",
+                        "description": "Pagination token from a prior call's next_cursor.",
+                    },
                     "from_ids": {
                         "type": "array",
                         "items": {"type": "string"},
@@ -1050,6 +1092,7 @@ async def _agent_dispatch(name, args, backend, session, *, embedder=None):
             op=args.get("op", "contains"),
             value=args.get("value", ""),
             limit=int(args.get("limit", 20)),
+            cursor=args.get("cursor"),
             from_ids=args.get("from_ids") or None,
         )
     elif name == "aggregate_nodes":
@@ -1065,6 +1108,7 @@ async def _agent_dispatch(name, args, backend, session, *, embedder=None):
             where_value=args.get("where_value", ""),
             group_by_format=args.get("group_by_format", ""),
             limit=int(args.get("limit", 50)),
+            cursor=args.get("cursor"),
             from_ids=args.get("from_ids") or None,
         )
     elif name == "join_related":
@@ -1076,6 +1120,7 @@ async def _agent_dispatch(name, args, backend, session, *, embedder=None):
             fk_property=args.get("fk_property", ""),
             target_table=args.get("target_table", ""),
             limit=int(args.get("limit", 20)),
+            cursor=args.get("cursor"),
         )
     elif name == "top_nodes":
         r = await top_nodes_tool(
@@ -1088,6 +1133,7 @@ async def _agent_dispatch(name, args, backend, session, *, embedder=None):
             where_property=args.get("where_property", ""),
             where_op=args.get("where_op", ""),
             where_value=args.get("where_value", ""),
+            cursor=args.get("cursor"),
             from_ids=args.get("from_ids") or None,
         )
     elif name == "get_document":
@@ -1198,7 +1244,15 @@ async def _agent_loop_run(
             continue
         total += 1
 
-        session = SearchSession(budget_tool_calls=max_turns * 3)
+        # Adaptive budget: enumeration queries ("X 모두/전체/목록",
+        # "list all X") get 3x the turns so the agent can walk
+        # pagination cursors. Mirror src/synaptic/agent_loop.py.
+        from synaptic.agent_loop import _is_enumeration_query
+
+        eff_max_turns = (
+            15 if (max_turns <= 5 and _is_enumeration_query(query_text)) else max_turns
+        )
+        session = SearchSession(budget_tool_calls=eff_max_turns * 3)
         messages = [
             {"role": "system", "content": system},
             {"role": "user", "content": query_text},
@@ -1208,7 +1262,7 @@ async def _agent_loop_run(
         turns_used = 0
         final_answer = ""
 
-        for turn in range(max_turns):
+        for turn in range(eff_max_turns):
             turns_used = turn + 1
             try:
                 resp = await client.chat.completions.create(
@@ -1443,7 +1497,15 @@ async def run_agent_benchmark(
             continue
         total += 1
 
-        session = SearchSession(budget_tool_calls=max_turns * 3)
+        # Adaptive budget: enumeration queries ("X 모두/전체/목록",
+        # "list all X") get 3x the turns so the agent can walk
+        # pagination cursors. Mirror src/synaptic/agent_loop.py.
+        from synaptic.agent_loop import _is_enumeration_query
+
+        eff_max_turns = (
+            15 if (max_turns <= 5 and _is_enumeration_query(query_text)) else max_turns
+        )
+        session = SearchSession(budget_tool_calls=eff_max_turns * 3)
         messages = [
             {"role": "system", "content": system},
             {"role": "user", "content": query_text},
@@ -1453,7 +1515,7 @@ async def run_agent_benchmark(
         turns_used = 0
         final_answer = ""
 
-        for turn in range(max_turns):
+        for turn in range(eff_max_turns):
             turns_used = turn + 1
             try:
                 resp = await client.chat.completions.create(