Skip to content

Commit b2f94dc

Browse files
SonAIengineclaude
andcommitted
feat(v0.21): Phase 1.5+1.6 — cross-domain bench wired + 25% baseline established
PHASE 1.5 (eval/run_all.py + eval/unified.py): - run_agent_benchmark loop now respects validation:{type:domain_coverage} queries: doesn't skip on empty relevant_docs, scores by per-domain coverage of found_ids instead of doc-id matching. - _count_domains_for_ids: bulk SQL helper that resolves the agent's heterogeneous found_ids (real node ids + titles + raw doc_id hashes from properties.doc_id extraction) by looking up each candidate against id / title / properties.doc_id in one pass. First implementation only checked direct id and reported 0 coverage on cross-domain runs even when the corpus IS multi-domain. - load_bench_log regex widened to accept xd012-style qids (1-3 letters + 3-4 digits) so cross_domain.json results parse. - Cross-Domain DatasetConfig added (quick=False, opt-in via --agent-dataset). agent_datasets filter widened to match it. PHASE 1.6 (measurement): Ran the agent (Qwen3.5-27B vLLM, T=0/seed=42) against the combined metacorpus.sqlite on cross_domain.json: solved=3/12 (25 %), 1151 s Hits when query keywords explicitly anchor distinct domain content (xd007 'user feedback records', xd008 '방송 마케팅', xd009 '상품 판매 데이터'). Misses dominated by ESG/카본/인권 themes where KRRA's explicit ESG category pulls FTS top-rank and the agent never branches to assort/x2bee. Combined UnifiedScore on v0.20.1 + cross-domain logs: 0.7205 across 209 queries — first time all 8 dimensions are covered: lang:ko 0.849 n=172 | lang:en 0.333 n=3 | lang:mixed 0.824 n=34 multi_hop 0.756 n=45 | enumeration 0.859 n=92 | structured 0.888 n=98 cross_domain 0.250 n=12 | cross_language 0.784 n=37 Phase 2 target: lift cross_domain 0.25 → 0.50+ via domain-aware agent (system-prompt enumeration of available domains + cross- domain intent routing). Tests +2 (title-based bulk lookup + raw doc_id hash properties scan) on top of the existing 6, total 8 in test_eval_domain_coverage.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 0076fd9 commit b2f94dc

5 files changed

Lines changed: 589 additions & 18 deletions

File tree

CHANGELOG.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,70 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
66

77
## [Unreleased]
88

9+
### Measured — v0.21 Phase 1.5/1.6: cross-domain federation bench (3/12 = 25 % baseline)
10+
11+
End-to-end demo of the Phase 1 stack — Phase 1.4 MetaCorpus combiner +
12+
Phase 1.5 ``validation: domain_coverage`` scoring path + per-domain
13+
node tally helper. The agent runs against the combined corpus
14+
(``metacorpus.sqlite`` = krra + assort + x2bee, 123,877 nodes) on the
15+
12 cross-domain queries from ``eval/data/queries/cross_domain.json``
16+
and is scored by per-domain coverage of its found_ids instead of
17+
doc-id matching.
18+
19+
**Result: 3 / 12 hits = 25 % cross-domain coverage**, runtime 1151 s.
20+
21+
Per-query split:
22+
23+
| qid | required domains | got | hit |
24+
|---|---|---|:---:|
25+
| xd001 | assort, krra | krra=112, assort=0 ||
26+
| xd002 | krra, x2bee | krra=80, x2bee=0 ||
27+
| xd003 | assort, krra | krra=51, assort=0 ||
28+
| xd004 | krra, x2bee | krra=48, x2bee=0 ||
29+
| xd005 | assort, krra | krra=80, assort=0 ||
30+
| xd006 | assort, krra (EN) | krra=96, assort=0 ||
31+
| **xd007** | **krra, x2bee (EN)** | **krra=91, x2bee=20** | **** |
32+
| **xd008** | **assort, krra** | **krra=100, assort=10** | **** |
33+
| **xd009** | **krra, x2bee** | **krra=12, x2bee=30** | **** |
34+
| xd010 | krra, x2bee, assort | krra=74, x2bee=56, assort=0 | ✗ (2 of 3) |
35+
| xd011 | assort, krra (EN) | krra=36, assort=0 ||
36+
| xd012 | krra, x2bee, assort | krra=83, x2bee=0, assort=0 ||
37+
38+
Pattern: **cross-domain succeeds when the query keywords hit distinct
39+
domain-anchored content** ("user feedback records" → x2bee feedback
40+
table; "방송 마케팅" → assort broadcasts table). It fails when one
41+
domain has explicit category coverage that dominates FTS ranking
42+
(ESG/carbon → KRRA's ``ESG 및 지속가능성`` category). All-or-nothing
43+
``min_docs_per_domain=1`` validation is harsh — 3-domain queries
44+
(xd010, xd012) covered 2/3 but still scored as miss; partial-credit
45+
scoring is a v0.21+ refinement candidate.
46+
47+
Combined UnifiedScore on the v0.20.1 + cross-domain logs: **0.7205**
48+
across 209 queries (all 8 dimensions covered for the first time):
49+
50+
```
51+
lang:ko hit=0.849 n=172 (strong)
52+
lang:en hit=0.333 n= 3 (under-covered, 3 EN cross-domain queries only)
53+
lang:mixed hit=0.824 n= 34
54+
multi_hop hit=0.756 n= 45
55+
enumeration hit=0.859 n= 92
56+
structured hit=0.888 n= 98
57+
cross_domain hit=0.250 n= 12 ← NEW BASELINE — was 0% no-coverage
58+
cross_language hit=0.784 n= 37
59+
```
60+
61+
Saved to ``eval/baselines/unified-v021-with-cross-domain.json`` for
62+
future ``--compare`` runs.
63+
64+
**Phase 2 target**: domain-aware agent (system-prompt enumeration of
65+
available domains + intent classifier routing fan-out searches across
66+
multiple domains rather than one) — should lift cross_domain from
67+
0.25 to 0.50+ which would add ≥0.025 to UnifiedScore.
68+
69+
**Phase 1 status**: complete. The framework correctly identified the
70+
real architectural gap (single-domain search bias) instead of hiding
71+
it in per-bench numbers.
72+
973
### Measured — v0.20.1: prompt revert recovers + improves on v0.19
1074

1175
After the v0.20 cursor follow-through prompt was reverted (commit
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
{
2+
"unified_score": 0.7205177538454509,
3+
"query_count": 209,
4+
"n_hits": 175,
5+
"per_dimension": {
6+
"lang:ko": {
7+
"n_queries": 172,
8+
"n_hits": 146,
9+
"hit_rate": 0.8488,
10+
"weight": 0.3
11+
},
12+
"lang:en": {
13+
"n_queries": 3,
14+
"n_hits": 1,
15+
"hit_rate": 0.3333,
16+
"weight": 0.1
17+
},
18+
"lang:mixed": {
19+
"n_queries": 34,
20+
"n_hits": 28,
21+
"hit_rate": 0.8235,
22+
"weight": 0.05
23+
},
24+
"recall:multi_hop": {
25+
"n_queries": 45,
26+
"n_hits": 34,
27+
"hit_rate": 0.7556,
28+
"weight": 0.15
29+
},
30+
"recall:enumeration": {
31+
"n_queries": 92,
32+
"n_hits": 79,
33+
"hit_rate": 0.8587,
34+
"weight": 0.1
35+
},
36+
"structured": {
37+
"n_queries": 98,
38+
"n_hits": 87,
39+
"hit_rate": 0.8878,
40+
"weight": 0.1
41+
},
42+
"cross_domain": {
43+
"n_queries": 12,
44+
"n_hits": 3,
45+
"hit_rate": 0.25,
46+
"weight": 0.1
47+
},
48+
"cross_language": {
49+
"n_queries": 37,
50+
"n_hits": 29,
51+
"hit_rate": 0.7838,
52+
"weight": 0.1
53+
}
54+
},
55+
"per_bench": {
56+
"v020-assort-conv": {
57+
"n_queries": 7,
58+
"n_hits": 6,
59+
"hit_rate": 0.8571
60+
},
61+
"v020-assort-hard": {
62+
"n_queries": 33,
63+
"n_hits": 29,
64+
"hit_rate": 0.8788
65+
},
66+
"v020-krra-hard": {
67+
"n_queries": 39,
68+
"n_hits": 31,
69+
"hit_rate": 0.7949
70+
},
71+
"v020-x2bee-conv": {
72+
"n_queries": 27,
73+
"n_hits": 23,
74+
"hit_rate": 0.8519
75+
},
76+
"v020-x2bee-hard": {
77+
"n_queries": 19,
78+
"n_hits": 18,
79+
"hit_rate": 0.9474
80+
},
81+
"v0201-assort-hard": {
82+
"n_queries": 33,
83+
"n_hits": 32,
84+
"hit_rate": 0.9697
85+
},
86+
"v0201-krra-hard": {
87+
"n_queries": 39,
88+
"n_hits": 33,
89+
"hit_rate": 0.8462
90+
},
91+
"v021-cross-domain": {
92+
"n_queries": 12,
93+
"n_hits": 3,
94+
"hit_rate": 0.25
95+
}
96+
},
97+
"weights": {
98+
"lang:ko": 0.3,
99+
"lang:en": 0.1,
100+
"lang:mixed": 0.05,
101+
"recall:multi_hop": 0.15,
102+
"recall:enumeration": 0.1,
103+
"structured": 0.1,
104+
"cross_domain": 0.1,
105+
"cross_language": 0.1
106+
},
107+
"notes": []
108+
}

0 commit comments

Comments
 (0)