Commit 7265c7b
rescore(benchmark): qwen3.5:35b full 45-triple rescore (claude-opus-4-7-rubric-v1)
Adds a 2nd observation for qwen3.5:35b, rescoring all 45 translations
end-to-end with a fresh Opus 4.7 pass under the current rubric (no
post-hoc -reranked suffix, pre-§5).
Stored alongside the original (-reranked) at
qwen3.5-35b_rescore.json. Aggregator will surface n_obs=2 with median.
Notable score movements vs the original judge:
- ko->en (unsu_jouen_nal): 6.7 -> 4.5 (caught currency hallucination
jeon vs won and direction omission "to" Dongkwang School)
- en->es (emerson_self_reliance): 6.7 -> 8.5 (Spanish "traicionar"
carries the literary "reveal" sense; was wrongly penalized as
contresens by mirroring the Chinese rendering's actual reversal)
- en->zh-Hans (pride_prejudice): 8.0 -> 7.0 (Austen's iconic opening
reordered, ironic "must" weakened to 总想)
- en->fr (dorian_gray): scored 6.0 (cumulative lexical errors:
odorat for odeur, délié for délicat, l'épines agreement break,
laburnum untranslated, studio -> pièce)
Global avg overall: 7.70 -> 7.59 (-0.11). 18/45 moved by >=0.5.
§5 rerank skipped per session decision.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent f25e6a4 commit 7265c7b
1 file changed
Lines changed: 693 additions & 0 deletions
0 commit comments