feat(mcp-readability): compliance orchestrator, LLM judge, and metrics scorer by akangsha7 · Pull Request #472 · GoogleCloudPlatform/evalbench

akangsha7 · 2026-07-01T03:36:51Z

Summary

The rest of the MCP-readability work, in one PR: the evaluation half that runs on
top of the mcp_tools generator from #469.

McpReadabilityOrchestrator (orchestrator: mcp_readability) — for each
endpoint: fetch tools via the generator (rendered as a man page), compute
deterministic size metrics, gather applicable waivers, and judge the man page
against the style guide with an LLM. Emits one result row per endpoint through
the shared EvalBench reporters (CSV / BigQuery). Driven entirely by Feat/mcp readability generator #469's
datasets/mcp_readability/run_config.yaml.
McpToolMetricsScorer — deterministic tool count / estimated tokens /
token-budget usage.
McpStyleComplianceScorer — LLM judge scoring the man page vs the style
guide (P0/P1/P2 findings, compliance score, waived rules), JSON output.
enums + exceptions helpers, aligned to Feat/mcp readability generator #469's endpoints/exceptions schema
and the readability_judge run-config block.
evalbench.py — dataset_config is now optional, and orchestrators may
emit results without NL2SQL scores (None-guarded reporter writes).

Testing

pytest evalbench/test/mcp_readability_test.py evalbench/test/mcp_tool_metrics_test.py — 23 passing.
Covers enums, exceptions matching, deterministic metrics, man-page formatter,
file-source generator, LLM-judge parsing/HTML, and an offline end-to-end
orchestrator run (file source + stubbed LLM) asserting the result schema.
Verified the orchestrator builds and runs the offline endpoint directly from
the canonical run_config.yaml.

google-cla · 2026-07-01T03:37:14Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

…s scorer The evaluation half of the MCP-readability work, on top of the mcp_tools generator. For each endpoint the orchestrator fetches tools (rendered as a man page), computes deterministic size metrics, gathers applicable waivers, and judges the man page against the style guide with an LLM. One result row per endpoint is emitted through the shared EvalBench reporters (CSV/BigQuery). - McpReadabilityOrchestrator (orchestrator: mcp_readability), driven entirely by datasets/mcp_readability/run_config.yaml. - McpToolMetricsScorer: deterministic tool count / estimated tokens / token-budget usage. - McpStyleComplianceScorer: LLM judge scoring the man page vs the style guide (P0/P1/P2 findings, compliance score, waived rules), JSON output. - enums + exceptions helpers, aligned to the endpoints/exceptions schema and the readability_judge run-config block. Reuses the standard evalbench.py report path with NO changes to evalbench.py: - process() emits a real scores_tf (one standard style_compliance score row per endpoint; pass = SUCCESS with no P0 findings), so the run takes the existing results+scores branch and the shared analyzer produces a P0-clean compliance rate. run_config declares scorers: [style_compliance]. - dataset_config is made optional in the shared helpers: set_session_configs always sets it (default None) and load_dataset_from_json returns {} for a falsy path, so the datasetless orchestrator loads an empty dataset. - CsvReporter.store no-ops on None/empty frames (mirroring BigQueryReporter), so subset-only report writes are safe. Testing: pytest evalbench/test/mcp_readability_test.py evalbench/test/mcp_tool_metrics_test.py evalbench/test/evalbench_test.py — 31 passing, including an offline end-to-end orchestrator run, an analyzer-integration test proving the emitted scores aggregate correctly, and the existing evalbench.py tests (unchanged).

akangsha7 requested a review from IsmailMehdi as a code owner July 1, 2026 03:36

akangsha7 mentioned this pull request Jul 1, 2026

Feat/mcp readability metrics scorer #470

Closed

akangsha7 force-pushed the feat/mcp-readability-compliance-eval branch from e675650 to ca1ee79 Compare July 1, 2026 03:53

akangsha7 self-assigned this Jul 1, 2026

akangsha7 marked this pull request as draft July 1, 2026 04:36

akangsha7 removed the request for review from IsmailMehdi July 1, 2026 04:37

akangsha7 force-pushed the feat/mcp-readability-compliance-eval branch 2 times, most recently from 8ce32ba to e0bb42d Compare July 1, 2026 19:36

akangsha7 force-pushed the feat/mcp-readability-compliance-eval branch from e0bb42d to 6d83453 Compare July 2, 2026 00:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(mcp-readability): compliance orchestrator, LLM judge, and metrics scorer#472

feat(mcp-readability): compliance orchestrator, LLM judge, and metrics scorer#472
akangsha7 wants to merge 1 commit into
GoogleCloudPlatform:mainfrom
akangsha7:feat/mcp-readability-compliance-eval

akangsha7 commented Jul 1, 2026 •

edited

Loading

Uh oh!

google-cla Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

akangsha7 commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

google-cla Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

akangsha7 commented Jul 1, 2026 •

edited

Loading