Skip to content

feat(mcp-readability): compliance orchestrator, LLM judge, and metrics scorer#472

Draft
akangsha7 wants to merge 1 commit into
GoogleCloudPlatform:mainfrom
akangsha7:feat/mcp-readability-compliance-eval
Draft

feat(mcp-readability): compliance orchestrator, LLM judge, and metrics scorer#472
akangsha7 wants to merge 1 commit into
GoogleCloudPlatform:mainfrom
akangsha7:feat/mcp-readability-compliance-eval

Conversation

@akangsha7

@akangsha7 akangsha7 commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

The rest of the MCP-readability work, in one PR: the evaluation half that runs on
top of the mcp_tools generator from #469.

  • McpReadabilityOrchestrator (orchestrator: mcp_readability) — for each
    endpoint: fetch tools via the generator (rendered as a man page), compute
    deterministic size metrics, gather applicable waivers, and judge the man page
    against the style guide with an LLM. Emits one result row per endpoint through
    the shared EvalBench reporters (CSV / BigQuery). Driven entirely by Feat/mcp readability generator #469's
    datasets/mcp_readability/run_config.yaml.
  • McpToolMetricsScorer — deterministic tool count / estimated tokens /
    token-budget usage.
  • McpStyleComplianceScorer — LLM judge scoring the man page vs the style
    guide (P0/P1/P2 findings, compliance score, waived rules), JSON output.
  • enums + exceptions helpers, aligned to Feat/mcp readability generator #469's endpoints/exceptions schema
    and the readability_judge run-config block.
  • evalbench.pydataset_config is now optional, and orchestrators may
    emit results without NL2SQL scores (None-guarded reporter writes).

Testing

  • pytest evalbench/test/mcp_readability_test.py evalbench/test/mcp_tool_metrics_test.py — 23 passing.
  • Covers enums, exceptions matching, deterministic metrics, man-page formatter,
    file-source generator, LLM-judge parsing/HTML, and an offline end-to-end
    orchestrator run (file source + stubbed LLM) asserting the result schema.
  • Verified the orchestrator builds and runs the offline endpoint directly from
    the canonical run_config.yaml.

@google-cla

google-cla Bot commented Jul 1, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@akangsha7 akangsha7 force-pushed the feat/mcp-readability-compliance-eval branch from e675650 to ca1ee79 Compare July 1, 2026 03:53
@akangsha7 akangsha7 self-assigned this Jul 1, 2026
@akangsha7 akangsha7 marked this pull request as draft July 1, 2026 04:36
@akangsha7 akangsha7 removed the request for review from IsmailMehdi July 1, 2026 04:37
@akangsha7 akangsha7 force-pushed the feat/mcp-readability-compliance-eval branch 2 times, most recently from 8ce32ba to e0bb42d Compare July 1, 2026 19:36
…s scorer

The evaluation half of the MCP-readability work, on top of the mcp_tools
generator. For each endpoint the orchestrator fetches tools (rendered as a
man page), computes deterministic size metrics, gathers applicable waivers,
and judges the man page against the style guide with an LLM. One result row
per endpoint is emitted through the shared EvalBench reporters (CSV/BigQuery).

- McpReadabilityOrchestrator (orchestrator: mcp_readability), driven entirely
  by datasets/mcp_readability/run_config.yaml.
- McpToolMetricsScorer: deterministic tool count / estimated tokens /
  token-budget usage.
- McpStyleComplianceScorer: LLM judge scoring the man page vs the style guide
  (P0/P1/P2 findings, compliance score, waived rules), JSON output.
- enums + exceptions helpers, aligned to the endpoints/exceptions schema and
  the readability_judge run-config block.

Reuses the standard evalbench.py report path with NO changes to evalbench.py:
- process() emits a real scores_tf (one standard style_compliance score row
  per endpoint; pass = SUCCESS with no P0 findings), so the run takes the
  existing results+scores branch and the shared analyzer produces a P0-clean
  compliance rate. run_config declares scorers: [style_compliance].
- dataset_config is made optional in the shared helpers: set_session_configs
  always sets it (default None) and load_dataset_from_json returns {} for a
  falsy path, so the datasetless orchestrator loads an empty dataset.
- CsvReporter.store no-ops on None/empty frames (mirroring BigQueryReporter),
  so subset-only report writes are safe.

Testing: pytest evalbench/test/mcp_readability_test.py
evalbench/test/mcp_tool_metrics_test.py evalbench/test/evalbench_test.py
— 31 passing, including an offline end-to-end orchestrator run, an
analyzer-integration test proving the emitted scores aggregate correctly, and
the existing evalbench.py tests (unchanged).
@akangsha7 akangsha7 force-pushed the feat/mcp-readability-compliance-eval branch from e0bb42d to 6d83453 Compare July 2, 2026 00:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant