feat(mcp-readability): compliance orchestrator, LLM judge, and metrics scorer#472
Draft
akangsha7 wants to merge 1 commit into
Draft
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
e675650 to
ca1ee79
Compare
8ce32ba to
e0bb42d
Compare
…s scorer
The evaluation half of the MCP-readability work, on top of the mcp_tools
generator. For each endpoint the orchestrator fetches tools (rendered as a
man page), computes deterministic size metrics, gathers applicable waivers,
and judges the man page against the style guide with an LLM. One result row
per endpoint is emitted through the shared EvalBench reporters (CSV/BigQuery).
- McpReadabilityOrchestrator (orchestrator: mcp_readability), driven entirely
by datasets/mcp_readability/run_config.yaml.
- McpToolMetricsScorer: deterministic tool count / estimated tokens /
token-budget usage.
- McpStyleComplianceScorer: LLM judge scoring the man page vs the style guide
(P0/P1/P2 findings, compliance score, waived rules), JSON output.
- enums + exceptions helpers, aligned to the endpoints/exceptions schema and
the readability_judge run-config block.
Reuses the standard evalbench.py report path with NO changes to evalbench.py:
- process() emits a real scores_tf (one standard style_compliance score row
per endpoint; pass = SUCCESS with no P0 findings), so the run takes the
existing results+scores branch and the shared analyzer produces a P0-clean
compliance rate. run_config declares scorers: [style_compliance].
- dataset_config is made optional in the shared helpers: set_session_configs
always sets it (default None) and load_dataset_from_json returns {} for a
falsy path, so the datasetless orchestrator loads an empty dataset.
- CsvReporter.store no-ops on None/empty frames (mirroring BigQueryReporter),
so subset-only report writes are safe.
Testing: pytest evalbench/test/mcp_readability_test.py
evalbench/test/mcp_tool_metrics_test.py evalbench/test/evalbench_test.py
— 31 passing, including an offline end-to-end orchestrator run, an
analyzer-integration test proving the emitted scores aggregate correctly, and
the existing evalbench.py tests (unchanged).
e0bb42d to
6d83453
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The rest of the MCP-readability work, in one PR: the evaluation half that runs on
top of the
mcp_toolsgenerator from #469.McpReadabilityOrchestrator(orchestrator: mcp_readability) — for eachendpoint: fetch tools via the generator (rendered as a man page), compute
deterministic size metrics, gather applicable waivers, and judge the man page
against the style guide with an LLM. Emits one result row per endpoint through
the shared EvalBench reporters (CSV / BigQuery). Driven entirely by Feat/mcp readability generator #469's
datasets/mcp_readability/run_config.yaml.McpToolMetricsScorer— deterministic tool count / estimated tokens /token-budget usage.
McpStyleComplianceScorer— LLM judge scoring the man page vs the styleguide (P0/P1/P2 findings, compliance score, waived rules), JSON output.
and the
readability_judgerun-config block.evalbench.py—dataset_configis now optional, and orchestrators mayemit results without NL2SQL scores (None-guarded reporter writes).
Testing
pytest evalbench/test/mcp_readability_test.py evalbench/test/mcp_tool_metrics_test.py— 23 passing.file-source generator, LLM-judge parsing/HTML, and an offline end-to-end
orchestrator run (file source + stubbed LLM) asserting the result schema.
the canonical
run_config.yaml.