chore(eval): remove Gemini and Yutori agents by felarof99 · Pull Request #843 · browseros-ai/BrowserOS

Felarof (felarof99) · 2026-04-28T17:35:35Z

Summary

Remove the Gemini Computer Use and Yutori Navigator eval agent implementations and sample configs.
Prune active registry, schema, dashboard, docs/report, validation, package, and lockfile references to those eval agent types.
This branch also contains the existing dashboard test-run/save-to-reference commit that was already on the local branch before this removal was pushed.

Test plan

PATH="$HOME/.bun/bin:$PATH" bun install --frozen-lockfile
PATH="$HOME/.bun/bin:$PATH" bun run --filter @browseros/eval typecheck
PATH="$HOME/.bun/bin:$PATH" bun run lint
rg found no remaining Gemini/Yutori eval-agent references outside ignored/generated files

🤖 Generated with Claude Code

* feat: add Twitter share referral UI and expose browserosId When credits are exhausted, users now see a "Share on Twitter" CTA with a pre-filled tweet URL and an input to paste their tweet link. Reusable ShareForCredits component used in both ChatError and UsagePage. Server's GET /credits now includes browserosId for the extension to pass to the referral service. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: rebuild chat session on provider change * fix: address Greptile review comments - Move referral service URL to EXTERNAL_URLS - Guard submitReferral on !response.ok - Remove stale TODO comment Co-Authored-By: Claude Opus 4.6 <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

…fallback (#731) * feat(referral): show share rules and lower default daily limit fallback Surface the three referral validation rules (must mention @browserOS_ai, posted within last 30 minutes, single-use) directly in the ShareForCredits UI so users understand submission requirements before pasting a tweet link. Also align the UsagePage daily-limit fallback (used while credits load) with the gateway default of 50. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * fix(usage): handle credit balance exceeding daily limit The "Credits used today" stat was computed as `dailyLimit - credits`, which goes negative once a referral bonus pushes the balance above the daily cap (e.g. balance 294 with cap 100 showed "-194 of 100"). Clamp the math to zero and surface a separate "Bonus credits" stat when the balance exceeds the daily allowance. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

* fix(agent): clarify upstream provider rate-limit errors When a non-BrowserOS provider (OpenAI, Anthropic, OpenRouter, etc.) returned a 429, ChatError rendered the retry-wrapped message "Failed after 3 attempts. Last error: The usage limit has been reached" with a generic "Something went wrong" title, leading users to blame BrowserOS for throttling imposed by their configured upstream. Detect upstream 429s in parseErrorMessage, show the provider name in the title ("OpenAI rate limit reached"), strip the retry prefix, render the raw upstream message, and add clarifying subtext that names the provider and explicitly excludes BrowserOS. Skip the BrowserOS-specific ShareForCredits / survey / upgrade affordances on this path — they do not apply. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * fix: address Greptile review comments - Tighten 429 pattern to \b429\b so it only matches the standalone status code, not incidental substrings (model IDs, paths, etc.). - Unwrap JSON-encoded provider error bodies on the upstream-rate-limit path so users see the human-readable message instead of raw JSON. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

* fix(agent): declare @browseros/shared as workspace dependency The agent app imports @browseros/shared/constants/urls in lib/referral/submit-referral.ts but never declared the package in its dependencies, so vite failed to resolve the import during dev. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * feat(referral): cap daily referral earnings at 500 credits Block tweet submissions client-side once the user's balance reaches 500 to prevent unlimited credit farming via repeated shares. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * feat(referral): randomize tweet variations for Twitter share Replace the single hardcoded share text with 10 feature-specific variations (agent mode, chat, scheduled tasks, connect apps, cowork, workflows, memory, skills, local models, ad blocking) and pick one at random each time the share button is clicked. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * fix(referral): regenerate share URL on click Previously getShareOnTwitterUrl() was evaluated once at render time as a static href, so every click produced the same tweet variation. Move the call into onClick so a new random variation is picked each time. Addresses Greptile P1 review on PR #737. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

* fix(credits): move credits fetch to extension side using install_id Extension now reads `browseros.metrics_install_id` pref directly and fetches credits from `llm.browseros.com` without going through the bundled server. Unblocks the referral submit flow in prod without requiring a BrowserOS binary release. - Revert `/credits` route change that added `browserosId` to the response. - Add `getOrCreateBrowserosId()` helper reading from BrowserOS prefs. - Add `CREDITS_GATEWAY` to shared EXTERNAL_URLS. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * refactor(credits): drop fallback UUID, read install_id directly Extension only runs inside BrowserOS, so the prefs API is always available. The chrome.storage fallback was dead code that would generate a ghost ID diverging from the server's install_id anyway. Rename the helper to match its simpler contract. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * fix(credits): guard against empty install_id pref Address Greptile P1 — throw instead of silently fetching `/credits/null` when `browseros.metrics_install_id` is unset. Fails loudly so the broken state is observable rather than masquerading as a credits outage. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

Document NVIDIA's free OpenAI-compatible API at build.nvidia.com — 80+ free models including GLM 5.1, MiniMax M2.7, Qwen 3.5, Mistral, and Nemotron — wired through BrowserOS's OpenAI Compatible provider template. Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

…756) * feat(llm): Minimax Chinese and International Users providers * fix(llm): Patch for p2 bugs * fix(agent): correct MiniMax base URL handling and enforce API key validation * fix(agent): add minimax entry to PROVIDER_DISPLAY_NAMES The Record<ProviderType, string> map in ChatError.tsx was missing the new minimax key added in this PR, causing a typecheck failure. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> --------- Co-authored-by: krish-mm <[email protected]> Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

* feat: add deterministic eval graders (AGI SDK + WebArena-Infinity) Two new benchmark integrations with programmatic grading — no LLM judge. AGI SDK / REAL Bench (52 tasks): - 11 React/Next.js clones of consumer apps (DoorDash, Amazon, Gmail, etc.) - Grader navigates browser to /finish, extracts state diff from <pre> tag - Python verifier checks exact values via jmespath queries WebArena-Infinity (50 hard tasks): - 13 LLM-generated SaaS clones (Gmail, GitLab, Linear, Figma, etc.) - InfinityAppManager starts fresh app server per task per worker - Python verifier calls /api/state and asserts on JSON state Infrastructure: - GraderInput extended with mcpUrl + infinityAppUrl for parallel workers - Each worker gets isolated ports (no cross-worker state contamination) - CI workflow: pip install agisdk, clone webarena-infinity repo * chore: switch eval configs back to kimi-k2p5 * fix: register deterministic graders in pass rate calculation Add agisdk_state_diff and infinity_state to PASS_FAIL_GRADER_ORDER in both runner types and weekly report script, so scores show correctly in the dashboard. * chore: temp switch to opus 4.6 for eval run * chore: restore kimi-k2p5 as default eval config * ci: add timeout and continue-on-error for trend report step

Adds two minimal additions to the eval dashboard so reference-config workflows are easier: - POST /api/run accepts testRun: true — forces 1 worker + first task only, exercising the full executor path so API key, model, dataset, and BrowserOS port are all sanity-checked before a 50/200-task run. - PUT /api/config/:name overwrites the loaded reference config in place (overwrite-only, schema-validated, path-traversal guarded). - UI gains a Test Run button, Save to Reference button (visible only when a saved config is loaded), and a "Loaded: <name>" indicator. No schema, runner, or task-loader changes. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Remove the Gemini Computer Use and Yutori Navigator eval integrations, including their agent folders, dashboard config branches, sample configs, docs, and eval workspace dependencies. Co-Authored-By: Claude Opus 4.6 <[email protected]>

greptile-apps · 2026-04-28T17:38:41Z

Greptile Summary

This PR completes the removal of the Gemini Computer Use and Yutori Navigator eval agent implementations — deleting ~2,900 lines of agent code, configs, and validation — and bundles two dashboard enhancements: a Test Run mode (1 worker, 1 task) and a Save to Reference button backed by a new PUT /api/config/:name endpoint with solid path-traversal protection.

Confidence Score: 4/5

Safe to merge; only P2 findings present

All issues found are P2 — a minor UI text-state bug and a backwards-compatibility edge case for legacy result files. The removal is thorough, typecheck and lint pass per the test plan, and the new server endpoint is well-guarded.

packages/browseros-agent/apps/eval/src/types/result.ts — narrowed AgentConfigMetaSchema may silently fail on old Gemini/Yutori result directories if re-grading is attempted

Important Files Changed

Filename	Overview
packages/browseros-agent/apps/eval/src/agents/gemini-computer-use/agent.ts	Deleted — Gemini Computer Use agent implementation fully removed
packages/browseros-agent/apps/eval/src/agents/yutori-navigator/agent.ts	Deleted — Yutori Navigator agent implementation fully removed
packages/browseros-agent/apps/eval/src/types/config.ts	Removed GeminiComputerUseConfigSchema and YutoriNavigatorConfigSchema from AgentConfigSchema discriminated union
packages/browseros-agent/apps/eval/src/types/result.ts	AgentConfigMetaSchema type enum narrowed to only 'single'/'orchestrator-executor'; may silently break re-grading of old Gemini/Yutori result directories
packages/browseros-agent/apps/eval/src/dashboard/server.ts	Adds PUT /api/config/:name (Save to Reference) with robust path-traversal protection and schema validation; adds testRun mode limiting to 1 worker + 1 task
packages/browseros-agent/apps/eval/src/dashboard/index.html	Adds Test Run button and Save to Reference button; minor UI bug — test button text stays "Starting…" while eval runs
packages/browseros-agent/apps/eval/src/utils/config-validator.ts	Removed gemini-computer-use API key env-var check; remaining logic is clean
packages/browseros-agent/apps/eval/scripts/weekly-report.ts	Removed gemini-computer-use label mappings; yutori-navigator never had a dedicated label, so no change needed there

Sequence Diagram

sequenceDiagram
    participant UI as Dashboard (index.html)
    participant Server as server.ts
    participant FS as File System

    Note over UI,FS: Save to Reference flow
    UI->>Server: PUT /api/config/:name (JSON body)
    Server->>FS: stat(filepath) — must already exist
    Server->>Server: EvalConfigSchema.safeParse(body)
    Server->>FS: Bun.write(filepath, JSON)
    Server-->>UI: { status: 'saved', name }

    Note over UI,FS: Test Run flow
    UI->>Server: POST /api/run { testRun: true }
    Server->>FS: loadTasks(datasetPath)
    Server->>Server: tasks = tasks.slice(0, 1)
    Server->>Server: ParallelExecutor({ numWorkers: 1 })
    Server-->>UI: { status: 'started', taskCount: 1, testRun: true }
    UI->>UI: setEvalRunningUI(true)

Prompt To Fix All With AI

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/dashboard/index.html
Line: 944

Comment:
**Test button text stuck on "Starting..." during run**

After a successful submission, `setEvalRunningUI(true)` is called but only sets `testBtn.disabled = true` — it never resets `testBtn.textContent`. This means the test button shows "Starting..." for the entire duration of the eval until `setEvalRunningUI(false)` is called on completion. Consider adding `testBtn.textContent = 'Test Run'` (or "Running…") inside the `if (running)` branch.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/types/result.ts
Line: 14-19

Comment:
**Narrowed enum breaks parsing of legacy result files**

`AgentConfigMetaSchema` now only accepts `'single'` and `'orchestrator-executor'`. Any previously-stored `metadata.json` file with `agent_config.type: 'gemini-computer-use'` or `'yutori-navigator'` will fail the `TaskMetadataSchema.parse()` call in `trajectory-saver.ts#loadMetadata`. The `hasExistingGraderResults` helper swallows the Zod error (returns `exists: false`), but `loadMetadata()` in `updateGraderResults` has no try/catch, so re-grading an old Gemini/Yutori result directory would throw. Consider using `z.string()` for the `type` field (the `.passthrough()` already allows unknown fields) if backwards compatibility with existing result files matters.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "chore(eval): remove Gemini and Yutori ag..." | Re-trigger Greptile}

greptile-apps · 2026-04-28T17:38:45Z

 const AgentConfigMetaSchema = z
  .object({
-    type: z.enum([
-      'single',
-      'orchestrator-executor',
-      'gemini-computer-use',
-      'yutori-navigator',
-    ]),
+    type: z.enum(['single', 'orchestrator-executor']),
    model: z.string().optional(),
  })
  .passthrough()


Narrowed enum breaks parsing of legacy result files

AgentConfigMetaSchema now only accepts 'single' and 'orchestrator-executor'. Any previously-stored metadata.json file with agent_config.type: 'gemini-computer-use' or 'yutori-navigator' will fail the TaskMetadataSchema.parse() call in trajectory-saver.ts#loadMetadata. The hasExistingGraderResults helper swallows the Zod error (returns exists: false), but loadMetadata() in updateGraderResults has no try/catch, so re-grading an old Gemini/Yutori result directory would throw. Consider using z.string() for the type field (the .passthrough() already allows unknown fields) if backwards compatibility with existing result files matters.

Prompt To Fix With AI

This is a comment left during a code review. Path: packages/browseros-agent/apps/eval/src/types/result.ts Line: 14-19 Comment: **Narrowed enum breaks parsing of legacy result files** `AgentConfigMetaSchema` now only accepts `'single'` and `'orchestrator-executor'`. Any previously-stored `metadata.json` file with `agent_config.type: 'gemini-computer-use'` or `'yutori-navigator'` will fail the `TaskMetadataSchema.parse()` call in `trajectory-saver.ts#loadMetadata`. The `hasExistingGraderResults` helper swallows the Zod error (returns `exists: false`), but `loadMetadata()` in `updateGraderResults` has no try/catch, so re-grading an old Gemini/Yutori result directory would throw. Consider using `z.string()` for the `type` field (the `.passthrough()` already allows unknown fields) if backwards compatibility with existing result files matters. How can I resolve this? If you propose a fix, please make it concise.

Felarof (felarof99) and others added 11 commits April 16, 2026 15:25

chore: add .auctor entries to gitignore (#738)

4f03afc

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

chore(eval): remove Gemini and Yutori agents

ed717f0

Remove the Gemini Computer Use and Yutori Navigator eval integrations, including their agent folders, dashboard config branches, sample configs, docs, and eval workspace dependencies. Co-Authored-By: Claude Opus 4.6 <[email protected]>

github-actions Bot added the chore label Apr 28, 2026

greptile-apps Bot reviewed Apr 28, 2026

View reviewed changes

Felarof (felarof99) changed the base branch from main to dev April 28, 2026 19:55

Felarof (felarof99) closed this Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(eval): remove Gemini and Yutori agents#843

chore(eval): remove Gemini and Yutori agents#843
Felarof (felarof99) wants to merge 11 commits intodevfrom
feat/evals-nithin

Felarof (felarof99) commented Apr 28, 2026

Uh oh!

greptile-apps Bot commented Apr 28, 2026

Uh oh!

greptile-apps Bot Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Felarof (felarof99) commented Apr 28, 2026

Summary

Test plan

Uh oh!

greptile-apps Bot commented Apr 28, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants