Skip to content

chore(eval): remove Gemini and Yutori agents#843

Closed
Felarof (felarof99) wants to merge 11 commits intodevfrom
feat/evals-nithin
Closed

chore(eval): remove Gemini and Yutori agents#843
Felarof (felarof99) wants to merge 11 commits intodevfrom
feat/evals-nithin

Conversation

@felarof99
Copy link
Copy Markdown
Contributor

Summary

  • Remove the Gemini Computer Use and Yutori Navigator eval agent implementations and sample configs.
  • Prune active registry, schema, dashboard, docs/report, validation, package, and lockfile references to those eval agent types.
  • This branch also contains the existing dashboard test-run/save-to-reference commit that was already on the local branch before this removal was pushed.

Test plan

  • PATH="$HOME/.bun/bin:$PATH" bun install --frozen-lockfile
  • PATH="$HOME/.bun/bin:$PATH" bun run --filter @browseros/eval typecheck
  • PATH="$HOME/.bun/bin:$PATH" bun run lint
  • rg found no remaining Gemini/Yutori eval-agent references outside ignored/generated files

🤖 Generated with Claude Code

Felarof (felarof99) and others added 11 commits April 16, 2026 15:25
* feat: add Twitter share referral UI and expose browserosId

When credits are exhausted, users now see a "Share on Twitter" CTA with
a pre-filled tweet URL and an input to paste their tweet link. Reusable
ShareForCredits component used in both ChatError and UsagePage. Server's
GET /credits now includes browserosId for the extension to pass to the
referral service.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix: rebuild chat session on provider change

* fix: address Greptile review comments

- Move referral service URL to EXTERNAL_URLS
- Guard submitReferral on !response.ok
- Remove stale TODO comment

Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
…fallback (#731)

* feat(referral): show share rules and lower default daily limit fallback

Surface the three referral validation rules (must mention @browserOS_ai,
posted within last 30 minutes, single-use) directly in the ShareForCredits
UI so users understand submission requirements before pasting a tweet link.
Also align the UsagePage daily-limit fallback (used while credits load) with
the gateway default of 50.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* fix(usage): handle credit balance exceeding daily limit

The "Credits used today" stat was computed as `dailyLimit - credits`,
which goes negative once a referral bonus pushes the balance above the
daily cap (e.g. balance 294 with cap 100 showed "-194 of 100"). Clamp
the math to zero and surface a separate "Bonus credits" stat when the
balance exceeds the daily allowance.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
* fix(agent): clarify upstream provider rate-limit errors

When a non-BrowserOS provider (OpenAI, Anthropic, OpenRouter, etc.)
returned a 429, ChatError rendered the retry-wrapped message
"Failed after 3 attempts. Last error: The usage limit has been reached"
with a generic "Something went wrong" title, leading users to blame
BrowserOS for throttling imposed by their configured upstream.

Detect upstream 429s in parseErrorMessage, show the provider name in
the title ("OpenAI rate limit reached"), strip the retry prefix,
render the raw upstream message, and add clarifying subtext that
names the provider and explicitly excludes BrowserOS. Skip the
BrowserOS-specific ShareForCredits / survey / upgrade affordances on
this path — they do not apply.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* fix: address Greptile review comments

- Tighten 429 pattern to \b429\b so it only matches the standalone
  status code, not incidental substrings (model IDs, paths, etc.).
- Unwrap JSON-encoded provider error bodies on the upstream-rate-limit
  path so users see the human-readable message instead of raw JSON.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
* fix(agent): declare @browseros/shared as workspace dependency

The agent app imports @browseros/shared/constants/urls in
lib/referral/submit-referral.ts but never declared the package in its
dependencies, so vite failed to resolve the import during dev.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* feat(referral): cap daily referral earnings at 500 credits

Block tweet submissions client-side once the user's balance reaches
500 to prevent unlimited credit farming via repeated shares.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* feat(referral): randomize tweet variations for Twitter share

Replace the single hardcoded share text with 10 feature-specific
variations (agent mode, chat, scheduled tasks, connect apps, cowork,
workflows, memory, skills, local models, ad blocking) and pick one at
random each time the share button is clicked.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* fix(referral): regenerate share URL on click

Previously getShareOnTwitterUrl() was evaluated once at render time as
a static href, so every click produced the same tweet variation. Move
the call into onClick so a new random variation is picked each time.

Addresses Greptile P1 review on PR #737.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
* fix(credits): move credits fetch to extension side using install_id

Extension now reads `browseros.metrics_install_id` pref directly and fetches
credits from `llm.browseros.com` without going through the bundled server.
Unblocks the referral submit flow in prod without requiring a BrowserOS
binary release.

- Revert `/credits` route change that added `browserosId` to the response.
- Add `getOrCreateBrowserosId()` helper reading from BrowserOS prefs.
- Add `CREDITS_GATEWAY` to shared EXTERNAL_URLS.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* refactor(credits): drop fallback UUID, read install_id directly

Extension only runs inside BrowserOS, so the prefs API is always available.
The chrome.storage fallback was dead code that would generate a ghost ID
diverging from the server's install_id anyway. Rename the helper to match
its simpler contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* fix(credits): guard against empty install_id pref

Address Greptile P1 — throw instead of silently fetching `/credits/null`
when `browseros.metrics_install_id` is unset. Fails loudly so the broken
state is observable rather than masquerading as a credits outage.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
Document NVIDIA's free OpenAI-compatible API at build.nvidia.com — 80+ free models including GLM 5.1, MiniMax M2.7, Qwen 3.5, Mistral, and Nemotron — wired through BrowserOS's OpenAI Compatible provider template.

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
…756)

* feat(llm): Minimax Chinese and International Users providers

* fix(llm): Patch for p2 bugs

* fix(agent): correct MiniMax base URL handling and enforce API key validation

* fix(agent): add minimax entry to PROVIDER_DISPLAY_NAMES

The Record<ProviderType, string> map in ChatError.tsx was missing
the new minimax key added in this PR, causing a typecheck failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

---------

Co-authored-by: krish-mm <[email protected]>
Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
* feat: add deterministic eval graders (AGI SDK + WebArena-Infinity)

Two new benchmark integrations with programmatic grading — no LLM judge.

AGI SDK / REAL Bench (52 tasks):
- 11 React/Next.js clones of consumer apps (DoorDash, Amazon, Gmail, etc.)
- Grader navigates browser to /finish, extracts state diff from <pre> tag
- Python verifier checks exact values via jmespath queries

WebArena-Infinity (50 hard tasks):
- 13 LLM-generated SaaS clones (Gmail, GitLab, Linear, Figma, etc.)
- InfinityAppManager starts fresh app server per task per worker
- Python verifier calls /api/state and asserts on JSON state

Infrastructure:
- GraderInput extended with mcpUrl + infinityAppUrl for parallel workers
- Each worker gets isolated ports (no cross-worker state contamination)
- CI workflow: pip install agisdk, clone webarena-infinity repo

* chore: switch eval configs back to kimi-k2p5

* fix: register deterministic graders in pass rate calculation

Add agisdk_state_diff and infinity_state to PASS_FAIL_GRADER_ORDER
in both runner types and weekly report script, so scores show correctly
in the dashboard.

* chore: temp switch to opus 4.6 for eval run

* chore: restore kimi-k2p5 as default eval config

* ci: add timeout and continue-on-error for trend report step
Adds two minimal additions to the eval dashboard so reference-config
workflows are easier:

- POST /api/run accepts testRun: true — forces 1 worker + first task only,
  exercising the full executor path so API key, model, dataset, and
  BrowserOS port are all sanity-checked before a 50/200-task run.
- PUT /api/config/:name overwrites the loaded reference config in place
  (overwrite-only, schema-validated, path-traversal guarded).
- UI gains a Test Run button, Save to Reference button (visible only when
  a saved config is loaded), and a "Loaded: <name>" indicator.

No schema, runner, or task-loader changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Remove the Gemini Computer Use and Yutori Navigator eval integrations, including their agent folders, dashboard config branches, sample configs, docs, and eval workspace dependencies.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@github-actions github-actions Bot added the chore label Apr 28, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 28, 2026

Greptile Summary

This PR completes the removal of the Gemini Computer Use and Yutori Navigator eval agent implementations — deleting ~2,900 lines of agent code, configs, and validation — and bundles two dashboard enhancements: a Test Run mode (1 worker, 1 task) and a Save to Reference button backed by a new PUT /api/config/:name endpoint with solid path-traversal protection.

Confidence Score: 4/5

Safe to merge; only P2 findings present

All issues found are P2 — a minor UI text-state bug and a backwards-compatibility edge case for legacy result files. The removal is thorough, typecheck and lint pass per the test plan, and the new server endpoint is well-guarded.

packages/browseros-agent/apps/eval/src/types/result.ts — narrowed AgentConfigMetaSchema may silently fail on old Gemini/Yutori result directories if re-grading is attempted

Important Files Changed

Filename Overview
packages/browseros-agent/apps/eval/src/agents/gemini-computer-use/agent.ts Deleted — Gemini Computer Use agent implementation fully removed
packages/browseros-agent/apps/eval/src/agents/yutori-navigator/agent.ts Deleted — Yutori Navigator agent implementation fully removed
packages/browseros-agent/apps/eval/src/types/config.ts Removed GeminiComputerUseConfigSchema and YutoriNavigatorConfigSchema from AgentConfigSchema discriminated union
packages/browseros-agent/apps/eval/src/types/result.ts AgentConfigMetaSchema type enum narrowed to only 'single'/'orchestrator-executor'; may silently break re-grading of old Gemini/Yutori result directories
packages/browseros-agent/apps/eval/src/dashboard/server.ts Adds PUT /api/config/:name (Save to Reference) with robust path-traversal protection and schema validation; adds testRun mode limiting to 1 worker + 1 task
packages/browseros-agent/apps/eval/src/dashboard/index.html Adds Test Run button and Save to Reference button; minor UI bug — test button text stays "Starting…" while eval runs
packages/browseros-agent/apps/eval/src/utils/config-validator.ts Removed gemini-computer-use API key env-var check; remaining logic is clean
packages/browseros-agent/apps/eval/scripts/weekly-report.ts Removed gemini-computer-use label mappings; yutori-navigator never had a dedicated label, so no change needed there

Sequence Diagram

sequenceDiagram
    participant UI as Dashboard (index.html)
    participant Server as server.ts
    participant FS as File System

    Note over UI,FS: Save to Reference flow
    UI->>Server: PUT /api/config/:name (JSON body)
    Server->>FS: stat(filepath) — must already exist
    Server->>Server: EvalConfigSchema.safeParse(body)
    Server->>FS: Bun.write(filepath, JSON)
    Server-->>UI: { status: 'saved', name }

    Note over UI,FS: Test Run flow
    UI->>Server: POST /api/run { testRun: true }
    Server->>FS: loadTasks(datasetPath)
    Server->>Server: tasks = tasks.slice(0, 1)
    Server->>Server: ParallelExecutor({ numWorkers: 1 })
    Server-->>UI: { status: 'started', taskCount: 1, testRun: true }
    UI->>UI: setEvalRunningUI(true)
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/dashboard/index.html
Line: 944

Comment:
**Test button text stuck on "Starting..." during run**

After a successful submission, `setEvalRunningUI(true)` is called but only sets `testBtn.disabled = true` — it never resets `testBtn.textContent`. This means the test button shows "Starting..." for the entire duration of the eval until `setEvalRunningUI(false)` is called on completion. Consider adding `testBtn.textContent = 'Test Run'` (or "Running…") inside the `if (running)` branch.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/types/result.ts
Line: 14-19

Comment:
**Narrowed enum breaks parsing of legacy result files**

`AgentConfigMetaSchema` now only accepts `'single'` and `'orchestrator-executor'`. Any previously-stored `metadata.json` file with `agent_config.type: 'gemini-computer-use'` or `'yutori-navigator'` will fail the `TaskMetadataSchema.parse()` call in `trajectory-saver.ts#loadMetadata`. The `hasExistingGraderResults` helper swallows the Zod error (returns `exists: false`), but `loadMetadata()` in `updateGraderResults` has no try/catch, so re-grading an old Gemini/Yutori result directory would throw. Consider using `z.string()` for the `type` field (the `.passthrough()` already allows unknown fields) if backwards compatibility with existing result files matters.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "chore(eval): remove Gemini and Yutori ag..." | Re-trigger Greptile

Comment on lines 14 to 19
const AgentConfigMetaSchema = z
.object({
type: z.enum([
'single',
'orchestrator-executor',
'gemini-computer-use',
'yutori-navigator',
]),
type: z.enum(['single', 'orchestrator-executor']),
model: z.string().optional(),
})
.passthrough()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Narrowed enum breaks parsing of legacy result files

AgentConfigMetaSchema now only accepts 'single' and 'orchestrator-executor'. Any previously-stored metadata.json file with agent_config.type: 'gemini-computer-use' or 'yutori-navigator' will fail the TaskMetadataSchema.parse() call in trajectory-saver.ts#loadMetadata. The hasExistingGraderResults helper swallows the Zod error (returns exists: false), but loadMetadata() in updateGraderResults has no try/catch, so re-grading an old Gemini/Yutori result directory would throw. Consider using z.string() for the type field (the .passthrough() already allows unknown fields) if backwards compatibility with existing result files matters.

Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/types/result.ts
Line: 14-19

Comment:
**Narrowed enum breaks parsing of legacy result files**

`AgentConfigMetaSchema` now only accepts `'single'` and `'orchestrator-executor'`. Any previously-stored `metadata.json` file with `agent_config.type: 'gemini-computer-use'` or `'yutori-navigator'` will fail the `TaskMetadataSchema.parse()` call in `trajectory-saver.ts#loadMetadata`. The `hasExistingGraderResults` helper swallows the Zod error (returns `exists: false`), but `loadMetadata()` in `updateGraderResults` has no try/catch, so re-grading an old Gemini/Yutori result directory would throw. Consider using `z.string()` for the `type` field (the `.passthrough()` already allows unknown fields) if backwards compatibility with existing result files matters.

How can I resolve this? If you propose a fix, please make it concise.

@felarof99 Felarof (felarof99) changed the base branch from main to dev April 28, 2026 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants