chore(eval): remove Gemini and Yutori agents#843
chore(eval): remove Gemini and Yutori agents#843Felarof (felarof99) wants to merge 11 commits intodevfrom
Conversation
* feat: add Twitter share referral UI and expose browserosId When credits are exhausted, users now see a "Share on Twitter" CTA with a pre-filled tweet URL and an input to paste their tweet link. Reusable ShareForCredits component used in both ChatError and UsagePage. Server's GET /credits now includes browserosId for the extension to pass to the referral service. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: rebuild chat session on provider change * fix: address Greptile review comments - Move referral service URL to EXTERNAL_URLS - Guard submitReferral on !response.ok - Remove stale TODO comment Co-Authored-By: Claude Opus 4.6 <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
…fallback (#731) * feat(referral): show share rules and lower default daily limit fallback Surface the three referral validation rules (must mention @browserOS_ai, posted within last 30 minutes, single-use) directly in the ShareForCredits UI so users understand submission requirements before pasting a tweet link. Also align the UsagePage daily-limit fallback (used while credits load) with the gateway default of 50. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * fix(usage): handle credit balance exceeding daily limit The "Credits used today" stat was computed as `dailyLimit - credits`, which goes negative once a referral bonus pushes the balance above the daily cap (e.g. balance 294 with cap 100 showed "-194 of 100"). Clamp the math to zero and surface a separate "Bonus credits" stat when the balance exceeds the daily allowance. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
* fix(agent): clarify upstream provider rate-limit errors
When a non-BrowserOS provider (OpenAI, Anthropic, OpenRouter, etc.)
returned a 429, ChatError rendered the retry-wrapped message
"Failed after 3 attempts. Last error: The usage limit has been reached"
with a generic "Something went wrong" title, leading users to blame
BrowserOS for throttling imposed by their configured upstream.
Detect upstream 429s in parseErrorMessage, show the provider name in
the title ("OpenAI rate limit reached"), strip the retry prefix,
render the raw upstream message, and add clarifying subtext that
names the provider and explicitly excludes BrowserOS. Skip the
BrowserOS-specific ShareForCredits / survey / upgrade affordances on
this path — they do not apply.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
* fix: address Greptile review comments
- Tighten 429 pattern to \b429\b so it only matches the standalone
status code, not incidental substrings (model IDs, paths, etc.).
- Unwrap JSON-encoded provider error bodies on the upstream-rate-limit
path so users see the human-readable message instead of raw JSON.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
* fix(agent): declare @browseros/shared as workspace dependency The agent app imports @browseros/shared/constants/urls in lib/referral/submit-referral.ts but never declared the package in its dependencies, so vite failed to resolve the import during dev. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * feat(referral): cap daily referral earnings at 500 credits Block tweet submissions client-side once the user's balance reaches 500 to prevent unlimited credit farming via repeated shares. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * feat(referral): randomize tweet variations for Twitter share Replace the single hardcoded share text with 10 feature-specific variations (agent mode, chat, scheduled tasks, connect apps, cowork, workflows, memory, skills, local models, ad blocking) and pick one at random each time the share button is clicked. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * fix(referral): regenerate share URL on click Previously getShareOnTwitterUrl() was evaluated once at render time as a static href, so every click produced the same tweet variation. Move the call into onClick so a new random variation is picked each time. Addresses Greptile P1 review on PR #737. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
* fix(credits): move credits fetch to extension side using install_id Extension now reads `browseros.metrics_install_id` pref directly and fetches credits from `llm.browseros.com` without going through the bundled server. Unblocks the referral submit flow in prod without requiring a BrowserOS binary release. - Revert `/credits` route change that added `browserosId` to the response. - Add `getOrCreateBrowserosId()` helper reading from BrowserOS prefs. - Add `CREDITS_GATEWAY` to shared EXTERNAL_URLS. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * refactor(credits): drop fallback UUID, read install_id directly Extension only runs inside BrowserOS, so the prefs API is always available. The chrome.storage fallback was dead code that would generate a ghost ID diverging from the server's install_id anyway. Rename the helper to match its simpler contract. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * fix(credits): guard against empty install_id pref Address Greptile P1 — throw instead of silently fetching `/credits/null` when `browseros.metrics_install_id` is unset. Fails loudly so the broken state is observable rather than masquerading as a credits outage. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
Document NVIDIA's free OpenAI-compatible API at build.nvidia.com — 80+ free models including GLM 5.1, MiniMax M2.7, Qwen 3.5, Mistral, and Nemotron — wired through BrowserOS's OpenAI Compatible provider template. Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
…756) * feat(llm): Minimax Chinese and International Users providers * fix(llm): Patch for p2 bugs * fix(agent): correct MiniMax base URL handling and enforce API key validation * fix(agent): add minimax entry to PROVIDER_DISPLAY_NAMES The Record<ProviderType, string> map in ChatError.tsx was missing the new minimax key added in this PR, causing a typecheck failure. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> --------- Co-authored-by: krish-mm <[email protected]> Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
* feat: add deterministic eval graders (AGI SDK + WebArena-Infinity) Two new benchmark integrations with programmatic grading — no LLM judge. AGI SDK / REAL Bench (52 tasks): - 11 React/Next.js clones of consumer apps (DoorDash, Amazon, Gmail, etc.) - Grader navigates browser to /finish, extracts state diff from <pre> tag - Python verifier checks exact values via jmespath queries WebArena-Infinity (50 hard tasks): - 13 LLM-generated SaaS clones (Gmail, GitLab, Linear, Figma, etc.) - InfinityAppManager starts fresh app server per task per worker - Python verifier calls /api/state and asserts on JSON state Infrastructure: - GraderInput extended with mcpUrl + infinityAppUrl for parallel workers - Each worker gets isolated ports (no cross-worker state contamination) - CI workflow: pip install agisdk, clone webarena-infinity repo * chore: switch eval configs back to kimi-k2p5 * fix: register deterministic graders in pass rate calculation Add agisdk_state_diff and infinity_state to PASS_FAIL_GRADER_ORDER in both runner types and weekly report script, so scores show correctly in the dashboard. * chore: temp switch to opus 4.6 for eval run * chore: restore kimi-k2p5 as default eval config * ci: add timeout and continue-on-error for trend report step
Adds two minimal additions to the eval dashboard so reference-config workflows are easier: - POST /api/run accepts testRun: true — forces 1 worker + first task only, exercising the full executor path so API key, model, dataset, and BrowserOS port are all sanity-checked before a 50/200-task run. - PUT /api/config/:name overwrites the loaded reference config in place (overwrite-only, schema-validated, path-traversal guarded). - UI gains a Test Run button, Save to Reference button (visible only when a saved config is loaded), and a "Loaded: <name>" indicator. No schema, runner, or task-loader changes. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Remove the Gemini Computer Use and Yutori Navigator eval integrations, including their agent folders, dashboard config branches, sample configs, docs, and eval workspace dependencies. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Greptile SummaryThis PR completes the removal of the Gemini Computer Use and Yutori Navigator eval agent implementations — deleting ~2,900 lines of agent code, configs, and validation — and bundles two dashboard enhancements: a Test Run mode (1 worker, 1 task) and a Save to Reference button backed by a new Confidence Score: 4/5Safe to merge; only P2 findings present All issues found are P2 — a minor UI text-state bug and a backwards-compatibility edge case for legacy result files. The removal is thorough, typecheck and lint pass per the test plan, and the new server endpoint is well-guarded. packages/browseros-agent/apps/eval/src/types/result.ts — narrowed AgentConfigMetaSchema may silently fail on old Gemini/Yutori result directories if re-grading is attempted Important Files Changed
Sequence DiagramsequenceDiagram
participant UI as Dashboard (index.html)
participant Server as server.ts
participant FS as File System
Note over UI,FS: Save to Reference flow
UI->>Server: PUT /api/config/:name (JSON body)
Server->>FS: stat(filepath) — must already exist
Server->>Server: EvalConfigSchema.safeParse(body)
Server->>FS: Bun.write(filepath, JSON)
Server-->>UI: { status: 'saved', name }
Note over UI,FS: Test Run flow
UI->>Server: POST /api/run { testRun: true }
Server->>FS: loadTasks(datasetPath)
Server->>Server: tasks = tasks.slice(0, 1)
Server->>Server: ParallelExecutor({ numWorkers: 1 })
Server-->>UI: { status: 'started', taskCount: 1, testRun: true }
UI->>UI: setEvalRunningUI(true)
Prompt To Fix All With AIThis is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/dashboard/index.html
Line: 944
Comment:
**Test button text stuck on "Starting..." during run**
After a successful submission, `setEvalRunningUI(true)` is called but only sets `testBtn.disabled = true` — it never resets `testBtn.textContent`. This means the test button shows "Starting..." for the entire duration of the eval until `setEvalRunningUI(false)` is called on completion. Consider adding `testBtn.textContent = 'Test Run'` (or "Running…") inside the `if (running)` branch.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/types/result.ts
Line: 14-19
Comment:
**Narrowed enum breaks parsing of legacy result files**
`AgentConfigMetaSchema` now only accepts `'single'` and `'orchestrator-executor'`. Any previously-stored `metadata.json` file with `agent_config.type: 'gemini-computer-use'` or `'yutori-navigator'` will fail the `TaskMetadataSchema.parse()` call in `trajectory-saver.ts#loadMetadata`. The `hasExistingGraderResults` helper swallows the Zod error (returns `exists: false`), but `loadMetadata()` in `updateGraderResults` has no try/catch, so re-grading an old Gemini/Yutori result directory would throw. Consider using `z.string()` for the `type` field (the `.passthrough()` already allows unknown fields) if backwards compatibility with existing result files matters.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "chore(eval): remove Gemini and Yutori ag..." | Re-trigger Greptile |
| const AgentConfigMetaSchema = z | ||
| .object({ | ||
| type: z.enum([ | ||
| 'single', | ||
| 'orchestrator-executor', | ||
| 'gemini-computer-use', | ||
| 'yutori-navigator', | ||
| ]), | ||
| type: z.enum(['single', 'orchestrator-executor']), | ||
| model: z.string().optional(), | ||
| }) | ||
| .passthrough() |
There was a problem hiding this comment.
Narrowed enum breaks parsing of legacy result files
AgentConfigMetaSchema now only accepts 'single' and 'orchestrator-executor'. Any previously-stored metadata.json file with agent_config.type: 'gemini-computer-use' or 'yutori-navigator' will fail the TaskMetadataSchema.parse() call in trajectory-saver.ts#loadMetadata. The hasExistingGraderResults helper swallows the Zod error (returns exists: false), but loadMetadata() in updateGraderResults has no try/catch, so re-grading an old Gemini/Yutori result directory would throw. Consider using z.string() for the type field (the .passthrough() already allows unknown fields) if backwards compatibility with existing result files matters.
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/types/result.ts
Line: 14-19
Comment:
**Narrowed enum breaks parsing of legacy result files**
`AgentConfigMetaSchema` now only accepts `'single'` and `'orchestrator-executor'`. Any previously-stored `metadata.json` file with `agent_config.type: 'gemini-computer-use'` or `'yutori-navigator'` will fail the `TaskMetadataSchema.parse()` call in `trajectory-saver.ts#loadMetadata`. The `hasExistingGraderResults` helper swallows the Zod error (returns `exists: false`), but `loadMetadata()` in `updateGraderResults` has no try/catch, so re-grading an old Gemini/Yutori result directory would throw. Consider using `z.string()` for the `type` field (the `.passthrough()` already allows unknown fields) if backwards compatibility with existing result files matters.
How can I resolve this? If you propose a fix, please make it concise.
Summary
Test plan
🤖 Generated with Claude Code