feat: nanochat worker — Karpathy's LLM pipeline on iii-engine#3
feat: nanochat worker — Karpathy's LLM pipeline on iii-engine#3
Conversation
proof is an iii worker that scans code changes and verifies them in a real Chromium browser using snapshot-driven accessibility testing. 25 registered functions: - 14 browser tools (navigate, snapshot, click, type, screenshot, console logs, network requests, performance metrics, raw Playwright exec, assertions, CDP discovery, cookie injection) - 11 pipeline functions (scan, coverage, execute, report, run, replay, flows, history, enqueue, cleanup) 8 HTTP endpoints for REST access. Uses iii primitives throughout: - All inter-function calls via iii.trigger() - State for reports and saved flows - Streams for real-time progress - Queue + DLQ for CI runs with auto-retry - Logger with OTel tracing Default mode: Claude Code or Codex as the agent (no API key). Automated mode: Anthropic API for headless CI (needs ANTHROPIC_API_KEY). 1,506 lines across 8 TypeScript files.
Wraps nanochat (tokenizer, pretraining, SFT, eval, inference, tool use) as 13 iii functions with typed Pydantic schemas, async handlers, and proper triggers. Any connected worker can train, evaluate, and chat with locally-trained GPT models. - 13 functions (chat, model, tokenizer, tools, eval, train, health) - 13 triggers (12 HTTP + 1 queue for long-running training) - 4 iii state scopes (sessions, models, training, evals) - Pydantic type hints for auto request/response schema extraction - Async handlers with trigger_async for state I/O - safe() wrapper preventing WebSocket crashes from handler exceptions - Tested 10/10 on live iii engine v0.10.0
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds two new workers: a Python nanochat worker (model loading, chat, tokenizer, eval, SFT training, checkpoints, health) and a TypeScript proof worker (Playwright browser automation, Anthropic agent orchestration, git scan/coverage, replay, and persistence). Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant III as iii Engine
participant Nano as nanochat Worker
participant GPU as GPUState
participant Model as nanochat.engine
Client->>III: POST /nanochat/chat.complete
III->>Nano: invoke chat complete handler
Nano->>GPU: acquire lock / ensure model loaded
alt model not loaded
GPU->>Model: load_model(source, device)
Model-->>GPU: model + tokenizer
end
Nano->>Model: generate / stream tokens
Model-->>Nano: tokens / final output
Nano->>III: state::set session history
Nano-->>Client: ChatCompleteOutput
sequenceDiagram
participant Client
participant III as iii Engine
participant Proof as proof Worker
participant Browser as Playwright
participant Agent as Anthropic
participant Tools as iii Tools
Client->>III: POST /proof/run
III->>Proof: trigger proof::run
Proof->>III: trigger proof::scan
alt no changes
Proof-->>Client: {status: "skip"}
else changes found
Proof->>III: trigger proof::coverage
Proof->>III: trigger proof::execute
III->>Proof: launch browser session
Proof->>Browser: new BrowserSession
Proof->>Agent: runAgent(...) via trigger callback
loop iterations
Agent->>Anthropic: messages.create (SYSTEM_PROMPT + user)
alt tool_use found
Agent->>III: trigger browser tool fn
III->>Browser: perform action (click/navigate/...)
Browser-->>III: snapshot/result
III-->>Agent: tool_result
end
Agent->>III: stream::set progress
end
Agent-->>Proof: RunReport
Proof->>III: state::set report / save flow
Proof->>Browser: closeAll()
Proof-->>Client: RunReport
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Drop the SDK patterns section (internal detail users don't need), condense testing section to results and known issues only.
There was a problem hiding this comment.
Actionable comments posted: 11
🧹 Nitpick comments (4)
nanochat/worker.py (3)
55-64: Intentional broad exception handling is acceptable here.The
safe()wrapper's purpose is to prevent WebSocket crashes by converting any unhandled exception into an error response. This is a valid pattern for worker handlers where stability is critical.Consider using
functools.wrapsfor cleaner attribute preservation:♻️ Optional: Use functools.wraps
+import functools + def safe(fn): """Wrap async handler so unhandled exceptions return error dicts, never crash the WebSocket.""" + `@functools.wraps`(fn) async def wrapper(data): try: return await fn(data) except Exception as e: return {"error": str(e), "traceback": traceback.format_exc()} - wrapper.__name__ = fn.__name__ - wrapper.__annotations__ = fn.__annotations__ return wrapper🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nanochat/worker.py` around lines 55 - 64, The safe() wrapper currently copies name and annotations manually; replace that with functools.wraps to preserve metadata more robustly: import functools, then decorate wrapper with `@functools.wraps`(fn) (keeping wrapper async and returning the same error dict behavior), and remove the manual assignments to wrapper.__name__ and wrapper.__annotations__; ensure traceback.format_exc() remains used so error/traceback are still returned.
489-489: Prefix unused variable with underscore.The
metavariable fromload_model()is unpacked but never used in this function.♻️ Fix unused variable
- model, tokenizer, meta = load_model(inp.source, device, "base", model_tag=inp.model_tag, step=inp.step) + model, tokenizer, _meta = load_model(inp.source, device, "base", model_tag=inp.model_tag, step=inp.step)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nanochat/worker.py` at line 489, The unpacked return from load_model(...) currently assigns to model, tokenizer, meta but meta is unused; change the unpacking to model, tokenizer, _ = load_model(inp.source, device, "base", model_tag=inp.model_tag, step=inp.step) (or model, tokenizer, _meta = ...) to prefix the unused third value with an underscore so linters know it is intentionally unused and no other code paths are affected.
430-432: Specify encoding when opening files.For cross-platform consistency, explicitly specify UTF-8 encoding.
♻️ Add encoding
- with open(tasks_yaml) as f: + with open(tasks_yaml, encoding="utf-8") as f: tasks = yaml.safe_load(f)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nanochat/worker.py` around lines 430 - 432, When opening the tasks_yaml file before calling yaml.safe_load, specify the file encoding for cross-platform consistency by passing UTF-8 to the open call (i.e. update the open(tasks_yaml) usage so the file is opened with encoding="utf-8"); keep the same with-context and yaml.safe_load call and ensure variable names tasks and tasks_yaml remain unchanged.proof/README.md (1)
113-127: Add language specifier to code blocks.Lines 113 and 230 have fenced code blocks without language specifiers, which triggers MD040 lint warnings. For ASCII diagrams, use
textorplaintext.♻️ Fix markdown lint warnings
-``` +```text proof::scan git diff → changed files, commitsAnd similarly for line 230:
-``` +```text ┌──────────────────────────────────────────┐🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@proof/README.md` around lines 113 - 127, Update the two fenced ASCII diagram blocks in README.md (the block that begins with "proof::scan git diff → changed files, commits" and the block that starts with the box drawing "┌──────────────────────────────────────────┐") to include a language specifier by adding "text" to the opening fence so the blocks become fenced as text code blocks; this will silence the MD040 lint warnings while preserving the diagrams.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nanochat/pyproject.toml`:
- Around line 1-22: Create a package initializer file nanochat/__init__.py (e.g.
define __version__ = "0.1.0" or leave empty) so the nanochat module is a proper
package, and update the console script entry point in pyproject.toml from
"worker:main" to "nanochat.worker:main" to reference the worker module inside
the package; also add a PEP 517/518 [build-system] section to pyproject.toml
(include a minimal requires list such as setuptools and wheel and set
build-backend to setuptools.build_meta) so builds comply with build-system
metadata requirements.
In `@nanochat/README.md`:
- Around line 161-174: The fenced code block in README.md is missing a language
tag which causes markdownlint and syntax highlighting to fail; update the
opening fence from ``` to something like ```text or ```console (or ```log) so
the block is labeled (e.g., change the block showing nanochat
health/tokenizer/chat outputs to ```text) and save the file.
- Around line 78-80: Update the README to remove or reword the claim that
nanochat.tools.execute is "sandboxed" and instead explicitly state that the
worker runs arbitrary Python code using in-process exec() (referencing
nanochat.tools.execute and the worker implementation), note this does NOT
provide strong isolation, and add a clear operator warning recommending running
the worker in a hardened/isolated environment (e.g., separate VM/container,
unprivileged user, syscall filters) or replacing exec() with a true sandboxed
executor before exposing to untrusted inputs; apply this change to the
high-level description and the later implementation note that currently mentions
exec().
In `@nanochat/worker.py`:
- Line 657: The print statement using an unnecessary f-string prefix
(print(f"[nanochat] 13 functions, 13 triggers (12 HTTP + 1 queue)")) should be
changed to a regular string literal; locate the print call in worker.py (the
line that prints "[nanochat] 13 functions, 13 triggers (12 HTTP + 1 queue)") and
remove the leading f so it becomes print("[nanochat] 13 functions, 13 triggers
(12 HTTP + 1 queue)").
In `@proof/src/agent.ts`:
- Around line 35-36: The runStatus should not default to "pass" — change the
initialization of runStatus to a non-passing default (e.g., "error") and ensure
runStatus is only set to "pass" when a RUN_COMPLETED marker is observed;
explicitly set runStatus to "fail" when processing ASSERTION_FAILED or any tool
failure markers; update any code that synthesizes a fallback step (and usages
around recordedActions) to rely on the adjusted runStatus so a missing
RUN_COMPLETED does not yield a false "pass" (check all handling around
runStatus, RUN_COMPLETED, ASSERTION_FAILED, and recordedActions including the
other occurrences noted).
In `@proof/src/browser.ts`:
- Around line 129-149: The refs collapse when multiple elements share the same
role+name because refMap only stores {role,name} and resolveRef always uses
.first(); update the ref population (where refMap.set is called) to also capture
the element's ordinal (e.g., index among page.getByRole(role, { name })) or a
stable locator snapshot and store it in RefEntry (e.g., add an index/ordinal
property), and then change resolveRef to locate all matches and select the
stored ordinal (use locator.nth(entry.index) or equivalent) instead of .first();
keep a safe fallback if index is missing to preserve current behavior.
- Around line 237-249: The current handlePlaywrightExec uses AsyncFunction and
executes untrusted code in the worker process (see AsyncFunction and fn), so
change it to run the provided code inside the browser page context via
Playwright's page.evaluate to sandbox execution; serialize session.refMap
entries and pass them into page.evaluate, recreate the ref(id) helper inside the
browser context (using the serialized role/name entries) and then create and
invoke the async function from the code string inside page.evaluate (so no
host-side AsyncFunction or access to process/dynamic import), keeping function
name handlePlaywrightExec and ref as the identifying symbols to locate and
replace the implementation.
In `@proof/src/cookies.ts`:
- Around line 65-69: The SQLite SELECT that builds cookies (currently using
WHERE host_key LIKE '%${domain...}') is too permissive and should instead match
either the exact host or any subdomain (e.g., host_key = 'example.com' OR
host_key LIKE '%.example.com'); update the query in cookies.ts where
execFileAsync is called (the SELECT name, value, host_key as domain, ...) to use
host_key = '<domain>' OR host_key LIKE '.<domain>' pattern (escaping domain
properly), and apply the identical predicate change to the corresponding Firefox
query block referenced around lines 130-134 so both Chromium and Firefox imports
only match exact hosts or their subdomains.
In `@proof/src/worker.ts`:
- Around line 125-133: The handlers registered for "proof::scan" and
"proof::coverage" forward an untrusted input.cwd into
scanChanges/analyzeTestCoverage which is later passed to simpleGit, recursive
walks and file reads; change these handlers to ignore or validate input.cwd and
always resolve to a server-controlled safe root (e.g., repository root or a
configured workspace base) before calling scanChanges/analyzeTestCoverage.
Specifically, in the registerFunction blocks for id "proof::scan" and
"proof::coverage" (and the other similar blocks at the noted locations), replace
use of input.cwd with a sanitized value: compute path.resolve(baseRoot,
input.cwd ?? ".") and then assert the resolved path startsWith(baseRoot) (or
simply use baseRoot/untrusted cwd removed), and pass that sanitizedRoot to
scanChanges/analyzeTestCoverage; also update context.ts usage around
simpleGit/file reads to accept only validated roots.
- Around line 39-58: acquireRun() and releaseRun() must be exception-safe: wrap
the launch/interaction sequence in a try/finally so releaseRun() always runs
even if launchBrowser, cookie injection, autoDiscoverCdp(), or any subsequent
step throws; move cookie injection and any other pre-launch work inside the
protected block after acquireRun() and call releaseRun() from the finally; do
the same for the close path so closeBrowser(input.runId) is invoked inside try
and releaseRun() is called in finally to avoid leaving activeRunId set; adjust
the implementations around launchBrowser, closeBrowser, acquireRun, releaseRun
and any cookie injection logic accordingly.
---
Nitpick comments:
In `@nanochat/worker.py`:
- Around line 55-64: The safe() wrapper currently copies name and annotations
manually; replace that with functools.wraps to preserve metadata more robustly:
import functools, then decorate wrapper with `@functools.wraps`(fn) (keeping
wrapper async and returning the same error dict behavior), and remove the manual
assignments to wrapper.__name__ and wrapper.__annotations__; ensure
traceback.format_exc() remains used so error/traceback are still returned.
- Line 489: The unpacked return from load_model(...) currently assigns to model,
tokenizer, meta but meta is unused; change the unpacking to model, tokenizer, _
= load_model(inp.source, device, "base", model_tag=inp.model_tag, step=inp.step)
(or model, tokenizer, _meta = ...) to prefix the unused third value with an
underscore so linters know it is intentionally unused and no other code paths
are affected.
- Around line 430-432: When opening the tasks_yaml file before calling
yaml.safe_load, specify the file encoding for cross-platform consistency by
passing UTF-8 to the open call (i.e. update the open(tasks_yaml) usage so the
file is opened with encoding="utf-8"); keep the same with-context and
yaml.safe_load call and ensure variable names tasks and tasks_yaml remain
unchanged.
In `@proof/README.md`:
- Around line 113-127: Update the two fenced ASCII diagram blocks in README.md
(the block that begins with "proof::scan git diff → changed files,
commits" and the block that starts with the box drawing
"┌──────────────────────────────────────────┐") to include a language specifier
by adding "text" to the opening fence so the blocks become fenced as text code
blocks; this will silence the MD040 lint warnings while preserving the diagrams.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4b6e37ad-d249-4952-ad2a-fb3e7bd79fe3
📒 Files selected for processing (15)
nanochat/README.mdnanochat/pyproject.tomlnanochat/worker.pyproof/README.mdproof/package.jsonproof/src/agent.tsproof/src/browser.tsproof/src/context.tsproof/src/cookies.tsproof/src/prompt.tsproof/src/tools.tsproof/src/types.tsproof/src/worker.tsproof/tsconfig.jsonregistry/index.json
| [project] | ||
| name = "iii-nanochat" | ||
| version = "0.1.0" | ||
| description = "nanochat LLM worker for iii-engine — train, fine-tune, evaluate, and chat with GPT models" | ||
| license = "Apache-2.0" | ||
| requires-python = ">=3.10" | ||
| dependencies = [ | ||
| "iii-sdk>=0.10.0", | ||
| "torch>=2.0", | ||
| "pydantic>=2.0", | ||
| "tiktoken", | ||
| "tokenizers", | ||
| "datasets", | ||
| "pyarrow", | ||
| "psutil", | ||
| ] | ||
|
|
||
| [project.optional-dependencies] | ||
| train = ["wandb"] | ||
|
|
||
| [project.scripts] | ||
| iii-nanochat = "worker:main" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if worker.py is inside a nanochat package or at root level
fd -t f 'worker.py' | head -5
# Check for __init__.py to verify package structure
fd -t f '__init__.py' | head -10Repository: iii-hq/workers
Length of output: 76
🏁 Script executed:
#!/bin/bash
# Check if nanochat/__init__.py exists
ls -la nanochat/__init__.py 2>/dev/null && echo "Found" || echo "Not found"
# List the contents of nanochat directory
ls -la nanochat/ | head -20
# Double-check the exact path of worker.py
find nanochat -name "worker.py" -type fRepository: iii-hq/workers
Length of output: 385
Create nanochat/__init__.py and fix console script entry point.
The nanochat package is missing __init__.py, which means nanochat is not a proper Python package. This will cause the console script entry point "worker:main" to fail at runtime. Create an __init__.py file in the nanochat directory and update the entry point to "nanochat.worker:main".
Additionally, add the missing [build-system] section for PEP 517/518 compliance:
🔧 Required fixes
Create nanochat/__init__.py (can be empty or with version info):
# nanochat/__init__.py
__version__ = "0.1.0"
In pyproject.toml:
+[build-system]
+requires = ["setuptools>=61.0"]
+build-backend = "setuptools.build_meta"
+
[project]Update the console script entry point:
[project.scripts]
-iii-nanochat = "worker:main"
+iii-nanochat = "nanochat.worker:main"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| [project] | |
| name = "iii-nanochat" | |
| version = "0.1.0" | |
| description = "nanochat LLM worker for iii-engine — train, fine-tune, evaluate, and chat with GPT models" | |
| license = "Apache-2.0" | |
| requires-python = ">=3.10" | |
| dependencies = [ | |
| "iii-sdk>=0.10.0", | |
| "torch>=2.0", | |
| "pydantic>=2.0", | |
| "tiktoken", | |
| "tokenizers", | |
| "datasets", | |
| "pyarrow", | |
| "psutil", | |
| ] | |
| [project.optional-dependencies] | |
| train = ["wandb"] | |
| [project.scripts] | |
| iii-nanochat = "worker:main" | |
| [build-system] | |
| requires = ["setuptools>=61.0"] | |
| build-backend = "setuptools.build_meta" | |
| [project] | |
| name = "iii-nanochat" | |
| version = "0.1.0" | |
| description = "nanochat LLM worker for iii-engine — train, fine-tune, evaluate, and chat with GPT models" | |
| license = "Apache-2.0" | |
| requires-python = ">=3.10" | |
| dependencies = [ | |
| "iii-sdk>=0.10.0", | |
| "torch>=2.0", | |
| "pydantic>=2.0", | |
| "tiktoken", | |
| "tokenizers", | |
| "datasets", | |
| "pyarrow", | |
| "psutil", | |
| ] | |
| [project.optional-dependencies] | |
| train = ["wandb"] | |
| [project.scripts] | |
| iii-nanochat = "nanochat.worker:main" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nanochat/pyproject.toml` around lines 1 - 22, Create a package initializer
file nanochat/__init__.py (e.g. define __version__ = "0.1.0" or leave empty) so
the nanochat module is a proper package, and update the console script entry
point in pyproject.toml from "worker:main" to "nanochat.worker:main" to
reference the worker module inside the package; also add a PEP 517/518
[build-system] section to pyproject.toml (include a minimal requires list such
as setuptools and wheel and set build-backend to setuptools.build_meta) so
builds comply with build-system metadata requirements.
There was a problem hiding this comment.
Fixed. Added [build-system] section with setuptools backend (PEP 517/518).
| if (isInteractive || isContent) { | ||
| refCounter++; | ||
| const ref = `e${refCounter}`; | ||
| outputLine += ` [ref=${ref}]`; | ||
| refMap.set(ref, { role, name: name ?? "" }); | ||
| } | ||
|
|
||
| outputLines.push(outputLine); | ||
| } | ||
|
|
||
| return outputLines.join("\n"); | ||
| } | ||
|
|
||
| export function resolveRef( | ||
| ref: string, | ||
| refMap: Map<string, RefEntry>, | ||
| page: Page, | ||
| ) { | ||
| const entry = refMap.get(ref); | ||
| if (!entry) throw new Error(`Ref "${ref}" not found in current snapshot. Take a new snapshot.`); | ||
| return page.getByRole(entry.role as any, { name: entry.name }).first(); |
There was a problem hiding this comment.
Snapshot refs are not stable when labels repeat.
The ref map only stores { role, name }, and resolveRef() always calls .first(). On pages with repeated labels like two "Delete" buttons, distinct refs collapse onto the same element and the agent will click/type into the wrong target.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@proof/src/browser.ts` around lines 129 - 149, The refs collapse when multiple
elements share the same role+name because refMap only stores {role,name} and
resolveRef always uses .first(); update the ref population (where refMap.set is
called) to also capture the element's ordinal (e.g., index among
page.getByRole(role, { name })) or a stable locator snapshot and store it in
RefEntry (e.g., add an index/ordinal property), and then change resolveRef to
locate all matches and select the stored ordinal (use locator.nth(entry.index)
or equivalent) instead of .first(); keep a safe fallback if index is missing to
preserve current behavior.
There was a problem hiding this comment.
Not part of the nanochat worker changes. This is pre-existing code in the proof worker.
| export async function handlePlaywrightExec( | ||
| code: string, | ||
| session: BrowserSession, | ||
| ): Promise<unknown> { | ||
| const { page, context, browser } = session; | ||
| const ref = (id: string) => { | ||
| const entry = session.refMap.get(id); | ||
| if (!entry) throw new Error(`Ref "${id}" not found`); | ||
| return page.getByRole(entry.role as any, { name: entry.name }).first(); | ||
| }; | ||
| const AsyncFunction = Object.getPrototypeOf(async () => {}).constructor; | ||
| const fn = new AsyncFunction("page", "context", "browser", "ref", code); | ||
| return fn(page, context, browser, ref); |
There was a problem hiding this comment.
browser_exec is host-side code execution.
new AsyncFunction(...) runs arbitrary JavaScript inside the worker process, not inside an isolated page sandbox. Because proof/src/worker.ts, Lines 106-107 register this as proof::browser::exec, any caller that can reach the worker gets access to process, dynamic imports, and the local host.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@proof/src/browser.ts` around lines 237 - 249, The current
handlePlaywrightExec uses AsyncFunction and executes untrusted code in the
worker process (see AsyncFunction and fn), so change it to run the provided code
inside the browser page context via Playwright's page.evaluate to sandbox
execution; serialize session.refMap entries and pass them into page.evaluate,
recreate the ref(id) helper inside the browser context (using the serialized
role/name entries) and then create and invoke the async function from the code
string inside page.evaluate (so no host-side AsyncFunction or access to
process/dynamic import), keeping function name handlePlaywrightExec and ref as
the identifying symbols to locate and replace the implementation.
There was a problem hiding this comment.
Not part of the nanochat worker changes. This is pre-existing code in the proof worker.
| const { stdout } = await execFileAsync("sqlite3", [ | ||
| "-json", | ||
| cookieDbPath, | ||
| `SELECT name, value, host_key as domain, path, expires_utc, is_secure, is_httponly, samesite FROM cookies WHERE host_key LIKE '%${domain.replace(/'/g, "''")}'`, | ||
| ]); |
There was a problem hiding this comment.
Use exact/subdomain matches when selecting cookies.
LIKE '%${domain}' also matches unrelated suffixes such as notexample.com when the target is example.com, so this can import third-party cookies into the Playwright context. Match the exact host plus .${domain} subdomains instead.
🔒 Suggested query shape
+ const escapedDomain = domain.replace(/'/g, "''");
const { stdout } = await execFileAsync("sqlite3", [
"-json",
cookieDbPath,
- `SELECT name, value, host_key as domain, path, expires_utc, is_secure, is_httponly, samesite FROM cookies WHERE host_key LIKE '%${domain.replace(/'/g, "''")}'`,
+ `SELECT name, value, host_key as domain, path, expires_utc, is_secure, is_httponly, samesite FROM cookies
+ WHERE host_key = '${escapedDomain}'
+ OR host_key LIKE '%.${escapedDomain}'`,
]);Apply the same predicate to the Firefox query.
Also applies to: 130-134
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@proof/src/cookies.ts` around lines 65 - 69, The SQLite SELECT that builds
cookies (currently using WHERE host_key LIKE '%${domain...}') is too permissive
and should instead match either the exact host or any subdomain (e.g., host_key
= 'example.com' OR host_key LIKE '%.example.com'); update the query in
cookies.ts where execFileAsync is called (the SELECT name, value, host_key as
domain, ...) to use host_key = '<domain>' OR host_key LIKE '.<domain>' pattern
(escaping domain properly), and apply the identical predicate change to the
corresponding Firefox query block referenced around lines 130-134 so both
Chromium and Firefox imports only match exact hosts or their subdomains.
There was a problem hiding this comment.
Not part of the nanochat worker changes. This is pre-existing code in the proof worker.
There was a problem hiding this comment.
@rohitg00, understood — the concern is in the proof worker and outside the scope of this nanochat PR. Would you like me to open a follow-up GitHub issue to track the domain-matching fix in proof/src/cookies.ts (both the Chrome and Firefox query blocks) so it isn't lost?
🧠 Learnings used
Learnt from: rohitg00
Repo: iii-hq/workers PR: 3
File: nanochat/worker.py:0-0
Timestamp: 2026-03-30T09:59:20.981Z
Learning: In `nanochat/worker.py`, `fn_tools_execute` uses in-process `exec()` instead of `subprocess.run` or `multiprocessing.Process` because forking corrupts the asyncio event loop in the iii-sdk WebSocket connection. This is a known, accepted trade-off documented in the README under Known Issues. For production deployments with untrusted code, a separate sandboxed worker is required. The function description intentionally reads "Execute Python code" (not "sandboxed") to reflect the lack of real isolation.
| iii.registerFunction({ id: "proof::browser::launch" }, async (input) => { | ||
| const { runId, headed, cdp } = input; | ||
| acquireRun(runId); | ||
| let cdpUrl: string | undefined; | ||
| if (cdp === "auto") { | ||
| cdpUrl = (await autoDiscoverCdp()) ?? undefined; | ||
| } else if (cdp) { | ||
| cdpUrl = cdp; | ||
| } | ||
| await launchBrowser(runId, headed, cdpUrl); | ||
| logger.info("Browser launched", { runId, headed, cdp: cdpUrl ?? "none" }); | ||
| return { runId, launched: true }; | ||
| }); | ||
|
|
||
| iii.registerFunction({ id: "proof::browser::close" }, async (input) => { | ||
| const result = await closeBrowser(input.runId); | ||
| releaseRun(); | ||
| logger.info("Browser closed", { runId: input.runId }); | ||
| return result; | ||
| }); |
There was a problem hiding this comment.
Make the run lock and browser cleanup exception-safe.
acquireRun() happens before launch succeeds, cookie injection happens before the try/finally, and releaseRun() runs only after closeBrowser() succeeds. A failure in any of those paths can leave activeRunId stuck and force manual proof::cleanup.
🧯 Suggested structure
iii.registerFunction({ id: "proof::browser::launch" }, async (input) => {
const { runId, headed, cdp } = input;
acquireRun(runId);
- let cdpUrl: string | undefined;
- if (cdp === "auto") {
- cdpUrl = (await autoDiscoverCdp()) ?? undefined;
- } else if (cdp) {
- cdpUrl = cdp;
- }
- await launchBrowser(runId, headed, cdpUrl);
- logger.info("Browser launched", { runId, headed, cdp: cdpUrl ?? "none" });
- return { runId, launched: true };
+ try {
+ let cdpUrl: string | undefined;
+ if (cdp === "auto") {
+ cdpUrl = (await autoDiscoverCdp()) ?? undefined;
+ } else if (cdp) {
+ cdpUrl = cdp;
+ }
+ await launchBrowser(runId, headed, cdpUrl);
+ logger.info("Browser launched", { runId, headed, cdp: cdpUrl ?? "none" });
+ return { runId, launched: true };
+ } catch (err) {
+ releaseRun();
+ throw err;
+ }
});
iii.registerFunction({ id: "proof::browser::close" }, async (input) => {
- const result = await closeBrowser(input.runId);
- releaseRun();
- logger.info("Browser closed", { runId: input.runId });
- return result;
+ try {
+ const result = await closeBrowser(input.runId);
+ logger.info("Browser closed", { runId: input.runId });
+ return result;
+ } finally {
+ releaseRun();
+ }
});
iii.registerFunction({ id: "proof::execute" }, async (input) => {
const { diff, files, base_url, instruction, runId, headed, commits, coverage, cdp, cookies } = input;
- await iii.trigger({
- function_id: "proof::browser::launch",
- payload: { runId, headed, cdp },
- });
-
- if (cookies) {
- await iii.trigger({
- function_id: "proof::cookies::inject",
- payload: { url: base_url },
- });
- }
-
try {
+ await iii.trigger({
+ function_id: "proof::browser::launch",
+ payload: { runId, headed, cdp },
+ });
+ if (cookies) {
+ await iii.trigger({
+ function_id: "proof::cookies::inject",
+ payload: { url: base_url },
+ });
+ }
const trigger = iii.trigger.bind(iii);
return await runAgent(trigger, diff, files, base_url, runId, instruction, commits, coverage);
} finally {
await iii.trigger({
function_id: "proof::browser::close",Also applies to: 135-159
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@proof/src/worker.ts` around lines 39 - 58, acquireRun() and releaseRun() must
be exception-safe: wrap the launch/interaction sequence in a try/finally so
releaseRun() always runs even if launchBrowser, cookie injection,
autoDiscoverCdp(), or any subsequent step throws; move cookie injection and any
other pre-launch work inside the protected block after acquireRun() and call
releaseRun() from the finally; do the same for the close path so
closeBrowser(input.runId) is invoked inside try and releaseRun() is called in
finally to avoid leaving activeRunId set; adjust the implementations around
launchBrowser, closeBrowser, acquireRun, releaseRun and any cookie injection
logic accordingly.
There was a problem hiding this comment.
Not part of the nanochat worker changes. This is pre-existing code in the proof worker.
| iii.registerFunction({ id: "proof::scan" }, async (input) => { | ||
| logger.info("Scanning changes", { target: input.target ?? "unstaged" }); | ||
| return scanChanges(input.target, input.cwd, input.main_branch, input.commit_hash); | ||
| }); | ||
|
|
||
| iii.registerFunction({ id: "proof::coverage" }, async (input) => { | ||
| logger.info("Analyzing test coverage", { files: input.files?.length }); | ||
| return analyzeTestCoverage(input.files ?? [], input.cwd); | ||
| }); |
There was a problem hiding this comment.
Don't let HTTP callers choose an arbitrary filesystem root.
cwd is forwarded straight into scanChanges() and analyzeTestCoverage(). In proof/src/context.ts, Lines 20-21 and 88-101 then feed that into simpleGit(...), recursive directory walks, and file reads, so proof::run and proof::coverage can inspect any accessible checkout or directory on the host.
Also applies to: 211-229, 336-342
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@proof/src/worker.ts` around lines 125 - 133, The handlers registered for
"proof::scan" and "proof::coverage" forward an untrusted input.cwd into
scanChanges/analyzeTestCoverage which is later passed to simpleGit, recursive
walks and file reads; change these handlers to ignore or validate input.cwd and
always resolve to a server-controlled safe root (e.g., repository root or a
configured workspace base) before calling scanChanges/analyzeTestCoverage.
Specifically, in the registerFunction blocks for id "proof::scan" and
"proof::coverage" (and the other similar blocks at the noted locations), replace
use of input.cwd with a sanitized value: compute path.resolve(baseRoot,
input.cwd ?? ".") and then assert the resolved path startsWith(baseRoot) (or
simply use baseRoot/untrusted cwd removed), and pass that sanitizedRoot to
scanChanges/analyzeTestCoverage; also update context.ts usage around
simpleGit/file reads to accept only validated roots.
There was a problem hiding this comment.
Not part of the nanochat worker changes. This is pre-existing code in the proof worker.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
nanochat/README.md (2)
115-128:⚠️ Potential issue | 🟡 MinorAdd a language tag to the fenced code block.
Line 115 should specify a language (for markdownlint MD040 and proper rendering).
Suggested doc fix
-``` +```text OK nanochat.health {"status": "ok", "model_loaded": false} ... 10/10 responded, 0 crashes</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@nanochat/README.mdaround lines 115 - 128, Change the fenced code block that
begins with the lines starting "OK nanochat.health {"status":
"ok", "model_loaded": false}" so the opening backticks include a language
tag (e.g.,text orconsole) to satisfy markdownlint MD040 and ensure
correct rendering; update the opening fence only and leave the block content and
closing fence unchanged.</details> --- `78-80`: _⚠️ Potential issue_ | _🟠 Major_ **Avoid calling `tools.execute` “sandboxed” when it runs in-process `exec()`.** Line 78-80 currently overstates isolation and can mislead operators. Align the wording with the actual risk model and add a hard warning for untrusted input exposure. <details> <summary>Suggested doc fix</summary> ```diff -**nanochat.tools.execute**:`POST /nanochat/tools/execute` - -Executes arbitrary Python code in a sandboxed environment. Returns stdout, stderr, success status, and any errors. This mirrors nanochat's built-in tool use (calculator, code execution) that models learn during SFT training. +**nanochat.tools.execute**:`POST /nanochat/tools/execute` + +Executes arbitrary Python code using in-process `exec()` with stdout/stderr capture. Returns stdout, stderr, success status, and any errors. This mirrors nanochat's built-in tool use (calculator, code execution) that models learn during SFT training. +⚠️ This is **not** a strong isolation boundary. Run this worker in a hardened isolated environment (dedicated VM/container, unprivileged user, syscall/network restrictions) or replace with a true sandbox before exposing to untrusted inputs. ``` </details> Also applies to: 138-139 <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@nanochat/README.md` around lines 78 - 80, Update the README text for nanochat.tools.execute to remove the word "sandboxed" and clarify that execution uses in-process Python exec(), which does not fully isolate processes or OS resources; explicitly state the risks (full access to process memory, file system, network, and ability to run arbitrary code) and add a prominent warning that untrusted input must never be passed to nanochat.tools.execute. Locate and update both occurrences referring to nanochat.tools.execute (the description near "POST /nanochat/tools/execute" and the repeated mention around the later lines) so the wording accurately reflects the risk model and includes the hard warning for operators. ``` </details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against the current code and only fix it if needed.
Inline comments:
In@nanochat/README.md:
- Around line 48-49: Fix missing spaces after colons in README prose by
replacing instances of "word:next" with "word: next"; specifically update the
sentence containing "Every handler uses Pydantic type hints for automatic
request/response schema extraction:the engine knows the exact input/output shape
of every function" to insert a space after the colon, and likewise correct the
similar colon-spacing issues referenced around the phrases on lines 56-57,
113-114, and 130 (search for any colon immediately followed by a letter and add
a space).
Duplicate comments:
In@nanochat/README.md:
- Around line 115-128: Change the fenced code block that begins with the lines
starting "OK nanochat.health {"status": "ok",
"model_loaded": false}" so the opening backticks include a language tag (e.g.,
text orconsole) to satisfy markdownlint MD040 and ensure correct
rendering; update the opening fence only and leave the block content and closing
fence unchanged.- Around line 78-80: Update the README text for nanochat.tools.execute to remove
the word "sandboxed" and clarify that execution uses in-process Python exec(),
which does not fully isolate processes or OS resources; explicitly state the
risks (full access to process memory, file system, network, and ability to run
arbitrary code) and add a prominent warning that untrusted input must never be
passed to nanochat.tools.execute. Locate and update both occurrences referring
to nanochat.tools.execute (the description near "POST /nanochat/tools/execute"
and the repeated mention around the later lines) so the wording accurately
reflects the risk model and includes the hard warning for operators.</details> <details> <summary>🪄 Autofix (Beta)</summary> Fix all unresolved CodeRabbit comments on this PR: - [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended) - [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes </details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Organization UI **Review profile**: CHILL **Plan**: Pro **Run ID**: `802897be-471c-4082-83ba-c9d0029e7916` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 0213bb23e374fdf7ac9a69abf2dc35c822fa7b43 and 3d437be5e05cfe99323003a467c29fbc68e89fa3. </details> <details> <summary>📒 Files selected for processing (1)</summary> * `nanochat/README.md` </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
Covers every nanochat capability through iii primitives: Training (all queued): - train.tokenizer: BPE tokenizer training from dataset - train.base: Base pretraining with depth scaling, LR scheduling, checkpoint saving, FP8 support, periodic BPB evaluation - train.sft: SFT with real task mixture (SmolTalk, MMLU, GSM8K, SimpleSpelling, SpellingBee), warmdown scheduling, checkpoint saving - train.rl: GRPO reinforcement learning on GSM8K with advantage weighting, multi-sample rollouts, checkpoint saving Evaluation: - eval.core: Real CORE benchmark via base_eval.evaluate_core() - eval.loss: BPB on configurable split with batch size control - eval.chat: ChatCORE evaluation (generative + categorical tasks) New functions: - model.sample: Raw text generation from loaded model - checkpoint.save: Save current model to disk - checkpoint.list: List available checkpoints by source All training progress, eval results, and checkpoints tracked through iii state scopes.
- Added karpathy/nanochat as git submodule at nanochat-upstream/ - Training functions run the actual nanochat scripts as subprocesses (100% fidelity: gradient accum, Muon optimizer, best-fit packing, full task mixture, GRPO, DDP, FP8, checkpoint saving) - eval.chat uses the real run_chat_eval dispatcher (all 6 tasks) - tools.execute uses in-process exec (subprocess crashes iii WebSocket) - 893 lines, 20 functions, 12/12 tested, 0 crashes
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (3)
nanochat/README.md (2)
81-84:⚠️ Potential issue | 🟠 MajorDo not describe
tools.executeas sandboxed.The README still says “sandboxed environment,” but the implementation note confirms in-process
exec(). This is a security-boundary mismatch and can mislead operators.Also applies to: 141-142
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nanochat/README.md` around lines 81 - 84, The README incorrectly describes nanochat.tools.execute as running in a "sandboxed environment"; update the README text for nanochat.tools.execute to remove the word "sandboxed" and explicitly state that the implementation uses in-process exec() (i.e., not a secure sandbox), add a clear security warning about arbitrary code execution risk and recommended operator mitigations (disable the endpoint or restrict access), and make the same wording change for the other occurrence mentioned (lines 141-142) so both descriptions match the actual nanochat.tools.execute behavior.
118-131:⚠️ Potential issue | 🟡 MinorAdd a language to the fenced output block.
This block is currently unlabeled; markdownlint/syntax highlighting will continue to fail.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nanochat/README.md` around lines 118 - 131, The unlabeled fenced code block that starts with lines like "OK nanochat.health {"status": "ok", "model_loaded": false}" should be given a language identifier (for example ```text or ```console) to satisfy markdownlint and enable proper syntax highlighting; update the fenced block in README.md that contains the nanochat status/outputs to begin with a language label instead of just ```.nanochat/worker.py (1)
117-120:⚠️ Potential issue | 🔴 Critical
tools.executeis unsandboxed and ignores declared timeout.This endpoint executes arbitrary Python with full builtins, and
timeoutis never enforced. The registered description also says “in sandbox,” which is inaccurate for the current implementation.Suggested minimal mitigation
- ("nanochat.tools.execute", fn_tools_execute, "Execute Python code in sandbox", "http", {"api_path": "/nanochat/tools/execute", "http_method": "POST"}), + ("nanochat.tools.execute", fn_tools_execute, "Execute Python code (WARNING: unsandboxed; trusted input only)", "http", {"api_path": "/nanochat/tools/execute", "http_method": "POST"}),Also applies to: 424-433, 796-796
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nanochat/worker.py` around lines 117 - 120, The ExecuteCodeInput model and tools.execute currently run arbitrary Python unsandboxed and ignore the timeout; update the execution path (the tools.execute handler and any helper that runs user code) to run user code in a separate process (e.g., subprocess or multiprocessing.Process) and enforce ExecuteCodeInput.timeout by passing it to subprocess.run or using Process.join with a timeout and terminating the child if exceeded; additionally restrict the execution environment by removing dangerous builtins and not importing modules in the child (or run a dedicated sandboxed interpreter) and update the registered description text from “in sandbox” to reflect the actual sandboxing level until a full sandbox is implemented.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nanochat/README.md`:
- Line 35: Replace the incorrect dependency install command 'pip install -r
nanochat-upstream/pyproject.toml' with an editable package install; change the
README to instruct users to run 'cd nanochat-upstream && pip install -e .' (or
an equivalent pip install .) so pip reads the package metadata instead of
treating a pyproject.toml as a requirements file.
In `@nanochat/worker.py`:
- Around line 53-58: The decorator safe currently returns full tracebacks to
callers; change wrapper in safe to log the full exception and traceback
server-side (e.g., use the app logger and logger.exception or
logger.error(traceback.format_exc())) and remove traceback from the response,
returning a sanitized error envelope such as {"error": "Internal server error"}
or {"error": str(e)} without any traceback; update any imports/usage accordingly
in the safe/wrapper code paths.
- Around line 224-239: GPUState.load currently updates
model/tokenizer/engine/meta/source/model_tag/device under self._lock but
multiple handlers (fn_chat_complete, fn_chat_stream, fn_model_load,
fn_model_status, fn_eval_core, fn_eval_loss, fn_eval_chat) read those fields
without locking, risking races; fix by acquiring the same self._lock in each
read path (or at start of each handler) to either (a) copy model, tokenizer,
engine, meta, source, model_tag, device into local variables while holding the
lock and then release it before use, or (b) wrap the entire read/usage critical
section in with self._lock, ensuring all accesses use the protected references
and avoiding mixed-version/state reads; update all listed functions to use
self._lock consistently.
- Around line 472-492: fn_train_tokenizer (and the other async handlers
fn_train_base, fn_train_sft, fn_train_rl) currently call the blocking
_run_nanochat_script() directly and block the event loop; change each call to
run in a thread by awaiting asyncio.to_thread(_run_nanochat_script,
"scripts.tok_train", args, run_id, "tokenizer") (and the corresponding
script/name/args for the other handlers) so the blocking subprocess work is
offloaded, and add an import for asyncio if it's not already present.
---
Duplicate comments:
In `@nanochat/README.md`:
- Around line 81-84: The README incorrectly describes nanochat.tools.execute as
running in a "sandboxed environment"; update the README text for
nanochat.tools.execute to remove the word "sandboxed" and explicitly state that
the implementation uses in-process exec() (i.e., not a secure sandbox), add a
clear security warning about arbitrary code execution risk and recommended
operator mitigations (disable the endpoint or restrict access), and make the
same wording change for the other occurrence mentioned (lines 141-142) so both
descriptions match the actual nanochat.tools.execute behavior.
- Around line 118-131: The unlabeled fenced code block that starts with lines
like "OK nanochat.health {"status": "ok", "model_loaded": false}" should be
given a language identifier (for example ```text or ```console) to satisfy
markdownlint and enable proper syntax highlighting; update the fenced block in
README.md that contains the nanochat status/outputs to begin with a language
label instead of just ```.
In `@nanochat/worker.py`:
- Around line 117-120: The ExecuteCodeInput model and tools.execute currently
run arbitrary Python unsandboxed and ignore the timeout; update the execution
path (the tools.execute handler and any helper that runs user code) to run user
code in a separate process (e.g., subprocess or multiprocessing.Process) and
enforce ExecuteCodeInput.timeout by passing it to subprocess.run or using
Process.join with a timeout and terminating the child if exceeded; additionally
restrict the execution environment by removing dangerous builtins and not
importing modules in the child (or run a dedicated sandboxed interpreter) and
update the registered description text from “in sandbox” to reflect the actual
sandboxing level until a full sandbox is implemented.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a15d26e2-e7dd-477d-9032-aa0e20e6f6a0
📒 Files selected for processing (4)
.gitmodulesnanochat/README.mdnanochat/nanochat-upstreamnanochat/worker.py
✅ Files skipped from review due to trivial changes (2)
- nanochat/nanochat-upstream
- .gitmodules
Training functions now parse nanochat's stdout line-by-line and push structured metrics to iii state as they arrive. Any worker can poll nanochat.train.status and see live: - step, loss, learning rate multiplier, tokens/sec, MFU, epoch - validation BPB (when eval runs) - CORE metric scores (base training) - ChatCORE + ChatCORE_cat scores (SFT) - average reward + sequence length (RL) - pass@k accuracy (RL eval) Eval results also written to nanochat:evals with run_id + step. Parser handles all 6 nanochat output patterns: - base/sft step line (step/total, loss, lrm, dt, tok/sec, mfu) - Validation bpb line - CORE metric line - ChatCORE line - RL average reward line - RL pass@k line 937 lines, 20 functions, 12/12 tested, 0 crashes.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
nanochat/worker.py (1)
380-380: Minor: Prefer list unpacking over concatenation.♻️ Suggested refactor
- tokens = [bos] + gpu.tokenizer.encode(inp.prompt) if inp.prompt else [bos] + tokens = [bos, *gpu.tokenizer.encode(inp.prompt)] if inp.prompt else [bos]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nanochat/worker.py` at line 380, The current tokens construction uses list concatenation; change it to list unpacking to be more idiomatic and slightly more efficient: when building tokens (variable tokens) use [bos, *gpu.tokenizer.encode(inp.prompt)] if inp.prompt else [bos], referencing tokens, bos, gpu.tokenizer.encode, and inp.prompt in the update.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nanochat/worker.py`:
- Around line 424-433: fn_tools_execute currently calls exec with full
__builtins__ and ignores ExecuteCodeInput.timeout, leaving execution unsafe and
unbounded; change it to either (A) enforce a hard timeout and limit builtins:
run the exec in a restricted globals dict (do not pass {"__builtins__":
__builtins__}) and only include a small allowlist of safe callables, execute the
code in a worker thread (e.g., asyncio.to_thread) and wrap that in
asyncio.wait_for(timeout=inp.timeout) to enforce the timeout, or (B) if true
sandboxing isn't feasible, update the function/docstring that claims "sandbox"
to explicitly warn not to run untrusted code and still implement the timeout via
asyncio.wait_for or signal.alarm; reference fn_tools_execute, ExecuteCodeInput,
inp.timeout, and the current exec call to locate and fix the code.
- Around line 801-805: The current filename parsing in the block that builds
steps for each tag (variables tag_path, tag_dir and the steps list appended to
checkpoints) assumes files are exactly "model_STEP.pt" and will raise
IndexError/ValueError on malformed files; fix by filtering and parsing more
defensively: iterate over os.listdir(tag_path), only consider names that match a
strict pattern (e.g., regex r"^model_(\d+)\.pt$") or otherwise wrap the int
extraction in try/except to skip non-matching files, collect valid integers,
sort them, and append the resulting steps list to checkpoints; ensure malformed
filenames are ignored rather than bubbling exceptions.
- Around line 349-350: The current logic sets device = inp.device or
autodetect_device_type(), which treats the string "auto" as a valid truthy
device and bypasses autodetection; update the selection so that if inp.device is
None or equals "auto" (case-insensitive) you call autodetect_device_type() and
otherwise use inp.device, then pass that resolved device to gpu.load(...) (refer
to the variables/function names inp.device, autodetect_device_type, device, and
gpu.load).
---
Nitpick comments:
In `@nanochat/worker.py`:
- Line 380: The current tokens construction uses list concatenation; change it
to list unpacking to be more idiomatic and slightly more efficient: when
building tokens (variable tokens) use [bos, *gpu.tokenizer.encode(inp.prompt)]
if inp.prompt else [bos], referencing tokens, bos, gpu.tokenizer.encode, and
inp.prompt in the update.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
- safe() no longer leaks tracebacks to callers; logs server-side,
returns only {"error": "..."} (was: Major, L58)
- GPUState.snapshot() for thread-safe reads; all handlers use locals
from snapshot instead of reading gpu.* fields directly (was: Critical, L239)
- Training subprocess runs in asyncio.to_thread via split into
_run_subprocess_blocking + _run_training (was: Major, L568)
- tools.execute description changed from "sandboxed" to "in-process,
not sandboxed" since exec() with __builtins__ is not a real sandbox
(was: Major, L414)
- f-string without placeholders already fixed in previous commit (L657)
- registry device config: changed "auto" to null (matches worker's autodetect behavior when device is omitted) - checkpoint listing: catch ValueError on malformed filenames instead of crashing (e.g. model_backup.pt no longer breaks the parser) - exec() sandboxing: acknowledged as known limitation, documented
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
nanochat/worker.py (3)
357-358:⚠️ Potential issue | 🟠 MajorHandle
"auto"explicitly when resolving device.Line 357 treats
"auto"as a valid truthy device and skips autodetection. Normalize and resolve"auto"(case-insensitive) toautodetect_device_type().🔧 Suggested fix
- device = inp.device or autodetect_device_type() + device = ( + inp.device + if inp.device and inp.device.strip().lower() != "auto" + else autodetect_device_type() + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nanochat/worker.py` around lines 357 - 358, The device resolution treats a string "auto" as truthy and skips autodetection; update the logic around device = inp.device or autodetect_device_type() to detect when inp.device (case-insensitive) equals "auto" and in that case call autodetect_device_type(); ensure you normalize inp.device (e.g., lower()) before comparison and then pass the resolved device into gpu.load(inp.source, device, model_tag=inp.model_tag, step=inp.step) so gpu.load never receives the literal "auto".
434-442:⚠️ Potential issue | 🟠 MajorEnforce
timeoutinfn_tools_executeto avoid unbounded handler blocking.
ExecuteCodeInput.timeoutis currently unused, so user code can run indefinitely and stall this handler path.🔧 Suggested mitigation
+import asyncio @@ async def fn_tools_execute(data: ExecuteCodeInput) -> dict: inp = ExecuteCodeInput.model_validate(data) if isinstance(data, dict) else data stdout_buf, stderr_buf = io.StringIO(), io.StringIO() + def _run_exec(): + with contextlib.redirect_stdout(stdout_buf), contextlib.redirect_stderr(stderr_buf): + exec(inp.code, {"__builtins__": __builtins__}, {}) try: - with contextlib.redirect_stdout(stdout_buf), contextlib.redirect_stderr(stderr_buf): - exec(inp.code, {"__builtins__": __builtins__}, {}) + await asyncio.wait_for(asyncio.to_thread(_run_exec), timeout=inp.timeout) return {"success": True, "stdout": stdout_buf.getvalue(), "stderr": stderr_buf.getvalue(), "error": None} + except asyncio.TimeoutError: + return {"success": False, "stdout": stdout_buf.getvalue(), "stderr": stderr_buf.getvalue(), "error": "execution timed out"} except Exception as e: return {"success": False, "stdout": stdout_buf.getvalue(), "stderr": stderr_buf.getvalue(), "error": str(e)}#!/bin/bash # Verify timeout field definition and whether execution path enforces it. rg -n --type=py 'class ExecuteCodeInput|timeout|fn_tools_execute|asyncio\.wait_for|to_thread' nanochat/worker.py -C2Based on learnings: in-process
exec()is an accepted trade-off here due iii-sdk WebSocket/event-loop corruption with forked execution models; this comment is only about missing timeout enforcement.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nanochat/worker.py` around lines 434 - 442, fn_tools_execute currently ignores ExecuteCodeInput.timeout allowing user code to run indefinitely; enforce the timeout by running the exec body in a cancellable task or thread and wrapping it with asyncio.wait_for using inp.timeout (fallback to a sensible default if None). Specifically, update fn_tools_execute to validate inp.timeout from ExecuteCodeInput, run the exec call (the block that captures stdout_buf/stderr_buf and calls exec(inp.code,...)) inside asyncio.to_thread or an async wrapper, and await that wrapped call inside asyncio.wait_for(timeout=inp.timeout). Ensure exceptions from timeout are handled to return {"success": False, "stdout": ..., "stderr": ..., "error": "timeout"} and preserve existing stdout/stderr capture and other exception handling for non-timeout errors.
834-837:⚠️ Potential issue | 🟡 MinorMake checkpoint step parsing defensive against non-conforming filenames.
Current split/index parsing can raise on unexpected files in checkpoint directories and fail listing.
🔧 Suggested fix
+ import re - steps = sorted([ - int(f.split("_")[1].split(".")[0]) - for f in os.listdir(tag_path) if f.startswith("model_") and f.endswith(".pt") - ]) + steps = [] + for f in os.listdir(tag_path): + m = re.match(r"^model_(\d+)\.pt$", f) + if m: + steps.append(int(m.group(1))) + steps.sort()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nanochat/worker.py` around lines 834 - 837, The checkpoint step parsing in the steps list comprehension can crash on non-conforming filenames; change the logic that builds steps (currently iterating os.listdir(tag_path) and parsing with split) to be defensive: iterate filenames from os.listdir(tag_path), match each against a regex like r"^model_(\d+)\.pt$" (or try/except around int extraction), only convert the captured group to int for matches, and skip any non-matching files so that the resulting steps list contains only valid integers before sorting.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nanochat/worker.py`:
- Around line 241-245: The snapshot method in GPUState only returns (model,
tokenizer, engine, meta, source, device) but misses model_tag and callers still
read gpu.* directly; update snapshot(self) to include self.model_tag in its
returned tuple (e.g., (model, tokenizer, engine, meta, source, device,
model_tag)) and then change all downstream usages (places that currently read
gpu.model_tag or other gpu.* fields directly—e.g., the call sites around the
former Line 374, Lines ~801-815, and Lines ~847-850) to use the single atomic
GPUState.snapshot() return instead of mixing snapshot() with direct attribute
reads so metadata cannot become inconsistent during concurrent
nanochat.model.load operations.
---
Duplicate comments:
In `@nanochat/worker.py`:
- Around line 357-358: The device resolution treats a string "auto" as truthy
and skips autodetection; update the logic around device = inp.device or
autodetect_device_type() to detect when inp.device (case-insensitive) equals
"auto" and in that case call autodetect_device_type(); ensure you normalize
inp.device (e.g., lower()) before comparison and then pass the resolved device
into gpu.load(inp.source, device, model_tag=inp.model_tag, step=inp.step) so
gpu.load never receives the literal "auto".
- Around line 434-442: fn_tools_execute currently ignores
ExecuteCodeInput.timeout allowing user code to run indefinitely; enforce the
timeout by running the exec body in a cancellable task or thread and wrapping it
with asyncio.wait_for using inp.timeout (fallback to a sensible default if
None). Specifically, update fn_tools_execute to validate inp.timeout from
ExecuteCodeInput, run the exec call (the block that captures
stdout_buf/stderr_buf and calls exec(inp.code,...)) inside asyncio.to_thread or
an async wrapper, and await that wrapped call inside
asyncio.wait_for(timeout=inp.timeout). Ensure exceptions from timeout are
handled to return {"success": False, "stdout": ..., "stderr": ..., "error":
"timeout"} and preserve existing stdout/stderr capture and other exception
handling for non-timeout errors.
- Around line 834-837: The checkpoint step parsing in the steps list
comprehension can crash on non-conforming filenames; change the logic that
builds steps (currently iterating os.listdir(tag_path) and parsing with split)
to be defensive: iterate filenames from os.listdir(tag_path), match each against
a regex like r"^model_(\d+)\.pt$" (or try/except around int extraction), only
convert the captured group to int for matches, and skip any non-matching files
so that the resulting steps list contains only valid integers before sorting.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
worker.py: - snapshot() now returns model_tag (7-tuple), all callers updated - zero direct gpu.* reads outside snapshot/load/main - tokenizer handlers use snapshot for thread-safe tokenizer access README.md: - pip install -r pyproject.toml -> cd nanochat-upstream && pip install -e . - tools.execute description: "sandboxed" -> "in-process, not sandboxed" - add language label to test output code fence - fix missing spaces after colons throughout pyproject.toml: - add [build-system] section (PEP 517/518)
Training:
- Pre-forked child process (fork before iii connects) runs Popen
safely without corrupting the WebSocket. Uses multiprocessing
with explicit fork context.
- Training handlers send jobs via Pipe, child runs nanochat scripts,
results come back with stdout lines for metric parsing.
Bug fixes:
- model.load: pass torch.device (not string), use phase="eval"
- chat.complete: conversation format is {"messages": [...]} not [...]
(nanochat's render_conversation expects this)
- model.sample: generate_batch takes tokens directly, not [tokens]
- safe() handles both sync and async handlers
E2E test results (2-layer GPT, 5 steps, CPU):
Load model -> 1,966,134 params, 2 layers, 128 dim
Sample -> generates text (gibberish from minimal training)
Chat -> completion with session tracking in iii state
History -> 1 session stored
Tokenizer -> encode/decode roundtrip
Tools -> code execution (7*6=42)
Status -> full model config
Health -> worker alive through all operations
Summary
Python worker that brings Karpathy's nanochat onto the III engine. 20 functions covering the full LLM pipeline, tested end-to-end: train a model, load it, generate text, chat, persist conversations, evaluate benchmarks, manage checkpoints.
Training runs the actual nanochat scripts via a pre-forked subprocess launcher (fork before iii connects, so the WebSocket stays intact). Inference, eval, and tokenization run in-process. All state through iii primitives. All handlers typed with Pydantic for auto schema extraction.
nanochat source included as a git submodule at nanochat-upstream/.
What's included
nanochat/worker.py- 1,016 lines, 20 functions, 20 HTTP triggers, 5 state scopesnanochat/pyproject.toml- Dependencies with [build-system]nanochat/README.md- Full documentation with E2E test results and architecturenanochat/nanochat-upstream- Git submodule pointing to karpathy/nanochatregistry/index.json- Updated with nanochat entryimage-resize/src/manifest.rs- Fixed pre-existing test failureFunctions
E2E test results
Trained a 2-layer 1.9M param GPT on CPU (5 steps), loaded it through the worker, ran inference:
Architecture decisions
Pre-forked subprocess launcher. fork() from inside iii-sdk handlers corrupts the WebSocket on macOS. The worker forks a child process before connecting to iii. Training handlers send jobs to the child via a Pipe. The child runs Popen safely. Results come back with stdout lines that get parsed for metrics and pushed to iii state.
Async handlers with trigger_async. All handlers that read/write iii state use async def and await trigger_async. GPU operations run synchronously within the async handler.
GPUState.snapshot(). All handlers copy model/tokenizer/engine/meta under lock before using them. Prevents races during concurrent model.load calls.
safe() wrapper. Every handler is wrapped to catch exceptions, log server-side, and return clean error dicts. Prevents WebSocket crashes from unhandled exceptions.
SDK v0.10.0 patterns. Pydantic type hints for auto schema extraction. TelemetryOptions. Service hierarchy. No sleep between registrations.