Skip to content

feat: nanochat worker — Karpathy's LLM pipeline on iii-engine#3

Open
rohitg00 wants to merge 12 commits intomainfrom
feat/nanochat-worker
Open

feat: nanochat worker — Karpathy's LLM pipeline on iii-engine#3
rohitg00 wants to merge 12 commits intomainfrom
feat/nanochat-worker

Conversation

@rohitg00
Copy link
Copy Markdown

@rohitg00 rohitg00 commented Mar 29, 2026

Summary

Python worker that brings Karpathy's nanochat onto the III engine. 20 functions covering the full LLM pipeline, tested end-to-end: train a model, load it, generate text, chat, persist conversations, evaluate benchmarks, manage checkpoints.

Training runs the actual nanochat scripts via a pre-forked subprocess launcher (fork before iii connects, so the WebSocket stays intact). Inference, eval, and tokenization run in-process. All state through iii primitives. All handlers typed with Pydantic for auto schema extraction.

nanochat source included as a git submodule at nanochat-upstream/.

What's included

  • nanochat/worker.py - 1,016 lines, 20 functions, 20 HTTP triggers, 5 state scopes
  • nanochat/pyproject.toml - Dependencies with [build-system]
  • nanochat/README.md - Full documentation with E2E test results and architecture
  • nanochat/nanochat-upstream - Git submodule pointing to karpathy/nanochat
  • registry/index.json - Updated with nanochat entry
  • image-resize/src/manifest.rs - Fixed pre-existing test failure

Functions

Function What it does
nanochat.chat.complete Chat completion with session persistence
nanochat.chat.stream Token-by-token generation
nanochat.chat.history Conversation history from iii state
nanochat.model.load Load checkpoint into memory
nanochat.model.status Model config and parameter count
nanochat.model.sample Raw text generation
nanochat.tokenizer.encode Text to BPE tokens
nanochat.tokenizer.decode Tokens to text
nanochat.tools.execute Python code execution
nanochat.train.tokenizer Train BPE tokenizer (runs scripts/tok_train.py)
nanochat.train.base Pretrain GPT from scratch (runs scripts/base_train.py)
nanochat.train.sft SFT with full task mixture (runs scripts/chat_sft.py)
nanochat.train.rl GRPO on GSM8K (runs scripts/chat_rl.py)
nanochat.train.status Training progress from iii state
nanochat.eval.core CORE benchmark (calls evaluate_core)
nanochat.eval.loss Bits-per-byte (calls evaluate_bpb)
nanochat.eval.chat ChatCORE eval (calls run_chat_eval)
nanochat.checkpoint.save Save model to disk
nanochat.checkpoint.list List checkpoints
nanochat.health Worker health check

E2E test results

Trained a 2-layer 1.9M param GPT on CPU (5 steps), loaded it through the worker, ran inference:

1. Load model   -> 1,966,134 params, 2 layers, 128 dim
2. Sample       -> generates text
3. Chat         -> completion with session tracking (26 tokens)
4. History      -> 1 session stored in iii state
5. Tokenizer    -> encode/decode roundtrip
6. Tools        -> print(42) = 42
7. Model status -> full config visible
8. Health       -> worker alive after all operations

8/8 passed

Architecture decisions

Pre-forked subprocess launcher. fork() from inside iii-sdk handlers corrupts the WebSocket on macOS. The worker forks a child process before connecting to iii. Training handlers send jobs to the child via a Pipe. The child runs Popen safely. Results come back with stdout lines that get parsed for metrics and pushed to iii state.

Async handlers with trigger_async. All handlers that read/write iii state use async def and await trigger_async. GPU operations run synchronously within the async handler.

GPUState.snapshot(). All handlers copy model/tokenizer/engine/meta under lock before using them. Prevents races during concurrent model.load calls.

safe() wrapper. Every handler is wrapped to catch exceptions, log server-side, and return clean error dicts. Prevents WebSocket crashes from unhandled exceptions.

SDK v0.10.0 patterns. Pydantic type hints for auto schema extraction. TelemetryOptions. Service hierarchy. No sleep between registrations.

proof is an iii worker that scans code changes and verifies them in a
real Chromium browser using snapshot-driven accessibility testing.

25 registered functions:
- 14 browser tools (navigate, snapshot, click, type, screenshot,
  console logs, network requests, performance metrics, raw Playwright
  exec, assertions, CDP discovery, cookie injection)
- 11 pipeline functions (scan, coverage, execute, report, run,
  replay, flows, history, enqueue, cleanup)

8 HTTP endpoints for REST access.

Uses iii primitives throughout:
- All inter-function calls via iii.trigger()
- State for reports and saved flows
- Streams for real-time progress
- Queue + DLQ for CI runs with auto-retry
- Logger with OTel tracing

Default mode: Claude Code or Codex as the agent (no API key).
Automated mode: Anthropic API for headless CI (needs ANTHROPIC_API_KEY).

1,506 lines across 8 TypeScript files.
Wraps nanochat (tokenizer, pretraining, SFT, eval, inference, tool use)
as 13 iii functions with typed Pydantic schemas, async handlers, and
proper triggers. Any connected worker can train, evaluate, and chat
with locally-trained GPT models.

- 13 functions (chat, model, tokenizer, tools, eval, train, health)
- 13 triggers (12 HTTP + 1 queue for long-running training)
- 4 iii state scopes (sessions, models, training, evals)
- Pydantic type hints for auto request/response schema extraction
- Async handlers with trigger_async for state I/O
- safe() wrapper preventing WebSocket crashes from handler exceptions
- Tested 10/10 on live iii engine v0.10.0
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 29, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds two new workers: a Python nanochat worker (model loading, chat, tokenizer, eval, SFT training, checkpoints, health) and a TypeScript proof worker (Playwright browser automation, Anthropic agent orchestration, git scan/coverage, replay, and persistence).

Changes

Cohort / File(s) Summary
nanochat docs & packaging
nanochat/README.md, nanochat/pyproject.toml, .gitmodules, nanochat/nanochat-upstream
Adds README describing APIs/state scopes/triggers, pyproject with package metadata/deps/entrypoint, git submodule config and upstream pointer update.
nanochat worker implementation
nanochat/worker.py
New Python worker: GPUState, safe handler wrapper, model load/status/sample, chat (streaming & batch), session persistence, tokenizer encode/decode, sandbox exec, checkpoint save/list, eval (CORE/loss/chat), queued SFT training with state progress, health endpoint, and many Pydantic schemas.
registry
registry/index.json
Adds nanochat registry entry with metadata, default_config, language, supported targets, and version.
proof docs & packaging
proof/README.md, proof/package.json, proof/tsconfig.json
Adds proof README, npm package metadata (Playwright/iii-sdk deps, postinstall Chromium), scripts, and TypeScript config.
proof types & tooling
proof/src/types.ts, proof/src/tools.ts
Introduces TS types (RunReport, StepResult, BrowserSession, etc.), tool definitions, and Anthropic tool mapping.
proof git scan & coverage
proof/src/context.ts
Implements git diff scanning (unstaged/staged/branch/commit) and heuristic test-coverage analysis resolving imports to source files.
proof agent & prompt builder
proof/src/agent.ts, proof/src/prompt.ts
Adds Anthropic-driven agent loop (runAgent) that parses step markers, invokes tools via iii triggers, streams progress, and builds user prompts from diffs/files/coverage.
proof browser automation & cookies
proof/src/browser.ts, proof/src/cookies.ts
Playwright session management with ARIA snapshot ref mapping, navigate/click/type/select/press handlers, screenshot/console/network/perf capture, sandboxed exec, cookie extraction/injection for Chrome/Firefox, replay and cleanup.
proof orchestration worker
proof/src/worker.ts
New TS worker registering browser lifecycle/tools, scan/coverage/execute/run/replay/report/flows/history/cleanup handlers, queue enqueueing, single-active-run enforcement, state/stream persistence, and HTTP triggers.
proof build/test & misc
proof/src/..., proof/ files
New TypeScript source tree, build/test scripts (tsc, vitest), dev deps, and CI/usage notes.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant III as iii Engine
    participant Nano as nanochat Worker
    participant GPU as GPUState
    participant Model as nanochat.engine

    Client->>III: POST /nanochat/chat.complete
    III->>Nano: invoke chat complete handler
    Nano->>GPU: acquire lock / ensure model loaded
    alt model not loaded
        GPU->>Model: load_model(source, device)
        Model-->>GPU: model + tokenizer
    end
    Nano->>Model: generate / stream tokens
    Model-->>Nano: tokens / final output
    Nano->>III: state::set session history
    Nano-->>Client: ChatCompleteOutput
Loading
sequenceDiagram
    participant Client
    participant III as iii Engine
    participant Proof as proof Worker
    participant Browser as Playwright
    participant Agent as Anthropic
    participant Tools as iii Tools

    Client->>III: POST /proof/run
    III->>Proof: trigger proof::run
    Proof->>III: trigger proof::scan
    alt no changes
        Proof-->>Client: {status: "skip"}
    else changes found
        Proof->>III: trigger proof::coverage
        Proof->>III: trigger proof::execute
        III->>Proof: launch browser session
        Proof->>Browser: new BrowserSession
        Proof->>Agent: runAgent(...) via trigger callback
        loop iterations
            Agent->>Anthropic: messages.create (SYSTEM_PROMPT + user)
            alt tool_use found
                Agent->>III: trigger browser tool fn
                III->>Browser: perform action (click/navigate/...)
                Browser-->>III: snapshot/result
                III-->>Agent: tool_result
            end
            Agent->>III: stream::set progress
        end
        Agent-->>Proof: RunReport
        Proof->>III: state::set report / save flow
        Proof->>Browser: closeAll()
        Proof-->>Client: RunReport
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I hopped through diffs and code tonight,

nanochat chats and models take flight,
proof clicks pages, snapshots gleam,
agents and browsers chase the dream,
two new workers, nimble and bright.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: nanochat worker — Karpathy's LLM pipeline on iii-engine' accurately and concisely summarizes the main change: adding a new nanochat worker that integrates Karpathy's LLM pipeline into the iii-engine.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/nanochat-worker

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Drop the SDK patterns section (internal detail users don't need),
condense testing section to results and known issues only.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Nitpick comments (4)
nanochat/worker.py (3)

55-64: Intentional broad exception handling is acceptable here.

The safe() wrapper's purpose is to prevent WebSocket crashes by converting any unhandled exception into an error response. This is a valid pattern for worker handlers where stability is critical.

Consider using functools.wraps for cleaner attribute preservation:

♻️ Optional: Use functools.wraps
+import functools
+
 def safe(fn):
     """Wrap async handler so unhandled exceptions return error dicts, never crash the WebSocket."""
+    `@functools.wraps`(fn)
     async def wrapper(data):
         try:
             return await fn(data)
         except Exception as e:
             return {"error": str(e), "traceback": traceback.format_exc()}
-    wrapper.__name__ = fn.__name__
-    wrapper.__annotations__ = fn.__annotations__
     return wrapper
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nanochat/worker.py` around lines 55 - 64, The safe() wrapper currently copies
name and annotations manually; replace that with functools.wraps to preserve
metadata more robustly: import functools, then decorate wrapper with
`@functools.wraps`(fn) (keeping wrapper async and returning the same error dict
behavior), and remove the manual assignments to wrapper.__name__ and
wrapper.__annotations__; ensure traceback.format_exc() remains used so
error/traceback are still returned.

489-489: Prefix unused variable with underscore.

The meta variable from load_model() is unpacked but never used in this function.

♻️ Fix unused variable
-        model, tokenizer, meta = load_model(inp.source, device, "base", model_tag=inp.model_tag, step=inp.step)
+        model, tokenizer, _meta = load_model(inp.source, device, "base", model_tag=inp.model_tag, step=inp.step)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nanochat/worker.py` at line 489, The unpacked return from load_model(...)
currently assigns to model, tokenizer, meta but meta is unused; change the
unpacking to model, tokenizer, _ = load_model(inp.source, device, "base",
model_tag=inp.model_tag, step=inp.step) (or model, tokenizer, _meta = ...) to
prefix the unused third value with an underscore so linters know it is
intentionally unused and no other code paths are affected.

430-432: Specify encoding when opening files.

For cross-platform consistency, explicitly specify UTF-8 encoding.

♻️ Add encoding
-    with open(tasks_yaml) as f:
+    with open(tasks_yaml, encoding="utf-8") as f:
         tasks = yaml.safe_load(f)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nanochat/worker.py` around lines 430 - 432, When opening the tasks_yaml file
before calling yaml.safe_load, specify the file encoding for cross-platform
consistency by passing UTF-8 to the open call (i.e. update the open(tasks_yaml)
usage so the file is opened with encoding="utf-8"); keep the same with-context
and yaml.safe_load call and ensure variable names tasks and tasks_yaml remain
unchanged.
proof/README.md (1)

113-127: Add language specifier to code blocks.

Lines 113 and 230 have fenced code blocks without language specifiers, which triggers MD040 lint warnings. For ASCII diagrams, use text or plaintext.

♻️ Fix markdown lint warnings
-```
+```text
 proof::scan          git diff → changed files, commits

And similarly for line 230:

-```
+```text
 ┌──────────────────────────────────────────┐
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proof/README.md` around lines 113 - 127, Update the two fenced ASCII diagram
blocks in README.md (the block that begins with "proof::scan          git diff →
changed files, commits" and the block that starts with the box drawing
"┌──────────────────────────────────────────┐") to include a language specifier
by adding "text" to the opening fence so the blocks become fenced as text code
blocks; this will silence the MD040 lint warnings while preserving the diagrams.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nanochat/pyproject.toml`:
- Around line 1-22: Create a package initializer file nanochat/__init__.py (e.g.
define __version__ = "0.1.0" or leave empty) so the nanochat module is a proper
package, and update the console script entry point in pyproject.toml from
"worker:main" to "nanochat.worker:main" to reference the worker module inside
the package; also add a PEP 517/518 [build-system] section to pyproject.toml
(include a minimal requires list such as setuptools and wheel and set
build-backend to setuptools.build_meta) so builds comply with build-system
metadata requirements.

In `@nanochat/README.md`:
- Around line 161-174: The fenced code block in README.md is missing a language
tag which causes markdownlint and syntax highlighting to fail; update the
opening fence from ``` to something like ```text or ```console (or ```log) so
the block is labeled (e.g., change the block showing nanochat
health/tokenizer/chat outputs to ```text) and save the file.
- Around line 78-80: Update the README to remove or reword the claim that
nanochat.tools.execute is "sandboxed" and instead explicitly state that the
worker runs arbitrary Python code using in-process exec() (referencing
nanochat.tools.execute and the worker implementation), note this does NOT
provide strong isolation, and add a clear operator warning recommending running
the worker in a hardened/isolated environment (e.g., separate VM/container,
unprivileged user, syscall filters) or replacing exec() with a true sandboxed
executor before exposing to untrusted inputs; apply this change to the
high-level description and the later implementation note that currently mentions
exec().

In `@nanochat/worker.py`:
- Line 657: The print statement using an unnecessary f-string prefix
(print(f"[nanochat] 13 functions, 13 triggers (12 HTTP + 1 queue)")) should be
changed to a regular string literal; locate the print call in worker.py (the
line that prints "[nanochat] 13 functions, 13 triggers (12 HTTP + 1 queue)") and
remove the leading f so it becomes print("[nanochat] 13 functions, 13 triggers
(12 HTTP + 1 queue)").

In `@proof/src/agent.ts`:
- Around line 35-36: The runStatus should not default to "pass" — change the
initialization of runStatus to a non-passing default (e.g., "error") and ensure
runStatus is only set to "pass" when a RUN_COMPLETED marker is observed;
explicitly set runStatus to "fail" when processing ASSERTION_FAILED or any tool
failure markers; update any code that synthesizes a fallback step (and usages
around recordedActions) to rely on the adjusted runStatus so a missing
RUN_COMPLETED does not yield a false "pass" (check all handling around
runStatus, RUN_COMPLETED, ASSERTION_FAILED, and recordedActions including the
other occurrences noted).

In `@proof/src/browser.ts`:
- Around line 129-149: The refs collapse when multiple elements share the same
role+name because refMap only stores {role,name} and resolveRef always uses
.first(); update the ref population (where refMap.set is called) to also capture
the element's ordinal (e.g., index among page.getByRole(role, { name })) or a
stable locator snapshot and store it in RefEntry (e.g., add an index/ordinal
property), and then change resolveRef to locate all matches and select the
stored ordinal (use locator.nth(entry.index) or equivalent) instead of .first();
keep a safe fallback if index is missing to preserve current behavior.
- Around line 237-249: The current handlePlaywrightExec uses AsyncFunction and
executes untrusted code in the worker process (see AsyncFunction and fn), so
change it to run the provided code inside the browser page context via
Playwright's page.evaluate to sandbox execution; serialize session.refMap
entries and pass them into page.evaluate, recreate the ref(id) helper inside the
browser context (using the serialized role/name entries) and then create and
invoke the async function from the code string inside page.evaluate (so no
host-side AsyncFunction or access to process/dynamic import), keeping function
name handlePlaywrightExec and ref as the identifying symbols to locate and
replace the implementation.

In `@proof/src/cookies.ts`:
- Around line 65-69: The SQLite SELECT that builds cookies (currently using
WHERE host_key LIKE '%${domain...}') is too permissive and should instead match
either the exact host or any subdomain (e.g., host_key = 'example.com' OR
host_key LIKE '%.example.com'); update the query in cookies.ts where
execFileAsync is called (the SELECT name, value, host_key as domain, ...) to use
host_key = '<domain>' OR host_key LIKE '.<domain>' pattern (escaping domain
properly), and apply the identical predicate change to the corresponding Firefox
query block referenced around lines 130-134 so both Chromium and Firefox imports
only match exact hosts or their subdomains.

In `@proof/src/worker.ts`:
- Around line 125-133: The handlers registered for "proof::scan" and
"proof::coverage" forward an untrusted input.cwd into
scanChanges/analyzeTestCoverage which is later passed to simpleGit, recursive
walks and file reads; change these handlers to ignore or validate input.cwd and
always resolve to a server-controlled safe root (e.g., repository root or a
configured workspace base) before calling scanChanges/analyzeTestCoverage.
Specifically, in the registerFunction blocks for id "proof::scan" and
"proof::coverage" (and the other similar blocks at the noted locations), replace
use of input.cwd with a sanitized value: compute path.resolve(baseRoot,
input.cwd ?? ".") and then assert the resolved path startsWith(baseRoot) (or
simply use baseRoot/untrusted cwd removed), and pass that sanitizedRoot to
scanChanges/analyzeTestCoverage; also update context.ts usage around
simpleGit/file reads to accept only validated roots.
- Around line 39-58: acquireRun() and releaseRun() must be exception-safe: wrap
the launch/interaction sequence in a try/finally so releaseRun() always runs
even if launchBrowser, cookie injection, autoDiscoverCdp(), or any subsequent
step throws; move cookie injection and any other pre-launch work inside the
protected block after acquireRun() and call releaseRun() from the finally; do
the same for the close path so closeBrowser(input.runId) is invoked inside try
and releaseRun() is called in finally to avoid leaving activeRunId set; adjust
the implementations around launchBrowser, closeBrowser, acquireRun, releaseRun
and any cookie injection logic accordingly.

---

Nitpick comments:
In `@nanochat/worker.py`:
- Around line 55-64: The safe() wrapper currently copies name and annotations
manually; replace that with functools.wraps to preserve metadata more robustly:
import functools, then decorate wrapper with `@functools.wraps`(fn) (keeping
wrapper async and returning the same error dict behavior), and remove the manual
assignments to wrapper.__name__ and wrapper.__annotations__; ensure
traceback.format_exc() remains used so error/traceback are still returned.
- Line 489: The unpacked return from load_model(...) currently assigns to model,
tokenizer, meta but meta is unused; change the unpacking to model, tokenizer, _
= load_model(inp.source, device, "base", model_tag=inp.model_tag, step=inp.step)
(or model, tokenizer, _meta = ...) to prefix the unused third value with an
underscore so linters know it is intentionally unused and no other code paths
are affected.
- Around line 430-432: When opening the tasks_yaml file before calling
yaml.safe_load, specify the file encoding for cross-platform consistency by
passing UTF-8 to the open call (i.e. update the open(tasks_yaml) usage so the
file is opened with encoding="utf-8"); keep the same with-context and
yaml.safe_load call and ensure variable names tasks and tasks_yaml remain
unchanged.

In `@proof/README.md`:
- Around line 113-127: Update the two fenced ASCII diagram blocks in README.md
(the block that begins with "proof::scan          git diff → changed files,
commits" and the block that starts with the box drawing
"┌──────────────────────────────────────────┐") to include a language specifier
by adding "text" to the opening fence so the blocks become fenced as text code
blocks; this will silence the MD040 lint warnings while preserving the diagrams.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4b6e37ad-d249-4952-ad2a-fb3e7bd79fe3

📥 Commits

Reviewing files that changed from the base of the PR and between a89c809 and 0213bb2.

📒 Files selected for processing (15)
  • nanochat/README.md
  • nanochat/pyproject.toml
  • nanochat/worker.py
  • proof/README.md
  • proof/package.json
  • proof/src/agent.ts
  • proof/src/browser.ts
  • proof/src/context.ts
  • proof/src/cookies.ts
  • proof/src/prompt.ts
  • proof/src/tools.ts
  • proof/src/types.ts
  • proof/src/worker.ts
  • proof/tsconfig.json
  • registry/index.json

Comment on lines +1 to +22
[project]
name = "iii-nanochat"
version = "0.1.0"
description = "nanochat LLM worker for iii-engine — train, fine-tune, evaluate, and chat with GPT models"
license = "Apache-2.0"
requires-python = ">=3.10"
dependencies = [
"iii-sdk>=0.10.0",
"torch>=2.0",
"pydantic>=2.0",
"tiktoken",
"tokenizers",
"datasets",
"pyarrow",
"psutil",
]

[project.optional-dependencies]
train = ["wandb"]

[project.scripts]
iii-nanochat = "worker:main"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if worker.py is inside a nanochat package or at root level
fd -t f 'worker.py' | head -5

# Check for __init__.py to verify package structure
fd -t f '__init__.py' | head -10

Repository: iii-hq/workers

Length of output: 76


🏁 Script executed:

#!/bin/bash
# Check if nanochat/__init__.py exists
ls -la nanochat/__init__.py 2>/dev/null && echo "Found" || echo "Not found"

# List the contents of nanochat directory
ls -la nanochat/ | head -20

# Double-check the exact path of worker.py
find nanochat -name "worker.py" -type f

Repository: iii-hq/workers

Length of output: 385


Create nanochat/__init__.py and fix console script entry point.

The nanochat package is missing __init__.py, which means nanochat is not a proper Python package. This will cause the console script entry point "worker:main" to fail at runtime. Create an __init__.py file in the nanochat directory and update the entry point to "nanochat.worker:main".

Additionally, add the missing [build-system] section for PEP 517/518 compliance:

🔧 Required fixes

Create nanochat/__init__.py (can be empty or with version info):

# nanochat/__init__.py
__version__ = "0.1.0"

In pyproject.toml:

+[build-system]
+requires = ["setuptools>=61.0"]
+build-backend = "setuptools.build_meta"
+
 [project]

Update the console script entry point:

 [project.scripts]
-iii-nanochat = "worker:main"
+iii-nanochat = "nanochat.worker:main"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
[project]
name = "iii-nanochat"
version = "0.1.0"
description = "nanochat LLM worker for iii-engine — train, fine-tune, evaluate, and chat with GPT models"
license = "Apache-2.0"
requires-python = ">=3.10"
dependencies = [
"iii-sdk>=0.10.0",
"torch>=2.0",
"pydantic>=2.0",
"tiktoken",
"tokenizers",
"datasets",
"pyarrow",
"psutil",
]
[project.optional-dependencies]
train = ["wandb"]
[project.scripts]
iii-nanochat = "worker:main"
[build-system]
requires = ["setuptools>=61.0"]
build-backend = "setuptools.build_meta"
[project]
name = "iii-nanochat"
version = "0.1.0"
description = "nanochat LLM worker for iii-engine — train, fine-tune, evaluate, and chat with GPT models"
license = "Apache-2.0"
requires-python = ">=3.10"
dependencies = [
"iii-sdk>=0.10.0",
"torch>=2.0",
"pydantic>=2.0",
"tiktoken",
"tokenizers",
"datasets",
"pyarrow",
"psutil",
]
[project.optional-dependencies]
train = ["wandb"]
[project.scripts]
iii-nanochat = "nanochat.worker:main"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nanochat/pyproject.toml` around lines 1 - 22, Create a package initializer
file nanochat/__init__.py (e.g. define __version__ = "0.1.0" or leave empty) so
the nanochat module is a proper package, and update the console script entry
point in pyproject.toml from "worker:main" to "nanochat.worker:main" to
reference the worker module inside the package; also add a PEP 517/518
[build-system] section to pyproject.toml (include a minimal requires list such
as setuptools and wheel and set build-backend to setuptools.build_meta) so
builds comply with build-system metadata requirements.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Added [build-system] section with setuptools backend (PEP 517/518).

Comment on lines +129 to +149
if (isInteractive || isContent) {
refCounter++;
const ref = `e${refCounter}`;
outputLine += ` [ref=${ref}]`;
refMap.set(ref, { role, name: name ?? "" });
}

outputLines.push(outputLine);
}

return outputLines.join("\n");
}

export function resolveRef(
ref: string,
refMap: Map<string, RefEntry>,
page: Page,
) {
const entry = refMap.get(ref);
if (!entry) throw new Error(`Ref "${ref}" not found in current snapshot. Take a new snapshot.`);
return page.getByRole(entry.role as any, { name: entry.name }).first();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Snapshot refs are not stable when labels repeat.

The ref map only stores { role, name }, and resolveRef() always calls .first(). On pages with repeated labels like two "Delete" buttons, distinct refs collapse onto the same element and the agent will click/type into the wrong target.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proof/src/browser.ts` around lines 129 - 149, The refs collapse when multiple
elements share the same role+name because refMap only stores {role,name} and
resolveRef always uses .first(); update the ref population (where refMap.set is
called) to also capture the element's ordinal (e.g., index among
page.getByRole(role, { name })) or a stable locator snapshot and store it in
RefEntry (e.g., add an index/ordinal property), and then change resolveRef to
locate all matches and select the stored ordinal (use locator.nth(entry.index)
or equivalent) instead of .first(); keep a safe fallback if index is missing to
preserve current behavior.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not part of the nanochat worker changes. This is pre-existing code in the proof worker.

Comment on lines +237 to +249
export async function handlePlaywrightExec(
code: string,
session: BrowserSession,
): Promise<unknown> {
const { page, context, browser } = session;
const ref = (id: string) => {
const entry = session.refMap.get(id);
if (!entry) throw new Error(`Ref "${id}" not found`);
return page.getByRole(entry.role as any, { name: entry.name }).first();
};
const AsyncFunction = Object.getPrototypeOf(async () => {}).constructor;
const fn = new AsyncFunction("page", "context", "browser", "ref", code);
return fn(page, context, browser, ref);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

browser_exec is host-side code execution.

new AsyncFunction(...) runs arbitrary JavaScript inside the worker process, not inside an isolated page sandbox. Because proof/src/worker.ts, Lines 106-107 register this as proof::browser::exec, any caller that can reach the worker gets access to process, dynamic imports, and the local host.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proof/src/browser.ts` around lines 237 - 249, The current
handlePlaywrightExec uses AsyncFunction and executes untrusted code in the
worker process (see AsyncFunction and fn), so change it to run the provided code
inside the browser page context via Playwright's page.evaluate to sandbox
execution; serialize session.refMap entries and pass them into page.evaluate,
recreate the ref(id) helper inside the browser context (using the serialized
role/name entries) and then create and invoke the async function from the code
string inside page.evaluate (so no host-side AsyncFunction or access to
process/dynamic import), keeping function name handlePlaywrightExec and ref as
the identifying symbols to locate and replace the implementation.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not part of the nanochat worker changes. This is pre-existing code in the proof worker.

Comment on lines +65 to +69
const { stdout } = await execFileAsync("sqlite3", [
"-json",
cookieDbPath,
`SELECT name, value, host_key as domain, path, expires_utc, is_secure, is_httponly, samesite FROM cookies WHERE host_key LIKE '%${domain.replace(/'/g, "''")}'`,
]);
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use exact/subdomain matches when selecting cookies.

LIKE '%${domain}' also matches unrelated suffixes such as notexample.com when the target is example.com, so this can import third-party cookies into the Playwright context. Match the exact host plus .${domain} subdomains instead.

🔒 Suggested query shape
+    const escapedDomain = domain.replace(/'/g, "''");
     const { stdout } = await execFileAsync("sqlite3", [
       "-json",
       cookieDbPath,
-      `SELECT name, value, host_key as domain, path, expires_utc, is_secure, is_httponly, samesite FROM cookies WHERE host_key LIKE '%${domain.replace(/'/g, "''")}'`,
+      `SELECT name, value, host_key as domain, path, expires_utc, is_secure, is_httponly, samesite FROM cookies
+       WHERE host_key = '${escapedDomain}'
+          OR host_key LIKE '%.${escapedDomain}'`,
     ]);

Apply the same predicate to the Firefox query.

Also applies to: 130-134

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proof/src/cookies.ts` around lines 65 - 69, The SQLite SELECT that builds
cookies (currently using WHERE host_key LIKE '%${domain...}') is too permissive
and should instead match either the exact host or any subdomain (e.g., host_key
= 'example.com' OR host_key LIKE '%.example.com'); update the query in
cookies.ts where execFileAsync is called (the SELECT name, value, host_key as
domain, ...) to use host_key = '<domain>' OR host_key LIKE '.<domain>' pattern
(escaping domain properly), and apply the identical predicate change to the
corresponding Firefox query block referenced around lines 130-134 so both
Chromium and Firefox imports only match exact hosts or their subdomains.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not part of the nanochat worker changes. This is pre-existing code in the proof worker.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohitg00, understood — the concern is in the proof worker and outside the scope of this nanochat PR. Would you like me to open a follow-up GitHub issue to track the domain-matching fix in proof/src/cookies.ts (both the Chrome and Firefox query blocks) so it isn't lost?


🧠 Learnings used
Learnt from: rohitg00
Repo: iii-hq/workers PR: 3
File: nanochat/worker.py:0-0
Timestamp: 2026-03-30T09:59:20.981Z
Learning: In `nanochat/worker.py`, `fn_tools_execute` uses in-process `exec()` instead of `subprocess.run` or `multiprocessing.Process` because forking corrupts the asyncio event loop in the iii-sdk WebSocket connection. This is a known, accepted trade-off documented in the README under Known Issues. For production deployments with untrusted code, a separate sandboxed worker is required. The function description intentionally reads "Execute Python code" (not "sandboxed") to reflect the lack of real isolation.

Comment on lines +39 to +58
iii.registerFunction({ id: "proof::browser::launch" }, async (input) => {
const { runId, headed, cdp } = input;
acquireRun(runId);
let cdpUrl: string | undefined;
if (cdp === "auto") {
cdpUrl = (await autoDiscoverCdp()) ?? undefined;
} else if (cdp) {
cdpUrl = cdp;
}
await launchBrowser(runId, headed, cdpUrl);
logger.info("Browser launched", { runId, headed, cdp: cdpUrl ?? "none" });
return { runId, launched: true };
});

iii.registerFunction({ id: "proof::browser::close" }, async (input) => {
const result = await closeBrowser(input.runId);
releaseRun();
logger.info("Browser closed", { runId: input.runId });
return result;
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make the run lock and browser cleanup exception-safe.

acquireRun() happens before launch succeeds, cookie injection happens before the try/finally, and releaseRun() runs only after closeBrowser() succeeds. A failure in any of those paths can leave activeRunId stuck and force manual proof::cleanup.

🧯 Suggested structure
 iii.registerFunction({ id: "proof::browser::launch" }, async (input) => {
   const { runId, headed, cdp } = input;
   acquireRun(runId);
-  let cdpUrl: string | undefined;
-  if (cdp === "auto") {
-    cdpUrl = (await autoDiscoverCdp()) ?? undefined;
-  } else if (cdp) {
-    cdpUrl = cdp;
-  }
-  await launchBrowser(runId, headed, cdpUrl);
-  logger.info("Browser launched", { runId, headed, cdp: cdpUrl ?? "none" });
-  return { runId, launched: true };
+  try {
+    let cdpUrl: string | undefined;
+    if (cdp === "auto") {
+      cdpUrl = (await autoDiscoverCdp()) ?? undefined;
+    } else if (cdp) {
+      cdpUrl = cdp;
+    }
+    await launchBrowser(runId, headed, cdpUrl);
+    logger.info("Browser launched", { runId, headed, cdp: cdpUrl ?? "none" });
+    return { runId, launched: true };
+  } catch (err) {
+    releaseRun();
+    throw err;
+  }
 });
 
 iii.registerFunction({ id: "proof::browser::close" }, async (input) => {
-  const result = await closeBrowser(input.runId);
-  releaseRun();
-  logger.info("Browser closed", { runId: input.runId });
-  return result;
+  try {
+    const result = await closeBrowser(input.runId);
+    logger.info("Browser closed", { runId: input.runId });
+    return result;
+  } finally {
+    releaseRun();
+  }
 });
 
 iii.registerFunction({ id: "proof::execute" }, async (input) => {
   const { diff, files, base_url, instruction, runId, headed, commits, coverage, cdp, cookies } = input;
-  await iii.trigger({
-    function_id: "proof::browser::launch",
-    payload: { runId, headed, cdp },
-  });
-
-  if (cookies) {
-    await iii.trigger({
-      function_id: "proof::cookies::inject",
-      payload: { url: base_url },
-    });
-  }
-
   try {
+    await iii.trigger({
+      function_id: "proof::browser::launch",
+      payload: { runId, headed, cdp },
+    });
+    if (cookies) {
+      await iii.trigger({
+        function_id: "proof::cookies::inject",
+        payload: { url: base_url },
+      });
+    }
     const trigger = iii.trigger.bind(iii);
     return await runAgent(trigger, diff, files, base_url, runId, instruction, commits, coverage);
   } finally {
     await iii.trigger({
       function_id: "proof::browser::close",

Also applies to: 135-159

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proof/src/worker.ts` around lines 39 - 58, acquireRun() and releaseRun() must
be exception-safe: wrap the launch/interaction sequence in a try/finally so
releaseRun() always runs even if launchBrowser, cookie injection,
autoDiscoverCdp(), or any subsequent step throws; move cookie injection and any
other pre-launch work inside the protected block after acquireRun() and call
releaseRun() from the finally; do the same for the close path so
closeBrowser(input.runId) is invoked inside try and releaseRun() is called in
finally to avoid leaving activeRunId set; adjust the implementations around
launchBrowser, closeBrowser, acquireRun, releaseRun and any cookie injection
logic accordingly.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not part of the nanochat worker changes. This is pre-existing code in the proof worker.

Comment on lines +125 to +133
iii.registerFunction({ id: "proof::scan" }, async (input) => {
logger.info("Scanning changes", { target: input.target ?? "unstaged" });
return scanChanges(input.target, input.cwd, input.main_branch, input.commit_hash);
});

iii.registerFunction({ id: "proof::coverage" }, async (input) => {
logger.info("Analyzing test coverage", { files: input.files?.length });
return analyzeTestCoverage(input.files ?? [], input.cwd);
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Don't let HTTP callers choose an arbitrary filesystem root.

cwd is forwarded straight into scanChanges() and analyzeTestCoverage(). In proof/src/context.ts, Lines 20-21 and 88-101 then feed that into simpleGit(...), recursive directory walks, and file reads, so proof::run and proof::coverage can inspect any accessible checkout or directory on the host.

Also applies to: 211-229, 336-342

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proof/src/worker.ts` around lines 125 - 133, The handlers registered for
"proof::scan" and "proof::coverage" forward an untrusted input.cwd into
scanChanges/analyzeTestCoverage which is later passed to simpleGit, recursive
walks and file reads; change these handlers to ignore or validate input.cwd and
always resolve to a server-controlled safe root (e.g., repository root or a
configured workspace base) before calling scanChanges/analyzeTestCoverage.
Specifically, in the registerFunction blocks for id "proof::scan" and
"proof::coverage" (and the other similar blocks at the noted locations), replace
use of input.cwd with a sanitized value: compute path.resolve(baseRoot,
input.cwd ?? ".") and then assert the resolved path startsWith(baseRoot) (or
simply use baseRoot/untrusted cwd removed), and pass that sanitizedRoot to
scanChanges/analyzeTestCoverage; also update context.ts usage around
simpleGit/file reads to accept only validated roots.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not part of the nanochat worker changes. This is pre-existing code in the proof worker.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
nanochat/README.md (2)

115-128: ⚠️ Potential issue | 🟡 Minor

Add a language tag to the fenced code block.

Line 115 should specify a language (for markdownlint MD040 and proper rendering).

Suggested doc fix
-```
+```text
 OK   nanochat.health              {"status": "ok", "model_loaded": false}
 ...
 10/10 responded, 0 crashes
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @nanochat/README.md around lines 115 - 128, Change the fenced code block that
begins with the lines starting "OK nanochat.health {"status":
"ok", "model_loaded": false}" so the opening backticks include a language
tag (e.g., text or console) to satisfy markdownlint MD040 and ensure
correct rendering; update the opening fence only and leave the block content and
closing fence unchanged.


</details>

---

`78-80`: _⚠️ Potential issue_ | _🟠 Major_

**Avoid calling `tools.execute` “sandboxed” when it runs in-process `exec()`.**

Line 78-80 currently overstates isolation and can mislead operators. Align the wording with the actual risk model and add a hard warning for untrusted input exposure.  
 

<details>
<summary>Suggested doc fix</summary>

```diff
-**nanochat.tools.execute**:`POST /nanochat/tools/execute`
-
-Executes arbitrary Python code in a sandboxed environment. Returns stdout, stderr, success status, and any errors. This mirrors nanochat's built-in tool use (calculator, code execution) that models learn during SFT training.
+**nanochat.tools.execute**:`POST /nanochat/tools/execute`
+
+Executes arbitrary Python code using in-process `exec()` with stdout/stderr capture. Returns stdout, stderr, success status, and any errors. This mirrors nanochat's built-in tool use (calculator, code execution) that models learn during SFT training.
+⚠️ This is **not** a strong isolation boundary. Run this worker in a hardened isolated environment (dedicated VM/container, unprivileged user, syscall/network restrictions) or replace with a true sandbox before exposing to untrusted inputs.
```
</details>


Also applies to: 138-139

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@nanochat/README.md` around lines 78 - 80, Update the README text for
nanochat.tools.execute to remove the word "sandboxed" and clarify that execution
uses in-process Python exec(), which does not fully isolate processes or OS
resources; explicitly state the risks (full access to process memory, file
system, network, and ability to run arbitrary code) and add a prominent warning
that untrusted input must never be passed to nanochat.tools.execute. Locate and
update both occurrences referring to nanochat.tools.execute (the description
near "POST /nanochat/tools/execute" and the repeated mention around the later
lines) so the wording accurately reflects the risk model and includes the hard
warning for operators.
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @nanochat/README.md:

  • Around line 48-49: Fix missing spaces after colons in README prose by
    replacing instances of "word:next" with "word: next"; specifically update the
    sentence containing "Every handler uses Pydantic type hints for automatic
    request/response schema extraction:the engine knows the exact input/output shape
    of every function" to insert a space after the colon, and likewise correct the
    similar colon-spacing issues referenced around the phrases on lines 56-57,
    113-114, and 130 (search for any colon immediately followed by a letter and add
    a space).

Duplicate comments:
In @nanochat/README.md:

  • Around line 115-128: Change the fenced code block that begins with the lines
    starting "OK nanochat.health {"status": "ok",
    "model_loaded": false}" so the opening backticks include a language tag (e.g.,
    text or console) to satisfy markdownlint MD040 and ensure correct
    rendering; update the opening fence only and leave the block content and closing
    fence unchanged.
  • Around line 78-80: Update the README text for nanochat.tools.execute to remove
    the word "sandboxed" and clarify that execution uses in-process Python exec(),
    which does not fully isolate processes or OS resources; explicitly state the
    risks (full access to process memory, file system, network, and ability to run
    arbitrary code) and add a prominent warning that untrusted input must never be
    passed to nanochat.tools.execute. Locate and update both occurrences referring
    to nanochat.tools.execute (the description near "POST /nanochat/tools/execute"
    and the repeated mention around the later lines) so the wording accurately
    reflects the risk model and includes the hard warning for operators.

</details>

<details>
<summary>🪄 Autofix (Beta)</summary>

Fix all unresolved CodeRabbit comments on this PR:

- [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended)
- [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Organization UI

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `802897be-471c-4082-83ba-c9d0029e7916`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 0213bb23e374fdf7ac9a69abf2dc35c822fa7b43 and 3d437be5e05cfe99323003a467c29fbc68e89fa3.

</details>

<details>
<summary>📒 Files selected for processing (1)</summary>

* `nanochat/README.md`

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Covers every nanochat capability through iii primitives:

Training (all queued):
- train.tokenizer: BPE tokenizer training from dataset
- train.base: Base pretraining with depth scaling, LR scheduling,
  checkpoint saving, FP8 support, periodic BPB evaluation
- train.sft: SFT with real task mixture (SmolTalk, MMLU, GSM8K,
  SimpleSpelling, SpellingBee), warmdown scheduling, checkpoint saving
- train.rl: GRPO reinforcement learning on GSM8K with advantage
  weighting, multi-sample rollouts, checkpoint saving

Evaluation:
- eval.core: Real CORE benchmark via base_eval.evaluate_core()
- eval.loss: BPB on configurable split with batch size control
- eval.chat: ChatCORE evaluation (generative + categorical tasks)

New functions:
- model.sample: Raw text generation from loaded model
- checkpoint.save: Save current model to disk
- checkpoint.list: List available checkpoints by source

All training progress, eval results, and checkpoints tracked
through iii state scopes.
- Added karpathy/nanochat as git submodule at nanochat-upstream/
- Training functions run the actual nanochat scripts as subprocesses
  (100% fidelity: gradient accum, Muon optimizer, best-fit packing,
  full task mixture, GRPO, DDP, FP8, checkpoint saving)
- eval.chat uses the real run_chat_eval dispatcher (all 6 tasks)
- tools.execute uses in-process exec (subprocess crashes iii WebSocket)
- 893 lines, 20 functions, 12/12 tested, 0 crashes
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (3)
nanochat/README.md (2)

81-84: ⚠️ Potential issue | 🟠 Major

Do not describe tools.execute as sandboxed.

The README still says “sandboxed environment,” but the implementation note confirms in-process exec(). This is a security-boundary mismatch and can mislead operators.

Also applies to: 141-142

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nanochat/README.md` around lines 81 - 84, The README incorrectly describes
nanochat.tools.execute as running in a "sandboxed environment"; update the
README text for nanochat.tools.execute to remove the word "sandboxed" and
explicitly state that the implementation uses in-process exec() (i.e., not a
secure sandbox), add a clear security warning about arbitrary code execution
risk and recommended operator mitigations (disable the endpoint or restrict
access), and make the same wording change for the other occurrence mentioned
(lines 141-142) so both descriptions match the actual nanochat.tools.execute
behavior.

118-131: ⚠️ Potential issue | 🟡 Minor

Add a language to the fenced output block.

This block is currently unlabeled; markdownlint/syntax highlighting will continue to fail.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nanochat/README.md` around lines 118 - 131, The unlabeled fenced code block
that starts with lines like "OK   nanochat.health {"status": "ok",
"model_loaded": false}" should be given a language identifier (for example
```text or ```console) to satisfy markdownlint and enable proper syntax
highlighting; update the fenced block in README.md that contains the nanochat
status/outputs to begin with a language label instead of just ```.
nanochat/worker.py (1)

117-120: ⚠️ Potential issue | 🔴 Critical

tools.execute is unsandboxed and ignores declared timeout.

This endpoint executes arbitrary Python with full builtins, and timeout is never enforced. The registered description also says “in sandbox,” which is inaccurate for the current implementation.

Suggested minimal mitigation
- ("nanochat.tools.execute", fn_tools_execute, "Execute Python code in sandbox", "http", {"api_path": "/nanochat/tools/execute", "http_method": "POST"}),
+ ("nanochat.tools.execute", fn_tools_execute, "Execute Python code (WARNING: unsandboxed; trusted input only)", "http", {"api_path": "/nanochat/tools/execute", "http_method": "POST"}),

Also applies to: 424-433, 796-796

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nanochat/worker.py` around lines 117 - 120, The ExecuteCodeInput model and
tools.execute currently run arbitrary Python unsandboxed and ignore the timeout;
update the execution path (the tools.execute handler and any helper that runs
user code) to run user code in a separate process (e.g., subprocess or
multiprocessing.Process) and enforce ExecuteCodeInput.timeout by passing it to
subprocess.run or using Process.join with a timeout and terminating the child if
exceeded; additionally restrict the execution environment by removing dangerous
builtins and not importing modules in the child (or run a dedicated sandboxed
interpreter) and update the registered description text from “in sandbox” to
reflect the actual sandboxing level until a full sandbox is implemented.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nanochat/README.md`:
- Line 35: Replace the incorrect dependency install command 'pip install -r
nanochat-upstream/pyproject.toml' with an editable package install; change the
README to instruct users to run 'cd nanochat-upstream && pip install -e .' (or
an equivalent pip install .) so pip reads the package metadata instead of
treating a pyproject.toml as a requirements file.

In `@nanochat/worker.py`:
- Around line 53-58: The decorator safe currently returns full tracebacks to
callers; change wrapper in safe to log the full exception and traceback
server-side (e.g., use the app logger and logger.exception or
logger.error(traceback.format_exc())) and remove traceback from the response,
returning a sanitized error envelope such as {"error": "Internal server error"}
or {"error": str(e)} without any traceback; update any imports/usage accordingly
in the safe/wrapper code paths.
- Around line 224-239: GPUState.load currently updates
model/tokenizer/engine/meta/source/model_tag/device under self._lock but
multiple handlers (fn_chat_complete, fn_chat_stream, fn_model_load,
fn_model_status, fn_eval_core, fn_eval_loss, fn_eval_chat) read those fields
without locking, risking races; fix by acquiring the same self._lock in each
read path (or at start of each handler) to either (a) copy model, tokenizer,
engine, meta, source, model_tag, device into local variables while holding the
lock and then release it before use, or (b) wrap the entire read/usage critical
section in with self._lock, ensuring all accesses use the protected references
and avoiding mixed-version/state reads; update all listed functions to use
self._lock consistently.
- Around line 472-492: fn_train_tokenizer (and the other async handlers
fn_train_base, fn_train_sft, fn_train_rl) currently call the blocking
_run_nanochat_script() directly and block the event loop; change each call to
run in a thread by awaiting asyncio.to_thread(_run_nanochat_script,
"scripts.tok_train", args, run_id, "tokenizer") (and the corresponding
script/name/args for the other handlers) so the blocking subprocess work is
offloaded, and add an import for asyncio if it's not already present.

---

Duplicate comments:
In `@nanochat/README.md`:
- Around line 81-84: The README incorrectly describes nanochat.tools.execute as
running in a "sandboxed environment"; update the README text for
nanochat.tools.execute to remove the word "sandboxed" and explicitly state that
the implementation uses in-process exec() (i.e., not a secure sandbox), add a
clear security warning about arbitrary code execution risk and recommended
operator mitigations (disable the endpoint or restrict access), and make the
same wording change for the other occurrence mentioned (lines 141-142) so both
descriptions match the actual nanochat.tools.execute behavior.
- Around line 118-131: The unlabeled fenced code block that starts with lines
like "OK   nanochat.health {"status": "ok", "model_loaded": false}" should be
given a language identifier (for example ```text or ```console) to satisfy
markdownlint and enable proper syntax highlighting; update the fenced block in
README.md that contains the nanochat status/outputs to begin with a language
label instead of just ```.

In `@nanochat/worker.py`:
- Around line 117-120: The ExecuteCodeInput model and tools.execute currently
run arbitrary Python unsandboxed and ignore the timeout; update the execution
path (the tools.execute handler and any helper that runs user code) to run user
code in a separate process (e.g., subprocess or multiprocessing.Process) and
enforce ExecuteCodeInput.timeout by passing it to subprocess.run or using
Process.join with a timeout and terminating the child if exceeded; additionally
restrict the execution environment by removing dangerous builtins and not
importing modules in the child (or run a dedicated sandboxed interpreter) and
update the registered description text from “in sandbox” to reflect the actual
sandboxing level until a full sandbox is implemented.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a15d26e2-e7dd-477d-9032-aa0e20e6f6a0

📥 Commits

Reviewing files that changed from the base of the PR and between 3d437be and 2345132.

📒 Files selected for processing (4)
  • .gitmodules
  • nanochat/README.md
  • nanochat/nanochat-upstream
  • nanochat/worker.py
✅ Files skipped from review due to trivial changes (2)
  • nanochat/nanochat-upstream
  • .gitmodules

Training functions now parse nanochat's stdout line-by-line and push
structured metrics to iii state as they arrive. Any worker can poll
nanochat.train.status and see live:

- step, loss, learning rate multiplier, tokens/sec, MFU, epoch
- validation BPB (when eval runs)
- CORE metric scores (base training)
- ChatCORE + ChatCORE_cat scores (SFT)
- average reward + sequence length (RL)
- pass@k accuracy (RL eval)

Eval results also written to nanochat:evals with run_id + step.

Parser handles all 6 nanochat output patterns:
- base/sft step line (step/total, loss, lrm, dt, tok/sec, mfu)
- Validation bpb line
- CORE metric line
- ChatCORE line
- RL average reward line
- RL pass@k line

937 lines, 20 functions, 12/12 tested, 0 crashes.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
nanochat/worker.py (1)

380-380: Minor: Prefer list unpacking over concatenation.

♻️ Suggested refactor
-        tokens = [bos] + gpu.tokenizer.encode(inp.prompt) if inp.prompt else [bos]
+        tokens = [bos, *gpu.tokenizer.encode(inp.prompt)] if inp.prompt else [bos]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nanochat/worker.py` at line 380, The current tokens construction uses list
concatenation; change it to list unpacking to be more idiomatic and slightly
more efficient: when building tokens (variable tokens) use [bos,
*gpu.tokenizer.encode(inp.prompt)] if inp.prompt else [bos], referencing tokens,
bos, gpu.tokenizer.encode, and inp.prompt in the update.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nanochat/worker.py`:
- Around line 424-433: fn_tools_execute currently calls exec with full
__builtins__ and ignores ExecuteCodeInput.timeout, leaving execution unsafe and
unbounded; change it to either (A) enforce a hard timeout and limit builtins:
run the exec in a restricted globals dict (do not pass {"__builtins__":
__builtins__}) and only include a small allowlist of safe callables, execute the
code in a worker thread (e.g., asyncio.to_thread) and wrap that in
asyncio.wait_for(timeout=inp.timeout) to enforce the timeout, or (B) if true
sandboxing isn't feasible, update the function/docstring that claims "sandbox"
to explicitly warn not to run untrusted code and still implement the timeout via
asyncio.wait_for or signal.alarm; reference fn_tools_execute, ExecuteCodeInput,
inp.timeout, and the current exec call to locate and fix the code.
- Around line 801-805: The current filename parsing in the block that builds
steps for each tag (variables tag_path, tag_dir and the steps list appended to
checkpoints) assumes files are exactly "model_STEP.pt" and will raise
IndexError/ValueError on malformed files; fix by filtering and parsing more
defensively: iterate over os.listdir(tag_path), only consider names that match a
strict pattern (e.g., regex r"^model_(\d+)\.pt$") or otherwise wrap the int
extraction in try/except to skip non-matching files, collect valid integers,
sort them, and append the resulting steps list to checkpoints; ensure malformed
filenames are ignored rather than bubbling exceptions.
- Around line 349-350: The current logic sets device = inp.device or
autodetect_device_type(), which treats the string "auto" as a valid truthy
device and bypasses autodetection; update the selection so that if inp.device is
None or equals "auto" (case-insensitive) you call autodetect_device_type() and
otherwise use inp.device, then pass that resolved device to gpu.load(...) (refer
to the variables/function names inp.device, autodetect_device_type, device, and
gpu.load).

---

Nitpick comments:
In `@nanochat/worker.py`:
- Line 380: The current tokens construction uses list concatenation; change it
to list unpacking to be more idiomatic and slightly more efficient: when
building tokens (variable tokens) use [bos, *gpu.tokenizer.encode(inp.prompt)]
if inp.prompt else [bos], referencing tokens, bos, gpu.tokenizer.encode, and
inp.prompt in the update.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: eee41651-aa5c-4e37-b147-4c347d28f49c

📥 Commits

Reviewing files that changed from the base of the PR and between 2345132 and 86a1c43.

📒 Files selected for processing (1)
  • nanochat/worker.py

- safe() no longer leaks tracebacks to callers; logs server-side,
  returns only {"error": "..."} (was: Major, L58)
- GPUState.snapshot() for thread-safe reads; all handlers use locals
  from snapshot instead of reading gpu.* fields directly (was: Critical, L239)
- Training subprocess runs in asyncio.to_thread via split into
  _run_subprocess_blocking + _run_training (was: Major, L568)
- tools.execute description changed from "sandboxed" to "in-process,
  not sandboxed" since exec() with __builtins__ is not a real sandbox
  (was: Major, L414)
- f-string without placeholders already fixed in previous commit (L657)
- registry device config: changed "auto" to null (matches worker's
  autodetect behavior when device is omitted)
- checkpoint listing: catch ValueError on malformed filenames instead
  of crashing (e.g. model_backup.pt no longer breaks the parser)
- exec() sandboxing: acknowledged as known limitation, documented
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
nanochat/worker.py (3)

357-358: ⚠️ Potential issue | 🟠 Major

Handle "auto" explicitly when resolving device.

Line 357 treats "auto" as a valid truthy device and skips autodetection. Normalize and resolve "auto" (case-insensitive) to autodetect_device_type().

🔧 Suggested fix
-    device = inp.device or autodetect_device_type()
+    device = (
+        inp.device
+        if inp.device and inp.device.strip().lower() != "auto"
+        else autodetect_device_type()
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nanochat/worker.py` around lines 357 - 358, The device resolution treats a
string "auto" as truthy and skips autodetection; update the logic around device
= inp.device or autodetect_device_type() to detect when inp.device
(case-insensitive) equals "auto" and in that case call autodetect_device_type();
ensure you normalize inp.device (e.g., lower()) before comparison and then pass
the resolved device into gpu.load(inp.source, device, model_tag=inp.model_tag,
step=inp.step) so gpu.load never receives the literal "auto".

434-442: ⚠️ Potential issue | 🟠 Major

Enforce timeout in fn_tools_execute to avoid unbounded handler blocking.

ExecuteCodeInput.timeout is currently unused, so user code can run indefinitely and stall this handler path.

🔧 Suggested mitigation
+import asyncio
@@
 async def fn_tools_execute(data: ExecuteCodeInput) -> dict:
     inp = ExecuteCodeInput.model_validate(data) if isinstance(data, dict) else data
     stdout_buf, stderr_buf = io.StringIO(), io.StringIO()
+    def _run_exec():
+        with contextlib.redirect_stdout(stdout_buf), contextlib.redirect_stderr(stderr_buf):
+            exec(inp.code, {"__builtins__": __builtins__}, {})
     try:
-        with contextlib.redirect_stdout(stdout_buf), contextlib.redirect_stderr(stderr_buf):
-            exec(inp.code, {"__builtins__": __builtins__}, {})
+        await asyncio.wait_for(asyncio.to_thread(_run_exec), timeout=inp.timeout)
         return {"success": True, "stdout": stdout_buf.getvalue(), "stderr": stderr_buf.getvalue(), "error": None}
+    except asyncio.TimeoutError:
+        return {"success": False, "stdout": stdout_buf.getvalue(), "stderr": stderr_buf.getvalue(), "error": "execution timed out"}
     except Exception as e:
         return {"success": False, "stdout": stdout_buf.getvalue(), "stderr": stderr_buf.getvalue(), "error": str(e)}
#!/bin/bash
# Verify timeout field definition and whether execution path enforces it.
rg -n --type=py 'class ExecuteCodeInput|timeout|fn_tools_execute|asyncio\.wait_for|to_thread' nanochat/worker.py -C2

Based on learnings: in-process exec() is an accepted trade-off here due iii-sdk WebSocket/event-loop corruption with forked execution models; this comment is only about missing timeout enforcement.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nanochat/worker.py` around lines 434 - 442, fn_tools_execute currently
ignores ExecuteCodeInput.timeout allowing user code to run indefinitely; enforce
the timeout by running the exec body in a cancellable task or thread and
wrapping it with asyncio.wait_for using inp.timeout (fallback to a sensible
default if None). Specifically, update fn_tools_execute to validate inp.timeout
from ExecuteCodeInput, run the exec call (the block that captures
stdout_buf/stderr_buf and calls exec(inp.code,...)) inside asyncio.to_thread or
an async wrapper, and await that wrapped call inside
asyncio.wait_for(timeout=inp.timeout). Ensure exceptions from timeout are
handled to return {"success": False, "stdout": ..., "stderr": ..., "error":
"timeout"} and preserve existing stdout/stderr capture and other exception
handling for non-timeout errors.

834-837: ⚠️ Potential issue | 🟡 Minor

Make checkpoint step parsing defensive against non-conforming filenames.

Current split/index parsing can raise on unexpected files in checkpoint directories and fail listing.

🔧 Suggested fix
+                import re
-                steps = sorted([
-                    int(f.split("_")[1].split(".")[0])
-                    for f in os.listdir(tag_path) if f.startswith("model_") and f.endswith(".pt")
-                ])
+                steps = []
+                for f in os.listdir(tag_path):
+                    m = re.match(r"^model_(\d+)\.pt$", f)
+                    if m:
+                        steps.append(int(m.group(1)))
+                steps.sort()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nanochat/worker.py` around lines 834 - 837, The checkpoint step parsing in
the steps list comprehension can crash on non-conforming filenames; change the
logic that builds steps (currently iterating os.listdir(tag_path) and parsing
with split) to be defensive: iterate filenames from os.listdir(tag_path), match
each against a regex like r"^model_(\d+)\.pt$" (or try/except around int
extraction), only convert the captured group to int for matches, and skip any
non-matching files so that the resulting steps list contains only valid integers
before sorting.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nanochat/worker.py`:
- Around line 241-245: The snapshot method in GPUState only returns (model,
tokenizer, engine, meta, source, device) but misses model_tag and callers still
read gpu.* directly; update snapshot(self) to include self.model_tag in its
returned tuple (e.g., (model, tokenizer, engine, meta, source, device,
model_tag)) and then change all downstream usages (places that currently read
gpu.model_tag or other gpu.* fields directly—e.g., the call sites around the
former Line 374, Lines ~801-815, and Lines ~847-850) to use the single atomic
GPUState.snapshot() return instead of mixing snapshot() with direct attribute
reads so metadata cannot become inconsistent during concurrent
nanochat.model.load operations.

---

Duplicate comments:
In `@nanochat/worker.py`:
- Around line 357-358: The device resolution treats a string "auto" as truthy
and skips autodetection; update the logic around device = inp.device or
autodetect_device_type() to detect when inp.device (case-insensitive) equals
"auto" and in that case call autodetect_device_type(); ensure you normalize
inp.device (e.g., lower()) before comparison and then pass the resolved device
into gpu.load(inp.source, device, model_tag=inp.model_tag, step=inp.step) so
gpu.load never receives the literal "auto".
- Around line 434-442: fn_tools_execute currently ignores
ExecuteCodeInput.timeout allowing user code to run indefinitely; enforce the
timeout by running the exec body in a cancellable task or thread and wrapping it
with asyncio.wait_for using inp.timeout (fallback to a sensible default if
None). Specifically, update fn_tools_execute to validate inp.timeout from
ExecuteCodeInput, run the exec call (the block that captures
stdout_buf/stderr_buf and calls exec(inp.code,...)) inside asyncio.to_thread or
an async wrapper, and await that wrapped call inside
asyncio.wait_for(timeout=inp.timeout). Ensure exceptions from timeout are
handled to return {"success": False, "stdout": ..., "stderr": ..., "error":
"timeout"} and preserve existing stdout/stderr capture and other exception
handling for non-timeout errors.
- Around line 834-837: The checkpoint step parsing in the steps list
comprehension can crash on non-conforming filenames; change the logic that
builds steps (currently iterating os.listdir(tag_path) and parsing with split)
to be defensive: iterate filenames from os.listdir(tag_path), match each against
a regex like r"^model_(\d+)\.pt$" (or try/except around int extraction), only
convert the captured group to int for matches, and skip any non-matching files
so that the resulting steps list contains only valid integers before sorting.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 29460c28-b37a-4bff-95f9-e3ea49b5f3e3

📥 Commits

Reviewing files that changed from the base of the PR and between 86a1c43 and 270b6b8.

📒 Files selected for processing (1)
  • nanochat/worker.py

worker.py:
- snapshot() now returns model_tag (7-tuple), all callers updated
- zero direct gpu.* reads outside snapshot/load/main
- tokenizer handlers use snapshot for thread-safe tokenizer access

README.md:
- pip install -r pyproject.toml -> cd nanochat-upstream && pip install -e .
- tools.execute description: "sandboxed" -> "in-process, not sandboxed"
- add language label to test output code fence
- fix missing spaces after colons throughout

pyproject.toml:
- add [build-system] section (PEP 517/518)
Training:
- Pre-forked child process (fork before iii connects) runs Popen
  safely without corrupting the WebSocket. Uses multiprocessing
  with explicit fork context.
- Training handlers send jobs via Pipe, child runs nanochat scripts,
  results come back with stdout lines for metric parsing.

Bug fixes:
- model.load: pass torch.device (not string), use phase="eval"
- chat.complete: conversation format is {"messages": [...]} not [...]
  (nanochat's render_conversation expects this)
- model.sample: generate_batch takes tokens directly, not [tokens]
- safe() handles both sync and async handlers

E2E test results (2-layer GPT, 5 steps, CPU):
  Load model  -> 1,966,134 params, 2 layers, 128 dim
  Sample      -> generates text (gibberish from minimal training)
  Chat        -> completion with session tracking in iii state
  History     -> 1 session stored
  Tokenizer   -> encode/decode roundtrip
  Tools       -> code execution (7*6=42)
  Status      -> full model config
  Health      -> worker alive through all operations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant