Skip to content

feat: add managed worker sandbox with libkrun VM isolation#1402

Open
andersonleal wants to merge 15 commits intomainfrom
feat/sandbox
Open

feat: add managed worker sandbox with libkrun VM isolation#1402
andersonleal wants to merge 15 commits intomainfrom
feat/sandbox

Conversation

@andersonleal
Copy link
Copy Markdown
Contributor

@andersonleal andersonleal commented Apr 3, 2026

Summary

Introduce microVM-based worker sandboxing using libkrun for secure, isolated function execution. This PR adds a complete sandbox infrastructure with three new crates (iii-worker, iii-filesystem, iii-network, iii-init) and refactors the engine CLI to delegate worker management to the standalone worker binary.

Key changes

  • iii-worker crate: Standalone worker binary with lazy firmware provisioning (libkrunfw), VM boot lifecycle, and managed worker state. Provides CLI commands for worker start/stop/dev/logs.
  • iii-filesystem crate: Passthrough filesystem backend for mounting host directories into the microVM guest, with inode management, directory operations, and platform-specific abstractions.
  • iii-network crate: User-space networking stack for the sandbox — virtual device, DNS proxy, TCP/UDP connection tracking, and smoltcp-based network stack.
  • iii-init crate: Minimal init process for the microVM guest — mounts filesystems, configures networking, sets resource limits, and supervises the user workload.
  • Sandbox Docker images: Node.js and Python base images with Dockerfiles and documentation for building sandbox root filesystems.
  • Engine refactor: Removed the old worker_manager module from the engine CLI, replaced with a lightweight managed_shim that delegates to the iii-worker binary. Improved logging with tracing instead of eprintln!.
  • CI/CD updates: Updated release pipeline and cross-compilation Makefile for the new worker crate and firmware assets.
  • Comprehensive tests: Unit and integration tests for filesystem, network, init, and worker components.

Type of Change

  • New feature (non-breaking change which adds functionality)
  • Refactoring (no functional changes, no API changes)

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • New and existing unit tests pass locally with my changes

Additional Context

This is the foundation for running user-submitted functions in isolated microVMs via libkrun. The architecture separates concerns into dedicated crates:

Engine (iii binary)
  └── managed_shim → delegates to iii-worker binary
        └── libkrun VM adapter
              ├── iii-filesystem (passthrough mount)
              ├── iii-network (user-space networking)
              └── iii-init (guest init process)

The worker binary lazily provisions firmware (libkrunfw) by checking for embedded binaries, local cache, or downloading from GitHub releases. Docker images for Node.js and Python provide the root filesystem layers for the sandbox VMs.

Summary by CodeRabbit

  • New Features

    • New managed worker CLI (add/remove/start/stop, dev) with hidden VM-boot flow; userspace networking (TCP/UDP/DNS) and a passthrough filesystem with optional embedded init; automatic firmware/init provisioning (embed or download); new worker runtime crate and sandbox images (Node, Python).
  • Chores

    • CI/release and Makefile enhanced for cross-builds and artifact publishing; workspace expanded with new crates.
  • Documentation

    • Added local development guide and sandbox image READMEs.

Introduce microVM-based worker sandboxing via libkrun runtime adapter.
New CLI commands: worker start/stop/dev/logs for managing isolated
worker containers. Adds iii.workers.yaml state file, vm_boot subprocess,
and managed worker lifecycle (add/remove/start/stop/dev/logs).
…visioning

Extract VM worker runtime into a separate `iii-worker` crate that resolves
libkrunfw and iii-init via embed, local cache, or GitHub release download.
Add `has_init()` guard to iii-filesystem so PassthroughFs works without
embedded init. Add filterable target matrix to _rust-binary workflow,
firmware-upload job, and worker-release job to the release pipeline.
… binary handling

- Added `iii-worker` crate and its features to the Makefile for cross-compilation.
- Updated sandbox build process to include the worker crate with embedded features.
- Modified `main.rs` in `iii-init` to handle non-Linux targets gracefully.
- Improved error handling in `libkrun.rs` for missing root filesystem, with logging for automatic pulling.
- Adjusted stderr handling in `managed_shim.rs` for better visibility during worker startup.
…g in managed_shim

- Moved the `managed_engine_url` function from `managed.rs` to the test module for better encapsulation.
- Improved error handling and logging for worker processes in `managed_shim.rs`, capturing stderr output for better visibility during startup.
…e crates

- Updated `nix` dependency to version 0.31.2 in `Cargo.lock`, `iii-init`, and `iii-network`.
- Changed license to "Elastic-2.0" in `Cargo.toml` for `iii-filesystem`, `iii-init`, and `iii-network`.
- Modified `main.rs` in `iii-worker` to include version information in the command structure.
…and documentation

- Introduced Dockerfiles for Node.js and Python sandbox images, including essential development tools and non-root user configurations for improved security.
- Added README files for both images detailing features, build instructions, and usage examples.
- Updated libkrun to use the new Docker images for TypeScript and Python languages, replacing references to microsandbox images.
- Enhanced environment variable handling in the development worker setup to include OCI environment variables.
- Improved logging by replacing `eprintln!` with `tracing::debug!` and `tracing::info!` for better visibility and structured output.
- Updated output messages for VM booting and root filesystem preparation to provide clearer context.
- Enhanced project and engine information display in the worker development command.
- Adjusted existing messages to improve clarity and consistency across the CLI interface.
…tem and network components

- Introduced unit tests for PassthroughFsBuilder, ConnectionTracker, and various network functionalities to ensure robust behavior.
- Added integration tests for iii-filesystem and iii-network to validate end-to-end functionality and interactions between components.
- Enhanced error handling and validation checks in tests to cover edge cases and improve reliability.
- Updated Cargo.toml to include necessary development dependencies for testing.
…rs dependency

- Updated all instances of the `syn` dependency to version 2.0.117 in `Cargo.lock`.
- Added `machineid-rs` version 1.2.4 to `Cargo.toml` for enhanced machine identification functionality.
- Removed the `otel` feature from `iii-sdk` in `Cargo.toml` to streamline dependencies.
@vercel
Copy link
Copy Markdown
Contributor

vercel bot commented Apr 3, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
iii-website Ready Ready Preview, Comment Apr 3, 2026 10:28pm
motia-docs Ready Ready Preview, Comment Apr 3, 2026 10:28pm

Request Review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

📝 Walkthrough

Walkthrough

Adds four new Rust crates (iii-init, iii-filesystem, iii-network, iii-worker), extensive backend implementations (passthrough FUSE filesystem, smoltcp userspace network), CLI/runtime features for worker lifecycle and firmware/init provisioning, workspace/CI/release updates, and engine-side removal/replacement of prior worker-manager modules.

Changes

Cohort / File(s) Summary
Workspace & Build config
\.cargo/config.toml, Cargo.toml, Makefile, engine/Cargo.toml, engine/build.rs
Add musl linker config, add workspace members for four new crates, new Makefile targets and host detection, small engine dependency edits, and build.rs rerun directive.
CI / Release workflows
.github/workflows/_rust-binary.yml, .github/workflows/ci.yml, .github/workflows/release-iii.yml
Introduce prepare-matrix job and dynamic build matrix, new workflow inputs (features, targets, init_artifacts), conditional init-artifact downloads for MUSL targets, and add init/firmware/worker release jobs plus aggregated notify logic.
Passthrough filesystem (iii-filesystem)
crates/iii-filesystem/*
New crate with builder API and PassthroughFs: inode/handle tables, init.krun embedding, create/dir/file/link/unlink/rename/read/write/metadata/special ops, platform abstractions, build.rs embed probe, and many unit/integration tests.
Init binary (iii-init)
crates/iii-init/*
New PID‑1 init binary crate: mount setup, rlimit handling, network configuration (ioctl), resolv.conf best‑effort write, supervisor spawning worker with signal forwarding, error types, and tests.
Userspace network (iii-network)
crates/iii-network/*
New smoltcp-based network crate: SmoltcpBackend/device, ConnectionTracker, DnsInterceptor, UdpRelay, WakePipe, SharedState queues, poll-loop orchestration, helpers and tests.
Worker runtime & CLI (iii-worker)
crates/iii-worker/*
New worker CLI/runtime: firmware/libkrunfw and iii-init embedding & download flows, build.rs for embedding, OCI/dev workflows, vm-boot glue (PassthroughFs + SmoltcpNetwork), worker-manager adapters (libkrun), many CLI subcommands and tests.
Engine refactor & installer
engine/src/cli/*, engine/install.sh, engine/local-worker-dev.md
Remove/replace prior engine worker-manager modules (manifest/registry/install/storage/spec/uninstall/etc.), add managed_shim for start/stop, adjust update semantics to skip unsupported platforms, and add optional iii-init install step plus local-dev doc.
Images, docs & scripts
crates/iii-worker/images/*, docs, engine/install.sh
Add Node/Python sandbox Docker images and READMEs; install script extended to optionally fetch iii-init from release assets; local dev documentation added.
Tests
crates/*/tests/*
Extensive new unit and integration tests across new crates covering filesystem, networking, init, firmware resolution, and worker CLI/manifest behaviors.

Sequence Diagram(s)

sequenceDiagram
    participant Guest as Guest VM
    participant Kernel as Kernel (FUSE)
    participant FS as PassthroughFs
    participant Init as Embedded Init
    participant Host as Host FS
    Guest->>Kernel: lookup/open/read "init.krun"
    Kernel->>FS: FUSE lookup/open/read request
    FS->>Init: check has_init / read_init
    alt init embedded
        Init-->>FS: return bytes / synthetic stat
    else no init
        FS->>Host: openat/stat operations for real inode
        Host-->>FS: fd / stat
    end
    FS-->>Kernel: FUSE response (entry/bytes)
    Kernel-->>Guest: VFS result
Loading
sequenceDiagram
    participant GuestNet as Guest (virtio)
    participant Backend as SmoltcpBackend
    participant Shared as SharedState (queues)
    participant Poll as Smoltcp Poll Loop
    participant Proxy as TCP Proxy Task
    participant Server as Remote Server
    GuestNet->>Backend: write_frame(hdr + payload)
    Backend->>Shared: push tx_ring
    Poll->>Shared: pop tx_ring
    Poll->>Poll: classify_frame -> TCP/UDP/DNS
    alt TCP SYN
        Poll->>Proxy: spawn proxy for connection
        Proxy->>Server: connect + relay
        Server-->>Proxy: response
        Proxy->>Shared: push rx_ring
    else UDP/DNS or passthrough
        Poll->>Shared: push rx_ring (responses or passthrough)
    end
    Backend->>Shared: pop rx_ring
    Backend->>GuestNet: read_frame(prepend virtio header + payload)
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested reviewers

  • sergiofilhowz
  • guibeira

Poem

🐰 I tunneled through crates, bytes snug in a chest,
Built init seeds and network nests,
Inodes hop, queues hum a tune,
Workers wake early beneath the moon —
🥕 Ready, set, the sandbox boots next!

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/sandbox

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

🟠 Major comments (21)
crates/iii-worker/images/node/README.md-7-7 (1)

7-7: ⚠️ Potential issue | 🟠 Major

Documented Node version conflicts with the actual base image.

Line 7 says 20.x LTS, but the Dockerfile currently uses node:24-slim. Please align README and Dockerfile to one source of truth.

As per coding guidelines, README content should remain consistent with related implementation/docs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-worker/images/node/README.md` at line 7, The README's documented
Node version ("Node.js 20.x LTS") conflicts with the actual base image tag
("node:24-slim") used in the Dockerfile; update either the README entry in
crates/iii-worker/images/node/README.md to match the Dockerfile's node:24-slim
base or change the Dockerfile to use the documented 20.x LTS image so both
sources are consistent — ensure the README string and the Dockerfile FROM line
reference the same major Node version (e.g., "Node.js 24.x" if keeping
node:24-slim or change FROM to node:20-slim to keep "Node.js 20.x LTS").
crates/iii-worker/images/node/README.md-15-19 (1)

15-19: ⚠️ Potential issue | 🟠 Major

Build instruction uses the wrong Dockerfile path for project-root execution.

Same issue as Python README: from project root, -f Dockerfile is ambiguous/wrong for this image.

Suggested doc fix
- docker build -t iiidev/node -f Dockerfile .
+ docker build -t iiidev/node -f crates/iii-worker/images/node/Dockerfile .

As per coding guidelines, this README should follow the authoritative image build/run flow.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-worker/images/node/README.md` around lines 15 - 19, The README
build instruction uses an ambiguous Dockerfile path; update the docker build
command in README.md so it explicitly references the Dockerfile inside the node
image directory (e.g., use -f crates/iii-worker/images/node/Dockerfile) and the
correct build context (project root or the image directory) to match the
authoritative image build/run flow; ensure the example preserves the image tag
(iiidev/node) and matches how other image READMEs specify build/run steps.
crates/iii-worker/images/python/README.md-13-17 (1)

13-17: ⚠️ Potential issue | 🟠 Major

Build command path is incorrect when run from project root.

Line 13 says “from the project root,” but Line 16 uses -f Dockerfile, which won’t target this file from root. Use the full path to avoid broken setup instructions.

Suggested doc fix
- docker build -t iiidev/python -f Dockerfile .
+ docker build -t iiidev/python -f crates/iii-worker/images/python/Dockerfile .

As per coding guidelines, READMEs for these images must follow the authoritative build/run commands and stay consistent with documented usage.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-worker/images/python/README.md` around lines 13 - 17, The build
command in README.md is wrong for running from the project root; update the
docker build invocation in crates/iii-worker/images/python/README.md so the -f
flag points to the image Dockerfile path
(crates/iii-worker/images/python/Dockerfile) and use the image directory as the
build context instead of just `-f Dockerfile .`; edit the command shown (the
current "docker build -t iiidev/python -f Dockerfile .") to reference the full
Dockerfile path and the correct context so the command works when executed from
the project root.
crates/iii-worker/images/python/Dockerfile-1-1 (1)

1-1: ⚠️ Potential issue | 🟠 Major

Pin the Python base image to a specific version tag.

Using python:latest makes builds non-reproducible and non-deterministic. The latest tag can silently point to new Python releases, introducing unexpected breaking changes and build drift. Pin to a specific version (e.g., python:3.12-slim-bookworm) to ensure reproducible builds and consistent behavior across environments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-worker/images/python/Dockerfile` at line 1, Replace the
non-deterministic base image "python:latest" in the Dockerfile with a pinned,
specific tag (for example "python:3.12-slim-bookworm" or another supported
patch-level image) to ensure reproducible builds; edit the FROM instruction in
the Dockerfile to use the chosen explicit tag and rebuild to verify no
unexpected changes.
engine/src/cli/registry.rs-54-61 (1)

54-61: ⚠️ Potential issue | 🟠 Major

Align iii-init registry targets with actual binary distribution strategy.

iii-init is declared in the registry with specific Linux targets (lines 54-61), yet no iii-init binaries are published in releases. The comment at update.rs:399 documents it as unsupported on macOS/Windows, which contradicts the registry declaration and suggests the binary is not meant for distribution via this mechanism.

The current check_platform_support() gate prevents unsafe asset-fetch attempts, so immediate runtime risk is low. However, either:

  • Remove iii-init from the registry if it's not distributed as a managed binary, or
  • Configure the release pipeline to build and publish iii-init binaries for declared targets.

Verify with the project whether iii-init (a PID 1 init binary for microVM workers) should be managed through the update flow or distributed differently.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@engine/src/cli/registry.rs` around lines 54 - 61, The registry declares
supported_targets for the `iii-init` managed binary but `iii-init` binaries are
not actually published; update the registry to reflect reality by removing
`iii-init` from the supported_targets array in the registry declaration (the
array shown with "x86_64-unknown-linux-musl", "aarch64-apple-darwin", etc.) or
alternatively add build/publish steps for `iii-init` in the release pipeline so
those targets are produced; ensure this change is consistent with the platform
gate `check_platform_support()` and the note in update.rs (around the PID 1 init
comment) so runtime asset-fetch logic matches the declared distribution
strategy.
crates/iii-worker/src/cli/worker_manager/state.rs-46-50 (1)

46-50: ⚠️ Potential issue | 🟠 Major

Make writes atomic to avoid corrupting iii.workers.yaml.

Line 49 overwrites the file directly. A crash/interruption during write can leave a truncated manifest and break subsequent loads.

🛠️ Proposed fix
     pub fn save(&self) -> Result<()> {
         let data =
             serde_yaml::to_string(self).with_context(|| "failed to serialize workers file")?;
-        std::fs::write(WORKERS_FILE, data)
-            .with_context(|| format!("failed to write {}", WORKERS_FILE))?;
+        let tmp = format!("{}.tmp", WORKERS_FILE);
+        std::fs::write(&tmp, data).with_context(|| format!("failed to write {}", tmp))?;
+        std::fs::rename(&tmp, WORKERS_FILE)
+            .with_context(|| format!("failed to replace {}", WORKERS_FILE))?;
         Ok(())
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-worker/src/cli/worker_manager/state.rs` around lines 46 - 50, The
save() method currently writes directly to WORKERS_FILE which can leave a
truncated file on crash; change it to perform an atomic write by writing
serialized data to a temporary file in the same directory (e.g., WORKERS_FILE +
".tmp" or use tempfile::NamedTempFile) and then atomically replace the target
with std::fs::rename (or persist the NamedTempFile) to move the temp into place;
propagate errors with the existing with_context messages (e.g., "failed to write
{}") and ensure the temp file is cleaned up on failure.
crates/iii-filesystem/build.rs-20-24 (1)

20-24: ⚠️ Potential issue | 🟠 Major

aarch64-unknown-linux-musl target unavailable; embed-init feature will silently fail on aarch64 Linux.

The build script uses aarch64-unknown-linux-musl (line 22), but platform.rs explicitly states "Linux aarch64 uses gnu (no musl builds available)". When the musl binary does not exist, the code silently writes a 1-byte placeholder (line 42) instead of the real binary, causing the embed-init feature to fail without error. The comment at line 19 claiming "init is always a Linux musl binary" is incorrect for aarch64.

Change aarch64 to map to "aarch64-unknown-linux-gnu" to align with install.sh (line 399) and platform.rs documentation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-filesystem/build.rs` around lines 20 - 24, The match on arch
assigns the wrong target triple for aarch64, causing embed-init to pick a
non-existent musl binary and silently write a 1-byte placeholder; update the
match arm in build.rs so that when arch == "aarch64" it maps triple =
"aarch64-unknown-linux-gnu" (instead of "aarch64-unknown-linux-musl"), and
correct the preceding comment that claims init is always a Linux musl binary to
reflect that aarch64 uses GNU; ensure this change makes the embed-init flow
select the real binary rather than the placeholder.
engine/install.sh-394-408 (1)

394-408: ⚠️ Potential issue | 🟠 Major

Fix Darwin target mapping for iii-init.

The install script incorrectly maps Darwin hosts to Darwin binaries (x86_64-apple-darwin, aarch64-apple-darwin). However, crates/iii-filesystem/build.rs (line 18-19) explicitly states: "VMs always run Linux guests, so init is always a Linux musl binary regardless of the host OS." The engine/local-worker-dev.md documentation consistently confirms this—both Apple Silicon and x86_64 hosts build iii-init using aarch64-unknown-linux-musl and x86_64-unknown-linux-musl targets.

The engine/src/cli/registry.rs also incorrectly lists aarch64-apple-darwin and x86_64-apple-darwin as supported targets for iii-init (lines 54-59), contradicting the build.rs constraint. These Darwin targets do not exist in releases and should be removed.

The Darwin case in lines 394-408 should map to Linux musl targets matching the host architecture, not Darwin targets.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@engine/install.sh` around lines 394 - 408, The Darwin branch of the
init_target selection incorrectly assigns Darwin triples; change the Darwin case
in the init_target logic so that when uname_s == "Darwin" it sets init_target to
the Linux musl variants (x86_64-unknown-linux-musl for x86_64 and
aarch64-unknown-linux-musl for aarch64) so iii-init is always a Linux musl
binary (refer to init_target, uname_s, arch), and also remove the non-existent
aarch64-apple-darwin and x86_64-apple-darwin entries from the supported targets
list in engine/src/cli/registry.rs (the list that registers iii-init targets) so
only the linux-musl targets remain.
crates/iii-worker/src/cli/firmware/symlinks.rs-7-28 (1)

7-28: ⚠️ Potential issue | 🟠 Major

Surface symlink provisioning failures to the caller.

If one unlink/symlink operation fails here, firmware provisioning can still look successful and the breakage only shows up later when libkrun tries to load libkrunfw. Returning a Result would let the install path fail fast instead of leaving a partial SONAME layout behind.

Possible change
 #[cfg(unix)]
-pub fn create_libkrunfw_symlinks(lib_dir: &Path) {
+pub fn create_libkrunfw_symlinks(lib_dir: &Path) -> std::io::Result<()> {
     let symlinks = libkrunfw_symlink_pairs();
 
     for (link_name, target) in &symlinks {
         let link_path = lib_dir.join(link_name);
 
         if link_path.exists() || link_path.is_symlink() {
-            let _ = std::fs::remove_file(&link_path);
+            std::fs::remove_file(&link_path)?;
         }
 
-        if let Err(e) = std::os::unix::fs::symlink(target, &link_path) {
-            tracing::warn!(
-                link = %link_path.display(),
-                target = %target,
-                error = %e,
-                "failed to create libkrunfw symlink"
-            );
-        }
+        std::os::unix::fs::symlink(target, &link_path)?;
     }
+
+    Ok(())
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-worker/src/cli/firmware/symlinks.rs` around lines 7 - 28, The
function create_libkrunfw_symlinks currently swallows errors and only logs
warnings, which can leave partial SONAME layout; change its signature to return
a Result<(), E> (e.g., std::io::Result<()> or anyhow::Result<()>) and propagate
failures from std::fs::remove_file and std::os::unix::fs::symlink instead of
ignoring them; use the ? operator (or map_err to convert errors) when calling
remove_file and symlink inside the loop (referencing create_libkrunfw_symlinks
and libkrunfw_symlink_pairs) so the caller can fail fast on any unlink/symlink
error and handle rollback or surface a provisioning failure.
crates/iii-network/src/dns.rs-135-150 (1)

135-150: ⚠️ Potential issue | 🟠 Major

Bound DNS resolution fan-out.

The bounded query_rx (capacity 64) does not actually bound total concurrent work here, because every dequeued packet is immediately re-spawned onto its own Tokio task. A noisy guest can create unbounded in-flight DNS lookups and waiting tasks. Resolve inline in this loop, or gate tokio::spawn behind a semaphore or JoinSet limit to cap concurrent resolutions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-network/src/dns.rs` around lines 135 - 150, The loop spawns a new
Tokio task for every dequeued DNS query (query_rx) so the bounded channel
(capacity 64) doesn't limit concurrent resolve_query work; replace the
unconditional tokio::spawn by either performing resolve_query(&query.data,
&resolver).await inline inside the while let loop (and then sending the response
on response_tx) or gate task creation with a concurrency limiter (e.g., a
tokio::sync::Semaphore permit or a tokio::task::JoinSet with a max concurrent
count) so only a fixed number of resolve_query tasks run concurrently; update
the code paths that reference query, resolver, response_tx, and
shared.proxy_wake accordingly to use the chosen strategy.
crates/iii-init/src/mount.rs-110-145 (1)

110-145: ⚠️ Potential issue | 🟠 Major

Don't silently skip worker memory enforcement.

mount_cgroup2().ok() and the ignored memory.max write let the worker boot without the requested cgroup limit and without any warning. At minimum log this; if III_WORKER_MEM_BYTES is set, I would surface the error so the sandbox does not run with weaker limits than requested.

Possible change
-    mount_cgroup2().ok();
+    if let Err(e) = mount_cgroup2() {
+        if std::env::var_os("III_WORKER_MEM_BYTES").is_some() {
+            return Err(e);
+        }
+        eprintln!("iii-init: warning: failed to configure cgroup2: {e}");
+    }
@@
     // Set memory limit from env var (passed by vm_boot.rs).
     if let Ok(mem_bytes) = std::env::var("III_WORKER_MEM_BYTES") {
-        let _ = std::fs::write("/sys/fs/cgroup/worker/memory.max", &mem_bytes);
+        std::fs::write("/sys/fs/cgroup/worker/memory.max", &mem_bytes).map_err(|e| {
+            InitError::WriteFile {
+                path: "/sys/fs/cgroup/worker/memory.max".into(),
+                source: e,
+            }
+        })?;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-init/src/mount.rs` around lines 110 - 145, The code currently
swallows errors by calling mount_cgroup2().ok() and ignoring the result of
writing memory.max, which allows the worker to run without requested memory
limits; change this so failures are surfaced or at least logged: (1) stop
dropping the result at the call site — replace mount_cgroup2().ok() with proper
error handling (propagate the InitError or log a clear warning) so mount
failures aren't silently ignored; (2) inside mount_cgroup2(), when
III_WORKER_MEM_BYTES is set, do not ignore the result of
std::fs::write("/sys/fs/cgroup/worker/memory.max", &mem_bytes) — return an
InitError on failure (or at minimum log an error with the path, mem_bytes
presence, and the io::Error) so the supervisor won't run with weaker limits than
requested; (3) ensure mkdir_ignore_exists("/sys/fs/cgroup") and the
cgroup.subtree_control write also surface/log errors consistently (they already
map errors to InitError) so behavior is uniform.
crates/iii-network/src/dns.rs-62-79 (1)

62-79: ⚠️ Potential issue | 🟠 Major

Make DNS interceptor startup fallible.

The resolver is built inside the spawned task (lines 127–133), so new() returns Self even if resolver construction fails. Once the task exits, process() drops queries due to the closed channel but logs the error as "DNS query channel full" (line 104), masking the actual startup failure. Additionally, resolve_query silently drops all resolution and parsing errors via .ok()? (lines 161, 164, 170, 185), causing guest-visible timeouts instead of negative DNS responses.

Build the resolver before spawning, return Result<Self>, and return negative responses for lookup failures so clients receive proper DNS errors instead of timeouts.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-network/src/dns.rs` around lines 62 - 79, Make DNS interceptor
startup fallible by constructing the DNS resolver synchronously before spawning
dns_resolver_task and change new() to return Result<Self> so callers can fail
fast; only spawn dns_resolver_task if resolver construction succeeds (remove
building resolver inside the task). In resolve_query replace the silent `.ok()?`
chains (lines around resolve_query) with explicit error handling that generates
proper negative DNS responses (SERVFAIL/NXDOMAIN as appropriate) and forwards
parse/lookup errors to response_tx instead of dropping them; also ensure
process() detects a closed query channel and surfaces the resolver startup error
(don't mask it as "DNS query channel full"). Use identifiers: new(),
dns_resolver_task, resolve_query, process(), CHANNEL_CAPACITY, DNS_PORT to find
and update the code.
crates/iii-network/src/dns.rs-167-187 (1)

167-187: ⚠️ Potential issue | 🟠 Major

Map DNS resolution errors to appropriate response codes instead of discarding them.

lookup(...).await.ok()? collapses resolver failures (NXDOMAIN, SERVFAIL, timeout, etc.) into None, causing the guest to receive a socket timeout instead of a proper DNS response. The lookup() method returns Result<Lookup, ResolveError>, where the error contains the ResponseCode that should be sent back. Once the query parses successfully, extract the error's ResponseCode and construct a response message with the appropriate code rather than abandoning the query.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-network/src/dns.rs` around lines 167 - 187, The current resolver
call uses resolver.lookup(...).await.ok()? which drops all ResolveError cases
and returns None; instead handle the Result from resolver.lookup(...) and on
Err(e) extract or map e to the appropriate hickory_proto::op::ResponseCode,
build a response Message (reuse response_msg, set_id/query/op_code/recursion
flags/add_query as you already do) but set_response_code to the mapped code
(e.g., NXDomain, ServFail, Refused, etc.) and serialize/return the bytes; only
when Ok(lookup) proceed to insert_answers from lookup.records() and set
ResponseCode::NoError. Ensure you reference resolver.lookup, ResolveError ->
ResponseCode mapping, response_msg, query_msg, question and return
Some(Bytes::from(response_bytes)) for both success and mapped-error cases.
crates/iii-init/src/supervisor.rs-68-69 (1)

68-69: ⚠️ Potential issue | 🟠 Major

Don’t ignore a failed cgroup move.

If this write fails, the worker starts outside the intended memory-limited cgroup and the sandbox silently loses one of its resource bounds. Bubble the error up instead of treating it as best-effort.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-init/src/supervisor.rs` around lines 68 - 69, The write to move
the worker into the cgroup currently swallows errors (let _ =
std::fs::write(...)), so a failure leaves the worker outside the memory-limited
cgroup; change this to propagate the error instead of ignoring it by replacing
the ignored write with a proper Result check (use the ? operator or map_err and
return Err) on the std::fs::write call so that failures when writing
"/sys/fs/cgroup/worker/cgroup.procs" (where child_pid is used) bubble up from
the function (or are converted to the function's error type) rather than being
treated as best-effort.
crates/iii-network/src/proxy.rs-72-83 (1)

72-83: ⚠️ Potential issue | 🟠 Major

Preserve server→guest relay after the guest closes its write side.

When from_smoltcp.recv() returns None, the proxy breaks the whole loop and stops reading the server socket too. That truncates responses for normal TCP close/FIN flows. Half-close the outbound socket and keep draining server_rx until EOF.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-network/src/proxy.rs` around lines 72 - 83, When
from_smoltcp.recv() returns None you must not break the entire relay loop;
instead half-close the outbound socket and keep draining the server_rx path
until EOF. Replace the current None => break branch with code that calls an
async shutdown of the server write side (use AsyncWriteExt::shutdown or the
stream's shutdown method) to send FIN, stop sending further writes to server_tx,
but continue the loop so the server_rx branch can keep reading from the server
socket until it returns None, then break. Ensure you reference
from_smoltcp.recv(), server_tx.write_all(...), and server_rx so the changes are
applied in the correct relay logic.
crates/iii-init/src/supervisor.rs-19-24 (1)

19-24: ⚠️ Potential issue | 🟠 Major

Forward termination to the workload group, not just the direct child.

Spawning via /bin/sh -c means the tracked PID can be a shell wrapper instead of the real workload. In wrapper-script or multi-command cases, SIGTERM/SIGINT stop at that shell and grandchildren can outlive shutdown. Prefer launching the workload directly, or put it in its own process group and signal the group.

Also applies to: 59-66

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-init/src/supervisor.rs` around lines 19 - 24, The signal handler
forwards signals to the child PID but that only hits a shell wrapper; update the
shutdown to target the child's process group and ensure the child is placed in
its own group when spawned: change signal_handler to call libc::kill with the
negative group id (e.g., kill(-pid, sig)) and, in the code that creates the
child process (the spawn/fork logic referenced around the other block at lines
59-66), set the child into a new process group (e.g., setpgid(child_pid,
child_pid) or the equivalent POSIX/posix_spawn attr) right after fork/exec so
signals sent to -CHILD_PID reach all descendants.
crates/iii-network/src/network.rs-77-95 (1)

77-95: ⚠️ Potential issue | 🟠 Major

Guard start() against double invocation.

A second call spawns another smoltcp-poll thread against the same shared queues and overwrites the first JoinHandle, so both loops can race the same backend. Make start() one-shot before spawning.

Suggested guard
     pub fn start(&mut self, tokio_handle: tokio::runtime::Handle) {
+        assert!(
+            self.poll_handle.is_none(),
+            "SmoltcpNetwork::start called more than once"
+        );
+
         let shared = self.shared.clone();
         let poll_config = PollLoopConfig {
             gateway_mac: self.gateway_mac,
             guest_mac: self.guest_mac,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-network/src/network.rs` around lines 77 - 95, The start() method
currently allows multiple invocations which spawns additional "smoltcp-poll"
threads and overwrites self.poll_handle; guard it so start() is one-shot by
checking self.poll_handle before spawning (e.g., return early or Err if
self.poll_handle.is_some()), and only set self.poll_handle to Some(...) when
creating the new thread that calls stack::smoltcp_poll_loop(shared, poll_config,
tokio_handle); ensure you reference and protect the shared queues and the
existing JoinHandle stored in self.poll_handle to avoid racing the same backend.
crates/iii-worker/src/cli/firmware/constants.rs-63-75 (1)

63-75: ⚠️ Potential issue | 🟠 Major

Fail unsupported host tuples before archive resolution.

check_libkrunfw_platform_support() only rejects darwin-x86_64. Every other unknown tuple currently looks “supported”, which means later code can search for non-existent archives or panic in iii_init_musl_target(). Make this an explicit allowlist of supported (OS, ARCH) pairs.

Suggested allowlist
 pub fn check_libkrunfw_platform_support() -> Result<(), String> {
-    let os = std::env::consts::OS;
-    let arch = std::env::consts::ARCH;
-    // Available firmware: darwin-aarch64, linux-x86_64, linux-aarch64
-    // Missing: darwin-x86_64 (Intel Mac)
-    if os == "macos" && arch == "x86_64" {
-        return Err(format!(
-            "libkrunfw firmware is not available for Intel Macs (darwin-x86_64).\n\
-             Managed workers require an Apple Silicon Mac (aarch64) or a Linux host.\n\
-             Set III_LIBKRUNFW_PATH to a manually-built firmware file to override."
-        ));
-    }
-    Ok(())
+    match (std::env::consts::OS, std::env::consts::ARCH) {
+        ("macos", "aarch64") | ("linux", "x86_64") | ("linux", "aarch64") => Ok(()),
+        ("macos", "x86_64") => Err(format!(
+            "libkrunfw firmware is not available for Intel Macs (darwin-x86_64).\n\
+             Managed workers require an Apple Silicon Mac (aarch64) or a Linux host.\n\
+             Set III_LIBKRUNFW_PATH to a manually-built firmware file to override."
+        )),
+        (os, arch) => Err(format!(
+            "libkrunfw firmware is not available for {os}-{arch}."
+        )),
+    }
 }

Also applies to: 89-99

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-worker/src/cli/firmware/constants.rs` around lines 63 - 75,
check_libkrunfw_platform_support currently only rejects darwin-x86_64 and leaves
all other tuples implicitly "supported", which can lead to searching
non-existent archives or panics; change it to an explicit allowlist: accept only
the supported (OS, ARCH) pairs ("macos","aarch64"), ("linux","x86_64"), and
("linux","aarch64") and return Err for any other tuple with a clear message
about unsupported host and how to override via III_LIBKRUNFW_PATH. Apply the
same allowlist logic to the other similar check in the file (the block around
the other function/logic referenced by lines 89-99 / any
iii_init_musl_target-related platform validation) so both functions validate
against the same enumerated supported tuples.
.github/workflows/_rust-binary.yml-98-107 (1)

98-107: ⚠️ Potential issue | 🟠 Major

Fail fast on unknown inputs.targets entries.

The filter silently drops typos or unsupported triples, which can produce a partial release without any explicit failure. Compare requested vs matched targets and abort when anything is unmatched.

Suggested validation
           else
             # Filter default matrix to only include targets in the provided JSON array
             FILTERED=$(echo "$DEFAULT" | jq -c --argjson targets "$TARGETS_INPUT" '
               .include |= [ .[] | select(.target as $t | $targets | index($t)) ]
             ')
+            MATCHED=$(echo "$FILTERED" | jq -cr '[.include[].target] | sort')
+            REQUESTED=$(echo "$TARGETS_INPUT" | jq -cr 'sort')
+            if [ "$MATCHED" != "$REQUESTED" ]; then
+              echo "::error::inputs.targets contains unsupported target triples"
+              exit 1
+            fi
             echo "matrix=$FILTERED" >> "$GITHUB_OUTPUT"
             echo "Filtered matrix to targets: $TARGETS_INPUT"
           fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/_rust-binary.yml around lines 98 - 107, The current
workflow silently drops unknown triples when building FILTERED from DEFAULT and
TARGETS_INPUT; update the shell logic after computing FILTERED to compare the
requested targets (TARGETS_INPUT) with the matched targets in FILTERED (extract
their .target fields) and if any requested entry is missing produce a clear
error message and exit non‑zero (e.g., echo "Unknown targets: <list>" >&2; exit
1). Use the same jq context to compute the matched-target list from FILTERED
(and the unmatched list by checking which entries in TARGETS_INPUT are not in
FILTERED) so you can log the unmatched items and abort before writing
GITHUB_OUTPUT.
crates/iii-filesystem/src/backends/passthroughfs/metadata.rs-62-81 (1)

62-81: ⚠️ Potential issue | 🟠 Major

Potential fd leak: owns_fd logic may incorrectly skip closing an opened fd.

When handle is Some(h) but h is not found in handles, the code opens a new fd via open_inode_fd. However, owns_fd is computed as handle.is_none() || ...get(&handle.unwrap())..., which will be false when the handle lookup fails (since handle.is_none() is false and the get returns None).

This means the newly opened fd won't be closed, causing a leak.

🐛 Proposed fix
     // Use handle fd if available, otherwise open from inode.
-    let fd = if let Some(h) = handle {
+    let (fd, owns_fd) = if let Some(h) = handle {
         let handles = fs.handles.read().unwrap();
         if let Some(hdata) = handles.get(&h) {
-            hdata.file.read().unwrap().as_raw_fd()
+            (hdata.file.read().unwrap().as_raw_fd(), false)
         } else {
-            inode::open_inode_fd(fs, ino, open_flags)?
+            (inode::open_inode_fd(fs, ino, open_flags)?, true)
         }
     } else {
-        inode::open_inode_fd(fs, ino, open_flags)?
+        (inode::open_inode_fd(fs, ino, open_flags)?, true)
     };
-
-    // Track whether we own the fd (and need to close it).
-    let owns_fd = handle.is_none()
-        || fs
-            .handles
-            .read()
-            .unwrap()
-            .get(&handle.unwrap_or(u64::MAX))
-            .is_none();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-filesystem/src/backends/passthroughfs/metadata.rs` around lines 62
- 81, The owns_fd calculation is wrong: when Some(h) is provided but not found,
you open a new fd via inode::open_inode_fd (assigned to fd) but owns_fd
evaluates to false, leaking the fd; fix by computing owns_fd from the same
lookup used to get fd—i.e., when handle is None OR when handle is Some(h) but
fs.handles.read().unwrap().get(&h) is None, owns_fd must be true; implement this
by reusing the handle lookup result used to select fd (store the lookup outcome
or the chosen branch) and set owns_fd accordingly so any fd created by
inode::open_inode_fd is closed.
crates/iii-worker/src/cli/managed.rs-625-651 (1)

625-651: ⚠️ Potential issue | 🟠 Major

detect_lan_ip uses macOS-specific commands without platform guards.

The function uses route -n get default and ifconfig, which are macOS-specific. On Linux, these commands either don't exist or produce different output, so the function will silently return None. Either guard this function with #[cfg(target_os = "macos")] or provide a cross-platform implementation (e.g., ip route and ip addr for Linux).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-worker/src/cli/managed.rs` around lines 625 - 651, The
detect_lan_ip function currently runs macOS-specific commands ("route -n get
default" and "ifconfig") and should be made platform-aware: either add
#[cfg(target_os = "macos")] to the detect_lan_ip function (and an alternative
stub that returns None or calls a cross-platform variant for other OSes), or
implement a cross-platform version that uses Linux commands (e.g., "ip route
show default" to get the interface and "ip -4 addr show dev <iface>" to parse
the IPv4 address) and annotate platform-specific implementations with
#[cfg(target_os = "...")] so detect_lan_ip behaves correctly on Linux and macOS;
update the function signatures and parsing logic in detect_lan_ip accordingly.
🟡 Minor comments (4)
crates/iii-worker/images/node/Dockerfile-1-1 (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Document or pin versions explicitly to ensure build reproducibility.

Using floating tags (node:24-slim and npm@latest) can cause builds to vary over time, complicating debugging and deployments. If reproducibility is critical, pin to explicit versions (e.g., node:24.14.0-slim and a specific npm release).

Note: Node.js 24.x (Krypton) is the current Active LTS line. npm@latest (11.x) is explicitly compatible with node:24-slim and maintains support for all active LTS versions, so floating to npm@latest is safe from a compatibility standpoint—the trade-off is reproducibility, not breakage risk.

Example: Explicit pinning approach
-FROM node:24-slim
+FROM node:24.14.0-slim

@@
-    && npm install -g npm@latest \
+    && npm install -g npm@11.9.0 \

Also applies to: 15–20

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-worker/images/node/Dockerfile` at line 1, The Dockerfile uses
floating image and package tags which harms reproducibility; update the FROM
line and any npm install invocations to pin explicit versions (e.g., change FROM
node:24-slim to a specific patch image like node:24.14.0-slim and replace
npm@latest with a fixed npm version such as npm@11.x.y) so builds are
deterministic, and ensure any other references around the Dockerfile lines 15–20
that mention npm or node are likewise pinned; locate the FROM line in the
Dockerfile and any npm install commands (npm@latest) and replace with the chosen
fixed version strings.
engine/local-worker-dev.md-125-126 (1)

125-126: ⚠️ Potential issue | 🟡 Minor

Fix libkrunfw resolution order in the table.

Line 125 currently puts env/cache before embedded assets, but this doc should match the documented primitive order used elsewhere.

📝 Proposed doc fix
-| **libkrunfw** | 1. `III_LIBKRUNFW_PATH` env var  2. `~/.iii/lib/libkrunfw.{5}.dylib`  3. Embedded bytes (if `--features embed-libkrunfw`)  4. Auto-download from GitHub release |
+| **libkrunfw** | 1. Embedded bytes (if `--features embed-libkrunfw`)  2. `III_LIBKRUNFW_PATH` env var  3. `~/.iii/lib/libkrunfw.{5}.dylib`  4. Auto-download from GitHub release |

As per coding guidelines: **/*.md*: - Ensure all references to primitives reflect those defined in docs/.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@engine/local-worker-dev.md` around lines 125 - 126, The libkrunfw resolution
order is wrong in the markdown table: update the row for **libkrunfw** so
embedded assets come first (matching other docs). Replace the current ordering
so it reads: 1. Embedded (if `--features embed-libkrunfw`) 2.
`III_LIBKRUNFW_PATH` env var 3. `~/.iii/lib/libkrunfw.{5}.dylib` 4.
Auto-download from GitHub release; ensure the symbol `III_LIBKRUNFW_PATH` and
the embedded feature name `embed-libkrunfw` are used exactly as shown.
crates/iii-worker/Cargo.toml-4-4 (1)

4-4: ⚠️ Potential issue | 🟡 Minor

Rust 1.85+ requirement is already documented; verify local development setup.

The project documentation (sdk/README.md) already specifies "Rust 1.85+ and Cargo (for Rust SDK)," and CI uses dtolnay/rust-toolchain@stable, which complies with this requirement. However, there is no rust-toolchain.toml file to enforce the minimum version for developers working locally. Consider adding one at the repository root to prevent accidental use of older Rust versions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-worker/Cargo.toml` at line 4, Add a rust-toolchain.toml at the
repository root to lock the minimum Rust version for local dev so contributors
don't accidentally use older toolchains; create a file named rust-toolchain.toml
and set the [toolchain] channel to 1.85.0 (and optionally list components like
rustfmt and clippy) so the workspace (including the crate that sets edition =
"2024" in Cargo.toml) is built with Rust 1.85+ by default.
crates/iii-network/src/conn.rs-164-177 (1)

164-177: ⚠️ Potential issue | 🟡 Minor

Potential data loss when proxy channel is full.

At line 168, if try_send fails (channel full), the loop breaks and data from subsequent recv_slice calls in that iteration is lost. The connection continues, but some data may never reach the proxy.

Consider buffering the data or using a different strategy when the channel is temporarily full.

💡 Suggested improvement
             while socket.can_recv() {
                 match socket.recv_slice(&mut relay_buf) {
                     Ok(n) if n > 0 => {
                         let data = Bytes::copy_from_slice(&relay_buf[..n]);
                         if conn.to_proxy.try_send(data).is_err() {
+                            // Channel full - data will be re-read on next poll
+                            // since we didn't advance smoltcp's read pointer
                             break;
                         }
                     }
                     _ => break,
                 }
             }

Actually, recv_slice does advance the read pointer, so this data is lost. Consider:

  1. Buffering unsent data similar to write_buf
  2. Using blocking send (requires async context)
  3. Accepting the data loss as backpressure mechanism
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-network/src/conn.rs` around lines 164 - 177, The loop reading from
socket uses socket.recv_slice and discards data when conn.to_proxy.try_send
fails (channel full); fix by buffering unsent data instead of dropping it: add a
per-connection outbound buffer (e.g., reuse or add a field like
conn.proxy_write_buf similar to write_buf), on Ok(n) copy the slice into Bytes,
attempt conn.to_proxy.try_send(data) and if it Errs push that Bytes into
conn.proxy_write_buf and break out of the recv loop (stop reading more from
socket), then ensure write_proxy_data (or the task that drains conn.to_proxy) is
updated to first drain conn.proxy_write_buf into the channel when it becomes
writable. This preserves data when try_send returns full and avoids losing bytes
read by recv_slice.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 3d0e41a4-6fe9-475a-a79a-f182cc0b8ba3

📥 Commits

Reviewing files that changed from the base of the PR and between 7677350 and 2c6308c.

⛔ Files ignored due to path filters (4)
  • Cargo.lock is excluded by !**/*.lock
  • engine/firmware/libkrunfw-darwin-aarch64.dylib is excluded by !**/*.dylib
  • engine/firmware/libkrunfw-linux-aarch64.so is excluded by !**/*.so
  • engine/firmware/libkrunfw-linux-x86_64.so is excluded by !**/*.so
📒 Files selected for processing (90)
  • .cargo/config.toml
  • .github/workflows/_rust-binary.yml
  • .github/workflows/ci.yml
  • .github/workflows/release-iii.yml
  • Cargo.toml
  • Makefile
  • crates/iii-filesystem/Cargo.toml
  • crates/iii-filesystem/build.rs
  • crates/iii-filesystem/src/backends/mod.rs
  • crates/iii-filesystem/src/backends/passthroughfs/builder.rs
  • crates/iii-filesystem/src/backends/passthroughfs/create_ops.rs
  • crates/iii-filesystem/src/backends/passthroughfs/dir_ops.rs
  • crates/iii-filesystem/src/backends/passthroughfs/file_ops.rs
  • crates/iii-filesystem/src/backends/passthroughfs/inode.rs
  • crates/iii-filesystem/src/backends/passthroughfs/metadata.rs
  • crates/iii-filesystem/src/backends/passthroughfs/mod.rs
  • crates/iii-filesystem/src/backends/passthroughfs/remove_ops.rs
  • crates/iii-filesystem/src/backends/passthroughfs/special.rs
  • crates/iii-filesystem/src/backends/shared/handle_table.rs
  • crates/iii-filesystem/src/backends/shared/init_binary.rs
  • crates/iii-filesystem/src/backends/shared/inode_table.rs
  • crates/iii-filesystem/src/backends/shared/mod.rs
  • crates/iii-filesystem/src/backends/shared/name_validation.rs
  • crates/iii-filesystem/src/backends/shared/platform.rs
  • crates/iii-filesystem/src/init.rs
  • crates/iii-filesystem/src/lib.rs
  • crates/iii-filesystem/tests/filesystem_integration.rs
  • crates/iii-init/Cargo.toml
  • crates/iii-init/src/error.rs
  • crates/iii-init/src/main.rs
  • crates/iii-init/src/mount.rs
  • crates/iii-init/src/network.rs
  • crates/iii-init/src/rlimit.rs
  • crates/iii-init/src/supervisor.rs
  • crates/iii-init/tests/init_integration.rs
  • crates/iii-network/Cargo.toml
  • crates/iii-network/src/backend.rs
  • crates/iii-network/src/config.rs
  • crates/iii-network/src/conn.rs
  • crates/iii-network/src/device.rs
  • crates/iii-network/src/dns.rs
  • crates/iii-network/src/lib.rs
  • crates/iii-network/src/network.rs
  • crates/iii-network/src/proxy.rs
  • crates/iii-network/src/shared.rs
  • crates/iii-network/src/stack.rs
  • crates/iii-network/src/udp_relay.rs
  • crates/iii-network/src/wake_pipe.rs
  • crates/iii-network/tests/network_integration.rs
  • crates/iii-worker/Cargo.toml
  • crates/iii-worker/build.rs
  • crates/iii-worker/images/node/Dockerfile
  • crates/iii-worker/images/node/README.md
  • crates/iii-worker/images/python/Dockerfile
  • crates/iii-worker/images/python/README.md
  • crates/iii-worker/src/cli/firmware/constants.rs
  • crates/iii-worker/src/cli/firmware/download.rs
  • crates/iii-worker/src/cli/firmware/libkrunfw_bytes.rs
  • crates/iii-worker/src/cli/firmware/mod.rs
  • crates/iii-worker/src/cli/firmware/resolve.rs
  • crates/iii-worker/src/cli/firmware/symlinks.rs
  • crates/iii-worker/src/cli/managed.rs
  • crates/iii-worker/src/cli/mod.rs
  • crates/iii-worker/src/cli/vm_boot.rs
  • crates/iii-worker/src/cli/worker_manager/adapter.rs
  • crates/iii-worker/src/cli/worker_manager/libkrun.rs
  • crates/iii-worker/src/cli/worker_manager/mod.rs
  • crates/iii-worker/src/cli/worker_manager/state.rs
  • crates/iii-worker/src/main.rs
  • crates/iii-worker/tests/worker_integration.rs
  • engine/Cargo.toml
  • engine/build.rs
  • engine/install.sh
  • engine/local-worker-dev.md
  • engine/src/cli/error.rs
  • engine/src/cli/github.rs
  • engine/src/cli/managed_shim.rs
  • engine/src/cli/mod.rs
  • engine/src/cli/registry.rs
  • engine/src/cli/update.rs
  • engine/src/cli/worker_manager/config.rs
  • engine/src/cli/worker_manager/install.rs
  • engine/src/cli/worker_manager/manifest.rs
  • engine/src/cli/worker_manager/registry.rs
  • engine/src/cli/worker_manager/spec.rs
  • engine/src/cli/worker_manager/storage.rs
  • engine/src/cli/worker_manager/uninstall.rs
  • engine/src/main.rs
  • engine/tests/vm_platform_e2e.rs
  • sdk/packages/rust/iii/Cargo.toml
💤 Files with no reviewable changes (9)
  • engine/src/cli/github.rs
  • engine/src/cli/error.rs
  • engine/src/cli/worker_manager/spec.rs
  • engine/src/cli/worker_manager/manifest.rs
  • engine/src/cli/worker_manager/storage.rs
  • engine/src/cli/worker_manager/uninstall.rs
  • engine/src/cli/worker_manager/registry.rs
  • engine/src/cli/worker_manager/config.rs
  • engine/src/cli/worker_manager/install.rs

Comment on lines +211 to +221
matrix:
include:
- firmware_file: libkrunfw-darwin-aarch64.dylib
archive_name: libkrunfw-darwin-aarch64.tar.gz
publish_aliases: 'aarch64-apple-darwin'
- firmware_file: libkrunfw-linux-x86_64.so
archive_name: libkrunfw-linux-x86_64.tar.gz
publish_aliases: 'x86_64-unknown-linux-gnu x86_64-unknown-linux-musl'
- firmware_file: libkrunfw-linux-aarch64.so
archive_name: libkrunfw-linux-aarch64.tar.gz
publish_aliases: 'aarch64-unknown-linux-gnu'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if x86_64 darwin firmware exists in the repository
fd -t f 'libkrunfw.*darwin.*x86' engine/firmware/ || echo "No x86_64 darwin firmware found"
ls -la engine/firmware/ 2>/dev/null || echo "engine/firmware/ directory not found"

Repository: iii-hq/iii

Length of output: 419


🏁 Script executed:

#!/bin/bash
# Check the worker-release job and its targets around line 309
sed -n '295,325p' .github/workflows/release-iii.yml

Repository: iii-hq/iii

Length of output: 1380


🏁 Script executed:

#!/bin/bash
# Search for any references to x86_64-apple-darwin in the workflow
rg 'x86_64-apple-darwin' .github/workflows/release-iii.yml

Repository: iii-hq/iii

Length of output: 276


🏁 Script executed:

#!/bin/bash
# Check if there's any documentation about macOS firmware support
fd -t f -e md -e txt | xargs rg -l 'darwin.*firmware\|x86_64.*darwin' 2>/dev/null | head -5

Repository: iii-hq/iii

Length of output: 36


🏁 Script executed:

#!/bin/bash
# Get the complete firmware-upload job and its matrix
sed -n '180,280p' .github/workflows/release-iii.yml

Repository: iii-hq/iii

Length of output: 4239


Mismatch between worker-release targets and firmware-upload matrix for x86_64-apple-darwin.

The worker-release job targets include x86_64-apple-darwin and use the embed-libkrunfw feature, but the firmware-upload matrix only provides aarch64-apple-darwin firmware. No x86_64-apple-darwin firmware file exists in engine/firmware/. Either add the missing x86_64-apple-darwin firmware or remove it from the worker-release targets if Intel Mac support is not intended.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/release-iii.yml around lines 211 - 221, The workflow's
worker-release targets include x86_64-apple-darwin but the firmware-upload
matrix only provides libkrunfw-darwin-aarch64.dylib (aarch64) so the x86_64
firmware is missing; either add a libkrunfw-darwin-x86_64.dylib/archive entry to
the firmware-upload matrix (matching the publish_aliases and archive_name
patterns) or remove x86_64-apple-darwin from the worker-release targets (and any
reference to the embed-libkrunfw feature) to keep targets and the
firmware_upload matrix (matrix.include entries like
firmware_file/archive_name/publish_aliases) consistent.

…ion tests

- Added steps to build and test the `iii-worker` crate within the CI workflow.
- Included installation of the `iii-worker` binary for integration tests.
- Removed redundant build and test steps for `iii-worker` to streamline the process.
- Updated test assertions in `cli_integration.rs` to reflect changes in command subcommands.
- Added a step to install `libcap-ng-dev` as a system dependency for the `iii-worker` during the CI process.
- Updated the CI workflow to ensure all necessary dependencies are available for building and testing the `iii-worker` crate.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
.github/workflows/ci.yml (1)

42-46: Remove the redundant cargo build step.

cargo test -p iii-worker already compiles the package and its tests. The preceding cargo build -p iii-worker duplicates compilation effort and can be safely removed to reduce CI duration.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml around lines 42 - 46, Remove the redundant
compilation step by deleting the "cargo build -p iii-worker" invocation and keep
only "cargo test -p iii-worker" so the CI runs the tests (which already compile
the package) without duplicating the build for the iii-worker job.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/ci.yml:
- Around line 37-41: Remove the unused system dependency by deleting the "sudo
apt-get install -y libcap-ng-dev" entry in the "Install system dependencies for
iii-worker" step; if you intend to keep it for a future requirement, instead add
an inline comment in that step explaining the reason and link to the issue/docs,
otherwise remove the whole libcap-ng-dev install line to avoid unnecessary CI
overhead.

---

Nitpick comments:
In @.github/workflows/ci.yml:
- Around line 42-46: Remove the redundant compilation step by deleting the
"cargo build -p iii-worker" invocation and keep only "cargo test -p iii-worker"
so the CI runs the tests (which already compile the package) without duplicating
the build for the iii-worker job.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 0765be6e-5bab-4c6c-863d-24bb9a4a58b4

📥 Commits

Reviewing files that changed from the base of the PR and between c1df62b and 21a4b72.

📒 Files selected for processing (1)
  • .github/workflows/ci.yml

- Simplified test assertions in multiple test files by consolidating lines for better readability.
- Enhanced formatting of output messages in tests to maintain consistency and improve clarity.
- Updated the `PassthroughFsBuilder` test to reduce line breaks for a more concise presentation.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
crates/iii-worker/src/cli/managed.rs (4)

625-651: detect_lan_ip uses macOS-specific commands without platform guard.

The function uses route -n get default and ifconfig, which are macOS-specific. On Linux, this will silently fail and return None. While engine_url_for_runtime currently ignores the LAN IP, this creates technical debt if the feature is later enabled.

Consider adding a #[cfg(target_os = "macos")] guard or implementing cross-platform support using ip route / ip addr on Linux.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-worker/src/cli/managed.rs` around lines 625 - 651, The
detect_lan_ip function is macOS-specific (uses `route -n get default` and
`ifconfig`) and should be guarded or made cross-platform: either add a
#[cfg(target_os = "macos")] on detect_lan_ip (and provide a stub returning None
for other targets) or implement a Linux branch that runs `ip route` / `ip addr`
to discover the default interface and its inet address; update the function (and
any callers) so the macOS implementation remains in `detect_lan_ip` and Linux
logic is used when cfg(target_os = "linux") to avoid silent failures on
non-macOS systems.

271-282: Inconsistent error handling for firmware provisioning.

On line 272-273, if ensure_libkrunfw() fails, only a warning is logged and execution continues. However, line 275-282 then checks libkrun_available() and returns an error if false. This creates potentially confusing behavior — the warning may mislead users into thinking the error on lines 276-281 is unrelated.

Consider making the firmware provisioning error more prominent, or consolidating the check:

♻️ Suggested refinement
     if let Err(e) = super::firmware::download::ensure_libkrunfw().await {
-        tracing::warn!(error = %e, "failed to ensure libkrunfw availability");
+        tracing::warn!(error = %e, "failed to ensure libkrunfw availability; checking if already present...");
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-worker/src/cli/managed.rs` around lines 271 - 282, The code logs a
warning when super::firmware::download::ensure_libkrunfw().await fails but then
separately checks super::worker_manager::libkrun::libkrun_available(), causing
confusing flow; change the handling so that any Err from ensure_libkrunfw() is
treated as a fatal/returnable error (e.g., log an error with the captured error
and immediately return 1) or else have ensure_libkrunfw() return a definitive
availability result that you use instead of calling libkrun_available()
separately; update the branch around ensure_libkrunfw() and libkrun_available()
(references: ensure_libkrunfw, libkrun_available) to either early-return on
failure or consolidate into a single availability check with a clear error log.

372-377: _follow parameter is declared but not implemented.

The follow parameter (prefixed with _ to silence warnings) appears intended for tail-follow functionality but is never used. This could confuse users who pass --follow expecting real-time log streaming.

Consider either implementing follow mode or documenting it as not yet supported.

Would you like me to open an issue to track implementing the --follow functionality?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-worker/src/cli/managed.rs` around lines 372 - 377, The _follow
parameter in handle_managed_logs (signature: pub async fn
handle_managed_logs(worker_name: &str, _follow: bool, _address: &str, _port:
u16) -> i32) is unused; either implement follow-mode log streaming or remove the
unused flag. To fix, decide whether to support --follow: if yes, un-prefix
follow, wire the boolean into the log-fetching logic (e.g., poll or attach a
streaming client and tail new log lines until interrupted) in the body of
handle_managed_logs; if no, remove the parameter (and any CLI flag wiring) or
update the help/documentation to state follow is not supported. Ensure
references to worker_name, _address, and _port are preserved when adding
streaming behavior and return appropriate exit codes from handle_managed_logs.

395-419: Duplicated "last N lines" logic could be extracted.

The pattern of reading a file, splitting into lines, and taking the last 100 is repeated three times (lines 395-403, 410-418, and 435-443). Consider extracting to a helper function.

♻️ Suggested helper
fn tail_lines(path: &std::path::Path, n: usize) -> Option<String> {
    let contents = std::fs::read_to_string(path).ok()?;
    if contents.is_empty() {
        return None;
    }
    let lines: Vec<&str> = contents.lines().collect();
    let start = lines.len().saturating_sub(n);
    Some(lines[start..].join("\n"))
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-worker/src/cli/managed.rs` around lines 395 - 419, Extract the
duplicated "read file, take last N lines" logic into a helper function (e.g., fn
tail_lines(path: &std::path::Path, n: usize) -> Option<String>) that reads the
file with std::fs::read_to_string, returns None for empty or missing files,
computes start with lines.len().saturating_sub(n), and returns the joined last-n
lines as a String; then replace the three repeated blocks that read stdout_path
and stderr_path and print to println!/eprintln! by calling
tail_lines(stdout_path, 100) / tail_lines(stderr_path, 100), checking for
Some(text) to set found_content and printing the returned text (use eprintln!
for stderr) so the shared logic is centralized.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/iii-worker/src/cli/managed.rs`:
- Around line 1136-1139: The debug message after calling
adapter.start(&spec).await is misleading because it says "managed worker process
exited successfully" even though start() spawns/starts the VM rather than
waiting for it to exit; update the log in the Ok(_) arm (the tracing::debug!
call near adapter.start(&spec).await) to say something like "managed worker
process started" or "spawned" and keep the same structured field (worker =
%name) so operators see an accurate lifecycle event.
- Around line 232-236: The code currently uses
dirs::home_dir().unwrap_or_default() to build pid_file which yields an empty
PathBuf on failure and can produce invalid root paths; change this to explicitly
handle None by returning an error (or bubbling a Result) when home_dir() is None
instead of defaulting to an empty path. Update the logic that constructs
pid_file (and other similar spots in this file) to first match or if-let the
dirs::home_dir() result, produce a descriptive error mentioning the missing home
directory and worker_name, and only then join the remaining components to build
vm.pid; ensure the surrounding function signature returns Result/any error type
used in this module so callers can handle the failure.
- Around line 827-830: The delete of the sandbox directory is currently ignoring
errors (std::fs::remove_dir_all(&dev_dir) is assigned to _) which can leave a
partial/unchanged dev_dir; change this to handle the Result: call
std::fs::remove_dir_all(&dev_dir) and on Err(e) either log the error with
eprintln! (including dev_dir and e) and return early (propagate the error from
the current function, e.g., return Err(e.into())) or otherwise bail out, so the
code does not continue using a partially-deleted directory; update the block
around rebuild, dev_dir, and remove_dir_all to match this error-handling
approach.
- Around line 178-190: The current flow in managed.rs calls WorkersFile::load()
then modifies in-memory and calls WorkersFile::save(), which causes a TOCTOU
race when concurrent `iii-worker managed add` commands run; modify the
implementation so WorkersFile operations acquire an exclusive file lock around
read-modify-write (e.g., use fs2::FileExt::lock_exclusive on the underlying
file) or implement save as an atomic write-then-rename while holding an
exclusive lock to prevent concurrent loads from racing. Specifically, update
WorkersFile::load and/or WorkersFile::save (and the usage in the add flow in
managed.rs where workers_file.add_worker(...) is invoked) to open the backing
file with std::fs::OpenOptions, acquire an exclusive lock, read current
contents, apply add_worker, write to a temporary file and rename (or write
directly while locked), then release the lock before returning; ensure errors
from locking are propagated back to managed.rs so the existing error handling
still reports failures.

---

Nitpick comments:
In `@crates/iii-worker/src/cli/managed.rs`:
- Around line 625-651: The detect_lan_ip function is macOS-specific (uses `route
-n get default` and `ifconfig`) and should be guarded or made cross-platform:
either add a #[cfg(target_os = "macos")] on detect_lan_ip (and provide a stub
returning None for other targets) or implement a Linux branch that runs `ip
route` / `ip addr` to discover the default interface and its inet address;
update the function (and any callers) so the macOS implementation remains in
`detect_lan_ip` and Linux logic is used when cfg(target_os = "linux") to avoid
silent failures on non-macOS systems.
- Around line 271-282: The code logs a warning when
super::firmware::download::ensure_libkrunfw().await fails but then separately
checks super::worker_manager::libkrun::libkrun_available(), causing confusing
flow; change the handling so that any Err from ensure_libkrunfw() is treated as
a fatal/returnable error (e.g., log an error with the captured error and
immediately return 1) or else have ensure_libkrunfw() return a definitive
availability result that you use instead of calling libkrun_available()
separately; update the branch around ensure_libkrunfw() and libkrun_available()
(references: ensure_libkrunfw, libkrun_available) to either early-return on
failure or consolidate into a single availability check with a clear error log.
- Around line 372-377: The _follow parameter in handle_managed_logs (signature:
pub async fn handle_managed_logs(worker_name: &str, _follow: bool, _address:
&str, _port: u16) -> i32) is unused; either implement follow-mode log streaming
or remove the unused flag. To fix, decide whether to support --follow: if yes,
un-prefix follow, wire the boolean into the log-fetching logic (e.g., poll or
attach a streaming client and tail new log lines until interrupted) in the body
of handle_managed_logs; if no, remove the parameter (and any CLI flag wiring) or
update the help/documentation to state follow is not supported. Ensure
references to worker_name, _address, and _port are preserved when adding
streaming behavior and return appropriate exit codes from handle_managed_logs.
- Around line 395-419: Extract the duplicated "read file, take last N lines"
logic into a helper function (e.g., fn tail_lines(path: &std::path::Path, n:
usize) -> Option<String>) that reads the file with std::fs::read_to_string,
returns None for empty or missing files, computes start with
lines.len().saturating_sub(n), and returns the joined last-n lines as a String;
then replace the three repeated blocks that read stdout_path and stderr_path and
print to println!/eprintln! by calling tail_lines(stdout_path, 100) /
tail_lines(stderr_path, 100), checking for Some(text) to set found_content and
printing the returned text (use eprintln! for stderr) so the shared logic is
centralized.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 0b531255-6f6f-4fff-9508-db1caee7f964

📥 Commits

Reviewing files that changed from the base of the PR and between 21a4b72 and 11a1728.

📒 Files selected for processing (7)
  • crates/iii-filesystem/src/backends/passthroughfs/builder.rs
  • crates/iii-filesystem/tests/filesystem_integration.rs
  • crates/iii-init/tests/init_integration.rs
  • crates/iii-network/tests/network_integration.rs
  • crates/iii-worker/src/cli/managed.rs
  • crates/iii-worker/src/cli/worker_manager/libkrun.rs
  • crates/iii-worker/tests/worker_integration.rs
✅ Files skipped from review due to trivial changes (3)
  • crates/iii-filesystem/tests/filesystem_integration.rs
  • crates/iii-init/tests/init_integration.rs
  • crates/iii-worker/tests/worker_integration.rs
🚧 Files skipped from review as they are similar to previous changes (3)
  • crates/iii-network/tests/network_integration.rs
  • crates/iii-filesystem/src/backends/passthroughfs/builder.rs
  • crates/iii-worker/src/cli/worker_manager/libkrun.rs

Comment on lines +178 to +190
let mut workers_file = WorkersFile::load().unwrap_or_default();
workers_file.add_worker(
name.clone(),
WorkerDef {
image: image_ref,
env: HashMap::new(),
resources,
},
);
if let Err(e) = workers_file.save() {
eprintln!("{} Failed to save iii.workers.yaml: {}", "error:".red(), e);
return 1;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if there's any file locking mechanism in WorkersFile
rg -n 'lock|flock|atomic' crates/iii-worker/src/cli/worker_manager/state.rs

Repository: iii-hq/iii

Length of output: 36


🏁 Script executed:

# Find where WorkersFile is defined
rg -n "struct WorkersFile|impl WorkersFile" --type rust

Repository: iii-hq/iii

Length of output: 311


🏁 Script executed:

# Check the actual implementation of load and save methods
fd "worker" --type f --name "*.rs" | head -20

Repository: iii-hq/iii

Length of output: 304


🏁 Script executed:

# Understand the file context and structure
cat -n crates/iii-worker/src/cli/managed.rs | head -50

Repository: iii-hq/iii

Length of output: 2062


🏁 Script executed:

# Read the WorkersFile implementation in state.rs
cat -n crates/iii-worker/src/cli/worker_manager/state.rs | head -100

Repository: iii-hq/iii

Length of output: 3762


🏁 Script executed:

# Search for the load and save implementation
rg -A 20 "fn load\(\)" crates/iii-worker/src/cli/worker_manager/state.rs

Repository: iii-hq/iii

Length of output: 822


🏁 Script executed:

# Search for save implementation
rg -A 20 "fn save\(" crates/iii-worker/src/cli/worker_manager/state.rs

Repository: iii-hq/iii

Length of output: 675


🏁 Script executed:

# Check if there are any documented issues or workarounds in managed.rs
rg -B 5 -A 5 "race\|TOCTOU\|concurrent\|lock" crates/iii-worker/src/cli/managed.rs

Repository: iii-hq/iii

Length of output: 36


🏁 Script executed:

# Search for file locking patterns elsewhere in the codebase (fs2, fcntl, etc.)
rg "lock_exclusive|try_lock|flock" --type rust

Repository: iii-hq/iii

Length of output: 332


🏁 Script executed:

# Check if there's any TODO or FIXME comments around WorkersFile
rg -B 3 -A 3 "TODO|FIXME" crates/iii-worker/src/cli/worker_manager/state.rs

Repository: iii-hq/iii

Length of output: 36


TOCTOU race condition on workers file — concurrent add commands lose data.

WorkersFile::load() (using std::fs::read_to_string()) and save() (using std::fs::write()) have no synchronization. When two iii-worker managed add commands run concurrently, both load the file, each modifies the in-memory structure, and the second save() overwrites the first, losing one worker definition.

Implement file locking (e.g., fs2::FileExt::lock_exclusive) or atomic write-with-rename to prevent concurrent modification.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/iii-worker/src/cli/managed.rs` around lines 178 - 190, The current
flow in managed.rs calls WorkersFile::load() then modifies in-memory and calls
WorkersFile::save(), which causes a TOCTOU race when concurrent `iii-worker
managed add` commands run; modify the implementation so WorkersFile operations
acquire an exclusive file lock around read-modify-write (e.g., use
fs2::FileExt::lock_exclusive on the underlying file) or implement save as an
atomic write-then-rename while holding an exclusive lock to prevent concurrent
loads from racing. Specifically, update WorkersFile::load and/or
WorkersFile::save (and the usage in the add flow in managed.rs where
workers_file.add_worker(...) is invoked) to open the backing file with
std::fs::OpenOptions, acquire an exclusive lock, read current contents, apply
add_worker, write to a temporary file and rename (or write directly while
locked), then release the lock before returning; ensure errors from locking are
propagated back to managed.rs so the existing error handling still reports
failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant