feat: add managed worker sandbox with libkrun VM isolation#1402
feat: add managed worker sandbox with libkrun VM isolation#1402andersonleal wants to merge 15 commits intomainfrom
Conversation
Introduce microVM-based worker sandboxing via libkrun runtime adapter. New CLI commands: worker start/stop/dev/logs for managing isolated worker containers. Adds iii.workers.yaml state file, vm_boot subprocess, and managed worker lifecycle (add/remove/start/stop/dev/logs).
…visioning Extract VM worker runtime into a separate `iii-worker` crate that resolves libkrunfw and iii-init via embed, local cache, or GitHub release download. Add `has_init()` guard to iii-filesystem so PassthroughFs works without embedded init. Add filterable target matrix to _rust-binary workflow, firmware-upload job, and worker-release job to the release pipeline.
… binary handling - Added `iii-worker` crate and its features to the Makefile for cross-compilation. - Updated sandbox build process to include the worker crate with embedded features. - Modified `main.rs` in `iii-init` to handle non-Linux targets gracefully. - Improved error handling in `libkrun.rs` for missing root filesystem, with logging for automatic pulling. - Adjusted stderr handling in `managed_shim.rs` for better visibility during worker startup.
…g in managed_shim - Moved the `managed_engine_url` function from `managed.rs` to the test module for better encapsulation. - Improved error handling and logging for worker processes in `managed_shim.rs`, capturing stderr output for better visibility during startup.
…e crates - Updated `nix` dependency to version 0.31.2 in `Cargo.lock`, `iii-init`, and `iii-network`. - Changed license to "Elastic-2.0" in `Cargo.toml` for `iii-filesystem`, `iii-init`, and `iii-network`. - Modified `main.rs` in `iii-worker` to include version information in the command structure.
… duplicated engine firmware modules
…and documentation - Introduced Dockerfiles for Node.js and Python sandbox images, including essential development tools and non-root user configurations for improved security. - Added README files for both images detailing features, build instructions, and usage examples. - Updated libkrun to use the new Docker images for TypeScript and Python languages, replacing references to microsandbox images. - Enhanced environment variable handling in the development worker setup to include OCI environment variables.
- Improved logging by replacing `eprintln!` with `tracing::debug!` and `tracing::info!` for better visibility and structured output. - Updated output messages for VM booting and root filesystem preparation to provide clearer context. - Enhanced project and engine information display in the worker development command. - Adjusted existing messages to improve clarity and consistency across the CLI interface.
…tem and network components - Introduced unit tests for PassthroughFsBuilder, ConnectionTracker, and various network functionalities to ensure robust behavior. - Added integration tests for iii-filesystem and iii-network to validate end-to-end functionality and interactions between components. - Enhanced error handling and validation checks in tests to cover edge cases and improve reliability. - Updated Cargo.toml to include necessary development dependencies for testing.
…rs dependency - Updated all instances of the `syn` dependency to version 2.0.117 in `Cargo.lock`. - Added `machineid-rs` version 1.2.4 to `Cargo.toml` for enhanced machine identification functionality. - Removed the `otel` feature from `iii-sdk` in `Cargo.toml` to streamline dependencies.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📝 WalkthroughWalkthroughAdds four new Rust crates (iii-init, iii-filesystem, iii-network, iii-worker), extensive backend implementations (passthrough FUSE filesystem, smoltcp userspace network), CLI/runtime features for worker lifecycle and firmware/init provisioning, workspace/CI/release updates, and engine-side removal/replacement of prior worker-manager modules. Changes
Sequence Diagram(s)sequenceDiagram
participant Guest as Guest VM
participant Kernel as Kernel (FUSE)
participant FS as PassthroughFs
participant Init as Embedded Init
participant Host as Host FS
Guest->>Kernel: lookup/open/read "init.krun"
Kernel->>FS: FUSE lookup/open/read request
FS->>Init: check has_init / read_init
alt init embedded
Init-->>FS: return bytes / synthetic stat
else no init
FS->>Host: openat/stat operations for real inode
Host-->>FS: fd / stat
end
FS-->>Kernel: FUSE response (entry/bytes)
Kernel-->>Guest: VFS result
sequenceDiagram
participant GuestNet as Guest (virtio)
participant Backend as SmoltcpBackend
participant Shared as SharedState (queues)
participant Poll as Smoltcp Poll Loop
participant Proxy as TCP Proxy Task
participant Server as Remote Server
GuestNet->>Backend: write_frame(hdr + payload)
Backend->>Shared: push tx_ring
Poll->>Shared: pop tx_ring
Poll->>Poll: classify_frame -> TCP/UDP/DNS
alt TCP SYN
Poll->>Proxy: spawn proxy for connection
Proxy->>Server: connect + relay
Server-->>Proxy: response
Proxy->>Shared: push rx_ring
else UDP/DNS or passthrough
Poll->>Shared: push rx_ring (responses or passthrough)
end
Backend->>Shared: pop rx_ring
Backend->>GuestNet: read_frame(prepend virtio header + payload)
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 1
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
🟠 Major comments (21)
crates/iii-worker/images/node/README.md-7-7 (1)
7-7:⚠️ Potential issue | 🟠 MajorDocumented Node version conflicts with the actual base image.
Line 7 says
20.x LTS, but the Dockerfile currently usesnode:24-slim. Please align README and Dockerfile to one source of truth.As per coding guidelines, README content should remain consistent with related implementation/docs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/iii-worker/images/node/README.md` at line 7, The README's documented Node version ("Node.js 20.x LTS") conflicts with the actual base image tag ("node:24-slim") used in the Dockerfile; update either the README entry in crates/iii-worker/images/node/README.md to match the Dockerfile's node:24-slim base or change the Dockerfile to use the documented 20.x LTS image so both sources are consistent — ensure the README string and the Dockerfile FROM line reference the same major Node version (e.g., "Node.js 24.x" if keeping node:24-slim or change FROM to node:20-slim to keep "Node.js 20.x LTS").crates/iii-worker/images/node/README.md-15-19 (1)
15-19:⚠️ Potential issue | 🟠 MajorBuild instruction uses the wrong Dockerfile path for project-root execution.
Same issue as Python README: from project root,
-f Dockerfileis ambiguous/wrong for this image.Suggested doc fix
- docker build -t iiidev/node -f Dockerfile . + docker build -t iiidev/node -f crates/iii-worker/images/node/Dockerfile .As per coding guidelines, this README should follow the authoritative image build/run flow.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/iii-worker/images/node/README.md` around lines 15 - 19, The README build instruction uses an ambiguous Dockerfile path; update the docker build command in README.md so it explicitly references the Dockerfile inside the node image directory (e.g., use -f crates/iii-worker/images/node/Dockerfile) and the correct build context (project root or the image directory) to match the authoritative image build/run flow; ensure the example preserves the image tag (iiidev/node) and matches how other image READMEs specify build/run steps.crates/iii-worker/images/python/README.md-13-17 (1)
13-17:⚠️ Potential issue | 🟠 MajorBuild command path is incorrect when run from project root.
Line 13 says “from the project root,” but Line 16 uses
-f Dockerfile, which won’t target this file from root. Use the full path to avoid broken setup instructions.Suggested doc fix
- docker build -t iiidev/python -f Dockerfile . + docker build -t iiidev/python -f crates/iii-worker/images/python/Dockerfile .As per coding guidelines, READMEs for these images must follow the authoritative build/run commands and stay consistent with documented usage.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/iii-worker/images/python/README.md` around lines 13 - 17, The build command in README.md is wrong for running from the project root; update the docker build invocation in crates/iii-worker/images/python/README.md so the -f flag points to the image Dockerfile path (crates/iii-worker/images/python/Dockerfile) and use the image directory as the build context instead of just `-f Dockerfile .`; edit the command shown (the current "docker build -t iiidev/python -f Dockerfile .") to reference the full Dockerfile path and the correct context so the command works when executed from the project root.crates/iii-worker/images/python/Dockerfile-1-1 (1)
1-1:⚠️ Potential issue | 🟠 MajorPin the Python base image to a specific version tag.
Using
python:latestmakes builds non-reproducible and non-deterministic. Thelatesttag can silently point to new Python releases, introducing unexpected breaking changes and build drift. Pin to a specific version (e.g.,python:3.12-slim-bookworm) to ensure reproducible builds and consistent behavior across environments.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/iii-worker/images/python/Dockerfile` at line 1, Replace the non-deterministic base image "python:latest" in the Dockerfile with a pinned, specific tag (for example "python:3.12-slim-bookworm" or another supported patch-level image) to ensure reproducible builds; edit the FROM instruction in the Dockerfile to use the chosen explicit tag and rebuild to verify no unexpected changes.engine/src/cli/registry.rs-54-61 (1)
54-61:⚠️ Potential issue | 🟠 MajorAlign
iii-initregistry targets with actual binary distribution strategy.
iii-initis declared in the registry with specific Linux targets (lines 54-61), yet noiii-initbinaries are published in releases. The comment at update.rs:399 documents it as unsupported on macOS/Windows, which contradicts the registry declaration and suggests the binary is not meant for distribution via this mechanism.The current
check_platform_support()gate prevents unsafe asset-fetch attempts, so immediate runtime risk is low. However, either:
- Remove
iii-initfrom the registry if it's not distributed as a managed binary, or- Configure the release pipeline to build and publish
iii-initbinaries for declared targets.Verify with the project whether
iii-init(a PID 1 init binary for microVM workers) should be managed through the update flow or distributed differently.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@engine/src/cli/registry.rs` around lines 54 - 61, The registry declares supported_targets for the `iii-init` managed binary but `iii-init` binaries are not actually published; update the registry to reflect reality by removing `iii-init` from the supported_targets array in the registry declaration (the array shown with "x86_64-unknown-linux-musl", "aarch64-apple-darwin", etc.) or alternatively add build/publish steps for `iii-init` in the release pipeline so those targets are produced; ensure this change is consistent with the platform gate `check_platform_support()` and the note in update.rs (around the PID 1 init comment) so runtime asset-fetch logic matches the declared distribution strategy.crates/iii-worker/src/cli/worker_manager/state.rs-46-50 (1)
46-50:⚠️ Potential issue | 🟠 MajorMake writes atomic to avoid corrupting
iii.workers.yaml.Line 49 overwrites the file directly. A crash/interruption during write can leave a truncated manifest and break subsequent loads.
🛠️ Proposed fix
pub fn save(&self) -> Result<()> { let data = serde_yaml::to_string(self).with_context(|| "failed to serialize workers file")?; - std::fs::write(WORKERS_FILE, data) - .with_context(|| format!("failed to write {}", WORKERS_FILE))?; + let tmp = format!("{}.tmp", WORKERS_FILE); + std::fs::write(&tmp, data).with_context(|| format!("failed to write {}", tmp))?; + std::fs::rename(&tmp, WORKERS_FILE) + .with_context(|| format!("failed to replace {}", WORKERS_FILE))?; Ok(()) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/iii-worker/src/cli/worker_manager/state.rs` around lines 46 - 50, The save() method currently writes directly to WORKERS_FILE which can leave a truncated file on crash; change it to perform an atomic write by writing serialized data to a temporary file in the same directory (e.g., WORKERS_FILE + ".tmp" or use tempfile::NamedTempFile) and then atomically replace the target with std::fs::rename (or persist the NamedTempFile) to move the temp into place; propagate errors with the existing with_context messages (e.g., "failed to write {}") and ensure the temp file is cleaned up on failure.crates/iii-filesystem/build.rs-20-24 (1)
20-24:⚠️ Potential issue | 🟠 Majoraarch64-unknown-linux-musl target unavailable; embed-init feature will silently fail on aarch64 Linux.
The build script uses
aarch64-unknown-linux-musl(line 22), butplatform.rsexplicitly states "Linux aarch64 uses gnu (no musl builds available)". When the musl binary does not exist, the code silently writes a 1-byte placeholder (line 42) instead of the real binary, causing theembed-initfeature to fail without error. The comment at line 19 claiming "init is always a Linux musl binary" is incorrect for aarch64.Change
aarch64to map to"aarch64-unknown-linux-gnu"to align withinstall.sh(line 399) andplatform.rsdocumentation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/iii-filesystem/build.rs` around lines 20 - 24, The match on arch assigns the wrong target triple for aarch64, causing embed-init to pick a non-existent musl binary and silently write a 1-byte placeholder; update the match arm in build.rs so that when arch == "aarch64" it maps triple = "aarch64-unknown-linux-gnu" (instead of "aarch64-unknown-linux-musl"), and correct the preceding comment that claims init is always a Linux musl binary to reflect that aarch64 uses GNU; ensure this change makes the embed-init flow select the real binary rather than the placeholder.engine/install.sh-394-408 (1)
394-408:⚠️ Potential issue | 🟠 MajorFix Darwin target mapping for iii-init.
The install script incorrectly maps Darwin hosts to Darwin binaries (
x86_64-apple-darwin,aarch64-apple-darwin). However,crates/iii-filesystem/build.rs(line 18-19) explicitly states: "VMs always run Linux guests, so init is always a Linux musl binary regardless of the host OS." Theengine/local-worker-dev.mddocumentation consistently confirms this—both Apple Silicon and x86_64 hosts build iii-init usingaarch64-unknown-linux-muslandx86_64-unknown-linux-musltargets.The
engine/src/cli/registry.rsalso incorrectly listsaarch64-apple-darwinandx86_64-apple-darwinas supported targets for iii-init (lines 54-59), contradicting the build.rs constraint. These Darwin targets do not exist in releases and should be removed.The Darwin case in lines 394-408 should map to Linux musl targets matching the host architecture, not Darwin targets.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@engine/install.sh` around lines 394 - 408, The Darwin branch of the init_target selection incorrectly assigns Darwin triples; change the Darwin case in the init_target logic so that when uname_s == "Darwin" it sets init_target to the Linux musl variants (x86_64-unknown-linux-musl for x86_64 and aarch64-unknown-linux-musl for aarch64) so iii-init is always a Linux musl binary (refer to init_target, uname_s, arch), and also remove the non-existent aarch64-apple-darwin and x86_64-apple-darwin entries from the supported targets list in engine/src/cli/registry.rs (the list that registers iii-init targets) so only the linux-musl targets remain.crates/iii-worker/src/cli/firmware/symlinks.rs-7-28 (1)
7-28:⚠️ Potential issue | 🟠 MajorSurface symlink provisioning failures to the caller.
If one unlink/symlink operation fails here, firmware provisioning can still look successful and the breakage only shows up later when libkrun tries to load
libkrunfw. Returning aResultwould let the install path fail fast instead of leaving a partial SONAME layout behind.Possible change
#[cfg(unix)] -pub fn create_libkrunfw_symlinks(lib_dir: &Path) { +pub fn create_libkrunfw_symlinks(lib_dir: &Path) -> std::io::Result<()> { let symlinks = libkrunfw_symlink_pairs(); for (link_name, target) in &symlinks { let link_path = lib_dir.join(link_name); if link_path.exists() || link_path.is_symlink() { - let _ = std::fs::remove_file(&link_path); + std::fs::remove_file(&link_path)?; } - if let Err(e) = std::os::unix::fs::symlink(target, &link_path) { - tracing::warn!( - link = %link_path.display(), - target = %target, - error = %e, - "failed to create libkrunfw symlink" - ); - } + std::os::unix::fs::symlink(target, &link_path)?; } + + Ok(()) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/iii-worker/src/cli/firmware/symlinks.rs` around lines 7 - 28, The function create_libkrunfw_symlinks currently swallows errors and only logs warnings, which can leave partial SONAME layout; change its signature to return a Result<(), E> (e.g., std::io::Result<()> or anyhow::Result<()>) and propagate failures from std::fs::remove_file and std::os::unix::fs::symlink instead of ignoring them; use the ? operator (or map_err to convert errors) when calling remove_file and symlink inside the loop (referencing create_libkrunfw_symlinks and libkrunfw_symlink_pairs) so the caller can fail fast on any unlink/symlink error and handle rollback or surface a provisioning failure.crates/iii-network/src/dns.rs-135-150 (1)
135-150:⚠️ Potential issue | 🟠 MajorBound DNS resolution fan-out.
The bounded
query_rx(capacity 64) does not actually bound total concurrent work here, because every dequeued packet is immediately re-spawned onto its own Tokio task. A noisy guest can create unbounded in-flight DNS lookups and waiting tasks. Resolve inline in this loop, or gatetokio::spawnbehind a semaphore orJoinSetlimit to cap concurrent resolutions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/iii-network/src/dns.rs` around lines 135 - 150, The loop spawns a new Tokio task for every dequeued DNS query (query_rx) so the bounded channel (capacity 64) doesn't limit concurrent resolve_query work; replace the unconditional tokio::spawn by either performing resolve_query(&query.data, &resolver).await inline inside the while let loop (and then sending the response on response_tx) or gate task creation with a concurrency limiter (e.g., a tokio::sync::Semaphore permit or a tokio::task::JoinSet with a max concurrent count) so only a fixed number of resolve_query tasks run concurrently; update the code paths that reference query, resolver, response_tx, and shared.proxy_wake accordingly to use the chosen strategy.crates/iii-init/src/mount.rs-110-145 (1)
110-145:⚠️ Potential issue | 🟠 MajorDon't silently skip worker memory enforcement.
mount_cgroup2().ok()and the ignoredmemory.maxwrite let the worker boot without the requested cgroup limit and without any warning. At minimum log this; ifIII_WORKER_MEM_BYTESis set, I would surface the error so the sandbox does not run with weaker limits than requested.Possible change
- mount_cgroup2().ok(); + if let Err(e) = mount_cgroup2() { + if std::env::var_os("III_WORKER_MEM_BYTES").is_some() { + return Err(e); + } + eprintln!("iii-init: warning: failed to configure cgroup2: {e}"); + } @@ // Set memory limit from env var (passed by vm_boot.rs). if let Ok(mem_bytes) = std::env::var("III_WORKER_MEM_BYTES") { - let _ = std::fs::write("/sys/fs/cgroup/worker/memory.max", &mem_bytes); + std::fs::write("/sys/fs/cgroup/worker/memory.max", &mem_bytes).map_err(|e| { + InitError::WriteFile { + path: "/sys/fs/cgroup/worker/memory.max".into(), + source: e, + } + })?; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/iii-init/src/mount.rs` around lines 110 - 145, The code currently swallows errors by calling mount_cgroup2().ok() and ignoring the result of writing memory.max, which allows the worker to run without requested memory limits; change this so failures are surfaced or at least logged: (1) stop dropping the result at the call site — replace mount_cgroup2().ok() with proper error handling (propagate the InitError or log a clear warning) so mount failures aren't silently ignored; (2) inside mount_cgroup2(), when III_WORKER_MEM_BYTES is set, do not ignore the result of std::fs::write("/sys/fs/cgroup/worker/memory.max", &mem_bytes) — return an InitError on failure (or at minimum log an error with the path, mem_bytes presence, and the io::Error) so the supervisor won't run with weaker limits than requested; (3) ensure mkdir_ignore_exists("/sys/fs/cgroup") and the cgroup.subtree_control write also surface/log errors consistently (they already map errors to InitError) so behavior is uniform.crates/iii-network/src/dns.rs-62-79 (1)
62-79:⚠️ Potential issue | 🟠 MajorMake DNS interceptor startup fallible.
The resolver is built inside the spawned task (lines 127–133), so
new()returnsSelfeven if resolver construction fails. Once the task exits,process()drops queries due to the closed channel but logs the error as "DNS query channel full" (line 104), masking the actual startup failure. Additionally,resolve_querysilently drops all resolution and parsing errors via.ok()?(lines 161, 164, 170, 185), causing guest-visible timeouts instead of negative DNS responses.Build the resolver before spawning, return
Result<Self>, and return negative responses for lookup failures so clients receive proper DNS errors instead of timeouts.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/iii-network/src/dns.rs` around lines 62 - 79, Make DNS interceptor startup fallible by constructing the DNS resolver synchronously before spawning dns_resolver_task and change new() to return Result<Self> so callers can fail fast; only spawn dns_resolver_task if resolver construction succeeds (remove building resolver inside the task). In resolve_query replace the silent `.ok()?` chains (lines around resolve_query) with explicit error handling that generates proper negative DNS responses (SERVFAIL/NXDOMAIN as appropriate) and forwards parse/lookup errors to response_tx instead of dropping them; also ensure process() detects a closed query channel and surfaces the resolver startup error (don't mask it as "DNS query channel full"). Use identifiers: new(), dns_resolver_task, resolve_query, process(), CHANNEL_CAPACITY, DNS_PORT to find and update the code.crates/iii-network/src/dns.rs-167-187 (1)
167-187:⚠️ Potential issue | 🟠 MajorMap DNS resolution errors to appropriate response codes instead of discarding them.
lookup(...).await.ok()?collapses resolver failures (NXDOMAIN, SERVFAIL, timeout, etc.) intoNone, causing the guest to receive a socket timeout instead of a proper DNS response. Thelookup()method returnsResult<Lookup, ResolveError>, where the error contains theResponseCodethat should be sent back. Once the query parses successfully, extract the error'sResponseCodeand construct a response message with the appropriate code rather than abandoning the query.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/iii-network/src/dns.rs` around lines 167 - 187, The current resolver call uses resolver.lookup(...).await.ok()? which drops all ResolveError cases and returns None; instead handle the Result from resolver.lookup(...) and on Err(e) extract or map e to the appropriate hickory_proto::op::ResponseCode, build a response Message (reuse response_msg, set_id/query/op_code/recursion flags/add_query as you already do) but set_response_code to the mapped code (e.g., NXDomain, ServFail, Refused, etc.) and serialize/return the bytes; only when Ok(lookup) proceed to insert_answers from lookup.records() and set ResponseCode::NoError. Ensure you reference resolver.lookup, ResolveError -> ResponseCode mapping, response_msg, query_msg, question and return Some(Bytes::from(response_bytes)) for both success and mapped-error cases.crates/iii-init/src/supervisor.rs-68-69 (1)
68-69:⚠️ Potential issue | 🟠 MajorDon’t ignore a failed cgroup move.
If this write fails, the worker starts outside the intended memory-limited cgroup and the sandbox silently loses one of its resource bounds. Bubble the error up instead of treating it as best-effort.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/iii-init/src/supervisor.rs` around lines 68 - 69, The write to move the worker into the cgroup currently swallows errors (let _ = std::fs::write(...)), so a failure leaves the worker outside the memory-limited cgroup; change this to propagate the error instead of ignoring it by replacing the ignored write with a proper Result check (use the ? operator or map_err and return Err) on the std::fs::write call so that failures when writing "/sys/fs/cgroup/worker/cgroup.procs" (where child_pid is used) bubble up from the function (or are converted to the function's error type) rather than being treated as best-effort.crates/iii-network/src/proxy.rs-72-83 (1)
72-83:⚠️ Potential issue | 🟠 MajorPreserve server→guest relay after the guest closes its write side.
When
from_smoltcp.recv()returnsNone, the proxy breaks the whole loop and stops reading the server socket too. That truncates responses for normal TCP close/FIN flows. Half-close the outbound socket and keep drainingserver_rxuntil EOF.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/iii-network/src/proxy.rs` around lines 72 - 83, When from_smoltcp.recv() returns None you must not break the entire relay loop; instead half-close the outbound socket and keep draining the server_rx path until EOF. Replace the current None => break branch with code that calls an async shutdown of the server write side (use AsyncWriteExt::shutdown or the stream's shutdown method) to send FIN, stop sending further writes to server_tx, but continue the loop so the server_rx branch can keep reading from the server socket until it returns None, then break. Ensure you reference from_smoltcp.recv(), server_tx.write_all(...), and server_rx so the changes are applied in the correct relay logic.crates/iii-init/src/supervisor.rs-19-24 (1)
19-24:⚠️ Potential issue | 🟠 MajorForward termination to the workload group, not just the direct child.
Spawning via
/bin/sh -cmeans the tracked PID can be a shell wrapper instead of the real workload. In wrapper-script or multi-command cases, SIGTERM/SIGINT stop at that shell and grandchildren can outlive shutdown. Prefer launching the workload directly, or put it in its own process group and signal the group.Also applies to: 59-66
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/iii-init/src/supervisor.rs` around lines 19 - 24, The signal handler forwards signals to the child PID but that only hits a shell wrapper; update the shutdown to target the child's process group and ensure the child is placed in its own group when spawned: change signal_handler to call libc::kill with the negative group id (e.g., kill(-pid, sig)) and, in the code that creates the child process (the spawn/fork logic referenced around the other block at lines 59-66), set the child into a new process group (e.g., setpgid(child_pid, child_pid) or the equivalent POSIX/posix_spawn attr) right after fork/exec so signals sent to -CHILD_PID reach all descendants.crates/iii-network/src/network.rs-77-95 (1)
77-95:⚠️ Potential issue | 🟠 MajorGuard
start()against double invocation.A second call spawns another
smoltcp-pollthread against the same shared queues and overwrites the firstJoinHandle, so both loops can race the same backend. Makestart()one-shot before spawning.Suggested guard
pub fn start(&mut self, tokio_handle: tokio::runtime::Handle) { + assert!( + self.poll_handle.is_none(), + "SmoltcpNetwork::start called more than once" + ); + let shared = self.shared.clone(); let poll_config = PollLoopConfig { gateway_mac: self.gateway_mac, guest_mac: self.guest_mac,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/iii-network/src/network.rs` around lines 77 - 95, The start() method currently allows multiple invocations which spawns additional "smoltcp-poll" threads and overwrites self.poll_handle; guard it so start() is one-shot by checking self.poll_handle before spawning (e.g., return early or Err if self.poll_handle.is_some()), and only set self.poll_handle to Some(...) when creating the new thread that calls stack::smoltcp_poll_loop(shared, poll_config, tokio_handle); ensure you reference and protect the shared queues and the existing JoinHandle stored in self.poll_handle to avoid racing the same backend.crates/iii-worker/src/cli/firmware/constants.rs-63-75 (1)
63-75:⚠️ Potential issue | 🟠 MajorFail unsupported host tuples before archive resolution.
check_libkrunfw_platform_support()only rejectsdarwin-x86_64. Every other unknown tuple currently looks “supported”, which means later code can search for non-existent archives or panic iniii_init_musl_target(). Make this an explicit allowlist of supported(OS, ARCH)pairs.Suggested allowlist
pub fn check_libkrunfw_platform_support() -> Result<(), String> { - let os = std::env::consts::OS; - let arch = std::env::consts::ARCH; - // Available firmware: darwin-aarch64, linux-x86_64, linux-aarch64 - // Missing: darwin-x86_64 (Intel Mac) - if os == "macos" && arch == "x86_64" { - return Err(format!( - "libkrunfw firmware is not available for Intel Macs (darwin-x86_64).\n\ - Managed workers require an Apple Silicon Mac (aarch64) or a Linux host.\n\ - Set III_LIBKRUNFW_PATH to a manually-built firmware file to override." - )); - } - Ok(()) + match (std::env::consts::OS, std::env::consts::ARCH) { + ("macos", "aarch64") | ("linux", "x86_64") | ("linux", "aarch64") => Ok(()), + ("macos", "x86_64") => Err(format!( + "libkrunfw firmware is not available for Intel Macs (darwin-x86_64).\n\ + Managed workers require an Apple Silicon Mac (aarch64) or a Linux host.\n\ + Set III_LIBKRUNFW_PATH to a manually-built firmware file to override." + )), + (os, arch) => Err(format!( + "libkrunfw firmware is not available for {os}-{arch}." + )), + } }Also applies to: 89-99
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/iii-worker/src/cli/firmware/constants.rs` around lines 63 - 75, check_libkrunfw_platform_support currently only rejects darwin-x86_64 and leaves all other tuples implicitly "supported", which can lead to searching non-existent archives or panics; change it to an explicit allowlist: accept only the supported (OS, ARCH) pairs ("macos","aarch64"), ("linux","x86_64"), and ("linux","aarch64") and return Err for any other tuple with a clear message about unsupported host and how to override via III_LIBKRUNFW_PATH. Apply the same allowlist logic to the other similar check in the file (the block around the other function/logic referenced by lines 89-99 / any iii_init_musl_target-related platform validation) so both functions validate against the same enumerated supported tuples..github/workflows/_rust-binary.yml-98-107 (1)
98-107:⚠️ Potential issue | 🟠 MajorFail fast on unknown
inputs.targetsentries.The filter silently drops typos or unsupported triples, which can produce a partial release without any explicit failure. Compare requested vs matched targets and abort when anything is unmatched.
Suggested validation
else # Filter default matrix to only include targets in the provided JSON array FILTERED=$(echo "$DEFAULT" | jq -c --argjson targets "$TARGETS_INPUT" ' .include |= [ .[] | select(.target as $t | $targets | index($t)) ] ') + MATCHED=$(echo "$FILTERED" | jq -cr '[.include[].target] | sort') + REQUESTED=$(echo "$TARGETS_INPUT" | jq -cr 'sort') + if [ "$MATCHED" != "$REQUESTED" ]; then + echo "::error::inputs.targets contains unsupported target triples" + exit 1 + fi echo "matrix=$FILTERED" >> "$GITHUB_OUTPUT" echo "Filtered matrix to targets: $TARGETS_INPUT" fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/_rust-binary.yml around lines 98 - 107, The current workflow silently drops unknown triples when building FILTERED from DEFAULT and TARGETS_INPUT; update the shell logic after computing FILTERED to compare the requested targets (TARGETS_INPUT) with the matched targets in FILTERED (extract their .target fields) and if any requested entry is missing produce a clear error message and exit non‑zero (e.g., echo "Unknown targets: <list>" >&2; exit 1). Use the same jq context to compute the matched-target list from FILTERED (and the unmatched list by checking which entries in TARGETS_INPUT are not in FILTERED) so you can log the unmatched items and abort before writing GITHUB_OUTPUT.crates/iii-filesystem/src/backends/passthroughfs/metadata.rs-62-81 (1)
62-81:⚠️ Potential issue | 🟠 MajorPotential fd leak:
owns_fdlogic may incorrectly skip closing an opened fd.When
handleisSome(h)buthis not found inhandles, the code opens a new fd viaopen_inode_fd. However,owns_fdis computed ashandle.is_none() || ...get(&handle.unwrap())..., which will befalsewhen the handle lookup fails (sincehandle.is_none()is false and thegetreturnsNone).This means the newly opened fd won't be closed, causing a leak.
🐛 Proposed fix
// Use handle fd if available, otherwise open from inode. - let fd = if let Some(h) = handle { + let (fd, owns_fd) = if let Some(h) = handle { let handles = fs.handles.read().unwrap(); if let Some(hdata) = handles.get(&h) { - hdata.file.read().unwrap().as_raw_fd() + (hdata.file.read().unwrap().as_raw_fd(), false) } else { - inode::open_inode_fd(fs, ino, open_flags)? + (inode::open_inode_fd(fs, ino, open_flags)?, true) } } else { - inode::open_inode_fd(fs, ino, open_flags)? + (inode::open_inode_fd(fs, ino, open_flags)?, true) }; - - // Track whether we own the fd (and need to close it). - let owns_fd = handle.is_none() - || fs - .handles - .read() - .unwrap() - .get(&handle.unwrap_or(u64::MAX)) - .is_none();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/iii-filesystem/src/backends/passthroughfs/metadata.rs` around lines 62 - 81, The owns_fd calculation is wrong: when Some(h) is provided but not found, you open a new fd via inode::open_inode_fd (assigned to fd) but owns_fd evaluates to false, leaking the fd; fix by computing owns_fd from the same lookup used to get fd—i.e., when handle is None OR when handle is Some(h) but fs.handles.read().unwrap().get(&h) is None, owns_fd must be true; implement this by reusing the handle lookup result used to select fd (store the lookup outcome or the chosen branch) and set owns_fd accordingly so any fd created by inode::open_inode_fd is closed.crates/iii-worker/src/cli/managed.rs-625-651 (1)
625-651:⚠️ Potential issue | 🟠 Major
detect_lan_ipuses macOS-specific commands without platform guards.The function uses
route -n get defaultandifconfig, which are macOS-specific. On Linux, these commands either don't exist or produce different output, so the function will silently returnNone. Either guard this function with#[cfg(target_os = "macos")]or provide a cross-platform implementation (e.g.,ip routeandip addrfor Linux).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/iii-worker/src/cli/managed.rs` around lines 625 - 651, The detect_lan_ip function currently runs macOS-specific commands ("route -n get default" and "ifconfig") and should be made platform-aware: either add #[cfg(target_os = "macos")] to the detect_lan_ip function (and an alternative stub that returns None or calls a cross-platform variant for other OSes), or implement a cross-platform version that uses Linux commands (e.g., "ip route show default" to get the interface and "ip -4 addr show dev <iface>" to parse the IPv4 address) and annotate platform-specific implementations with #[cfg(target_os = "...")] so detect_lan_ip behaves correctly on Linux and macOS; update the function signatures and parsing logic in detect_lan_ip accordingly.
🟡 Minor comments (4)
crates/iii-worker/images/node/Dockerfile-1-1 (1)
1-1:⚠️ Potential issue | 🟡 MinorDocument or pin versions explicitly to ensure build reproducibility.
Using floating tags (
node:24-slimandnpm@latest) can cause builds to vary over time, complicating debugging and deployments. If reproducibility is critical, pin to explicit versions (e.g.,node:24.14.0-slimand a specific npm release).Note: Node.js 24.x (Krypton) is the current Active LTS line. npm@latest (11.x) is explicitly compatible with node:24-slim and maintains support for all active LTS versions, so floating to
npm@latestis safe from a compatibility standpoint—the trade-off is reproducibility, not breakage risk.Example: Explicit pinning approach
-FROM node:24-slim +FROM node:24.14.0-slim @@ - && npm install -g npm@latest \ + && npm install -g npm@11.9.0 \Also applies to: 15–20
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/iii-worker/images/node/Dockerfile` at line 1, The Dockerfile uses floating image and package tags which harms reproducibility; update the FROM line and any npm install invocations to pin explicit versions (e.g., change FROM node:24-slim to a specific patch image like node:24.14.0-slim and replace npm@latest with a fixed npm version such as npm@11.x.y) so builds are deterministic, and ensure any other references around the Dockerfile lines 15–20 that mention npm or node are likewise pinned; locate the FROM line in the Dockerfile and any npm install commands (npm@latest) and replace with the chosen fixed version strings.engine/local-worker-dev.md-125-126 (1)
125-126:⚠️ Potential issue | 🟡 MinorFix
libkrunfwresolution order in the table.Line 125 currently puts env/cache before embedded assets, but this doc should match the documented primitive order used elsewhere.
📝 Proposed doc fix
-| **libkrunfw** | 1. `III_LIBKRUNFW_PATH` env var 2. `~/.iii/lib/libkrunfw.{5}.dylib` 3. Embedded bytes (if `--features embed-libkrunfw`) 4. Auto-download from GitHub release | +| **libkrunfw** | 1. Embedded bytes (if `--features embed-libkrunfw`) 2. `III_LIBKRUNFW_PATH` env var 3. `~/.iii/lib/libkrunfw.{5}.dylib` 4. Auto-download from GitHub release |As per coding guidelines:
**/*.md*: - Ensure all references to primitives reflect those defined in docs/.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@engine/local-worker-dev.md` around lines 125 - 126, The libkrunfw resolution order is wrong in the markdown table: update the row for **libkrunfw** so embedded assets come first (matching other docs). Replace the current ordering so it reads: 1. Embedded (if `--features embed-libkrunfw`) 2. `III_LIBKRUNFW_PATH` env var 3. `~/.iii/lib/libkrunfw.{5}.dylib` 4. Auto-download from GitHub release; ensure the symbol `III_LIBKRUNFW_PATH` and the embedded feature name `embed-libkrunfw` are used exactly as shown.crates/iii-worker/Cargo.toml-4-4 (1)
4-4:⚠️ Potential issue | 🟡 MinorRust 1.85+ requirement is already documented; verify local development setup.
The project documentation (sdk/README.md) already specifies "Rust 1.85+ and Cargo (for Rust SDK)," and CI uses
dtolnay/rust-toolchain@stable, which complies with this requirement. However, there is norust-toolchain.tomlfile to enforce the minimum version for developers working locally. Consider adding one at the repository root to prevent accidental use of older Rust versions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/iii-worker/Cargo.toml` at line 4, Add a rust-toolchain.toml at the repository root to lock the minimum Rust version for local dev so contributors don't accidentally use older toolchains; create a file named rust-toolchain.toml and set the [toolchain] channel to 1.85.0 (and optionally list components like rustfmt and clippy) so the workspace (including the crate that sets edition = "2024" in Cargo.toml) is built with Rust 1.85+ by default.crates/iii-network/src/conn.rs-164-177 (1)
164-177:⚠️ Potential issue | 🟡 MinorPotential data loss when proxy channel is full.
At line 168, if
try_sendfails (channel full), the loop breaks and data from subsequentrecv_slicecalls in that iteration is lost. The connection continues, but some data may never reach the proxy.Consider buffering the data or using a different strategy when the channel is temporarily full.
💡 Suggested improvement
while socket.can_recv() { match socket.recv_slice(&mut relay_buf) { Ok(n) if n > 0 => { let data = Bytes::copy_from_slice(&relay_buf[..n]); if conn.to_proxy.try_send(data).is_err() { + // Channel full - data will be re-read on next poll + // since we didn't advance smoltcp's read pointer break; } } _ => break, } }Actually,
recv_slicedoes advance the read pointer, so this data is lost. Consider:
- Buffering unsent data similar to
write_buf- Using blocking send (requires async context)
- Accepting the data loss as backpressure mechanism
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/iii-network/src/conn.rs` around lines 164 - 177, The loop reading from socket uses socket.recv_slice and discards data when conn.to_proxy.try_send fails (channel full); fix by buffering unsent data instead of dropping it: add a per-connection outbound buffer (e.g., reuse or add a field like conn.proxy_write_buf similar to write_buf), on Ok(n) copy the slice into Bytes, attempt conn.to_proxy.try_send(data) and if it Errs push that Bytes into conn.proxy_write_buf and break out of the recv loop (stop reading more from socket), then ensure write_proxy_data (or the task that drains conn.to_proxy) is updated to first drain conn.proxy_write_buf into the channel when it becomes writable. This preserves data when try_send returns full and avoids losing bytes read by recv_slice.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 3d0e41a4-6fe9-475a-a79a-f182cc0b8ba3
⛔ Files ignored due to path filters (4)
Cargo.lockis excluded by!**/*.lockengine/firmware/libkrunfw-darwin-aarch64.dylibis excluded by!**/*.dylibengine/firmware/libkrunfw-linux-aarch64.sois excluded by!**/*.soengine/firmware/libkrunfw-linux-x86_64.sois excluded by!**/*.so
📒 Files selected for processing (90)
.cargo/config.toml.github/workflows/_rust-binary.yml.github/workflows/ci.yml.github/workflows/release-iii.ymlCargo.tomlMakefilecrates/iii-filesystem/Cargo.tomlcrates/iii-filesystem/build.rscrates/iii-filesystem/src/backends/mod.rscrates/iii-filesystem/src/backends/passthroughfs/builder.rscrates/iii-filesystem/src/backends/passthroughfs/create_ops.rscrates/iii-filesystem/src/backends/passthroughfs/dir_ops.rscrates/iii-filesystem/src/backends/passthroughfs/file_ops.rscrates/iii-filesystem/src/backends/passthroughfs/inode.rscrates/iii-filesystem/src/backends/passthroughfs/metadata.rscrates/iii-filesystem/src/backends/passthroughfs/mod.rscrates/iii-filesystem/src/backends/passthroughfs/remove_ops.rscrates/iii-filesystem/src/backends/passthroughfs/special.rscrates/iii-filesystem/src/backends/shared/handle_table.rscrates/iii-filesystem/src/backends/shared/init_binary.rscrates/iii-filesystem/src/backends/shared/inode_table.rscrates/iii-filesystem/src/backends/shared/mod.rscrates/iii-filesystem/src/backends/shared/name_validation.rscrates/iii-filesystem/src/backends/shared/platform.rscrates/iii-filesystem/src/init.rscrates/iii-filesystem/src/lib.rscrates/iii-filesystem/tests/filesystem_integration.rscrates/iii-init/Cargo.tomlcrates/iii-init/src/error.rscrates/iii-init/src/main.rscrates/iii-init/src/mount.rscrates/iii-init/src/network.rscrates/iii-init/src/rlimit.rscrates/iii-init/src/supervisor.rscrates/iii-init/tests/init_integration.rscrates/iii-network/Cargo.tomlcrates/iii-network/src/backend.rscrates/iii-network/src/config.rscrates/iii-network/src/conn.rscrates/iii-network/src/device.rscrates/iii-network/src/dns.rscrates/iii-network/src/lib.rscrates/iii-network/src/network.rscrates/iii-network/src/proxy.rscrates/iii-network/src/shared.rscrates/iii-network/src/stack.rscrates/iii-network/src/udp_relay.rscrates/iii-network/src/wake_pipe.rscrates/iii-network/tests/network_integration.rscrates/iii-worker/Cargo.tomlcrates/iii-worker/build.rscrates/iii-worker/images/node/Dockerfilecrates/iii-worker/images/node/README.mdcrates/iii-worker/images/python/Dockerfilecrates/iii-worker/images/python/README.mdcrates/iii-worker/src/cli/firmware/constants.rscrates/iii-worker/src/cli/firmware/download.rscrates/iii-worker/src/cli/firmware/libkrunfw_bytes.rscrates/iii-worker/src/cli/firmware/mod.rscrates/iii-worker/src/cli/firmware/resolve.rscrates/iii-worker/src/cli/firmware/symlinks.rscrates/iii-worker/src/cli/managed.rscrates/iii-worker/src/cli/mod.rscrates/iii-worker/src/cli/vm_boot.rscrates/iii-worker/src/cli/worker_manager/adapter.rscrates/iii-worker/src/cli/worker_manager/libkrun.rscrates/iii-worker/src/cli/worker_manager/mod.rscrates/iii-worker/src/cli/worker_manager/state.rscrates/iii-worker/src/main.rscrates/iii-worker/tests/worker_integration.rsengine/Cargo.tomlengine/build.rsengine/install.shengine/local-worker-dev.mdengine/src/cli/error.rsengine/src/cli/github.rsengine/src/cli/managed_shim.rsengine/src/cli/mod.rsengine/src/cli/registry.rsengine/src/cli/update.rsengine/src/cli/worker_manager/config.rsengine/src/cli/worker_manager/install.rsengine/src/cli/worker_manager/manifest.rsengine/src/cli/worker_manager/registry.rsengine/src/cli/worker_manager/spec.rsengine/src/cli/worker_manager/storage.rsengine/src/cli/worker_manager/uninstall.rsengine/src/main.rsengine/tests/vm_platform_e2e.rssdk/packages/rust/iii/Cargo.toml
💤 Files with no reviewable changes (9)
- engine/src/cli/github.rs
- engine/src/cli/error.rs
- engine/src/cli/worker_manager/spec.rs
- engine/src/cli/worker_manager/manifest.rs
- engine/src/cli/worker_manager/storage.rs
- engine/src/cli/worker_manager/uninstall.rs
- engine/src/cli/worker_manager/registry.rs
- engine/src/cli/worker_manager/config.rs
- engine/src/cli/worker_manager/install.rs
| matrix: | ||
| include: | ||
| - firmware_file: libkrunfw-darwin-aarch64.dylib | ||
| archive_name: libkrunfw-darwin-aarch64.tar.gz | ||
| publish_aliases: 'aarch64-apple-darwin' | ||
| - firmware_file: libkrunfw-linux-x86_64.so | ||
| archive_name: libkrunfw-linux-x86_64.tar.gz | ||
| publish_aliases: 'x86_64-unknown-linux-gnu x86_64-unknown-linux-musl' | ||
| - firmware_file: libkrunfw-linux-aarch64.so | ||
| archive_name: libkrunfw-linux-aarch64.tar.gz | ||
| publish_aliases: 'aarch64-unknown-linux-gnu' |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if x86_64 darwin firmware exists in the repository
fd -t f 'libkrunfw.*darwin.*x86' engine/firmware/ || echo "No x86_64 darwin firmware found"
ls -la engine/firmware/ 2>/dev/null || echo "engine/firmware/ directory not found"Repository: iii-hq/iii
Length of output: 419
🏁 Script executed:
#!/bin/bash
# Check the worker-release job and its targets around line 309
sed -n '295,325p' .github/workflows/release-iii.ymlRepository: iii-hq/iii
Length of output: 1380
🏁 Script executed:
#!/bin/bash
# Search for any references to x86_64-apple-darwin in the workflow
rg 'x86_64-apple-darwin' .github/workflows/release-iii.ymlRepository: iii-hq/iii
Length of output: 276
🏁 Script executed:
#!/bin/bash
# Check if there's any documentation about macOS firmware support
fd -t f -e md -e txt | xargs rg -l 'darwin.*firmware\|x86_64.*darwin' 2>/dev/null | head -5Repository: iii-hq/iii
Length of output: 36
🏁 Script executed:
#!/bin/bash
# Get the complete firmware-upload job and its matrix
sed -n '180,280p' .github/workflows/release-iii.ymlRepository: iii-hq/iii
Length of output: 4239
Mismatch between worker-release targets and firmware-upload matrix for x86_64-apple-darwin.
The worker-release job targets include x86_64-apple-darwin and use the embed-libkrunfw feature, but the firmware-upload matrix only provides aarch64-apple-darwin firmware. No x86_64-apple-darwin firmware file exists in engine/firmware/. Either add the missing x86_64-apple-darwin firmware or remove it from the worker-release targets if Intel Mac support is not intended.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/release-iii.yml around lines 211 - 221, The workflow's
worker-release targets include x86_64-apple-darwin but the firmware-upload
matrix only provides libkrunfw-darwin-aarch64.dylib (aarch64) so the x86_64
firmware is missing; either add a libkrunfw-darwin-x86_64.dylib/archive entry to
the firmware-upload matrix (matching the publish_aliases and archive_name
patterns) or remove x86_64-apple-darwin from the worker-release targets (and any
reference to the embed-libkrunfw feature) to keep targets and the
firmware_upload matrix (matrix.include entries like
firmware_file/archive_name/publish_aliases) consistent.
…ion tests - Added steps to build and test the `iii-worker` crate within the CI workflow. - Included installation of the `iii-worker` binary for integration tests. - Removed redundant build and test steps for `iii-worker` to streamline the process. - Updated test assertions in `cli_integration.rs` to reflect changes in command subcommands.
- Added a step to install `libcap-ng-dev` as a system dependency for the `iii-worker` during the CI process. - Updated the CI workflow to ensure all necessary dependencies are available for building and testing the `iii-worker` crate.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
.github/workflows/ci.yml (1)
42-46: Remove the redundantcargo buildstep.
cargo test -p iii-workeralready compiles the package and its tests. The precedingcargo build -p iii-workerduplicates compilation effort and can be safely removed to reduce CI duration.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/ci.yml around lines 42 - 46, Remove the redundant compilation step by deleting the "cargo build -p iii-worker" invocation and keep only "cargo test -p iii-worker" so the CI runs the tests (which already compile the package) without duplicating the build for the iii-worker job.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/ci.yml:
- Around line 37-41: Remove the unused system dependency by deleting the "sudo
apt-get install -y libcap-ng-dev" entry in the "Install system dependencies for
iii-worker" step; if you intend to keep it for a future requirement, instead add
an inline comment in that step explaining the reason and link to the issue/docs,
otherwise remove the whole libcap-ng-dev install line to avoid unnecessary CI
overhead.
---
Nitpick comments:
In @.github/workflows/ci.yml:
- Around line 42-46: Remove the redundant compilation step by deleting the
"cargo build -p iii-worker" invocation and keep only "cargo test -p iii-worker"
so the CI runs the tests (which already compile the package) without duplicating
the build for the iii-worker job.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 0765be6e-5bab-4c6c-863d-24bb9a4a58b4
📒 Files selected for processing (1)
.github/workflows/ci.yml
- Simplified test assertions in multiple test files by consolidating lines for better readability. - Enhanced formatting of output messages in tests to maintain consistency and improve clarity. - Updated the `PassthroughFsBuilder` test to reduce line breaks for a more concise presentation.
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (4)
crates/iii-worker/src/cli/managed.rs (4)
625-651:detect_lan_ipuses macOS-specific commands without platform guard.The function uses
route -n get defaultandifconfig, which are macOS-specific. On Linux, this will silently fail and returnNone. Whileengine_url_for_runtimecurrently ignores the LAN IP, this creates technical debt if the feature is later enabled.Consider adding a
#[cfg(target_os = "macos")]guard or implementing cross-platform support usingip route/ip addron Linux.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/iii-worker/src/cli/managed.rs` around lines 625 - 651, The detect_lan_ip function is macOS-specific (uses `route -n get default` and `ifconfig`) and should be guarded or made cross-platform: either add a #[cfg(target_os = "macos")] on detect_lan_ip (and provide a stub returning None for other targets) or implement a Linux branch that runs `ip route` / `ip addr` to discover the default interface and its inet address; update the function (and any callers) so the macOS implementation remains in `detect_lan_ip` and Linux logic is used when cfg(target_os = "linux") to avoid silent failures on non-macOS systems.
271-282: Inconsistent error handling for firmware provisioning.On line 272-273, if
ensure_libkrunfw()fails, only a warning is logged and execution continues. However, line 275-282 then checkslibkrun_available()and returns an error if false. This creates potentially confusing behavior — the warning may mislead users into thinking the error on lines 276-281 is unrelated.Consider making the firmware provisioning error more prominent, or consolidating the check:
♻️ Suggested refinement
if let Err(e) = super::firmware::download::ensure_libkrunfw().await { - tracing::warn!(error = %e, "failed to ensure libkrunfw availability"); + tracing::warn!(error = %e, "failed to ensure libkrunfw availability; checking if already present..."); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/iii-worker/src/cli/managed.rs` around lines 271 - 282, The code logs a warning when super::firmware::download::ensure_libkrunfw().await fails but then separately checks super::worker_manager::libkrun::libkrun_available(), causing confusing flow; change the handling so that any Err from ensure_libkrunfw() is treated as a fatal/returnable error (e.g., log an error with the captured error and immediately return 1) or else have ensure_libkrunfw() return a definitive availability result that you use instead of calling libkrun_available() separately; update the branch around ensure_libkrunfw() and libkrun_available() (references: ensure_libkrunfw, libkrun_available) to either early-return on failure or consolidate into a single availability check with a clear error log.
372-377:_followparameter is declared but not implemented.The
followparameter (prefixed with_to silence warnings) appears intended for tail-follow functionality but is never used. This could confuse users who pass--followexpecting real-time log streaming.Consider either implementing follow mode or documenting it as not yet supported.
Would you like me to open an issue to track implementing the
--followfunctionality?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/iii-worker/src/cli/managed.rs` around lines 372 - 377, The _follow parameter in handle_managed_logs (signature: pub async fn handle_managed_logs(worker_name: &str, _follow: bool, _address: &str, _port: u16) -> i32) is unused; either implement follow-mode log streaming or remove the unused flag. To fix, decide whether to support --follow: if yes, un-prefix follow, wire the boolean into the log-fetching logic (e.g., poll or attach a streaming client and tail new log lines until interrupted) in the body of handle_managed_logs; if no, remove the parameter (and any CLI flag wiring) or update the help/documentation to state follow is not supported. Ensure references to worker_name, _address, and _port are preserved when adding streaming behavior and return appropriate exit codes from handle_managed_logs.
395-419: Duplicated "last N lines" logic could be extracted.The pattern of reading a file, splitting into lines, and taking the last 100 is repeated three times (lines 395-403, 410-418, and 435-443). Consider extracting to a helper function.
♻️ Suggested helper
fn tail_lines(path: &std::path::Path, n: usize) -> Option<String> { let contents = std::fs::read_to_string(path).ok()?; if contents.is_empty() { return None; } let lines: Vec<&str> = contents.lines().collect(); let start = lines.len().saturating_sub(n); Some(lines[start..].join("\n")) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/iii-worker/src/cli/managed.rs` around lines 395 - 419, Extract the duplicated "read file, take last N lines" logic into a helper function (e.g., fn tail_lines(path: &std::path::Path, n: usize) -> Option<String>) that reads the file with std::fs::read_to_string, returns None for empty or missing files, computes start with lines.len().saturating_sub(n), and returns the joined last-n lines as a String; then replace the three repeated blocks that read stdout_path and stderr_path and print to println!/eprintln! by calling tail_lines(stdout_path, 100) / tail_lines(stderr_path, 100), checking for Some(text) to set found_content and printing the returned text (use eprintln! for stderr) so the shared logic is centralized.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/iii-worker/src/cli/managed.rs`:
- Around line 1136-1139: The debug message after calling
adapter.start(&spec).await is misleading because it says "managed worker process
exited successfully" even though start() spawns/starts the VM rather than
waiting for it to exit; update the log in the Ok(_) arm (the tracing::debug!
call near adapter.start(&spec).await) to say something like "managed worker
process started" or "spawned" and keep the same structured field (worker =
%name) so operators see an accurate lifecycle event.
- Around line 232-236: The code currently uses
dirs::home_dir().unwrap_or_default() to build pid_file which yields an empty
PathBuf on failure and can produce invalid root paths; change this to explicitly
handle None by returning an error (or bubbling a Result) when home_dir() is None
instead of defaulting to an empty path. Update the logic that constructs
pid_file (and other similar spots in this file) to first match or if-let the
dirs::home_dir() result, produce a descriptive error mentioning the missing home
directory and worker_name, and only then join the remaining components to build
vm.pid; ensure the surrounding function signature returns Result/any error type
used in this module so callers can handle the failure.
- Around line 827-830: The delete of the sandbox directory is currently ignoring
errors (std::fs::remove_dir_all(&dev_dir) is assigned to _) which can leave a
partial/unchanged dev_dir; change this to handle the Result: call
std::fs::remove_dir_all(&dev_dir) and on Err(e) either log the error with
eprintln! (including dev_dir and e) and return early (propagate the error from
the current function, e.g., return Err(e.into())) or otherwise bail out, so the
code does not continue using a partially-deleted directory; update the block
around rebuild, dev_dir, and remove_dir_all to match this error-handling
approach.
- Around line 178-190: The current flow in managed.rs calls WorkersFile::load()
then modifies in-memory and calls WorkersFile::save(), which causes a TOCTOU
race when concurrent `iii-worker managed add` commands run; modify the
implementation so WorkersFile operations acquire an exclusive file lock around
read-modify-write (e.g., use fs2::FileExt::lock_exclusive on the underlying
file) or implement save as an atomic write-then-rename while holding an
exclusive lock to prevent concurrent loads from racing. Specifically, update
WorkersFile::load and/or WorkersFile::save (and the usage in the add flow in
managed.rs where workers_file.add_worker(...) is invoked) to open the backing
file with std::fs::OpenOptions, acquire an exclusive lock, read current
contents, apply add_worker, write to a temporary file and rename (or write
directly while locked), then release the lock before returning; ensure errors
from locking are propagated back to managed.rs so the existing error handling
still reports failures.
---
Nitpick comments:
In `@crates/iii-worker/src/cli/managed.rs`:
- Around line 625-651: The detect_lan_ip function is macOS-specific (uses `route
-n get default` and `ifconfig`) and should be guarded or made cross-platform:
either add a #[cfg(target_os = "macos")] on detect_lan_ip (and provide a stub
returning None for other targets) or implement a Linux branch that runs `ip
route` / `ip addr` to discover the default interface and its inet address;
update the function (and any callers) so the macOS implementation remains in
`detect_lan_ip` and Linux logic is used when cfg(target_os = "linux") to avoid
silent failures on non-macOS systems.
- Around line 271-282: The code logs a warning when
super::firmware::download::ensure_libkrunfw().await fails but then separately
checks super::worker_manager::libkrun::libkrun_available(), causing confusing
flow; change the handling so that any Err from ensure_libkrunfw() is treated as
a fatal/returnable error (e.g., log an error with the captured error and
immediately return 1) or else have ensure_libkrunfw() return a definitive
availability result that you use instead of calling libkrun_available()
separately; update the branch around ensure_libkrunfw() and libkrun_available()
(references: ensure_libkrunfw, libkrun_available) to either early-return on
failure or consolidate into a single availability check with a clear error log.
- Around line 372-377: The _follow parameter in handle_managed_logs (signature:
pub async fn handle_managed_logs(worker_name: &str, _follow: bool, _address:
&str, _port: u16) -> i32) is unused; either implement follow-mode log streaming
or remove the unused flag. To fix, decide whether to support --follow: if yes,
un-prefix follow, wire the boolean into the log-fetching logic (e.g., poll or
attach a streaming client and tail new log lines until interrupted) in the body
of handle_managed_logs; if no, remove the parameter (and any CLI flag wiring) or
update the help/documentation to state follow is not supported. Ensure
references to worker_name, _address, and _port are preserved when adding
streaming behavior and return appropriate exit codes from handle_managed_logs.
- Around line 395-419: Extract the duplicated "read file, take last N lines"
logic into a helper function (e.g., fn tail_lines(path: &std::path::Path, n:
usize) -> Option<String>) that reads the file with std::fs::read_to_string,
returns None for empty or missing files, computes start with
lines.len().saturating_sub(n), and returns the joined last-n lines as a String;
then replace the three repeated blocks that read stdout_path and stderr_path and
print to println!/eprintln! by calling tail_lines(stdout_path, 100) /
tail_lines(stderr_path, 100), checking for Some(text) to set found_content and
printing the returned text (use eprintln! for stderr) so the shared logic is
centralized.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 0b531255-6f6f-4fff-9508-db1caee7f964
📒 Files selected for processing (7)
crates/iii-filesystem/src/backends/passthroughfs/builder.rscrates/iii-filesystem/tests/filesystem_integration.rscrates/iii-init/tests/init_integration.rscrates/iii-network/tests/network_integration.rscrates/iii-worker/src/cli/managed.rscrates/iii-worker/src/cli/worker_manager/libkrun.rscrates/iii-worker/tests/worker_integration.rs
✅ Files skipped from review due to trivial changes (3)
- crates/iii-filesystem/tests/filesystem_integration.rs
- crates/iii-init/tests/init_integration.rs
- crates/iii-worker/tests/worker_integration.rs
🚧 Files skipped from review as they are similar to previous changes (3)
- crates/iii-network/tests/network_integration.rs
- crates/iii-filesystem/src/backends/passthroughfs/builder.rs
- crates/iii-worker/src/cli/worker_manager/libkrun.rs
| let mut workers_file = WorkersFile::load().unwrap_or_default(); | ||
| workers_file.add_worker( | ||
| name.clone(), | ||
| WorkerDef { | ||
| image: image_ref, | ||
| env: HashMap::new(), | ||
| resources, | ||
| }, | ||
| ); | ||
| if let Err(e) = workers_file.save() { | ||
| eprintln!("{} Failed to save iii.workers.yaml: {}", "error:".red(), e); | ||
| return 1; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if there's any file locking mechanism in WorkersFile
rg -n 'lock|flock|atomic' crates/iii-worker/src/cli/worker_manager/state.rsRepository: iii-hq/iii
Length of output: 36
🏁 Script executed:
# Find where WorkersFile is defined
rg -n "struct WorkersFile|impl WorkersFile" --type rustRepository: iii-hq/iii
Length of output: 311
🏁 Script executed:
# Check the actual implementation of load and save methods
fd "worker" --type f --name "*.rs" | head -20Repository: iii-hq/iii
Length of output: 304
🏁 Script executed:
# Understand the file context and structure
cat -n crates/iii-worker/src/cli/managed.rs | head -50Repository: iii-hq/iii
Length of output: 2062
🏁 Script executed:
# Read the WorkersFile implementation in state.rs
cat -n crates/iii-worker/src/cli/worker_manager/state.rs | head -100Repository: iii-hq/iii
Length of output: 3762
🏁 Script executed:
# Search for the load and save implementation
rg -A 20 "fn load\(\)" crates/iii-worker/src/cli/worker_manager/state.rsRepository: iii-hq/iii
Length of output: 822
🏁 Script executed:
# Search for save implementation
rg -A 20 "fn save\(" crates/iii-worker/src/cli/worker_manager/state.rsRepository: iii-hq/iii
Length of output: 675
🏁 Script executed:
# Check if there are any documented issues or workarounds in managed.rs
rg -B 5 -A 5 "race\|TOCTOU\|concurrent\|lock" crates/iii-worker/src/cli/managed.rsRepository: iii-hq/iii
Length of output: 36
🏁 Script executed:
# Search for file locking patterns elsewhere in the codebase (fs2, fcntl, etc.)
rg "lock_exclusive|try_lock|flock" --type rustRepository: iii-hq/iii
Length of output: 332
🏁 Script executed:
# Check if there's any TODO or FIXME comments around WorkersFile
rg -B 3 -A 3 "TODO|FIXME" crates/iii-worker/src/cli/worker_manager/state.rsRepository: iii-hq/iii
Length of output: 36
TOCTOU race condition on workers file — concurrent add commands lose data.
WorkersFile::load() (using std::fs::read_to_string()) and save() (using std::fs::write()) have no synchronization. When two iii-worker managed add commands run concurrently, both load the file, each modifies the in-memory structure, and the second save() overwrites the first, losing one worker definition.
Implement file locking (e.g., fs2::FileExt::lock_exclusive) or atomic write-with-rename to prevent concurrent modification.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/iii-worker/src/cli/managed.rs` around lines 178 - 190, The current
flow in managed.rs calls WorkersFile::load() then modifies in-memory and calls
WorkersFile::save(), which causes a TOCTOU race when concurrent `iii-worker
managed add` commands run; modify the implementation so WorkersFile operations
acquire an exclusive file lock around read-modify-write (e.g., use
fs2::FileExt::lock_exclusive on the underlying file) or implement save as an
atomic write-then-rename while holding an exclusive lock to prevent concurrent
loads from racing. Specifically, update WorkersFile::load and/or
WorkersFile::save (and the usage in the add flow in managed.rs where
workers_file.add_worker(...) is invoked) to open the backing file with
std::fs::OpenOptions, acquire an exclusive lock, read current contents, apply
add_worker, write to a temporary file and rename (or write directly while
locked), then release the lock before returning; ensure errors from locking are
propagated back to managed.rs so the existing error handling still reports
failures.
Summary
Introduce microVM-based worker sandboxing using libkrun for secure, isolated function execution. This PR adds a complete sandbox infrastructure with three new crates (
iii-worker,iii-filesystem,iii-network,iii-init) and refactors the engine CLI to delegate worker management to the standalone worker binary.Key changes
iii-workercrate: Standalone worker binary with lazy firmware provisioning (libkrunfw), VM boot lifecycle, and managed worker state. Provides CLI commands forworker start/stop/dev/logs.iii-filesystemcrate: Passthrough filesystem backend for mounting host directories into the microVM guest, with inode management, directory operations, and platform-specific abstractions.iii-networkcrate: User-space networking stack for the sandbox — virtual device, DNS proxy, TCP/UDP connection tracking, and smoltcp-based network stack.iii-initcrate: Minimal init process for the microVM guest — mounts filesystems, configures networking, sets resource limits, and supervises the user workload.worker_managermodule from the engine CLI, replaced with a lightweightmanaged_shimthat delegates to theiii-workerbinary. Improved logging withtracinginstead ofeprintln!.Type of Change
Checklist
Additional Context
This is the foundation for running user-submitted functions in isolated microVMs via libkrun. The architecture separates concerns into dedicated crates:
The worker binary lazily provisions firmware (libkrunfw) by checking for embedded binaries, local cache, or downloading from GitHub releases. Docker images for Node.js and Python provide the root filesystem layers for the sandbox VMs.
Summary by CodeRabbit
New Features
Chores
Documentation