Skip to content

chore: experiment with Remote Build Execution (RBE) @ Namespace#10579

Draft
basvandijk wants to merge 117 commits into
masterfrom
basvandijk/namespace-bazel-remote-execution
Draft

chore: experiment with Remote Build Execution (RBE) @ Namespace#10579
basvandijk wants to merge 117 commits into
masterfrom
basvandijk/namespace-bazel-remote-execution

Conversation

@basvandijk

@basvandijk basvandijk commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Overview

Experiment with Bazel's Remote Build Execution (RBE) @ Namespace: run bazel test on Namespace runners, with actions executed on remote Namespace workers booted from a custom worker image (a mirror of ic-build).

Changes

Extended: .github/workflows/container-autobuild.yml

  • New rbe-worker-image job (on the Namespace runner where nsc is pre-authenticated): mirrors the freshly built ic-build image — by digest — into the Namespace tenant registry ($NSC_CONTAINER_REGISTRY, i.e. nscr.io/<tenant>/ic-build-worker) via nsc base-image upload, resolves the pushed digest, and optimizes it for RBE via nsc base-image optimize. Same fork guard as above.
  • It is treated as a required job, like ic-build-image: if it fails, that's a bug to fix.
  • The existing update-image-references job now also pins the worker image. It needs rbe-worker-image (so its commit/push — which re-triggers the workflow and would otherwise cancel the in-flight optimize via cancel-in-progress — only happens once the optimize completes), and rewrites the pinned ref in rbe-namespace-test.yml in the same commit as the ic-build/ic-dev ref and TAG updates. The pin sed is tenant-agnostic, so it self-corrects if the Namespace tenant ever changes.

New workflow: .github/workflows/rbe-namespace-test.yml

  • Runs bazel test on a Namespace runner (namespace-profile-amd64-linux-32x64) using RBE. Targets default to //... and are overridable via a workflow_dispatch input.
  • Provisions the RBE cluster with nsc bazel execution setup (writes a bazelrc with the remote executor, cache and credentials — deliberately not printed, since it contains short-lived credentials).
  • Routes actions to the custom worker image via --remote_default_exec_properties=container-image=....
  • Bypasses the DFINITY-internal Bazel cache/RE config (--noworkspace_rc + explicit .bazelrc.build + the Namespace RBE bazelrc), mirroring the existing bazel-test-arm64 job.
  • Excludes long/nightly/fuzz/large-system tests via --test_tag_filters and runs with --keep_going.
  • Opt-in while experimental: workflow_dispatch, pushes to dev-gh-*, or non-fork PRs labeled CI_RBE. Restricted to dfinity/ic; fork PRs are excluded because the job runs on a privileged Namespace runner with pre-authenticated nsc.

Notes / follow-ups

  • The worker-image ref in the test workflow starts as a placeholder digest; it is populated on the next ic-build rebuild (bump ci/container/TAG).
  • Because rbe-worker-image is required, a Namespace/RBE outage would block the production ic-build/ic-dev reference bump — a conscious trade-off (both must succeed).
  • Expect rough edges running //... under RBE (ic-os local-strategy targets, privileged/system tests); the broad --test_tag_filters exclusions and --keep_going reduce noise while iterating.

@basvandijk basvandijk requested a review from Copilot June 26, 2026 12:28
@basvandijk basvandijk changed the title Experiment with Bazel Remote Execution (BRE) on Namespace chore: experiment with Bazel Remote Execution (BRE) on Namespace Jun 26, 2026
@github-actions github-actions Bot added the chore label Jun 26, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an experimental GitHub Actions path to run bazel test using Namespace Bazel Remote Execution (BRE), including automation to build/mirror/optimize a worker image and keep the workflow pinned to an immutable digest.

Changes:

  • Adds a new experimental workflow to run bazel test on Namespace runners with remote execution enabled.
  • Extends the container autobuild workflow with jobs to mirror ic-build into nscr.io, optimize it for BRE, and automatically update the pinned worker-image digest used by the new workflow.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
.github/workflows/container-autobuild.yml Adds jobs to create an optimized Namespace BRE worker image and auto-update the pinned digest reference in workflows.
.github/workflows/bre-namespace-test.yml New opt-in workflow to run bazel test using Namespace remote execution with a pinned worker image.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/bre-namespace-test.yml Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Comment thread .github/workflows/container-autobuild.yml Outdated
Comment thread .github/workflows/container-autobuild.yml Outdated
Comment thread .github/workflows/container-autobuild.yml Outdated
Comment thread .github/workflows/bre-namespace-test.yml Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Add a new 'bre-namespace-test.yml' workflow that runs 'bazel test' on Namespace runners using Bazel Remote Execution (BRE). Actions execute on Namespace workers booted from a custom worker image (a mirror of ic-build).

Extend 'container-autobuild.yml' to mirror the freshly built ic-build image into nscr.io, optimize it for BRE, and pin the resulting digest. These jobs are decoupled from the production image-reference update so an early-access BRE failure can never block it.
Addresses Copilot review: 'nsc bazel execution setup' writes short-lived credentials into the bazelrc, so 'cat'-ing it could leak auth material into the Actions logs.
…update job

- Use the same fully-qualified nscr.io ref for the upload destination and the digest lookup, so the inspected tag is guaranteed to exist.

- Guard bre-worker-image and bre-namespace-test against fork PRs via head.repo.full_name == github.repository (matching ci-kickoff.yml), since both run on privileged Namespace runners with pre-authenticated nsc.

- Drop update-image-references from update-worker-reference's needs for true decoupling; the existing 'git pull --rebase' absorbs any concurrent push.
@basvandijk basvandijk force-pushed the basvandijk/namespace-bazel-remote-execution branch from 27f3f10 to 95ee442 Compare June 28, 2026 12:15
…ac895bdd550cac7bacb9dad553bae

ic-build: sha256:f4c6c7e0e16da470cba7ebceb0145f588d5fd4859c04acfa607bee475ecfa914

ic-dev:   sha256:2f98d344d708a1ae70938d5e777a1f141f7f2a9545687653f407a405eb1a27ea
@github-actions

github-actions Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Run URL: https://github.com/dfinity/ic/actions/runs/28675533597

New container images with tag: 8daf75b044f16ab4395e9746071e3d091157dac66460efb58bd61345c7a376cd
ic-build: sha256:050045f83ecb0037c4fade65e322aac4d1e22e080f70c18d4c84504bba0acccc
ic-dev: sha256:ece8b162493b06a7e59642c24a1a8b27d1a05ad002c1e1241d0d8abc1e8d2ef8
ic-build-worker: nscr.io/c9ptjuknd7oc6/ic-build-worker@sha256:65f912a41be8cf3e9247d8dc34f317ddfc11badf8dc93f137c1702d6a2cd2790

update-image-references now depends on bre-worker-image (default needs semantics) and pins the worker digest unconditionally in the same commit, dropping the cancelled()-guard and the separate update-worker-reference job. Waiting on bre-worker-image before committing/pushing avoids the concurrency cancellation seen earlier.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 3 comments.

Comment thread .github/workflows/bre-namespace-test.yml Outdated
Comment thread .github/workflows/bre-namespace-test.yml Outdated
Comment thread .github/workflows/container-autobuild.yml Outdated
nsc base-image upload prepends $NSC_CONTAINER_REGISTRY (nscr.io/<tenant>) to a relative name. Passing the fully-qualified nscr.io/dfinity ref caused a double-prefixed push (nscr.io/<tenant>/nscr.io/dfinity/...) and a 401 on the digest lookup. Derive the registry from $NSC_CONTAINER_REGISTRY, upload a relative name, pin the resulting full ref, and match any tenant in the pin sed. Also fixes a stale comment referencing the removed update-worker-reference job.
…rent one

On the Namespace RBE cluster the action runs as root inside a nested
user namespace but in a network namespace owned by an ancestor userns.
Capabilities are relative to the userns owning a resource, so root's
CAP_NET_ADMIN does not apply there and every RTNETLINK operation of the
local backend fails with 'Operation not permitted' -- even in a
privileged action container. Internally this never bites because
bazel's linux-sandbox gives each _local test a fresh netns owned by the
sandbox's userns.

Recreate what linux-sandbox provides: before the parent driver process
spawns the tokio runtime and any task subprocess, probe whether the
current netns is administrable and, if not, unshare(CLONE_NEWNET) into
a fresh netns owned by our userns (where root holds full CAP_NET_ADMIN)
and bring lo up. All descendants (tasks, libvirtd, QEMU, dnsmasq)
inherit it. The local backend needs no external connectivity, so an
isolated netns loses nothing.
Add the ic-test-utilities-privileges crate exposing run_as_nobody_if_root,
which runs a test closure in a forked child that drops to the nobody user when
the process is root, and in-process otherwise.

Root holds CAP_DAC_OVERRIDE and therefore bypasses filesystem permission bits,
so tests asserting PermissionDenied instead observe the operation succeeding and
fail when run as root (e.g. on Bazel remote-execution workers). The helper forks,
redirects TMPDIR/TEST_TMPDIR to a nobody-owned base, drops supplementary groups,
gid, and uid to nobody, and runs the whole test body there; the child's outcome
(including any panic message) is mirrored to the parent so #[should_panic] and
its expected message keep working.

Use it in the affected permission tests in rs/sys, the crypto service provider,
and rs/state_manager.
nix 0.24.3 configures `setgroups` out on Apple targets, so the `imp`
module failed to compile on darwin. The helper's semantics (the Linux
`nobody` uid 65534 and CAP_DAC_OVERRIDE) are Linux-specific anyway, so
gate `imp` on `target_os = "linux"` and fall through to the in-process
no-op on the other platforms.
The tests test_ensure_file_exists_and_is_writeable_fails_if_non_writeable and
test_save_proposal_id_to_file_fails_if_write_fails assert PermissionDenied on a
read-only file. Root holds CAP_DAC_OVERRIDE and bypasses filesystem permission
bits, so under Bazel remote-execution (running as root) the operations succeed
and the assertions fail.

Wrap both test bodies in ic_test_utilities_privileges::run_as_nobody_if_root so
they drop to the nobody user when the process is root, mirroring the treatment
applied in 380255a.
basvandijk and others added 12 commits July 2, 2026 10:45
* Terminate the forked child with _exit(2) instead of std::process::exit,
  which would run atexit(3) handlers and stdio flushing inherited from the
  (potentially multi-threaded) parent test process.

* Make the child's temp base nobody-owned with mode 0700 instead of
  root-owned world-writable 0777, matching the documentation and avoiding a
  predictable world-writable directory created as root. Create it with
  create_dir rather than create_dir_all so a pre-existing directory of
  unknown ownership is rejected.

* Make the fork path safe in a multi-threaded test process without forcing
  --test-threads=1 on whole test suites: the temp base and putenv(3)
  NAME=VALUE strings are prepared in the parent before forking, and the
  child no longer calls std::env::set_var (whose global environment lock
  another parent thread could have held at fork time), using putenv with
  the pre-allocated strings instead.

* Add smoke tests for the helper, including a concurrency regression test
  that forks from multiple threads; verified as root, where they exercise
  the fork + privilege-drop path.
Requested in review: each of the 12 wrapped tests now carries a short
comment explaining that root bypasses file permission bits
(CAP_DAC_OVERRIDE), so the permission denial the test expects only
happens when dropping to nobody on root-run workers (e.g. Bazel remote
execution).
Requested in review: wrapping whole test bodies in a closure pollutes them
with extra code and indentation. Add an attribute that does the wrapping
instead.

Proc-macros must live in their own crate, so the attribute is defined in the
new ic-test-utilities-privileges-macros crate and re-exported from
ic-test-utilities-privileges (the serde/serde_derive facade pattern, like
canlog re-exports canlog_derive); consumers keep a single dependency. The
attribute composes with #[test] and #[should_panic] (the wrapper re-raises
the child's panic message) and rejects async fns and non-unit return types
at compile time.

Convert all 12 call sites to the attribute form and extend the smoke tests
with attribute-form cases; verified as root, where the permission-denied
tests only pass because the attribute drops privileges to nobody.
The merge auto-combined HEAD's closure form (run_as_nobody_if_root(|| {...}))
with the incoming branch's #[as_nobody_when_root] attribute, double-wrapping
the permission-denied tests. Take the incoming branch's final attribute-only
form for the affected test files.
The RBE Namespace job crashed linking big system-test binaries
(e.g. //rs/tests/consensus/tecdsa:tecdsa_signature_from_other_subnet_test_bin)
with the zig linker child dying with SIGABRT:

  abnormal exit: child_process.ChildProcess.Term{ .Signal = 6 }

On Namespace RBE workers every action runs in its own execroot
(.../execroots/action-<id>) but shares the worker's /tmp/zig-cache, so all
concurrently executing actions fall into the same 'config-catchall' shard of
the zig cache. zig 0.12's cache is racy under concurrent writers
(ziglang/zig#23993,
uber/hermetic_cc_toolchain#224) which is what the
existing cache-dir workaround mitigates for foreign_cc configure builds. The
master merge (rules_rust 0.71.3) invalidated every Rust action at once, so
this run was the first to execute thousands of cold zig links concurrently on
freshly provisioned workers, maximizing the race window.

Extend the zig-wrapper's cache sharding to key the cache on the per-action
execroot segment when the working directory contains an 'execroots' path
segment. Invocations from subdirectories of one action (e.g. foreign_cc
configure scripts) still share that action's cache, and local execution
(no 'execroots' segment) is unaffected. Cold-cache cost is only paid by link
actions (~17s); compile-only zig invocations barely use the cache.
basvandijk and others added 11 commits July 3, 2026 14:24
Replace the large comment block above the bazel invocation with a
bazel_args array where each flag (or group of flags) is documented at
its definition.
…7dac66460efb58bd61345c7a376cd

ic-build: sha256:050045f83ecb0037c4fade65e322aac4d1e22e080f70c18d4c84504bba0acccc

ic-dev:   sha256:ece8b162493b06a7e59642c24a1a8b27d1a05ad002c1e1241d0d8abc1e8d2ef8

ic-build-worker: nscr.io/c9ptjuknd7oc6/ic-build-worker@sha256:65f912a41be8cf3e9247d8dc34f317ddfc11badf8dc93f137c1702d6a2cd2790
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chore CI_RBE Trigger the bazel-test-rbe job to run bazel test via Remote Build Execution @ Namespace

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants