Skip to content

Commit 766a6f2

Browse files
authored
fix(ci): unstick WarpBuild nightly release builds (#1183)
* fix(ci): use supported warp-windows-2022 runner label for nightly * feat(ci): fail nightly fast when WarpBuild runners never pick up jobs * docs(ci): WarpBuild one-time setup and stuck-queue runbook * fix(ci): harden nightly queue watchdog per review * docs(ci): verify runner-group contents before allowing public repos * docs(ci): correct concurrency pending-run wording
1 parent 8afadd6 commit 766a6f2

2 files changed

Lines changed: 148 additions & 2 deletions

File tree

.github/workflows/nightly-release.yml

Lines changed: 61 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ jobs:
5959
6060
matrix='[
6161
{"platform":"linux","arch":"x64","runner":"warp-ubuntu-2204-x64-32x","config":"release.linux.ci.yaml","timeout":660},
62-
{"platform":"windows","arch":"x64","runner":"warp-windows-2025-x64-32x","config":"release.windows.ci.yaml","timeout":780},
62+
{"platform":"windows","arch":"x64","runner":"warp-windows-2022-x64-32x","config":"release.windows.ci.yaml","timeout":780},
6363
{"platform":"macos","arch":"arm64","runner":"warp-macos-15-arm64-12x","config":"release.macos.arm64.ci.yaml","timeout":720}
6464
]'
6565
if [ "$PLATFORMS" != "all" ]; then
@@ -80,6 +80,66 @@ jobs:
8080
echo "publish=$publish"
8181
} >> "$GITHUB_OUTPUT"
8282
83+
# WarpBuild provisions runners on demand via webhook; when none arrive
84+
# (org runner-group policy, bad label, account issue) the build jobs sit
85+
# queued until GitHub discards them 24h later, holding the
86+
# nightly-release concurrency group all day. Fail fast instead — see the
87+
# troubleshooting section of
88+
# packages/browseros/build/docs/nightly-warpbuild-ci.md.
89+
queue-watchdog:
90+
needs: plan
91+
runs-on: ubuntu-latest
92+
timeout-minutes: 30
93+
permissions:
94+
actions: write
95+
steps:
96+
- name: Fail fast when no runner picks up the builds
97+
env:
98+
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
99+
run: |
100+
set -euo pipefail
101+
deadline=$(( $(date +%s) + 20 * 60 ))
102+
failures=0
103+
while :; do
104+
if jobs="$(gh api "repos/$GITHUB_REPOSITORY/actions/runs/$GITHUB_RUN_ID/jobs?per_page=100" \
105+
--jq '[.jobs[] | select(.name | startswith("build (")) | {name, status}]')"; then
106+
failures=0
107+
else
108+
failures=$(( failures + 1 ))
109+
if [ "$failures" -ge 3 ]; then
110+
echo "::error::queue-watchdog could not list this run's jobs (3 consecutive failures); builds are unwatched this run."
111+
exit 1
112+
fi
113+
sleep 120
114+
continue
115+
fi
116+
total="$(jq 'length' <<<"$jobs")"
117+
queued="$(jq '[.[] | select(.status == "queued")] | length' <<<"$jobs")"
118+
in_progress="$(jq '[.[] | select(.status == "in_progress")] | length' <<<"$jobs")"
119+
if [ "$total" -gt 0 ] && [ "$queued" -eq 0 ]; then
120+
echo "All $total build jobs picked up by runners."
121+
exit 0
122+
fi
123+
if [ "$(date +%s)" -ge "$deadline" ]; then
124+
doc="packages/browseros/build/docs/nightly-warpbuild-ci.md"
125+
if [ "$total" -eq 0 ]; then
126+
echo "::error::queue-watchdog matched no 'build (' jobs — were the matrix job names changed? Fix the filter; not cancelling."
127+
exit 1
128+
fi
129+
stuck="$(jq -r '[.[] | select(.status == "queued") | .name] | join(", ")' <<<"$jobs")"
130+
# Cancel only when nothing is live: a queued job is then the
131+
# only thing keeping the run (and concurrency group) pinned.
132+
if [ "$in_progress" -eq 0 ]; then
133+
echo "::error::No build job is running and these never left the queue: $stuck. Check the runner-group public-repo setting, runner labels, and the WarpBuild dashboard — see $doc. Cancelling the run to free the nightly-release concurrency group."
134+
gh api -X POST "repos/$GITHUB_REPOSITORY/actions/runs/$GITHUB_RUN_ID/cancel"
135+
else
136+
echo "::error::Still queued after 20 min: $stuck. In-progress builds keep running — see $doc."
137+
fi
138+
exit 1
139+
fi
140+
sleep 120
141+
done
142+
83143
build:
84144
needs: plan
85145
strategy:

packages/browseros/build/docs/nightly-warpbuild-ci.md

Lines changed: 87 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ up here, that workflow can be retired.
1212
| Platform | Label | Specs | Disk |
1313
| --- | --- | --- | --- |
1414
| Linux x64 | `warp-ubuntu-2204-x64-32x` | 32 vCPU / 128 GB | 150 GB |
15-
| Windows x64 | `warp-windows-2025-x64-32x` | 32 vCPU / 128 GB | 256 GB |
15+
| Windows x64 | `warp-windows-2022-x64-32x` | 32 vCPU / 128 GB | 256 GB |
1616
| macOS arm64 | `warp-macos-15-arm64-12x` | M4 Pro, 12 vCPU / 44 GB | 270 GB |
1717

1818
There is no 32-core macOS tier; 12x is WarpBuild's largest Mac. WarpBuild
@@ -21,6 +21,51 @@ apply — but `timeout-minutes` must be set explicitly (the implicit default is
2121
360). Linux's 150 GB disk is the tightest fit: ~60-75 GB checkout +
2222
~25-40 GB out dir + OS image. The workflow prints `df -h` after each build.
2323

24+
## One-time setup (WarpBuild)
25+
26+
The `warpbuildbot` GitHub app is installed org-wide on `browseros-ai`
27+
(since 2026-06-11). Two more things must be true before any `warp-*` job
28+
leaves `queued`:
29+
30+
1. **The org must allow self-hosted runners on public repos.** WarpBuild
31+
runners register as org-level self-hosted runners, and GitHub blocks
32+
those on public repositories by default
33+
(https://www.warpbuild.com/docs/ci/public-repos). BrowserOS is public,
34+
so an org admin must check: Organization Settings → Actions → Runner
35+
groups → Default → "Allow public repositories". Via API (needs
36+
`admin:org` scope):
37+
38+
```bash
39+
gh auth refresh -h github.com -s admin:org
40+
gh api orgs/browseros-ai/actions/runner-groups \
41+
--jq '.runner_groups[] | {id, name, allows_public_repositories}'
42+
gh api -X PATCH "orgs/browseros-ai/actions/runner-groups/<id>" \
43+
-F allows_public_repositories=true
44+
```
45+
46+
Before flipping the toggle, check what else lives in that group — it
47+
widens exposure for every runner in it:
48+
49+
```bash
50+
gh api "orgs/browseros-ai/actions/runner-groups/<id>/runners" \
51+
--jq '.runners[] | {name, status, labels: [.labels[].name]}'
52+
```
53+
54+
Expect only ephemeral `warp-*` runners (usually none while idle). The
55+
signed-nightly Mac (`browseros-builder`) is registered at the repo
56+
level, so this org-group toggle does not change its exposure. If the
57+
group ever holds other persistent org-level runners, give WarpBuild a
58+
dedicated runner group instead of widening Default.
59+
60+
2. **The WarpBuild org must be active**: sign in at
61+
https://app.warpbuild.com/, confirm the `browseros-ai` connection and
62+
that billing/credits are set up — runners are not provisioned without
63+
an active account.
64+
65+
Smoke test after changing either:
66+
`gh workflow run "Nightly Release Build" -f platforms=linux`, then watch
67+
the build job leave `queued` within ~5 minutes (`gh run watch`).
68+
2469
## Per-night pipeline (per platform)
2570

2671
1. `actions/checkout` + `astral-sh/setup-uv`.
@@ -127,3 +172,44 @@ The first run per platform is the cache warm-up; expect cold timings. If a
127172
pin bump lands, the next night is cold again for that version. To force a
128173
fresh checkout, bump the `v1` in the cache key (workflow) — for Windows also
129174
delete the old object under `ci-cache/chromium/` in R2.
175+
176+
## Troubleshooting: jobs stuck in `queued`
177+
178+
A job no runner ever picked up shows `runner_id: 0` and empty steps:
179+
180+
```bash
181+
gh run view <run-id> --json jobs --jq '.jobs[] | {name, status}'
182+
gh api repos/browseros-ai/BrowserOS/actions/jobs/<job-id> \
183+
--jq '{status, runner_id, runner_name, labels}'
184+
```
185+
186+
Causes, in the order to check:
187+
188+
1. **Runner group blocks public repos** — see one-time setup above. This
189+
stalls all platforms at once.
190+
2. **Label not in WarpBuild's catalog** — supported images: Ubuntu
191+
22.04/24.04 (x64, arm64), macOS 14/15/26 (arm64), Windows Server 2022
192+
(x64) (https://www.warpbuild.com/docs/ci/preinstalled-software). An
193+
unsupported label queues forever (this workflow originally shipped a
194+
`windows-2025` label that WarpBuild does not image); WarpBuild reports
195+
no error back to GitHub.
196+
3. **WarpBuild account** — org connection or billing lapsed
197+
(https://app.warpbuild.com/).
198+
4. **WarpBuild capacity or incident** — rare; check their dashboard.
199+
200+
Mechanics worth knowing:
201+
202+
- GitHub discards self-hosted jobs queued for more than 24h, and the
203+
workflow's `nightly-release` concurrency group
204+
(`cancel-in-progress: false`) makes the next run wait (newer pending
205+
runs supersede older pending ones) — one stuck night delays the next
206+
by a full day (runs 27367077749 → 27407228486 did exactly this). The `queue-watchdog` job therefore steps in at the
207+
20-minute mark: it cancels the run when no build job is actually
208+
running (everything stuck in queue or already finished), and fails
209+
loudly without cancelling while any build is in progress. In that
210+
mixed case, cancel the run manually once the live builds finish — a
211+
still-queued job otherwise pins the group for up to 24h with no
212+
watcher left.
213+
- Fixing the root cause does not revive already-queued jobs: WarpBuild
214+
provisions on the `workflow_job.queued` webhook, which has already
215+
fired. Cancel the stuck run and re-dispatch.

0 commit comments

Comments
 (0)