Skip to content

Workflow health (9d): Deployment E2E red all window (Aca/AzureStorage); CI main red (30%, Templates flake); Outerloop red (Playwright timeout); release/13.4 latest = infra flake; PR Docs + internal main recovered; Smoke/Quarantine greenΒ #18223

@radical

Description

@radical

πŸ”΄ Needs attention

  • πŸ”΄ Deployment E2E β€” latest nightly (06-18 04:32) is red and the workflow has failed 8 of 9 completed nights in the window (only 06-11 passed). The 06-18 run failed AzureStorageDeploymentTests and TypeScriptExpressDeploymentTests; AcaManagedRedis (#17430) recurred earlier in the week. Per-night trackers auto-filed; latest open is #18309. Looks like Azure/AKS environment flakiness. latest
  • πŸ”΄ CI β€” main (rolling) passes only 30% (26 / 87) over 9 days and the tip is red: latest completed build (06-18 06:28) failed on the Templates legs (ubuntu + windows) β€” a recurring flaky-template signature, not a clean regression. The week stays red on a mix of Docker is not running Windows-agent flakes plus Hosting.Azure, Templates and VS Code E2E flakes. Chronic instability, no single tracker. runs Β· latest
  • πŸ”΄ Outerloop Tests β€” latest nightly (06-18 03:08) is red on a Playwright 30s timeout in Aspire.Templates.Tests (Templates ubuntu leg, browser WebSocket … 1006 disconnect). The 06-17 one-day green didn't hold; failures read as flaky browser-automation timeouts. latest
  • 🟑 CI β€” release/13.4 latest completed build (06-17 23:03, Freeze AssemblyVersion release-prep) is red, but only the Windows Hosting-4/Hosting-5 legs failed β€” the recurring Docker is not running runner-infra flake, not a release regression (release was 100% green on 06-16). No newer release push since. latest

Everything else has a green latest run and is not on fire: Quarantined Tests recovered (latest 06-18 05:33 green; the 03:04 red was a flaky shard), PR Documentation Check stayed recovered (latest green 06-18 06:28 after the daily-AI-credits guardrail fix #18307 merged), Internal main recovered to green (latest 06-18 05:30; shaky 78% week with a sharp 06-15 dip), and Smoke / Internal release/13.4 are healthy. Bot-PR-producing automation is not tracked here β€” see #18285; that includes the broken update-azure-vm-sizes.yml (expired AZURE_CREDENTIALS, #18305).

Today

Health of microsoft/aspire's scheduled / rolling / automated workflows. Pass / Runs counts only completed runs that ended in success or failure β€” cancelled, skipped, queued, in-progress and awaiting-approval runs are excluded. Scheduled test workflows count only schedule runs on main; CI counts only rolling (push) builds, split by branch; PR Documentation Check counts its PR runs. Window is the last 9 days (06-10 β†’ 06-18). Column-1 emoji is the 9-day health; Latest run is the most recent completed run, so a green latest can sit next to a poor weekly rate (and vice-versa).

Workflow Pass / Runs (9d) Latest run What's up
πŸ”΄ CI β€” main 26 / 87 β€” 30% ❌ failure Rolling main is red far more often than green. Latest completed build (06-18 06:28) is red on the Templates legs (ubuntu + windows) β€” a recurring flaky-template signature. The week is otherwise dominated by Docker is not running Windows-agent flakes plus Hosting.Azure / Templates / VS Code E2E flakes. Chronic 9-day instability.
🟑 CI β€” release/13.4 3 / 7 β€” 43% ❌ failure Small sample (others cancelled by newer pushes). Latest (06-17 23:03, Freeze Aspire.TypeSystem AssemblyVersion release-prep) is red, but only the Windows Hosting-4/Hosting-5 legs failed β€” recurring Docker is not running / runner infra flake, not a release regression. Was 100% on 06-16.
πŸ”΄ Outerloop Tests 1 / 9 β€” 11% ❌ failure Red on 8 of 9 nightlies. Latest (06-18 03:08) red on a Playwright 30s timeout in Aspire.Templates.Tests (Templates ubuntu leg, browser WebSocket … 1006 disconnect). The 06-17 green didn't hold. Failures read as flaky browser-automation timeouts rather than a workflow break. No dedicated tracker.
πŸ”΄ Deployment E2E Tests 1 / 9 β€” 11% ❌ failure Red essentially all window β€” 8 of 9 completed nights failed (only 06-11 passed), latest included. Latest (06-18 04:32) failed AzureStorage + TypeScriptExpress deploy legs; AcaManagedRedis recurs (#17430). Per-night trackers auto-filed; latest open #18309. Looks like Azure/AKS flakiness.
🟒 Daily CLI Smoke Tests 8 / 8 β€” 100% βœ… success Green every day. Nothing to do. (06-18 nightly not yet run.)
🟒 Quarantined Tests 67 / 97 β€” 69% βœ… success Workflow itself is healthy and the latest run (06-18 05:33) is green. The intermittent red conclusions are flaky-test shards β€” the tests this workflow exists to catch β€” not workflow breakage. No infra/setup break in the window.
🟒 PR Documentation Check 47 / 71 β€” 66% βœ… success Recovered and holding. Earlier this agentic workflow kept tripping the daily AI-credits guardrail at the activation step; the fix #18307 merged and the latest PR run (06-18 06:28) is green. Prior trackers all closed (#18301, #18286, #18235, #18267). Separate agentic failure still open: Repo Pulse #18120 (automation-broken).

Legend: πŸ”΄ broken / chronically failing Β· 🟠 stuck / needs a human Β· 🟑 worth a glance Β· 🟒 healthy Β· βšͺ informational.

Internal build (Azure DevOps)

Aggregate result counts only; "partially succeeded" is treated as success. No internal job names, logs, or content are exposed.

Build Pass / Runs (9d) Latest Note
🟑 Internal β€” main 50 / 64 β€” 78% βœ… partially succeeded (06-18 05:30) Recovered to green on the latest, and the last three days have been strong (06-16/06-17/06-18 β‰₯ 90%). But the rolling signal is shaky β€” 14 hard failures over the window, with a sharp 06-15 dip (25%). Latest failed build (06-17 19:20).
🟒 Internal β€” release/13.4 7 / 7 β€” 100% βœ… partially succeeded (06-18 01:11) Clean all window.

πŸ“ˆ 9-day trend

Per-day health, most recent first (emoji = that day's pass%, not a rolling aggregate)

Each cell is that day's status: per-day pass% for push/PR workflows, or the single scheduled run's pass/fail for nightly workflows. β€” = no completed run that day. Cell emoji: 🟒 β‰₯80% Β· 🟑 40–79% Β· πŸ”΄ <40%. (06-18 is only a few hours old β€” some nightly jobs, e.g. Smoke, haven't run yet.)

Day CI main CI rel/13.4 Outerloop Deploy E2E Smoke Quarantine PR Docs Int main Int rel/13.4
06-18 πŸ”΄ 0% β€” πŸ”΄ 0% πŸ”΄ 0% β€” 🟑 67% 🟑 50% 🟒 100% 🟒 100%
06-17 🟑 60% πŸ”΄ 0% 🟒 100% πŸ”΄ 0% 🟒 100% 🟑 75% 🟒 87% 🟒 90% 🟒 100%
06-16 🟑 47% 🟒 100% πŸ”΄ 0% πŸ”΄ 0% 🟒 100% 🟒 83% 🟑 67% 🟒 100% 🟒 100%
06-15 🟑 40% β€” πŸ”΄ 0% πŸ”΄ 0% 🟒 100% 🟑 50% 🟒 100% πŸ”΄ 25% 🟒 100%
06-14 πŸ”΄ 0% β€” πŸ”΄ 0% πŸ”΄ 0% 🟒 100% 🟑 75% 🟑 50% 🟒 100% β€”
06-13 🟑 67% β€” πŸ”΄ 0% πŸ”΄ 0% 🟒 100% 🟑 58% 🟒 100% 🟑 67% β€”
06-12 πŸ”΄ 11% 🟑 50% πŸ”΄ 0% πŸ”΄ 0% 🟒 100% 🟑 67% 🟑 69% 🟑 73% β€”
06-11 πŸ”΄ 0% β€” πŸ”΄ 0% 🟒 100% 🟒 100% 🟑 67% 🟑 50% 🟑 67% β€”
06-10 πŸ”΄ 36% πŸ”΄ 0% πŸ”΄ 0% πŸ”΄ 0% 🟒 100% 🟑 75% 🟑 45% 🟑 67% 🟒 100%

Per-workflow detail

πŸ”΄ Deployment E2E Tests

Red 8 of 9 completed nights β€” AzureStorage / TypeScriptExpress / AcaManagedRedis

Failed 8 of 9 completed nightlies in the window (only 06-11 passed), including the latest completed run (27736857554, 06-18 04:32). That run failed two legs β€” AzureStorageDeploymentTests and TypeScriptExpressDeploymentTests; AcaManagedRedisDeploymentTests (tracked by #17430) recurred earlier in the week. The workflow auto-files a per-night [Deployment E2E] Nightly test failure issue β€” latest open is #18309 (06-18). Pattern looks like Azure/AKS environment flakiness rather than a single code regression.

πŸ”΄ CI β€” main

Chronic 9-day instability (30%), latest red on a Templates flake

Rolling main passes 26 / 87 (30%) of completed push builds over the window. The latest completed build (27741140132, 06-18 06:28) is red on the Templates legs (NewUpAndBuildSupportProjectTemplatesTests ubuntu, plus two windows template legs) β€” a recurring flaky-template signature rather than a clean code regression.

Across the week the dominant failure signatures are the Docker is not running Windows-agent infra flake plus recurring Hosting.Azure, Templates and VS Code extension E2E flakes. The mix of infra and flaky-test failures keeps the rolling rate red; there is no single tracking issue.

πŸ”΄ Outerloop Tests

Latest nightly red on a Playwright timeout in Aspire.Templates.Tests

Red on 8 of 9 nightlies. The latest (27734070919, 06-18 03:08) failed the Templates (ubuntu-latest) leg with System.TimeoutException : Timeout 30000ms exceeded and a Playwright browser disconnect (WebSocket closed with status code: 1006) in Aspire.Templates.Tests. The single green nightly (06-17) didn't hold. The failures read as flaky browser-automation timeouts, not a workflow break; no dedicated tracker exists for this signature yet.

🟑 CI β€” release/13.4

Latest run is red, but only the Windows Hosting-4/Hosting-5 legs failed on the recurring Docker is not running / runner infra flake β€” not a release regression (release was 100% on 06-16). Small completed sample (3 / 7) because newer pushes cancel older runs.

🟒 PR Documentation Check β€” recovered

Latest run green; guardrail fix is holding

This agentic workflow runs per-PR and opens agentic-workflows-labelled issues when it fails. It passes 47 / 71 (66%) of its PR runs over 9 days. Earlier several runs tripped the activation step's daily workflow token guardrail (daily AI-credits budget), but the fix #18307 (Disable daily AI credits guardrail for agentic workflows) merged and the latest completed PR run (27741139971, 06-18 06:28) is green.

No new failure tracker was filed β€” prior ones are consolidated and closed: #18301 (daily AI-credits budget), #18286 (daily effective workflow budget), #18235 and #18267. One unrelated agentic failure remains open and labelled automation-broken: Repo Pulse β€” Daily Report (#18120, open since 06-11).

🟒 Quarantined Tests

Workflow is healthy; latest run (06-18 05:33) is green. The intermittent red conclusions are the flaky tests it exists to catch, not workflow breakage. No infra/setup break in the window.

🟒 Daily CLI Smoke Tests

8 / 8 green. No action. (06-18 nightly had not run at report time.)


Report generated 2026-06-18 06:56 UTC by the workflow-health observer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-engineering-systemsinfrastructure helix infra engineering repo stuffautomatedOpened by bots or toolstriage:bot-seenAspire triage bot has seen this issue

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions