Skip to content

Flaky: ResourceCommand_FailsWhenInteractionServiceIsRequired times out on aspire stop with DCP container still running #17485

@mitchdenny

Description

@mitchdenny

Failing test

Aspire.Cli.EndToEnd.Tests.ResourceCommandTests.ResourceCommand_FailsWhenInteractionServiceIsRequired

Source: tests/Aspire.Cli.EndToEnd.Tests/ResourceCommandTests.cs

Failure signature

The test asserts the full happy-path of "trigger a resource command that requires IInteractionService → assert the expected error → aspire stop". The final step times out at the Hex1b automation level:

Step 72 of 72 failed — WaitUntilText(" stopped successfully.")
  Timed out after 00:01:00 waiting for: text " stopped successfully." to appear
  at CliE2EAutomatorHelpers.cs:84

The terminal recording shows that aspire stop does not silently hang — it actively prints a failure after spinning for ~60s:

⠳ Stopping apphost.cs...
❌ Failed to stop apphost.cs.
📄 See logs at /root/.aspire/logs/cli_<id>.log
🔍 See AppHost logs at /root/.aspire/logs/cli_<id>_detach-child_<id>.log

The post-step docker container ls in the same CI job shows the redis container the AppHost owns is still up at the moment of failure:

CONTAINER ID   IMAGE          STATUS         NAMES
5aa9f3938387   redis:latest   Up 7 seconds   cache-xuhbbbzq

So ProcessShutdownService.StopProcessesAsync exhausted its graceful + force-kill + monitor window without DCP finishing the container teardown, returned false, and StopCommand emitted FailedToStopAppHost. The Hex1b WaitUntilText(" stopped successfully.") then times out as a downstream symptom.

Why this is a flake, not a deterministic failure

Observed on PR #17452, run 26425521402:

Attempt ResourceCommandTests
1 ❌ fails (same shape as above)
2 ✅ passes
3 ❌ fails (same shape)

PR #17452 only touches CLI init / channel / scaffolding code paths (InitCommand, PackageChannel, GuestAppHostProject, ProjectUpdater, ScaffoldingService); none of those are on the aspire stop / ProcessShutdownService / backchannel / DCP container shutdown path, so this is not a regression introduced by that PR — it surfaces a pre-existing intermittent shutdown-timing flake.

Suspected root cause

This test is the only ResourceCommandTests case that combines:

  • mountDockerSocket: true → runs a real redis container under DCP
  • a deliberately-failing resource command (IInteractionService unavailable) immediately before aspire stop

ProcessShutdownService.StopProcessesAsync (see src/Aspire.Cli/Processes/ProcessShutdownService.cs) currently uses:

  • s_processTerminationTimeout = 10s for the post-graceful-shutdown monitor
  • followed by ProcessSignaler.ForceKill on the AppHost process tree
  • followed by another MonitorProcessesForTerminationAsync pass

Under Docker-in-Docker contention on the GitHub ubuntu-latest runner, DCP's container teardown can outlast that budget, so the AppHost process does not exit and the CLI prints FailedToStopAppHost.

The companion test ResourceCommand_FailedExecution_DisplaysAppHostLogPathAndLogContainsEntries exercises a similar shape (redis + failing resource command + aspire stop) and has so far passed on the same runs — but it almost certainly shares the same underlying risk.

Speculative follow-up (separate from quarantine)

Consider whether ProcessShutdownService should extend its monitor window when the AppHost owns DCP-managed containers (or have StopCommand pass a longer per-call cancellation in container-heavy scenarios). Not in scope for this issue.

Artifacts

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-cliflaky-testquarantined-testQuarantined tests that run only in the Outerloop Tests workflowtriage:bot-seenAspire triage bot has seen this issue

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions