Skip to content

Convert polyglot SDK validation to C# unit tests (#16701)#16965

Draft
radical wants to merge 10 commits into
microsoft:mainfrom
radical:ankj/16701-polyglot-sdk-unit-tests
Draft

Convert polyglot SDK validation to C# unit tests (#16701)#16965
radical wants to merge 10 commits into
microsoft:mainfrom
radical:ankj/16701-polyglot-sdk-unit-tests

Conversation

@radical
Copy link
Copy Markdown
Member

@radical radical commented May 11, 2026

Fixes #16701.

What changed

Replaces the polyglot-validation.yml GitHub Actions harness (5 sibling jobs running per-language Dockerfiles + shell scripts in .github/workflows/polyglot-validation/) with C# test classes inside tests/Aspire.Cli.EndToEnd.Tests. Each polyglot language now has its own test class running in its own Docker container under the standard tests runsheet, so polyglot validation gets the same .trx, summary, rerun, and quarantine integration as every other test in the repo.

New test classes (all under Aspire.Cli.EndToEnd.Tests)

Class Mirrors Behavior
PythonPolyglotTests test-python.sh aspire init --language pythonaspire add Aspire.Hosting.Redis → write modified apphost.py → background aspire run → poll docker ps for the Redis container
PythonCodegenValidationTests test-python-playground.sh scaffold AppHost, add Redis + SqlServer, aspire restore, validate .modules/aspire_app.py exists and python3 py_compile passes
GoPolyglotTests test-go.sh analogous to Python; apphost.go
GoCodegenValidationTests test-go-playground.sh analogous; go build -buildvcs=false ./...
RustPolyglotTests test-rust.sh analogous; edits src/main.rs (the apphost.rs marker file is left untouched). Quarantined.

Java (JavaPolyglotTests, JavaCodegenValidationTests) and TypeScript (TypeScriptPolyglotTests, TypeScriptCodegenValidationTests) already lived in this project — unchanged.

Reused mechanism (the "we already have support for running them in separate containers" the issue refers to)

  • CliE2ETestHelpers.CreateDockerTestTerminal(... variant: DockerfileVariant.X) per test → per-class container.
  • New variants PolyglotPython, PolyglotGo, PolyglotRust added to DockerfileVariant. They extend aspire-e2e-polyglot-base (same pattern as the pre-existing PolyglotJava) and live in tests/Shared/Docker/Dockerfile.e2e-polyglot-{python,go,rust}. Python uses apt-get install python3 + uv; Go and Rust multi-stage COPY their toolchains from the official golang:1 / rust:1 images so we get current stable versions.
  • Aspire.Cli.EndToEnd.Tests.csproj already declares SplitTestsOnCI=true, RequiresCliArchive=true, RequiresNugets=true, TestArchiveTestsDir, etc., so each new class becomes its own runsheet entry, runs on its own GitHub runner, and uploads its own .trx / asciinema recording.
  • New ASPIRE_E2E_POLYGLOT_{PYTHON,GO,RUST}_IMAGE env vars + matching REQUIRE_* gates mirror the existing Java prebuilt-image plumbing.

Workflow changes

  • .github/workflows/build-cli-e2e-image.yml: builds and uploads the 3 new images alongside the existing dotnet / polyglot-base / polyglot-java tarballs.
  • .github/workflows/run-tests.yml: downloads each new image gated by contains(inputs.testShortName, 'Python' | 'Go' | 'Rust') (mirroring the Java gate). Passes the matching --require-python / --require-go / --require-rust flags to the image loader.
  • eng/scripts/load-cli-e2e-images.sh: new --require-python / --require-go / --require-rust modes; load the tarballs and export the env vars the test harness consumes.
  • .github/workflows/reproduce-flaky-tests.yml: downloads all 3 new images so the reproducer can target any polyglot test class.
  • Deleted: .github/workflows/polyglot-validation.yml and the entire .github/workflows/polyglot-validation/ directory (5 Dockerfiles + 9 shell scripts + helpers). Removed the polyglot_validation invocation from .github/workflows/tests.yml and its mention in the Fail if any dependency failed gate.
  • Updated docs/ci/TestingOnCI.md and docs/ci/ci-pipeline-optimizations.md to reflect the change and document the new "Polyglot SDK validation tests" section. Updated AutoRerunTransientCiFailuresTests so the deleted workflow no longer appears in the workflow-text expectations.

Behavior change to be aware of

The legacy Rust SDK Validation job was continue-on-error: true — it ran on every PR and reported a non-blocking failure. RustPolyglotTests is decorated with [QuarantinedTest("…/issues/16701")], which means it does not run on PRs and only runs in the scheduled Quarantined Tests workflow. The xUnit framework has no exact equivalent of "runs on PR but never blocks." If maintainers want to preserve the PR-time non-blocking Rust signal, the alternative is to not quarantine and instead manage failure budget through another mechanism — happy to switch if desired.

What you'll see in the PR check list

  • Removed: Python SDK Validation, Go SDK Validation, Java SDK Validation, Rust SDK Validation, TypeScript SDK Validation, Polyglot Validation Results (6 jobs).
  • Added: per-class entries under the regular tests runsheet — l: Cli.EndToEnd-PythonPolyglotTests, l: Cli.EndToEnd-PythonCodegenValidationTests, l: Cli.EndToEnd-GoPolyglotTests, l: Cli.EndToEnd-GoCodegenValidationTests. Plus the already-existing Java / TypeScript polyglot class entries.
  • Quarantined (not on PR): Cli.EndToEnd-RustPolyglotTests. Runs in Quarantined Tests.

Running locally

Docker required. From the repo root:

./build.sh /p:SkipNativeBuild=true   # one-time
dotnet test --project tests/Aspire.Cli.EndToEnd.Tests/Aspire.Cli.EndToEnd.Tests.csproj \
  --no-launch-profile -- \
  --filter-class "*.PythonPolyglotTests" \
  --filter-not-trait "quarantined=true" --filter-not-trait "outerloop=true"

Substitute *.GoPolyglotTests, *.GoCodegenValidationTests, *.PythonCodegenValidationTests, or (with --filter-trait quarantined=true) *.RustPolyglotTests. Each test builds its Docker image on first run (subsequent runs reuse it).

Why this matches the issue

We should convert these tests into unit tests in C#. If we put them in separate classes then we already have support for running them in separate containers. We can pick the custom runner for the tests too from the project file. This would bring these tests into the same framework as others with generating .trx files, summary generation, and reruns. Each SDK has a different docker image (some share a base).

  • ✅ Unit tests in C#: PythonPolyglotTests, PythonCodegenValidationTests, GoPolyglotTests, GoCodegenValidationTests, RustPolyglotTests.
  • ✅ Separate classes → separate containers: SplitTestsOnCI=true + DockerfileVariant per class.
  • ✅ Custom runner from the project file: the existing Aspire.Cli.EndToEnd.Tests.csproj properties (SplitTestsOnCI, RequiresCliArchive, TestArchiveTestsDir, XunitRunnerJson).
  • .trx, summary, reruns: inherited from the standard tests pipeline.
  • ✅ Each SDK has a different docker image, some share a base: Dockerfile.e2e-polyglot-{java,python,go,rust} all extend aspire-e2e-polyglot-base.

cc @sebastienros @radical

The polyglot SDK validation that lived under .github/workflows/polyglot-validation.yml
ran the Python/Go/Java/Rust/TypeScript AppHost smoke and codegen checks as standalone
GitHub Actions jobs invoking per-language Dockerfiles + shell scripts. The harness
produced no .trx, no rerun support, no summary, no quarantine integration; Rust was
hardcoded continue-on-error: true.

Aspire.Cli.EndToEnd.Tests already runs each test class in its own Docker container
(via CliE2ETestHelpers.CreateDockerTestTerminal + DockerfileVariant), and Java +
TypeScript polyglot tests already lived there. The project sets SplitTestsOnCI=true
so every class becomes its own runsheet entry under tests_requires_cli_archive.

Fill the Python/Go/Rust gap inside Aspire.Cli.EndToEnd.Tests, then decommission the
parallel shell harness:

- Add tests/Shared/Docker/Dockerfile.e2e-polyglot-{python,go,rust} layered on the
  shared aspire-e2e-polyglot-base image, mirroring Dockerfile.e2e-polyglot-java.
  Python installs python3 + uv; Go and Rust multi-stage COPY their toolchains from
  the official golang:1 / rust:1 images.
- Extend CliE2ETestHelpers with PolyglotPython/PolyglotGo/PolyglotRust variants and
  matching ASPIRE_E2E_POLYGLOT_{PYTHON,GO,RUST}_IMAGE env vars; generalize the
  polyglot-base prebuild gate so all extended variants build the base first.
- Add EnableExperimentalPython/Go/RustSupportAsync helpers next to the existing
  EnableExperimentalJavaSupportAsync.
- Add Python/Go/Rust polyglot smoke tests and Python/Go codegen validation tests.
  The smoke tests run aspire run in the background and poll docker ps for a Redis
  container via a shared PolyglotRedisAssertions helper, preserving the Redis
  container assertion that the old shell scripts had (vs the weaker prompt-only
  signal in the existing Java/TypeScript polyglot tests). The codegen tests follow
  the existing JavaCodegenValidationTests / TypeScriptCodegenValidationTests
  scenario pattern: scaffold an AppHost, add two integrations, aspire restore,
  validate the generated .modules sources compile.
- RustPolyglotTests is marked [QuarantinedTest] to preserve the prior
  continue-on-error: true semantics (non-PR-blocking). Note: this is a behavior
  change vs the old workflow, where the Rust job ran on every PR and reported a
  non-blocking failure; under the new harness the Rust test runs only in the
  scheduled Quarantined Tests workflow.
- Wire the new prebuilt images through build-cli-e2e-image.yml, run-tests.yml,
  reproduce-flaky-tests.yml, and eng/scripts/load-cli-e2e-images.sh (new
  --require-python/--require-go/--require-rust flags). Per-class download is
  gated by contains(testShortName, 'Python'|'Go'|'Rust') to mirror the existing
  Java gating.
- Delete .github/workflows/polyglot-validation.yml and the associated
  .github/workflows/polyglot-validation/ Dockerfiles and shell scripts. Remove
  the polyglot_validation job from tests.yml and update the workflow-rerun
  expectations test plus the TestingOnCI / ci-pipeline-optimizations docs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 11, 2026

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 16965

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 16965"

@github-actions
Copy link
Copy Markdown
Contributor

Re-running the failed jobs in the CI workflow for this pull request because 1 job was identified as retry-safe transient failures in the CI run attempt.
GitHub was asked to rerun all failed jobs for that attempt, and the rerun is being tracked in the rerun attempt.
The job links below point to the failed attempt jobs that matched the retry-safe transient failure rules.

radical and others added 6 commits May 11, 2026 21:55
…te tightening, GHA cache

Apply review findings from the GPT-5.5 + Opus 4.7 dual-model pre-merge review on microsoft#16965.

1. Codegen fixture enumeration (review finding #1, both reviewers, High):
   The deleted test-{python,go,java,typescript}-playground.sh scripts iterated every
   tests/PolyglotAppHosts/<Integration>/<Language>/ fixture (50 Python, 49 Go, 50 Java,
   50 TypeScript) and ran 'aspire restore --apphost' + per-language compile against
   each, catching codegen regressions specific to individual integrations (Kafka,
   MongoDB, every Aspire.Hosting.Azure.*, etc.). The initial replacement only ran
   one Redis+SqlServer scenario per language and dropped that per-integration coverage.

   Add a RestoreAndCompileAllValidationFixtures [Fact] method to each of the four
   *CodegenValidationTests classes (Python/Go/Java/TypeScript). Each new test mounts
   tests/PolyglotAppHosts read-only into the test container via additionalVolumes,
   then runs a shared bash loop (PolyglotFixtureValidation.RunFixtureLoopAsync) that
   copies each fixture to a writable temp dir, runs 'aspire restore', and runs the
   per-language compile (py_compile / go build / javac @sources.txt / npm install +
   tsc --noEmit). The loop aggregates per-fixture pass/fail and exits non-zero if any
   fixture fails or if zero fixtures are found.

   Because SplitTestsOnCI=true splits at the class level, the new [Fact]s share a
   runner with the existing single-scenario tests rather than spinning up four new
   runners.

2. Redis pulls from Docker Hub (review finding #2, GPT-5.5, Medium):
   The deleted bash smoke scripts called .with_image_registry('netaspireci.azurecr.io')
   on the Redis resource to avoid Docker Hub anonymous-pull rate limits in CI; the
   initial C# rewrite omitted the override. Re-add it in
   PythonPolyglotTests/GoPolyglotTests/RustPolyglotTests so CI continues to pull
   Redis from the Aspire CI registry mirror.

3. Per-image gates were case-insensitive substring matches (review finding #3, Opus, Medium):
   GitHub Actions 'contains()' is case-insensitive, so
   'contains(testShortName, "Go")' matched every KubernetesDeployWithMongoDB* /
   KubernetesDeployWithPostgres* class, downloading the Go image (and load-ing it
   into docker) for every unrelated job. The Java gate had the same problem against
   JavaScriptPublishTests. Replace the four 'contains()' gates in run-tests.yml with
   explicit endsWith() enumerations of the polyglot test classes that actually need
   each image (PolyglotTests, CodegenValidationTests, plus JavaEmptyAppHostTemplateTests).
   The --require-* flags passed to load-cli-e2e-images.sh use the same gating.

4. New polyglot images skipped GHA buildx cache (review finding #4, Opus, Medium):
   build_extended_polyglot_image used plain 'DOCKER_BUILDKIT=1 docker build' with no
   --cache-from / --cache-to, so the multi-stage 'FROM golang:1' / 'FROM rust:1'
   stages re-pulled ~800MB from Docker Hub on every CI build. Route the helper
   through 'docker buildx build --load --cache-from type=gha,scope=...
   --cache-to type=gha,scope=...,mode=max,ignore-error=true' with per-image cache
   scopes (cli-e2e-polyglot-{java,python,go,rust}). The existing Java helper is
   migrated to the same pattern.

Review findings deferred to follow-up (preserve existing bash-script behavior; not
regressions introduced by this PR):

 - #6 'docker ps | grep redis' false-positive risk on developer machines.
 - #7 SIGKILL of aspire run bypasses graceful cleanup.
 - #8 [QuarantinedTest] URL on RustPolyglotTests points at the feature issue.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ceful shutdown, un-quarantine Rust

- #6 (Medium): the Redis-container poll in PolyglotRedisAssertions used
  'docker ps | grep -i redis', which matches any container containing 'redis'
  in its image or name. On developer machines (where personal Redis containers
  are common) or on a runner that has an orphaned Redis from a prior [Fact]
  on the same class, that would produce a false positive. Tighten the regex
  to match the Aspire CI registry mirror path that #2 added: only containers
  whose image starts with netaspireci.azurecr.io/...redis are accepted.

- #7 (Medium): the legacy bash scripts SIGKILLed 'aspire run' the moment
  Redis was detected, which prevented DCP from tearing down the spawned
  Redis container. The host-side container then leaked, polluting subsequent
  [Fact]s on the same runner (now relevant because #6 narrowed detection to
  the netaspireci image, so a leaked container would still match). Replace
  the SIGKILL with SIGINT + a 30-second graceful wait, falling back to
  SIGKILL only if the AppHost fails to shut down cleanly.

- #5 + #8: the legacy polyglot-validation.yml Rust SDK Validation job was
  marked 'continue-on-error: true' as a paranoid safety net. Inspecting recent
  main CI runs (6/6 most recent: success) shows the job has been green
  consistently. The [QuarantinedTest] attribute also pointed at the feature
  issue (microsoft#16701) rather than a flaky-test tracking issue (which is the
  convention elsewhere in the repo), so de-quarantining is the right move:
  run RustPolyglotTests as a normal PR-blocking test. If it ever regresses
  the failure will be visible and tractable; we can quarantine then with a
  proper failure-tracking issue.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Collapse the four near-duplicate build helper functions
(build_image / build_java_image / build_extended_polyglot_image /
build_with_mirror_retry / build_extended_polyglot_with_mirror_retry)
into a single build_image function that takes a display name,
dockerfile, cache scope, source-build flag, and one-or-more tags,
and handles the apt-mirror fallback internally.

Drive the four extended polyglot images (java, python, go, rust)
from a small pipe-delimited table instead of separate calls so the
variant-derived names (tag, cache scope, dockerfile, tarball) live
in one place.

Inline the previously env: block constants — they were only
referenced once each after this refactor, so the indirection
obscured rather than clarified.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Continue the variant-derived refactor of the CLI E2E image plumbing so
the set of polyglot variants is declared in as few places as possible.

build-cli-e2e-image.yml
- Replace the 4-entry pipe-delimited `local_extended` array with a
  trivial `for lang in java python go rust` loop. Every other field
  (display name, dockerfile path, image tag, cache scope, tarball name)
  derives from the language token.

run-tests.yml / reproduce-flaky-tests.yml
- Use actions/download-artifact's pattern + merge-multiple features.
  reproduce-flaky-tests.yml (downloads all six images) collapses six
  steps into one wildcard 'cli-e2e-*-image' step that picks up new
  variants automatically. run-tests.yml collapses the always-required
  dotnet + polyglot base into a single 'cli-e2e-{dotnet,polyglot}-image'
  brace-glob step; the four language-gated steps stay separate because
  each has a different endsWith() condition on testShortName.

eng/scripts/load-cli-e2e-images.sh
- Declare a single VARIANTS list (dotnet, polyglot, polyglot-{java,
  python,go,rust}) and drive arg parsing, env-var derivation, and
  image loading off it. Variant-derived names replace 6× hardcoded
  tarball/tag/env-var triples.
- Rename CLI flags so the flag suffix matches the variant exactly:
  --require-java -> --require-polyglot-java (and similarly for python,
  go, rust). dotnet and polyglot are unchanged.
- Adopt a generic --opt val / --opt=val argument splitter, replacing
  the 6 hand-rolled per-flag case branches.
- Use indirect variable names (printf -v + ${!var}) rather than an
  associative array so the script stays portable to bash 3.2 (macOS
  default). Memory note about `declare -A` not being available there
  is the reason for this choice.
- Fix a latent bug surfaced by the rewrite: `exit 2` inside
  $(normalize_mode …) only kills the subshell, so an invalid
  --require-X bogus silently became empty and fell through. Now the
  normalize_mode failure is captured and propagated explicitly.

run-tests.yml load step
- Replace the 4 needs_java/needs_python/needs_go/needs_rust bash
  variables with 'auto' mode on the corresponding --require-polyglot-*
  flags. The loader treats a missing tarball as 'not required', which
  exactly mirrors the per-language download gates above the load
  step. This removes the duplication between download-step gates and
  load-step gates.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The 'cli-e2e-builder' buildx instance created in this workflow uses the
docker-container driver, which runs BuildKit inside a container with
its own image store separate from the host Docker daemon. When the
polyglot base is built with 'docker buildx build --load', the
resulting image is exported to the host Docker daemon's image store
but the tag is NOT registered in the buildkit container's internal
store.

So a naked 'FROM aspire-e2e-polyglot-base' in the extended polyglot
dockerfiles (Java/Python/Go/Rust) fails:

  #3 [internal] load metadata for docker.io/library/aspire-e2e-polyglot-base:latest
  #3 ERROR: pull access denied, repository does not exist or may
     require authorization: server message: insufficient_scope:
     authorization failed

This regression was introduced by commit 4846d7a, which switched
the extended polyglot builds from 'DOCKER_BUILDKIT=1 docker build'
(which uses the host daemon's BuildKit and sees locally-loaded
images) to 'docker buildx build --load' (which goes through the
container builder). The pre-regression code on origin/main still
uses 'docker build' for these images, which is why main's scheduled
Quarantined Tests run is green while every PR run on this branch
since 4846d7a has failed at this step.

Use --build-context with the 'docker-image://' scheme to tell buildx
to resolve 'aspire-e2e-polyglot-base' via the host Docker daemon
(which is where the base now lives). Drive it through an optional
EXTRA_BUILD_ARGS array so build_image stays a single generic helper
and only the extended-polyglot loop opts into the override.

Keeps the GHA layer cache for the heavy upstream 'golang:1' /
'rust:1' COPY stages that 4846d7a was trying to enable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
radical and others added 2 commits May 12, 2026 02:36
Two unrelated fixes from PR review:

1. Restore SKIP_SOURCE_BUILD=true for the polyglot base image build.

   The build-cli-e2e-image.yml refactor introduced a per-call skip_source
   parameter but passed 'false' for the polyglot base image, flipping the
   pre-refactor behavior. That triggers the in-container 'dotnet build'
   path in Dockerfile.e2e-polyglot-base, which fails because .dockerignore
   strips .git/ from the build context and arcade can no longer infer
   RepositoryUrl. CI installs the CLI from prebuilt artifacts anyway —
   the in-container source build is only useful for local docker iteration.

   Extended polyglot Dockerfiles FROM aspire-e2e-polyglot-base have no
   source-build stage of their own, so SKIP_SOURCE_BUILD is a no-op there;
   passing 'true' for symmetry to make intent explicit.

2. Quarantine RustPolyglotTests (microsoft#16974).

   Matches the prior polyglot-validation.yml Rust SDK Validation job's
   continue-on-error: true safety net. Docs at TestingOnCI.md already
   describe this; the test code was the only divergence.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tainer

Extended polyglot Dockerfiles use 'FROM aspire-e2e-polyglot-base', which is
loaded into the host docker daemon by the polyglot-base build step (--load).
The cli-e2e-builder buildx instance uses the docker-container driver, whose
internal image store is isolated from the host daemon. A naked FROM lookup
inside that container therefore falls back to docker.io/library/... and 401s,
and the --build-context 'docker-image://aspire-e2e-polyglot-base:latest'
workaround is still resolved by buildkit as a remote registry reference, not
a host-daemon lookup, so it 401s identically.

Switch the extended polyglot loop to the default docker driver (host daemon)
where the base tag is already present. This matches origin/main's pattern
(its build_java_image used 'DOCKER_BUILDKIT=1 docker build' for the same
reason) and drops the GHA cache for these images; that's cheap because they
are thin language-toolchain layers on top of the already-cached base.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Java polyglot image's apt-get step took 24 minutes in CI run
25718644577 because openjdk-25-jdk pulls in the full GUI stack (GTK,
mesa, X11, fonts) as transitive dependencies — 135 packages totalling
219 MB. Combined with intermittent Azure Ubuntu mirror slowdowns this
blew past the 45-minute job timeout before the Go/Rust images could
even start.

Switch to openjdk-25-jdk-headless, which drops the GUI dependency tree.
The polyglot Java tests compile a Java AppHost (server code) and need
no AWT/Swing/JavaFX at all — verified by grepping the test sources.

Also bump the build-cli-e2e-image.yml job timeout from 45 to 60 minutes
as a safety margin for slow apt mirror days; this is the canonical knob
called out by GitHub's reusable-workflow timeout-minutes pattern.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@sebastienros
Copy link
Copy Markdown
Contributor

I need to do changes in some tests scripts, do you intend to merge this soon?

@radical
Copy link
Copy Markdown
Member Author

radical commented May 14, 2026

I need to do changes in some tests scripts, do you intend to merge this soon?

Go ahead with your changes. I'll get to this next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[tests] Convert the polyglot SDK validation tests into unit tests

2 participants