Convert polyglot SDK validation to C# unit tests (#16701)#16965
Draft
radical wants to merge 10 commits into
Draft
Convert polyglot SDK validation to C# unit tests (#16701)#16965radical wants to merge 10 commits into
radical wants to merge 10 commits into
Conversation
The polyglot SDK validation that lived under .github/workflows/polyglot-validation.yml
ran the Python/Go/Java/Rust/TypeScript AppHost smoke and codegen checks as standalone
GitHub Actions jobs invoking per-language Dockerfiles + shell scripts. The harness
produced no .trx, no rerun support, no summary, no quarantine integration; Rust was
hardcoded continue-on-error: true.
Aspire.Cli.EndToEnd.Tests already runs each test class in its own Docker container
(via CliE2ETestHelpers.CreateDockerTestTerminal + DockerfileVariant), and Java +
TypeScript polyglot tests already lived there. The project sets SplitTestsOnCI=true
so every class becomes its own runsheet entry under tests_requires_cli_archive.
Fill the Python/Go/Rust gap inside Aspire.Cli.EndToEnd.Tests, then decommission the
parallel shell harness:
- Add tests/Shared/Docker/Dockerfile.e2e-polyglot-{python,go,rust} layered on the
shared aspire-e2e-polyglot-base image, mirroring Dockerfile.e2e-polyglot-java.
Python installs python3 + uv; Go and Rust multi-stage COPY their toolchains from
the official golang:1 / rust:1 images.
- Extend CliE2ETestHelpers with PolyglotPython/PolyglotGo/PolyglotRust variants and
matching ASPIRE_E2E_POLYGLOT_{PYTHON,GO,RUST}_IMAGE env vars; generalize the
polyglot-base prebuild gate so all extended variants build the base first.
- Add EnableExperimentalPython/Go/RustSupportAsync helpers next to the existing
EnableExperimentalJavaSupportAsync.
- Add Python/Go/Rust polyglot smoke tests and Python/Go codegen validation tests.
The smoke tests run aspire run in the background and poll docker ps for a Redis
container via a shared PolyglotRedisAssertions helper, preserving the Redis
container assertion that the old shell scripts had (vs the weaker prompt-only
signal in the existing Java/TypeScript polyglot tests). The codegen tests follow
the existing JavaCodegenValidationTests / TypeScriptCodegenValidationTests
scenario pattern: scaffold an AppHost, add two integrations, aspire restore,
validate the generated .modules sources compile.
- RustPolyglotTests is marked [QuarantinedTest] to preserve the prior
continue-on-error: true semantics (non-PR-blocking). Note: this is a behavior
change vs the old workflow, where the Rust job ran on every PR and reported a
non-blocking failure; under the new harness the Rust test runs only in the
scheduled Quarantined Tests workflow.
- Wire the new prebuilt images through build-cli-e2e-image.yml, run-tests.yml,
reproduce-flaky-tests.yml, and eng/scripts/load-cli-e2e-images.sh (new
--require-python/--require-go/--require-rust flags). Per-class download is
gated by contains(testShortName, 'Python'|'Go'|'Rust') to mirror the existing
Java gating.
- Delete .github/workflows/polyglot-validation.yml and the associated
.github/workflows/polyglot-validation/ Dockerfiles and shell scripts. Remove
the polyglot_validation job from tests.yml and update the workflow-rerun
expectations test plus the TestingOnCI / ci-pipeline-optimizations docs.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 16965Or
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 16965" |
Contributor
|
Re-running the failed jobs in the CI workflow for this pull request because 1 job was identified as retry-safe transient failures in the CI run attempt.
|
sebastienros
approved these changes
May 12, 2026
…te tightening, GHA cache Apply review findings from the GPT-5.5 + Opus 4.7 dual-model pre-merge review on microsoft#16965. 1. Codegen fixture enumeration (review finding #1, both reviewers, High): The deleted test-{python,go,java,typescript}-playground.sh scripts iterated every tests/PolyglotAppHosts/<Integration>/<Language>/ fixture (50 Python, 49 Go, 50 Java, 50 TypeScript) and ran 'aspire restore --apphost' + per-language compile against each, catching codegen regressions specific to individual integrations (Kafka, MongoDB, every Aspire.Hosting.Azure.*, etc.). The initial replacement only ran one Redis+SqlServer scenario per language and dropped that per-integration coverage. Add a RestoreAndCompileAllValidationFixtures [Fact] method to each of the four *CodegenValidationTests classes (Python/Go/Java/TypeScript). Each new test mounts tests/PolyglotAppHosts read-only into the test container via additionalVolumes, then runs a shared bash loop (PolyglotFixtureValidation.RunFixtureLoopAsync) that copies each fixture to a writable temp dir, runs 'aspire restore', and runs the per-language compile (py_compile / go build / javac @sources.txt / npm install + tsc --noEmit). The loop aggregates per-fixture pass/fail and exits non-zero if any fixture fails or if zero fixtures are found. Because SplitTestsOnCI=true splits at the class level, the new [Fact]s share a runner with the existing single-scenario tests rather than spinning up four new runners. 2. Redis pulls from Docker Hub (review finding #2, GPT-5.5, Medium): The deleted bash smoke scripts called .with_image_registry('netaspireci.azurecr.io') on the Redis resource to avoid Docker Hub anonymous-pull rate limits in CI; the initial C# rewrite omitted the override. Re-add it in PythonPolyglotTests/GoPolyglotTests/RustPolyglotTests so CI continues to pull Redis from the Aspire CI registry mirror. 3. Per-image gates were case-insensitive substring matches (review finding #3, Opus, Medium): GitHub Actions 'contains()' is case-insensitive, so 'contains(testShortName, "Go")' matched every KubernetesDeployWithMongoDB* / KubernetesDeployWithPostgres* class, downloading the Go image (and load-ing it into docker) for every unrelated job. The Java gate had the same problem against JavaScriptPublishTests. Replace the four 'contains()' gates in run-tests.yml with explicit endsWith() enumerations of the polyglot test classes that actually need each image (PolyglotTests, CodegenValidationTests, plus JavaEmptyAppHostTemplateTests). The --require-* flags passed to load-cli-e2e-images.sh use the same gating. 4. New polyglot images skipped GHA buildx cache (review finding #4, Opus, Medium): build_extended_polyglot_image used plain 'DOCKER_BUILDKIT=1 docker build' with no --cache-from / --cache-to, so the multi-stage 'FROM golang:1' / 'FROM rust:1' stages re-pulled ~800MB from Docker Hub on every CI build. Route the helper through 'docker buildx build --load --cache-from type=gha,scope=... --cache-to type=gha,scope=...,mode=max,ignore-error=true' with per-image cache scopes (cli-e2e-polyglot-{java,python,go,rust}). The existing Java helper is migrated to the same pattern. Review findings deferred to follow-up (preserve existing bash-script behavior; not regressions introduced by this PR): - #6 'docker ps | grep redis' false-positive risk on developer machines. - #7 SIGKILL of aspire run bypasses graceful cleanup. - #8 [QuarantinedTest] URL on RustPolyglotTests points at the feature issue. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ceful shutdown, un-quarantine Rust - #6 (Medium): the Redis-container poll in PolyglotRedisAssertions used 'docker ps | grep -i redis', which matches any container containing 'redis' in its image or name. On developer machines (where personal Redis containers are common) or on a runner that has an orphaned Redis from a prior [Fact] on the same class, that would produce a false positive. Tighten the regex to match the Aspire CI registry mirror path that #2 added: only containers whose image starts with netaspireci.azurecr.io/...redis are accepted. - #7 (Medium): the legacy bash scripts SIGKILLed 'aspire run' the moment Redis was detected, which prevented DCP from tearing down the spawned Redis container. The host-side container then leaked, polluting subsequent [Fact]s on the same runner (now relevant because #6 narrowed detection to the netaspireci image, so a leaked container would still match). Replace the SIGKILL with SIGINT + a 30-second graceful wait, falling back to SIGKILL only if the AppHost fails to shut down cleanly. - #5 + #8: the legacy polyglot-validation.yml Rust SDK Validation job was marked 'continue-on-error: true' as a paranoid safety net. Inspecting recent main CI runs (6/6 most recent: success) shows the job has been green consistently. The [QuarantinedTest] attribute also pointed at the feature issue (microsoft#16701) rather than a flaky-test tracking issue (which is the convention elsewhere in the repo), so de-quarantining is the right move: run RustPolyglotTests as a normal PR-blocking test. If it ever regresses the failure will be visible and tractable; we can quarantine then with a proper failure-tracking issue. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Collapse the four near-duplicate build helper functions (build_image / build_java_image / build_extended_polyglot_image / build_with_mirror_retry / build_extended_polyglot_with_mirror_retry) into a single build_image function that takes a display name, dockerfile, cache scope, source-build flag, and one-or-more tags, and handles the apt-mirror fallback internally. Drive the four extended polyglot images (java, python, go, rust) from a small pipe-delimited table instead of separate calls so the variant-derived names (tag, cache scope, dockerfile, tarball) live in one place. Inline the previously env: block constants — they were only referenced once each after this refactor, so the indirection obscured rather than clarified. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Continue the variant-derived refactor of the CLI E2E image plumbing so
the set of polyglot variants is declared in as few places as possible.
build-cli-e2e-image.yml
- Replace the 4-entry pipe-delimited `local_extended` array with a
trivial `for lang in java python go rust` loop. Every other field
(display name, dockerfile path, image tag, cache scope, tarball name)
derives from the language token.
run-tests.yml / reproduce-flaky-tests.yml
- Use actions/download-artifact's pattern + merge-multiple features.
reproduce-flaky-tests.yml (downloads all six images) collapses six
steps into one wildcard 'cli-e2e-*-image' step that picks up new
variants automatically. run-tests.yml collapses the always-required
dotnet + polyglot base into a single 'cli-e2e-{dotnet,polyglot}-image'
brace-glob step; the four language-gated steps stay separate because
each has a different endsWith() condition on testShortName.
eng/scripts/load-cli-e2e-images.sh
- Declare a single VARIANTS list (dotnet, polyglot, polyglot-{java,
python,go,rust}) and drive arg parsing, env-var derivation, and
image loading off it. Variant-derived names replace 6× hardcoded
tarball/tag/env-var triples.
- Rename CLI flags so the flag suffix matches the variant exactly:
--require-java -> --require-polyglot-java (and similarly for python,
go, rust). dotnet and polyglot are unchanged.
- Adopt a generic --opt val / --opt=val argument splitter, replacing
the 6 hand-rolled per-flag case branches.
- Use indirect variable names (printf -v + ${!var}) rather than an
associative array so the script stays portable to bash 3.2 (macOS
default). Memory note about `declare -A` not being available there
is the reason for this choice.
- Fix a latent bug surfaced by the rewrite: `exit 2` inside
$(normalize_mode …) only kills the subshell, so an invalid
--require-X bogus silently became empty and fell through. Now the
normalize_mode failure is captured and propagated explicitly.
run-tests.yml load step
- Replace the 4 needs_java/needs_python/needs_go/needs_rust bash
variables with 'auto' mode on the corresponding --require-polyglot-*
flags. The loader treats a missing tarball as 'not required', which
exactly mirrors the per-language download gates above the load
step. This removes the duplication between download-step gates and
load-step gates.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The 'cli-e2e-builder' buildx instance created in this workflow uses the docker-container driver, which runs BuildKit inside a container with its own image store separate from the host Docker daemon. When the polyglot base is built with 'docker buildx build --load', the resulting image is exported to the host Docker daemon's image store but the tag is NOT registered in the buildkit container's internal store. So a naked 'FROM aspire-e2e-polyglot-base' in the extended polyglot dockerfiles (Java/Python/Go/Rust) fails: #3 [internal] load metadata for docker.io/library/aspire-e2e-polyglot-base:latest #3 ERROR: pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed This regression was introduced by commit 4846d7a, which switched the extended polyglot builds from 'DOCKER_BUILDKIT=1 docker build' (which uses the host daemon's BuildKit and sees locally-loaded images) to 'docker buildx build --load' (which goes through the container builder). The pre-regression code on origin/main still uses 'docker build' for these images, which is why main's scheduled Quarantined Tests run is green while every PR run on this branch since 4846d7a has failed at this step. Use --build-context with the 'docker-image://' scheme to tell buildx to resolve 'aspire-e2e-polyglot-base' via the host Docker daemon (which is where the base now lives). Drive it through an optional EXTRA_BUILD_ARGS array so build_image stays a single generic helper and only the extended-polyglot loop opts into the override. Keeps the GHA layer cache for the heavy upstream 'golang:1' / 'rust:1' COPY stages that 4846d7a was trying to enable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two unrelated fixes from PR review: 1. Restore SKIP_SOURCE_BUILD=true for the polyglot base image build. The build-cli-e2e-image.yml refactor introduced a per-call skip_source parameter but passed 'false' for the polyglot base image, flipping the pre-refactor behavior. That triggers the in-container 'dotnet build' path in Dockerfile.e2e-polyglot-base, which fails because .dockerignore strips .git/ from the build context and arcade can no longer infer RepositoryUrl. CI installs the CLI from prebuilt artifacts anyway — the in-container source build is only useful for local docker iteration. Extended polyglot Dockerfiles FROM aspire-e2e-polyglot-base have no source-build stage of their own, so SKIP_SOURCE_BUILD is a no-op there; passing 'true' for symmetry to make intent explicit. 2. Quarantine RustPolyglotTests (microsoft#16974). Matches the prior polyglot-validation.yml Rust SDK Validation job's continue-on-error: true safety net. Docs at TestingOnCI.md already describe this; the test code was the only divergence. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tainer Extended polyglot Dockerfiles use 'FROM aspire-e2e-polyglot-base', which is loaded into the host docker daemon by the polyglot-base build step (--load). The cli-e2e-builder buildx instance uses the docker-container driver, whose internal image store is isolated from the host daemon. A naked FROM lookup inside that container therefore falls back to docker.io/library/... and 401s, and the --build-context 'docker-image://aspire-e2e-polyglot-base:latest' workaround is still resolved by buildkit as a remote registry reference, not a host-daemon lookup, so it 401s identically. Switch the extended polyglot loop to the default docker driver (host daemon) where the base tag is already present. This matches origin/main's pattern (its build_java_image used 'DOCKER_BUILDKIT=1 docker build' for the same reason) and drops the GHA cache for these images; that's cheap because they are thin language-toolchain layers on top of the already-cached base. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Java polyglot image's apt-get step took 24 minutes in CI run 25718644577 because openjdk-25-jdk pulls in the full GUI stack (GTK, mesa, X11, fonts) as transitive dependencies — 135 packages totalling 219 MB. Combined with intermittent Azure Ubuntu mirror slowdowns this blew past the 45-minute job timeout before the Go/Rust images could even start. Switch to openjdk-25-jdk-headless, which drops the GUI dependency tree. The polyglot Java tests compile a Java AppHost (server code) and need no AWT/Swing/JavaFX at all — verified by grepping the test sources. Also bump the build-cli-e2e-image.yml job timeout from 45 to 60 minutes as a safety margin for slow apt mirror days; this is the canonical knob called out by GitHub's reusable-workflow timeout-minutes pattern. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
|
I need to do changes in some tests scripts, do you intend to merge this soon? |
Member
Author
Go ahead with your changes. I'll get to this next week. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #16701.
What changed
Replaces the
polyglot-validation.ymlGitHub Actions harness (5 sibling jobs running per-language Dockerfiles + shell scripts in.github/workflows/polyglot-validation/) with C# test classes insidetests/Aspire.Cli.EndToEnd.Tests. Each polyglot language now has its own test class running in its own Docker container under the standard tests runsheet, so polyglot validation gets the same.trx, summary, rerun, and quarantine integration as every other test in the repo.New test classes (all under
Aspire.Cli.EndToEnd.Tests)PythonPolyglotTeststest-python.shaspire init --language python→aspire add Aspire.Hosting.Redis→ write modifiedapphost.py→ backgroundaspire run→ polldocker psfor the Redis containerPythonCodegenValidationTeststest-python-playground.shaspire restore, validate.modules/aspire_app.pyexists andpython3 py_compilepassesGoPolyglotTeststest-go.shapphost.goGoCodegenValidationTeststest-go-playground.shgo build -buildvcs=false ./...RustPolyglotTeststest-rust.shsrc/main.rs(the apphost.rs marker file is left untouched). Quarantined.Java (
JavaPolyglotTests,JavaCodegenValidationTests) and TypeScript (TypeScriptPolyglotTests,TypeScriptCodegenValidationTests) already lived in this project — unchanged.Reused mechanism (the "we already have support for running them in separate containers" the issue refers to)
CliE2ETestHelpers.CreateDockerTestTerminal(... variant: DockerfileVariant.X)per test → per-class container.PolyglotPython,PolyglotGo,PolyglotRustadded toDockerfileVariant. They extendaspire-e2e-polyglot-base(same pattern as the pre-existingPolyglotJava) and live intests/Shared/Docker/Dockerfile.e2e-polyglot-{python,go,rust}. Python usesapt-get install python3+uv; Go and Rust multi-stageCOPYtheir toolchains from the officialgolang:1/rust:1images so we get current stable versions.Aspire.Cli.EndToEnd.Tests.csprojalready declaresSplitTestsOnCI=true,RequiresCliArchive=true,RequiresNugets=true,TestArchiveTestsDir, etc., so each new class becomes its own runsheet entry, runs on its own GitHub runner, and uploads its own .trx / asciinema recording.ASPIRE_E2E_POLYGLOT_{PYTHON,GO,RUST}_IMAGEenv vars + matchingREQUIRE_*gates mirror the existing Java prebuilt-image plumbing.Workflow changes
.github/workflows/build-cli-e2e-image.yml: builds and uploads the 3 new images alongside the existing dotnet / polyglot-base / polyglot-java tarballs..github/workflows/run-tests.yml: downloads each new image gated bycontains(inputs.testShortName, 'Python' | 'Go' | 'Rust')(mirroring the Java gate). Passes the matching--require-python/--require-go/--require-rustflags to the image loader.eng/scripts/load-cli-e2e-images.sh: new--require-python/--require-go/--require-rustmodes; load the tarballs and export the env vars the test harness consumes..github/workflows/reproduce-flaky-tests.yml: downloads all 3 new images so the reproducer can target any polyglot test class..github/workflows/polyglot-validation.ymland the entire.github/workflows/polyglot-validation/directory (5 Dockerfiles + 9 shell scripts + helpers). Removed thepolyglot_validationinvocation from.github/workflows/tests.ymland its mention in theFail if any dependency failedgate.docs/ci/TestingOnCI.mdanddocs/ci/ci-pipeline-optimizations.mdto reflect the change and document the new "Polyglot SDK validation tests" section. UpdatedAutoRerunTransientCiFailuresTestsso the deleted workflow no longer appears in the workflow-text expectations.Behavior change to be aware of
The legacy Rust SDK Validation job was
continue-on-error: true— it ran on every PR and reported a non-blocking failure.RustPolyglotTestsis decorated with[QuarantinedTest("…/issues/16701")], which means it does not run on PRs and only runs in the scheduledQuarantined Testsworkflow. The xUnit framework has no exact equivalent of "runs on PR but never blocks." If maintainers want to preserve the PR-time non-blocking Rust signal, the alternative is to not quarantine and instead manage failure budget through another mechanism — happy to switch if desired.What you'll see in the PR check list
Python SDK Validation,Go SDK Validation,Java SDK Validation,Rust SDK Validation,TypeScript SDK Validation,Polyglot Validation Results(6 jobs).l: Cli.EndToEnd-PythonPolyglotTests,l: Cli.EndToEnd-PythonCodegenValidationTests,l: Cli.EndToEnd-GoPolyglotTests,l: Cli.EndToEnd-GoCodegenValidationTests. Plus the already-existing Java / TypeScript polyglot class entries.Cli.EndToEnd-RustPolyglotTests. Runs inQuarantined Tests.Running locally
Docker required. From the repo root:
Substitute
*.GoPolyglotTests,*.GoCodegenValidationTests,*.PythonCodegenValidationTests, or (with--filter-trait quarantined=true)*.RustPolyglotTests. Each test builds its Docker image on first run (subsequent runs reuse it).Why this matches the issue
PythonPolyglotTests,PythonCodegenValidationTests,GoPolyglotTests,GoCodegenValidationTests,RustPolyglotTests.SplitTestsOnCI=true+DockerfileVariantper class.Aspire.Cli.EndToEnd.Tests.csprojproperties (SplitTestsOnCI,RequiresCliArchive,TestArchiveTestsDir,XunitRunnerJson)..trx, summary, reruns: inherited from the standard tests pipeline.Dockerfile.e2e-polyglot-{java,python,go,rust}all extendaspire-e2e-polyglot-base.cc @sebastienros @radical