feat(image-resolver): pin image tags to content digests at hire time#73
Open
BAI-Yeqi wants to merge 4 commits into
Open
feat(image-resolver): pin image tags to content digests at hire time#73BAI-Yeqi wants to merge 4 commits into
BAI-Yeqi wants to merge 4 commits into
Conversation
…digests
Mutable ACR tags (``:latest`` and even most dated tags unless
content-immutability is enabled) can be re-pushed mid-session, so two
workers in the same workforce can end up running different image bytes.
Long-running schedulers will silently mix builds without pinning.
Adds the resolver as a self-contained module; wiring into ``Workforce``
is a follow-up.
Behaviour:
* Single-flight cache keyed by ``registry/repo:tag``. TTL configurable
(default 1 h). Per-process cache only -- a fresh scheduler always
re-resolves at startup, then stays on that digest for the session.
* On TTL expiry the resolver re-fetches; if the digest changed a WARNING
log names both digests so operators can spot mid-session rebuilds.
* Fail-open by default: if ACR is unreachable / 5xx / no AcrPull, falls
back to the original tag-based URI with a WARNING. Image pinning is
a quality improvement, not a correctness invariant.
* Thread-safe (one mutex around cache + fetch). A single resolver
instance is intended to be shared across all Workforces in a
MultiRegionWorkforce.
* Logging is terminal-only via ``ai4s.jobq.orchestration.image_resolver``.
ACR auth uses the standard two-step OAuth dance: AAD token ->
``/oauth2/exchange`` -> refresh token -> ``/oauth2/token`` with
``repository:{repo}:pull`` scope. Caller's principal needs AcrPull.
Tests cover tag parsing/formatting, cache hit/miss, TTL expiry,
digest-change detection, fail-open vs fail-closed, and thread safety.
Refactor + wire-in commit for the digest resolver landed in the previous
commit. Two changes in one go since both touch the same surface area:
1. Pluggable registry auth.
The manifest endpoint (``GET /v2/{repo}/manifests/{tag}``) is OCI-
standard, so the only registry-specific code is token acquisition.
Extract a ``RegistryAuth`` Protocol and three implementations:
* ``AcrAadAuth`` -- the existing ACR + AAD OAuth dance, now an
opt-in backend. The ``azure.identity`` import is deferred so
non-ACR users don't pay it.
* ``AnonymousAuth`` -- skip the Authorization header entirely; works
for public registries / public repos.
* ``BearerTokenAuth`` -- caller-supplied bearer token; useful for
tests and for callers that fetch tokens out-of-band (Docker Hub,
GHCR, ECR, GCR all fit this pattern).
``AcrAadAuth`` remains the resolver's default because the typical
user is on AML + ACR, but external users with a different registry
pass their own backend.
2. Wire into ``Workforce`` and ``MultiRegionWorkforce``.
* ``Workforce`` accepts an optional ``image_resolver`` and calls
``resolver.resolve(env.image)`` in ``_build_worker``. The
prototype's Environment is not mutated -- a shallow copy is taken
for the per-worker job so concurrent ``_build_worker`` calls don't
race. Registered AML environments (passed as strings, no
``.image`` attribute) are skipped.
* ``MultiRegionWorkforce`` accepts ``image_resolver`` and installs
it on every child workforce via ``set_image_resolver``, sharing
one cache across all regions.
Tests added for: anonymous vs bearer auth backends, ``_fetch_digest``
threading the auth header correctly, and the Workforce integration
path (rewrite happens, prototype not mutated, string envs and already-
pinned images both skip the fetch).
…change Two fixes for the 401s seen against ``msrmoldyn.azurecr.io``: 1. ``DefaultAzureCredential`` picks the host's managed identity first on Azure VMs; that MI typically lacks AcrPull on the registry the workforce is targeting. Switch the ``AcrAadAuth`` default to ``ChainedTokenCredential(AzureCliCredential, DefaultAzureCredential)`` so the operator's ``az login`` identity is preferred. Production service-account use is unaffected — pass an explicit credential to ``AcrAadAuth(credential=...)``. 2. ACR's ``/oauth2/exchange`` endpoint returns 401 on some tenants when the ``tenant`` form field is missing, even when the AAD token's audience is correct. Decode the token's ``iss`` claim to extract the tenant id and include it on the exchange POST. Smoke-tested against ``msrmoldyn.azurecr.io/materials/dft-calculation:2026-04-21``; resolves to ``sha256:cb260046...`` cleanly with the operator's az-login identity.
When the resolver rewrites ``env.image`` to a digest URI, ``_build_worker`` hands every job submission a fresh Environment object with no ARM id. The SDK's ``jobs.create_or_update`` then re-registers the inline env on every submission; concurrent ``parallel_hire`` threads compute the same content-hash and race on the workspace's ``CliV2AnonymousEnvironment/ versions/<hash>`` PUT — the losers crash with ``ResourceExistsError``. Pre-register the digest-pinned Environment once per resolved image (under a thread lock) and substitute the returned ARM id *string* for ``job.environment``. The SDK skips its per-submission registration for string envs, so all concurrent submissions reuse the same registration. ``ResourceExistsError`` from a concurrent caller (or a previous scheduler run) is recovered by parsing the version from the error and ``GET``ing the existing registration. Also extends ``_apply_resolver_to_job`` test helper and ``_bare_workforce`` fixture to cover the new register-then-substitute-ARM-id flow.
3193b4d to
e9dd399
Compare
Contributor
Author
|
Good morning @temporaer , would you like to take a look? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
ImageDigestResolver: a pluggable component that resolves anACR image tag (
registry/repo:tag) to its immutable content digest(
registry/repo@sha256:...) once at scheduler startup and caches itwith a TTL.
WorkforceandMultiRegionWorkforceaccept anoptional
image_resolverand rewriteenv.imageat hire time soevery worker in a session pulls byte-identical image content even if
the tag is re-pushed mid-run.
Why
When a long-running scheduler hires workers across many regions and
many hours from a moving tag, a tag re-push during the run can leave
some workers on the old image and others on the new one. Pinning to
the digest fixes this without forcing operators to manage digests
manually upstream — they keep using a tag, and the resolver freezes
it for the lifetime of the scheduler process (1h TTL by default).
Pieces
ai4s/jobq/orchestration/image_resolver.py— new module:ImageDigestResolver: TTL cache + thread-saferesolve();fail-open or fail-closed on registry errors; logs loudly when a
digest changes mid-session.
RegistryAuthProtocol withAnonymousAuth,BearerTokenAuth,AcrAadAuth(the latter does theAAD-to-ACR
/oauth2/exchangeflow). External users withnon-ACR registries can plug in their own auth.
Workforce:image_resolver=andset_image_resolver(...)._build_workerrewritesenv.imageto the digest URI when aresolver is configured.
MultiRegionWorkforce: acceptsimage_resolver=andpropagates it to all child workforces.
The race fix in this PR
Anonymous AzureML environments are content-hash-versioned. When the
resolver rewrites
env.image, every_build_workercall producesa fresh
Environmentobject with no ARM id, and the SDK'sjobs.create_or_updatere-runs_environments.create_or_updateon every job. Concurrent
parallel_hirethreads then race on theworkspace's
CliV2AnonymousEnvironment/versions/<hash>PUT and thelosers crash with
ResourceExistsError.Workforcenow pre-registers the digest-pinnedEnvironmentonceper resolved image (under a thread lock) and substitutes the returned
ARM id string for
job.environment. The SDK skips itsper-submission registration for string envs, so all concurrent
submissions reuse the same registration.
ResourceExistsErrorfrom a concurrent caller (or a previousscheduler run that registered the same content hash) is recovered by
parsing the version from the error and
GET-ing the existingregistration.
Test plan
tests/test_image_resolver.py— 23 tests covering tagparsing / digest URI formatting, TTL cache hit/miss/expiry,
digest-change warning, fail-open vs fail-closed, thread safety
under concurrent
resolve(), auth backends, andWorkforce-integration via a stand-in helper that mirrorsthe
_build_workerrewrite + register flow.tests/test_workforce_parallel.py— extended_bare_workforcefixture to cover the new lock + cacheattributes; existing parallel-hire / lay-off / resume
coverage continues to pass.
pre-commit runclean (ruff, mypy, format).image_digest_resolvedonce per resolver, thenanonymous_env_registeredonce per workforce, thensuccessful hires across all regions with no
ResourceExistsError.