Skip to content

feat(image-resolver): pin image tags to content digests at hire time#73

Open
BAI-Yeqi wants to merge 4 commits into
microsoft:mainfrom
BAI-Yeqi:yeqi/feat/image-digest-resolver
Open

feat(image-resolver): pin image tags to content digests at hire time#73
BAI-Yeqi wants to merge 4 commits into
microsoft:mainfrom
BAI-Yeqi:yeqi/feat/image-digest-resolver

Conversation

@BAI-Yeqi

@BAI-Yeqi BAI-Yeqi commented May 6, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds ImageDigestResolver: a pluggable component that resolves an
ACR image tag (registry/repo:tag) to its immutable content digest
(registry/repo@sha256:...) once at scheduler startup and caches it
with a TTL. Workforce and MultiRegionWorkforce accept an
optional image_resolver and rewrite env.image at hire time so
every worker in a session pulls byte-identical image content even if
the tag is re-pushed mid-run.

Why

When a long-running scheduler hires workers across many regions and
many hours from a moving tag, a tag re-push during the run can leave
some workers on the old image and others on the new one. Pinning to
the digest fixes this without forcing operators to manage digests
manually upstream — they keep using a tag, and the resolver freezes
it for the lifetime of the scheduler process (1h TTL by default).

Pieces

  • ai4s/jobq/orchestration/image_resolver.py — new module:
    • ImageDigestResolver: TTL cache + thread-safe resolve();
      fail-open or fail-closed on registry errors; logs loudly when a
      digest changes mid-session.
    • RegistryAuth Protocol with AnonymousAuth,
      BearerTokenAuth, AcrAadAuth (the latter does the
      AAD-to-ACR /oauth2/exchange flow). External users with
      non-ACR registries can plug in their own auth.
  • Workforce:
    • Accepts image_resolver= and set_image_resolver(...).
    • _build_worker rewrites env.image to the digest URI when a
      resolver is configured.
  • MultiRegionWorkforce: accepts image_resolver= and
    propagates it to all child workforces.

The race fix in this PR

Anonymous AzureML environments are content-hash-versioned. When the
resolver rewrites env.image, every _build_worker call produces
a fresh Environment object with no ARM id, and the SDK's
jobs.create_or_update re-runs _environments.create_or_update
on every job. Concurrent parallel_hire threads then race on the
workspace's CliV2AnonymousEnvironment/versions/<hash> PUT and the
losers crash with ResourceExistsError.

Workforce now pre-registers the digest-pinned Environment once
per resolved image (under a thread lock) and substitutes the returned
ARM id string for job.environment. The SDK skips its
per-submission registration for string envs, so all concurrent
submissions reuse the same registration.
ResourceExistsError from a concurrent caller (or a previous
scheduler run that registered the same content hash) is recovered by
parsing the version from the error and GET-ing the existing
registration.

Test plan

  • tests/test_image_resolver.py — 23 tests covering tag
    parsing / digest URI formatting, TTL cache hit/miss/expiry,
    digest-change warning, fail-open vs fail-closed, thread safety
    under concurrent resolve(), auth backends, and
    Workforce-integration via a stand-in helper that mirrors
    the _build_worker rewrite + register flow.
  • tests/test_workforce_parallel.py — extended
    _bare_workforce fixture to cover the new lock + cache
    attributes; existing parallel-hire / lay-off / resume
    coverage continues to pass.
  • pre-commit run clean (ruff, mypy, format).
  • Smoke-tested end-to-end: scheduler logs show
    image_digest_resolved once per resolver, then
    anonymous_env_registered once per workforce, then
    successful hires across all regions with no ResourceExistsError.

BAI-Yeqi added 3 commits May 6, 2026 17:23
…digests

Mutable ACR tags (``:latest`` and even most dated tags unless
content-immutability is enabled) can be re-pushed mid-session, so two
workers in the same workforce can end up running different image bytes.
Long-running schedulers will silently mix builds without pinning.

Adds the resolver as a self-contained module; wiring into ``Workforce``
is a follow-up.

Behaviour:

* Single-flight cache keyed by ``registry/repo:tag``.  TTL configurable
  (default 1 h).  Per-process cache only -- a fresh scheduler always
  re-resolves at startup, then stays on that digest for the session.
* On TTL expiry the resolver re-fetches; if the digest changed a WARNING
  log names both digests so operators can spot mid-session rebuilds.
* Fail-open by default: if ACR is unreachable / 5xx / no AcrPull, falls
  back to the original tag-based URI with a WARNING.  Image pinning is
  a quality improvement, not a correctness invariant.
* Thread-safe (one mutex around cache + fetch).  A single resolver
  instance is intended to be shared across all Workforces in a
  MultiRegionWorkforce.
* Logging is terminal-only via ``ai4s.jobq.orchestration.image_resolver``.

ACR auth uses the standard two-step OAuth dance: AAD token ->
``/oauth2/exchange`` -> refresh token -> ``/oauth2/token`` with
``repository:{repo}:pull`` scope.  Caller's principal needs AcrPull.

Tests cover tag parsing/formatting, cache hit/miss, TTL expiry,
digest-change detection, fail-open vs fail-closed, and thread safety.
Refactor + wire-in commit for the digest resolver landed in the previous
commit.  Two changes in one go since both touch the same surface area:

1. Pluggable registry auth.

   The manifest endpoint (``GET /v2/{repo}/manifests/{tag}``) is OCI-
   standard, so the only registry-specific code is token acquisition.
   Extract a ``RegistryAuth`` Protocol and three implementations:

   * ``AcrAadAuth`` -- the existing ACR + AAD OAuth dance, now an
     opt-in backend.  The ``azure.identity`` import is deferred so
     non-ACR users don't pay it.
   * ``AnonymousAuth`` -- skip the Authorization header entirely; works
     for public registries / public repos.
   * ``BearerTokenAuth`` -- caller-supplied bearer token; useful for
     tests and for callers that fetch tokens out-of-band (Docker Hub,
     GHCR, ECR, GCR all fit this pattern).

   ``AcrAadAuth`` remains the resolver's default because the typical
   user is on AML + ACR, but external users with a different registry
   pass their own backend.

2. Wire into ``Workforce`` and ``MultiRegionWorkforce``.

   * ``Workforce`` accepts an optional ``image_resolver`` and calls
     ``resolver.resolve(env.image)`` in ``_build_worker``.  The
     prototype's Environment is not mutated -- a shallow copy is taken
     for the per-worker job so concurrent ``_build_worker`` calls don't
     race.  Registered AML environments (passed as strings, no
     ``.image`` attribute) are skipped.
   * ``MultiRegionWorkforce`` accepts ``image_resolver`` and installs
     it on every child workforce via ``set_image_resolver``, sharing
     one cache across all regions.

Tests added for: anonymous vs bearer auth backends, ``_fetch_digest``
threading the auth header correctly, and the Workforce integration
path (rewrite happens, prototype not mutated, string envs and already-
pinned images both skip the fetch).
…change

Two fixes for the 401s seen against ``msrmoldyn.azurecr.io``:

1. ``DefaultAzureCredential`` picks the host's managed identity first
   on Azure VMs; that MI typically lacks AcrPull on the registry the
   workforce is targeting.  Switch the ``AcrAadAuth`` default to
   ``ChainedTokenCredential(AzureCliCredential, DefaultAzureCredential)``
   so the operator's ``az login`` identity is preferred.  Production
   service-account use is unaffected — pass an explicit credential to
   ``AcrAadAuth(credential=...)``.

2. ACR's ``/oauth2/exchange`` endpoint returns 401 on some tenants when
   the ``tenant`` form field is missing, even when the AAD token's
   audience is correct.  Decode the token's ``iss`` claim to extract
   the tenant id and include it on the exchange POST.

Smoke-tested against ``msrmoldyn.azurecr.io/materials/dft-calculation:2026-04-21``;
resolves to ``sha256:cb260046...`` cleanly with the operator's az-login
identity.
@BAI-Yeqi BAI-Yeqi requested a review from a team May 6, 2026 18:30
When the resolver rewrites ``env.image`` to a digest URI, ``_build_worker``
hands every job submission a fresh Environment object with no ARM id.
The SDK's ``jobs.create_or_update`` then re-registers the inline env on
every submission; concurrent ``parallel_hire`` threads compute the same
content-hash and race on the workspace's ``CliV2AnonymousEnvironment/
versions/<hash>`` PUT — the losers crash with ``ResourceExistsError``.

Pre-register the digest-pinned Environment once per resolved image
(under a thread lock) and substitute the returned ARM id *string* for
``job.environment``. The SDK skips its per-submission registration for
string envs, so all concurrent submissions reuse the same registration.
``ResourceExistsError`` from a concurrent caller (or a previous
scheduler run) is recovered by parsing the version from the error and
``GET``ing the existing registration.

Also extends ``_apply_resolver_to_job`` test helper and ``_bare_workforce``
fixture to cover the new register-then-substitute-ARM-id flow.
@BAI-Yeqi BAI-Yeqi force-pushed the yeqi/feat/image-digest-resolver branch from 3193b4d to e9dd399 Compare May 6, 2026 18:37
@BAI-Yeqi

BAI-Yeqi commented May 8, 2026

Copy link
Copy Markdown
Contributor Author

Good morning @temporaer , would you like to take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant