Skip to content

feat(image-resolver): pin ACR tags to digests via azure-containerregistry SDK#82

Open
temporaer wants to merge 5 commits into
yeqi/feat/image-digest-resolverfrom
yeqi/feat/image-digest-resolver-v2
Open

feat(image-resolver): pin ACR tags to digests via azure-containerregistry SDK#82
temporaer wants to merge 5 commits into
yeqi/feat/image-digest-resolverfrom
yeqi/feat/image-digest-resolver-v2

Conversation

@temporaer

Copy link
Copy Markdown
Contributor

Replaces the hand-rolled image-digest resolver from the yeqi/feat/image-digest-resolver branch with a simpler implementation using the azure-containerregistry SDK. See PR description for details.

Copilot AI and others added 5 commits May 21, 2026 22:52
… isolated build env (#80)

* Initial plan

* fix: add [testenv:.package] with conda_deps=pip to fix tox isolated build

Agent-Logs-Url: https://github.com/microsoft/ai4s-jobq/sessions/1bf6032c-f49f-4035-af92-db8420272a0e

Co-authored-by: temporaer <27792+temporaer@users.noreply.github.com>

* fix: ensure pip is installed in tox accessibility conda env

Agent-Logs-Url: https://github.com/microsoft/ai4s-jobq/sessions/a11c7495-1a91-4ec9-8c63-fc079be6e5c5

Co-authored-by: temporaer <27792+temporaer@users.noreply.github.com>

* fix: add pip to base testenv conda_deps

The .package env had pip but the test envs (py310/py311/py312) did not,
causing 'No module named pip' when tox tried to install the package.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: temporaer <27792+temporaer@users.noreply.github.com>
Co-authored-by: Hannes Schulz <Hannes.Schulz@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The heartbeat loop retried indefinitely on non-HTTP errors (network
timeouts, DNS failures, connection resets) without ever declaring lock
loss. Now tracks time since last successful heartbeat update; if it
exceeds the visibility timeout, sets lock_lost_event and logs a warning.

This aligns Storage Queue behavior with the Service Bus backend, which
already declares lock lost when no successful renewal occurs within the
lock duration.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* fix: stop log_from_queue retry loop on broken Manager connection

During shutdown (preemption or SIGTERM), the multiprocessing Manager
backing log_msg_queue is killed. The log_from_queue loop then spams
ERROR logs on every iteration as queue.get() raises BrokenPipeError or
ConnectionResetError. Break out of the loop on these fatal connection
errors since the Manager is gone and no more log messages will arrive.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* docs: add changelog entry for log queue shutdown fix

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* test: repro for subprocess continuing after lock loss

After PR #59 removed the _kill_subprocesses() call from
ProcessPool.submit(), cancelling a task (e.g. due to lock loss) no
longer terminates the pool child subprocess. The asyncio CancelledError
only cancels the await on run_in_executor — the subprocess keeps running
without any message lock, leading to duplicate task execution when
another worker picks up the now-visible message.

Two tests:
- Unit-level: shows subprocess keeps running after task cancellation
- E2E: uses pull_and_execute + Azurite to simulate lock_lost_event

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: terminate subprocess on lock loss via targeted SIGUSR1

ProcessPool.submit() now tracks the pool child's PID via a temp file.
On CancelledError (e.g. lock loss), it sends SIGUSR1 to that specific
process — the existing signal handler forwards SIGTERM to the shell
subprocess.  Other pool children are unaffected.

This fixes the regression from PR #59 where cancelling a task left the
subprocess running without any message lock, leading to duplicate
execution when another worker picked up the same message.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: use SIGKILL to process group on lock loss (no graceful shutdown)

On lock loss there's no point in graceful SIGTERM — the message lock is
gone and checkpointing would conflict with another worker. Changed to:

- SIGUSR2 (new) → pool child sends SIGKILL to subprocess process group
- SIGUSR1 (unchanged) → graceful SIGTERM for preemption/shutdown

The subprocess is killed instantly; no 600s kill-timeout wait.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: replace mktemp with uuid-based paths for CodeQL compliance

tempfile.mktemp is flagged by CodeQL as insecure (race between path
generation and file creation). Replace with os.path.join(gettempdir(),
uuid4().hex) which generates unique paths without the race condition.
Remove S306 suppression from pyproject.toml as it's no longer needed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* ci: add pip to tox conda_deps for CI compatibility

Recent conda/runner changes no longer include pip by default in
tox-conda environments, causing 'No module named pip' failures.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: add JOBQ_LOCK_LOST_BEHAVIOR envvar for configurable signal

Allows choosing between SIGKILL (default, immediate process group
termination) and SIGTERM (graceful, allows cleanup) when a message
lock is lost.

Set JOBQ_LOCK_LOST_BEHAVIOR=sigterm to use graceful shutdown on lock
loss instead of the default hard termination.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* test+docs: cover SIGTERM lock-loss behavior and add lock-loss docs

- Add test_sigterm_lock_lost_behavior: verifies that with
  JOBQ_LOCK_LOST_BEHAVIOR=sigterm, the subprocess trap handler fires
  (marker transitions to 'SIGTERM') while other tasks complete normally.
- Add docs/misc/15-lock-loss.md documenting lock-loss detection and the
  JOBQ_LOCK_LOST_BEHAVIOR envvar.
- Remove misplaced lock-loss section from preemption docs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* docs: add message redelivery note to lock-loss page

Explain that SQ redelivers at the end of the queue while SB redelivers
at the beginning, making SQ the safer choice for workloads that do not
tolerate simultaneous delivery.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tainerregistry SDK

Replace the hand-rolled OCI manifest fetch + ACR OAuth token exchange
with the azure-containerregistry SDK's ContainerRegistryClient, which
handles authentication and manifest resolution internally.

Scope: ACR only (*.azurecr.io). Non-ACR registries log a warning and
pass through unchanged.

Key simplifications vs the previous approach:
- No RegistryAuth Protocol / AcrAadAuth / BearerTokenAuth / AnonymousAuth
- No manual /oauth2/exchange + /oauth2/token dance
- No raw requests calls or Accept-header constants
- No JWT parsing for tenant extraction
- ~160 lines (was ~416)

The Workforce integration (env pre-registration to avoid the
ResourceExistsError race on concurrent submissions) is unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@temporaer temporaer requested a review from a team May 29, 2026 12:17
@temporaer temporaer changed the base branch from main to yeqi/feat/image-digest-resolver May 30, 2026 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants