feat(image-resolver): pin ACR tags to digests via azure-containerregistry SDK by temporaer · Pull Request #82 · microsoft/ai4s-jobq

temporaer · 2026-05-29T12:17:26Z

Replaces the hand-rolled image-digest resolver from the yeqi/feat/image-digest-resolver branch with a simpler implementation using the azure-containerregistry SDK. See PR description for details.

… isolated build env (#80) * Initial plan * fix: add [testenv:.package] with conda_deps=pip to fix tox isolated build Agent-Logs-Url: https://github.com/microsoft/ai4s-jobq/sessions/1bf6032c-f49f-4035-af92-db8420272a0e Co-authored-by: temporaer <27792+temporaer@users.noreply.github.com> * fix: ensure pip is installed in tox accessibility conda env Agent-Logs-Url: https://github.com/microsoft/ai4s-jobq/sessions/a11c7495-1a91-4ec9-8c63-fc079be6e5c5 Co-authored-by: temporaer <27792+temporaer@users.noreply.github.com> * fix: add pip to base testenv conda_deps The .package env had pip but the test envs (py310/py311/py312) did not, causing 'No module named pip' when tox tried to install the package. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: temporaer <27792+temporaer@users.noreply.github.com> Co-authored-by: Hannes Schulz <Hannes.Schulz@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The heartbeat loop retried indefinitely on non-HTTP errors (network timeouts, DNS failures, connection resets) without ever declaring lock loss. Now tracks time since last successful heartbeat update; if it exceeds the visibility timeout, sets lock_lost_event and logs a warning. This aligns Storage Queue behavior with the Service Bus backend, which already declares lock lost when no successful renewal occurs within the lock duration. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: stop log_from_queue retry loop on broken Manager connection During shutdown (preemption or SIGTERM), the multiprocessing Manager backing log_msg_queue is killed. The log_from_queue loop then spams ERROR logs on every iteration as queue.get() raises BrokenPipeError or ConnectionResetError. Break out of the loop on these fatal connection errors since the Manager is gone and no more log messages will arrive. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * docs: add changelog entry for log queue shutdown fix Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* test: repro for subprocess continuing after lock loss After PR #59 removed the _kill_subprocesses() call from ProcessPool.submit(), cancelling a task (e.g. due to lock loss) no longer terminates the pool child subprocess. The asyncio CancelledError only cancels the await on run_in_executor — the subprocess keeps running without any message lock, leading to duplicate task execution when another worker picks up the now-visible message. Two tests: - Unit-level: shows subprocess keeps running after task cancellation - E2E: uses pull_and_execute + Azurite to simulate lock_lost_event Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: terminate subprocess on lock loss via targeted SIGUSR1 ProcessPool.submit() now tracks the pool child's PID via a temp file. On CancelledError (e.g. lock loss), it sends SIGUSR1 to that specific process — the existing signal handler forwards SIGTERM to the shell subprocess. Other pool children are unaffected. This fixes the regression from PR #59 where cancelling a task left the subprocess running without any message lock, leading to duplicate execution when another worker picked up the same message. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: use SIGKILL to process group on lock loss (no graceful shutdown) On lock loss there's no point in graceful SIGTERM — the message lock is gone and checkpointing would conflict with another worker. Changed to: - SIGUSR2 (new) → pool child sends SIGKILL to subprocess process group - SIGUSR1 (unchanged) → graceful SIGTERM for preemption/shutdown The subprocess is killed instantly; no 600s kill-timeout wait. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: replace mktemp with uuid-based paths for CodeQL compliance tempfile.mktemp is flagged by CodeQL as insecure (race between path generation and file creation). Replace with os.path.join(gettempdir(), uuid4().hex) which generates unique paths without the race condition. Remove S306 suppression from pyproject.toml as it's no longer needed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * ci: add pip to tox conda_deps for CI compatibility Recent conda/runner changes no longer include pip by default in tox-conda environments, causing 'No module named pip' failures. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add JOBQ_LOCK_LOST_BEHAVIOR envvar for configurable signal Allows choosing between SIGKILL (default, immediate process group termination) and SIGTERM (graceful, allows cleanup) when a message lock is lost. Set JOBQ_LOCK_LOST_BEHAVIOR=sigterm to use graceful shutdown on lock loss instead of the default hard termination. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * test+docs: cover SIGTERM lock-loss behavior and add lock-loss docs - Add test_sigterm_lock_lost_behavior: verifies that with JOBQ_LOCK_LOST_BEHAVIOR=sigterm, the subprocess trap handler fires (marker transitions to 'SIGTERM') while other tasks complete normally. - Add docs/misc/15-lock-loss.md documenting lock-loss detection and the JOBQ_LOCK_LOST_BEHAVIOR envvar. - Remove misplaced lock-loss section from preemption docs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * docs: add message redelivery note to lock-loss page Explain that SQ redelivers at the end of the queue while SB redelivers at the beginning, making SQ the safer choice for workloads that do not tolerate simultaneous delivery. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…tainerregistry SDK Replace the hand-rolled OCI manifest fetch + ACR OAuth token exchange with the azure-containerregistry SDK's ContainerRegistryClient, which handles authentication and manifest resolution internally. Scope: ACR only (*.azurecr.io). Non-ACR registries log a warning and pass through unchanged. Key simplifications vs the previous approach: - No RegistryAuth Protocol / AcrAadAuth / BearerTokenAuth / AnonymousAuth - No manual /oauth2/exchange + /oauth2/token dance - No raw requests calls or Accept-header constants - No JWT parsing for tenant extraction - ~160 lines (was ~416) The Workforce integration (env pre-registration to avoid the ResourceExistsError race on concurrent submissions) is unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI and others added 5 commits May 21, 2026 22:52

temporaer requested a review from a team May 29, 2026 12:17

temporaer changed the base branch from main to yeqi/feat/image-digest-resolver May 30, 2026 20:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(image-resolver): pin ACR tags to digests via azure-containerregistry SDK#82

feat(image-resolver): pin ACR tags to digests via azure-containerregistry SDK#82
temporaer wants to merge 5 commits into
yeqi/feat/image-digest-resolverfrom
yeqi/feat/image-digest-resolver-v2

temporaer commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

temporaer commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants