feat(image-resolver): pin ACR tags to digests via azure-containerregistry SDK#82
Open
temporaer wants to merge 5 commits into
Open
feat(image-resolver): pin ACR tags to digests via azure-containerregistry SDK#82temporaer wants to merge 5 commits into
temporaer wants to merge 5 commits into
Conversation
… isolated build env (#80) * Initial plan * fix: add [testenv:.package] with conda_deps=pip to fix tox isolated build Agent-Logs-Url: https://github.com/microsoft/ai4s-jobq/sessions/1bf6032c-f49f-4035-af92-db8420272a0e Co-authored-by: temporaer <27792+temporaer@users.noreply.github.com> * fix: ensure pip is installed in tox accessibility conda env Agent-Logs-Url: https://github.com/microsoft/ai4s-jobq/sessions/a11c7495-1a91-4ec9-8c63-fc079be6e5c5 Co-authored-by: temporaer <27792+temporaer@users.noreply.github.com> * fix: add pip to base testenv conda_deps The .package env had pip but the test envs (py310/py311/py312) did not, causing 'No module named pip' when tox tried to install the package. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: temporaer <27792+temporaer@users.noreply.github.com> Co-authored-by: Hannes Schulz <Hannes.Schulz@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The heartbeat loop retried indefinitely on non-HTTP errors (network timeouts, DNS failures, connection resets) without ever declaring lock loss. Now tracks time since last successful heartbeat update; if it exceeds the visibility timeout, sets lock_lost_event and logs a warning. This aligns Storage Queue behavior with the Service Bus backend, which already declares lock lost when no successful renewal occurs within the lock duration. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* fix: stop log_from_queue retry loop on broken Manager connection During shutdown (preemption or SIGTERM), the multiprocessing Manager backing log_msg_queue is killed. The log_from_queue loop then spams ERROR logs on every iteration as queue.get() raises BrokenPipeError or ConnectionResetError. Break out of the loop on these fatal connection errors since the Manager is gone and no more log messages will arrive. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * docs: add changelog entry for log queue shutdown fix Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* test: repro for subprocess continuing after lock loss After PR #59 removed the _kill_subprocesses() call from ProcessPool.submit(), cancelling a task (e.g. due to lock loss) no longer terminates the pool child subprocess. The asyncio CancelledError only cancels the await on run_in_executor — the subprocess keeps running without any message lock, leading to duplicate task execution when another worker picks up the now-visible message. Two tests: - Unit-level: shows subprocess keeps running after task cancellation - E2E: uses pull_and_execute + Azurite to simulate lock_lost_event Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: terminate subprocess on lock loss via targeted SIGUSR1 ProcessPool.submit() now tracks the pool child's PID via a temp file. On CancelledError (e.g. lock loss), it sends SIGUSR1 to that specific process — the existing signal handler forwards SIGTERM to the shell subprocess. Other pool children are unaffected. This fixes the regression from PR #59 where cancelling a task left the subprocess running without any message lock, leading to duplicate execution when another worker picked up the same message. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: use SIGKILL to process group on lock loss (no graceful shutdown) On lock loss there's no point in graceful SIGTERM — the message lock is gone and checkpointing would conflict with another worker. Changed to: - SIGUSR2 (new) → pool child sends SIGKILL to subprocess process group - SIGUSR1 (unchanged) → graceful SIGTERM for preemption/shutdown The subprocess is killed instantly; no 600s kill-timeout wait. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: replace mktemp with uuid-based paths for CodeQL compliance tempfile.mktemp is flagged by CodeQL as insecure (race between path generation and file creation). Replace with os.path.join(gettempdir(), uuid4().hex) which generates unique paths without the race condition. Remove S306 suppression from pyproject.toml as it's no longer needed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * ci: add pip to tox conda_deps for CI compatibility Recent conda/runner changes no longer include pip by default in tox-conda environments, causing 'No module named pip' failures. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add JOBQ_LOCK_LOST_BEHAVIOR envvar for configurable signal Allows choosing between SIGKILL (default, immediate process group termination) and SIGTERM (graceful, allows cleanup) when a message lock is lost. Set JOBQ_LOCK_LOST_BEHAVIOR=sigterm to use graceful shutdown on lock loss instead of the default hard termination. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * test+docs: cover SIGTERM lock-loss behavior and add lock-loss docs - Add test_sigterm_lock_lost_behavior: verifies that with JOBQ_LOCK_LOST_BEHAVIOR=sigterm, the subprocess trap handler fires (marker transitions to 'SIGTERM') while other tasks complete normally. - Add docs/misc/15-lock-loss.md documenting lock-loss detection and the JOBQ_LOCK_LOST_BEHAVIOR envvar. - Remove misplaced lock-loss section from preemption docs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * docs: add message redelivery note to lock-loss page Explain that SQ redelivers at the end of the queue while SB redelivers at the beginning, making SQ the safer choice for workloads that do not tolerate simultaneous delivery. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tainerregistry SDK Replace the hand-rolled OCI manifest fetch + ACR OAuth token exchange with the azure-containerregistry SDK's ContainerRegistryClient, which handles authentication and manifest resolution internally. Scope: ACR only (*.azurecr.io). Non-ACR registries log a warning and pass through unchanged. Key simplifications vs the previous approach: - No RegistryAuth Protocol / AcrAadAuth / BearerTokenAuth / AnonymousAuth - No manual /oauth2/exchange + /oauth2/token dance - No raw requests calls or Accept-header constants - No JWT parsing for tenant extraction - ~160 lines (was ~416) The Workforce integration (env pre-registration to avoid the ResourceExistsError race on concurrent submissions) is unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replaces the hand-rolled image-digest resolver from the yeqi/feat/image-digest-resolver branch with a simpler implementation using the azure-containerregistry SDK. See PR description for details.