Add retry logic for registermanifests, introduce skip-build override for LRT #8495

lakshmimsft · 2025-02-14T01:22:15Z

Description

The pull request addresses findings in the failures related to Long Running Test:
We see manifests not being registered successfully due to errors "409 Conflict : The target resource is in Accepted state"
This was determined due to the fact that every PUT request for a resourceprovider/resource type/location/api within a file will internally make an entry towards a resourceprovidersummary entry which is kept updated with the latest updates for the resourceprovider and is optimized for GET calls which receive summarized data for the resource provider.

The system actually corrects itself eventually and the pods are up with the manifests registered and ucp is running but the workflow logic will fail as is designed, the in-built types are not saved in the skip-delete-resources-list.txt and will be deleted in the next subsequent runs and they fail.
This error is intermittent (latest fresh build run this evening did not have this error, https://github.com/radius-project/radius/actions/runs/13316073926/job/37190564570) but when it does error, it can lead to 12 subsequent failures.

The PR includes addition of retry logic with exponential backoff for handling 409 conflict errors.
Introduction of a 'skip-build' override mechism for workflow_dispatch to be able to run Long Running Tests against latest build on demand.

Type of change

This pull request fixes a bug in Radius and has an approved issue (issue link required).

Fixes: #8449

Contributor checklist

Please verify that the PR meets the following requirements, where applicable:

An overview of proposed schema changes is included in a linked GitHub issue.
- Yes
- Not applicable
A design document PR is created in the design-notes repository, if new APIs are being introduced.
- Yes
- Not applicable
The design document has been reviewed and approved by Radius maintainers/approvers.
- Yes
- Not applicable
A PR for the samples repository is created, if existing samples are affected by the changes in this PR.
- Yes
- Not applicable
A PR for the documentation repository is created, if the changes in this PR affect the documentation or any user facing updates are made.
- Yes
- Not applicable
A PR for the recipes repository is created, if existing recipes are affected by the changes in this PR.
- Yes
- Not applicable

codecov · 2025-02-14T01:34:21Z

Codecov Report

Attention: Patch coverage is 68.35443% with 25 lines in your changes missing coverage. Please review.

Project coverage is 59.90%. Comparing base (7709a79) to head (4d27a79).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/cli/manifest/registermanifest.go	68.35%	17 Missing and 8 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8495      +/-   ##
==========================================
+ Coverage   59.86%   59.90%   +0.03%     
==========================================
  Files         596      596              
  Lines       40463    40512      +49     
==========================================
+ Hits        24222    24267      +45     
- Misses      14416    14419       +3     
- Partials     1825     1826       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ytimocin · 2025-02-14T02:08:32Z

The pull request addresses findings in the failures related to Long Running Test: We see manifests not being registered successfully due to errors "409 Conflict : The target resource is in Accepted state" This was determined due to the fact that every PUT request for a resourceprovider/resource type/location/api within a file will internally make an entry towards a resourceprovidersummary entry which is kept updated with the latest updates for the resourceprovider and is optimized for GET calls which receive summarized data for the resource provider.

I am trying to understand why we see the 409s? I understand that the resources are in Accepted state but what is the main reason why they are stuck in Accepted state? Are they still being registered?

I am not sure why we need to retry when we get a 409? To me, 409 means that there is an operation going on and retrying while it is going on by sending the same CreateOrUpdate requests doesn't make much sense. If we are using retry just to wait for these operations to finish, then we may add a step to the workflow.

But, if they are stuck in 409 then there may be another issue.

ytimocin · 2025-02-14T01:54:34Z

.github/workflows/long-running-azure.yaml

+          # Check override in workflow_dispatch mode
+          if [ "${{ github.event_name }}" = "workflow_dispatch" ] && [ "${{ github.event.inputs.skip-build }}" = "false" ]; then
+            echo "Manual run with skip-build=false, forcing build"
+            SKIP_BUILD="false"
+          fi


This tells me that if the event is workflow dispatch then SKIP_BUILD will always be false.

this is saying if the event is workflow_dispatch And the input variable skip-build = false, then it will be set to false as an override. Default value for the variable is 'true'.

ytimocin · 2025-02-14T01:55:03Z

.github/workflows/long-running-azure.yaml

        id: skip-build
        run: |
          # check if the last build time to see if we need to build again
+          SKIP_BUILD="false"


Are we adding this because by default SKIP_BUILD wasn't being set to false and we need to do it explicitly?

I preferred initializing to a explicit value and check for in_window and workflow_dispatch conditions to update it .

ytimocin · 2025-02-14T01:57:23Z

.github/workflows/long-running-azure.yaml

@@ -436,8 +451,8 @@ jobs:
            exit 1
          fi

-          # Poll logs for up to  iterations, 30 seconds each (upto 3 minutes total)
-          for i in {1..6}; do
+          # Poll logs for up to 10 iterations, 30 seconds each (up to 5 minutes total)


409 errors may take more than 5 minutes to disappear. If the resource in a Non-Terminal state like Accepted, not sure how long it is going to take for worker to mark it as Cancelled which is a Terminal state.

409 happens because the resource is not in a Terminal state.

I descibed above I suspect it's more a case of completion/propogation of current resourceprovidersummary entry while new updates are coming in for the same entry.
With our other workflows, we do not see this error on kind cluster with 3 mins.
With the current updates, we're retrying on 409 codes and not returning error immediately which will give the system time to propogate changes. Since we're using backoff, yes, it's possible we see this cross 5 minutes Updating to 10mins.

pkg/cli/manifest/registermanifest.go

Signed-off-by: lakshmimsft <[email protected]>

radius-functional-tests · 2025-02-14T16:50:37Z

Radius functional test overview

🔍 Go to test action run

Name	Value
Repository	lakshmimsft/radius-mainfork
Commit ref	`4d27a79`
Unique ID	func0919730db0
Image tag	pr-func0919730db0

Click here to see the list of tools in the current test run

gotestsum 1.12.0
KinD: v0.20.0
Dapr:
Azure KeyVault CSI driver: 1.4.2
Azure Workload identity webhook: 1.3.0
Bicep recipe location ghcr.io/radius-project/dev/test/testrecipes/test-bicep-recipes/<name>:pr-func0919730db0
Terraform recipe location http://tf-module-server.radius-test-tf-module-server.svc.cluster.local/<name>.zip (in cluster)
applications-rp test image location: ghcr.io/radius-project/dev/applications-rp:pr-func0919730db0
dynamic-rp test image location: ghcr.io/radius-project/dev/dynamic-rp:pr-func0919730db0
controller test image location: ghcr.io/radius-project/dev/controller:pr-func0919730db0
ucp test image location: ghcr.io/radius-project/dev/ucpd:pr-func0919730db0
deployment-engine test image location: ghcr.io/radius-project/deployment-engine:latest

Test Status

⌛ Building Radius and pushing container images for functional tests...
✅ Container images build succeeded
⌛ Publishing Bicep Recipes for functional tests...
✅ Recipe publishing succeeded
⌛ Starting corerp-cloud functional tests...
⌛ Starting ucp-cloud functional tests...
✅ ucp-cloud functional tests succeeded
✅ corerp-cloud functional tests succeeded

ytimocin · 2025-02-18T15:20:33Z

.github/workflows/long-running-azure.yaml

+          # Poll logs for up to 20 iterations, 30 seconds each (up to 10 minutes total)
+          for i in {1..20}; do


Why don't we just retry this step on failure instead of adding retry logic to the register manifest flow? We use something like this in samples repo: https://github.com/radius-project/samples/blob/edge/.github/workflows/test.yaml#L397.

I think we can move to this later once we see it stabilized. The current approach gives me the exact error of the logs which is valuable right now.

ytimocin

Discussed offline with @lakshmimsft. I believe that the retry logic should be in the workflow level. We agreed that to unblock the long running tests, we should merge this one in and see if this fixes the issue. But, we should have a follow up issue/PR to move the retry logic. Not sure if the retry logic being in the domain level is the best idea.

lakshmimsft requested review from a team as code owners February 14, 2025 01:22

lakshmimsft temporarily deployed to publish-bicep February 14, 2025 01:22 — with GitHub Actions Inactive

lakshmimsft changed the title ~~add retry and backoff for registermanifest~~ Add retry logic for registermanifests, introduce skip-build override for LRT Feb 14, 2025

lakshmimsft force-pushed the lakshmimsft/addretryregistermanifests branch from 61d1c89 to b2c45cf Compare February 14, 2025 01:43

lakshmimsft temporarily deployed to publish-bicep February 14, 2025 01:43 — with GitHub Actions Inactive

ytimocin reviewed Feb 14, 2025

View reviewed changes

add check, retry and backoff for registermanifest

4d27a79

Signed-off-by: lakshmimsft <[email protected]>

lakshmimsft force-pushed the lakshmimsft/addretryregistermanifests branch from b2c45cf to 4d27a79 Compare February 14, 2025 04:36

lakshmimsft temporarily deployed to publish-bicep February 14, 2025 04:36 — with GitHub Actions Inactive

lakshmimsft temporarily deployed to functional-tests February 14, 2025 16:49 — with GitHub Actions Inactive

ytimocin reviewed Feb 18, 2025

View reviewed changes

ytimocin approved these changes Feb 18, 2025

View reviewed changes

lakshmimsft merged commit 83d8a72 into radius-project:main Feb 18, 2025
29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retry logic for registermanifests, introduce skip-build override for LRT #8495

Add retry logic for registermanifests, introduce skip-build override for LRT #8495

lakshmimsft commented Feb 14, 2025 •

edited

Loading

codecov bot commented Feb 14, 2025 •

edited

Loading

ytimocin commented Feb 14, 2025

ytimocin Feb 14, 2025

lakshmimsft Feb 14, 2025 •

edited

Loading

ytimocin Feb 14, 2025

lakshmimsft Feb 14, 2025

ytimocin Feb 14, 2025

lakshmimsft Feb 14, 2025 •

edited

Loading

radius-functional-tests bot commented Feb 14, 2025 •

edited

Loading

ytimocin Feb 18, 2025

lakshmimsft Feb 18, 2025

ytimocin left a comment

		# Poll logs for up to 20 iterations, 30 seconds each (up to 10 minutes total)
		for i in {1..20}; do

Add retry logic for registermanifests, introduce skip-build override for LRT #8495

Add retry logic for registermanifests, introduce skip-build override for LRT #8495

Conversation

lakshmimsft commented Feb 14, 2025 • edited Loading

Description

Type of change

Contributor checklist

codecov bot commented Feb 14, 2025 • edited Loading

Codecov Report

ytimocin commented Feb 14, 2025

ytimocin Feb 14, 2025

Choose a reason for hiding this comment

lakshmimsft Feb 14, 2025 • edited Loading

Choose a reason for hiding this comment

ytimocin Feb 14, 2025

Choose a reason for hiding this comment

lakshmimsft Feb 14, 2025

Choose a reason for hiding this comment

ytimocin Feb 14, 2025

Choose a reason for hiding this comment

lakshmimsft Feb 14, 2025 • edited Loading

Choose a reason for hiding this comment

radius-functional-tests bot commented Feb 14, 2025 • edited Loading

Radius functional test overview

Test Status

ytimocin Feb 18, 2025

Choose a reason for hiding this comment

lakshmimsft Feb 18, 2025

Choose a reason for hiding this comment

ytimocin left a comment

Choose a reason for hiding this comment

lakshmimsft commented Feb 14, 2025 •

edited

Loading

codecov bot commented Feb 14, 2025 •

edited

Loading

lakshmimsft Feb 14, 2025 •

edited

Loading

lakshmimsft Feb 14, 2025 •

edited

Loading

radius-functional-tests bot commented Feb 14, 2025 •

edited

Loading