Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry logic for registermanifests, introduce skip-build override for LRT #8495

Conversation

lakshmimsft
Copy link
Contributor

@lakshmimsft lakshmimsft commented Feb 14, 2025

Description

The pull request addresses findings in the failures related to Long Running Test:
We see manifests not being registered successfully due to errors "409 Conflict : The target resource is in Accepted state"
This was determined due to the fact that every PUT request for a resourceprovider/resource type/location/api within a file will internally make an entry towards a resourceprovidersummary entry which is kept updated with the latest updates for the resourceprovider and is optimized for GET calls which receive summarized data for the resource provider.

The system actually corrects itself eventually and the pods are up with the manifests registered and ucp is running but the workflow logic will fail as is designed, the in-built types are not saved in the skip-delete-resources-list.txt and will be deleted in the next subsequent runs and they fail.
This error is intermittent (latest fresh build run this evening did not have this error, https://github.com/radius-project/radius/actions/runs/13316073926/job/37190564570) but when it does error, it can lead to 12 subsequent failures.

The PR includes addition of retry logic with exponential backoff for handling 409 conflict errors.
Introduction of a 'skip-build' override mechism for workflow_dispatch to be able to run Long Running Tests against latest build on demand.

Type of change

  • This pull request fixes a bug in Radius and has an approved issue (issue link required).

Fixes: #8449

Contributor checklist

Please verify that the PR meets the following requirements, where applicable:

  • An overview of proposed schema changes is included in a linked GitHub issue.
    • Yes
    • Not applicable
  • A design document PR is created in the design-notes repository, if new APIs are being introduced.
    • Yes
    • Not applicable
  • The design document has been reviewed and approved by Radius maintainers/approvers.
    • Yes
    • Not applicable
  • A PR for the samples repository is created, if existing samples are affected by the changes in this PR.
    • Yes
    • Not applicable
  • A PR for the documentation repository is created, if the changes in this PR affect the documentation or any user facing updates are made.
    • Yes
    • Not applicable
  • A PR for the recipes repository is created, if existing recipes are affected by the changes in this PR.
    • Yes
    • Not applicable

@lakshmimsft lakshmimsft requested review from a team as code owners February 14, 2025 01:22
@lakshmimsft lakshmimsft changed the title add retry and backoff for registermanifest Add retry logic for registermanifests, introduce skip-build override for LRT Feb 14, 2025
Copy link

codecov bot commented Feb 14, 2025

Codecov Report

Attention: Patch coverage is 68.35443% with 25 lines in your changes missing coverage. Please review.

Project coverage is 59.90%. Comparing base (7709a79) to head (4d27a79).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/cli/manifest/registermanifest.go 68.35% 17 Missing and 8 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8495      +/-   ##
==========================================
+ Coverage   59.86%   59.90%   +0.03%     
==========================================
  Files         596      596              
  Lines       40463    40512      +49     
==========================================
+ Hits        24222    24267      +45     
- Misses      14416    14419       +3     
- Partials     1825     1826       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lakshmimsft lakshmimsft force-pushed the lakshmimsft/addretryregistermanifests branch from 61d1c89 to b2c45cf Compare February 14, 2025 01:43
@ytimocin
Copy link
Contributor

The pull request addresses findings in the failures related to Long Running Test: We see manifests not being registered successfully due to errors "409 Conflict : The target resource is in Accepted state" This was determined due to the fact that every PUT request for a resourceprovider/resource type/location/api within a file will internally make an entry towards a resourceprovidersummary entry which is kept updated with the latest updates for the resourceprovider and is optimized for GET calls which receive summarized data for the resource provider.

I am trying to understand why we see the 409s? I understand that the resources are in Accepted state but what is the main reason why they are stuck in Accepted state? Are they still being registered?

I am not sure why we need to retry when we get a 409? To me, 409 means that there is an operation going on and retrying while it is going on by sending the same CreateOrUpdate requests doesn't make much sense. If we are using retry just to wait for these operations to finish, then we may add a step to the workflow.

But, if they are stuck in 409 then there may be another issue.

Comment on lines +138 to +142
# Check override in workflow_dispatch mode
if [ "${{ github.event_name }}" = "workflow_dispatch" ] && [ "${{ github.event.inputs.skip-build }}" = "false" ]; then
echo "Manual run with skip-build=false, forcing build"
SKIP_BUILD="false"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tells me that if the event is workflow dispatch then SKIP_BUILD will always be false.

Copy link
Contributor Author

@lakshmimsft lakshmimsft Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is saying if the event is workflow_dispatch And the input variable skip-build = false, then it will be set to false as an override. Default value for the variable is 'true'.

id: skip-build
run: |
# check if the last build time to see if we need to build again
SKIP_BUILD="false"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we adding this because by default SKIP_BUILD wasn't being set to false and we need to do it explicitly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I preferred initializing to a explicit value and check for in_window and workflow_dispatch conditions to update it .

@@ -436,8 +451,8 @@ jobs:
exit 1
fi

# Poll logs for up to iterations, 30 seconds each (upto 3 minutes total)
for i in {1..6}; do
# Poll logs for up to 10 iterations, 30 seconds each (up to 5 minutes total)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

409 errors may take more than 5 minutes to disappear. If the resource in a Non-Terminal state like Accepted, not sure how long it is going to take for worker to mark it as Cancelled which is a Terminal state.

409 happens because the resource is not in a Terminal state.

Copy link
Contributor Author

@lakshmimsft lakshmimsft Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I descibed above I suspect it's more a case of completion/propogation of current resourceprovidersummary entry while new updates are coming in for the same entry.
With our other workflows, we do not see this error on kind cluster with 3 mins.
With the current updates, we're retrying on 409 codes and not returning error immediately which will give the system time to propogate changes. Since we're using backoff, yes, it's possible we see this cross 5 minutes Updating to 10mins.

@lakshmimsft lakshmimsft force-pushed the lakshmimsft/addretryregistermanifests branch from b2c45cf to 4d27a79 Compare February 14, 2025 04:36
@radius-functional-tests
Copy link

radius-functional-tests bot commented Feb 14, 2025

Radius functional test overview

🔍 Go to test action run

Name Value
Repository lakshmimsft/radius-mainfork
Commit ref 4d27a79
Unique ID func0919730db0
Image tag pr-func0919730db0
Click here to see the list of tools in the current test run
  • gotestsum 1.12.0
  • KinD: v0.20.0
  • Dapr:
  • Azure KeyVault CSI driver: 1.4.2
  • Azure Workload identity webhook: 1.3.0
  • Bicep recipe location ghcr.io/radius-project/dev/test/testrecipes/test-bicep-recipes/<name>:pr-func0919730db0
  • Terraform recipe location http://tf-module-server.radius-test-tf-module-server.svc.cluster.local/<name>.zip (in cluster)
  • applications-rp test image location: ghcr.io/radius-project/dev/applications-rp:pr-func0919730db0
  • dynamic-rp test image location: ghcr.io/radius-project/dev/dynamic-rp:pr-func0919730db0
  • controller test image location: ghcr.io/radius-project/dev/controller:pr-func0919730db0
  • ucp test image location: ghcr.io/radius-project/dev/ucpd:pr-func0919730db0
  • deployment-engine test image location: ghcr.io/radius-project/deployment-engine:latest

Test Status

⌛ Building Radius and pushing container images for functional tests...
✅ Container images build succeeded
⌛ Publishing Bicep Recipes for functional tests...
✅ Recipe publishing succeeded
⌛ Starting corerp-cloud functional tests...
⌛ Starting ucp-cloud functional tests...
✅ ucp-cloud functional tests succeeded
✅ corerp-cloud functional tests succeeded

Comment on lines +454 to +455
# Poll logs for up to 20 iterations, 30 seconds each (up to 10 minutes total)
for i in {1..20}; do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we just retry this step on failure instead of adding retry logic to the register manifest flow? We use something like this in samples repo: https://github.com/radius-project/samples/blob/edge/.github/workflows/test.yaml#L397.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can move to this later once we see it stabilized. The current approach gives me the exact error of the logs which is valuable right now.

Copy link
Contributor

@ytimocin ytimocin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline with @lakshmimsft. I believe that the retry logic should be in the workflow level. We agreed that to unblock the long running tests, we should merge this one in and see if this fixes the issue. But, we should have a follow up issue/PR to move the retry logic. Not sure if the retry logic being in the domain level is the best idea.

@lakshmimsft lakshmimsft merged commit 83d8a72 into radius-project:main Feb 18, 2025
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate 409 error in Long Running tests
2 participants