-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retry logic for registermanifests, introduce skip-build override for LRT #8495
Add retry logic for registermanifests, introduce skip-build override for LRT #8495
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8495 +/- ##
==========================================
+ Coverage 59.86% 59.90% +0.03%
==========================================
Files 596 596
Lines 40463 40512 +49
==========================================
+ Hits 24222 24267 +45
- Misses 14416 14419 +3
- Partials 1825 1826 +1 ☔ View full report in Codecov by Sentry. |
61d1c89
to
b2c45cf
Compare
I am trying to understand why we see the 409s? I understand that the resources are in Accepted state but what is the main reason why they are stuck in Accepted state? Are they still being registered? I am not sure why we need to retry when we get a 409? To me, 409 means that there is an operation going on and retrying while it is going on by sending the same CreateOrUpdate requests doesn't make much sense. If we are using retry just to wait for these operations to finish, then we may add a step to the workflow. But, if they are stuck in 409 then there may be another issue. |
# Check override in workflow_dispatch mode | ||
if [ "${{ github.event_name }}" = "workflow_dispatch" ] && [ "${{ github.event.inputs.skip-build }}" = "false" ]; then | ||
echo "Manual run with skip-build=false, forcing build" | ||
SKIP_BUILD="false" | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tells me that if the event is workflow dispatch then SKIP_BUILD will always be false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is saying if the event is workflow_dispatch And the input variable skip-build = false, then it will be set to false as an override. Default value for the variable is 'true'.
id: skip-build | ||
run: | | ||
# check if the last build time to see if we need to build again | ||
SKIP_BUILD="false" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we adding this because by default SKIP_BUILD wasn't being set to false and we need to do it explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I preferred initializing to a explicit value and check for in_window and workflow_dispatch conditions to update it .
@@ -436,8 +451,8 @@ jobs: | |||
exit 1 | |||
fi | |||
|
|||
# Poll logs for up to iterations, 30 seconds each (upto 3 minutes total) | |||
for i in {1..6}; do | |||
# Poll logs for up to 10 iterations, 30 seconds each (up to 5 minutes total) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
409 errors may take more than 5 minutes to disappear. If the resource in a Non-Terminal state like Accepted, not sure how long it is going to take for worker to mark it as Cancelled which is a Terminal state.
409 happens because the resource is not in a Terminal state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I descibed above I suspect it's more a case of completion/propogation of current resourceprovidersummary entry while new updates are coming in for the same entry.
With our other workflows, we do not see this error on kind cluster with 3 mins.
With the current updates, we're retrying on 409 codes and not returning error immediately which will give the system time to propogate changes. Since we're using backoff, yes, it's possible we see this cross 5 minutes Updating to 10mins.
Signed-off-by: lakshmimsft <[email protected]>
b2c45cf
to
4d27a79
Compare
Radius functional test overview
Click here to see the list of tools in the current test run
Test Status⌛ Building Radius and pushing container images for functional tests... |
# Poll logs for up to 20 iterations, 30 seconds each (up to 10 minutes total) | ||
for i in {1..20}; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we just retry this step on failure instead of adding retry logic to the register manifest flow? We use something like this in samples repo: https://github.com/radius-project/samples/blob/edge/.github/workflows/test.yaml#L397.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can move to this later once we see it stabilized. The current approach gives me the exact error of the logs which is valuable right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline with @lakshmimsft. I believe that the retry logic should be in the workflow level. We agreed that to unblock the long running tests, we should merge this one in and see if this fixes the issue. But, we should have a follow up issue/PR to move the retry logic. Not sure if the retry logic being in the domain level is the best idea.
Description
The pull request addresses findings in the failures related to Long Running Test:
We see manifests not being registered successfully due to errors "409 Conflict : The target resource is in Accepted state"
This was determined due to the fact that every PUT request for a resourceprovider/resource type/location/api within a file will internally make an entry towards a resourceprovidersummary entry which is kept updated with the latest updates for the resourceprovider and is optimized for GET calls which receive summarized data for the resource provider.
The system actually corrects itself eventually and the pods are up with the manifests registered and ucp is running but the workflow logic will fail as is designed, the in-built types are not saved in the skip-delete-resources-list.txt and will be deleted in the next subsequent runs and they fail.
This error is intermittent (latest fresh build run this evening did not have this error, https://github.com/radius-project/radius/actions/runs/13316073926/job/37190564570) but when it does error, it can lead to 12 subsequent failures.
The PR includes addition of retry logic with exponential backoff for handling 409 conflict errors.
Introduction of a 'skip-build' override mechism for workflow_dispatch to be able to run Long Running Tests against latest build on demand.
Type of change
Fixes: #8449
Contributor checklist
Please verify that the PR meets the following requirements, where applicable: