Skip to content

Conversation

@2uasimojo
Copy link
Member

After a CD is Installed, we try to set the API and console URIs in the status. The latter is gleaned from the console Route on the remote cluster. It has recently come to my attention that it is possible for a cluster not to have that Route at all. But the code path that was looking for it was doing an immediate requeue, so we would end up hot looping, ultimately going into backoff. In scenarios where the absence is transient, this would resolve itself; but if that Route was never expected to exist, this was causing unnecessary churn in the controller, especially at scale.

With this change, we:

  • Split the error path when we fail to retrieve the console Route. When the error is a 404, we requeue with a static delay of 10m. Otherwise, we requeue "immediately" as is the usual pattern for such errors.
  • Carry on updating the API URL even if the console URL can't be determined (previously it was both or neither). (Extra logic was needed here to update only if something changed.)
  • Use error returns rather than info logs & non-error returns. Not sure why it was doing the latter previously; I'm only guessing it's because that code was from a bygone era where that made more sense for some reason. This modernization should make the error paths pop out of the logs more readily, which would have helped in diagnosing the scaling performance issue in the referenced card.

After a CD is Installed, we try to set the API and console URIs in the
status. The latter is gleaned from the `console` Route on the remote
cluster. It has recently come to my attention that it is possible for a
cluster not to have that Route at all. But the code path that was
looking for it was doing an immediate requeue, so we would end up hot
looping, ultimately going into backoff. In scenarios where the absence
is transient, this would resolve itself; but if that Route was never
expected to exist, this was causing unnecessary churn in the controller,
especially at scale.

With this change, we:
- Split the error path when we fail to retrieve the `console` Route.
  When the error is a 404, we requeue with a static delay of 10m.
  Otherwise, we requeue "immediately" as is the usual pattern for such
  errors.
- Carry on updating the API URL even if the console URL can't be
  determined (previously it was both or neither). (Extra logic was
  needed here to update only if something changed.)
- Use error returns rather than info logs & non-error returns. Not sure
  why it was doing the latter previously; I'm only guessing it's because
  that code was from a bygone era where that made more sense for some
  reason. This modernization should make the error paths pop out of the
  logs more readily, which would have helped in diagnosing the scaling
  performance issue in the referenced card.
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 14, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Nov 14, 2025

@2uasimojo: This pull request references ACM-20933 which is a valid jira issue.

In response to this:

After a CD is Installed, we try to set the API and console URIs in the status. The latter is gleaned from the console Route on the remote cluster. It has recently come to my attention that it is possible for a cluster not to have that Route at all. But the code path that was looking for it was doing an immediate requeue, so we would end up hot looping, ultimately going into backoff. In scenarios where the absence is transient, this would resolve itself; but if that Route was never expected to exist, this was causing unnecessary churn in the controller, especially at scale.

With this change, we:

  • Split the error path when we fail to retrieve the console Route. When the error is a 404, we requeue with a static delay of 10m. Otherwise, we requeue "immediately" as is the usual pattern for such errors.
  • Carry on updating the API URL even if the console URL can't be determined (previously it was both or neither). (Extra logic was needed here to update only if something changed.)
  • Use error returns rather than info logs & non-error returns. Not sure why it was doing the latter previously; I'm only guessing it's because that code was from a bygone era where that made more sense for some reason. This modernization should make the error paths pop out of the logs more readily, which would have helped in diagnosing the scaling performance issue in the referenced card.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from dlom and suhanime November 14, 2025 21:14
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 14, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 2uasimojo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 14, 2025
@2uasimojo
Copy link
Member Author

/assign @dlom
/cc @akrzos

@openshift-ci openshift-ci bot requested a review from akrzos November 14, 2025 21:17
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 14, 2025

@2uasimojo: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@codecov
Copy link

codecov bot commented Nov 14, 2025

Codecov Report

❌ Patch coverage is 61.90476% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.35%. Comparing base (fe9c66b) to head (9d33c41).

Files with missing lines Patch % Lines
.../clusterdeployment/clusterdeployment_controller.go 61.90% 7 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2782      +/-   ##
==========================================
+ Coverage   50.34%   50.35%   +0.01%     
==========================================
  Files         279      279              
  Lines       34167    34178      +11     
==========================================
+ Hits        17201    17210       +9     
- Misses      15612    15615       +3     
+ Partials     1354     1353       -1     
Files with missing lines Coverage Δ
.../clusterdeployment/clusterdeployment_controller.go 62.10% <61.90%> (+0.11%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants