Skip to content

Conversation

@kyrtapz
Copy link
Contributor

@kyrtapz kyrtapz commented Oct 22, 2025

The previous capacity calculation incorrectly only counted CloudPrivateIPConfigs that had the status set.
This is racy and can lead to artificially reduced capacity.
Fix it by accounting for all CloudPrivateIPConfigs that are or were assigned to the matching node.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 22, 2025
@openshift-ci openshift-ci bot requested review from arkadeepsen and tssurya October 22, 2025 08:03
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 22, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 22, 2025
@kyrtapz kyrtapz force-pushed the fix_capacity_calc branch 2 times, most recently from 8379c29 to 06577fe Compare October 23, 2025 07:54
@kyrtapz kyrtapz changed the title Fix capacity calculation OCPBUGS-63348: Fix capacity calculation Oct 23, 2025
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 23, 2025
@openshift-ci-robot
Copy link

@kyrtapz: This pull request references Jira Issue OCPBUGS-63348, which is invalid:

  • expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The previous capacity calculation incorrectly only counted CloudPrivateIPConfigs that had the status set.
This is racy and can lead to artificially reduced capacity.
Fix it by accounting for all CloudPrivateIPConfigs that are or were assigned to the matching node.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@kyrtapz kyrtapz force-pushed the fix_capacity_calc branch 2 times, most recently from b71c9a0 to 394540b Compare October 24, 2025 12:51
@kyrtapz
Copy link
Contributor Author

kyrtapz commented Nov 3, 2025

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 3, 2025
@openshift-ci-robot
Copy link

@kyrtapz: This pull request references Jira Issue OCPBUGS-63348, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @huiran0826

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from huiran0826 November 3, 2025 07:51
@huiran0826
Copy link

/verified by 'Pre-merge testing'

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Nov 3, 2025
@openshift-ci-robot
Copy link

@huiran0826: This PR has been marked as verified by 'Pre-merge testing'.

In response to this:

/verified by 'Pre-merge testing'

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

// Include CPICs where spec.node OR status.node matches to handle transitions
cpicIPs := sets.New[string]()
for _, cloudPrivateIPConfig := range cloudPrivateIPConfigs {
if isAssignedCloudPrivateIPConfigOnNode(cloudPrivateIPConfig, key) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may remove isAssignedCloudPrivateIPConfigOnNode method which is no longer needed.

if isAssignedCloudPrivateIPConfigOnNode(cloudPrivateIPConfig, key) {
cloudPrivateIPConfigs[index] = cloudPrivateIPConfig
index++
if cloudPrivateIPConfig.Spec.Node == key || cloudPrivateIPConfig.Status.Node == key {
Copy link
Member

@pperiyasamy pperiyasamy Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: If capacity is just about excluding only IPs_consumed_by_non_CPIC_sources from the limit, why can't we just pass all cloudPrivateIPConfigs to GetNodeEgressIPConfiguration method for simplicity ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would mean the same logic would have to be copied across all of the implementations. Doing it the way we have here avoids it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i meant to check assignedIP with all cpicIPs, but anyway this also looks fine.

g.nodeMapLock.Lock()
defer g.nodeMapLock.Unlock()
if _, ok := g.nodeLockMap[nodeName]; !ok {
g.nodeLockMap[nodeName] = &sync.Mutex{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't cleanup the lock when node is deleted ?

Copy link
Contributor Author

@kyrtapz kyrtapz Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a valid point. Done(added it for azure as well).

The previous capacity calculation incorrectly only counted CloudPrivateIPConfigs
that had the status set. This is racy and can lead to artificially reduced capacity.
Fix it by accounting for all CloudPrivateIPConfigs that are or were assigned to
the matching node.

Change the calculation in openstack to only consider IPs that were assigned to the
port by using `allowed_address_pairs`.

Signed-off-by: Patryk Diak <[email protected]>
To avoid the following error being returned
when trying to add/remove multiple IPs at once:
  err: googleapi: Error 412: Invalid fingerprint.

From the docs:
  The request will fail with error 400 Bad Request if the fingerprint is not provided,
  or 412 Precondition Failed if the fingerprint is out of date.

Signed-off-by: Patryk Diak <[email protected]>
@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Nov 6, 2025
@coderabbitai
Copy link

coderabbitai bot commented Nov 6, 2025

Walkthrough

This pull request refactors CloudPrivateIPConfig parameter handling across cloud provider implementations by replacing slice-based inputs with set-based IP collections. The changes include parameter signature updates in AWS, Azure, GCP, and OpenStack providers, addition of CleanupNode() lifecycle methods, GCP per-node lock synchronization, and controller-level updates to collect and pass CPIC IPs as sets.

Changes

Cohort / File(s) Summary
Interface & Mock Implementation
pkg/cloudprovider/cloudprovider.go, pkg/cloudprovider/cloudprovider_fake.go
Updated GetNodeEgressIPConfiguration signature to accept sets.Set[string] instead of []*v1.CloudPrivateIPConfig; added CleanupNode method to interface; updated imports to add sets package and remove v1 cloudnetwork dependency.
AWS Provider
pkg/cloudprovider/aws.go
Changed GetNodeEgressIPConfiguration and getCapacity to accept cpicIPs sets.Set[string]; removed error return from getCapacity; added CleanupNode method; updated capacity calculation logic to use set membership checks instead of slice iteration.
Azure Provider
pkg/cloudprovider/azure.go
Updated GetNodeEgressIPConfiguration and getCapacity signatures to accept cpicIPs sets.Set[string]; added CleanupNode method; refactored capacity calculation to count current IP usage from the provided set; updated imports for sets package.
GCP Provider
pkg/cloudprovider/gcp.go
Added per-node concurrency control with nodeMapLock and nodeLockMap fields; implemented getNodeLock() helper for thread-safe per-node operations; updated GetNodeEgressIPConfiguration to accept cpicIPs sets.Set[string]; added locking to AssignPrivateIP and ReleasePrivateIP; added CleanupNode method; refactored capacity calculation with unified IP usage counter.
OpenStack Provider
pkg/cloudprovider/openstack.go
Updated GetNodeEgressIPConfiguration and getNeutronPortNodeEgressIPConfiguration to accept cpicIPs sets.Set[string] instead of CloudPrivateIPConfig slice; refactored capacity counting to use set-based lookup; added CleanupNode method; removed v1 import.
OpenStack Tests
pkg/cloudprovider/openstack_test.go
Changed test data types from []*v1.CloudPrivateIPConfig to []string; updated test cases to use sets.New[string]() for CPIC IPs; extended port test data with additional IPv4/IPv6 addresses; adjusted capacity calculations and test expectations to reflect new data structures.
Node Controller
pkg/controller/node/node_controller.go
Replaced isAssignedCloudPrivateIPConfigOnNode logic with set-based collection of CPIC IPs using NameToIP; updated GetNodeEgressIPConfiguration call to pass new cpicIPs set; added CleanupNode invocation on node deletion; added DeleteFunc to event handler; updated imports for sets and cloudprivateipconfig packages.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • GCP concurrency implementation: Review per-node lock acquisition/release patterns in AssignPrivateIP, ReleasePrivateIP, and getNodeLock() helper for potential race conditions or deadlock scenarios
  • Capacity calculation consistency: Verify capacity calculation logic is correctly implemented identically across AWS, Azure, and OpenStack providers (all switched from counting-based to set-based exclusion)
  • Set-based parameter propagation: Trace cpicIPs set construction and usage throughout controller and all provider implementations to ensure consistency
  • Test coverage: Validate that test case adjustments in openstack_test.go properly reflect the new set-based data structures and capacity expectations
  • Cleanup lifecycle: Verify CleanupNode invocation flow in controller deletion path and proper lock cleanup in GCP implementation
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 8384756 and f6b12fd.

📒 Files selected for processing (8)
  • pkg/cloudprovider/aws.go (4 hunks)
  • pkg/cloudprovider/azure.go (5 hunks)
  • pkg/cloudprovider/cloudprovider.go (3 hunks)
  • pkg/cloudprovider/cloudprovider_fake.go (2 hunks)
  • pkg/cloudprovider/gcp.go (8 hunks)
  • pkg/cloudprovider/openstack.go (6 hunks)
  • pkg/cloudprovider/openstack_test.go (7 hunks)
  • pkg/controller/node/node_controller.go (5 hunks)
🔇 Additional comments (1)
pkg/controller/node/node_controller.go (1)

124-143: Set-based CPIC filtering handles moves cleanly

Collecting CPIC IPs by spec/status match and feeding the set into the provider plugs the race where transitional assignments skewed capacity. The cleanup on node disappearance ties the lifecycle together nicely.

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands and usage tips.

@huiran0826
Copy link

/verified

@openshift-ci-robot
Copy link

@huiran0826: The /verified command must be used with one of the following actions: by, later, remove, or bypass. See https://docs.ci.openshift.org/docs/architecture/jira/#premerge-verification for more information.

In response to this:

/verified

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@huiran0826
Copy link

/verified by @huiran0826

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Nov 6, 2025
@openshift-ci-robot
Copy link

@huiran0826: This PR has been marked as verified by @huiran0826.

In response to this:

/verified by @huiran0826

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@kyrtapz
Copy link
Contributor Author

kyrtapz commented Nov 6, 2025

/test hypershift-e2e-aks
Failure looks unrelated.

@pperiyasamy
Copy link
Member

/lgtm
//hold waiting for CI to complete.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 6, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 6, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kyrtapz, pperiyasamy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [kyrtapz,pperiyasamy]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kyrtapz
Copy link
Contributor Author

kyrtapz commented Nov 6, 2025

/override ci/prow/hypershift-e2e-aks
https://redhat-external.slack.com/archives/C01C8502FMM/p1762445069386699

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 6, 2025

@kyrtapz: Overrode contexts on behalf of kyrtapz: ci/prow/hypershift-e2e-aks

In response to this:

/override ci/prow/hypershift-e2e-aks
https://redhat-external.slack.com/archives/C01C8502FMM/p1762445069386699

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 6, 2025

@kyrtapz: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/security f6b12fd link false /test security

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit a179997 into openshift:main Nov 6, 2025
13 of 14 checks passed
@openshift-ci-robot
Copy link

@kyrtapz: Jira Issue Verification Checks: Jira Issue OCPBUGS-63348
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-63348 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

In response to this:

The previous capacity calculation incorrectly only counted CloudPrivateIPConfigs that had the status set.
This is racy and can lead to artificially reduced capacity.
Fix it by accounting for all CloudPrivateIPConfigs that are or were assigned to the matching node.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@kyrtapz
Copy link
Contributor Author

kyrtapz commented Nov 6, 2025

/cherry-pick release-4.20

@openshift-cherrypick-robot

@kyrtapz: new pull request created: #188

In response to this:

/cherry-pick release-4.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants