MON-4437: Migrate windows-exporter ServiceMonitor to EndpointSlice #3583

slashpai · 2025-11-24T09:01:04Z

related-to openshift/cluster-network-operator#2839

Related Epic: https://issues.redhat.com/browse/MON-4216

openshift-ci-robot · 2025-11-24T09:01:09Z

@slashpai: This pull request references MON-4432 which is a valid jira issue.

In response to this:

related-to openshift/cluster-network-operator#2839

Related Epic: https://issues.redhat.com/browse/MON-4216

cc: @simonpasquier

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-11-24T09:04:48Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: slashpai
Once this PR has been reviewed and has the lgtm label, please assign sebsoto for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jrvaldes

@slashpai thanks for working on this, PTAL at the comments

controllers/metric_controller.go

openshift-ci-robot · 2025-11-25T03:37:48Z

@slashpai: This pull request references MON-4437 which is a valid jira issue.

In response to this:

related-to openshift/cluster-network-operator#2839

Related Epic: https://issues.redhat.com/browse/MON-4216

cc: @simonpasquier

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Update ServiceMonitor configuration to use EndpointSlice discovery instead of Endpoints API for improved scalability. Also bumps prometheus-operator lib to v0.86.2 in which the new field serviceDiscoveryRole is introduced to serviceMonitor spec. - Set serviceDiscoveryRole to EndpointSlice - Update relabel metadata label for EndpointSlice - Add endpointslices RBAC permissions Signed-off-by: Jayapriya Pai <[email protected]> Assisted-By: Cursor AI

slashpai · 2025-11-25T05:51:33Z

/test azure-e2e-operator

slashpai · 2025-11-25T08:34:36Z

e2e test failure looks legit I might also need to update e2e test

slashpai · 2025-11-26T12:07:19Z

/test aws-e2e-operator
/test vsphere-disconnected-e2e-operator

slashpai · 2025-11-27T08:34:53Z

looking into this

=== RUN   TestWMCO/destroy/Deletion/BYOH_node_removal/ip-10-0-105-91.us-west-2.compute.internal/AWS_metadata_endpoint
    delete_test.go:199: 
        	Error Trace:	/go/src/github.com/openshift/windows-machine-config-operator/test/e2e/delete_test.go:199
        	            				/go/src/github.com/openshift/windows-machine-config-operator/test/e2e/delete_test.go:102
        	Error:      	Received unexpected error:
        	            	context deadline exceeded
        	Test:       	TestWMCO/destroy/Deletion/BYOH_node_removal/ip-10-0-105-91.us-west-2.compute.internal/AWS_metadata_endpoint
        	Messages:   	metadata endpoint route was not restored within timeout
2025/11/26 14:42:13 waiting (timeout: 5m0s) for 0 Windows Machines to reach phase ""
2025/11/26 14:42:18 5.067970045s time is required for 0 Machines without the ignore label to reach phase deleted
2025/11/26 14:42:18 waiting (timeout: 5m0s) for 0 Windows Machines to reach phase ""
2025/11/26 14:42:23 waiting for -1/0 Windows Machines
2025/11/26 14:42:28 waiting for -1/0 Windows Machines
2025/11/26 14:42:33 waiting for -1/0 Windows Machines
2025/11/26 14:42:38 waiting for -1/0 Windows Machines
2025/11/26 14:42:43 waiting for -1/0 Windows Machines
2025/11/26 14:42:48 waiting for -1/0 Windows Machines
2025/11/26 14:42:53 waiting for -1/0 Windows Machines
2025/11/26 14:42:58 waiting for -1/0 Windows Machines
2025/11/26 14:43:03 waiting for -1/0 Windows Machines
2025/11/26 14:43:08 waiting for -1/0 Windows Machines
2025/11/26 14:43:13 waiting for -1/0 Windows Machines
2025/11/26 14:43:18 1m0.067311082s time is required for 0 Machines with the ignore label to reach phase deleted
=== RUN   TestWMCO/destroy/Deletion/Prometheus_configuration
2025/11/26 14:43:18 test failed, attempting to gather Node logs
--- FAIL: TestWMCO (771.28s)
    --- FAIL: TestWMCO/destroy (770.57s)
        --- PASS: TestWMCO/destroy/Mirror_settings_cleared_from_nodes (0.19s)
        --- FAIL: TestWMCO/destroy/Deletion (770.24s)
            --- FAIL: TestWMCO/destroy/Deletion/BYOH_node_removal (598.70s)
                --- FAIL: TestWMCO/destroy/Deletion/BYOH_node_removal/ip-10-0-105-91.us-west-2.compute.internal (253.32s)

Assisted-By: Cursor AI Signed-off-by: Jayapriya Pai <[email protected]>

openshift-ci · 2025-11-27T17:34:38Z

@slashpai: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/aws-e2e-operator	`a66ff46`	link	true	`/test aws-e2e-operator`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

simonpasquier · 2025-11-27T17:50:01Z

test/e2e/metrics_test.go

+	for _, slice := range endpointSlices.Items {
+		for _, endpoint := range slice.Endpoints {
+			// Check if this endpoint references a node
+			if endpoint.TargetRef != nil && endpoint.TargetRef.Kind == "Node" {


(nit) we can go to the next item directly if it's not a node ref.

Suggested change

if endpoint.TargetRef != nil && endpoint.TargetRef.Kind == "Node" {

if endpoint.TargetRef == nil || endpoint.TargetRef.Kind != "Node" {

continue

}

simonpasquier · 2025-11-27T17:51:22Z

test/e2e/metrics_test.go

+				// Try to get the node to check its OS
+				node, err := tc.client.K8s.CoreV1().Nodes().Get(context.TODO(),
+					endpoint.TargetRef.Name, metav1.GetOptions{})
+				if err == nil {


shouldn't we catch non-nil errors? Even a node not being found is a problem since it means that the endpointslice isn't synced.

jrvaldes · 2025-12-01T15:16:33Z

looking into this

=== RUN   TestWMCO/destroy/Deletion/BYOH_node_removal/ip-10-0-105-91.us-west-2.compute.internal/AWS_metadata_endpoint
    delete_test.go:199: 
        	Error Trace:	/go/src/github.com/openshift/windows-machine-config-operator/test/e2e/delete_test.go:199
        	            				/go/src/github.com/openshift/windows-machine-config-operator/test/e2e/delete_test.go:102
        	Error:      	Received unexpected error:
        	            	context deadline exceeded
        	Test:       	TestWMCO/destroy/Deletion/BYOH_node_removal/ip-10-0-105-91.us-west-2.compute.internal/AWS_metadata_endpoint
        	Messages:   	metadata endpoint route was not restored within timeout
2025/11/26 14:42:13 waiting (timeout: 5m0s) for 0 Windows Machines to reach phase ""
2025/11/26 14:42:18 5.067970045s time is required for 0 Machines without the ignore label to reach phase deleted
2025/11/26 14:42:18 waiting (timeout: 5m0s) for 0 Windows Machines to reach phase ""
2025/11/26 14:42:23 waiting for -1/0 Windows Machines
2025/11/26 14:42:28 waiting for -1/0 Windows Machines
2025/11/26 14:42:33 waiting for -1/0 Windows Machines
2025/11/26 14:42:38 waiting for -1/0 Windows Machines
2025/11/26 14:42:43 waiting for -1/0 Windows Machines
2025/11/26 14:42:48 waiting for -1/0 Windows Machines
2025/11/26 14:42:53 waiting for -1/0 Windows Machines
2025/11/26 14:42:58 waiting for -1/0 Windows Machines
2025/11/26 14:43:03 waiting for -1/0 Windows Machines
2025/11/26 14:43:08 waiting for -1/0 Windows Machines
2025/11/26 14:43:13 waiting for -1/0 Windows Machines
2025/11/26 14:43:18 1m0.067311082s time is required for 0 Machines with the ignore label to reach phase deleted
=== RUN   TestWMCO/destroy/Deletion/Prometheus_configuration
2025/11/26 14:43:18 test failed, attempting to gather Node logs
--- FAIL: TestWMCO (771.28s)
    --- FAIL: TestWMCO/destroy (770.57s)
        --- PASS: TestWMCO/destroy/Mirror_settings_cleared_from_nodes (0.19s)
        --- FAIL: TestWMCO/destroy/Deletion (770.24s)
            --- FAIL: TestWMCO/destroy/Deletion/BYOH_node_removal (598.70s)
                --- FAIL: TestWMCO/destroy/Deletion/BYOH_node_removal/ip-10-0-105-91.us-west-2.compute.internal (253.32s)

@slashpai there is an ongoing issue with AWS job in CI, ignore this failure.

https://issues.redhat.com/browse/OCPBUGS-66070

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 24, 2025

openshift-ci bot requested review from jrvaldes and sebsoto November 24, 2025 09:04

jrvaldes reviewed Nov 24, 2025

View reviewed changes

controllers/metric_controller.go Outdated Show resolved Hide resolved

controllers/metric_controller.go Show resolved Hide resolved

slashpai changed the title ~~MON-4432: Migrate windows-exporter ServiceMonitor to EndpointSlice~~ MON-4437: Migrate windows-exporter ServiceMonitor to EndpointSlice Nov 25, 2025

slashpai force-pushed the endpointslice branch from 4b98550 to acea385 Compare November 25, 2025 03:44

tests: update e2e test for endpointslice compatibility

a66ff46

Assisted-By: Cursor AI Signed-off-by: Jayapriya Pai <[email protected]>

slashpai force-pushed the endpointslice branch from 111feea to a66ff46 Compare November 27, 2025 14:21

simonpasquier reviewed Nov 27, 2025

View reviewed changes

-			if endpoint.TargetRef != nil && endpoint.TargetRef.Kind == "Node" {
+			if endpoint.TargetRef == nil || endpoint.TargetRef.Kind != "Node" {
+				continue
+			}

MON-4437: Migrate windows-exporter ServiceMonitor to EndpointSlice #3583

Are you sure you want to change the base?

MON-4437: Migrate windows-exporter ServiceMonitor to EndpointSlice #3583

Uh oh!

Conversation

slashpai commented Nov 24, 2025

Uh oh!

openshift-ci-robot commented Nov 24, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Nov 24, 2025

Uh oh!

jrvaldes left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

openshift-ci-robot commented Nov 25, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slashpai commented Nov 25, 2025

Uh oh!

slashpai commented Nov 25, 2025

Uh oh!

slashpai commented Nov 26, 2025

Uh oh!

slashpai commented Nov 27, 2025

Uh oh!

openshift-ci bot commented Nov 27, 2025

Uh oh!

simonpasquier Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

simonpasquier Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

jrvaldes commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

openshift-ci-robot commented Nov 24, 2025 •

edited by openshift-ci bot

Loading

jrvaldes left a comment •

edited

Loading

openshift-ci-robot commented Nov 25, 2025 •

edited by openshift-ci bot

Loading