Skip to content

Conversation

fmount
Copy link
Contributor

@fmount fmount commented Sep 5, 2025

This patch introduces the notificationsBusInstance (optional) parameter at API level.
When an override is not defined, it is propagated to the underlying storage components where this field is implemented.

Based on: #1402
Jira: https://issues.redhat.com/browse/OSPRH-15389

@fmount fmount requested review from abays and stuggi September 5, 2025 09:09
@openshift-ci openshift-ci bot requested review from fultonj and rabi September 5, 2025 09:09
Copy link
Contributor

openshift-ci bot commented Sep 5, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: fmount
Once this PR has been reviewed and has the lgtm label, please assign rabi for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fmount
Copy link
Contributor Author

fmount commented Sep 5, 2025

Note that I found #1424 that can be considered a follow up of this patch, where a dedicated rabbitmq instance is introduced.

Copy link
Contributor

@gibizer gibizer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to include the propagation to the nova level field too? I hope it is not too big of a complexity increase to take.

// Bus Service instance used by all services that produce or consume notifications.
// Avoid colocating it with RabbitMQ services used for PRC.
// That instance will be pushed down for services, unless overriden in templates.
NotificationsBusInstance *string `json:"notificationsBusInstance,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so the usual considerations:

  • if this is nil then it means nothing to propagate to the service level values. If the service level value is also nil then it means notifications are disabled. And this is our default.
  • if this is set to the name of a rabbitmqcluster CR and the service level value is nil, then the rabbitmqcluster name is propagated to that service. This should be our normal way of enabling notififcations for ceilometer across the whole cluster.
  • if this is set to the name of a rabbitmqcluster CR and the service level value is also set to a different rabbitmqcluster CR name then the sevice level value is kept for that service. This is the complicated case where ceilometer is not needed or only partially needed, and for some reason this specific service needs to use an independent rabbitmqcluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check if we need both "" and nil value here to mean different things. If not the we can either opt to use "" only, or define that "" and means the same for this field.

Copy link
Contributor

@bogdando bogdando Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the linked top-scope impl patch follows the above (except we had reached no agreement for empty values meaning)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so the usual considerations:

* if this is nil then it means nothing to propagate to the service level values. If the service level value is also nil then it means notifications are disabled. And this is our default.

Correct, this is consistent with the current implementation for storage operators.

* if this is set to the name of a rabbitmqcluster CR and the service level value is nil, then the rabbitmqcluster name is propagated to that service. This should be our normal way of enabling notififcations for ceilometer across the whole cluster.

Correct, this is an assumption consistent with the top-down propagation model.

* if this is set to the name of a rabbitmqcluster CR and the service level value is also set to a different rabbitmqcluster CR name then the sevice level value is kept for that service. This is the complicated case where ceilometer is not needed or only partially needed, and for some reason this specific service needs to use an independent rabbitmqcluster.

Correct, from an API perspective this is how we use to handle overrides at service level. I think the "complication" is something that might be handled by service operators, unless you envision some logic at the openstack-operator level. I think we might want to keep things easy here, and defer more logic (based on the input we receive) at service operator level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check if we need both "" and nil value here to mean different things. If not the we can either opt to use "" only, or define that "" and means the same for this field.

I think checking if that what the user sets is not an empty string is something we can catch from webhooks, so we can cover updates of the same field (while kubebuilder would help us only at CR creation time) and we can decide to default it to nil. However, I think that raising an error (usual webhook flow) would be better to have a human operator fix the input or remove the parameter entirely.

// When no NotificationsBusInstance is referenced in the subCR (override)
// try to inject the top-level one if defined
if instance.Spec.Cinder.Template.NotificationsBusInstance == nil {
instance.Spec.Cinder.Template.NotificationsBusInstance = instance.Spec.NotificationsBusInstance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this inherits the "" value to the service level. We need to see if that is what we want

Copy link
Contributor

@bogdando bogdando Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, how to handle empty values remained a gray area in the design. See in the #1402 descr:

Note about a special handling expected for an empty value by the services
that will be supporting this interface. It should provide backwards
compatibility during oscp and services CRDs upgrades.

There is no an empty value handling top scope (cannot disable notifications top-scope as a cluster-wide), however. It may only take a default value of a 'rabbitmq'. Use the service templates to override it for an empty value, if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at cinder/glance/manila implementation, for example [1], the parameter is the same, and at openstack-operator it realizes only a parameter inheritance, if required, so I don't see any problem w/ set/update/delete for storage CRs (basically it's the same thing we did for topologies).

[1] https://github.com/openstack-k8s-operators/manila-operator/blob/main/api/v1beta1/manila_types.go#L135

Copy link
Contributor

@bogdando bogdando Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this situation is no longer possible as empty strings became prohibited in webhook

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/95c8ffa28bb84df6841fd4a84713376c

✔️ openstack-k8s-operators-content-provider SUCCESS in 3h 25m 15s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 41m 46s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 31m 11s
✔️ adoption-standalone-to-crc-ceph-provider SUCCESS in 3h 06m 33s
openstack-operator-tempest-multinode RETRY_LIMIT in 13m 16s

@fmount fmount force-pushed the notif branch 3 times, most recently from a593d78 to 71ce410 Compare September 7, 2025 12:28
@fmount
Copy link
Contributor Author

fmount commented Sep 7, 2025

Would it be possible to include the propagation to the nova level field too? I hope it is not too big of a complexity increase to take.

Of course, done. Added nova propagation as well.

@fmount
Copy link
Contributor Author

fmount commented Sep 17, 2025

/retest

@amoralej
Copy link
Contributor

Watcher section also supports NotificationsBusInstance. If merging this, It should be also covered for consistency.

Also, neutron supports it, although I'm not sure if there is any reason to skip it in this patch.

@fmount
Copy link
Contributor Author

fmount commented Sep 17, 2025

Watcher section also supports NotificationsBusInstance. If merging this, It should be also covered for consistency.

Also, neutron supports it, although I'm not sure if there is any reason to skip it in this patch.

Interesting, thanks @amoralej, I wasn't aware about the current status. I started with storage but I was going to extend this patch as needed. Let me check the current status and add both neutron and watcher!

@auniyal61
Copy link
Contributor

how existing flow works for nova and neutron is
if we take example of nova - right now user will create new rabbit, wire it with nova spec via notificationsBusInstance. doing so nova operator recognize new rabbit and creates a new transporturl and update confs so its aware of where to send notifications.

then consumer can listen/read on new rabbit.


can you please explain how this will be used from user perspective.

So I think this will works as single top-level switch to enable/disable external notifications from respected service operator.
user will create new-rabbit wire it to top-level notificationsBusInstance and for all supported services/operators, individual transporturl will be created and they will start publishing external notifications to single dedicated new-rabbit

is above correct ?

@fmount
Copy link
Contributor Author

fmount commented Sep 19, 2025

how existing flow works for nova and neutron is if we take example of nova - right now user will create new rabbit, wire it with nova spec via notificationsBusInstance. doing so nova operator recognize new rabbit and creates a new transporturl and update confs so its aware of where to send notifications.

then consumer can listen/read on new rabbit.

can you please explain how this will be used from user perspective.

So I think this will works as single top-level switch to enable/disable external notifications from respected service operator. user will create new-rabbit wire it to top-level notificationsBusInstance and for all supported services/operators, individual transporturl will be created and they will start publishing external notifications to single dedicated new-rabbit

is above correct ?

Correct. It perfectly fine to create a dedicated rabbitmq instance and make this parameter point to the new rabbitmq instance name. By doing this all the underlying services will point to it and emit notifications. From a service perspective this mechanism is not different from the regular rabbimq/transportURL request that is already implemented as a pattern across the board. This part is clearly handled in each service (with a dedicated NotificationBusInstanceReady condition [1] that is evaluated by each service).
What this patch does is:

  • prevent adding a rabbitmq name that refers to an instance that doesn't exist
  • propagate the top-level field to the underlying components if it is not present as an ovveride (which is yet another pattern we use to follow for most of the service field)

Any processing logic already exists in the service operators to properly reconcile and configure the services.

[1] https://github.com/openstack-k8s-operators/lib-common/blob/main/modules/common/condition/conditions.go#L72

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/885e5285a3054f3a83223613d66fb5c2

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 01m 03s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 24m 05s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 45m 42s
adoption-standalone-to-crc-ceph-provider RETRY_LIMIT in 20m 25s
✔️ openstack-operator-tempest-multinode SUCCESS in 1h 39m 36s

@fmount
Copy link
Contributor Author

fmount commented Sep 20, 2025

/test openstack-operator-build-deploy-kuttl-4-18

@fmount
Copy link
Contributor Author

fmount commented Sep 20, 2025

recheck

@stuggi
Copy link
Contributor

stuggi commented Sep 20, 2025

FYI, there is an issue with the adoption test, which make it to fail.

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/0551c0c5f8b54a11aa8cfb52af05c29f

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 07m 40s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 19m 05s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 26m 40s
adoption-standalone-to-crc-ceph-provider RETRY_LIMIT in 15m 04s
openstack-operator-tempest-multinode FAILURE in 1h 53m 16s

@fmount
Copy link
Contributor Author

fmount commented Sep 22, 2025

recheck

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/4840aa8f77ec4372b20e30509a9ccaf5

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 48m 57s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 16m 11s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 33m 42s
adoption-standalone-to-crc-ceph-provider FAILURE in 1h 25m 44s
✔️ openstack-operator-tempest-multinode SUCCESS in 1h 35m 31s

@fmount
Copy link
Contributor Author

fmount commented Sep 23, 2025

rebased and good to go.

@fmount
Copy link
Contributor Author

fmount commented Sep 23, 2025

/test openstack-operator-build-deploy-kuttl

1 similar comment
@fmount
Copy link
Contributor Author

fmount commented Sep 24, 2025

/test openstack-operator-build-deploy-kuttl

@fmount
Copy link
Contributor Author

fmount commented Sep 24, 2025

/retest

@fmount
Copy link
Contributor Author

fmount commented Sep 25, 2025

/test openstack-operator-build-deploy-kuttl

1 similar comment
@fmount
Copy link
Contributor Author

fmount commented Sep 25, 2025

/test openstack-operator-build-deploy-kuttl

@fmount
Copy link
Contributor Author

fmount commented Sep 25, 2025

still can't get a cluster claim to run kuttl tests in Prow, retrying in a bit

This patch introduces the notificationsBusInstance parameters at API
level. When an override is not defined, it is propagated to the
underlying storage components where this field is implemented.

Signed-off-by: Francesco Pantano <[email protected]>
Because we can't control user input, this patch introduces a webhook
function to validate what we expect users to set as notificationsBusInstance
top-level parameter.
It also adds the webhook associated envTest to validate the use cases
covered by the new function.

Signed-off-by: Francesco Pantano <[email protected]>
Add comprehensive envTests covering the "notificationsBusInstance"
parameter lifecycle and its propagation to implementing services.

The tests validate three key scenarios:

- Base use case: parameter is properly referenced and used
- Override use case: service overrides with a custom/dedicated rabbitmq
  instance
- Removal use case: services disable notifications when parameter is
  removed

This demonstrates the flexibility of referencing non-default rabbit
instances and ensures consistent interface implementation across
different services.

Signed-off-by: Francesco Pantano <[email protected]>
@fmount
Copy link
Contributor Author

fmount commented Sep 26, 2025

ok, finally green CI, I assume this is good to go. Can we land this patch at this point (/cc @abays @stuggi @amoralej @bogdando @gibizer )

Copy link
Contributor

@bogdando bogdando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@amoralej
Copy link
Contributor

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants