Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low: libpacemaker: Don't assert on injecting nonexistent nodes. #3808

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

clumens
Copy link
Contributor

@clumens clumens commented Jan 23, 2025

If crm_simulate tries to inject a node up or node fail action for a nonexistent node, it'll just assert. Instead, log an error message and return NULL.

That NULL then needs to be propagated up a couple layers to functions that actually expect and know how to handle errors.

Fixes T945

@clumens clumens force-pushed the crm_simulate-clean branch from 8fe1757 to 18f3048 Compare January 23, 2025 21:04
Copy link
Contributor

@nrwahl2 nrwahl2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My inclination is to fix this in an even simpler way. Currently, for --node-up, if the node doesn't exist then we simply inject it. We could do the same thing for --node-down (which also asserts) and --node-fail.

We haven't done that in the past, and there may be only little or no legitimate use case for it. But it seems at least as reasonable as ignoring the nonexistent node for --node-down and --node-up. And it seems unlikely that anyone is relying on not injecting the nonexistent node, because it's probably been asserting for a long time/forever.

To do the injection, all we have to do is set pcmk__simulate_node_config = true to node_down_cb() and node_fail_cb().

What do you think?

@nrwahl2
Copy link
Contributor

nrwahl2 commented Feb 3, 2025

To do the injection, all we have to do is set pcmk__simulate_node_config = true to node_down_cb() and node_fail_cb().

What do you think?

Actually, injecting an action that refers to a nonexistent node would run into a similar issue. There may be other commands that do too. So, two options:

  1. Do it the way you're doing it currently in this PR (returning NULL and letting the caller handle it as appropriate).
  2. Get rid of pcmk__simulate_node_config and treat it as always true. Inject the node if it didn't already exist. Currently we assert in those situations, so no one should be relying on old behavior. I haven't yet looked deeply into whether this would have other unintended consequences. I've only glanced at this PR; still focused on the D-Bus stuff.

If crm_simulate tries to inject a node up or node fail action for a
nonexistent node, it'll just assert.  Instead, log an error message and
return NULL.

That NULL then needs to be propagated up a couple layers to functions
that actually expect and know how to handle errors.

Fixes T945
@clumens clumens force-pushed the crm_simulate-clean branch from 18f3048 to 4f727f0 Compare March 5, 2025 19:51
@clumens
Copy link
Contributor Author

clumens commented Mar 5, 2025

Rebased on main just to take a look at this again - no code changes, and not worth looking at again yet.

@clumens
Copy link
Contributor Author

clumens commented Mar 5, 2025

2. Get rid of `pcmk__simulate_node_config` and treat it as always true. Inject the node if it didn't already exist. Currently we assert in those situations, so no one should be relying on old behavior. I haven't yet looked deeply into whether this would have other unintended consequences. I've only glanced at this PR; still focused on the D-Bus stuff.

The way it feels like this should work to me is:

(1) If pcmk__simulate_node_config is true, inject unknown nodes.
(2) If it's false, error on unknown nodes.

I guess I don't much see the point of a flag that's only useful if it's set and crashes sometimes if it's not set. I am inclined to do what you've outlined here, though I can see the value of erroring out on unknown nodes as a way of catching typos on the command line.

@clumens
Copy link
Contributor Author

clumens commented Mar 5, 2025

Well, doing the obvious thing of getting rid of pcmk__simulate_node_config causes a lot of changes in scheduler regression output, specifically around bundle resources. For example:

--- a/cts/scheduler/summary/bundle-connection-with-container.summary
+++ b/cts/scheduler/summary/bundle-connection-with-container.summary
@@ -46,10 +46,11 @@ Using the original execution date of: 2022-07-13 22:13:26Z
 
 Revised Cluster Status:
   * Node List:
+    * Node httpd-bundle-0: pending
     * Online: [ rhel8-1 rhel8-3 rhel8-4 rhel8-5 ]
     * OFFLINE: [ rhel8-2 ]
     * RemoteOnline: [ remote-rhel8-2 ]
-    * GuestOnline: [ httpd-bundle-0 httpd-bundle-1 httpd-bundle-2 ]
+    * GuestOnline: [ httpd-bundle-1 httpd-bundle-2 ]
 
   * Full List of Resources:
     * Fencing  (stonith:fence_xvm):     Started rhel8-3

@nrwahl2
Copy link
Contributor

nrwahl2 commented Mar 5, 2025

Well, doing the obvious thing of getting rid of pcmk__simulate_node_config causes a lot of changes in scheduler regression output, specifically around bundle resources. For example:

At a glance, it seems to be because we're now creating a node object for guest nodes when we shouldn't be. Looks like during unpack_nodes() instead of during unpack_remote_nodes(). Again, only glanced at trace output.

This still might be a decent path forward but we'd have to handle guest nodes differently somehow.

Obviously pcmk__inject_node() is the only function that changes if we set pcmk__simulate_node_config = true unconditionally, so start there and its callers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants