Skip to content

Conversation

@nrwahl2
Copy link
Contributor

@nrwahl2 nrwahl2 commented Sep 18, 2025

Most of this seems fairly straightforward, but the fencer changes related to pcmk_delay_base and pcmk_host_map could use some testing. The existing code was unintelligible to me at first, so it required a LOT of small, incremental changes.

@nrwahl2 nrwahl2 requested a review from clumens September 18, 2025 21:07
@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Sep 18, 2025

This is pretty big regardless. So depending on the speed of reviewing it, we could break off the fencer commits into a later PR, and maybe work on sharing code with build_port_aliases() as mentioned in the TODO comment.

I don't love the idea of pulling this parsing code into libcrmcommon or libpacemaker, since only the fencer uses this syntax. However, that would allow for unit testing if we did.

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Sep 18, 2025

One cts-fencing test fails currently:

Pattern Not Matched = 'Delaying 'off' action targeting node3 using true2 for 1s | timeout=120s requested_delay=0s base=1s max=2s'
FAILURE - 'topology_delays' failed. 1 patterns out of 8 not matched. 0 negative matches.

Edit: Added a commit to fix this in the test

@nrwahl2 nrwahl2 force-pushed the nrwahl2-strtok branch 2 times, most recently from bf6eb20 to b936817 Compare September 19, 2025 03:40
* into fewer than two mappings avoids a "Malformed mapping" error message
* below.
*/
if (g_strv_length(mappings) < 2) {
Copy link
Contributor Author

@nrwahl2 nrwahl2 Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong. There could be a single mapping -- though if there are no delimiters, we don't want to throw an error if it's an interval instead of a mapping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this still need to be addressed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

g_string_append_c(buf, '"');
g_string_append(buf, value);
g_string_append(buf, *value);
g_string_append_c(buf, '"');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These calls could be replaced with:

g_string_append_printf(buf, "\"%s\"", *value);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, they can. It's a question of whether we want to.

If you look around at all the places where we build GStrings, we almost never use g_string_append_printf(). We basically use it only when we need to stringify a number or something. Otherwise, we use g_string_append(), g_string_append_c(), and pcmk__g_strcat() (a macro that chains together g_string_append() calls).

The reason? The usual. Performance. When we first decided to start moving to GStrings to avoid static buffer size limitations, we obsessively ran performance tests using crm_simulate and pcmk_simtimes. We found that using string formatting functions like g_string_append_printf() or even crm_strdup_printf() carried a lot of overhead. They reliably made a difference of a few percentage points in the runtimes of simulations on certain scheduler inputs. So we decided to avoid them wherever possible. A string or a character can be appended to a GString without any formatting.

With all of that said... I'm not convinced that a 5% slowdown (for example) is important enough to matter, if the printf() alternatives are more readable. The flip side of course is that if we make several changes that each cause a 5% slowdown, it will start to matter.

Copy link
Contributor Author

@nrwahl2 nrwahl2 Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should explicitly say: let me know your thoughts on the above.

In any case, performance shouldn't matter much when displaying options. I continued to follow that "rule" here more for consistency.


for (char *dir = strtok(dirs, ":"); !found && (dir != NULL);
dir = strtok(NULL, ":")) {
dirs = g_strsplit(PCMK__OCF_RA_PATH, ":", 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if PCMK__OCF_RA_PATH is NULL, this will assert instead of return false like we previously did. I have not gone back and checked previous patches to see if this is taken into consideration in those.

Copy link
Contributor Author

@nrwahl2 nrwahl2 Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like it can be NULL, since it's set by the configure script. I can add a comment to that effect. I'm afraid Coverity will complain about an always-true condition if I check whether it's NULL.

Edit: On second thought it feels silly to add a comment, since we generally assume our macros are defined in a reasonable way. I can add assertions for dirs here and elsewhere though.

* into fewer than two mappings avoids a "Malformed mapping" error message
* below.
*/
if (g_strv_length(mappings) < 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this still need to be addressed?

}
}
value[k] = '\0';

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs (doc/sphinx/Pacemaker_Explained/fencing.rst) say this about pcmk_host_map:

     - A mapping of node names to ports for devices that do not understand the
       node names. For example, ``node1:1;node2:2,3`` tells the cluster to use
       port 1 for ``node1`` and ports 2 and 3 for ``node2``. If
       ``pcmk_host_check`` is explicitly set to ``static-list``, either this or
       ``pcmk_host_list`` must be set. The port portion of the map may contain
       special characters such as spaces if preceded by a backslash *(since 2.1.2)*.

I'm unclear on what a "port" is in this context, and what characters are valid in the definition of a port. I'd like to make sure that if we are expected to support characters such as spaces in port names, that the new parsing code handles that without backslashes. Also, since this has been present since 2.1.2, I'm a little concerned about breaking whoever is using this right now regardless of how little sense it makes or how it is currently mangled in the XML.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Port" comes from a time when it was more common for fencing to target a shared power supply or network switch. For example:

the nodes each have two independent Power Supply Units (PSUs) connected to two independent Power Distribution Units (PDUs) reachable at 198.51.100.1 (port 10 and port 11) and 203.0.113.1 (port 10 and port 11)

Ref https://clusterlabs.org/projects/pacemaker/doc/3.0/Pacemaker_Explained/singlehtml/#example-dual-layer-dual-device-fencing-topologies

It has since been overloaded -- for example, fence_vmware_rest uses it for the VM name. Ultimately, the meaning of a port, and thus the characters that make sense, are up to the fence agent. If the fence agent advertises support for a --plug or --port option, and pcmk_host_map is configured, then the "ports" get passed as that option by default (or as pcmk_host_argument if set).


I'd like to make sure that if we are expected to support characters such as spaces in port names, that the new parsing code handles that without backslashes

Previously, the backslash was just getting ignored. Wouldn't hurt to do some testing (or re-testing... not sure what I did, if anything) to be sure. But the reason I felt comfortable making this change in the first place, is that it doesn't appear to change the end result; we simply no longer skip a backslash if present.

FWIW Oyvind has no idea why e95198f was done.


Also, since this has been present since 2.1.2, I'm a little concerned about breaking whoever is using this right now regardless of how little sense it makes or how it is currently mangled in the XML.

Hopefully no one... but I understand your concern. This is a case where I'd be willing to take my chances in order to make the code look sane. But I'm not the one who has to do the z-stream builds.


if (pcmk__str_empty(nvpair[0]) || pcmk__str_empty(nvpair[1])) {
crm_err(PCMK_FENCING_HOST_MAP ": Malformed mapping '%s'", *mapping);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're changing behavior... I wonder if it makes sense to return an empty mapping in the case where it's malformed. I can see going either way with it. If there's a parse error, who knows what they screwed up so maybe the entire thing is invalid. On the other hand, maybe we should try to use what we can from before the error occurred (what we're doing right now) to implement the user's configuration as much as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not even just "before the error occurred," but also after the error occurred. This is a sanity check to catch mappings with an empty name (node) or value (port). Note that this is happening after splitting on semicolons and whitespace. So consider the following (colons are equivalent to equal signs):

node1:port1 node2: node3:port3 :port4 node5:port5 :

The mappings for node1, node3, and node5 would get used. The mappings node2:, :port4, and : would get ignored because either the name or value was empty. I lean against throwing away the entire set of mappings in that case. But I am open to it.

@clumens clumens added the review: in progress PRs that are currently being reviewed label Oct 9, 2025
For readability. Similarly for the use of pcmk__str_eq() -- using
strcmp() in add_possible_values_default() was not a bug in this case, as
both arguments were guaranteed to be non-NULL, but this was not obvious.

Signed-off-by: Reid Wahl <[email protected]>
For readability and reliability. strtok() modifies its argument string,
but according to the C11 standard (per the following Stack Overflow
answer), the setlocale() return value should not be modified by the
program.
* https://stackoverflow.com/a/29116455

Signed-off-by: Reid Wahl <[email protected]>
The name is overly long and uses the public API prefix ("services_" with
a single underscore). Also rename the lpc variable to "i" and scope it
to the loop, rename the "root" argument to "dir", and use bool instead
of gboolean.

I'm guessing that the "os" prefix, as well as the separate
"services_linux.c" file, were meant to allow different implementations
of certain functions depending on the OS. GLib does similar things.
However, the commit that introduced it all (cc2110d) sheds no light on
this question.

Signed-off-by: Reid Wahl <[email protected]>
* namelist was not being freed when entries == 0, indicating an empty
  directory.
* A namelist entry was not being freed if stat() failed.

Signed-off-by: Reid Wahl <[email protected]>
The name is overly long and uses the public API prefix ("services_" with
a single underscore).

Use bool instead of gboolean, and rename the "root" argument to "dirs".
"dirs" is a colon-delimited list of directories.

I'm guessing that the "os" prefix, as well as the separate
"services_linux.c" file, were meant to allow different implementations
of certain functions depending on the OS. GLib does similar things.
However, the commit that introduced it all (cc2110d) sheds no light on
this question.

Signed-off-by: Reid Wahl <[email protected]>
For readability versus strtok().

Signed-off-by: Reid Wahl <[email protected]>
Previously, the libcrmservice code supported initdir being a
colon-delimited list of directories. However, this was not documented in
the configure help text or comments. All of the documentation referred
to initdir being a single directory.

The default behavior, if not building with "--with-initdir=<dir>", was
to check a list of directories and set INITDIR to the first one that
existed.

Signed-off-by: Reid Wahl <[email protected]>
Its sole remaining caller did nothing except call it. Move the code
there.

Signed-off-by: Reid Wahl <[email protected]>
The resources_os_list_ocf_providers() is listing top-level
subdirectories of each directory in PCMK__OCF_RA_PATH.

Signed-off-by: Reid Wahl <[email protected]>
...up.

* Give it a shorter and clearer name that doesn't look like public API.
* Drop arguments that always take the same value. (There is only a
  single caller.)
* Avoid a fixed-size buffer where we don't control the input.
* Use g_strsplit() for readability instead of strtok().

Signed-off-by: Reid Wahl <[email protected]>
...in get_directory_list(). To do this, copy the current
services__list_dir() function into a static helper called gdl_helper().

This is the only call to services__list_dir() that may take a false
value for the executable argument.

Signed-off-by: Reid Wahl <[email protected]>
Previously, files of type other than directory and regular file would be
listed unconditionally.

Now, list only executable regular files if exec_files is true, and list
only directories if exec_files is false.

Signed-off-by: Reid Wahl <[email protected]>
To services__list_ocf_providers(), in order to follow the library's
naming conventions.

Signed-off-by: Reid Wahl <[email protected]>
To services__list_ocf_agents(), in order to follow the library's
naming conventions.

Also defunctionize list_provider_agents(). It seems more straightforward
to have it inline.

Signed-off-by: Reid Wahl <[email protected]>
And return bool instead of gboolean. More to come.

Signed-off-by: Reid Wahl <[email protected]>
And compare strchr() return value against NULL instead of 0.

Signed-off-by: Reid Wahl <[email protected]>
It's still confusing, but hopefully a little bit less so.

When delay_base_s is set, it means we've found our mapping. So we use
that as a loop exit condition.

We no longer need valptr, since we never change what value points to.

Now that value isn't changing, it will be easier to feel confident about
using g_strsplit_set() in place of strtok().

Note the similarity to pcmk__scan_nvpair().

Signed-off-by: Reid Wahl <[email protected]>
Instead of strtok(). This seems clearer.

Skip empty mappings so that we don't log a "malformed" error. strtok()
skipped contiguous sequences of delimiters, so we wouldn't try to parse
an empty mapping. g_strsplit_set() returns an empty string between
contiguous delimiters.

Strip leading and trailing whitespace before parsing the param value,
and don't try to parse as a sequence of mappings (delimited by ';', ' ',
and '\t') if there are no more delimiters after stripping. This avoids
logging an error when the value is formatted like a single interval: no
colon, no semicolon, no spaces or tabs except optionally at the
beginning or end of the string.

It seems reasonable not to strip leading or trailing semicolons. We can
ignore leading or trailing whitespace, but a leading or trailing
semicolon seems like a clear indicator that this is supposed to be a
mapping. This will result an error for values like ";3s " or " 3s;" but
not for " 3s ".

Signed-off-by: Reid Wahl <[email protected]>
Seems cleaner since it isolates the "mappings" variable and
g_strsplit_set() call.

Signed-off-by: Reid Wahl <[email protected]>
An ISO 8601 interval can include a colon character. I can't think of any
valid reason why a user would specify an interval this way for a fencing
delay, but there doesn't seem to be any reason to exclude mapping values
that have a colon. The "make sure there's no colon" check seemed to be
an artifact of the way the parsing code worked before we started using
g_strsplit(), etc.

Signed-off-by: Reid Wahl <[email protected]>
Commit f057778 fixed an off-by-one error that a test relied on to avoid
randomness in the output.

Also add a new test to ensure that pcmk_delay_base is capped by
pcmk_delay_max.

Signed-off-by: Reid Wahl <[email protected]>
...to i. Also rename max to len.

Signed-off-by: Reid Wahl <[email protected]>
This is a static function with only one caller, and that caller always
passes the address of a struct member.

Signed-off-by: Reid Wahl <[email protected]>
It's about to go out of scope, so it doesn't get reused. It also doesn't
get freed because the aliases table has taken ownership of it.

Signed-off-by: Reid Wahl <[email protected]>
This should make absolutely no difference but uses the correct
semantics.

Signed-off-by: Reid Wahl <[email protected]>
This reverts e95198f.

Backslash escapes are useful in C and on the command line, but it's not
clear how they would be useful in a pcmk_host_map value. Backslash
escapes are not used in XML, where the configuration is stored.

The removed code did the following upon finding a backslash:
* If we're not inside the value part of a host-to-port mapping, skip
  both a backslash and the character after it.
* Otherwise, skip the backslash and keep the character after it (unless
  that character is also a backslash).

This doesn't seem to make sense. Any characters that are special in XML
attribute values would need to be escaped using XML entities. Other
characters, including backslashes, are allowed as literals.
* https://www.w3.org/TR/REC-xml/#NT-AttValue

Signed-off-by: Reid Wahl <[email protected]>
Previously, if we hit a parsing error in pcmk_host_map, we logged a
debug message and continued trying to parse the rest of the value. It's
hard to be confident about the resulting behavior.

Now, we stop parsing if we encounter an error.

Also as part of this commit, we use g_strsplit_set() twice. At the first
level, we use it to split the list of node-name-to-port mappings in
pcmk_host_map, using ';', ' ', and '\t' as delimiters. At the second
level, we use it to split each mapping, using '=' and ':' as delimiters.
Support for '=' is likely to be removed in a future release.

I greatly struggled to reason about the existing code based on
"last = i + 1". It felt as if there were lots of potential edge cases.
The behavior may not be identical as of this commit, but it should be
more robust and predictable, and the code should be easier to maintain.

Signed-off-by: Reid Wahl <[email protected]>
@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Oct 28, 2025

Rebased on main, no other changes yet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review: in progress PRs that are currently being reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants