-
Notifications
You must be signed in to change notification settings - Fork 349
Use dbus to detect completion of systemd resource start/stop actions #3818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@clumens This is an alternate version of #3805 to show an idea. I HAVE NOT TESTED THIS AT ALL YET. All I know is that it compiles. I just got it put together and am going to bed. If it works (including with minor tweaks), great. This PR also implements almost all of my review comments. One note: I dropped the "remove the timer for start/stop" part, but I didn't change the timeout handling at all otherwise. We had some discussion in #3805 about deficiencies in our current timeout handling. |
7b4a18d to
d3e2f85
Compare
a31d3e5 to
002647f
Compare
002647f to
108f582
Compare
|
@clumens Testing has been very limited so far. I'm marking this ready for review, but bear that in mind. |
ef1d140 to
12cebe0
Compare
12cebe0 to
b39b822
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me, aside from still not having any confidence in how timeouts interact with waiting for systemd to tell us something has happened.
|
One paperwork thing... I still come up as the author on the last two commits. That may be true on the subscribe commit, but I think the last commit is now both of us. You can add a |
b39b822 to
7ab5c38
Compare
|
Rebased on current main; no other changes in this push |
7ab5c38 to
5e4d0b6
Compare
|
Added |
5e4d0b6 to
c4e3224
Compare
|
Hopefully fixed this segfault: #3818 (comment) |
c4e3224 to
77f3bea
Compare
|
Only thing left (besides testing, obviously) is timeout logic. Thoughts:
|
In general, I'd prefer to use the systemd timeouts over our own for systemd services. We have so, so many timers around all this stuff and offloading it would be easier to understand, I think. Plus, I think using the systemd timers might fix up some of your past comments about eating away at timers with setup/teardown tasks.
Introduced without comment in fccd046. My hunch is that it's just some slop time to guarantee that pacemaker's timeout is long enough for us to tell systemd what to do, systemd getting around to executing it, the actual timeout value to pass, and then systemd getting around to notifying us about what it did. |
The issue is that we still need to track the overall timeout. We do the So I was thinking we keep the Pacemaker timer until the start job is enqueued. To be most correct, we should enqueue it with the systemd job timeouts ( Hopefully none of that will ever matter. I don't like to rely on hope.
Yep, sorry for making you track that down. I've got that in a comment in a WIP commit that I haven't pushed yet. That hunch makes sense and is my best guess as well. I don't love having arbitrary slop time. |
77f3bea to
dc58e11
Compare
I figured out how we would do this, but I think I've changed my mind (before I got too far along). It's still true that there are several asynchronous steps of variable duration before the start is enqueued. These steps occur after Pacemaker initiates the action. However, when a user configures an action timeout, they're not thinking about overhead like this. They are (or should be) thinking about how long the start/stop itself takes. It feels nice when I can convince myself that the simpler approach is probably the better one. Let me know if you agree. |
Sigh. I just remembered that we don't support JUST systemd services, but also sockets, mounts, timers, and paths. To me it seems silly... I struggle to imagine a use case for managing these directly that couldn't be better achieved in some other way. But whatever. Someone wanted them. Added via #1508. These are probably already broken to some extent, because our override files assume we're working with a service. Other unit types don't support I think our best bet for a portable override is to set This still leaves the problem of creating an override file in the correct directory, whose syntax is valid for the unit type we're dealing with. The current one has a |
|
retest this please |
|
@clumens What testing have you already done on this? I just kicked off CI but haven't made any code changes since your last review. There are enough complications with systemd timeouts that we should postpone looking into that and try to get the D-Bus signal stuff in. |
Ref T25 Signed-off-by: Reid Wahl <[email protected]>
We will use this later for D-Bus signal message filtering. Ref T25 Signed-off-by: Reid Wahl <[email protected]>
We'll add a filter later. This does not meaningfully change behavior. Ref T25 Signed-off-by: Reid Wahl <[email protected]>
When systemd receives a StartUnit() or StopUnit() method call, it returns almost immediately, as soon as a start/stop job is enqueued. A successful return code does NOT indicate that the start/stop has finished. Previously, we worked around this in action_complete() with a hack that scheduled a follow-up monitor after a successful start/stop method call, which polled the service after 2 seconds to see whether it was actually running. However, this was not a robust solution. Timing issues could result in Pacemaker having an incorrect view of the resource's status or prematurely declaring the action as failed. Now, we follow the best practice as documented in the systemd D-Bus API doc (see StartUnit()): https://www.freedesktop.org/software/systemd/man/latest/org.freedesktop.systemd1.html#Methods After kicking off a systemd start/stop action, we make note of the job's D-Bus object path. Then we register a D-Bus message filter that looks for a JobRemoved signal whose bus path matches. This signal indicates that the job has completed and includes its result. When we find the matching signal, we set the action's result. We then remove the filter, which causes the action to be finalized and freed. In the case of the executor daemon, the action has a callback (action_complete()) that runs during finalization and sets the executor's view of the action result. Monitor actions still need much of the existing workaround code in action_complete(), so we keep it for now. We bail out for start/stop actions after setting the result as described above. Ref T25 Co-authored-by: Reid Wahl <[email protected]> Signed-off-by: Reid Wahl <[email protected]>
dc58e11 to
6e70bb6
Compare
|
Rebased on main to resolve conflict |
|
Merging after some discussion with clumens. Additional testing was done, and if there are bugs, we'll benefit from other users exercising this code. |
No description provided.