More mdcheck fixes: Rework mdcheck service logic (2nd attempt) #198
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR on top of #189. Relaunch of #190 after @XiaoNi87 indicated to me that my changes would be acceptable to him.
This PR changes the logic of the "mdcheck" tool and the related systemd services
mdcheck_start.service
andmdcheck_continue.service
.The current behavior is like this:
mdcheck
without arguments starts a RAID check on all arrays on the system, starting at position 0. This is started frommdcheck_start.service
, started by a systemd timer once a month.mdcheck --continue
looks for files/var/lib/mdcheck/MD_UUID_$UUID
, reads the start position from them, and starts a check from that position on the array with the respective UUID. This is started from a systemd timer every night.In either case,
mdcheck
won't do anything if the kernel is already running async_action
on a given array. The check runs for a given period of time (default 6h) and saves the last position in theMD_UUID
file, to be taken up whenmdcheck --continue
is called next time. When the entire array has been checked, theMD_UUID_
file is deleted. When all checks are finished,mdcheck_continue.timer
is stopped, to be restarted whenmdcheck_start.timer
expires next time.Before the recent commit 8aa4ea9 ("systemd: start mdcheck_continue.timer before mdcheck_start.timer"), this could lead to a race condition when the check for a given array didn't complete throughout the month.
mdcheck_start.service
would start and reset the check position to 0 beforemdcheck_continue.service
could pick up at the last saved position. 8aa4ea9 works around this by starting mdcheck_continue.service a few minutes before mdcheck_start.timer.Yet the general problem still exists: both services trigger checks on the kernel's part which they can only passively monitor. So if a user plays with the timer settings (which he is in his rights to do), another similar race might happen.
This patch set changes the behavior as follows:
Only
mdcheck_continue.service
actually starts and stops kernel-based sync actions. This service will continue at the saved start position if anMD_UUID*
file exists, or start a new check at position 0 otherwise. Starting at 0 can be inhibited by creating a file/var/lib/mdcheck/Checked_$UUID
. These files will be created bymdcheck
when it finishes checking a given array. Thus future invocations ofmdcheck_continue.service
will not restart the check on this array.mdcheck_start.service
runsmdcheck --restart
, which simply removes allChecked_*
markers from/var/lib/mdcheck
, so that the next invocation ofmdcheck_continue.service
will start new checks on all arrays which don't have already running checks.The general behavior of the systemd timers and services is like before, but the mentioned race condition is avoided, even if the user modifies the timer settings arbitrarily.
Unlike #190, this PR preserves the behavior of the
mdcheck
script when called without arguments.mdcheck
historically had just two modes of operation: default (no arguments) and--continue
. This set introduces new modes--restart
(was in #190 already),--maybe-start
(the behavior ofmdcheck
without args in #190), and--force-start
(a complete restart of checks on all arrays from 0, like traditionalmdcheck
). For backward compatibility reasons,--force-start
becomes the default behavior.More details in the commit descriptions.
Differences to #190:
mdcheck