Don't set MD_BROKEN on failfast bio failure #252

kpd-daemon · 2025-09-17T12:42:09Z

Pull request for series with
subject: Don't set MD_BROKEN on failfast bio failure
version: 4
url: https://patchwork.kernel.org/project/linux-raid/list/?series=1002295

Currently, the LastDev flag is set on an rdev that failed a failfast metadata write and called md_error, but did not become Faulty. It is cleared when the metadata write retry succeeds. This has problems for the following reasons: * Despite its name, the flag is only set during a metadata write window. * Unlike when LastDev and Failfast was introduced, md_error on the last rdev of a RAID1/10 array now sets MD_BROKEN. Thus when LastDev is set, the array is already unwritable. A following commit will prevent failfast bios from breaking the array, which requires knowing from outside the personality whether an rdev is the last one. For that purpose, LastDev should be set on rdevs that must not be lost. This commit ensures that LastDev is set on the indispensable rdev in a degraded RAID1/10 array. Signed-off-by: Kenta Akagi <[email protected]>

md_error is mainly called when a bio fails, so it can run in parallel. Each personality’s error_handler locks with device_lock, so concurrent calls are safe. However, RAID1 and RAID10 require changes for Failfast bio error handling, which needs a special helper for md_error. For that helper to work, the regular md_error must also be serialized. The helper function md_bio_failure_error for failfast will be introduced in a subsequent commit. This commit serializes md_error for all RAID personalities. While unnecessary for RAID levels other than 1 and 10, it has no performance impact as it is a cold path. Signed-off-by: Kenta Akagi <[email protected]>

Add a new helper function md_bio_failure_error(). It is serialized with md_error() under the same lock and works almost the same, but with two differences: * Takes the failed bio as an argument * If MD_FAILFAST is set in bi_opf and the target rdev is LastDev, it does not mark the rdev faulty Failfast bios must not break the array, but in the current implementation this can happen. This is because MD_BROKEN was introduced in RAID1/RAID10 and is set when md_error() is called on an rdev required for mddev operation. At the time failfast was introduced, this was not the case. Before this commit, md_error() has already been serialized, and RAID1/RAID10 mark rdevs that must not be set Faulty by Failfast with the LastDev flag. The actual change in bio error handling will follow in a later commit. Signed-off-by: Kenta Akagi <[email protected]>

Failfast is a feature implemented only for RAID1 and RAID10. It instructs the block device providing the rdev to immediately return a bio error without retrying if any issue occurs. This allows quickly detaching a problematic rdev and minimizes IO latency. Due to its nature, failfast bios can fail easily, and md must not mark an essential rdev as Faulty or set MD_BROKEN on the array just because a failfast bio failed. When failfast was introduced, RAID1 and RAID10 were designed to continue operating normally even if md_error was called for the last rdev. However, with the introduction of MD_BROKEN in RAID1/RAID10 in commit 9631abd ("md: Set MD_BROKEN for RAID1 and RAID10"), calling md_error for the last rdev now prevents further writes to the array. Despite this, the current failfast error handler still assumes that calling md_error will not break the array. Normally, this is not an issue because MD_FAILFAST is not set when a bio is issued to the last rdev. However, if the array is not degraded and a bio with MD_FAILFAST has been issued, simultaneous failures could potentially break the array. This is unusual but can happen; for example, this can occur when using NVMe over TCP if all rdevs depend on a single Ethernet link. In other words, this becomes a problem under the following conditions: Preconditions: * Failfast is enabled on all rdevs. * All rdevs are In_sync - This is a requirement for bio to be submit with MD_FAILFAST. * At least one bio has been submitted but has not yet completed. Trigger condition: * All underlying devices of the rdevs return an error for their failfast bios. Whether the bio is a read or a write makes little difference to the outcome. In the write case, md_error is invoked on each rdev through its bi_end_io handler. In the read case, losing the first rdev triggers a metadata update. Next, md_super_write, unlike raid1_write_request, issues the bio with MD_FAILFAST if the rdev supports it, causing the bio to fail immediately - Before this patchset, LastDev was set only by the failure path in super_written. Consequently, super_written calls md_error on the remaining rdev. Prior to this commit, the following changes were introduced: * The helper function md_bio_failure_error() that skips the error handler if a failfast bio targets the last rdev. * Serialization md_error() and md_bio_failure_error(). * Setting the LastDev flag for rdevs that must not be lost. This commit uses md_bio_failure_error() instead of md_error() for failfast bio failures, ensuring that failfast bios do not stop array operations. Fixes: 9631abd ("md: Set MD_BROKEN for RAID1 and RAID10") Signed-off-by: Kenta Akagi <[email protected]>

…ailed bio In the current implementation, when a write bio fails, the retry flow is as follows: * In bi_end_io, e.g. raid1_end_write_request, R1BIO_WriteError is set on the r1bio. * The md thread calls handle_write_finished corresponding to this r1bio. * Inside handle_write_finished, narrow_write_error is invoked. * narrow_write_error rewrites the r1bio on a per-sector basis, marking any failed sectors as badblocks. It returns true if all sectors succeed, or if failed sectors are successfully recorded via rdev_set_badblock. It returns false if rdev_set_badblock fails or if badblocks are disabled. * handle_write_finished faults the rdev if it receives false from narrow_write_error. Otherwise, it does nothing. This can cause a problem where an r1bio that succeeded on retry is incorrectly reported as failed to the higher layer, for example in the following case: * Only one In_sync rdev exists, and * The write bio initially failed but all retries in narrow_write_error succeed. This commit ensures that if a write initially fails but all retries in narrow_write_error succeed, R1BIO_Uptodate or R10BIO_Uptodate is set and the higher layer receives a successful write status. Signed-off-by: Kenta Akagi <[email protected]>

In the current implementation, write failures are not retried on rdevs with badblocks disabled. This is because narrow_write_error, which issues retry bios, immediately returns when badblocks are disabled. As a result, a single write failure on such an rdev will immediately mark it as Faulty. The retry mechanism appears to have been implemented under the assumption that a bad block is involved in the failure. However, the retry after MD_FAILFAST write failure depend on this code, and a Failfast write request may fail for reasons unrelated to bad blocks. Consequently, if failfast is enabled and badblocks are disabled on all rdevs, and all rdevs encounter a failfast write bio failure at the same time, no retries will occur and the entire array can be lost. This commit adds a path in narrow_write_error to retry writes even on rdevs where bad blocks are disabled, and failed bios marked with MD_FAILFAST will use this path. For non-failfast cases, the behavior remains unchanged: no retry writes are attempted to rdevs with bad blocks disabled. Fixes: 1919cbb ("md/raid10: add failfast handling for writes.") Fixes: 212e7eb ("md/raid1: add failfast handling for writes.") Signed-off-by: Kenta Akagi <[email protected]>

raid10_end_read_request lacks a path to retry when a FailFast IO fails. As a result, when Failfast Read IOs fail on all rdevs, the upper layer receives EIO, without read rescheduled. Looking at the two commits below, it seems only raid10_end_read_request lacks the failfast read retry handling, while raid1_end_read_request has it. In RAID1, the retry works as expected. * commit 8d3ca83 ("md/raid10: add failfast handling for reads.") * commit 2e52d44 ("md/raid1: add failfast handling for reads.") I don't know why raid10_end_read_request lacks this, but it is probably just a simple oversight. This commit will make the failfast read bio for the last rdev in raid10 retry if it fails. Fixes: 8d3ca83 ("md/raid10: add failfast handling for reads.") Signed-off-by: Kenta Akagi <[email protected]>

Once MD_BROKEN is set on an array, no further writes can be performed to it. The user must be informed that the array cannot continue operation. Signed-off-by: Kenta Akagi <[email protected]>

Since commit 9a56784 ("md: allow last device to be forcibly removed from RAID1/RAID10."), RAID1/10 arrays can now lose all rdevs. Before that commit, losing the array last rdev or reaching the end of the function without early return in raid{1,10}_error never occurred. However, both situations can occur in the current implementation. As a result, when mddev->fail_last_dev is set, a spurious pr_crit message can be printed. This patch prevents "Operation continuing" printed if the array is not operational. root@fedora:~# mdadm --create --verbose /dev/md0 --level=1 \ --raid-devices=2 /dev/loop0 /dev/loop1 mdadm: Note: this array has metadata at the start and may not be suitable as a boot device. If you plan to store '/boot' on this device please ensure that your boot-loader understands md/v1.x metadata, or use --metadata=0.90 mdadm: size set to 1046528K Continue creating array? y mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md0 started. root@fedora:~# echo 1 > /sys/block/md0/md/fail_last_dev root@fedora:~# mdadm --fail /dev/md0 loop0 mdadm: set loop0 faulty in /dev/md0 root@fedora:~# mdadm --fail /dev/md0 loop1 mdadm: set device faulty failed for loop1: Device or resource busy root@fedora:~# dmesg | tail -n 4 [ 1314.359674] md/raid1:md0: Disk failure on loop0, disabling device. md/raid1:md0: Operation continuing on 1 devices. [ 1315.506633] md/raid1:md0: Disk failure on loop1, disabling device. md/raid1:md0: Operation continuing on 0 devices. root@fedora:~# Fixes: 9a56784 ("md: allow last device to be forcibly removed from RAID1/RAID10.") Signed-off-by: Kenta Akagi <[email protected]>

kpd-daemon · 2025-09-17T12:42:10Z

Upstream branch: c17fb54
series: https://patchwork.kernel.org/project/linux-raid/list/?series=1002295
version: 4

mgmlme added 9 commits September 17, 2025 12:42

md/raid1,raid10: Add error message when setting MD_BROKEN

da97ee2

Once MD_BROKEN is set on an array, no further writes can be performed to it. The user must be informed that the array cannot continue operation. Signed-off-by: Kenta Akagi <[email protected]>

kpd-daemon bot added new V4 md-6.16 V4-ci-fail labels Sep 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't set MD_BROKEN on failfast bio failure #252

Don't set MD_BROKEN on failfast bio failure #252

Uh oh!

kpd-daemon bot commented Sep 17, 2025

Uh oh!

kpd-daemon bot commented Sep 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Don't set MD_BROKEN on failfast bio failure #252

Are you sure you want to change the base?

Don't set MD_BROKEN on failfast bio failure #252

Uh oh!

Conversation

kpd-daemon bot commented Sep 17, 2025

Uh oh!

kpd-daemon bot commented Sep 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant