-
Notifications
You must be signed in to change notification settings - Fork 25
mdadm: Fix IMSM Raid assembly after disk link failure and reboot #179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This patch addresses a scenario observed in production where disk links go down. After a system reboot, depending on which disk becomes available first, the IMSM RAID array may either fully assemble or come up with missing disks. Below is an example of the production case simulating disk link failures and subsequent system reboot. (note: "echo "1" | sudo tee /sys/class/scsi_device/x:x:x:x/device/delete" is used here to fail/unplug/disconnect disks) Raid Configuration: IMSM Raid1 with two disks - When sda is unplugged first, then sdb, and after reboot sdb is reconnected first followed by sda, the container (/dev/md127) and subarrays (/dev/md125, /dev/md126) correctly assemble and become active. - However, when sda is reconnected first, then sdb, the subarrays fail to fully reconstruct — sda remains missing from the assembled subarrays, due to stale metadata. Above behaviors are influenced by udev event handling: - When a disk disconnects, the rule ACTION=="remove", ENV{ID_PATH}=="?*", RUN+="/usr/sbin/mdadm -If $devnode --path $env{ID_PATH}" is triggered to inform mdadm of the removal. - When a disk reconnects (i.e., ACTION!="remove"), the rule IMPORT{program}="/usr/sbin/mdadm --incremental --export $devnode --offroot $env{DEVLINKS}" is triggered to incrementally assemble the RAID arrays. During array assembling, the array may not be fully assembled due to disks with stale metadata. This patch adds a udev-triggered script that detects this failure and brings the missing disks back to the array. Basically, it inspects the RAID configuration in /usr/sbin/mdadm --detail --scan --export,identifies disks that belong to a container array but are missing from their corresponding member (sub)arrays, and restores them by performing a hot remove-and-re-add cycle. The patch improves resilience by ensuring consistent array reconstruction regardless of disk detection order. This aligns system behavior with expected RAID redundancy and reduces risk of unnecessary manual recovery steps after reboots in degraded hardware environments. Signed-off-by: Richard Li <[email protected]>
need IMSM developers feedback here. @bkucman please take a look. |
Not sure why ci test (upstream tests) isn't triggered |
I think that if it's someone's first PR to the repo, github actions never start, it needs to be confirmed by the maintainer. |
This requires more analysis from me, such a script can mess with the flow a bit, |
Why it cannot be integrated with mdadm and assemble functionality? I think that mdadm should be able to get rid over it and start array. Probably I know the reason, it is To consider script I need some confidence that it cannot be fixed in mdadm directly please @bkucman and @richl9 provide me something! |
@mtkaczyk Thanks for your feedback Mariusz. This issue can be fixed from mdadm side indeed. To address the issue from the mdadm side, the problem can be resolved by checking the array’s health in mdadm.c. If a failed device is detected, it can be identified and an Incremental_remove operation performed, followed by an Incremental being called to reassemble the array. However, to avoid affecting workflows unnecessarily, we chose not to integrate this directly into mdadm and instead developed a script for this purpose. Please let me know what do you think. |
I think that we are chickening to do it right way because we are scared to bring regression in one of the most critical functionalities which is of course understandable but not right way of doing things. From upstream and product point to view I have to ask you at least to try integrate it with code. I understand that I might be more time consuming but here quality and maintenance cost matters overall. Now, It significantly increases debug time and solution complexity, Something like that is last thing people expected, even --verbose logs from mdadm would be not helpful because you are going to execute it by udev. You are basing on mdadm response and it the middle mdadm might be executed elsewhere that is why I think that it is dangerous. You description didn't specify why mdadm is doing it, where is exactly the problem. It might require to look into imsm metadata particularly. Mdadm If you will reach obstacle that can be pointed out, then I would start considering something like this as a workaround but it definitely requires solid argument. Hope it help, I'm always here in case of questions! |
@mtkaczyk Thanks a lot for the detailed back. To elaborate more on the issue, when sda is both unplugged first and reconnected first, the array fails to fully assemble. The reason is following. At the time of unplug, sda's metadata still indicated that both sda and sdb were active members. After reboot, when only sda is present during the initial scan, mdadm reads this metadata and determines that the array should have two active devices. Since only one is available at this point, mdadm does not consider the array to have sufficient members for a safe assembly. Internally, this is reflected by info.container_enough = 0, and as a result, the md is not started, and subarrays remain unassembled. This behavior is by design — mdadm refuses to assemble an incomplete IMSM container if the number of available members is insufficient per the metadata. Importantly, this is not a metadata corruption issue, as the metadata on sda is still technically valid. The failure arises from the state of the system at the time of scanning, not from a flaw in the data. Therefore, I personally think this case is hard to be resolved from the metadata side, as there is no indication that the metadata is "stale" or incorrect. It simply reflects the state of the array before disconnection. Therefore, what I propose, as I mentioned in the previous comment, was to do a Incremental_remove and Incremental cycle on failed devices in mdadm. This cycle is triggered only in edge cases involving IMSM RAID and after thorough verification that a device is genuinely missing. So, with that being said, this shouldn't be impacting normal workflows. Would love to hear your feedback! |
If drive unplug happened, whatever it is- it should result in udev "remove" event and degradation of array in metadata. Why is it not happening in this case? |
I'm missing something.. So the issue happens only if you are connecting drive that was disconnected as first, right? And the issue is that |
Yes, that is correct. When we are connecting the device that was disconnected first (sda), only container (md127) is created with sda in it. And after second device(sdb) is plugged back, array is started, and md125, md126 are created, with sdb included, but sda remains missing from them. |
I am not exactly sure what you are referring to here. The array is started after sdb is back, but just with sda missing from member arrays. |
Please share Simply, do your steps and call these commands at the end. You will collect system state and metadata. I would like to fully understand what is going on. |
Please see the output as above. Thanks! |
At a first place, I need to mention that array state is
and here:
The question is why rebuild is not started in this case and I know why. I the past I was IMSM developer.
Later, when you are adding sdb, mdadm decides do start array as degraded and left sda in container. This is correct. So the right fix is to look into mdmon implementation and determine if this case can be handled somehow. Therefore, sadly I'm tending to reject this PR- it must be fixed on mdmon side. We never fixed this because we (I think fairly) assumed that it is only test scenario because, in real word- if you are removing drive once you should never consider to add it back (especially put it again in booting queue). You can avoid it by erasing metadata from the drive sda if you would like simulate adding new spare. |
Hi @mtkaczyk , thanks a lot for your review and feedback.
No matter which option we choose, at the end of the day, we want the failed disk to be reused for rebuilding the array. Please let me know what you think:) |
In mdadm terms "re-add" means that we know that the disk ws part of array and we don't need to fully reconstruct it, but it is not supported by IMSM. Looks like we hit the bug that was made to keep the room for support. Before making the decision, could you please also test how regular spare behaves?
If I remember correctly, in this case rebuild won't be started either so I'm thinking if any of this if fixing this problem too. |
Hi @mtkaczyk, thanks for the suggestion. Unfortunately, we currently only have two m.2 slots on our server thus a third disk cannot be added. Are there any other experiments you would suggest? |
@richl9 we have this test suite that is run on nullblk in github actions, you can eventually explore this and simulate the scenario with no real hardware. Maybe, you can add a test for this scenario because I don't find any. Maybe I'm totally wrong and the scenario I mentioned is working correctly. Keep this in mind! /*
* OK, this device needs recovery. Try to re-add the
* previous occupant of this slot, if this fails see if
* we can continue the assimilation of a spare that was
* partially assimilated, finally try to activate a new
* spare.
*/
dl = imsm_readd(super, i, a);
if (!dl) According to comment, looks like the code is not working as expected so I would like to suggest moving with first option. In case of other issues reported we have a good explanation why we changed logic. I hope it has some sense, if not please let me know. We have slack if you would like to be added to have possibility to talk with developers directly, I can add you, just share e-mail :) Generally, you should receive feedback faster over slack. |
This patch addresses a scenario observed in production where disk links go down. After a system reboot, depending on which disk becomes available first, the IMSM RAID array may either fully assemble or come up with missing disks.
Below is an example of the production case simulating disk link failures and subsequent system reboot.
(note: "echo "1" | sudo tee /sys/class/scsi_device/x:x:x:x/device/delete" is used here to fail/unplug/disconnect disks)
Raid Configuration: IMSM Raid1 with two disks
Above behaviors are influenced by udev event handling:
During array assembling, the array may not be fully assembled due to disks with stale metadata.
This patch adds a udev-triggered script that detects this failure and brings the missing disks back to the array. Basically, it inspects the RAID configuration in /usr/sbin/mdadm --detail --scan --export, identifies disks that belong to a container array but are missing from their corresponding member (sub)arrays, and restores them by performing a hot remove-and-re-add cycle.
The patch improves resilience by ensuring consistent array reconstruction regardless of disk detection order. This aligns system behavior with expected RAID redundancy and reduces risk of unnecessary manual recovery steps after reboots in degraded hardware environments.