Skip to content

mdadm: Fix IMSM Raid assembly after disk link failure and reboot #179

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

richl9
Copy link

@richl9 richl9 commented May 7, 2025

This patch addresses a scenario observed in production where disk links go down. After a system reboot, depending on which disk becomes available first, the IMSM RAID array may either fully assemble or come up with missing disks.

Below is an example of the production case simulating disk link failures and subsequent system reboot.

(note: "echo "1" | sudo tee /sys/class/scsi_device/x:x:x:x/device/delete" is used here to fail/unplug/disconnect disks)

Raid Configuration: IMSM Raid1 with two disks

  • When sda is unplugged first, then sdb, and after reboot sdb is reconnected first followed by sda, the container (/dev/md127) and subarrays (/dev/md125, /dev/md126) correctly assemble and become active.
  • However, when sda is reconnected first, then sdb, the subarrays fail to fully reconstruct — sda remains missing from the assembled subarrays, due to stale metadata.

Above behaviors are influenced by udev event handling:

  • When a disk disconnects, the rule ACTION=="remove", ENV{ID_PATH}=="?*", RUN+="/usr/sbin/mdadm -If $devnode --path $env{ID_PATH}" is triggered to inform mdadm of the removal.
  • When a disk reconnects (i.e., ACTION!="remove"), the rule IMPORT{program}="/usr/sbin/mdadm --incremental --export $devnode --offroot $env{DEVLINKS}" is triggered to incrementally assemble the RAID arrays.

During array assembling, the array may not be fully assembled due to disks with stale metadata.

This patch adds a udev-triggered script that detects this failure and brings the missing disks back to the array. Basically, it inspects the RAID configuration in /usr/sbin/mdadm --detail --scan --export, identifies disks that belong to a container array but are missing from their corresponding member (sub)arrays, and restores them by performing a hot remove-and-re-add cycle.

The patch improves resilience by ensuring consistent array reconstruction regardless of disk detection order. This aligns system behavior with expected RAID redundancy and reduces risk of unnecessary manual recovery steps after reboots in degraded hardware environments.

This patch addresses a scenario observed in production where disk links
go down. After a system reboot, depending on which disk becomes
available first, the IMSM RAID array may either fully assemble or
come up with missing disks.

Below is an example of the production case simulating disk link failures
and subsequent system reboot.

(note: "echo "1" | sudo tee /sys/class/scsi_device/x:x:x:x/device/delete"
is used here to fail/unplug/disconnect disks)

Raid Configuration: IMSM Raid1 with two disks

- When sda is unplugged first, then sdb, and after reboot sdb is
reconnected first followed by sda, the container (/dev/md127) and
subarrays (/dev/md125, /dev/md126) correctly assemble and become active.
- However, when sda is reconnected first, then sdb, the subarrays fail to
fully reconstruct — sda remains missing from the assembled subarrays,
due to stale metadata.

Above behaviors are influenced by udev event handling:

- When a disk disconnects, the rule ACTION=="remove", ENV{ID_PATH}=="?*",
RUN+="/usr/sbin/mdadm -If $devnode --path $env{ID_PATH}" is triggered to
inform mdadm of the removal.
- When a disk reconnects (i.e., ACTION!="remove"), the rule
IMPORT{program}="/usr/sbin/mdadm --incremental --export $devnode
--offroot $env{DEVLINKS}" is triggered to incrementally assemble the
RAID arrays.

During array assembling, the array may not be fully assembled due to
disks with stale metadata.

This patch adds a udev-triggered script that detects this failure
and brings the missing disks back to the array. Basically, it
inspects the RAID configuration in
/usr/sbin/mdadm --detail --scan --export,identifies disks that belong
to a container array but are missing from their corresponding member
(sub)arrays, and restores them by performing a hot remove-and-re-add
cycle.

The patch improves resilience by ensuring consistent array reconstruction
regardless of disk detection order. This aligns system behavior with
expected RAID redundancy and reduces risk of unnecessary manual recovery
steps after reboots in degraded hardware environments.

Signed-off-by: Richard Li <[email protected]>
@mtkaczyk
Copy link
Member

mtkaczyk commented May 8, 2025

need IMSM developers feedback here. @bkucman please take a look.

@XiaoNi87
Copy link
Collaborator

XiaoNi87 commented May 9, 2025

Not sure why ci test (upstream tests) isn't triggered

@bkucman
Copy link
Collaborator

bkucman commented May 9, 2025

Not sure why ci test (upstream tests) isn't triggered

I think that if it's someone's first PR to the repo, github actions never start, it needs to be confirmed by the maintainer.

@bkucman
Copy link
Collaborator

bkucman commented May 9, 2025

need IMSM developers feedback here. @bkucman please take a look.

This requires more analysis from me, such a script can mess with the flow a bit,
I can't get into it right now, it will have to wait for about a week and a half. Sorry for the inconvenience.

@mtkaczyk
Copy link
Member

During array assembling, the array may not be fully assembled due to
disks with stale metadata.

Why it cannot be integrated with mdadm and assemble functionality? I think that mdadm should be able to get rid over it and start array. Probably I know the reason, it is imsm_thunderdome and how complicated it is.

To consider script I need some confidence that it cannot be fixed in mdadm directly please @bkucman and @richl9 provide me something!

@richl9
Copy link
Author

richl9 commented May 19, 2025

@mtkaczyk Thanks for your feedback Mariusz. This issue can be fixed from mdadm side indeed. To address the issue from the mdadm side, the problem can be resolved by checking the array’s health in mdadm.c. If a failed device is detected, it can be identified and an Incremental_remove operation performed, followed by an Incremental being called to reassemble the array. However, to avoid affecting workflows unnecessarily, we chose not to integrate this directly into mdadm and instead developed a script for this purpose. Please let me know what do you think.

@mtkaczyk
Copy link
Member

mtkaczyk commented May 20, 2025

I think that we are chickening to do it right way because we are scared to bring regression in one of the most critical functionalities which is of course understandable but not right way of doing things. From upstream and product point to view I have to ask you at least to try integrate it with code. I understand that I might be more time consuming but here quality and maintenance cost matters overall.

Now, It significantly increases debug time and solution complexity, Something like that is last thing people expected, even --verbose logs from mdadm would be not helpful because you are going to execute it by udev. You are basing on mdadm response and it the middle mdadm might be executed elsewhere that is why I think that it is dangerous.

You description didn't specify why mdadm is doing it, where is exactly the problem. It might require to look into imsm metadata particularly.

Mdadm incremental is executed for every drive with exclusive access to raid array (incremental processes are not racing). I know cases in which it even removes not matching drives in runtime that is why I think it can be integrated with incremental command.

If you will reach obstacle that can be pointed out, then I would start considering something like this as a workaround but it definitely requires solid argument.

Hope it help, I'm always here in case of questions!

@richl9
Copy link
Author

richl9 commented May 21, 2025

@mtkaczyk Thanks a lot for the detailed back. To elaborate more on the issue, when sda is both unplugged first and reconnected first, the array fails to fully assemble. The reason is following. At the time of unplug, sda's metadata still indicated that both sda and sdb were active members. After reboot, when only sda is present during the initial scan, mdadm reads this metadata and determines that the array should have two active devices. Since only one is available at this point, mdadm does not consider the array to have sufficient members for a safe assembly. Internally, this is reflected by info.container_enough = 0, and as a result, the md is not started, and subarrays remain unassembled.

This behavior is by design — mdadm refuses to assemble an incomplete IMSM container if the number of available members is insufficient per the metadata. Importantly, this is not a metadata corruption issue, as the metadata on sda is still technically valid. The failure arises from the state of the system at the time of scanning, not from a flaw in the data. Therefore, I personally think this case is hard to be resolved from the metadata side, as there is no indication that the metadata is "stale" or incorrect. It simply reflects the state of the array before disconnection.

Therefore, what I propose, as I mentioned in the previous comment, was to do a Incremental_remove and Incremental cycle on failed devices in mdadm. This cycle is triggered only in edge cases involving IMSM RAID and after thorough verification that a device is genuinely missing. So, with that being said, this shouldn't be impacting normal workflows. Would love to hear your feedback!

@mtkaczyk
Copy link
Member

@mtkaczyk Thanks a lot for the detailed back. To elaborate more on the issue, when sda is both unplugged first and reconnected first, the array fails to fully assemble. The reason is following. At the time of unplug, sda's metadata still indicated that both sda and sdb were active members. After reboot, when only sda is present during the initial scan, mdadm reads this metadata and determines that the array should have two active devices. Since only one is available at this point, mdadm does not consider the array to have sufficient members for a safe assembly. Internally, this is reflected by info.container_enough = 0, and as a result, the md is not started, and subarrays remain unassembled.

If drive unplug happened, whatever it is- it should result in udev "remove" event and degradation of array in metadata. Why is it not happening in this case?

@mtkaczyk
Copy link
Member

  • When sda is unplugged first, then sdb, and after reboot sdb is
    reconnected first followed by sda, the container (/dev/md127) and
    subarrays (/dev/md125, /dev/md126) correctly assemble and become active.
  • However, when sda is reconnected first, then sdb, the subarrays fail to
    fully reconstruct — sda remains missing from the assembled subarrays,
    due to stale metadata.

I'm missing something..

So the issue happens only if you are connecting drive that was disconnected as first, right? And the issue is that rebuild is not started in this case? Am I correct?

@richl9
Copy link
Author

richl9 commented May 22, 2025

So the issue happens only if you are connecting drive that was disconnected as first

Yes, that is correct. When we are connecting the device that was disconnected first (sda), only container (md127) is created with sda in it. And after second device(sdb) is plugged back, array is started, and md125, md126 are created, with sdb included, but sda remains missing from them.

@richl9
Copy link
Author

richl9 commented May 22, 2025

And the issue is that rebuild is not started in this case

I am not exactly sure what you are referring to here. The array is started after sdb is back, but just with sda missing from member arrays.

@mtkaczyk
Copy link
Member

Please share cat /proc/mdstat and mdadm -E /dev/sda; mdadm -E /dev/sdb from both scenarios.

Simply, do your steps and call these commands at the end. You will collect system state and metadata. I would like to fully understand what is going on.

@richl9
Copy link
Author

richl9 commented May 23, 2025

[root@xxx richard]# cat /proc/mdstat
Personalities : [raid1] 
md125 : active raid1 sdb[0]
      1048576 blocks super external:/md127/0 [2/1] [_U]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md126 : active raid1 sdb[0]
      1048576 blocks super external:/md127/1 [2/1] [_U]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md127 : inactive sdb[1](S) sda[0](S)
      10402 blocks super external:imsm
       
unused devices: <none>
[root@xxx richard]# mdadm -E /dev/sda
/dev/sda:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.3.00
    Orig Family : xxxxxxxx
         Family : xxxxxxxx
     Generation : 000001e8
  Creation Time : Tue Apr 29 10:55:15 2025
     Attributes : 80000002 (supported)
           UUID : caa06757:77c311a7:36f8e5ce:a72031f3
       Checksum : d8e2a9a7 correct
    MPB Sectors : 2
          Disks : 2
   RAID Devices : 2

  Disk01 Serial : xxxxxxxxxx
          State : active
             Id : 04000000
    Usable Size : 468851726 (223.57 GiB 240.05 GB)
mdadm: imsm_component_size_alignment_check: (Level: 1, chunk_size = 65536, component_size = 2097152), component_size_alignment = 0

[125]:
       Subarray : 0
           UUID : 634150d2:e8d1853f:b9530bfa:66382d6b
     RAID Level : 1
        Members : 2
          Slots : [UU]
    Failed disk : none
      This Slot : 0
    Sector Size : 512
     Array Size : 2097152 (1024.00 MiB 1073.74 MB)
   Per Dev Size : 2099200 (1025.00 MiB 1074.79 MB)
  Sector Offset : 0
    Num Stripes : 8192
     Chunk Size : 64 KiB
       Reserved : 0
  Migrate State : idle
      Map State : normal
    Dirty State : clean
     RWH Policy : Write-intent bitmap
      Volume ID : 1
mdadm: imsm_component_size_alignment_check: (Level: 1, chunk_size = 65536, component_size = 2097152), component_size_alignment = 0

[126]:
       Subarray : 1
           UUID : 20f86ce7:dabd6415:9dbb23a6:026c8a98
     RAID Level : 1
        Members : 2
          Slots : [UU]
    Failed disk : none
      This Slot : 0
    Sector Size : 512
     Array Size : 2097152 (1024.00 MiB 1073.74 MB)
   Per Dev Size : 2099200 (1025.00 MiB 1074.79 MB)
  Sector Offset : 2105344
    Num Stripes : 8192
     Chunk Size : 64 KiB
       Reserved : 0
  Migrate State : idle
      Map State : normal
    Dirty State : clean
     RWH Policy : Write-intent bitmap
      Volume ID : 2

  Disk00 Serial : xxxxxxxxxx
          State : active
             Id : 06000000
    Usable Size : 468851726 (223.57 GiB 240.05 GB)
[root@xxx richard]# mdadm -E /dev/sdb
/dev/sdb:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.3.00
    Orig Family : xxxxxxxx
         Family : xxxxxxxx
     Generation : 000001f0
  Creation Time : Tue Apr 29 10:55:15 2025
     Attributes : 80000002 (supported)
           UUID : caa06757:77c311a7:36f8e5ce:a72031f3
       Checksum : 537dc4b6 correct
    MPB Sectors : 2
          Disks : 2
   RAID Devices : 2

  Disk00 Serial : xxxxxxxxxx
          State : active
             Id : 06000000
    Usable Size : 468851726 (223.57 GiB 240.05 GB)
mdadm: imsm_component_size_alignment_check: (Level: 1, chunk_size = 65536, component_size = 2097152), component_size_alignment = 0

[125]:
       Subarray : 0
           UUID : 634150d2:e8d1853f:b9530bfa:66382d6b
     RAID Level : 1
        Members : 2
          Slots : [_U]
    Failed disk : 0
      This Slot : 1
    Sector Size : 512
     Array Size : 2097152 (1024.00 MiB 1073.74 MB)
   Per Dev Size : 2099200 (1025.00 MiB 1074.79 MB)
  Sector Offset : 0
    Num Stripes : 8192
     Chunk Size : 64 KiB
       Reserved : 0
  Migrate State : idle
      Map State : degraded
    Dirty State : clean
     RWH Policy : Write-intent bitmap
      Volume ID : 1
mdadm: imsm_component_size_alignment_check: (Level: 1, chunk_size = 65536, component_size = 2097152), component_size_alignment = 0

[126]:
       Subarray : 1
           UUID : 20f86ce7:dabd6415:9dbb23a6:026c8a98
     RAID Level : 1
        Members : 2
          Slots : [_U]
    Failed disk : 0
      This Slot : 1
    Sector Size : 512
     Array Size : 2097152 (1024.00 MiB 1073.74 MB)
   Per Dev Size : 2099200 (1025.00 MiB 1074.79 MB)
  Sector Offset : 2105344
    Num Stripes : 8192
     Chunk Size : 64 KiB
       Reserved : 0
  Migrate State : idle
      Map State : degraded
    Dirty State : clean
     RWH Policy : Write-intent bitmap
      Volume ID : 2

  Disk01 Serial : xxxxxxxxxxx:x:x
          State : active failed
             Id : ffffffff
    Usable Size : 468851726 (223.57 GiB 240.05 GB)

Please see the output as above. Thanks!

@mtkaczyk
Copy link
Member

mtkaczyk commented May 26, 2025

[root@xxx richard]# cat /proc/mdstat
Personalities : [raid1]
md125 : active raid1 sdb[0]
1048576 blocks super external:/md127/0 [2/1] [_U]
bitmap: 0/1 pages [0KB], 65536KB chunk

md126 : active raid1 sdb[0]
1048576 blocks super external:/md127/1 [2/1] [_U]
bitmap: 0/1 pages [0KB], 65536KB chunk

At a first place, I need to mention that array state is active - it means that after removal mdadm correctly degraded raid array. What the most important - you have access to data. You can see that sda in metadata on sdb is presented offline here:

Slots : [_U]

and here:

Disk01 Serial : xxxxxxxxxxx:x:x
State : active failed
Id : ffffffff
Usable Size : 468851726 (223.57 GiB 240.05 GB)

The question is why rebuild is not started in this case and I know why. I the past I was IMSM developer.
I know that mdmon (not mdadm) has a bug in this area.
If you are adding sda first, it is landing in container md127:

md127 : inactive sda[0](S)
      10402 blocks super external:imsm

Later, when you are adding sdb, mdadm decides do start array as degraded and left sda in container. This is correct.
In such case, mdmon should determine that there is unused spare (sda) in container and should start the rebuild to it. All I know about it is if spare is already in container when mdmon is starting then it will be never used to rebuild. Naturally, workaround for this issue is remove/add cycle you are doing by the script.

So the right fix is to look into mdmon implementation and determine if this case can be handled somehow. Therefore, sadly I'm tending to reject this PR- it must be fixed on mdmon side.

We never fixed this because we (I think fairly) assumed that it is only test scenario because, in real word- if you are removing drive once you should never consider to add it back (especially put it again in booting queue). You can avoid it by erasing metadata from the drive sda if you would like simulate adding new spare.

@richl9
Copy link
Author

richl9 commented Jun 3, 2025

Hi @mtkaczyk , thanks a lot for your review and feedback.
After reviewing the code more, I have two proposals for the fix:

  1. In imsm_activate_spare, when calling imsm_readd, if a failed disk is found, instead of setting dl = NULL, we clear its metadata and return it, allowing it to be re-added.
  2. Alternatively, we remove the failed disk from array so that it can be detected and picked up later by imsm_add_spare

No matter which option we choose, at the end of the day, we want the failed disk to be reused for rebuilding the array. Please let me know what you think:)

@mtkaczyk
Copy link
Member

mtkaczyk commented Jun 4, 2025

In mdadm terms "re-add" means that we know that the disk ws part of array and we don't need to fully reconstruct it, but it is not supported by IMSM. Looks like we hit the bug that was made to keep the room for support.

Before making the decision, could you please also test how regular spare behaves?
i.e:

# mdadm -CR imsm -e imsm -n 3 /dev/sd[abc]
# mdadm -CR vol -l1 -n2 /dev/sd[ab]

# mdadm -If /dev/sda

(see if rebuild started)

If I remember correctly, in this case rebuild won't be started either so I'm thinking if any of this if fixing this problem too.

@richl9
Copy link
Author

richl9 commented Jun 4, 2025

Hi @mtkaczyk, thanks for the suggestion. Unfortunately, we currently only have two m.2 slots on our server thus a third disk cannot be added. Are there any other experiments you would suggest?

@mtkaczyk
Copy link
Member

mtkaczyk commented Jun 5, 2025

@richl9 we have this test suite that is run on nullblk in github actions, you can eventually explore this and simulate the scenario with no real hardware. Maybe, you can add a test for this scenario because I don't find any.

Maybe I'm totally wrong and the scenario I mentioned is working correctly. Keep this in mind!

	/*
		 * OK, this device needs recovery.  Try to re-add the
		 * previous occupant of this slot, if this fails see if
		 * we can continue the assimilation of a spare that was
		 * partially assimilated, finally try to activate a new
		 * spare.
		 */
		dl = imsm_readd(super, i, a);
		if (!dl)

According to comment, looks like the code is not working as expected so I would like to suggest moving with first option. In case of other issues reported we have a good explanation why we changed logic. I hope it has some sense, if not please let me know.

We have slack if you would like to be added to have possibility to talk with developers directly, I can add you, just share e-mail :) Generally, you should receive feedback faster over slack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants