mdadm: Fix IMSM Raid assembly after disk link failure and reboot #179

richl9 · 2025-05-07T20:34:31Z

This patch addresses a scenario observed in production where disk links go down. After a system reboot, depending on which disk becomes available first, the IMSM RAID array may either fully assemble or come up with missing disks.

Below is an example of the production case simulating disk link failures and subsequent system reboot.

(note: "echo "1" | sudo tee /sys/class/scsi_device/x:x:x:x/device/delete" is used here to fail/unplug/disconnect disks)

Raid Configuration: IMSM Raid1 with two disks

When sda is unplugged first, then sdb, and after reboot sdb is reconnected first followed by sda, the container (/dev/md127) and subarrays (/dev/md125, /dev/md126) correctly assemble and become active.
However, when sda is reconnected first, then sdb, the subarrays fail to fully reconstruct — sda remains missing from the assembled subarrays, due to stale metadata.

Above behaviors are influenced by udev event handling:

When a disk disconnects, the rule ACTION=="remove", ENV{ID_PATH}=="?*", RUN+="/usr/sbin/mdadm -If $devnode --path $env{ID_PATH}" is triggered to inform mdadm of the removal.
When a disk reconnects (i.e., ACTION!="remove"), the rule IMPORT{program}="/usr/sbin/mdadm --incremental --export $devnode --offroot $env{DEVLINKS}" is triggered to incrementally assemble the RAID arrays.

During array assembling, the array may not be fully assembled due to disks with stale metadata.

This patch adds a udev-triggered script that detects this failure and brings the missing disks back to the array. Basically, it inspects the RAID configuration in /usr/sbin/mdadm --detail --scan --export, identifies disks that belong to a container array but are missing from their corresponding member (sub)arrays, and restores them by performing a hot remove-and-re-add cycle.

The patch improves resilience by ensuring consistent array reconstruction regardless of disk detection order. This aligns system behavior with expected RAID redundancy and reduces risk of unnecessary manual recovery steps after reboots in degraded hardware environments.

This patch addresses a scenario observed in production where disk links go down. After a system reboot, depending on which disk becomes available first, the IMSM RAID array may either fully assemble or come up with missing disks. Below is an example of the production case simulating disk link failures and subsequent system reboot. (note: "echo "1" | sudo tee /sys/class/scsi_device/x:x:x:x/device/delete" is used here to fail/unplug/disconnect disks) Raid Configuration: IMSM Raid1 with two disks - When sda is unplugged first, then sdb, and after reboot sdb is reconnected first followed by sda, the container (/dev/md127) and subarrays (/dev/md125, /dev/md126) correctly assemble and become active. - However, when sda is reconnected first, then sdb, the subarrays fail to fully reconstruct — sda remains missing from the assembled subarrays, due to stale metadata. Above behaviors are influenced by udev event handling: - When a disk disconnects, the rule ACTION=="remove", ENV{ID_PATH}=="?*", RUN+="/usr/sbin/mdadm -If $devnode --path $env{ID_PATH}" is triggered to inform mdadm of the removal. - When a disk reconnects (i.e., ACTION!="remove"), the rule IMPORT{program}="/usr/sbin/mdadm --incremental --export $devnode --offroot $env{DEVLINKS}" is triggered to incrementally assemble the RAID arrays. During array assembling, the array may not be fully assembled due to disks with stale metadata. This patch adds a udev-triggered script that detects this failure and brings the missing disks back to the array. Basically, it inspects the RAID configuration in /usr/sbin/mdadm --detail --scan --export,identifies disks that belong to a container array but are missing from their corresponding member (sub)arrays, and restores them by performing a hot remove-and-re-add cycle. The patch improves resilience by ensuring consistent array reconstruction regardless of disk detection order. This aligns system behavior with expected RAID redundancy and reduces risk of unnecessary manual recovery steps after reboots in degraded hardware environments. Signed-off-by: Richard Li <[email protected]>

mtkaczyk · 2025-05-08T06:56:16Z

need IMSM developers feedback here. @bkucman please take a look.

XiaoNi87 · 2025-05-09T09:17:21Z

Not sure why ci test (upstream tests) isn't triggered

bkucman · 2025-05-09T09:29:22Z

Not sure why ci test (upstream tests) isn't triggered

I think that if it's someone's first PR to the repo, github actions never start, it needs to be confirmed by the maintainer.

bkucman · 2025-05-09T09:33:22Z

need IMSM developers feedback here. @bkucman please take a look.

This requires more analysis from me, such a script can mess with the flow a bit,
I can't get into it right now, it will have to wait for about a week and a half. Sorry for the inconvenience.

mtkaczyk · 2025-05-10T06:45:11Z

During array assembling, the array may not be fully assembled due to
disks with stale metadata.

Why it cannot be integrated with mdadm and assemble functionality? I think that mdadm should be able to get rid over it and start array. Probably I know the reason, it is imsm_thunderdome and how complicated it is.

To consider script I need some confidence that it cannot be fixed in mdadm directly please @bkucman and @richl9 provide me something!

richl9 · 2025-05-19T17:00:06Z

@mtkaczyk Thanks for your feedback Mariusz. This issue can be fixed from mdadm side indeed. To address the issue from the mdadm side, the problem can be resolved by checking the array’s health in mdadm.c. If a failed device is detected, it can be identified and an Incremental_remove operation performed, followed by an Incremental being called to reassemble the array. However, to avoid affecting workflows unnecessarily, we chose not to integrate this directly into mdadm and instead developed a script for this purpose. Please let me know what do you think.

mtkaczyk · 2025-05-20T07:04:05Z

I think that we are chickening to do it right way because we are scared to bring regression in one of the most critical functionalities which is of course understandable but not right way of doing things. From upstream and product point to view I have to ask you at least to try integrate it with code. I understand that I might be more time consuming but here quality and maintenance cost matters overall.

Now, It significantly increases debug time and solution complexity, Something like that is last thing people expected, even --verbose logs from mdadm would be not helpful because you are going to execute it by udev. You are basing on mdadm response and it the middle mdadm might be executed elsewhere that is why I think that it is dangerous.

You description didn't specify why mdadm is doing it, where is exactly the problem. It might require to look into imsm metadata particularly.

Mdadm incremental is executed for every drive with exclusive access to raid array (incremental processes are not racing). I know cases in which it even removes not matching drives in runtime that is why I think it can be integrated with incremental command.

If you will reach obstacle that can be pointed out, then I would start considering something like this as a workaround but it definitely requires solid argument.

Hope it help, I'm always here in case of questions!

richl9 · 2025-05-21T23:24:00Z

@mtkaczyk Thanks a lot for the detailed back. To elaborate more on the issue, when sda is both unplugged first and reconnected first, the array fails to fully assemble. The reason is following. At the time of unplug, sda's metadata still indicated that both sda and sdb were active members. After reboot, when only sda is present during the initial scan, mdadm reads this metadata and determines that the array should have two active devices. Since only one is available at this point, mdadm does not consider the array to have sufficient members for a safe assembly. Internally, this is reflected by info.container_enough = 0, and as a result, the md is not started, and subarrays remain unassembled.

This behavior is by design — mdadm refuses to assemble an incomplete IMSM container if the number of available members is insufficient per the metadata. Importantly, this is not a metadata corruption issue, as the metadata on sda is still technically valid. The failure arises from the state of the system at the time of scanning, not from a flaw in the data. Therefore, I personally think this case is hard to be resolved from the metadata side, as there is no indication that the metadata is "stale" or incorrect. It simply reflects the state of the array before disconnection.

Therefore, what I propose, as I mentioned in the previous comment, was to do a Incremental_remove and Incremental cycle on failed devices in mdadm. This cycle is triggered only in edge cases involving IMSM RAID and after thorough verification that a device is genuinely missing. So, with that being said, this shouldn't be impacting normal workflows. Would love to hear your feedback!

mtkaczyk · 2025-05-22T06:21:04Z

@mtkaczyk Thanks a lot for the detailed back. To elaborate more on the issue, when sda is both unplugged first and reconnected first, the array fails to fully assemble. The reason is following. At the time of unplug, sda's metadata still indicated that both sda and sdb were active members. After reboot, when only sda is present during the initial scan, mdadm reads this metadata and determines that the array should have two active devices. Since only one is available at this point, mdadm does not consider the array to have sufficient members for a safe assembly. Internally, this is reflected by info.container_enough = 0, and as a result, the md is not started, and subarrays remain unassembled.

If drive unplug happened, whatever it is- it should result in udev "remove" event and degradation of array in metadata. Why is it not happening in this case?

mtkaczyk · 2025-05-22T10:56:25Z

When sda is unplugged first, then sdb, and after reboot sdb is
reconnected first followed by sda, the container (/dev/md127) and
subarrays (/dev/md125, /dev/md126) correctly assemble and become active.

However, when sda is reconnected first, then sdb, the subarrays fail to
fully reconstruct — sda remains missing from the assembled subarrays,
due to stale metadata.

I'm missing something..

So the issue happens only if you are connecting drive that was disconnected as first, right? And the issue is that rebuild is not started in this case? Am I correct?

richl9 · 2025-05-22T16:34:33Z

So the issue happens only if you are connecting drive that was disconnected as first

Yes, that is correct. When we are connecting the device that was disconnected first (sda), only container (md127) is created with sda in it. And after second device(sdb) is plugged back, array is started, and md125, md126 are created, with sdb included, but sda remains missing from them.

richl9 · 2025-05-22T16:37:23Z

And the issue is that rebuild is not started in this case

I am not exactly sure what you are referring to here. The array is started after sdb is back, but just with sda missing from member arrays.

mtkaczyk · 2025-05-23T05:32:00Z

Please share cat /proc/mdstat and mdadm -E /dev/sda; mdadm -E /dev/sdb from both scenarios.

Simply, do your steps and call these commands at the end. You will collect system state and metadata. I would like to fully understand what is going on.

richl9 · 2025-05-23T17:06:27Z

[root@xxx richard]# cat /proc/mdstat
Personalities : [raid1] 
md125 : active raid1 sdb[0]
      1048576 blocks super external:/md127/0 [2/1] [_U]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md126 : active raid1 sdb[0]
      1048576 blocks super external:/md127/1 [2/1] [_U]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md127 : inactive sdb[1](S) sda[0](S)
      10402 blocks super external:imsm
       
unused devices: <none>
[root@xxx richard]# mdadm -E /dev/sda
/dev/sda:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.3.00
    Orig Family : xxxxxxxx
         Family : xxxxxxxx
     Generation : 000001e8
  Creation Time : Tue Apr 29 10:55:15 2025
     Attributes : 80000002 (supported)
           UUID : caa06757:77c311a7:36f8e5ce:a72031f3
       Checksum : d8e2a9a7 correct
    MPB Sectors : 2
          Disks : 2
   RAID Devices : 2

  Disk01 Serial : xxxxxxxxxx
          State : active
             Id : 04000000
    Usable Size : 468851726 (223.57 GiB 240.05 GB)
mdadm: imsm_component_size_alignment_check: (Level: 1, chunk_size = 65536, component_size = 2097152), component_size_alignment = 0

[125]:
       Subarray : 0
           UUID : 634150d2:e8d1853f:b9530bfa:66382d6b
     RAID Level : 1
        Members : 2
          Slots : [UU]
    Failed disk : none
      This Slot : 0
    Sector Size : 512
     Array Size : 2097152 (1024.00 MiB 1073.74 MB)
   Per Dev Size : 2099200 (1025.00 MiB 1074.79 MB)
  Sector Offset : 0
    Num Stripes : 8192
     Chunk Size : 64 KiB
       Reserved : 0
  Migrate State : idle
      Map State : normal
    Dirty State : clean
     RWH Policy : Write-intent bitmap
      Volume ID : 1
mdadm: imsm_component_size_alignment_check: (Level: 1, chunk_size = 65536, component_size = 2097152), component_size_alignment = 0

[126]:
       Subarray : 1
           UUID : 20f86ce7:dabd6415:9dbb23a6:026c8a98
     RAID Level : 1
        Members : 2
          Slots : [UU]
    Failed disk : none
      This Slot : 0
    Sector Size : 512
     Array Size : 2097152 (1024.00 MiB 1073.74 MB)
   Per Dev Size : 2099200 (1025.00 MiB 1074.79 MB)
  Sector Offset : 2105344
    Num Stripes : 8192
     Chunk Size : 64 KiB
       Reserved : 0
  Migrate State : idle
      Map State : normal
    Dirty State : clean
     RWH Policy : Write-intent bitmap
      Volume ID : 2

  Disk00 Serial : xxxxxxxxxx
          State : active
             Id : 06000000
    Usable Size : 468851726 (223.57 GiB 240.05 GB)
[root@xxx richard]# mdadm -E /dev/sdb
/dev/sdb:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.3.00
    Orig Family : xxxxxxxx
         Family : xxxxxxxx
     Generation : 000001f0
  Creation Time : Tue Apr 29 10:55:15 2025
     Attributes : 80000002 (supported)
           UUID : caa06757:77c311a7:36f8e5ce:a72031f3
       Checksum : 537dc4b6 correct
    MPB Sectors : 2
          Disks : 2
   RAID Devices : 2

  Disk00 Serial : xxxxxxxxxx
          State : active
             Id : 06000000
    Usable Size : 468851726 (223.57 GiB 240.05 GB)
mdadm: imsm_component_size_alignment_check: (Level: 1, chunk_size = 65536, component_size = 2097152), component_size_alignment = 0

[125]:
       Subarray : 0
           UUID : 634150d2:e8d1853f:b9530bfa:66382d6b
     RAID Level : 1
        Members : 2
          Slots : [_U]
    Failed disk : 0
      This Slot : 1
    Sector Size : 512
     Array Size : 2097152 (1024.00 MiB 1073.74 MB)
   Per Dev Size : 2099200 (1025.00 MiB 1074.79 MB)
  Sector Offset : 0
    Num Stripes : 8192
     Chunk Size : 64 KiB
       Reserved : 0
  Migrate State : idle
      Map State : degraded
    Dirty State : clean
     RWH Policy : Write-intent bitmap
      Volume ID : 1
mdadm: imsm_component_size_alignment_check: (Level: 1, chunk_size = 65536, component_size = 2097152), component_size_alignment = 0

[126]:
       Subarray : 1
           UUID : 20f86ce7:dabd6415:9dbb23a6:026c8a98
     RAID Level : 1
        Members : 2
          Slots : [_U]
    Failed disk : 0
      This Slot : 1
    Sector Size : 512
     Array Size : 2097152 (1024.00 MiB 1073.74 MB)
   Per Dev Size : 2099200 (1025.00 MiB 1074.79 MB)
  Sector Offset : 2105344
    Num Stripes : 8192
     Chunk Size : 64 KiB
       Reserved : 0
  Migrate State : idle
      Map State : degraded
    Dirty State : clean
     RWH Policy : Write-intent bitmap
      Volume ID : 2

  Disk01 Serial : xxxxxxxxxxx:x:x
          State : active failed
             Id : ffffffff
    Usable Size : 468851726 (223.57 GiB 240.05 GB)

Please see the output as above. Thanks!

mtkaczyk · 2025-05-26T05:43:01Z

[root@xxx richard]# cat /proc/mdstat
Personalities : [raid1]
md125 : active raid1 sdb[0]
1048576 blocks super external:/md127/0 [2/1] [_U]
bitmap: 0/1 pages [0KB], 65536KB chunk

md126 : active raid1 sdb[0]
1048576 blocks super external:/md127/1 [2/1] [_U]
bitmap: 0/1 pages [0KB], 65536KB chunk

At a first place, I need to mention that array state is active - it means that after removal mdadm correctly degraded raid array. What the most important - you have access to data. You can see that sda in metadata on sdb is presented offline here:

Slots : [_U]

and here:

Disk01 Serial : xxxxxxxxxxx:x:x
State : active failed
Id : ffffffff
Usable Size : 468851726 (223.57 GiB 240.05 GB)

The question is why rebuild is not started in this case and I know why. I the past I was IMSM developer.
I know that mdmon (not mdadm) has a bug in this area.
If you are adding sda first, it is landing in container md127:

md127 : inactive sda[0](S)
      10402 blocks super external:imsm

Later, when you are adding sdb, mdadm decides do start array as degraded and left sda in container. This is correct.
In such case, mdmon should determine that there is unused spare (sda) in container and should start the rebuild to it. All I know about it is if spare is already in container when mdmon is starting then it will be never used to rebuild. Naturally, workaround for this issue is remove/add cycle you are doing by the script.

So the right fix is to look into mdmon implementation and determine if this case can be handled somehow. Therefore, sadly I'm tending to reject this PR- it must be fixed on mdmon side.

We never fixed this because we (I think fairly) assumed that it is only test scenario because, in real word- if you are removing drive once you should never consider to add it back (especially put it again in booting queue). You can avoid it by erasing metadata from the drive sda if you would like simulate adding new spare.

richl9 · 2025-06-03T22:30:19Z

Hi @mtkaczyk , thanks a lot for your review and feedback.
After reviewing the code more, I have two proposals for the fix:

In imsm_activate_spare, when calling imsm_readd, if a failed disk is found, instead of setting dl = NULL, we clear its metadata and return it, allowing it to be re-added.
Alternatively, we remove the failed disk from array so that it can be detected and picked up later by imsm_add_spare

No matter which option we choose, at the end of the day, we want the failed disk to be reused for rebuilding the array. Please let me know what you think:)

mtkaczyk · 2025-06-04T09:55:24Z

In mdadm terms "re-add" means that we know that the disk ws part of array and we don't need to fully reconstruct it, but it is not supported by IMSM. Looks like we hit the bug that was made to keep the room for support.

Before making the decision, could you please also test how regular spare behaves?
i.e:

# mdadm -CR imsm -e imsm -n 3 /dev/sd[abc]
# mdadm -CR vol -l1 -n2 /dev/sd[ab]

# mdadm -If /dev/sda

(see if rebuild started)

If I remember correctly, in this case rebuild won't be started either so I'm thinking if any of this if fixing this problem too.

richl9 · 2025-06-04T16:43:06Z

Hi @mtkaczyk, thanks for the suggestion. Unfortunately, we currently only have two m.2 slots on our server thus a third disk cannot be added. Are there any other experiments you would suggest?

mtkaczyk · 2025-06-05T05:49:16Z

@richl9 we have this test suite that is run on nullblk in github actions, you can eventually explore this and simulate the scenario with no real hardware. Maybe, you can add a test for this scenario because I don't find any.

Maybe I'm totally wrong and the scenario I mentioned is working correctly. Keep this in mind!

	/*
		 * OK, this device needs recovery.  Try to re-add the
		 * previous occupant of this slot, if this fails see if
		 * we can continue the assimilation of a spare that was
		 * partially assimilated, finally try to activate a new
		 * spare.
		 */
		dl = imsm_readd(super, i, a);
		if (!dl)

According to comment, looks like the code is not working as expected so I would like to suggest moving with first option. In case of other issues reported we have a good explanation why we changed logic. I hope it has some sense, if not please let me know.

We have slack if you would like to be added to have possibility to talk with developers directly, I can add you, just share e-mail :) Generally, you should receive feedback faster over slack.

richl9 force-pushed the udev_rules branch from ef48ec5 to 542d01b Compare May 7, 2025 21:27

richl9 force-pushed the udev_rules branch from 542d01b to 5262fa8 Compare May 7, 2025 21:35

mdadm: Fix IMSM Raid assembly after disk link failure and reboot #179

Are you sure you want to change the base?

mdadm: Fix IMSM Raid assembly after disk link failure and reboot #179

Uh oh!

Conversation

richl9 commented May 7, 2025

Uh oh!

mtkaczyk commented May 8, 2025

Uh oh!

XiaoNi87 commented May 9, 2025

Uh oh!

bkucman commented May 9, 2025

Uh oh!

bkucman commented May 9, 2025

Uh oh!

mtkaczyk commented May 10, 2025

Uh oh!

richl9 commented May 19, 2025

Uh oh!

mtkaczyk commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

richl9 commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mtkaczyk commented May 22, 2025

Uh oh!

mtkaczyk commented May 22, 2025

Uh oh!

richl9 commented May 22, 2025

Uh oh!

richl9 commented May 22, 2025

Uh oh!

mtkaczyk commented May 23, 2025

Uh oh!

richl9 commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mtkaczyk commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

richl9 commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mtkaczyk commented Jun 4, 2025

Uh oh!

richl9 commented Jun 4, 2025

Uh oh!

mtkaczyk commented Jun 5, 2025

Uh oh!

Uh oh!

mtkaczyk commented May 20, 2025 •

edited

Loading

richl9 commented May 21, 2025 •

edited

Loading

richl9 commented May 23, 2025 •

edited

Loading

mtkaczyk commented May 26, 2025 •

edited

Loading

richl9 commented Jun 3, 2025 •

edited

Loading