Skip to content

Multihost Improvements (MMP)#18177

Open
behlendorf wants to merge 7 commits intoopenzfs:masterfrom
behlendorf:mmp-uberblock-claim
Open

Multihost Improvements (MMP)#18177
behlendorf wants to merge 7 commits intoopenzfs:masterfrom
behlendorf:mmp-uberblock-claim

Conversation

@behlendorf
Copy link
Contributor

@behlendorf behlendorf commented Feb 4, 2026

Motivation and Context

The multihost pool property in OpenZFS is a safety feature used when a pool might be visible from more than one system, typically in shared storage setup. When enabled it adds an activity check to the import path which monitors the pool for activity. It will then refuse to import the pool if it looks like another node still has it imported.

This check has worked reliably for years for the intended use case of preventing an import when the pool is active on another system. However, if a pool is simultaneously imported on two different nodes there exists a small window where both systems may determine the pool is inactive and proceed with the import. There is a long standing mitigation which adds a small random delay to the import to make this scenario even less likely, regardless it is still technically possible. This PR aims to improve the multihost feature and close that narrow window.

Description

The above race is possible because the multihost feature was designed to perform a read-only activity check. In practice, this protects against the vast majority of cases and it has served us well. This PR proposes to prevent the simultaneous import case, where both systems determine the pool to be idle (because it is), by adding a second read/write phase to claim an inactive pool. This phase is done before the import is allowed to proceed and start syncing TXGs. The extra safety check operates as follows:

  1. For pools with multihost=on which have already been determined to be idle perform a second "claim" check. The idea is to make a targeted safe modification can be detected by another system trying to import the pool. To do this write out an MMP uberblock to all the labels which is identical to the best uberblock but with a random sequence id.
  2. Verify a minimum number of good writes to ensures that any pool with enough visible vdevs to be imported on a remote system will also see at least one of the updated MMP uberblocks.
  3. Wait for the MMP interval. A racing import performing a claim on this pool would overwrite the same MMP uberblock slot with a similarly modified uberblock but with a different random sequence id.
  4. Call vdev_uberblock_load() to determine the best uberblock to use, the txg and sequence number in this uberblock must match what we just wrote or we fail the claim and abort the import.
  5. Repeat steps 1-4 for 10 iterations with random sequence numbers.

This change also includes some refactoring and minor bug fixes for corner cases discovered during development. These are discussed in the commit comments, but to briefly summarize the highlights.

  • For a cleanly exported pools there exists a small window where both systems may determine it's safe to import the pool and skip the activity check. Only allow the check to be skipped when the last imported hostid matches the systems hostid and the pool was cleanly exported.

  • Never try loading earlier txgs during import when the import fails with EREMOTEIO or EINTER. These errors don't indicate the txg is damaged but instead that its either in use on a remote host or the import was interactively cancelled. No rewind is also performed for EBADF which can result from a stale trusted config when doing a verbatim import.

  • Added ZFS_LOAD_INFO_DEBUG environment variable which can be set to log to dump to stdout the spa_load_info nvlist returned during import. This is used by the updated mmp test cases to determine if an activity check was run and its result.

  • Refactor the code for consistent logging of the multihost activity check using spa_load_note() / spa_load_fail().

  • Add console messages indicating when the activity check was triggered and the result. This provides a helpful timeline of events in the logs for any postmortem analysis.

  • Added MMP_*_MASK and MMP_SEQ_CLEAR() macros to allow easier modification of the sequence number in an uberblock.

  • Updated ZTS MMP test cases accordingly, including updated tools to make it easier to write reliably tests for the CI.

How Has This Been Tested?

Performed 9,801 simultaneous imports from two systems attached to a single JBOD with shared disks. This is the absolute worst case scenario. With this PR applied there were zero instances where the pool was accidentally imported on both systems at the same time. It always detected the issue and either the import would fail on both systems or one, depending on the exact timing. Looking more carefully at the exact scenarios:

  • 7030 - Detected during tryimport (existing check)
  • 1530 - Detected during import verify (existing check)
  • 1241 - Detected during claim (the new check)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Copilot AI review requested due to automatic review settings February 4, 2026 01:52
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Feb 4, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements significant improvements to the OpenZFS multihost pool (MMP) feature to prevent simultaneous pool imports in shared storage configurations. The changes add a second "claim" phase to the activity check that writes unique MMP uberblocks and verifies them after a delay, effectively closing the narrow window where two systems could simultaneously determine a pool is inactive and both proceed with import.

Changes:

  • Added a read/write claim phase during import that writes MMP uberblocks with random sequence IDs and verifies them after an MMP interval, repeated 10 times
  • Refactored activity check logic to distinguish between tryimport, import/open, and recover load states with appropriate verification methods
  • Enhanced MMP test infrastructure by migrating from ztest to zhack for simulating active pools and implementing new helper functions to verify activity checks ran
  • Added comprehensive logging and diagnostics including new load_info nvlist entries for tryimport/import/claim durations and results

Reviewed changes

Copilot reviewed 28 out of 29 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
module/zfs/spa.c Core implementation of new multi-phase activity check logic including tryimport verification, claim phase, and enhanced logging
module/zfs/mmp.c New mmp_claim_uberblock() function to write and verify claim uberblocks across all vdev leaves
module/zfs/spa_misc.c Added spa_load_name() helper to properly log pool names during tryimport
module/zfs/vdev_label.c Made vdev_uberblock_compare() externally visible and improved logging
tests/zfs-tests/tests/functional/mmp/*.ksh Updated test scripts to use new import verification functions and migrate from ztest to zhack
tests/zfs-tests/tests/functional/mmp/mmp.kshlib Refactored activity check helpers to use ZFS_LOAD_INFO_DEBUG environment variable
tests/zfs-tests/tests/functional/mmp/mmp.cfg Updated configuration including reducing MMP_INTERVAL_DEFAULT from 1000 to 500ms
cmd/zhack.c Added new "action idle" subcommand to simulate active pool for testing
lib/libzfs/libzfs_pool.c Added ZFS_LOAD_INFO_DEBUG support for debugging import activity checks
include/sys/*.h New function declarations, constants, and data structures to support claim phase
tests/test-runner/bin/zts-report.py.in Removed MMP tests from known failures list

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings February 4, 2026 21:57
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 31 out of 32 changed files in this pull request and generated 13 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings February 5, 2026 02:08
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 34 out of 35 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings February 5, 2026 02:50

This comment was marked as resolved.

Copilot AI review requested due to automatic review settings February 5, 2026 17:42
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 33 out of 34 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@ofaaland ofaaland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I see no functional problems

For a cleanly exported pools there exists a small window where
both systems may determine it's safe to import the pool and skip
the activity check.  Only allow the check to be skipped when the
last imported hostid matches the systems hostid and the pool was
cleanly exported.

Signed-off-by: Brian Behlendorf <[email protected]>
Move the "Starting import" log message in to the import block so
it's matched with the "Fiinshed importing" debug message.

Signed-off-by: Brian Behlendorf <[email protected]>
Tryimport adds a unique prefix to the pool name to avoid name
collisions.  This makes it awkward to log user-friendly info
during a tryimport.  Add a spa_load_name() function which can
be used to report the unmodified pool name.
Copilot AI review requested due to automatic review settings February 7, 2026 01:24
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 33 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

As part of SPA_LOAD_IMPORT add an additional activity check to
detect simultaneous imports from different hosts.  This check is
only required when the timing is such that there's no activity
for the the read-only tryimport check to detect.  This extra
safety chceck operates as follows:

1. Repeats the following MMP check 10 times:
  a. Write out an MMP uberblock with the best txg and a random
     sequence id to all primary pool vdevs.
  b. Verify a minimum number of good writes such that even if
     the pool appears degraded on the remote host it will see
     at least one of the updated MMP uberblocks.
  c. Wait for the MMP interval this leaves a window for other
     racing hosts to make similar modifications which can be
     detected.
  d. Call vdev_uberblock_load() to determine the best uberblock
     to use, this should be the MMP uberblock just written.
  e. Verify the txg and random sequeunce number match the MMP
     uberblock written in 1a.

2. Restore the original MMP uberblocks.  This allows the check
   to be performed again if the pool fails to import for an
   unrelated reason.

This change also includes some refactoring and minor improvements.

- Never try loading earlier txgs during import when the import
  fails with EREMOTEIO or EINTER.  These errors don't indicate
  the txg is damaged but instead that its either in use on a
  remote host or the import was interactively cancelled.  No
  rewind is also performed for EBADD which can result from a
  stale trusted config when doing a verbatim import.

- Refactor the code for consistent logging of the multihost
  activity check using spa_load_note() and console messages
  indicating when the activity check was trigger and the result.

- Added MMP_*_MASK and MMP_SEQ_CLEAR() macros to allow easier
  modification of the sequence number in an uberblock.

- Added ZFS_LOAD_INFO_DEBUG environment variable which can be
  set to log to dump to stdout the spa_load_info nvlist returned
  during import.  This is used by the updated mmp test cases
  to determine if an activity check was run and its result.

- Standardize the mmp messages similarly to make it easier to
  find all the relevent mmp lines in the debug log.

Signed-off-by: Brian Behlendorf <[email protected]>
Add a -G option to zhack to dump the internal debug buffer on exit.
We were able to use the same code from zdb for this which was nice.

Signed-off-by: Brian Behlendorf <[email protected]>
In order to reliably test the multihost protection we need two (or more)
systems attempting to import the pool at the same time.  Historically, we've
used ztest running in userspace to simulate an active pool and attempted to
import the pool with the kernel modules.  This works but ztest is a bit
unwieldy for this and if it crashes for unrelated reasons it can result
in false positives.

All we really need is the pool imported in userspace so the MMP thread is
active and writing out uberblocks.  We can extend zhack which already knows
how to import the pool read/write and add an option to leave the pool open
and idle.

Signed-off-by: Brian Behlendorf <[email protected]>
- mmp_concurrent_import: added test case to verify that concurrent
  import correctness.  The pool may only be imported once.

- mmp_exported_import: an activity check is now required for pools
  which were cleanly exported if the system and pool hostids don't
  match.

- mmp_inactive_import: an activity check is now required for any
  pool which wasn't cleanly exported, even if the system and pool
  hostids match.

- mmp_on_uberblocks: updated expected uberblocks to take in to account
  the value MMP_INTERVAL_DEFAULT is set too.

- mmp_reset_interval: reduce the number of iterations from 10 to 3.
  This is sufficient to verify functionality and significantly speeds
  up the test.

- mmp_on_uberblocks: adjust the thresholds and increase the runtime
  to avoid false positives observed in CI.

- Update tests to use 'zhack action idle' instead of ztest to improve
  the reliability of the tests.

- Add additional log_note messages to test cases which have multiple
  verification steps to make it clear which portion of a test failed
  when reviewing the logs.

- Replace default_setup/cleanup_noexit calls with 'zpool create' and
  'zpool destroy' calls to avoid additional unnecessary dataset
  creation work.

- Update activity/noactivity check helper functions to use the
  ZFS_LOAD_INFO_DEBUG information now available from 'zpool import'
  to determine if this activity check ran and why.  This is more
  reliable in the CI than measuring the runtime.

- Removed all mmp tests from the zts-report.py exceptions list.

Signed-off-by: Brian Behlendorf <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Code Review Needed Ready for review and testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants