Conversation
There was a problem hiding this comment.
Pull request overview
This pull request implements significant improvements to the OpenZFS multihost pool (MMP) feature to prevent simultaneous pool imports in shared storage configurations. The changes add a second "claim" phase to the activity check that writes unique MMP uberblocks and verifies them after a delay, effectively closing the narrow window where two systems could simultaneously determine a pool is inactive and both proceed with import.
Changes:
- Added a read/write claim phase during import that writes MMP uberblocks with random sequence IDs and verifies them after an MMP interval, repeated 10 times
- Refactored activity check logic to distinguish between tryimport, import/open, and recover load states with appropriate verification methods
- Enhanced MMP test infrastructure by migrating from ztest to zhack for simulating active pools and implementing new helper functions to verify activity checks ran
- Added comprehensive logging and diagnostics including new load_info nvlist entries for tryimport/import/claim durations and results
Reviewed changes
Copilot reviewed 28 out of 29 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| module/zfs/spa.c | Core implementation of new multi-phase activity check logic including tryimport verification, claim phase, and enhanced logging |
| module/zfs/mmp.c | New mmp_claim_uberblock() function to write and verify claim uberblocks across all vdev leaves |
| module/zfs/spa_misc.c | Added spa_load_name() helper to properly log pool names during tryimport |
| module/zfs/vdev_label.c | Made vdev_uberblock_compare() externally visible and improved logging |
| tests/zfs-tests/tests/functional/mmp/*.ksh | Updated test scripts to use new import verification functions and migrate from ztest to zhack |
| tests/zfs-tests/tests/functional/mmp/mmp.kshlib | Refactored activity check helpers to use ZFS_LOAD_INFO_DEBUG environment variable |
| tests/zfs-tests/tests/functional/mmp/mmp.cfg | Updated configuration including reducing MMP_INTERVAL_DEFAULT from 1000 to 500ms |
| cmd/zhack.c | Added new "action idle" subcommand to simulate active pool for testing |
| lib/libzfs/libzfs_pool.c | Added ZFS_LOAD_INFO_DEBUG support for debugging import activity checks |
| include/sys/*.h | New function declarations, constants, and data structures to support claim phase |
| tests/test-runner/bin/zts-report.py.in | Removed MMP tests from known failures list |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
77a543e to
7df392a
Compare
7df392a to
c7c89de
Compare
c7c89de to
2c4345e
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 31 out of 32 changed files in this pull request and generated 13 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
2c4345e to
129398d
Compare
129398d to
4986369
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 34 out of 35 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
4986369 to
91fbdea
Compare
91fbdea to
e582885
Compare
e582885 to
74cd13a
Compare
74cd13a to
c31b70a
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 33 out of 34 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
c31b70a to
ad4acdb
Compare
ofaaland
left a comment
There was a problem hiding this comment.
LGTM, I see no functional problems
For a cleanly exported pools there exists a small window where both systems may determine it's safe to import the pool and skip the activity check. Only allow the check to be skipped when the last imported hostid matches the systems hostid and the pool was cleanly exported. Signed-off-by: Brian Behlendorf <[email protected]>
Move the "Starting import" log message in to the import block so it's matched with the "Fiinshed importing" debug message. Signed-off-by: Brian Behlendorf <[email protected]>
Tryimport adds a unique prefix to the pool name to avoid name collisions. This makes it awkward to log user-friendly info during a tryimport. Add a spa_load_name() function which can be used to report the unmodified pool name.
ad4acdb to
51788d4
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 32 out of 33 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
As part of SPA_LOAD_IMPORT add an additional activity check to
detect simultaneous imports from different hosts. This check is
only required when the timing is such that there's no activity
for the the read-only tryimport check to detect. This extra
safety chceck operates as follows:
1. Repeats the following MMP check 10 times:
a. Write out an MMP uberblock with the best txg and a random
sequence id to all primary pool vdevs.
b. Verify a minimum number of good writes such that even if
the pool appears degraded on the remote host it will see
at least one of the updated MMP uberblocks.
c. Wait for the MMP interval this leaves a window for other
racing hosts to make similar modifications which can be
detected.
d. Call vdev_uberblock_load() to determine the best uberblock
to use, this should be the MMP uberblock just written.
e. Verify the txg and random sequeunce number match the MMP
uberblock written in 1a.
2. Restore the original MMP uberblocks. This allows the check
to be performed again if the pool fails to import for an
unrelated reason.
This change also includes some refactoring and minor improvements.
- Never try loading earlier txgs during import when the import
fails with EREMOTEIO or EINTER. These errors don't indicate
the txg is damaged but instead that its either in use on a
remote host or the import was interactively cancelled. No
rewind is also performed for EBADD which can result from a
stale trusted config when doing a verbatim import.
- Refactor the code for consistent logging of the multihost
activity check using spa_load_note() and console messages
indicating when the activity check was trigger and the result.
- Added MMP_*_MASK and MMP_SEQ_CLEAR() macros to allow easier
modification of the sequence number in an uberblock.
- Added ZFS_LOAD_INFO_DEBUG environment variable which can be
set to log to dump to stdout the spa_load_info nvlist returned
during import. This is used by the updated mmp test cases
to determine if an activity check was run and its result.
- Standardize the mmp messages similarly to make it easier to
find all the relevent mmp lines in the debug log.
Signed-off-by: Brian Behlendorf <[email protected]>
Add a -G option to zhack to dump the internal debug buffer on exit. We were able to use the same code from zdb for this which was nice. Signed-off-by: Brian Behlendorf <[email protected]>
In order to reliably test the multihost protection we need two (or more) systems attempting to import the pool at the same time. Historically, we've used ztest running in userspace to simulate an active pool and attempted to import the pool with the kernel modules. This works but ztest is a bit unwieldy for this and if it crashes for unrelated reasons it can result in false positives. All we really need is the pool imported in userspace so the MMP thread is active and writing out uberblocks. We can extend zhack which already knows how to import the pool read/write and add an option to leave the pool open and idle. Signed-off-by: Brian Behlendorf <[email protected]>
- mmp_concurrent_import: added test case to verify that concurrent import correctness. The pool may only be imported once. - mmp_exported_import: an activity check is now required for pools which were cleanly exported if the system and pool hostids don't match. - mmp_inactive_import: an activity check is now required for any pool which wasn't cleanly exported, even if the system and pool hostids match. - mmp_on_uberblocks: updated expected uberblocks to take in to account the value MMP_INTERVAL_DEFAULT is set too. - mmp_reset_interval: reduce the number of iterations from 10 to 3. This is sufficient to verify functionality and significantly speeds up the test. - mmp_on_uberblocks: adjust the thresholds and increase the runtime to avoid false positives observed in CI. - Update tests to use 'zhack action idle' instead of ztest to improve the reliability of the tests. - Add additional log_note messages to test cases which have multiple verification steps to make it clear which portion of a test failed when reviewing the logs. - Replace default_setup/cleanup_noexit calls with 'zpool create' and 'zpool destroy' calls to avoid additional unnecessary dataset creation work. - Update activity/noactivity check helper functions to use the ZFS_LOAD_INFO_DEBUG information now available from 'zpool import' to determine if this activity check ran and why. This is more reliable in the CI than measuring the runtime. - Removed all mmp tests from the zts-report.py exceptions list. Signed-off-by: Brian Behlendorf <[email protected]>
51788d4 to
f502c9b
Compare
Motivation and Context
The multihost pool property in OpenZFS is a safety feature used when a pool might be visible from more than one system, typically in shared storage setup. When enabled it adds an activity check to the import path which monitors the pool for activity. It will then refuse to import the pool if it looks like another node still has it imported.
This check has worked reliably for years for the intended use case of preventing an import when the pool is active on another system. However, if a pool is simultaneously imported on two different nodes there exists a small window where both systems may determine the pool is inactive and proceed with the import. There is a long standing mitigation which adds a small random delay to the import to make this scenario even less likely, regardless it is still technically possible. This PR aims to improve the
multihostfeature and close that narrow window.Description
The above race is possible because the multihost feature was designed to perform a read-only activity check. In practice, this protects against the vast majority of cases and it has served us well. This PR proposes to prevent the simultaneous import case, where both systems determine the pool to be idle (because it is), by adding a second read/write phase to claim an inactive pool. This phase is done before the import is allowed to proceed and start syncing TXGs. The extra safety check operates as follows:
multihost=onwhich have already been determined to be idle perform a second "claim" check. The idea is to make a targeted safe modification can be detected by another system trying to import the pool. To do this write out an MMP uberblock to all the labels which is identical to the best uberblock but with a random sequence id.This change also includes some refactoring and minor bug fixes for corner cases discovered during development. These are discussed in the commit comments, but to briefly summarize the highlights.
For a cleanly exported pools there exists a small window where both systems may determine it's safe to import the pool and skip the activity check. Only allow the check to be skipped when the last imported hostid matches the systems hostid and the pool was cleanly exported.
Never try loading earlier txgs during import when the import fails with EREMOTEIO or EINTER. These errors don't indicate the txg is damaged but instead that its either in use on a remote host or the import was interactively cancelled. No rewind is also performed for EBADF which can result from a stale trusted config when doing a verbatim import.
Added ZFS_LOAD_INFO_DEBUG environment variable which can be set to log to dump to stdout the spa_load_info nvlist returned during import. This is used by the updated mmp test cases to determine if an activity check was run and its result.
Refactor the code for consistent logging of the multihost activity check using spa_load_note() / spa_load_fail().
Add console messages indicating when the activity check was triggered and the result. This provides a helpful timeline of events in the logs for any postmortem analysis.
Added MMP_*_MASK and MMP_SEQ_CLEAR() macros to allow easier modification of the sequence number in an uberblock.
Updated ZTS MMP test cases accordingly, including updated tools to make it easier to write reliably tests for the CI.
How Has This Been Tested?
Performed 9,801 simultaneous imports from two systems attached to a single JBOD with shared disks. This is the absolute worst case scenario. With this PR applied there were zero instances where the pool was accidentally imported on both systems at the same time. It always detected the issue and either the import would fail on both systems or one, depending on the exact timing. Looking more carefully at the exact scenarios:
Types of changes
Checklist:
Signed-off-by.