Skip to content

Conversation

@davepacheco
Copy link
Collaborator

@davepacheco davepacheco commented Oct 22, 2025

Fixes #2318 and #9327.

@davepacheco davepacheco self-assigned this Oct 22, 2025
) -> Result<Generation, Error> {
for (_sled_id, zone_config, nexus_config) in
self.all_nexus_zones(BlueprintZoneDisposition::is_in_service)
self.all_nexus_zones(BlueprintZoneDisposition::could_be_running)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make this shift? From what I can tell, could_be_running is the same as is_in_service, but also includes zones which are "expunged, but not ready for cleanup".

If a Nexus zone is expunged, do we want to be including it here? (If we needed this above, wouldn't it only be used when re-assigning sagas from one expunged Nexus to another expunged Nexus?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically just because it's possible for an expunged, not-yet-cleaned-up Nexus to still be running and call this function and it can correctly determine its generation. You're right that in this case, it only enables it to assign sagas to itself, which isn't useful. Those will wind up re-assigned to another after this one becomes ready for cleanup.

// want some expunged zones and some non-expunged zones in each of two
// different generations.

// Frst, create a basic blueprint with several Nexus zones in generation 1.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Frst, create a basic blueprint with several Nexus zones in generation 1.
// First, create a basic blueprint with several Nexus zones in generation 1.

Comment on lines +257 to +266
// It is possible for the expunged and not-yet-ready-for-cleanup zone in
// generation 2 to wind up calling this function. It should not find
// itself!
let matched = find_expunged_same_generation(
&blueprint3,
SecId(g2_expunged_not_cleaned_up.into_untyped_uuid()),
)
.unwrap();
assert_eq!(matched.len(), 1);
assert_eq!(matched[0], g2_matched);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that we don't want it to find itself, but why let it find the other zones either? This lets us do re-assignment from expunged zone -> expunged zone, which is pointless

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is finding the not expunged Nexus from g2, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

g2_matched is set on line 246 to g2_expunged_cleaned_up - I think this is returning an expunged Nexus

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right. I was confused by this (and the related question about the change to find_generation_for_self() above). I picked Dave's brain for a few minutes and came away convinced the changes here are correct.

Blueprint::find_generation_for_self() seems like it can (and should) succeed in returning the generation of any running Nexus that calls it, even if that Nexus is expunged-but-still-running. In the specific case of reassign_sagas_from_expunged(), it seems like it would be fine for find_generation_for_self() to fail if called by an expunged-but-still-running Nexus, but I think that'd be surprising behavior for the method more generally.

reassign_sagas_from_expunged() certainly could check whether the nexus_id it's been given belongs to an expunged-but-still-running Nexus, and refuse to reassign sagas to itself in that case. But this could only be an optimization, not something that matters for correctness: this method is always inherently racing against the target blueprint changing, so it's always possible that the blueprint we're looking at shows the calling Nexus is not-expunged, then we reassign sagas to ourselves, then a new target blueprint is set in which we're expunged-but-still-running (even before we start executing any of the sagas we just reassigned). I don't feel strongly about whether this particular optimization is warranted: it would be pretty rare for it to come up, I think, and if it does the cost is presumably pretty small? If we had reason to believe those weren't true it'd be worth adding an explicit check here and bailing out, but I can't come up with such a reason.

Comment on lines +257 to +266
// It is possible for the expunged and not-yet-ready-for-cleanup zone in
// generation 2 to wind up calling this function. It should not find
// itself!
let matched = find_expunged_same_generation(
&blueprint3,
SecId(g2_expunged_not_cleaned_up.into_untyped_uuid()),
)
.unwrap();
assert_eq!(matched.len(), 1);
assert_eq!(matched[0], g2_matched);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is finding the not expunged Nexus from g2, right?

@davepacheco
Copy link
Collaborator Author

Deploying Omicron from f3267ee and live tests from d5739b9, the live tests pass:

root@oxz_switch:~#      TMPDIR=/var/tmp ./cargo-nextest nextest run --profile=live-tests          --archive-file live-tests-archive/omicron-live-tests.tar.zst          --workspace-remap live-tests-archive 
  Extracting 2 binaries, 1 build script output directory, and 5 linked paths to /var/tmp/nextest-archive-bd7M8b
   Extracted 79 files to /var/tmp/nextest-archive-bd7M8b in 2.28s
warning: this repository recommends nextest version 0.9.110, but the current version is 0.9.98
info: experimental features enabled: setup-scripts
------------
 Nextest run ID 0a36087c-a83a-4eee-9c3e-4fb36a419a6d with nextest profile: live-tests
    Starting 2 tests across 2 binaries
        SLOW [> 60.000s] omicron-live-tests::test_nexus_add_remove test_nexus_add_remove
        SLOW [>120.000s] omicron-live-tests::test_nexus_add_remove test_nexus_add_remove
        PASS [ 125.188s] omicron-live-tests::test_nexus_add_remove test_nexus_add_remove
        SLOW [> 60.000s] omicron-live-tests::test_nexus_handoff test_nexus_handoff
        SLOW [>120.000s] omicron-live-tests::test_nexus_handoff test_nexus_handoff
        SLOW [>180.000s] omicron-live-tests::test_nexus_handoff test_nexus_handoff
        SLOW [>240.000s] omicron-live-tests::test_nexus_handoff test_nexus_handoff
        SLOW [>300.000s] omicron-live-tests::test_nexus_handoff test_nexus_handoff
        PASS [ 305.112s] omicron-live-tests::test_nexus_handoff test_nexus_handoff
------------
     Summary [ 430.314s] 2 tests run: 2 passed (2 slow), 0 skipped
warning: this repository recommends nextest version 0.9.110, but the current version is 0.9.98
info: update nextest with cargo nextest self update, or bypass check with --override-version-check

@davepacheco davepacheco linked an issue Nov 4, 2025 that may be closed by this pull request
@davepacheco davepacheco enabled auto-merge (squash) November 4, 2025 21:19
@davepacheco davepacheco merged commit 97528cd into main Nov 4, 2025
16 checks passed
@davepacheco davepacheco deleted the dap/steno-update branch November 4, 2025 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

live tests broken after nexus lockstep API change steno upgrade

4 participants