-
Notifications
You must be signed in to change notification settings - Fork 61
saga re-assignment should only grab sagas from Nexus of same generation #9270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| ) -> Result<Generation, Error> { | ||
| for (_sled_id, zone_config, nexus_config) in | ||
| self.all_nexus_zones(BlueprintZoneDisposition::is_in_service) | ||
| self.all_nexus_zones(BlueprintZoneDisposition::could_be_running) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make this shift? From what I can tell, could_be_running is the same as is_in_service, but also includes zones which are "expunged, but not ready for cleanup".
If a Nexus zone is expunged, do we want to be including it here? (If we needed this above, wouldn't it only be used when re-assigning sagas from one expunged Nexus to another expunged Nexus?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically just because it's possible for an expunged, not-yet-cleaned-up Nexus to still be running and call this function and it can correctly determine its generation. You're right that in this case, it only enables it to assign sagas to itself, which isn't useful. Those will wind up re-assigned to another after this one becomes ready for cleanup.
| // want some expunged zones and some non-expunged zones in each of two | ||
| // different generations. | ||
|
|
||
| // Frst, create a basic blueprint with several Nexus zones in generation 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Frst, create a basic blueprint with several Nexus zones in generation 1. | |
| // First, create a basic blueprint with several Nexus zones in generation 1. |
| // It is possible for the expunged and not-yet-ready-for-cleanup zone in | ||
| // generation 2 to wind up calling this function. It should not find | ||
| // itself! | ||
| let matched = find_expunged_same_generation( | ||
| &blueprint3, | ||
| SecId(g2_expunged_not_cleaned_up.into_untyped_uuid()), | ||
| ) | ||
| .unwrap(); | ||
| assert_eq!(matched.len(), 1); | ||
| assert_eq!(matched[0], g2_matched); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that we don't want it to find itself, but why let it find the other zones either? This lets us do re-assignment from expunged zone -> expunged zone, which is pointless
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is finding the not expunged Nexus from g2, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
g2_matched is set on line 246 to g2_expunged_cleaned_up - I think this is returning an expunged Nexus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right. I was confused by this (and the related question about the change to find_generation_for_self() above). I picked Dave's brain for a few minutes and came away convinced the changes here are correct.
Blueprint::find_generation_for_self() seems like it can (and should) succeed in returning the generation of any running Nexus that calls it, even if that Nexus is expunged-but-still-running. In the specific case of reassign_sagas_from_expunged(), it seems like it would be fine for find_generation_for_self() to fail if called by an expunged-but-still-running Nexus, but I think that'd be surprising behavior for the method more generally.
reassign_sagas_from_expunged() certainly could check whether the nexus_id it's been given belongs to an expunged-but-still-running Nexus, and refuse to reassign sagas to itself in that case. But this could only be an optimization, not something that matters for correctness: this method is always inherently racing against the target blueprint changing, so it's always possible that the blueprint we're looking at shows the calling Nexus is not-expunged, then we reassign sagas to ourselves, then a new target blueprint is set in which we're expunged-but-still-running (even before we start executing any of the sagas we just reassigned). I don't feel strongly about whether this particular optimization is warranted: it would be pretty rare for it to come up, I think, and if it does the cost is presumably pretty small? If we had reason to believe those weren't true it'd be worth adding an explicit check here and bailing out, but I can't come up with such a reason.
| // It is possible for the expunged and not-yet-ready-for-cleanup zone in | ||
| // generation 2 to wind up calling this function. It should not find | ||
| // itself! | ||
| let matched = find_expunged_same_generation( | ||
| &blueprint3, | ||
| SecId(g2_expunged_not_cleaned_up.into_untyped_uuid()), | ||
| ) | ||
| .unwrap(); | ||
| assert_eq!(matched.len(), 1); | ||
| assert_eq!(matched[0], g2_matched); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is finding the not expunged Nexus from g2, right?
|
Deploying Omicron from f3267ee and live tests from d5739b9, the live tests pass: |
Fixes #2318 and #9327.