-
Notifications
You must be signed in to change notification settings - Fork 57
SEP75 Automatic Workload Migration #850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
6f842cc to
1f95a6b
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #850 +/- ##
==========================================
+ Coverage 76.70% 77.48% +0.78%
==========================================
Files 44 44
Lines 2640 2834 +194
==========================================
+ Hits 2025 2196 +171
Misses 529 529
- Partials 86 109 +23 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
1f95a6b to
7a66e00
Compare
| // The highest version that the Sail Operator will update to. Updates to versions later than MaximumVersion | ||
| // will not trigger workload migration. | ||
| // If unset, the operator will only trigger workload migration for new patch versions on the current minor version stream. | ||
| MaximumVersion *string `json:maximumVersion,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Does it means that the maximum version set could be a version that is number of versions ahead of the current version running and not just the next one?
But from other side is to set a specific version I would not want to get over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be an usage of this? To avoid automatic workload migration when upgrading to new minor version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes- basically by default it will only update patch versions, unless you set a higher MaximumVersion (now that I'm reading it again, maybe MaxVersion is better)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a bunch of explanations for this field (and also renamed it maxVersion)
| // The highest version that the Sail Operator will update to. Updates to versions later than MaximumVersion | ||
| // will not trigger workload migration. | ||
| // If unset, the operator will only trigger workload migration for new patch versions on the current minor version stream. | ||
| MaximumVersion *string `json:maximumVersion,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be an usage of this? To avoid automatic workload migration when upgrading to new minor version?
| - **Phase 1**: Namespace label updates (no restarts required yet) | ||
| - **Phase 2**: Deployment restarts in configurable batches | ||
|
|
||
| 4. **Validation**: Each batch waits for readiness before proceeding to the next batch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the validation fails, the whole migration fails or it waits for users to manually fix the problem and then continues with other workloads? If it fails, can it be triggered again manually when the problematic workload is migrated manually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Undefined right now. I'm still not sure how to do it - as you say, manual intervention is probably required, but it's kind of tricky with CRDs. We don't want to force the user to "make some unrelated change to your Istio resource to restart the process". So maybe the solution will be to rollback if something goes wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it makes sense to rollback already successfully migrated workloads. If the validation fails, I think the safest option is to stop the migration at that point and notify users to finish the migration manually. Question is if we want to try to migrate all workloads and report at the end which workloads failed to be migrated or to fail fast after first failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question is if we want to try to migrate all workloads and report at the end which workloads failed to be migrated or to fail fast after first failure.
Users will likely want to control this. Some kind of FailurePolicy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the operator "Report failure"? Will the Istio status be updated? Will it show all the workloads that failed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll add a section about status. Yes, the idea would be that we have a status condition for this. I guess it would suffice if we list the workload that failed the migration?
|
Update the doc to now use an enum ( |
7a66e00 to
1e6c512
Compare
| // Defines how workloads should be moved from one control plane instance to another automatically. | ||
| // Defaults to "". If not set, workload migration is disabled. | ||
| // +kubebuilder:default="" | ||
| Strategy WorkloadMigrationStrategy `json:"strategy,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other fields become irrelevant when you set this to "" if "" means disable workload migrations. What do you think of making the WorkloadMigration field in IstioUpdateStrategy a pointer and if it's defined then that would enable workload migration e.g.
updateStrategy:
workloadMigration: {}
Then we can default the Strategy field to something like batched.
|
|
||
| #### IstioUpdateStrategy Enhancement | ||
|
|
||
| The existing `IstioUpdateStrategy` type is extended with a new `WorkloadMigration` field: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens to the existing updateWorkloads field?
sail-operator/api/v1/istio_types.go
Lines 94 to 101 in fbb1809
| // Defines whether the workloads should be moved from one control plane instance to another | |
| // automatically. If updateWorkloads is true, the operator moves the workloads from the old | |
| // control plane instance to the new one after the new control plane is ready. | |
| // If updateWorkloads is false, the user must move the workloads manually by updating the | |
| // istio.io/rev labels on the namespace and/or the pods. | |
| // Defaults to false. | |
| // +operator-sdk:csv:customresourcedefinitions:type=spec,order=3,displayName="Update Workloads Automatically",xDescriptors={"urn:alm:descriptor:com.tectonic.ui:booleanSwitch"} | |
| UpdateWorkloads bool `json:"updateWorkloads,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh no, I didn't realize we had this already 🙁 I guess we'll need to use it then. And if we use UpdateWorkloads in one place, we can't really use WorkloadMigration elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could move the settings into updateStrategy.workloads then. It would fit with updateStrategy.updateWorkloads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this field is used for anything today? It was added for this feature but this feature never got baked in? We could also mark this field as deprecated and instead use workloadMigration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a paragraph that we're removing that old field.
| // Maximum number of deployments to restart concurrently during migration. | ||
| // Defaults to 1. | ||
| // +kubebuilder:default=1 | ||
| // +kubebuilder:validation:Minimum=1 | ||
| BatchSize *int32 `json:"batchSize,omitempty"` | ||
|
|
||
| // Time to wait between deployment restart batches. | ||
| // Defaults to 30s. | ||
| // +kubebuilder:default="30s" | ||
| DelayBetweenBatches *metav1.Duration `json:"delayBetweenBatches,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will these fields be relevant for strategies other than batched? Should these be under their own struct that is only holds fields for the batched strategy? Similar to how volume sources are defined. Maybe you can do something more clever with a descriminated union type based on the Strategy field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely. Moved under batched
| // The highest version that the Sail Operator will update to. Updates to versions later than MaximumVersion | ||
| // will not trigger workload migration. | ||
| // If unset, the operator will only trigger workload migration for new patch versions on the current minor version stream. | ||
| MaximumVersion *string `json:"maximumVersion,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want the operator to upgrade your workloads when going from say 1.25 --> 1.26, you'd need to set spec.version = 1.26 and spec.updateStrategy.maximumVersion = 1.26? For each minor upgrade you'd need to keep setting spec.updateStrategy.maximumVersion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, effectively. Which is a good thing I believe, as minor version upgrades could break your config. Or you set it to something like 1.999.0 to always upgrade
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually did change this as I think it was confusing. Now, if you don't set maxVersion, there simply is no maximum, which I think is more what you would expect as a user.
| // Maximum time to wait for a deployment to become ready after restart. | ||
| // Defaults to 5m. | ||
| // +kubebuilder:default="5m" | ||
| ReadinessTimeout *metav1.Duration `json:"readinessTimeout,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be labels or annotations to control this per workload? e.g. sail.operator.io/readiness-timeout: 30m
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean overriding the readinessTimeout for specific pods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting idea. but maybe something we could add later if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Let's keep it simple. We can consider it if there is a strong use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a whole section on annotations under Future Enhancements. I think once we have an implementation, this can really help make it production-ready
| - **Phase 1**: Namespace label updates (no restarts required yet) | ||
| - **Phase 2**: Deployment restarts in configurable batches | ||
|
|
||
| 4. **Validation**: Each batch waits for readiness before proceeding to the next batch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question is if we want to try to migrate all workloads and report at the end which workloads failed to be migrated or to fail fast after first failure.
Users will likely want to control this. Some kind of FailurePolicy?
| - **Phase 1**: Namespace label updates (no restarts required yet) | ||
| - **Phase 2**: Deployment restarts in configurable batches | ||
|
|
||
| 4. **Validation**: Each batch waits for readiness before proceeding to the next batch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the operator "Report failure"? Will the Istio status be updated? Will it show all the workloads that failed?
| // Maximum time to wait for a deployment to become ready after restart. | ||
| // Defaults to 5m. | ||
| // +kubebuilder:default="5m" | ||
| ReadinessTimeout *metav1.Duration `json:"readinessTimeout,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean overriding the readinessTimeout for specific pods?
1e6c512 to
266f40e
Compare
e1e8399 to
e029cc3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a few thoughts mainly around the idea of supporting more of a canary upgrade/migration process. You could make the argument that that could be a separate strategy (canary instead of batched) with its own configuration. From my perspective it would probably be pretty appealing to users. In general though this seems like a pretty compelling feature and great differentiator for the operator!
|
|
||
| ### Failure Policy | ||
|
|
||
| The initial implementation will continue processing all batches even if individual workloads fail to migrate, reporting all failures at the end. Some users may want different behavior when migrations fail. We could implement a `failurePolicy` field that allows users to specify what the operator should do in case a workload does not become ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial implementation is somewhat at odds with a "canary upgrade" where you'd want to ensure a small subset of upgraded workloads are successful before proceeding, so IMO adding something like a failurePolicy will likely be pretty important for users, to be able to have a controlled upgrade / limit blast radius of any issues.
| Potential `failurePolicy` values: | ||
| * `ContinueOnError`: Continue migrating all workloads, report failures at end | ||
| * `FailFast`: Stop immediately on first failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A configurable % of workloads that are allowed to fail before stopping the migration might be nice as it would give finer-grained control over success/failure conditions (i.e. "continue migration as long as no more than 5% of workloads failed upgrade").
| **Potential annotations:** | ||
| - `sailoperator.io/readiness-timeout: 10s`: Override the readiness timeout for specific workloads | ||
| - `sailoperator.io/skip-migration: true`: Exclude specific workloads from automatic migration | ||
| - `sailoperator.io/migration-batch: 3`: Assign workload to a specific migration priority (integer, lower values migrate first) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this would be very helpful for supporting canary-style upgrades as it would allow the user to identify the initial set of workloads (i.e. less critical) to try to migrate. Otherwise, it's possible that critical workloads could be ~randomly selected for migration first which would not be ideal.
This adds an enhancement proposal on Automatic Workload Migration, which helps users by automatically restarting their workloads when updating their mesh. This feature will only be available in sidecar mode. Signed-off-by: Daniel Grimm <[email protected]>
e029cc3 to
df43c23
Compare
|
|
||
| type WorkloadMigrationStatus struct { | ||
| // State represents the current state of the migration process. | ||
| // +kubebuilder:validation:Enum=Idle;InProgress;Completed;Failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be a difference between Idle and Completed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe NotStarted is a better name
| // Name of the failed workload. | ||
| Name string `json:"name"` | ||
|
|
||
| // Kind of the failed workload (e.g., "Deployment"). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed if we only migrate Deployments and nothing else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call. we can leave it out for now and only add it once we support other kinds
First draft