SEP75 Automatic Workload Migration #850

dgn · 2025-05-26T13:11:27Z

First draft

codecov · 2025-05-26T13:19:41Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.48%. Comparing base (eb6389f) to head (df43c23).
⚠️ Report is 196 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #850      +/-   ##
==========================================
+ Coverage   76.70%   77.48%   +0.78%     
==========================================
  Files          44       44              
  Lines        2640     2834     +194     
==========================================
+ Hits         2025     2196     +171     
  Misses        529      529              
- Partials       86      109      +23

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

enhancements/SEP75-automatic-workload-migration.md

MaxBab · 2025-05-26T16:10:50Z

enhancements/SEP75-automatic-workload-migration.md

+    // The highest version that the Sail Operator will update to. Updates to versions later than MaximumVersion
+    // will not trigger workload migration.
+    // If unset, the operator will only trigger workload migration for new patch versions on the current minor version stream.
+    MaximumVersion *string `json:maximumVersion,omitempty"`


Q: Does it means that the maximum version set could be a version that is number of versions ahead of the current version running and not just the next one?
But from other side is to set a specific version I would not want to get over?

What would be an usage of this? To avoid automatic workload migration when upgrading to new minor version?

Yes- basically by default it will only update patch versions, unless you set a higher MaximumVersion (now that I'm reading it again, maybe MaxVersion is better)

I added a bunch of explanations for this field (and also renamed it maxVersion)

enhancements/SEP75-automatic-workload-migration.md

FilipB · 2025-05-26T18:44:13Z

enhancements/SEP75-automatic-workload-migration.md

+    // The highest version that the Sail Operator will update to. Updates to versions later than MaximumVersion
+    // will not trigger workload migration.
+    // If unset, the operator will only trigger workload migration for new patch versions on the current minor version stream.
+    MaximumVersion *string `json:maximumVersion,omitempty"`


What would be an usage of this? To avoid automatic workload migration when upgrading to new minor version?

enhancements/SEP75-automatic-workload-migration.md

FilipB · 2025-05-26T18:59:42Z

enhancements/SEP75-automatic-workload-migration.md

+   - **Phase 1**: Namespace label updates (no restarts required yet)
+   - **Phase 2**: Deployment restarts in configurable batches
+
+4. **Validation**: Each batch waits for readiness before proceeding to the next batch


If the validation fails, the whole migration fails or it waits for users to manually fix the problem and then continues with other workloads? If it fails, can it be triggered again manually when the problematic workload is migrated manually?

Undefined right now. I'm still not sure how to do it - as you say, manual intervention is probably required, but it's kind of tricky with CRDs. We don't want to force the user to "make some unrelated change to your Istio resource to restart the process". So maybe the solution will be to rollback if something goes wrong?

I'm not sure if it makes sense to rollback already successfully migrated workloads. If the validation fails, I think the safest option is to stop the migration at that point and notify users to finish the migration manually. Question is if we want to try to migrate all workloads and report at the end which workloads failed to be migrated or to fail fast after first failure.

Question is if we want to try to migrate all workloads and report at the end which workloads failed to be migrated or to fail fast after first failure.

Users will likely want to control this. Some kind of FailurePolicy?

How does the operator "Report failure"? Will the Istio status be updated? Will it show all the workloads that failed?

Good point, I'll add a section about status. Yes, the idea would be that we have a status condition for this. I guess it would suffice if we list the workload that failed the migration?

dgn · 2025-05-27T22:00:58Z

Update the doc to now use an enum (strategy) instead of a bool (enabled) for enablement. Not sure I like the stuttering (updateStrategy.workloadMigration.strategy is not great) but I couldn't come up with a better term right now

nrfox · 2025-05-28T11:37:46Z

enhancements/SEP75-automatic-workload-migration.md

+    // Defines how workloads should be moved from one control plane instance to another automatically.
+    // Defaults to "". If not set, workload migration is disabled.
+    // +kubebuilder:default=""
+    Strategy WorkloadMigrationStrategy `json:"strategy,omitempty"`


The other fields become irrelevant when you set this to "" if "" means disable workload migrations. What do you think of making the WorkloadMigration field in IstioUpdateStrategy a pointer and if it's defined then that would enable workload migration e.g.

updateStrategy: workloadMigration: {}

Then we can default the Strategy field to something like batched.

nrfox · 2025-05-28T11:39:04Z

enhancements/SEP75-automatic-workload-migration.md

+
+#### IstioUpdateStrategy Enhancement
+
+The existing `IstioUpdateStrategy` type is extended with a new `WorkloadMigration` field:


What happens to the existing updateWorkloads field?

sail-operator/api/v1/istio_types.go

Lines 94 to 101 in fbb1809

// Defines whether the workloads should be moved from one control plane instance to another

// automatically. If updateWorkloads is true, the operator moves the workloads from the old

// control plane instance to the new one after the new control plane is ready.

// If updateWorkloads is false, the user must move the workloads manually by updating the

// istio.io/rev labels on the namespace and/or the pods.

// Defaults to false.

// +operator-sdk:csv:customresourcedefinitions:type=spec,order=3,displayName="Update Workloads Automatically",xDescriptors={"urn:alm:descriptor:com.tectonic.ui:booleanSwitch"}

UpdateWorkloads bool `json:"updateWorkloads,omitempty"`

oh no, I didn't realize we had this already 🙁 I guess we'll need to use it then. And if we use UpdateWorkloads in one place, we can't really use WorkloadMigration elsewhere

I guess we could move the settings into updateStrategy.workloads then. It would fit with updateStrategy.updateWorkloads

I don't think this field is used for anything today? It was added for this feature but this feature never got baked in? We could also mark this field as deprecated and instead use workloadMigration.

I added a paragraph that we're removing that old field.

nrfox · 2025-05-28T11:39:57Z

enhancements/SEP75-automatic-workload-migration.md

+    // Maximum number of deployments to restart concurrently during migration.
+    // Defaults to 1.
+    // +kubebuilder:default=1
+    // +kubebuilder:validation:Minimum=1
+    BatchSize *int32 `json:"batchSize,omitempty"`
+
+    // Time to wait between deployment restart batches.
+    // Defaults to 30s.
+    // +kubebuilder:default="30s"
+    DelayBetweenBatches *metav1.Duration `json:"delayBetweenBatches,omitempty"`


Will these fields be relevant for strategies other than batched? Should these be under their own struct that is only holds fields for the batched strategy? Similar to how volume sources are defined. Maybe you can do something more clever with a descriminated union type based on the Strategy field.

Absolutely. Moved under batched

nrfox · 2025-05-28T12:00:50Z

enhancements/SEP75-automatic-workload-migration.md

+    // The highest version that the Sail Operator will update to. Updates to versions later than MaximumVersion
+    // will not trigger workload migration.
+    // If unset, the operator will only trigger workload migration for new patch versions on the current minor version stream.
+    MaximumVersion *string `json:"maximumVersion,omitempty"`


If you want the operator to upgrade your workloads when going from say 1.25 --> 1.26, you'd need to set spec.version = 1.26 and spec.updateStrategy.maximumVersion = 1.26? For each minor upgrade you'd need to keep setting spec.updateStrategy.maximumVersion?

Yes, effectively. Which is a good thing I believe, as minor version upgrades could break your config. Or you set it to something like 1.999.0 to always upgrade

I actually did change this as I think it was confusing. Now, if you don't set maxVersion, there simply is no maximum, which I think is more what you would expect as a user.

nrfox · 2025-05-28T12:02:52Z

enhancements/SEP75-automatic-workload-migration.md

+    // Maximum time to wait for a deployment to become ready after restart.
+    // Defaults to 5m.
+    // +kubebuilder:default="5m"
+    ReadinessTimeout *metav1.Duration `json:"readinessTimeout,omitempty"`


Will there be labels or annotations to control this per workload? e.g. sail.operator.io/readiness-timeout: 30m

You mean overriding the readinessTimeout for specific pods?

interesting idea. but maybe something we could add later if needed

+1. Let's keep it simple. We can consider it if there is a strong use case.

I added a whole section on annotations under Future Enhancements. I think once we have an implementation, this can really help make it production-ready

nrfox · 2025-05-28T12:07:52Z

enhancements/SEP75-automatic-workload-migration.md

+   - **Phase 1**: Namespace label updates (no restarts required yet)
+   - **Phase 2**: Deployment restarts in configurable batches
+
+4. **Validation**: Each batch waits for readiness before proceeding to the next batch


Question is if we want to try to migrate all workloads and report at the end which workloads failed to be migrated or to fail fast after first failure.

Users will likely want to control this. Some kind of FailurePolicy?

nrfox · 2025-05-28T12:12:12Z

enhancements/SEP75-automatic-workload-migration.md

+   - **Phase 1**: Namespace label updates (no restarts required yet)
+   - **Phase 2**: Deployment restarts in configurable batches
+
+4. **Validation**: Each batch waits for readiness before proceeding to the next batch


How does the operator "Report failure"? Will the Istio status be updated? Will it show all the workloads that failed?

sridhargaddam · 2025-06-04T12:18:21Z

enhancements/SEP75-automatic-workload-migration.md

+    // Maximum time to wait for a deployment to become ready after restart.
+    // Defaults to 5m.
+    // +kubebuilder:default="5m"
+    ReadinessTimeout *metav1.Duration `json:"readinessTimeout,omitempty"`


You mean overriding the readinessTimeout for specific pods?

enhancements/SEP75-automatic-workload-migration.md

skriss

Had a few thoughts mainly around the idea of supporting more of a canary upgrade/migration process. You could make the argument that that could be a separate strategy (canary instead of batched) with its own configuration. From my perspective it would probably be pretty appealing to users. In general though this seems like a pretty compelling feature and great differentiator for the operator!

skriss · 2025-10-22T19:15:51Z

enhancements/SEP75-automatic-workload-migration.md

+
+### Failure Policy
+
+The initial implementation will continue processing all batches even if individual workloads fail to migrate, reporting all failures at the end. Some users may want different behavior when migrations fail. We could implement a `failurePolicy` field that allows users to specify what the operator should do in case a workload does not become ready.


The initial implementation is somewhat at odds with a "canary upgrade" where you'd want to ensure a small subset of upgraded workloads are successful before proceeding, so IMO adding something like a failurePolicy will likely be pretty important for users, to be able to have a controlled upgrade / limit blast radius of any issues.

skriss · 2025-10-22T19:19:16Z

enhancements/SEP75-automatic-workload-migration.md

+Potential `failurePolicy` values:
+* `ContinueOnError`: Continue migrating all workloads, report failures at end
+* `FailFast`: Stop immediately on first failure


A configurable % of workloads that are allowed to fail before stopping the migration might be nice as it would give finer-grained control over success/failure conditions (i.e. "continue migration as long as no more than 5% of workloads failed upgrade").

skriss · 2025-10-22T19:24:54Z

enhancements/SEP75-automatic-workload-migration.md

+**Potential annotations:**
+- `sailoperator.io/readiness-timeout: 10s`: Override the readiness timeout for specific workloads
+- `sailoperator.io/skip-migration: true`: Exclude specific workloads from automatic migration
+- `sailoperator.io/migration-batch: 3`: Assign workload to a specific migration priority (integer, lower values migrate first)


Something like this would be very helpful for supporting canary-style upgrades as it would allow the user to identify the initial set of workloads (i.e. less critical) to try to migrate. Otherwise, it's possible that critical workloads could be ~randomly selected for migration first which would not be ideal.

This adds an enhancement proposal on Automatic Workload Migration, which helps users by automatically restarting their workloads when updating their mesh. This feature will only be available in sidecar mode. Signed-off-by: Daniel Grimm <[email protected]>

FilipB · 2025-10-24T10:00:56Z

enhancements/SEP75-automatic-workload-migration.md

+
+type WorkloadMigrationStatus struct {
+    // State represents the current state of the migration process.
+    // +kubebuilder:validation:Enum=Idle;InProgress;Completed;Failed


What would be a difference between Idle and Completed?

maybe NotStarted is a better name

FilipB · 2025-10-24T10:04:41Z

enhancements/SEP75-automatic-workload-migration.md

+    // Name of the failed workload.
+    Name string `json:"name"`
+
+    // Kind of the failed workload (e.g., "Deployment").


Is this needed if we only migrate Deployments and nothing else?

good call. we can leave it out for now and only add it once we support other kinds

dgn requested a review from a team as a code owner May 26, 2025 13:11

istio-testing added the size/L label May 26, 2025

dgn force-pushed the sep-workload-migration branch from 6f842cc to 1f95a6b Compare May 26, 2025 13:13

dgn changed the title ~~SEP74 Automatic Workload Migration~~ SEP75 Automatic Workload Migration May 26, 2025

dgn force-pushed the sep-workload-migration branch from 1f95a6b to 7a66e00 Compare May 26, 2025 13:25

fjglira reviewed May 26, 2025

View reviewed changes

enhancements/SEP75-automatic-workload-migration.md Show resolved Hide resolved

MaxBab reviewed May 26, 2025

View reviewed changes

fjglira reviewed May 26, 2025

View reviewed changes

enhancements/SEP75-automatic-workload-migration.md Outdated Show resolved Hide resolved

FilipB reviewed May 26, 2025

View reviewed changes

dgn force-pushed the sep-workload-migration branch from 7a66e00 to 1e6c512 Compare May 28, 2025 08:41

nrfox reviewed May 28, 2025

View reviewed changes

sridhargaddam reviewed Jun 4, 2025

View reviewed changes

FilipB mentioned this pull request Jul 10, 2025

Fully Automatic Upgrades #74

Open

dgn force-pushed the sep-workload-migration branch from 1e6c512 to 266f40e Compare October 22, 2025 11:58

istio-testing added size/XL and removed size/L labels Oct 22, 2025

dgn force-pushed the sep-workload-migration branch 4 times, most recently from e1e8399 to e029cc3 Compare October 22, 2025 12:12

skriss reviewed Oct 22, 2025

View reviewed changes

dgn force-pushed the sep-workload-migration branch from e029cc3 to df43c23 Compare October 23, 2025 09:26

FilipB reviewed Oct 24, 2025

View reviewed changes


		#### IstioUpdateStrategy Enhancement

		The existing `IstioUpdateStrategy` type is extended with a new `WorkloadMigration` field:

	// Defines whether the workloads should be moved from one control plane instance to another
	// automatically. If updateWorkloads is true, the operator moves the workloads from the old
	// control plane instance to the new one after the new control plane is ready.
	// If updateWorkloads is false, the user must move the workloads manually by updating the
	// istio.io/rev labels on the namespace and/or the pods.
	// Defaults to false.
	// +operator-sdk:csv:customresourcedefinitions:type=spec,order=3,displayName="Update Workloads Automatically",xDescriptors={"urn:alm:descriptor:com.tectonic.ui:booleanSwitch"}
	UpdateWorkloads bool `json:"updateWorkloads,omitempty"`


		### Failure Policy

		The initial implementation will continue processing all batches even if individual workloads fail to migrate, reporting all failures at the end. Some users may want different behavior when migrations fail. We could implement a `failurePolicy` field that allows users to specify what the operator should do in case a workload does not become ready.

SEP75 Automatic Workload Migration #850

Are you sure you want to change the base?

SEP75 Automatic Workload Migration #850

Uh oh!

Conversation

dgn commented May 26, 2025

Uh oh!

codecov bot commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

MaxBab May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dgn commented May 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented May 26, 2025 •

edited

Loading

MaxBab May 26, 2025 •

edited

Loading