Skip to content

Conversation

@dgn
Copy link
Collaborator

@dgn dgn commented May 26, 2025

First draft

@dgn dgn requested a review from a team as a code owner May 26, 2025 13:11
@dgn dgn force-pushed the sep-workload-migration branch from 6f842cc to 1f95a6b Compare May 26, 2025 13:13
@dgn dgn changed the title SEP74 Automatic Workload Migration SEP75 Automatic Workload Migration May 26, 2025
@codecov
Copy link

codecov bot commented May 26, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.48%. Comparing base (eb6389f) to head (df43c23).
⚠️ Report is 196 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #850      +/-   ##
==========================================
+ Coverage   76.70%   77.48%   +0.78%     
==========================================
  Files          44       44              
  Lines        2640     2834     +194     
==========================================
+ Hits         2025     2196     +171     
  Misses        529      529              
- Partials       86      109      +23     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dgn dgn force-pushed the sep-workload-migration branch from 1f95a6b to 7a66e00 Compare May 26, 2025 13:25
// The highest version that the Sail Operator will update to. Updates to versions later than MaximumVersion
// will not trigger workload migration.
// If unset, the operator will only trigger workload migration for new patch versions on the current minor version stream.
MaximumVersion *string `json:maximumVersion,omitempty"`
Copy link
Collaborator

@MaxBab MaxBab May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Does it means that the maximum version set could be a version that is number of versions ahead of the current version running and not just the next one?
But from other side is to set a specific version I would not want to get over?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be an usage of this? To avoid automatic workload migration when upgrading to new minor version?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes- basically by default it will only update patch versions, unless you set a higher MaximumVersion (now that I'm reading it again, maybe MaxVersion is better)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a bunch of explanations for this field (and also renamed it maxVersion)

// The highest version that the Sail Operator will update to. Updates to versions later than MaximumVersion
// will not trigger workload migration.
// If unset, the operator will only trigger workload migration for new patch versions on the current minor version stream.
MaximumVersion *string `json:maximumVersion,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be an usage of this? To avoid automatic workload migration when upgrading to new minor version?

- **Phase 1**: Namespace label updates (no restarts required yet)
- **Phase 2**: Deployment restarts in configurable batches

4. **Validation**: Each batch waits for readiness before proceeding to the next batch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the validation fails, the whole migration fails or it waits for users to manually fix the problem and then continues with other workloads? If it fails, can it be triggered again manually when the problematic workload is migrated manually?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undefined right now. I'm still not sure how to do it - as you say, manual intervention is probably required, but it's kind of tricky with CRDs. We don't want to force the user to "make some unrelated change to your Istio resource to restart the process". So maybe the solution will be to rollback if something goes wrong?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it makes sense to rollback already successfully migrated workloads. If the validation fails, I think the safest option is to stop the migration at that point and notify users to finish the migration manually. Question is if we want to try to migrate all workloads and report at the end which workloads failed to be migrated or to fail fast after first failure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question is if we want to try to migrate all workloads and report at the end which workloads failed to be migrated or to fail fast after first failure.

Users will likely want to control this. Some kind of FailurePolicy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the operator "Report failure"? Will the Istio status be updated? Will it show all the workloads that failed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll add a section about status. Yes, the idea would be that we have a status condition for this. I guess it would suffice if we list the workload that failed the migration?

@dgn
Copy link
Collaborator Author

dgn commented May 27, 2025

Update the doc to now use an enum (strategy) instead of a bool (enabled) for enablement. Not sure I like the stuttering (updateStrategy.workloadMigration.strategy is not great) but I couldn't come up with a better term right now

@dgn dgn force-pushed the sep-workload-migration branch from 7a66e00 to 1e6c512 Compare May 28, 2025 08:41
Comment on lines +50 to +58
// Defines how workloads should be moved from one control plane instance to another automatically.
// Defaults to "". If not set, workload migration is disabled.
// +kubebuilder:default=""
Strategy WorkloadMigrationStrategy `json:"strategy,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other fields become irrelevant when you set this to "" if "" means disable workload migrations. What do you think of making the WorkloadMigration field in IstioUpdateStrategy a pointer and if it's defined then that would enable workload migration e.g.

updateStrategy:
  workloadMigration: {}

Then we can default the Strategy field to something like batched.


#### IstioUpdateStrategy Enhancement

The existing `IstioUpdateStrategy` type is extended with a new `WorkloadMigration` field:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to the existing updateWorkloads field?

// Defines whether the workloads should be moved from one control plane instance to another
// automatically. If updateWorkloads is true, the operator moves the workloads from the old
// control plane instance to the new one after the new control plane is ready.
// If updateWorkloads is false, the user must move the workloads manually by updating the
// istio.io/rev labels on the namespace and/or the pods.
// Defaults to false.
// +operator-sdk:csv:customresourcedefinitions:type=spec,order=3,displayName="Update Workloads Automatically",xDescriptors={"urn:alm:descriptor:com.tectonic.ui:booleanSwitch"}
UpdateWorkloads bool `json:"updateWorkloads,omitempty"`

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh no, I didn't realize we had this already 🙁 I guess we'll need to use it then. And if we use UpdateWorkloads in one place, we can't really use WorkloadMigration elsewhere

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could move the settings into updateStrategy.workloads then. It would fit with updateStrategy.updateWorkloads

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this field is used for anything today? It was added for this feature but this feature never got baked in? We could also mark this field as deprecated and instead use workloadMigration.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a paragraph that we're removing that old field.

Comment on lines +55 to +76
// Maximum number of deployments to restart concurrently during migration.
// Defaults to 1.
// +kubebuilder:default=1
// +kubebuilder:validation:Minimum=1
BatchSize *int32 `json:"batchSize,omitempty"`

// Time to wait between deployment restart batches.
// Defaults to 30s.
// +kubebuilder:default="30s"
DelayBetweenBatches *metav1.Duration `json:"delayBetweenBatches,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will these fields be relevant for strategies other than batched? Should these be under their own struct that is only holds fields for the batched strategy? Similar to how volume sources are defined. Maybe you can do something more clever with a descriminated union type based on the Strategy field.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely. Moved under batched

Comment on lines 66 to 69
// The highest version that the Sail Operator will update to. Updates to versions later than MaximumVersion
// will not trigger workload migration.
// If unset, the operator will only trigger workload migration for new patch versions on the current minor version stream.
MaximumVersion *string `json:"maximumVersion,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want the operator to upgrade your workloads when going from say 1.25 --> 1.26, you'd need to set spec.version = 1.26 and spec.updateStrategy.maximumVersion = 1.26? For each minor upgrade you'd need to keep setting spec.updateStrategy.maximumVersion?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, effectively. Which is a good thing I believe, as minor version upgrades could break your config. Or you set it to something like 1.999.0 to always upgrade

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually did change this as I think it was confusing. Now, if you don't set maxVersion, there simply is no maximum, which I think is more what you would expect as a user.

Comment on lines +71 to +99
// Maximum time to wait for a deployment to become ready after restart.
// Defaults to 5m.
// +kubebuilder:default="5m"
ReadinessTimeout *metav1.Duration `json:"readinessTimeout,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be labels or annotations to control this per workload? e.g. sail.operator.io/readiness-timeout: 30m

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean overriding the readinessTimeout for specific pods?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting idea. but maybe something we could add later if needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Let's keep it simple. We can consider it if there is a strong use case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a whole section on annotations under Future Enhancements. I think once we have an implementation, this can really help make it production-ready

- **Phase 1**: Namespace label updates (no restarts required yet)
- **Phase 2**: Deployment restarts in configurable batches

4. **Validation**: Each batch waits for readiness before proceeding to the next batch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question is if we want to try to migrate all workloads and report at the end which workloads failed to be migrated or to fail fast after first failure.

Users will likely want to control this. Some kind of FailurePolicy?

- **Phase 1**: Namespace label updates (no restarts required yet)
- **Phase 2**: Deployment restarts in configurable batches

4. **Validation**: Each batch waits for readiness before proceeding to the next batch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the operator "Report failure"? Will the Istio status be updated? Will it show all the workloads that failed?

Comment on lines +71 to +99
// Maximum time to wait for a deployment to become ready after restart.
// Defaults to 5m.
// +kubebuilder:default="5m"
ReadinessTimeout *metav1.Duration `json:"readinessTimeout,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean overriding the readinessTimeout for specific pods?

@dgn dgn force-pushed the sep-workload-migration branch from 1e6c512 to 266f40e Compare October 22, 2025 11:58
@dgn dgn force-pushed the sep-workload-migration branch 4 times, most recently from e1e8399 to e029cc3 Compare October 22, 2025 12:12
Copy link
Collaborator

@skriss skriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a few thoughts mainly around the idea of supporting more of a canary upgrade/migration process. You could make the argument that that could be a separate strategy (canary instead of batched) with its own configuration. From my perspective it would probably be pretty appealing to users. In general though this seems like a pretty compelling feature and great differentiator for the operator!


### Failure Policy

The initial implementation will continue processing all batches even if individual workloads fail to migrate, reporting all failures at the end. Some users may want different behavior when migrations fail. We could implement a `failurePolicy` field that allows users to specify what the operator should do in case a workload does not become ready.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial implementation is somewhat at odds with a "canary upgrade" where you'd want to ensure a small subset of upgraded workloads are successful before proceeding, so IMO adding something like a failurePolicy will likely be pretty important for users, to be able to have a controlled upgrade / limit blast radius of any issues.

Comment on lines 591 to 593
Potential `failurePolicy` values:
* `ContinueOnError`: Continue migrating all workloads, report failures at end
* `FailFast`: Stop immediately on first failure
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A configurable % of workloads that are allowed to fail before stopping the migration might be nice as it would give finer-grained control over success/failure conditions (i.e. "continue migration as long as no more than 5% of workloads failed upgrade").

**Potential annotations:**
- `sailoperator.io/readiness-timeout: 10s`: Override the readiness timeout for specific workloads
- `sailoperator.io/skip-migration: true`: Exclude specific workloads from automatic migration
- `sailoperator.io/migration-batch: 3`: Assign workload to a specific migration priority (integer, lower values migrate first)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this would be very helpful for supporting canary-style upgrades as it would allow the user to identify the initial set of workloads (i.e. less critical) to try to migrate. Otherwise, it's possible that critical workloads could be ~randomly selected for migration first which would not be ideal.

This adds an enhancement proposal on Automatic Workload Migration, which
helps users by automatically restarting their workloads when updating
their mesh. This feature will only be available in sidecar mode.

Signed-off-by: Daniel Grimm <[email protected]>
@dgn dgn force-pushed the sep-workload-migration branch from e029cc3 to df43c23 Compare October 23, 2025 09:26

type WorkloadMigrationStatus struct {
// State represents the current state of the migration process.
// +kubebuilder:validation:Enum=Idle;InProgress;Completed;Failed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be a difference between Idle and Completed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe NotStarted is a better name

// Name of the failed workload.
Name string `json:"name"`

// Kind of the failed workload (e.g., "Deployment").
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed if we only migrate Deployments and nothing else?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call. we can leave it out for now and only add it once we support other kinds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants