Skip to content

Commit 2acc219

Browse files
ryanaolearyFuture-Outlierrueian
authored
[RayService] Support Incremental Zero-Downtime Upgrades (#3166)
* Add incremental upgrade API changes to KubeRay Signed-off-by: Ryan O'Leary <[email protected]> Update go mod dependencies for gateway v1 Signed-off-by: Ryan O'Leary <[email protected]> Add reconcile Gateway and HTTPRoute Signed-off-by: Ryan O'Leary <[email protected]> Add TargetCapacity and TrafficRoutedPercent to RayServiceStatus Signed-off-by: Ryan O'Leary <[email protected]> Add controller logic initial commit Signed-off-by: Ryan O'Leary <[email protected]> Add IncrementalUpgrade check to ShouldUpdate Signed-off-by: Ryan O'Leary <[email protected]> Update controller logic to reconcile incremental upgrade Signed-off-by: Ryan O'Leary <[email protected]> TrafficRoutedPercent should not set default value Signed-off-by: Ryan O'Leary <[email protected]> Remove test changes to TPU manifest Signed-off-by: Ryan O'Leary <[email protected]> Move helper function to utils Signed-off-by: Ryan O'Leary <[email protected]> Fix lint Signed-off-by: Ryan O'Leary <[email protected]> Fix field alignment Signed-off-by: Ryan O'Leary <[email protected]> Fix bad merge Signed-off-by: Ryan O'Leary <[email protected]> Fix CRDs and add validation test case Signed-off-by: Ryan O'Leary <[email protected]> Test create HTTPRoute and create Gateway Signed-off-by: Ryan O'Leary <[email protected]> Add reconcile tests for Gateway and HTTPRoute Signed-off-by: Ryan O'Leary <[email protected]> Fix lint Signed-off-by: Ryan O'Leary <[email protected]> Add tests for util functions and fix golangci-lint Signed-off-by: Ryan O'Leary <[email protected]> Add basic e2e test case Signed-off-by: Ryan O'Leary <[email protected]> Fix GetGatewayListeners logic and test Signed-off-by: Ryan O'Leary <[email protected]> Add gatewayv1 scheme to util runtime Signed-off-by: Ryan O'Leary <[email protected]> Check if IncrementalUpgrade is enabled before checking Gateway Signed-off-by: Ryan O'Leary <[email protected]> Fix reconcile logic for Gateway and HTTPRoute Signed-off-by: Ryan O'Leary <[email protected]> Add feature gate Signed-off-by: Ryan O'Leary <[email protected]> Always create Gateway and HTTPRoute for IncrementalUpgrade Signed-off-by: Ryan O'Leary <[email protected]> Fix target_capacity reonciliation logic Signed-off-by: Ryan O'Leary <[email protected]> Add additional unit tests Signed-off-by: Ryan O'Leary <[email protected]> Move e2e test and add another unit test Signed-off-by: Ryan O'Leary <[email protected]> * Fix some tests and create Gateway for pending cluster Signed-off-by: Ryan O'Leary <[email protected]> * Fix merge errors Signed-off-by: Ryan O'Leary <[email protected]> * Manually sync rbac for gateway Signed-off-by: Ryan O'Leary <[email protected]> * Fix bugs and e2e test Signed-off-by: Ryan O'Leary <[email protected]> * Add Makefile command Signed-off-by: Ryan O'Leary <[email protected]> * Run 'make sync' Signed-off-by: Ryan O'Leary <[email protected]> * Run 'make generate' Signed-off-by: Ryan O'Leary <[email protected]> * Fix comments Signed-off-by: Ryan O'Leary <[email protected]> * Run 'make api-docs' Signed-off-by: Ryan O'Leary <[email protected]> * Fix tests after merge conflicts Signed-off-by: Ryan O'Leary <[email protected]> * Update ray-operator/controllers/ray/rayservice_controller.go Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> * Update ray-operator/controllers/ray/rayservice_controller.go Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> * Update ray-operator/controllers/ray/rayservice_controller.go Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> * Fix error return Signed-off-by: Ryan O'Leary <[email protected]> * Add RayServiceIncrementalUpgrade feature gate option to helm Signed-off-by: Ryan O'Leary <[email protected]> * Remove unnecessary perms Signed-off-by: Ryan O'Leary <[email protected]> * Remove delete perm and run lint Signed-off-by: Ryan O'Leary <[email protected]> * Fix helm roles Signed-off-by: Ryan O'Leary <[email protected]> * add back required perms Signed-off-by: Ryan O'Leary <[email protected]> * Update ray-operator/controllers/ray/utils/validation.go Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> * Update ray-operator/controllers/ray/utils/util.go Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> * Update ray-operator/controllers/ray/rayservice_controller.go Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> * Change controller to use two serve services during upgrade Signed-off-by: Ryan O'Leary <[email protected]> * Remove Gateway and HTTPRoute API fields Signed-off-by: Ryan O'Leary <[email protected]> * Fix port errors Signed-off-by: Ryan O'Leary <[email protected]> * Fix comments and build issues Signed-off-by: Ryan O'Leary <[email protected]> * fix helm-chart-verify-rbac Signed-off-by: Ryan O'Leary <[email protected]> * Refactor tests and create HTTPRoute to be clearer Signed-off-by: Ryan O'Leary <[email protected]> * Use time &now Signed-off-by: Ryan O'Leary <[email protected]> * Update ray-operator/controllers/ray/rayservice_controller.go Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> * Add function comments Signed-off-by: Ryan O'Leary <[email protected]> * Fix bad merge Signed-off-by: Ryan O'Leary <[email protected]> * Add more comments Signed-off-by: Ryan O'Leary <[email protected]> * Update ray-operator/controllers/ray/rayservice_controller.go Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> * Add Ray Serve hostname and serve port logic Signed-off-by: Ryan O'Leary <[email protected]> * Update ray-operator/controllers/ray/rayservice_controller.go Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> * Update ray-operator/controllers/ray/common/service.go Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> * Update ray-operator/controllers/ray/common/service.go Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> * Update ray-operator/controllers/ray/rayservice_controller.go Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> * Update ray-operator/controllers/ray/rayservice_controller.go Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> * Update ray-operator/controllers/ray/rayservice_controller.go Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> * Update ray-operator/controllers/ray/rayservice_controller.go Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> * Fix dropped requests and old cluster config not being served Signed-off-by: Ryan O'Leary <[email protected]> * Resolve readability comments and improve structure Signed-off-by: Ryan O'Leary <[email protected]> * Refactor based on comments Signed-off-by: Ryan O'Leary <[email protected]> * Update ray-operator/controllers/ray/common/service.go Co-authored-by: Rueian <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> * Remove hostname from listener Signed-off-by: Ryan O'Leary <[email protected]> * ensure pending cluster scales from 0 target_capacity Signed-off-by: Ryan O'Leary <[email protected]> * Run make generate after rebase Signed-off-by: Ryan O'Leary <[email protected]> * rename upgrade type Signed-off-by: Ryan O'Leary <[email protected]> * Clean up utils and add more comments Signed-off-by: Ryan O'Leary <[email protected]> * reconcileHTTPRoute should pass created object to calculate status Signed-off-by: Ryan O'Leary <[email protected]> * lint Signed-off-by: Ryan O'Leary <[email protected]> * Update ray-operator/controllers/ray/rayservice_controller.go Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> * Fix test after suggested fix Signed-off-by: Ryan O'Leary <[email protected]> --------- Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Co-authored-by: Rueian <[email protected]>
1 parent 2d52001 commit 2acc219

36 files changed

+3174
-110
lines changed

docs/reference/api.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,25 @@ _Appears in:_
5555

5656

5757

58+
#### ClusterUpgradeOptions
59+
60+
61+
62+
These options are currently only supported for the IncrementalUpgrade type.
63+
64+
65+
66+
_Appears in:_
67+
- [RayServiceUpgradeStrategy](#rayserviceupgradestrategy)
68+
69+
| Field | Description | Default | Validation |
70+
| --- | --- | --- | --- |
71+
| `maxSurgePercent` _integer_ | The capacity of serve requests the upgraded cluster should scale to handle each interval.<br />Defaults to 100%. | 100 | |
72+
| `stepSizePercent` _integer_ | The percentage of traffic to switch to the upgraded RayCluster at a set interval after scaling by MaxSurgePercent. | | |
73+
| `intervalSeconds` _integer_ | The interval in seconds between transferring StepSize traffic from the old to new RayCluster. | | |
74+
| `gatewayClassName` _string_ | The name of the Gateway Class installed by the Kubernetes Cluster admin. | | |
75+
76+
5877
#### DeletionCondition
5978

6079

@@ -377,6 +396,7 @@ _Appears in:_
377396
| Field | Description | Default | Validation |
378397
| --- | --- | --- | --- |
379398
| `type` _[RayServiceUpgradeType](#rayserviceupgradetype)_ | Type represents the strategy used when upgrading the RayService. Currently supports `NewCluster` and `None`. | | |
399+
| `clusterUpgradeOptions` _[ClusterUpgradeOptions](#clusterupgradeoptions)_ | ClusterUpgradeOptions defines the behavior of a NewClusterWithIncrementalUpgrade type.<br />RayServiceIncrementalUpgrade feature gate must be enabled to set ClusterUpgradeOptions. | | |
380400

381401

382402
#### RayServiceUpgradeType

go.mod

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ require (
7373
github.com/liggitt/tabwriter v0.0.0-20181228230101-89fcab3d43de // indirect
7474
github.com/mailru/easyjson v0.9.0 // indirect
7575
github.com/mattn/go-colorable v0.1.13 // indirect
76-
github.com/mattn/go-isatty v0.0.19 // indirect
76+
github.com/mattn/go-isatty v0.0.20 // indirect
7777
github.com/mitchellh/go-wordwrap v1.0.1 // indirect
7878
github.com/moby/spdystream v0.5.0 // indirect
7979
github.com/moby/term v0.5.0 // indirect
@@ -95,12 +95,12 @@ require (
9595
go.uber.org/automaxprocs v1.6.0 // indirect
9696
go.uber.org/multierr v1.11.0 // indirect
9797
go.uber.org/zap v1.27.0 // indirect
98-
golang.org/x/net v0.38.0 // indirect
98+
golang.org/x/net v0.39.0 // indirect
9999
golang.org/x/oauth2 v0.27.0 // indirect
100-
golang.org/x/sync v0.12.0 // indirect
100+
golang.org/x/sync v0.13.0 // indirect
101101
golang.org/x/sys v0.32.0 // indirect
102-
golang.org/x/term v0.30.0 // indirect
103-
golang.org/x/text v0.23.0 // indirect
102+
golang.org/x/term v0.31.0 // indirect
103+
golang.org/x/text v0.24.0 // indirect
104104
golang.org/x/time v0.10.0 // indirect
105105
golang.org/x/tools v0.31.0 // indirect
106106
gomodules.xyz/jsonpatch/v2 v2.4.0 // indirect
@@ -112,6 +112,7 @@ require (
112112
k8s.io/component-base v0.33.1 // indirect
113113
k8s.io/component-helpers v0.33.1 // indirect
114114
k8s.io/kube-openapi v0.0.0-20250318190949-c8a335a9a2ff // indirect
115+
sigs.k8s.io/gateway-api v1.3.0 // indirect
115116
sigs.k8s.io/json v0.0.0-20241014173422-cfa47c3a1cc8 // indirect
116117
sigs.k8s.io/kustomize/api v0.19.0 // indirect
117118
sigs.k8s.io/kustomize/kyaml v0.19.0 // indirect

go.sum

Lines changed: 12 additions & 9 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

helm-chart/kuberay-operator/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -174,6 +174,8 @@ spec:
174174
| featureGates[1].enabled | bool | `false` | |
175175
| featureGates[2].name | string | `"RayMultiHostIndexing"` | |
176176
| featureGates[2].enabled | bool | `false` | |
177+
| featureGates[3].name | string | `"RayServiceIncrementalUpgrade"` | |
178+
| featureGates[3].enabled | bool | `false` | |
177179
| metrics.enabled | bool | `true` | Whether KubeRay operator should emit control plane metrics. |
178180
| metrics.serviceMonitor.enabled | bool | `false` | Enable a prometheus ServiceMonitor |
179181
| metrics.serviceMonitor.interval | string | `"30s"` | Prometheus ServiceMonitor interval |

helm-chart/kuberay-operator/crds/ray.io_rayservices.yaml

Lines changed: 37 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

helm-chart/kuberay-operator/templates/_helpers.tpl

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -222,6 +222,17 @@ rules:
222222
- patch
223223
- update
224224
- watch
225+
- apiGroups:
226+
- gateway.networking.k8s.io
227+
resources:
228+
- gateways
229+
- httproutes
230+
verbs:
231+
- create
232+
- get
233+
- list
234+
- update
235+
- watch
225236
- apiGroups:
226237
- networking.k8s.io
227238
resources:

helm-chart/kuberay-operator/values.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,8 @@ featureGates:
119119
enabled: false
120120
- name: RayMultiHostIndexing
121121
enabled: false
122+
- name: RayServiceIncrementalUpgrade
123+
enabled: false
122124

123125
# Configurations for KubeRay operator metrics.
124126
metrics:

ray-operator/Makefile

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,8 +75,16 @@ test-e2e-autoscaler: WHAT ?= ./test/e2eautoscaler
7575
test-e2e-autoscaler: manifests fmt vet ## Run e2e autoscaler tests.
7676
go test -timeout 30m -v $(WHAT)
7777

78+
test-e2e-rayservice: WHAT ?= ./test/e2erayservice
79+
test-e2e-rayservice: manifests fmt vet ## Run e2e RayService tests.
80+
go test -timeout 30m -v $(WHAT)
81+
7882
test-e2e-upgrade: WHAT ?= ./test/e2eupgrade
79-
test-e2e-upgrade: manifests fmt vet ## Run e2e tests.
83+
test-e2e-upgrade: manifests fmt vet ## Run e2e operator upgrade tests.
84+
go test -timeout 30m -v $(WHAT)
85+
86+
test-e2e-incremental-upgrade: WHAT ?= ./test/e2eincrementalupgrade
87+
test-e2e-incremental-upgrade: manifests fmt vet ## Run e2e RayService incremental upgrade tests.
8088
go test -timeout 30m -v $(WHAT)
8189

8290
test-e2e-rayjob-submitter: WHAT ?= ./test/e2erayjobsubmitter

ray-operator/apis/ray/v1/rayservice_types.go

Lines changed: 35 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,9 @@ const (
2222
type RayServiceUpgradeType string
2323

2424
const (
25+
// During upgrade, NewClusterWithIncrementalUpgrade strategy will create an upgraded cluster to gradually scale
26+
// and migrate traffic to using Gateway API.
27+
NewClusterWithIncrementalUpgrade RayServiceUpgradeType = "NewClusterWithIncrementalUpgrade"
2528
// During upgrade, NewCluster strategy will create new upgraded cluster and switch to it when it becomes ready
2629
NewCluster RayServiceUpgradeType = "NewCluster"
2730
// No new cluster will be created while the strategy is set to None
@@ -57,10 +60,27 @@ var DeploymentStatusEnum = struct {
5760
UNHEALTHY: "UNHEALTHY",
5861
}
5962

63+
// These options are currently only supported for the IncrementalUpgrade type.
64+
type ClusterUpgradeOptions struct {
65+
// The capacity of serve requests the upgraded cluster should scale to handle each interval.
66+
// Defaults to 100%.
67+
// +kubebuilder:default:=100
68+
MaxSurgePercent *int32 `json:"maxSurgePercent,omitempty"`
69+
// The percentage of traffic to switch to the upgraded RayCluster at a set interval after scaling by MaxSurgePercent.
70+
StepSizePercent *int32 `json:"stepSizePercent"`
71+
// The interval in seconds between transferring StepSize traffic from the old to new RayCluster.
72+
IntervalSeconds *int32 `json:"intervalSeconds"`
73+
// The name of the Gateway Class installed by the Kubernetes Cluster admin.
74+
GatewayClassName string `json:"gatewayClassName"`
75+
}
76+
6077
type RayServiceUpgradeStrategy struct {
6178
// Type represents the strategy used when upgrading the RayService. Currently supports `NewCluster` and `None`.
6279
// +optional
6380
Type *RayServiceUpgradeType `json:"type,omitempty"`
81+
// ClusterUpgradeOptions defines the behavior of a NewClusterWithIncrementalUpgrade type.
82+
// RayServiceIncrementalUpgrade feature gate must be enabled to set ClusterUpgradeOptions.
83+
ClusterUpgradeOptions *ClusterUpgradeOptions `json:"clusterUpgradeOptions,omitempty"`
6484
}
6585

6686
// RayServiceSpec defines the desired state of RayService
@@ -129,6 +149,20 @@ type RayServiceStatus struct {
129149
// Important: Run "make" to regenerate code after modifying this file
130150
// +optional
131151
Applications map[string]AppStatus `json:"applicationStatuses,omitempty"`
152+
// TargetCapacity is the `target_capacity` percentage for all Serve replicas
153+
// across the cluster for this RayService. The `num_replicas`, `min_replicas`, `max_replicas`,
154+
// and `initial_replicas` for each deployment will be scaled by this percentage."
155+
// +optional
156+
TargetCapacity *int32 `json:"targetCapacity,omitempty"`
157+
// TrafficRoutedPercent is the percentage of traffic that is routed to the Serve service
158+
// for this RayService. TrafficRoutedPercent is updated to reflect the weight on the HTTPRoute
159+
// created for this RayService during incremental upgrades to a new cluster.
160+
// +optional
161+
TrafficRoutedPercent *int32 `json:"trafficRoutedPercent,omitempty"`
162+
// LastTrafficMigratedTime is the last time that TrafficRoutedPercent was updated to a new value
163+
// for this RayService.
164+
// +optional
165+
LastTrafficMigratedTime *metav1.Time `json:"lastTrafficMigratedTime,omitempty"`
132166
// +optional
133167
RayClusterName string `json:"rayClusterName,omitempty"`
134168
// +optional
@@ -184,8 +218,7 @@ const (
184218
type RayService struct {
185219
metav1.TypeMeta `json:",inline"`
186220
metav1.ObjectMeta `json:"metadata,omitempty"`
187-
188-
Spec RayServiceSpec `json:"spec,omitempty"`
221+
Spec RayServiceSpec `json:"spec,omitempty"`
189222
// +optional
190223
Status RayServiceStatuses `json:"status,omitempty"`
191224
}

ray-operator/apis/ray/v1/zz_generated.deepcopy.go

Lines changed: 49 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)