[Docs] Add guide for RayService Incremental Upgrade KubeRay feature #58293

ryanaoleary · 2025-10-29T22:28:27Z

Description

Briefly describe what this PR accomplishes and why it's needed.

This PR adds a guide for the new zero-downtime incremental upgrade feature in KubeRay v1.5. This feature was implemented in this PR: ray-project/kuberay#3166.

Related issues

ray-project/kuberay#3209

Docs link

https://anyscale-ray--58293.com.readthedocs.build/en/58293/serve/advanced-guides/incremental-upgrade.html#rayservice-zero-downtime-incremental-upgrades

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary · 2025-10-29T22:29:06Z

cc: @Future-Outlier @rueian @andrewsykim and @angelinalg @dstrodtman for review from the docs team.

gemini-code-assist

Code Review

This pull request adds a new, comprehensive guide for the RayService Zero-Downtime Incremental Upgrades feature. The documentation is well-written and covers the prerequisites, mechanics, configuration, monitoring, and API of the new feature. My review includes a critical fix for a command with an incorrect version number and a suggestion to improve grammatical correctness and clarity.

doc/source/serve/advanced-guides/incremental-upgrade.md

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ryan O'Leary <[email protected]>

Future-Outlier

Hi, @ryanaoleary

Can we take kind as an example in the doc, and add [optional] field if this only needs to be done in kind?
IMO,

this will be great for developers like us,
and also great for engineers who need to do POC in their company.
we can reduce maintenance burden in the future

This is my successful script to reproduce.

kind create cluster --image=kindest/node:v1.29.0
(need 1.29.0 because istio's version)
install gateway CRD

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yaml

[optional]

kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.7/config/manifests/metallb-native.yaml

[optional]

echo "apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: kind-pool
  namespace: metallb-system
spec:
  addresses:
  - 192.168.8.200-192.168.8.250
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: default
  namespace: metallb-system
spec:
  ipAddressPools:
  - kind-pool" | kubectl apply -f -

echo "apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: istio
spec:
  controllerName: istio.io/gateway-controller" | kubectl apply -f -

istioctl install --set profile=demo -y
install kuberay operator + CRD
apply a rayservice CR (old cluster)
apply a rayservice CR (new cluster)

Future-Outlier

is it possible to add a section to teach user how to calculate the minimum resource we need when using this feature?
For example, this is my comment.
ray-project/kuberay#3209 (comment)

Future-Outlier · 2025-10-30T08:51:38Z

doc/source/serve/advanced-guides/incremental-upgrade.md

+### 5. Rollback Support
+
+To roll back a failing or poorly performing upgrade, simply **update the `RayService` manifest back to the original configuration** (e.g., change the `image` back to the old tag).
+
+KubeRay's controller will detect that the "goal state" now matches the *active* (old) cluster. It will reverse the process:
+1.  Scale the active cluster's `target_capacity` back to 100%.
+2.  Shift all traffic back to the active cluster.
+3.  Scale down and terminate the *pending* (new) cluster.
+
+---


Is this supported now? or will be supported in the future?
I guess this is related to this PR, right?

ray-project/kuberay#4109

Removed in 809c76e, it is not currently supported but I will add documentation on it when I fix ray-project/kuberay#4109 and can get it merged.

Future-Outlier · 2025-10-30T09:01:32Z

doc/source/serve/advanced-guides/incremental-upgrade.md

+This guide details how to configure and use the `NewClusterWithIncrementalUpgrade` strategy for a `RayService` with KubeRay. This feature was proposed in a [Ray Enhancement Proposal (REP)](https://github.com/ray-project/enhancements/blob/main/reps/2024-12-4-ray-service-incr-upgrade.md) and implemented with alpha support in KubeRay v1.5.0. If unfamiliar with RayServices and KubeRay, see the [RayService Quickstart](https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/rayservice-quick-start.html).
+
+In previous versions of KubeRay, zero-downtime upgrades were supported only through the `NewCluster` strategy. This upgrade strategy involved scaling up a pending RayCluster with equal capacity as the active cluster, waiting until the updated Serve applications were healthy, and then switching traffic to the new RayCluster. While this upgrade strategy is reliable, it required users to scale 200% of their original cluster's compute resources which can be prohibitive when dealing with expensive accelerator resources.
+
+The `NewClusterWithIncrementalUpgrade` strategy is designed for large-scale deployments, such as LLM serving, where duplicating resources for a standard blue/green deployment is not feasible due to resource constraints. Rather than creating a new `RayCluster` at 100% capacity, this strategy creates a new cluster and gradually scales its capacity up while simultaneously shifting user traffic from the old cluster to the new one. This gradual traffic migration enables users to safely scale their updated RayService while the old cluster auto-scales down, enabling users to save expensive compute resources and exert fine-grained control over the pace of their upgrade. This process relies on the Kubernetes Gateway API for fine-grained traffic splitting.


Can we add this sentence on top of this paragraph?

This feature minimizes resource usage during RayService CR upgrades while maintaining service availability. Below we explain the design and usage.

Done in 809c76e.

Future-Outlier · 2025-10-30T09:04:15Z

doc/source/serve/advanced-guides/incremental-upgrade.md

+## API Overview (Reference)
+
+This section details the new and updated fields in the `RayService` CRD.
+
+### `RayService.spec.upgradeStrategy`
+
+| Field | Type | Description | Required | Default |
+| :--- | :--- | :--- | :--- | :--- |
+| `type` | `string` | The strategy to use for upgrades. Can be `NewCluster`, `None`, or `NewClusterWithIncrementalUpgrade`. | No | `NewCluster` |
+| `clusterUpgradeOptions` | `object` | Container for incremental upgrade settings. **Required if `type` is `NewClusterWithIncrementalUpgrade`.** The `RayServiceIncrementalUpgrade` feature gate must be enabled. | No | `nil` |
+
+### `RayService.spec.upgradeStrategy.clusterUpgradeOptions`
+
+This block is required *only* if `type` is set to `NewClusterWithIncrementalUpgrade`.
+
+| Field | Type | Description | Required | Default |
+| :--- | :--- | :--- | :--- | :--- |
+| `maxSurgePercent` | `int32` | The percentage of *capacity* (Serve replicas) to add to the new cluster in each scaling step. For example, a value of `20` means the new cluster's `target_capacity` will increase in 20% increments (0% -> 20% -> 40%...). Must be between 0 and 100. | No | `100` |
+| `stepSizePercent` | `int32` | The percentage of *traffic* to shift from the old to the new cluster during each interval. Must be between 0 and 100. | **Yes** | N/A |
+| `intervalSeconds` | `int32` | The time in seconds to wait between shifting traffic by `stepSizePercent`. | **Yes** | N/A |
+| `gatewayClassName` | `string` | The `metadata.name` of the `GatewayClass` resource KubeRay should use to create `Gateway` and `HTTPRoute` objects. | **Yes** | N/A |
+
+### `RayService.status.activeServiceStatus` & `RayService.status.pendingServiceStatus`
+
+Three new fields are added to both the `activeServiceStatus` and `pendingServiceStatus` blocks to provide visibility into the upgrade process.
+
+| Field | Type | Description |
+| :--- | :--- | :--- |
+| `targetCapacity` | `int32` | The target percentage of Serve replicas this cluster is *configured* to handle (from 0 to 100). This is controlled by KubeRay based on `maxSurgePercent`. |
+| `trafficRoutedPercent` | `int32` | The *actual* percentage of traffic (from 0 to 100) currently being routed to this cluster's endpoint. This is controlled by KubeRay during an upgrade based on `stepSizePercent` and `intervalSeconds`. |
+| `lastTrafficMigratedTime` | `metav1.Time` | A timestamp indicating the last time `trafficRoutedPercent` was updated. |


I am thinking that should we delete this?
cc @rueian @andrewsykim for decision

I don't have much of a preference, instead we could link to https://ray-project.github.io/kuberay/reference/api/ but it doesn't seem to include RayService status for some reason, so I thought it might be userful to describe the new API here.

Future-Outlier · 2025-10-30T09:08:54Z

doc/source/serve/advanced-guides/incremental-upgrade.md

+
+Understanding the lifecycle of an incremental upgrade helps in monitoring and configuration.
+
+1.  **Trigger:** You trigger an upgrade by updating the `RayService` spec, such as changing the container `image` or updating the `serveConfigV2`.


In my experience, change serveConfigV2 will not trigger incremental upgrade, but update image will.

Fixed in 809c76e, changes under rayClusterConfig except for replicas trigger the upgrade.

Future-Outlier · 2025-10-30T09:17:13Z

doc/source/serve/advanced-guides/incremental-upgrade.md

+        * Active `target_capacity`: 100%
+        * Pending `target_capacity`: 0% $\rightarrow$ **20%**
+        * **Total Capacity: 120%**
+        * The Ray Autoscaler begins provisioning pods for the pending cluster to handle 20% of the target load.


This is my understanding, is it correct?
if possible, I think it's better to let users know there are 2 autoscaler in this upgrade process, ray serve autoscaler and ray autoscaler.

the ray serve autoscaler will update target_capacity, and the serve application deployments's num_replica will be changed.

according to 1, if needed, autoscaler will scale up or scale down

Added a sentence detailing the interaction between the two autoscalers in 809c76e.

If the Ray Serve autoscaler is enabled (this is not required so it will not be the case for all users), num_replicas will scale from min_replicas when target_capacity is updated, this is what causes the drop in RPS.

Based on the new value of num_replicas from step 1, the Ray autoscaler will consider the new resource request and scale Pods accordingly to schedule the serve replicas on.

Future-Outlier

overall LGTM, thank you!!
@ryanaoleary
you execute really really fast, I am really grateful to work with you, thank you for all the hard work.

Future-Outlier · 2025-10-30T09:42:27Z

doc/source/serve/advanced-guides/incremental-upgrade.md

+        * KubeRay waits for the pending cluster's new pods to be ready.
+        * Once ready, it begins to *gradually* shift traffic. Every `intervalSeconds`, it updates the `HTTPRoute` weights, moving `stepSizePercent` (5%) of traffic from the active to the pending cluster.
+        * This continues until the *actual* traffic (`trafficRoutedPercent`) "catches up" to the *pending* cluster's `target_capacity` (20% in this example).
+


according to
ray-project/kuberay#3166 (comment)
and
ray-project/kuberay#3166 (comment)

You can see that the RayService controller will not wait for us to finish updating the replica in the Ray Serve application before updating the traffic percentage.

However, Rueian, Kai-Hsun, and I will discuss how to solve this issue this week.

I just want to say we should inform users that there may be temporary RPS drops because worker pods (worker nodes) are still being created

I added a sentence detailing this caveat in 809c76e so that users are aware

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary · 2025-10-31T09:32:55Z

Hi, @ryanaoleary

Can we take kind as an example in the doc, and add [optional] field if this only needs to be done in kind? IMO,

this will be great for developers like us,

and also great for engineers who need to do POC in their company.

we can reduce maintenance burden in the future

This is my successful script to reproduce.

kind create cluster --image=kindest/node:v1.29.0
(need 1.29.0 because istio's version)

install gateway CRD
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yaml
[optional]
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.7/config/manifests/metallb-native.yaml
[optional]
echo "apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: kind-pool
  namespace: metallb-system
spec:
  addresses:
  - 192.168.8.200-192.168.8.250
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: default
  namespace: metallb-system
spec:
  ipAddressPools:
  - kind-pool" | kubectl apply -f -
echo "apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: istio
spec:
  controllerName: istio.io/gateway-controller" | kubectl apply -f -
istioctl install --set profile=demo -y

install kuberay operator + CRD

apply a rayservice CR (old cluster)

apply a rayservice CR (new cluster)

Done in 809c76e, I added these steps and I created a PR in KubeRay with a sample yaml that users can apply.

ryanaoleary · 2025-10-31T09:33:48Z

overall LGTM, thank you!! @ryanaoleary you execute really really fast, I am really grateful to work with you, thank you for all the hard work.

Thank you!! I really appreciate all the reviews and help as well.

[Docs] Add guide for RayService Incremental Upgrade KubeRay feature

b0c69f3

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary requested review from a team as code owners October 29, 2025 22:28

gemini-code-assist bot reviewed Oct 29, 2025

View reviewed changes

doc/source/serve/advanced-guides/incremental-upgrade.md Show resolved Hide resolved

doc/source/serve/advanced-guides/incremental-upgrade.md Outdated Show resolved Hide resolved

Update doc/source/serve/advanced-guides/incremental-upgrade.md

daa1849

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ryan O'Leary <[email protected]>

ray-gardener bot added serve Ray Serve Related Issue core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Oct 30, 2025

Future-Outlier self-assigned this Oct 30, 2025

Future-Outlier reviewed Oct 30, 2025

View reviewed changes

ryanaoleary mentioned this pull request Oct 31, 2025

Add RayService incremental upgrade sample for guide ray-project/kuberay#4164

Merged

4 tasks

Fix review comments and add example setup

809c76e

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary mentioned this pull request Nov 3, 2025

[Feature] RayService Incremental Upgrade Project Tracker ray-project/kuberay#3209

Open

5 tasks


		Understanding the lifecycle of an incremental upgrade helps in monitoring and configuration.

		1. Trigger: You trigger an upgrade by updating the `RayService` spec, such as changing the container `image` or updating the `serveConfigV2`.

[Docs] Add guide for RayService Incremental Upgrade KubeRay feature #58293

Are you sure you want to change the base?

[Docs] Add guide for RayService Incremental Upgrade KubeRay feature #58293

Uh oh!

Conversation

ryanaoleary commented Oct 29, 2025 • edited by Future-Outlier Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Docs link

Uh oh!

ryanaoleary commented Oct 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Future-Outlier left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Future-Outlier left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Future-Outlier left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryanaoleary commented Oct 31, 2025

Uh oh!

ryanaoleary commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ryanaoleary commented Oct 29, 2025 •

edited by Future-Outlier

Loading

Future-Outlier left a comment •

edited

Loading