Skip to content

Conversation

@ryanaoleary
Copy link
Contributor

@ryanaoleary ryanaoleary commented Oct 29, 2025

Description

Briefly describe what this PR accomplishes and why it's needed.

This PR adds a guide for the new zero-downtime incremental upgrade feature in KubeRay v1.5. This feature was implemented in this PR: ray-project/kuberay#3166.

Related issues

ray-project/kuberay#3209

Docs link

https://anyscale-ray--58293.com.readthedocs.build/en/58293/serve/advanced-guides/incremental-upgrade.html#rayservice-zero-downtime-incremental-upgrades

@ryanaoleary ryanaoleary requested review from a team as code owners October 29, 2025 22:28
@ryanaoleary
Copy link
Contributor Author

cc: @Future-Outlier @rueian @andrewsykim and @angelinalg @dstrodtman for review from the docs team.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new, comprehensive guide for the RayService Zero-Downtime Incremental Upgrades feature. The documentation is well-written and covers the prerequisites, mechanics, configuration, monitoring, and API of the new feature. My review includes a critical fix for a command with an incorrect version number and a suggestion to improve grammatical correctness and clarity.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Ryan O'Leary <[email protected]>
@ray-gardener ray-gardener bot added serve Ray Serve Related Issue core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Oct 30, 2025
@Future-Outlier Future-Outlier self-assigned this Oct 30, 2025
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @ryanaoleary

Can we take kind as an example in the doc, and add [optional] field if this only needs to be done in kind?
IMO,

  1. this will be great for developers like us,
  2. and also great for engineers who need to do POC in their company.
  3. we can reduce maintenance burden in the future

This is my successful script to reproduce.

  1. kind create cluster --image=kindest/node:v1.29.0
    (need 1.29.0 because istio's version)
  2. install gateway CRD
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yaml
  1. [optional]
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.7/config/manifests/metallb-native.yaml
  1. [optional]
echo "apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: kind-pool
  namespace: metallb-system
spec:
  addresses:
  - 192.168.8.200-192.168.8.250
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: default
  namespace: metallb-system
spec:
  ipAddressPools:
  - kind-pool" | kubectl apply -f -
echo "apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: istio
spec:
  controllerName: istio.io/gateway-controller" | kubectl apply -f -
  1. istioctl install --set profile=demo -y
  2. install kuberay operator + CRD
  3. apply a rayservice CR (old cluster)
  4. apply a rayservice CR (new cluster)

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to add a section to teach user how to calculate the minimum resource we need when using this feature?
For example, this is my comment.
ray-project/kuberay#3209 (comment)

Comment on lines 166 to 175
### 5. Rollback Support

To roll back a failing or poorly performing upgrade, simply **update the `RayService` manifest back to the original configuration** (e.g., change the `image` back to the old tag).

KubeRay's controller will detect that the "goal state" now matches the *active* (old) cluster. It will reverse the process:
1. Scale the active cluster's `target_capacity` back to 100%.
2. Shift all traffic back to the active cluster.
3. Scale down and terminate the *pending* (new) cluster.

---
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supported now? or will be supported in the future?
I guess this is related to this PR, right?

ray-project/kuberay#4109

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 809c76e, it is not currently supported but I will add documentation on it when I fix ray-project/kuberay#4109 and can get it merged.

Comment on lines 4 to 8
This guide details how to configure and use the `NewClusterWithIncrementalUpgrade` strategy for a `RayService` with KubeRay. This feature was proposed in a [Ray Enhancement Proposal (REP)](https://github.com/ray-project/enhancements/blob/main/reps/2024-12-4-ray-service-incr-upgrade.md) and implemented with alpha support in KubeRay v1.5.0. If unfamiliar with RayServices and KubeRay, see the [RayService Quickstart](https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/rayservice-quick-start.html).

In previous versions of KubeRay, zero-downtime upgrades were supported only through the `NewCluster` strategy. This upgrade strategy involved scaling up a pending RayCluster with equal capacity as the active cluster, waiting until the updated Serve applications were healthy, and then switching traffic to the new RayCluster. While this upgrade strategy is reliable, it required users to scale 200% of their original cluster's compute resources which can be prohibitive when dealing with expensive accelerator resources.

The `NewClusterWithIncrementalUpgrade` strategy is designed for large-scale deployments, such as LLM serving, where duplicating resources for a standard blue/green deployment is not feasible due to resource constraints. Rather than creating a new `RayCluster` at 100% capacity, this strategy creates a new cluster and gradually scales its capacity up while simultaneously shifting user traffic from the old cluster to the new one. This gradual traffic migration enables users to safely scale their updated RayService while the old cluster auto-scales down, enabling users to save expensive compute resources and exert fine-grained control over the pace of their upgrade. This process relies on the Kubernetes Gateway API for fine-grained traffic splitting.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add this sentence on top of this paragraph?

This feature minimizes resource usage during RayService CR upgrades while maintaining service availability. Below we explain the design and usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 809c76e.

Comment on lines +177 to +207
## API Overview (Reference)

This section details the new and updated fields in the `RayService` CRD.

### `RayService.spec.upgradeStrategy`

| Field | Type | Description | Required | Default |
| :--- | :--- | :--- | :--- | :--- |
| `type` | `string` | The strategy to use for upgrades. Can be `NewCluster`, `None`, or `NewClusterWithIncrementalUpgrade`. | No | `NewCluster` |
| `clusterUpgradeOptions` | `object` | Container for incremental upgrade settings. **Required if `type` is `NewClusterWithIncrementalUpgrade`.** The `RayServiceIncrementalUpgrade` feature gate must be enabled. | No | `nil` |

### `RayService.spec.upgradeStrategy.clusterUpgradeOptions`

This block is required *only* if `type` is set to `NewClusterWithIncrementalUpgrade`.

| Field | Type | Description | Required | Default |
| :--- | :--- | :--- | :--- | :--- |
| `maxSurgePercent` | `int32` | The percentage of *capacity* (Serve replicas) to add to the new cluster in each scaling step. For example, a value of `20` means the new cluster's `target_capacity` will increase in 20% increments (0% -> 20% -> 40%...). Must be between 0 and 100. | No | `100` |
| `stepSizePercent` | `int32` | The percentage of *traffic* to shift from the old to the new cluster during each interval. Must be between 0 and 100. | **Yes** | N/A |
| `intervalSeconds` | `int32` | The time in seconds to wait between shifting traffic by `stepSizePercent`. | **Yes** | N/A |
| `gatewayClassName` | `string` | The `metadata.name` of the `GatewayClass` resource KubeRay should use to create `Gateway` and `HTTPRoute` objects. | **Yes** | N/A |

### `RayService.status.activeServiceStatus` & `RayService.status.pendingServiceStatus`

Three new fields are added to both the `activeServiceStatus` and `pendingServiceStatus` blocks to provide visibility into the upgrade process.

| Field | Type | Description |
| :--- | :--- | :--- |
| `targetCapacity` | `int32` | The target percentage of Serve replicas this cluster is *configured* to handle (from 0 to 100). This is controlled by KubeRay based on `maxSurgePercent`. |
| `trafficRoutedPercent` | `int32` | The *actual* percentage of traffic (from 0 to 100) currently being routed to this cluster's endpoint. This is controlled by KubeRay during an upgrade based on `stepSizePercent` and `intervalSeconds`. |
| `lastTrafficMigratedTime` | `metav1.Time` | A timestamp indicating the last time `trafficRoutedPercent` was updated. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking that should we delete this?
cc @rueian @andrewsykim for decision

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have much of a preference, instead we could link to https://ray-project.github.io/kuberay/reference/api/ but it doesn't seem to include RayService status for some reason, so I thought it might be userful to describe the new API here.


Understanding the lifecycle of an incremental upgrade helps in monitoring and configuration.

1. **Trigger:** You trigger an upgrade by updating the `RayService` spec, such as changing the container `image` or updating the `serveConfigV2`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience, change serveConfigV2 will not trigger incremental upgrade, but update image will.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 809c76e, changes under rayClusterConfig except for replicas trigger the upgrade.

* Active `target_capacity`: 100%
* Pending `target_capacity`: 0% $\rightarrow$ **20%**
* **Total Capacity: 120%**
* The Ray Autoscaler begins provisioning pods for the pending cluster to handle 20% of the target load.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my understanding, is it correct?
if possible, I think it's better to let users know there are 2 autoscaler in this upgrade process, ray serve autoscaler and ray autoscaler.

  1. the ray serve autoscaler will update target_capacity, and the serve application deployments's num_replica will be changed.
  2. according to 1, if needed, autoscaler will scale up or scale down

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a sentence detailing the interaction between the two autoscalers in 809c76e.

  1. If the Ray Serve autoscaler is enabled (this is not required so it will not be the case for all users), num_replicas will scale from min_replicas when target_capacity is updated, this is what causes the drop in RPS.
  2. Based on the new value of num_replicas from step 1, the Ray autoscaler will consider the new resource request and scale Pods accordingly to schedule the serve replicas on.

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM, thank you!!
@ryanaoleary
you execute really really fast, I am really grateful to work with you, thank you for all the hard work.

Comment on lines 68 to 71
* KubeRay waits for the pending cluster's new pods to be ready.
* Once ready, it begins to *gradually* shift traffic. Every `intervalSeconds`, it updates the `HTTPRoute` weights, moving `stepSizePercent` (5%) of traffic from the active to the pending cluster.
* This continues until the *actual* traffic (`trafficRoutedPercent`) "catches up" to the *pending* cluster's `target_capacity` (20% in this example).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to
ray-project/kuberay#3166 (comment)
and
ray-project/kuberay#3166 (comment)

You can see that the RayService controller will not wait for us to finish updating the replica in the Ray Serve application before updating the traffic percentage.

However, Rueian, Kai-Hsun, and I will discuss how to solve this issue this week.

I just want to say we should inform users that there may be temporary RPS drops because worker pods (worker nodes) are still being created

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a sentence detailing this caveat in 809c76e so that users are aware

@ryanaoleary
Copy link
Contributor Author

Hi, @ryanaoleary

Can we take kind as an example in the doc, and add [optional] field if this only needs to be done in kind? IMO,

  1. this will be great for developers like us,
  2. and also great for engineers who need to do POC in their company.
  3. we can reduce maintenance burden in the future

This is my successful script to reproduce.

  1. kind create cluster --image=kindest/node:v1.29.0
    (need 1.29.0 because istio's version)
  2. install gateway CRD
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yaml
  1. [optional]
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.7/config/manifests/metallb-native.yaml
  1. [optional]
echo "apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: kind-pool
  namespace: metallb-system
spec:
  addresses:
  - 192.168.8.200-192.168.8.250
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: default
  namespace: metallb-system
spec:
  ipAddressPools:
  - kind-pool" | kubectl apply -f -
echo "apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: istio
spec:
  controllerName: istio.io/gateway-controller" | kubectl apply -f -
  1. istioctl install --set profile=demo -y
  2. install kuberay operator + CRD
  3. apply a rayservice CR (old cluster)
  4. apply a rayservice CR (new cluster)

Done in 809c76e, I added these steps and I created a PR in KubeRay with a sample yaml that users can apply.

@ryanaoleary
Copy link
Contributor Author

overall LGTM, thank you!! @ryanaoleary you execute really really fast, I am really grateful to work with you, thank you for all the hard work.

Thank you!! I really appreciate all the reviews and help as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants