[core][autoscaler] make cluster constraints clear in the autoscaler status report#52278

Merged

jjyao merged 6 commits intoray-project:masterfrom

rueian:better-autoscaler-observability

Apr 22, 2025

Contributor

rueian commented Apr 12, 2025 •

edited

Loading

Why are these changes needed?

As requested in the #37959. Make the cluster constraints clear in the autoscaler status report by not showing them with normal demands together but showing them in an explicit section.

Before

Resources
---------------------------------------------------------------
Usage:
 0B/8.21GiB memory
 0B/2.00GiB object_store_memory

Demands:
 {'CPU': 3.0}: 3+ pending tasks/actors
 {'CPU': 2.0}: 3+ pending tasks/actors
 {'CPU': 1}: 1000+ from request_resources()

After

Resources
---------------------------------------------------------------
Usage:
 0B/8.21GiB memory
 0B/2.00GiB object_store_memory

Constraints:
 {'CPU': 1}: 1000 from request_resources()
Demands:
 {'CPU': 3.0}: 3+ pending tasks/actors
 {'CPU': 2.0}: 3+ pending tasks/actors

Note that this change is also reflected in the dashboard:

Related issue number

Closes #37959

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

rueian force-pushed the better-autoscaler-observability branch 5 times, most recently from 18bc935 to 2033ebc Compare

April 13, 2025 20:02


          [core][autoscaler] make cluster constraints in the autoscaler status …

704c731

…report clear

Signed-off-by: Rueian <[email protected]>

rueian force-pushed the better-autoscaler-observability branch from 2033ebc to 704c731 Compare

April 15, 2025 05:04

rueian changed the title ~~WIP [core][autoscaler] make constraints and numbers of ready/infeasible/backlog clear in the status report~~ [core][autoscaler] make cluster constraints clear in the autoscaler status report

rueian marked this pull request as ready for review

April 15, 2025 06:02

rueian requested a review from a team as a code owner

April 15, 2025 06:02

rueian mentioned this pull request

[core][autoscaler] Better observability for request resources #37959

Closed

jjyao reviewed

View reviewed changes

python/ray/autoscaler/_private/util.py Outdated Show resolved Hide resolved

jjyao reviewed

View reviewed changes

python/ray/autoscaler/_private/util.py Outdated Show resolved Hide resolved

rueian added 2 commits

April 14, 2025 23:56


          [core][autoscaler] make cluster constraints in the autoscaler status …

0f842af

…report clear

Signed-off-by: Rueian <[email protected]>


          [core][autoscaler] make cluster constraints in the autoscaler status …

d3b041c

…report clear

Signed-off-by: Rueian <[email protected]>

rueian force-pushed the better-autoscaler-observability branch from 6e8ce94 to d3b041c Compare

April 15, 2025 15:05

kevin85421 reviewed

View reviewed changes

python/ray/autoscaler/_private/util.py

		@@ -914,6 +924,8 @@ def format_info_string(
		{separator}
		{"Total " if verbose else ""}Usage:

Member

kevin85421 Apr 15, 2025

Could you add a screenshot with verbose set to true? What is the difference between the output with "Total" and the output without "Total"?

Contributor Author

rueian Apr 15, 2025

Here is an example:

without verbose

Resources
--------------------------------------------------------
Usage:
 530.0/544.0 CPU
 2/2 GPU
 2.00GiB/8.00GiB memory
 3.14GiB/16.00GiB object_store_memory

Constraints:
 {'CPU': 16}: 100 from request_resources()
Demands:
 {'CPU': 1}: 150+ pending tasks/actors
 {'CPU': 4} * 5 (PACK): 420+ pending placement groups

with verbose

Resources
--------------------------------------------------------
Total Usage:
 530.0/544.0 CPU
 2/2 GPU
 1/2 accelerator_type:V100
 2.00GiB/8.00GiB memory
 3.14GiB/16.00GiB object_store_memory

Total Constraints:
 {'CPU': 16}: 100 from request_resources()
Total Demands:
 {'CPU': 1}: 150+ pending tasks/actors
 {'CPU': 4} * 5 (PACK): 420+ pending placement groups

Node: 192.168.1.1
 Usage:
  5.0/20.0 CPU
  0.7/1 GPU
  0.1/1 accelerator_type:V100
  1.00GiB/4.00GiB memory
  3.14GiB/4.00GiB object_store_memory
 Activity:
  CPU in use.
  GPU in use.
  Active workers.

Node: 192.168.1.2
 Usage:
  15.0/20.0 CPU
  0.3/1 GPU
  0.9/1 accelerator_type:V100
  1.00GiB/12.00GiB memory
  0B/4.00GiB object_store_memory
 Activity:
  GPU in use.
  Active workers.

For the usage section, accelerator_type resources will be shown in the verbose mode, and there will be node usages for each node at the end of the report.

For the other sections, there is no difference except that their title will be added with Total prefixes.

Member

kevin85421 Apr 15, 2025

For the other sections, there is no difference except that their title will be added with Total prefixes.

I guess "Total" is used to distinguish between "Total Usage" and a node's "Usage". How about always adding "Total"? Not necessarily to be in this PR. We can update it in a follow up PR.

Member

kevin85421 Apr 16, 2025

Can you open an issue to track the progress of the follow up and add to the umbrella issue in KubeRay repo?

python/ray/autoscaler/_private/util.py Outdated Show resolved Hide resolved

python/ray/autoscaler/_private/util.py Outdated Show resolved Hide resolved

python/ray/autoscaler/_private/util.py Outdated Show resolved Hide resolved

python/ray/autoscaler/_private/util.py

               def get_demand_report(lm_summary: LoadMetricsSummary):
                   demand_lines = []
                   if lm_summary.resource_demand:

Member

kevin85421 Apr 15, 2025

Can you help me understand the difference between

demand_lines.extend(format_resource_demand_summary(lm_summary.resource_demand)) (L730) and lm_summary.request_demand (L736)?

Contributor Author

rueian Apr 15, 2025

resource_demand consists of ready/infeasible/blacklog task shapes for autoscaler to scale the cluster.
request_demand is made by a direct request_resources() call from the user.

python/ray/autoscaler/_private/util.py Outdated Show resolved Hide resolved


          [core][autoscaler] make cluster constraints in the autoscaler status …

d76f8db

…report clear

Signed-off-by: Rueian <[email protected]>

kevin85421 reviewed

View reviewed changes

python/ray/autoscaler/_private/util.py

		@@ -914,6 +924,8 @@ def format_info_string(
		{separator}
		{"Total " if verbose else ""}Usage:

Member

kevin85421 Apr 15, 2025

For the other sections, there is no difference except that their title will be added with Total prefixes.

I guess "Total" is used to distinguish between "Total Usage" and a node's "Usage". How about always adding "Total"? Not necessarily to be in this PR. We can update it in a follow up PR.

python/ray/autoscaler/_private/util.py Show resolved Hide resolved

python/ray/autoscaler/v2/utils.py Outdated Show resolved Hide resolved

python/ray/autoscaler/v2/utils.py Show resolved Hide resolved

rueian force-pushed the better-autoscaler-observability branch from 3e92173 to fe3fecc Compare

April 16, 2025 03:08

kevin85421 approved these changes

View reviewed changes

python/ray/autoscaler/_private/util.py

		@@ -914,6 +924,8 @@ def format_info_string(
		{separator}
		{"Total " if verbose else ""}Usage:

Member

kevin85421 Apr 16, 2025

Can you open an issue to track the progress of the follow up and add to the umbrella issue in KubeRay repo?

kevin85421 added the go label

rueian mentioned this pull request

[core][autoscaler] Always add Total prefix to section titles of ray status output #52361

Closed

jcotant1 added the core label


          [core][autoscaler] make cluster constraints in the autoscaler status …

d88dddb

…report clear

Signed-off-by: Rueian <[email protected]>

rueian force-pushed the better-autoscaler-observability branch from fe3fecc to d88dddb Compare

April 16, 2025 20:38

Contributor Author

rueian commented Apr 17, 2025

Gently ping @jjyao for reviews again. All tests are passed.

Contributor Author

rueian commented Apr 21, 2025

Gently ping @kevin85421 and @jjyao for reviews.

jjyao reviewed

View reviewed changes

python/ray/tests/test_resource_demand_scheduler.py

+              Constraints:
+               {'CPU': 16}: 100 from request_resources()
               Demands:
                {'CPU': 1}: 150+ pending tasks/actors

Collaborator

jjyao Apr 21, 2025

Could you have a follow-up to remove the + for demands as well if the count is accurate.

jjyao reviewed

View reviewed changes

Collaborator

jjyao left a comment

LG

python/ray/tests/test_resource_demand_scheduler.py

 .00GiB/8.00GiB memory
 .14GiB/16.00GiB object_store_memory
+              Constraints:

Collaborator

jjyao Apr 21, 2025

Could you add a test where we have multiple bundle shapes so this contains multiple lines?

Contributor Author

rueian Apr 22, 2025

Sure. A new test containing multi-lines of constraints is added in the latest commit.

rueian requested a review from a team as a code owner

April 22, 2025 00:01

rueian force-pushed the better-autoscaler-observability branch from f7f2e56 to d88dddb Compare

April 22, 2025 00:06


          [core][autoscaler] add a test for multi-lines of constraints report

b7a814c

Signed-off-by: Rueian <[email protected]>

Member

kevin85421 commented Apr 22, 2025

Gently ping @kevin85421 and @jjyao for reviews.

I have already approved this PR. Are there any other major changes that need to be reviewed? If not, I will leave this PR to @jjyao.

jjyao merged commit 2159324 into ray-project:master

5 checks passed

hainesmichaelc added the community-backlog label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-backlog core go