Skip to content

[core][autoscaler] make cluster constraints clear in the autoscaler status report#52278

Merged
jjyao merged 6 commits intoray-project:masterfrom
rueian:better-autoscaler-observability
Apr 22, 2025
Merged

[core][autoscaler] make cluster constraints clear in the autoscaler status report#52278
jjyao merged 6 commits intoray-project:masterfrom
rueian:better-autoscaler-observability

Conversation

@rueian
Copy link
Contributor

@rueian rueian commented Apr 12, 2025

Why are these changes needed?

As requested in the #37959. Make the cluster constraints clear in the autoscaler status report by not showing them with normal demands together but showing them in an explicit section.

Before

Resources
---------------------------------------------------------------
Usage:
 0B/8.21GiB memory
 0B/2.00GiB object_store_memory

Demands:
 {'CPU': 3.0}: 3+ pending tasks/actors
 {'CPU': 2.0}: 3+ pending tasks/actors
 {'CPU': 1}: 1000+ from request_resources()

After

Resources
---------------------------------------------------------------
Usage:
 0B/8.21GiB memory
 0B/2.00GiB object_store_memory

Constraints:
 {'CPU': 1}: 1000 from request_resources()
Demands:
 {'CPU': 3.0}: 3+ pending tasks/actors
 {'CPU': 2.0}: 3+ pending tasks/actors

Note that this change is also reflected in the dashboard:
image

Related issue number

Closes #37959

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rueian rueian force-pushed the better-autoscaler-observability branch 5 times, most recently from 18bc935 to 2033ebc Compare April 13, 2025 20:02
@rueian rueian force-pushed the better-autoscaler-observability branch from 2033ebc to 704c731 Compare April 15, 2025 05:04
@rueian rueian changed the title WIP [core][autoscaler] make constraints and numbers of ready/infeasible/backlog clear in the status report [core][autoscaler] make cluster constraints clear in the autoscaler status report Apr 15, 2025
@rueian rueian marked this pull request as ready for review April 15, 2025 06:02
@rueian rueian requested a review from a team as a code owner April 15, 2025 06:02
@rueian rueian force-pushed the better-autoscaler-observability branch from 6e8ce94 to d3b041c Compare April 15, 2025 15:05
@@ -914,6 +924,8 @@ def format_info_string(
{separator}
{"Total " if verbose else ""}Usage:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a screenshot with verbose set to true? What is the difference between the output with "Total" and the output without "Total"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an example:

without verbose

Resources
--------------------------------------------------------
Usage:
 530.0/544.0 CPU
 2/2 GPU
 2.00GiB/8.00GiB memory
 3.14GiB/16.00GiB object_store_memory

Constraints:
 {'CPU': 16}: 100 from request_resources()
Demands:
 {'CPU': 1}: 150+ pending tasks/actors
 {'CPU': 4} * 5 (PACK): 420+ pending placement groups

with verbose

Resources
--------------------------------------------------------
Total Usage:
 530.0/544.0 CPU
 2/2 GPU
 1/2 accelerator_type:V100
 2.00GiB/8.00GiB memory
 3.14GiB/16.00GiB object_store_memory

Total Constraints:
 {'CPU': 16}: 100 from request_resources()
Total Demands:
 {'CPU': 1}: 150+ pending tasks/actors
 {'CPU': 4} * 5 (PACK): 420+ pending placement groups

Node: 192.168.1.1
 Usage:
  5.0/20.0 CPU
  0.7/1 GPU
  0.1/1 accelerator_type:V100
  1.00GiB/4.00GiB memory
  3.14GiB/4.00GiB object_store_memory
 Activity:
  CPU in use.
  GPU in use.
  Active workers.

Node: 192.168.1.2
 Usage:
  15.0/20.0 CPU
  0.3/1 GPU
  0.9/1 accelerator_type:V100
  1.00GiB/12.00GiB memory
  0B/4.00GiB object_store_memory
 Activity:
  GPU in use.
  Active workers.

For the usage section, accelerator_type resources will be shown in the verbose mode, and there will be node usages for each node at the end of the report.

For the other sections, there is no difference except that their title will be added with Total prefixes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the other sections, there is no difference except that their title will be added with Total prefixes.

I guess "Total" is used to distinguish between "Total Usage" and a node's "Usage". How about always adding "Total"? Not necessarily to be in this PR. We can update it in a follow up PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open an issue to track the progress of the follow up and add to the umbrella issue in KubeRay repo?


def get_demand_report(lm_summary: LoadMetricsSummary):
demand_lines = []
if lm_summary.resource_demand:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand the difference between

demand_lines.extend(format_resource_demand_summary(lm_summary.resource_demand)) (L730) and lm_summary.request_demand (L736)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resource_demand consists of ready/infeasible/blacklog task shapes for autoscaler to scale the cluster.
request_demand is made by a direct request_resources() call from the user.

@@ -914,6 +924,8 @@ def format_info_string(
{separator}
{"Total " if verbose else ""}Usage:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the other sections, there is no difference except that their title will be added with Total prefixes.

I guess "Total" is used to distinguish between "Total Usage" and a node's "Usage". How about always adding "Total"? Not necessarily to be in this PR. We can update it in a follow up PR.

@rueian rueian force-pushed the better-autoscaler-observability branch from 3e92173 to fe3fecc Compare April 16, 2025 03:08
@@ -914,6 +924,8 @@ def format_info_string(
{separator}
{"Total " if verbose else ""}Usage:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open an issue to track the progress of the follow up and add to the umbrella issue in KubeRay repo?

@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Apr 16, 2025
@jcotant1 jcotant1 added the core Issues that should be addressed in Ray Core label Apr 16, 2025
@rueian rueian force-pushed the better-autoscaler-observability branch from fe3fecc to d88dddb Compare April 16, 2025 20:38
@rueian
Copy link
Contributor Author

rueian commented Apr 17, 2025

Gently ping @jjyao for reviews again. All tests are passed.

@rueian
Copy link
Contributor Author

rueian commented Apr 21, 2025

Gently ping @kevin85421 and @jjyao for reviews.

Constraints:
{'CPU': 16}: 100 from request_resources()
Demands:
{'CPU': 1}: 150+ pending tasks/actors
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you have a follow-up to remove the + for demands as well if the count is accurate.

Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

2.00GiB/8.00GiB memory
3.14GiB/16.00GiB object_store_memory

Constraints:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test where we have multiple bundle shapes so this contains multiple lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. A new test containing multi-lines of constraints is added in the latest commit.

@rueian rueian requested a review from a team as a code owner April 22, 2025 00:01
@rueian rueian force-pushed the better-autoscaler-observability branch from f7f2e56 to d88dddb Compare April 22, 2025 00:06
@kevin85421
Copy link
Member

Gently ping @kevin85421 and @jjyao for reviews.

I have already approved this PR. Are there any other major changes that need to be reviewed? If not, I will leave this PR to @jjyao.

@jjyao jjyao merged commit 2159324 into ray-project:master Apr 22, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-backlog core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[core][autoscaler] Better observability for request resources

5 participants