[core][autoscaler] make cluster constraints clear in the autoscaler status report#52278
Conversation
18bc935 to
2033ebc
Compare
…report clear Signed-off-by: Rueian <[email protected]>
2033ebc to
704c731
Compare
…report clear Signed-off-by: Rueian <[email protected]>
…report clear Signed-off-by: Rueian <[email protected]>
6e8ce94 to
d3b041c
Compare
| @@ -914,6 +924,8 @@ def format_info_string( | |||
| {separator} | |||
| {"Total " if verbose else ""}Usage: | |||
There was a problem hiding this comment.
Could you add a screenshot with verbose set to true? What is the difference between the output with "Total" and the output without "Total"?
There was a problem hiding this comment.
Here is an example:
without verbose
Resources
--------------------------------------------------------
Usage:
530.0/544.0 CPU
2/2 GPU
2.00GiB/8.00GiB memory
3.14GiB/16.00GiB object_store_memory
Constraints:
{'CPU': 16}: 100 from request_resources()
Demands:
{'CPU': 1}: 150+ pending tasks/actors
{'CPU': 4} * 5 (PACK): 420+ pending placement groupswith verbose
Resources
--------------------------------------------------------
Total Usage:
530.0/544.0 CPU
2/2 GPU
1/2 accelerator_type:V100
2.00GiB/8.00GiB memory
3.14GiB/16.00GiB object_store_memory
Total Constraints:
{'CPU': 16}: 100 from request_resources()
Total Demands:
{'CPU': 1}: 150+ pending tasks/actors
{'CPU': 4} * 5 (PACK): 420+ pending placement groups
Node: 192.168.1.1
Usage:
5.0/20.0 CPU
0.7/1 GPU
0.1/1 accelerator_type:V100
1.00GiB/4.00GiB memory
3.14GiB/4.00GiB object_store_memory
Activity:
CPU in use.
GPU in use.
Active workers.
Node: 192.168.1.2
Usage:
15.0/20.0 CPU
0.3/1 GPU
0.9/1 accelerator_type:V100
1.00GiB/12.00GiB memory
0B/4.00GiB object_store_memory
Activity:
GPU in use.
Active workers.For the usage section, accelerator_type resources will be shown in the verbose mode, and there will be node usages for each node at the end of the report.
For the other sections, there is no difference except that their title will be added with Total prefixes.
There was a problem hiding this comment.
For the other sections, there is no difference except that their title will be added with Total prefixes.
I guess "Total" is used to distinguish between "Total Usage" and a node's "Usage". How about always adding "Total"? Not necessarily to be in this PR. We can update it in a follow up PR.
There was a problem hiding this comment.
Can you open an issue to track the progress of the follow up and add to the umbrella issue in KubeRay repo?
|
|
||
| def get_demand_report(lm_summary: LoadMetricsSummary): | ||
| demand_lines = [] | ||
| if lm_summary.resource_demand: |
There was a problem hiding this comment.
Can you help me understand the difference between
demand_lines.extend(format_resource_demand_summary(lm_summary.resource_demand)) (L730) and lm_summary.request_demand (L736)?
There was a problem hiding this comment.
resource_demand consists of ready/infeasible/blacklog task shapes for autoscaler to scale the cluster.
request_demand is made by a direct request_resources() call from the user.
…report clear Signed-off-by: Rueian <[email protected]>
| @@ -914,6 +924,8 @@ def format_info_string( | |||
| {separator} | |||
| {"Total " if verbose else ""}Usage: | |||
There was a problem hiding this comment.
For the other sections, there is no difference except that their title will be added with Total prefixes.
I guess "Total" is used to distinguish between "Total Usage" and a node's "Usage". How about always adding "Total"? Not necessarily to be in this PR. We can update it in a follow up PR.
3e92173 to
fe3fecc
Compare
| @@ -914,6 +924,8 @@ def format_info_string( | |||
| {separator} | |||
| {"Total " if verbose else ""}Usage: | |||
There was a problem hiding this comment.
Can you open an issue to track the progress of the follow up and add to the umbrella issue in KubeRay repo?
…report clear Signed-off-by: Rueian <[email protected]>
fe3fecc to
d88dddb
Compare
|
Gently ping @jjyao for reviews again. All tests are passed. |
|
Gently ping @kevin85421 and @jjyao for reviews. |
| Constraints: | ||
| {'CPU': 16}: 100 from request_resources() | ||
| Demands: | ||
| {'CPU': 1}: 150+ pending tasks/actors |
There was a problem hiding this comment.
Could you have a follow-up to remove the + for demands as well if the count is accurate.
| 2.00GiB/8.00GiB memory | ||
| 3.14GiB/16.00GiB object_store_memory | ||
|
|
||
| Constraints: |
There was a problem hiding this comment.
Could you add a test where we have multiple bundle shapes so this contains multiple lines?
There was a problem hiding this comment.
Sure. A new test containing multi-lines of constraints is added in the latest commit.
f7f2e56 to
d88dddb
Compare
Signed-off-by: Rueian <[email protected]>
I have already approved this PR. Are there any other major changes that need to be reviewed? If not, I will leave this PR to @jjyao. |
Why are these changes needed?
As requested in the #37959. Make the cluster constraints clear in the autoscaler status report by not showing them with normal demands together but showing them in an explicit section.
Before
Resources --------------------------------------------------------------- Usage: 0B/8.21GiB memory 0B/2.00GiB object_store_memory Demands: {'CPU': 3.0}: 3+ pending tasks/actors {'CPU': 2.0}: 3+ pending tasks/actors {'CPU': 1}: 1000+ from request_resources()After
Resources --------------------------------------------------------------- Usage: 0B/8.21GiB memory 0B/2.00GiB object_store_memory Constraints: {'CPU': 1}: 1000 from request_resources() Demands: {'CPU': 3.0}: 3+ pending tasks/actors {'CPU': 2.0}: 3+ pending tasks/actorsNote that this change is also reflected in the dashboard:

Related issue number
Closes #37959
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.