Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 11 additions & 2 deletions microsoft/testsuites/gpu/gpusuite.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,11 +124,20 @@ def verify_gpu_provision(self, node: Node, log: Logger) -> None:
timeout=TIMEOUT,
# min_gpu_count is 8 since it is current
# max GPU count available in Azure
requirement=simple_requirement(min_gpu_count=8),
requirement=simple_requirement(min_gpu_count=2),
priority=3,
)
def verify_max_gpu_provision(self, node: Node, log: Logger) -> None:
_gpu_provision_check(8, node, log)
actual_gpu_count = node.capability.gpu_count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case is meant to validate a VM with the maximum GPU configuration. If it's modified as shown, it won't perform the verification correctly. If VM sizes with 8 GPUs are unavailable but still get scheduled, please check for a bug in the VM size capability calculation. If the GPU information is missing in some policy, please set it to 0 to skip this test case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, thanks @squirrelsc - will try to check more on this but it seems if there is no capability returned for a VM from Azure, it assumes the one provided in the requirement and proceeds with the test.

In either case, can we have a check at the start of the test to skip the execution if actual GPU count is less than 8?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the request. If there's no capacity with 8 GPUs, the test case will be skipped—is that the check you're referring to?

if not isinstance(actual_gpu_count, int):
raise SkippedException("GPU count is not available")
# For "max" GPU test, we want to test with high GPU counts
if actual_gpu_count < 2:
raise SkippedException(
f"Test is for scenarios with more than 2 GPUs, "
f" current Node only has {actual_gpu_count} GPUs."
)
_gpu_provision_check(actual_gpu_count, node, log)

@TestCaseMetadata(
description="""
Expand Down
Loading