-
Notifications
You must be signed in to change notification settings - Fork 165
Description
gpu_allocated field in resource usage calculations should include all *.device and *.shares resource slots, also considering MIG variants like cuda.device:20g-mig.
The current implementation (in both legacy manager/models/resource_usage.py and the new group/user repository functions in manager/repositories/group/repository.py and manager/repositories/user/repository.py) has two issues:
- Typo:
cuda.devicesshould becuda.device(whilecuda.sharesis still plural…) - It should include non-CUDA accelerators like ATOM, ROCm, etc.
An immediate fix would be simply introducing a regular expression (e.g., ^[^.]+\.(device|shares)(:[-\w]+)?$) to filter resource slot names, but a more decent fix would be to update and apply SlotName to become a proper class (not just str subtype declaration) which provides structured methods to parse resource slot names like how to divide with . and : for future reference.
While it was intended to be removed as being a legacy when we only supported NVIDIA GPUs, currently there are some references to gpu_allocated field in the webui. (cc: @63240f6729083bbe8cc4d07d ) Since there are practically no cases that allocate two or more different types of accelerators in a single session, let’s generalize the backend logic to take the sum of all accelerator slots.
JIRA Issue: BA-2404