Skip to content

gpu_allocated field should include all accelerators #5926

@achimnol

Description

@achimnol

gpu_allocated field in resource usage calculations should include all *.device and *.shares resource slots, also considering MIG variants like cuda.device:20g-mig.

The current implementation (in both legacy manager/models/resource_usage.py and the new group/user repository functions in manager/repositories/group/repository.py and manager/repositories/user/repository.py) has two issues:

  • Typo: cuda.devices should be cuda.device (while cuda.shares is still plural…)
  • It should include non-CUDA accelerators like ATOM, ROCm, etc.

An immediate fix would be simply introducing a regular expression (e.g., ^[^.]+\.(device|shares)(:[-\w]+)?$) to filter resource slot names, but a more decent fix would be to update and apply SlotName to become a proper class (not just str subtype declaration) which provides structured methods to parse resource slot names like how to divide with . and : for future reference.

While it was intended to be removed as being a legacy when we only supported NVIDIA GPUs, currently there are some references to gpu_allocated field in the webui. (cc: @63240f6729083bbe8cc4d07d ) Since there are practically no cases that allocate two or more different types of accelerators in a single session, let’s generalize the backend logic to take the sum of all accelerator slots.

JIRA Issue: BA-2404

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions