-
Notifications
You must be signed in to change notification settings - Fork 164
feat(BA-3071): add support for DGX spark devices #6809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for DGX Spark devices by introducing a new "unified" slot type that allows implicit attachment of certain accelerator devices to all created kernels.
Key changes:
- Added
UNIFIEDslot type enum value to distinguish unified accelerators - Modified kernel creation logic to automatically attach unified-type accelerators to every kernel
- Removed x86_64 platform-specific wheel build configuration from build script
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/ai/backend/common/types.py | Added UNIFIED enum value to SlotTypes for unified accelerator devices |
| src/ai/backend/agent/agent.py | Added logic in create_kernel to automatically attach unified accelerators to all kernels |
| scripts/build-wheels.sh | Removed x86_64 platform-specific wheel build configuration |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@copilot Update PR description to abstract changes introduced in this PR. |
|
@kyujin-cho I've opened a new pull request, #6850, to work on those changes. Once the pull request is ready, I'll request review from you. |
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
1a10632 to
2d8c9ea
Compare
resolves #6798 (BA-3071).
This pull request refactors how device allocation and unified device management are handled in the agent resource specification and kernel lifecycle. The main improvements are the introduction of the
DeviceViewdataclass and thedevice_listproperty, which provide a unified and clearer way to access all devices (including unified devices) attached to a kernel. This change simplifies device iteration logic throughout the codebase and ensures unified devices are consistently included in resource specifications and related operations.Device allocation and unified device management:
DeviceViewdataclass and thedevice_listproperty toKernelResourceSpec, enabling a unified view of all devices (including unified devices) attached to a kernel. Unified devices are now tracked via a newunified_devicesattribute, and the resource spec serialization/deserialization logic is updated to include this information. (src/ai/backend/agent/resources.py) [1] [2] [3] [4]SlotTypes.UNIFIEDenum value to distinguish unified device slots, supporting the new unified device logic. (src/ai/backend/common/types.py)API and method refactoring:
generate_resource_specasync method to the agent, which builds a complete resource spec including unified devices, replacing direct calls toprepare_resource_specthroughout the codebase. (src/ai/backend/agent/agent.py) [1] [2]device_listproperty instead of iterating over allocations and manually checking for nonzero allocations. This improves clarity and ensures unified devices are handled correctly. (src/ai/backend/agent/agent.py,src/ai/backend/agent/stage/kernel_lifecycle/docker/environ.py,src/ai/backend/agent/stage/kernel_lifecycle/docker/mount/krunner.py) [1] [2] [3]Documentation and typing improvements:
src/ai/backend/agent/agent.py,src/ai/backend/agent/resources.py) [1] [2]These changes collectively improve device management consistency, reduce code duplication, and make the agent's resource handling logic easier to maintain and extend.