Skip to content

initial support blackwell #747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

johnnynunez
Copy link

10.0 blackwell b100/b200
12.0 blackwell rtx50

@yzh119
Copy link
Collaborator

yzh119 commented Jan 23, 2025

Hi @johnnynunez , thanks for bringing this up! Could we hold this PR and wait for the official release of torch 2.6 and blackwell software stack?

@johnnynunez
Copy link
Author

Hi @johnnynunez , thanks for bringing this up! Could be hold this PR and wait for the official release of torch 2.6 and blackwell software stack?

Yeah for sure! I put all codegen blackwell family on pytorch.
Also you have references here:
NVIDIA/cccl#3493

@johnnynunez
Copy link
Author

Hi @johnnynunez , thanks for bringing this up! Could we hold this PR and wait for the official release of torch 2.6 and blackwell software stack?

FYI: pytorch/pytorch#145436

@johnnynunez
Copy link
Author

johnnynunez commented Jan 23, 2025

@yzh119
Copy link
Collaborator

yzh119 commented Jan 23, 2025

FYI: https://docs.nvidia.com/cuda/pdf/ptx_isa_8.7.pdf image

This is huge!

@johnnynunez
Copy link
Author

@yzh119 can you merge?

@zhyncs
Copy link
Member

zhyncs commented Jan 25, 2025

@yzh119 can you merge?

@johnnynunez remind #747 (comment)

@johnnynunez
Copy link
Author

johnnynunez commented Jan 25, 2025

well, sure... pytorch is coming this week : M6: Release Day (1/29/25)

@ghostplant
Copy link

Is there a prebuilt that can work for B200?

@YavorGIvanov
Copy link

YavorGIvanov commented Apr 16, 2025

What performance improvement should we expect out of the box on B200 compared to H100 SXM5 for different size models ? 8B, 70B, 400B. I expected to get some benefit even for 8B (e.g. 30% for low batch sizes), but I am getting no benefit using Llama 8B.

Also is there any planned on in-progress work on flashinfer utilizing B200 specific capabilities (e.g. Tensor Memory Accelerator) ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants