Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug for asort kernel & faster sampler with GPU sorting #2730

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

guoqingbao
Copy link
Contributor

This PR addresses an issue with the asort kernel, which can lead to CUDA invalid arguments error when the last dimension to be sorted exceeded the number of threads allowed per CUDA block. With the revised asort kernel, we can now partially perform logit sampling on the GPU (specifically, the sorting process). This optimization improves text generation performance by up to 25%, depending on the model and the number of tokens generated.

Testing Cases

I conducted tests using the following models and commands:

GLM4-9B

cargo run --release --example glm4 --features cuda -- \
  --weight-path /home/glm-4-9b-chat/ \
  --prompt "Please talk about deep learning." \
  --sample-len 2000

Note: By default, the GLM4 model uses sampling parameters of temperature=0.8 and top_p=0.8.

LLaMa3.1 8B

cargo run --release --example llama --features cuda -- \
  --weight-path /home/Meta-Llama-3.1-8B-Instruct/ \
  --prompt "Please talk about deep learning." \
  --temperature 0.7 \
  --top-p 0.8 \
  --sample-len 2000

Performance Results

Original Implementation (Sampling Entirely on CPU)

GLM4-9B:

  • 100 tokens generated (26.52 tokens/s)
  • 500 tokens generated (25.43 tokens/s)
  • 2000 tokens generated (20.41 tokens/s)

LLaMa3.1 8B:

  • 100 tokens generated (40.14 tokens/s)
  • 500 tokens generated (38.62 tokens/s)
  • 607 tokens generated (38.28 tokens/s)
  • 2000 tokens generated (34.12 tokens/s)

After asort Kernel Fix & Faster Sampler with GPU Sort

GLM4-9B:

  • 100 tokens generated (29.93 tokens/s)
  • 500 tokens generated (28.52 tokens/s)
  • 2000 tokens generated (22.30 tokens/s)

LLaMa3.1 8B:

  • 100 tokens generated (50.68 tokens/s)
  • 500 tokens generated (48.81 tokens/s)
  • 712 tokens generated (47.88 tokens/s)
  • 2000 tokens generated (41.84 tokens/s)

Note: You can adjust the temperature parameter to influence the model's behavior and potentially generate more than 2000 tokens for the above comparisons. While sample-len sets the maximum number of tokens that can be generated, the actual output length may vary depending on the model (and sampling process, parameters, etc.).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant