Skip to content

Support Qwen3 w/ fp32 on GPU #634

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 13, 2025

Conversation

kozistr
Copy link
Contributor

@kozistr kozistr commented Jun 12, 2025

What does this PR do?

The previous PRs introduce Qwen3 w/ FA2 and CPU + MPS support. To fully support all use cases, I've just implemented the code to support fp32 on the GPU, so that a relatively smaller model, such as Qwen3 0.6B, can be run on the GPU where FA2 cannot be utilized.

  • support Qwen3 w/ fp32 on GPU
  • update the MTEB rank to the latest (based on multilingual v2, as of 2025-06-13)
  • Due to the large max sequence size (32K), TEI would fail to load Qwen3 w/ fp32 because TEI performs warm-up with sequences that have a size of max sequence length. So, we need to be aware of that behavior.
$ ./target/release/text-embeddings-router --model-id ../Qwen3-Embedding-0.6B --port 8080 --dtype float32 --max-batch-tokens 1024
2025-06-12T15:56:39.414201Z  INFO text_embeddings_router: router/src/main.rs:189: Args { model_id: "../Qwe**-*********-0.6B", revision: None, tokenization_workers: None, dtype: Some(Float32), pooling: None, max_concurrent_requests: 512, max_batch_tokens: 1024, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hf_token: Some("hf_z******************************ORb"), hostname: "r-kozistr-grant-org-tei-qbn72b3s-b7796-utf65", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: None, payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
2025-06-12T15:56:39.711754Z  WARN text_embeddings_router: router/src/lib.rs:189: Could not find a Sentence Transformers config
2025-06-12T15:56:39.711782Z  INFO text_embeddings_router: router/src/lib.rs:193: Maximum number of tokens per request: 32768
2025-06-12T15:56:39.712013Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 8 tokenization workers
2025-06-12T15:56:39.950145Z  INFO text_embeddings_router: router/src/lib.rs:235: Starting model backend
2025-06-12T15:56:40.296936Z  INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:460: Starting Qwen3 model on Cuda(CudaDevice(DeviceId(1)))
2025-06-12T15:56:40.702414Z  INFO text_embeddings_router: router/src/lib.rs:252: Warming up model
2025-06-12T15:56:40.833718Z  WARN text_embeddings_router: router/src/lib.rs:311: Invalid hostname, defaulting to 0.0.0.0
2025-06-12T15:56:40.835600Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1847: Starting HTTP server: 0.0.0.0:8080
2025-06-12T15:56:40.835614Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1848: Ready
2025-06-12T15:57:32.054742Z  INFO embed{total_time="18.453313ms" tokenization_time="288.823µs" queue_time="274.272µs" inference_time="17.822086ms"}: text_embeddings_router::http::server: router/src/http/server.rs:730: Success

double-checked that it also works on MPS

  text-embeddings-inference git:(feature/qwen3-for-cuda-fp32) ./target/release/text-embeddings-router --model-id ../Qwen/Qwen3-Embedding-0.6B --dtype float32 --port 8080 --
max-batch-tokens 1024
2025-06-13T01:22:27.836386Z  INFO text_embeddings_router: router/src/main.rs:189: Args { model_id: "../Qwe*/*****-*********-0.6B", revision: None, tokenization_workers: None, dtype: Some(Float32), pooling: None, max_concurrent_requests: 512, max_batch_tokens: 1024, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hf_token: None, hostname: "0.0.0.0", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: None, payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
2025-06-13T01:22:27.839102Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:20: Starting download
2025-06-13T01:22:27.839112Z  INFO download_artifacts:download_pool_config: text_embeddings_core::download: core/src/download.rs:53: Downloading `1_Pooling/config.json`
2025-06-13T01:22:30.060741Z  INFO download_artifacts:download_new_st_config: text_embeddings_core::download: core/src/download.rs:77: Downloading `config_sentence_transformers.json`
2025-06-13T01:22:30.541269Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:40: Downloading `config.json`
2025-06-13T01:22:30.979441Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:43: Downloading `tokenizer.json`
2025-06-13T01:22:32.367068Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:47: Model artifacts downloaded in 4.527977792s
2025-06-13T01:22:32.481927Z  WARN text_embeddings_router: router/src/lib.rs:189: Could not find a Sentence Transformers config
2025-06-13T01:22:32.481940Z  INFO text_embeddings_router: router/src/lib.rs:193: Maximum number of tokens per request: 32768
2025-06-13T01:22:32.482171Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 16 tokenization workers
2025-06-13T01:22:32.553883Z  INFO text_embeddings_router: router/src/lib.rs:235: Starting model backend
2025-06-13T01:22:32.553898Z  INFO text_embeddings_backend: backends/src/lib.rs:493: Downloading `model.safetensors`
2025-06-13T01:23:52.047395Z  INFO text_embeddings_backend: backends/src/lib.rs:377: Model weights downloaded in 79.493807s
2025-06-13T01:23:52.110863Z  INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:279: Starting Qwen3 model on Metal(MetalDevice(DeviceId(1)))
2025-06-13T01:23:54.308240Z  INFO text_embeddings_router: router/src/lib.rs:252: Warming up model
2025-06-13T01:23:54.985485Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1847: Starting HTTP server: 0.0.0.0:8080
2025-06-13T01:23:54.985495Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1848: Ready
2025-06-13T01:31:41.042091Z  INFO embed{total_time="110.074166ms" tokenization_time="711.958µs" queue_time="130.25µs" inference_time="109.175958ms"}: text_embeddings_router::http::server: router/src/http/server.rs:730: Success

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@Narsil, @alvarobartt

@kozistr kozistr marked this pull request as ready for review June 12, 2025 16:02
@kozistr kozistr changed the title Support Qwen3 for CUDA w/ fp32 Support Qwen3 w/ fp32 on GPU Jun 12, 2025
Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for thiss !

@Narsil Narsil merged commit 60f0378 into huggingface:main Jun 13, 2025
@kozistr kozistr deleted the feature/qwen3-for-cuda-fp32 branch June 13, 2025 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants