Support Qwen3 w/ fp32 on GPU #634

kozistr · 2025-06-12T15:02:40Z

What does this PR do?

The previous PRs introduce Qwen3 w/ FA2 and CPU + MPS support. To fully support all use cases, I've just implemented the code to support fp32 on the GPU, so that a relatively smaller model, such as Qwen3 0.6B, can be run on the GPU where FA2 cannot be utilized.

support Qwen3 w/ fp32 on GPU
update the MTEB rank to the latest (based on multilingual v2, as of 2025-06-13)

Due to the large max sequence size (32K), TEI would fail to load Qwen3 w/ fp32 because TEI performs warm-up with sequences that have a size of max sequence length. So, we need to be aware of that behavior.

$ ./target/release/text-embeddings-router --model-id ../Qwen3-Embedding-0.6B --port 8080 --dtype float32 --max-batch-tokens 1024
2025-06-12T15:56:39.414201Z  INFO text_embeddings_router: router/src/main.rs:189: Args { model_id: "../Qwe**-*********-0.6B", revision: None, tokenization_workers: None, dtype: Some(Float32), pooling: None, max_concurrent_requests: 512, max_batch_tokens: 1024, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hf_token: Some("hf_z******************************ORb"), hostname: "r-kozistr-grant-org-tei-qbn72b3s-b7796-utf65", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: None, payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
2025-06-12T15:56:39.711754Z  WARN text_embeddings_router: router/src/lib.rs:189: Could not find a Sentence Transformers config
2025-06-12T15:56:39.711782Z  INFO text_embeddings_router: router/src/lib.rs:193: Maximum number of tokens per request: 32768
2025-06-12T15:56:39.712013Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 8 tokenization workers
2025-06-12T15:56:39.950145Z  INFO text_embeddings_router: router/src/lib.rs:235: Starting model backend
2025-06-12T15:56:40.296936Z  INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:460: Starting Qwen3 model on Cuda(CudaDevice(DeviceId(1)))
2025-06-12T15:56:40.702414Z  INFO text_embeddings_router: router/src/lib.rs:252: Warming up model
2025-06-12T15:56:40.833718Z  WARN text_embeddings_router: router/src/lib.rs:311: Invalid hostname, defaulting to 0.0.0.0
2025-06-12T15:56:40.835600Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1847: Starting HTTP server: 0.0.0.0:8080
2025-06-12T15:56:40.835614Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1848: Ready
2025-06-12T15:57:32.054742Z  INFO embed{total_time="18.453313ms" tokenization_time="288.823µs" queue_time="274.272µs" inference_time="17.822086ms"}: text_embeddings_router::http::server: router/src/http/server.rs:730: Success

double-checked that it also works on MPS

  text-embeddings-inference git:(feature/qwen3-for-cuda-fp32) ./target/release/text-embeddings-router --model-id ../Qwen/Qwen3-Embedding-0.6B --dtype float32 --port 8080 --
max-batch-tokens 1024
2025-06-13T01:22:27.836386Z  INFO text_embeddings_router: router/src/main.rs:189: Args { model_id: "../Qwe*/*****-*********-0.6B", revision: None, tokenization_workers: None, dtype: Some(Float32), pooling: None, max_concurrent_requests: 512, max_batch_tokens: 1024, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hf_token: None, hostname: "0.0.0.0", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: None, payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
2025-06-13T01:22:27.839102Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:20: Starting download
2025-06-13T01:22:27.839112Z  INFO download_artifacts:download_pool_config: text_embeddings_core::download: core/src/download.rs:53: Downloading `1_Pooling/config.json`
2025-06-13T01:22:30.060741Z  INFO download_artifacts:download_new_st_config: text_embeddings_core::download: core/src/download.rs:77: Downloading `config_sentence_transformers.json`
2025-06-13T01:22:30.541269Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:40: Downloading `config.json`
2025-06-13T01:22:30.979441Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:43: Downloading `tokenizer.json`
2025-06-13T01:22:32.367068Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:47: Model artifacts downloaded in 4.527977792s
2025-06-13T01:22:32.481927Z  WARN text_embeddings_router: router/src/lib.rs:189: Could not find a Sentence Transformers config
2025-06-13T01:22:32.481940Z  INFO text_embeddings_router: router/src/lib.rs:193: Maximum number of tokens per request: 32768
2025-06-13T01:22:32.482171Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 16 tokenization workers
2025-06-13T01:22:32.553883Z  INFO text_embeddings_router: router/src/lib.rs:235: Starting model backend
2025-06-13T01:22:32.553898Z  INFO text_embeddings_backend: backends/src/lib.rs:493: Downloading `model.safetensors`
2025-06-13T01:23:52.047395Z  INFO text_embeddings_backend: backends/src/lib.rs:377: Model weights downloaded in 79.493807s
2025-06-13T01:23:52.110863Z  INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:279: Starting Qwen3 model on Metal(MetalDevice(DeviceId(1)))
2025-06-13T01:23:54.308240Z  INFO text_embeddings_router: router/src/lib.rs:252: Warming up model
2025-06-13T01:23:54.985485Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1847: Starting HTTP server: 0.0.0.0:8080
2025-06-13T01:23:54.985495Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1848: Ready
2025-06-13T01:31:41.042091Z  INFO embed{total_time="110.074166ms" tokenization_time="711.958µs" queue_time="130.25µs" inference_time="109.175958ms"}: text_embeddings_router::http::server: router/src/http/server.rs:730: Success

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@Narsil, @alvarobartt

Narsil

LGTM, thanks for thiss !

kozistr and others added 2 commits June 12, 2025 23:51

feature: Qwen3 for cuda w/ fp32

d9d4cb2

fix: if else

3316753

kozistr marked this pull request as ready for review June 12, 2025 16:02

kozistr changed the title ~~Support Qwen3 for CUDA w/ fp32~~ Support Qwen3 w/ fp32 on GPU Jun 12, 2025

kozistr added 2 commits June 13, 2025 10:43

docs: add Qwen3 and update the ranks

7024a41

docs: mteb table

9ac36d3

Narsil approved these changes Jun 13, 2025

View reviewed changes

Narsil merged commit 60f0378 into huggingface:main Jun 13, 2025

kozistr deleted the feature/qwen3-for-cuda-fp32 branch June 13, 2025 12:39

BrewTestBot mentioned this pull request Jun 16, 2025

text-embeddings-inference 1.7.2 Homebrew/homebrew-core#227008

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Qwen3 w/ fp32 on GPU #634

Support Qwen3 w/ fp32 on GPU #634

Uh oh!

kozistr commented Jun 12, 2025 •

edited

Loading

Uh oh!

Narsil left a comment

Uh oh!

Uh oh!

Support Qwen3 w/ fp32 on GPU #634

Support Qwen3 w/ fp32 on GPU #634

Uh oh!

Conversation

kozistr commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

Narsil left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kozistr commented Jun 12, 2025 •

edited

Loading