You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/supported_models/multimodal_language_models.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -37,5 +37,5 @@ in the GitHub search bar.
37
37
|**Gemma 3 (Multimodal)**|`google/gemma-3-4b-it`|`gemma-it`| Gemma 3's larger models (4B, 12B, 27B) accept images (each image encoded as 256 tokens) alongside text in a combined 128K-token context. |
38
38
|**Kimi-VL** (A3B) |`moonshotai/Kimi-VL-A3B-Instruct`|`kimi-vl`| Kimi-VL is a multimodal model that can understand and generate text from images. |
39
39
|**Mistral-Small-3.1-24B**|`mistralai/Mistral-Small-3.1-24B-Instruct-2503`|`mistral`| Mistral 3.1 is a multimodal model that can generate text from text or images input. It also supports tool calling and structured output. |
40
-
|**Phi-4-multimodal-instruct**|`microsoft/Phi-4-multimodal-instruct`|`phi-4-mm`| Phi-4-multimodal-instruct is the multimodal variant of the Phi-4-mini model, enhanced with LoRA for improved multimodal capabilities. Currently, it supports only textand vision modalities in SGLang. |
40
+
|**Phi-4-multimodal-instruct**|`microsoft/Phi-4-multimodal-instruct`|`phi-4-mm`| Phi-4-multimodal-instruct is the multimodal variant of the Phi-4-mini model, enhanced with LoRA for improved multimodal capabilities. It supports text, vision and audio modalities in SGLang. |
41
41
|**MiMo-VL** (7B) |`XiaomiMiMo/MiMo-VL-7B-RL`|`mimo-vl`| Xiaomi's compact yet powerful vision-language model featuring a native resolution ViT encoder for fine-grained visual details, an MLP projector for cross-modal alignment, and the MiMo-7B language model optimized for complex reasoning tasks. |
0 commit comments