-
Notifications
You must be signed in to change notification settings - Fork 928
add openvino VLM blog post #3071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
15d3828
9cc1717
15f6f5f
d6033a8
cfbcca1
6441337
a6ee9d9
c527b8b
042ae0f
e69c8ea
69f09cc
be7aef2
d2523c0
d17cf36
cfda70f
47c9baf
6ae3d81
e3e410e
bcd87da
18ae0ce
da03f28
05e2a60
7e76c89
9214619
9250d8b
137beb3
3fd228f
6badb73
78a3fd6
bb6296d
481fddb
80a6000
7e37d5d
fb2de47
23c40bc
77ea0cd
d027179
8fc6928
57f4a34
27ff34a
3779587
34c1612
078bf4e
f026273
7b7eb57
43c2a52
e79375c
da05b4e
4a6c6b6
60aff81
f2d302a
5cd869c
107947c
caeb255
29fbb32
cf81b66
efb7cbf
2713e55
b6c88dc
069af77
96a7b76
a87e919
ee29381
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,191 @@ | ||
| --- | ||
| title: "Get your VLM running in 3 simple steps" | ||
| thumbnail: /blog/assets/optimum_intel/intel_thumbnail.png | ||
| authors: | ||
| - user: ezelanza | ||
| guest: true | ||
| org: Intel | ||
| - user: echarlaix | ||
| - user: helenai | ||
| guest: true | ||
| org: Intel | ||
| - user: nikita-savelyev-intel | ||
| guest: true | ||
| org: Intel | ||
| --- | ||
|
|
||
| # Get your VLM running in 3 simple steps | ||
|
|
||
| Teaser: Run a Vision Language Model (VLM) locally in three steps, no need for expensive cloud infrastructure or high-end compute devices. SmolVLM + Intel Optimum + OpenVINO makes it possible to accelerate on an iGPU or an NPU. | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| As large language models (LLMs) and chatbots become more capable, AI is moving beyond text; it's now interpreting images and videos as well. This is where Vision Language Models (VLMs) come in, enabling tasks like describing scenes, generating captions, or answering questions about images. | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Early models like [Flamingo](https://arxiv.org/abs/2204.14198) and [Idefics](https://huggingface.co/blog/idefics) showed what was possible, both demonstrated capabilities with 80B parameters. More recently, we’ve seen small models emerge, [PaliGemma 3B](https://huggingface.co/google/paligemma-3b-pt-896?utm_source=chatgpt.com) , [moondream2](https://www.analyticsvidhya.com/blog/2024/03/introducing-moondream2-a-tiny-vision-language-model/?utm_source=chatgpt.com) , and [Qwen2-VL models](https://nodeshift.com/blog/how-to-install-qwen2-5-vl-7b-instruct-locally?utm_source=chatgpt.com) but even these “small” versions can be tough to run locally because they still carry a lot of the memory and compute demands from their larger predecessors. | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| That’s why running AI models locally is still a challenge, but also a huge opportunity. Local inference keeps your data private, gives you fast responses without internet latency, avoids cloud costs, and lets you run and tweak models offline, with full control. | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| That’s where tools like Intel [Hugging Face Optimum](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/llm-inference-hf.html), OpenVINO, and the lightweight [SmolVLM](https://huggingface.co/blog/smolvlm) model come in. In this post, we’ll show you how to get a VLM running locally in just three simple steps, with no expensive hardware or GPUs needed (though it can also run on Intel GPUs). | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## What is a VLM | ||
|
|
||
| Let’s first recap: A Vision Language Model (VLM) can understand both text and images. Instead of just reading or writing text, it can also “see” pictures, so you can ask it to describe a photo, answer a question about an image, or generate a caption. It’s like giving your LLM eyes. | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| <figure style="width: 700px; margin: 0 auto;"> | ||
| <img src="https://huggingface.co/datasets/openvino/documentation/resolve/main/blog/openvino_vlm/chat1.png"> | ||
| </figure> | ||
|
|
||
| It’s impressive, but not exactly accessible to use. Let’s take [CogVLM](https://github.com/THUDM/CogVLM), for example, it is a powerful open source vision-language model with around 17 billion parameters (10B vision encoder \+ 7B language model) which can require [about 80GB of RAM](https://inference.roboflow.com/foundation/cogvlm/) to run the model in full precision. Inference is still relatively slow: captioning a single image takes 10 to 13 seconds on an NVIDIA T4 GPU ([RoboflowBenchmark](https://inference.roboflow.com/foundation/cogvlm/?utm_source=chatgpt.com)). Users attempting to run CogVLM on CPUs have reported crashes or memory errors even with 64 GB of RAM, highlighting its impracticality for typical local deployment ([GitHub Issue](https://github.com/THUDM/CogVLM/issues/162)), just to mention one model, this is the challenge faced recently with most small VLMs. | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| In contrast, SmolVLM is purpose-built for low-resource environments, and it becomes a highly efficient solution for deploying vision-language models on laptops or edge devices. | ||
| Launched by Hugging Face in July 2024, SmolVLM addresses the growing need for multimodal AI that runs locally without requiring high-end GPUs or cloud infrastructure. As vision-language models become essential in areas like accessibility, robotics, and on-device assistants, SmolVLM offers a path to efficient, privacy-preserving inference at the edge. | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| Architecturally, SmolVLM pairs a lightweight vision encoder with a compact language decoder. This modular design enables it to interpret both images and text. | ||
|
|
||
| <figure style="width: 700px; margin: 0 auto;"> | ||
| <img src="https://huggingface.co/datasets/openvino/documentation/resolve/main/blog/openvino_vlm/smolvlm.png" width=700> | ||
| <figcaption style="text-align: center;"> | ||
| SmolVLM architecture (<b><i>Source: <a href="https://huggingface.co/blog/smolvlm#what-is-smolvlm">SmolVLM - small yet mighty Vision Language Model</i></b></a>). | ||
| </figcaption> | ||
| </figure> | ||
|
|
||
| It offers a lightweight, efficient solution for running image-and-text models directly on laptops or edge devices. | ||
|
|
||
| ## Hugging Face Optimum | ||
|
|
||
| As mentioned, SmolVLM offers a strong advantage for running multimodal models efficiently, but there’s still room for improvement. These models can be further compressed or optimized to run even more effectively on local devices. If you’ve tried optimizing a model by yourself, you probably know it’s not a trivial task. | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| This is where [Optimum Intel for OpenVINO](https://huggingface.co/docs/optimum-intel/en/index) ([repo](https://github.com/huggingface/optimum-intel)) comes in. | ||
| It acts as a bridge between Hugging Face libraries like [**Transformers**](https://huggingface.co/docs/transformers/en/index)**, [Diffusers](https://huggingface.co/docs/diffusers/index), [timm](https://huggingface.co/docs/timm/index), and [sentence-transformers](https://huggingface.co/sentence-transformers)**, and Intel’s optimization tools, making it easy to accelerate end-to-end pipelines on Intel hardware | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Before using it the very first step is to install the library. | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ```bash | ||
| pip install optimum-intel[openvino] | ||
| ``` | ||
|
|
||
| By using Optimum with OpenVINO, you gain several benefits, like improving the inference time and lower memory/storage usage out of the box. But you can go even further: quantization can reduce the model size and resource consumption even more. While quantization often requires deep expertise, Optimum simplifies the process, making it much more accessible. | ||
|
|
||
| Let’s see how you can run SmolVLM then. | ||
|
|
||
| ## Step 1: Convert your model to the OpenVINO IR | ||
|
|
||
| First, you will need to convert your model to the OpenVINO IR. There are multiple options to do it: | ||
|
|
||
| 1. You can use the [Optimum CLI](https://huggingface.co/docs/optimum-intel/en/openvino/export#using-the-cli) | ||
|
|
||
| ```bash | ||
| optimum-cli export openvino -m HuggingFaceTB/SmolVLM2-256M-Video-Instruct smolvlm_ov/ | ||
| ``` | ||
|
|
||
| 2. Or you can convert it [on the fly](https://huggingface.co/docs/optimum-intel/en/openvino/export#when-loading-your-model) when loading your model: | ||
|
|
||
|
|
||
| ```python | ||
| from optimum.intel import OVModelForVisualCausalLM | ||
|
|
||
| model_id = "HuggingFaceTB/SmolVLM2-256M-Video-Instruct" | ||
| model = OVModelForVisualCausalLM.from_pretrained(model_id) | ||
| model.save_pretrained("smolvlm_ov") | ||
| ``` | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want to establish a reference baseline about speed/memory at this point? |
||
|
|
||
| ## Step 2: Quantization | ||
|
|
||
| Now it’s time to optimize the model for efficient execution using **quantization**. Quantization reduces the precision of the model weights and/or activations, leading to smaller, faster models. | ||
|
|
||
| Essentially, it's a way to map values from a high-precision data type, such as 32-bit floating-point numbers (FP32), to a lower-precision format, typically 8-bit integers (INT8). While this process offers several key benefits, it can also impact in a potential loss of accuracy. | ||
|
|
||
| <figure style="width: 800px; margin: 0 auto;"> | ||
| <img src="https://huggingface.co/datasets/openvino/documentation/resolve/main/blog/openvino_vlm/quantization.png"> | ||
| </figure> | ||
|
|
||
| Optimum supports two main post-training quantization: | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| - Weight Only Quantization | ||
| - Static Quantization | ||
|
|
||
| Let’s explore each of them. | ||
|
|
||
| ### Option 1: Weight Only Quantization | ||
|
|
||
| Weight-only quantization means that only the weights are being quantized and leaving the activation in their original precisions. To explain this process, let’s imagine preparing for a long backpacking trip. To reduce weight, you replace bulky items like full-size shampoo bottles with compact travel-sized versions. This is like weight-only quantization, where the model’s weights are compressed from 32-bit floating-point numbers to 8-bit integers, reducing the model’s memory footprint. | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| However, the “interactions” during the trip, like drinking water, remain unchanged. This is similar to what happens to activations, which stay in high precision (FP32 or BF16) to preserve accuracy during computation. | ||
|
|
||
| As a result, the model becomes smaller and more memory-efficient, improving loading times. But since activations are not quantized, inference speed gains are limited. Since OpenVINO 2024.3, if the model's weight have been quantized, the corresponding activations will also be quantized at runtime, leading to additional speedup depending on the device. | ||
|
||
|
|
||
| Weight-only quantization is a simple first step since it usually doesn’t result in significant accuracy degradation. | ||
| In order to run it, you will need to create a quantization configuration using Optimum \`OVWeightQuantizationConfig\` as follows | ||
|
|
||
|
|
||
| ```python | ||
| from optimum.intel import OVModelForVisualCausalLM, OVWeightQuantizationConfig | ||
|
|
||
| q_config = OVWeightQuantizationConfig(bits=8) | ||
| # Apply quantization and save the new model | ||
| q_model = OVModelForVisualCausalLM.from_pretrained(model_id, quantization_config=q_config) | ||
| q_model.save_pretrained("smolvlm_int8") | ||
| ``` | ||
|
|
||
| or quivalently using the CLI: | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| ```bash | ||
| optimum-cli export openvino -m HuggingFaceTB/SmolVLM2-256M-Video-Instruct --weight-format int8 smolvlm_int8/ | ||
|
|
||
| ``` | ||
|
|
||
| ## Option 2: Static Quantization | ||
echarlaix marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| When applying static quantization, quantization is applied on both weights and activations. For this a calibration step is needed in which a dataset subset is used in order to estimate the activations ranges. In the following example we are using 50 samples of the [contextual dataset](https://huggingface.co/datasets/ucla-contextual/contextual_test) to perform this calibration step. | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ```python | ||
| from optimum.intel import OVModelForVisualCausalLM, OVQuantizationConfig | ||
|
|
||
| q_config = OVQuantizationConfig(bits=8, dataset="contextual", num_samples=50) | ||
| q_model = OVModelForVisualCausalLM.from_pretrained(model_id, quantization_config=q_config) | ||
| q_model.save_pretrained("smolvlm_static_int8") | ||
| ``` | ||
|
|
||
| or quivalently using the CLI: | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ```bash | ||
| optimum-cli export openvino -m HuggingFaceTB/SmolVLM2-256M-Video-Instruct --quant-mode int8 --dataset contextual --num-samples 50 smolvlm_static_int8/ | ||
| ``` | ||
|
|
||
| Quantizing activations adds small errors that can build up and affect accuracy, so careful testing afterward is important. More information and examples can be found in [our documentation](https://huggingface.co/docs/optimum-intel/en/openvino/optimization#pipeline-quantization). | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also using a dataset as close to our task as possible, right? |
||
|
|
||
| ### Step 3: Run inference | ||
|
|
||
| You can now run inference with your quantized model : | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ```python | ||
| generated_ids = q_model.generate(**inputs, max_new_tokens=500) | ||
| generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True) | ||
| print(generated_texts[0]) | ||
| ``` | ||
|
|
||
| If you have a recent Intel laptop, Intel AI PC, or Intel discrete GPU, you can load the model on GPU by adding `device="gpu"` when loading your model: | ||
|
|
||
| ```python | ||
| model = OVModelForVisualCausalLM.from_pretrained(model_id, device="gpu") | ||
| ``` | ||
|
|
||
| Try the complete notebook [here](https://github.com/huggingface/optimum-intel/blob/main/notebooks/openvino/vision_language_quantization.ipynb). | ||
|
|
||
echarlaix marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ## Conclusion | ||
|
|
||
| Multimodal AI is becoming more accessible thanks to smaller, optimized models like **SmolVLM**, along with tools such as **Hugging Face Optimum** and **OpenVINO**. While deploying vision-language models locally still comes with challenges, this workflow shows that it's possible to run lightweight image-and-text models on modest hardware. | ||
|
|
||
| By combining quantization techniques with OpenVINO's inference engine, you can reduce memory and compute requirements significantly, making local deployment feasible for a wide range of applications. Whether you're experimenting, prototyping, or looking to deploy offline, this setup gives you a practical starting point. | ||
|
|
||
echarlaix marked this conversation as resolved.
Show resolved
Hide resolved
echarlaix marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| As models and tooling continue to improve, so will the ability to run powerful multimodal systems without relying on the cloud. | ||
|
|
||
echarlaix marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ## Useful Links & Resources | ||
| - [Notebook](https://github.com/huggingface/optimum-intel/blob/main/notebooks/openvino/vision_language_quantization.ipynb) | ||
| - [Try our Space](https://huggingface.co/spaces/echarlaix/vision-langage-openvino) | ||
echarlaix marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - Watch the webinar recording | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - [Optimum Intel Documentation](https://huggingface.co/docs/optimum-intel/en/openvino/inference) | ||
|
|
||
| #### Notices and Disclaimers | ||
| Performance varies by use, configuration, and other factors. Learn more on the Performance Index site. | ||
| Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation. | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.