-
Notifications
You must be signed in to change notification settings - Fork 923
add openvino VLM blog post #3071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 17 commits
15d3828
9cc1717
15f6f5f
d6033a8
cfbcca1
6441337
a6ee9d9
c527b8b
042ae0f
e69c8ea
69f09cc
be7aef2
d2523c0
d17cf36
cfda70f
47c9baf
6ae3d81
e3e410e
bcd87da
18ae0ce
da03f28
05e2a60
7e76c89
9214619
9250d8b
137beb3
3fd228f
6badb73
78a3fd6
bb6296d
481fddb
80a6000
7e37d5d
fb2de47
23c40bc
77ea0cd
d027179
8fc6928
57f4a34
27ff34a
3779587
34c1612
078bf4e
f026273
7b7eb57
43c2a52
e79375c
da05b4e
4a6c6b6
60aff81
f2d302a
5cd869c
107947c
caeb255
29fbb32
cf81b66
efb7cbf
2713e55
b6c88dc
069af77
96a7b76
a87e919
ee29381
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,205 @@ | ||
--- | ||
title: "Get your VLM running in 3 simple steps" | ||
thumbnail: /blog/assets/optimum_intel/intel_thumbnail.png | ||
authors: | ||
- user: ezelanza | ||
guest: true | ||
org: Intel | ||
- user: helenai | ||
guest: true | ||
org: Intel | ||
- user: nikita-savelyev-intel | ||
guest: true | ||
org: Intel | ||
- user: echarlaix | ||
- user: IlyasMoutawwakil | ||
--- | ||
|
||
# Get your VLM running in 3 simple steps on Intel CPU/GPU | ||
|
||
Teaser: Run a Vision Language Model (VLM) locally in three steps, no need for expensive cloud infrastructure or high-end compute devices. SmolVLM + Intel Optimum + OpenVINO makes it possible to accelerate on an iGPU or a discrete Intel GPU). | ||
|
||
As large language models (LLMs) and chatbots become more capable, AI is moving beyond text, and can now understand images and videos as well. Vision Language Models (VLMs) enable tasks like describing scenes, generating captions, answering questions about images, and many others. | ||
|
||
Early models like [Flamingo](https://arxiv.org/abs/2204.14198) and [Idefics](https://huggingface.co/blog/idefics) showed what was possible. Both demonstrated interesting capabilities, using 80B parameters. More recently, we’ve seen much smaller models emerge, like [PaliGemma 2 (3B)](https://huggingface.co/google/paligemma2-3b-pt-224), [moondream2 (2B)](https://huggingface.co/vikhyatk/moondream2), or [Qwen2.5-VL (7B)](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct), but even these “small” versions can be tough to run locally because they still carry a lot of the memory and compute demands from their larger predecessors. | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
Running AI models locally is still a challenge, but also a huge opportunity. Local inference keeps your data private, gives you fast responses without internet latency, avoids cloud costs, and lets you run and tweak models offline, with full control. | ||
|
||
That’s where tools like [Hugging Face Optimum](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/llm-inference-hf.html), [OpenVINO](https://github.com/openvinotoolkit/openvino), and the lightweight [SmolVLM](https://huggingface.co/blog/smolvlm) model come in. In this post, we’ll show you how to get a VLM running locally in just three simple steps, with no expensive hardware or GPUs needed (though it can also run on Intel GPUs). | ||
|
||
## What is a VLM | ||
|
||
Let’s first recap: A Vision Language Model (VLM) can understand both text and images. Instead of just reading or writing text, it can also “see” pictures, so you can ask it to describe a photo, answer a question about an image, or generate a caption. It’s like giving your LLM eyes. | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
<figure style="width: 700px; margin: 0 auto;"> | ||
<img src="https://huggingface.co/datasets/openvino/documentation/resolve/main/blog/openvino_vlm/chat1.png"> | ||
</figure> | ||
|
||
It’s impressive, but not exactly accessible to use. Let’s take [CogVLM](https://github.com/THUDM/CogVLM), for example, it is a powerful open source vision-language model with around 17 billion parameters (10B vision encoder \+ 7B language model) which can require [about 80GB of RAM](https://inference.roboflow.com/foundation/cogvlm/) to run the model in full precision. Inference is still relatively slow: captioning a single image takes 10 to 13 seconds on an NVIDIA T4 GPU ([RoboflowBenchmark](https://inference.roboflow.com/foundation/cogvlm/?utm_source=chatgpt.com)). Users attempting to run CogVLM on CPUs have reported crashes or memory errors even with 64 GB of RAM, highlighting its impracticality for typical local deployment ([GitHub Issue](https://github.com/THUDM/CogVLM/issues/162)), just to mention one model, this is the challenge faced recently with most small VLMs. | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
In contrast, SmolVLM is purpose-built for low-resource environments, and it becomes a highly efficient solution for deploying vision-language models on laptops or edge devices. | ||
Launched by Hugging Face in July 2024, SmolVLM addresses the growing need for multimodal AI that runs locally without requiring high-end GPUs or cloud infrastructure. As vision-language models become essential in areas like accessibility, robotics, and on-device assistants, SmolVLM offers a path to efficient, privacy-preserving inference at the edge. | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
Architecturally, SmolVLM pairs a lightweight vision encoder with a compact language decoder. This modular design enables it to interpret both images and text. | ||
|
||
<figure style="width: 700px; margin: 0 auto;"> | ||
<img src="https://huggingface.co/datasets/openvino/documentation/resolve/main/blog/openvino_vlm/smolvlm.png" width=700> | ||
<figcaption style="text-align: center;"> | ||
SmolVLM architecture (<b><i>Source: <a href="https://huggingface.co/blog/smolvlm#what-is-smolvlm">SmolVLM - small yet mighty Vision Language Model</i></b></a>). | ||
</figcaption> | ||
</figure> | ||
|
||
It offers a lightweight, efficient solution for running image-and-text models directly on laptops or edge devices. | ||
|
||
## Hugging Face Optimum | ||
|
||
Even though SmolVLM was designed for low-resource consumption, there’s still room for improvement. These models can be further compressed or optimized for your own hardware. However, if you’ve tried to optimize a model yourself, you probably know it’s not a trivial task. | ||
This is where [Optimum Intel for OpenVINO](https://huggingface.co/docs/optimum-intel/en/index) ([repo](https://github.com/huggingface/optimum-intel)) comes in. | ||
It acts as a bridge between Hugging Face libraries –including [**Transformers**](https://huggingface.co/docs/transformers/en/index)**, [Diffusers](https://huggingface.co/docs/diffusers/index), [timm](https://huggingface.co/docs/timm/index), and [sentence-transformers](https://huggingface.co/sentence-transformers)**–, and Intel’s optimization tools, making it easy to accelerate end-to-end pipelines on Intel hardware. | ||
|
||
Before using it, the very first step is to install the library. | ||
```bash | ||
pip install optimum-intel[openvino] | ||
``` | ||
|
||
By using Optimum with OpenVINO, you gain several benefits, like improving the inference time and lower memory/storage usage out of the box. But you can go even further: quantization can reduce the model size and resource consumption even more. While quantization often requires deep expertise, Optimum simplifies the process, making it much more accessible. | ||
|
||
Let’s see how you can run SmolVLM then. | ||
|
||
## Step 1: Convert your model to the OpenVINO IR | ||
|
||
First, you will need to convert your model to the OpenVINO IR. There are multiple options to do it: | ||
|
||
1. You can use the [Optimum CLI](https://huggingface.co/docs/optimum-intel/en/openvino/export#using-the-cli) | ||
|
||
```bash | ||
optimum-cli export openvino -m HuggingFaceTB/SmolVLM2-256M-Video-Instruct smolvlm_ov/ | ||
``` | ||
|
||
2. Or you can convert it [on the fly](https://huggingface.co/docs/optimum-intel/en/openvino/export#when-loading-your-model) when loading your model: | ||
|
||
|
||
```python | ||
from optimum.intel import OVModelForVisualCausalLM | ||
|
||
model_id = "HuggingFaceTB/SmolVLM2-256M-Video-Instruct" | ||
model = OVModelForVisualCausalLM.from_pretrained(model_id) | ||
model.save_pretrained("smolvlm_ov") | ||
``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want to establish a reference baseline about speed/memory at this point? |
||
|
||
## Step 2: Quantization | ||
|
||
Now it’s time to optimize the model for efficient execution using **quantization**. Quantization reduces the precision of the model weights and/or activations, leading to smaller, faster models. | ||
|
||
Essentially, it's a way to map values from a high-precision data type, such as 32-bit floating-point numbers (FP32), to a lower-precision format, typically 8-bit integers (INT8). While this process offers several key benefits, it can also impact in a potential loss of accuracy. | ||
|
||
<figure style="width: 800px; margin: 0 auto;"> | ||
<img src="https://huggingface.co/datasets/openvino/documentation/resolve/main/blog/openvino_vlm/quantization.png"> | ||
</figure> | ||
|
||
Optimum supports two main post-training quantization methods: | ||
|
||
- Weight Only Quantization | ||
- Static Quantization | ||
|
||
Let’s explore each of them. | ||
|
||
### Option 1: Weight Only Quantization | ||
|
||
Weight-only quantization means that only the weights are quantized but activations remain in their original precisions. To explain this process, let’s imagine preparing for a long backpacking trip. To reduce weight, you replace bulky items like full-size shampoo bottles with compact travel-sized versions. This is like weight-only quantization, where the model’s weights are compressed from 32-bit floating-point numbers to 8-bit integers, reducing the model’s memory footprint. | ||
|
||
However, the “interactions” during the trip, like drinking water, remain unchanged. This is similar to what happens to activations, which stay in high precision (FP32 or BF16) to preserve accuracy during computation. | ||
|
||
As a result, the model becomes smaller and more memory-efficient, improving loading times. But since activations are not quantized, inference speed gains are limited. Since OpenVINO 2024.3, if the model's weight have been quantized, the corresponding activations will also be quantized at runtime, leading to additional speedup depending on the device. | ||
|
||
|
||
Weight-only quantization is a simple first step since it usually doesn’t result in significant accuracy degradation. | ||
In order to run it, you will need to create a quantization configuration using Optimum \`OVWeightQuantizationConfig\` as follows | ||
|
||
|
||
```python | ||
from optimum.intel import OVModelForVisualCausalLM, OVWeightQuantizationConfig | ||
|
||
q_config = OVWeightQuantizationConfig(bits=8) | ||
# Apply quantization and save the new model | ||
q_model = OVModelForVisualCausalLM.from_pretrained(model_id, quantization_config=q_config) | ||
q_model.save_pretrained("smolvlm_int8") | ||
``` | ||
|
||
or quivalently using the CLI: | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
|
||
```bash | ||
optimum-cli export openvino -m HuggingFaceTB/SmolVLM2-256M-Video-Instruct --weight-format int8 smolvlm_int8/ | ||
|
||
``` | ||
|
||
## Option 2: Static Quantization | ||
echarlaix marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
To achieve the best estimate for the activation quantization parameters, we perform a calibration step. This involves running inference on a small subset of our dataset, in our case using 50 samples of the [contextual dataset](https://huggingface.co/datasets/ucla-contextual/contextual_test). | ||
|
||
```python | ||
from optimum.intel import OVModelForVisualCausalLM, OVQuantizationConfig | ||
|
||
q_config = OVQuantizationConfig(bits=8, dataset="contextual", num_samples=50) | ||
q_model = OVModelForVisualCausalLM.from_pretrained(model_id, quantization_config=q_config) | ||
q_model.save_pretrained("smolvlm_static_int8") | ||
``` | ||
|
||
or equivalently with the CLI: | ||
|
||
```bash | ||
optimum-cli export openvino -m HuggingFaceTB/SmolVLM2-256M-Video-Instruct --quant-mode int8 --dataset contextual --num-samples 50 smolvlm_static_int8/ | ||
``` | ||
|
||
Quantizing activations adds small errors that can build up and affect accuracy, so careful testing afterward is important. More information and examples can be found in [our documentation](https://huggingface.co/docs/optimum-intel/en/openvino/optimization#pipeline-quantization). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also using a dataset as close to our task as possible, right? |
||
|
||
### Step 3: Run inference | ||
|
||
You can now run inference with your quantized model: | ||
|
||
```python | ||
generated_ids = q_model.generate(**inputs, max_new_tokens=500) | ||
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True) | ||
print(generated_texts[0]) | ||
``` | ||
|
||
If you have a recent Intel laptop, Intel AI PC, or Intel discrete GPU, you can load the model on GPU by adding `device="gpu"` when loading your model: | ||
|
||
```python | ||
model = OVModelForVisualCausalLM.from_pretrained(model_id, device="gpu") | ||
``` | ||
|
||
Try the complete notebook [here](https://github.com/huggingface/optimum-intel/blob/main/notebooks/openvino/vision_language_quantization.ipynb). | ||
|
||
echarlaix marked this conversation as resolved.
Show resolved
Hide resolved
|
||
## Evaluation and Conclusion | ||
|
||
Multimodal AI is becoming more accessible thanks to smaller, optimized models like SmolVLM and tools such as Hugging Face Optimum and OpenVINO. While deploying vision-language models locally still presents challenges, this workflow shows that it's possible to run lightweight image-and-text models on multiple hardware. | ||
|
||
We ran a benchmark to show the impact of weight-only quantization on a (SmolVLM2-256M)[https://huggingface.co/HuggingFaceTB/SmolVLM2-256M-Video-Instruct] model and how it performs on different Intel hardware. For this test, we used a single image. | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
We measured the following metrics to evaluate the model's performance: | ||
echarlaix marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Model Size: this shows how much storage space the model requires. | ||
- Latency: we measured the average time it took for the model to process an input. | ||
- Image throughput: the rate at which the model can process images. | ||
- Tokens throughput: the rate at which the model can process tokens. | ||
|
||
<p align="center"> | ||
<img src="https://huggingface.co/datasets/OpenVINO/documentation/resolve/main/blog/openvino_vlm/flower.png" alt="Pink flower with bee" width="700"/> | ||
</p> | ||
|
||
echarlaix marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Here are the results across different Intel hardware: | ||
|
||
echarlaix marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| Device | Model Size (MB) (Before/After) | Images Throughput (im/s) (Before/After) | First Token Throughput (t/s) (Before/After) | Second Token Throughput (t/s) (Before/After) | Latency (s) (Before/After) | | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|-------------|-------------------------------|-----------------------------------------|--------------------------------------------|---------------------------------------------|-----------------------------| | ||
| CPU | - | 0.33 / 0.55 | 2.69 / 3.94 | 83.25 / 146.1 | 3.5249 / 2.1548 | | ||
| iGPU | - | 0.58 / 0.53 | 5.01 / 5.26 | 51.62 / 49.56 | 2.1386 / 2.3182 | | ||
echarlaix marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| GPU (b580) | 980.61 / 248 (Applies to all devices) | 15.75 / 15.01 | 34.51 / 27.54 | 149.79 / 120.91 | 0.2074 / 0.2376 | | ||
| GPU (A770) | - | 10.68 / 10.89 | 16.57 / 15.79 | 83.01 / 69.1 | 0.3321 / 0.3403 | | ||
|
||
echarlaix marked this conversation as resolved.
Show resolved
Hide resolved
echarlaix marked this conversation as resolved.
Show resolved
Hide resolved
|
||
This benchmark shows that small, optimized multimodal models, like (SmolVLM2-256M)[https://huggingface.co/HuggingFaceTB/SmolVLM2-256M-Video-Instruct], can run efficiently on various Intel hardware. Weight-only quantization significantly reduces model size, improving efficiency without majorly impacting throughput. GPUs deliver the highest image and token processing speeds, while CPUs and iGPUs remain viable for lighter workloads. Overall, this shows that lightweight vision-language models can be deployed locally with reasonable performance, making multimodal AI more accessible. | ||
|
||
echarlaix marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Useful Links & Resources | ||
- [Notebook](https://github.com/huggingface/optimum-intel/blob/main/notebooks/openvino/vision_language_quantization.ipynb) | ||
- [Try our Space](https://huggingface.co/spaces/echarlaix/vision-langage-openvino) | ||
echarlaix marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- [Watch the webinar recording](https://web.cvent.com/event/d550a2a7-04f2-4a28-b641-3af228e318ca/regProcessStep1?utm_campaign=speakers4&utm_medium=organic&utm_source=Community) | ||
- [Optimum Intel Documentation](https://huggingface.co/docs/optimum-intel/en/openvino/inference) |
Uh oh!
There was an error while loading. Please reload this page.