huggingface · echarlaix · Oct 15, 2025 · Sep 12, 2025 · Sep 16, 2025 · Sep 16, 2025
diff --git a/_blog.yml b/_blog.yml
@@ -6636,4 +6636,15 @@
     - transformers
     - pytorch
     - optimization
-    - guide
+    - guide
+
+- local: openvino-vlm
+  title: "Get your VLM running in 3 simple steps"
+  author: ezelanza
+  thumbnail: /blog/assets/optimum_intel/intel_thumbnail.png
+  date: Sep 15, 2025
+  tags:
+    - intel
+    - optimum
+    - quantization
+    - inference
diff --git a/openvino-vlm.md b/openvino-vlm.md
@@ -0,0 +1,191 @@
+---
+title: "Get your VLM running in 3 simple steps"
+thumbnail: /blog/assets/optimum_intel/intel_thumbnail.png
+authors:
+- user: ezelanza
+  guest: true
+  org: Intel
+- user: echarlaix
+- user: helenai
+  guest: true
+  org: Intel
+- user: nikita-savelyev-intel
+  guest: true
+  org: Intel
+---
+
+# Get your VLM running in 3 simple steps
+
+Teaser: Run a Vision Language Model (VLM) locally in three steps, no need for expensive cloud infrastructure or high-end compute devices. SmolVLM + Intel Optimum + OpenVINO makes it possible to accelerate on an iGPU or an NPU.
+
+As large language models (LLMs) and chatbots become more capable, AI is moving beyond text; it's now interpreting images and videos as well. This is where Vision Language Models (VLMs) come in, enabling tasks like describing scenes, generating captions, or answering questions about images.
+
+Early models like [Flamingo](https://arxiv.org/abs/2204.14198) and [Idefics](https://huggingface.co/blog/idefics) showed what was possible, both demonstrated capabilities with 80B parameters. More recently, we’ve seen small models emerge, [PaliGemma 3B](https://huggingface.co/google/paligemma-3b-pt-896?utm_source=chatgpt.com) , [moondream2](https://www.analyticsvidhya.com/blog/2024/03/introducing-moondream2-a-tiny-vision-language-model/?utm_source=chatgpt.com) , and [Qwen2-VL models](https://nodeshift.com/blog/how-to-install-qwen2-5-vl-7b-instruct-locally?utm_source=chatgpt.com)   but even these “small” versions can be tough to run locally because they still carry a lot of the memory and compute demands from their larger predecessors.
+
+That’s why running AI models locally is still a challenge, but also a huge opportunity. Local inference keeps your data private, gives you fast responses without internet latency, avoids cloud costs, and lets you run and tweak models offline, with full control.
+
+That’s where tools like Intel [Hugging Face Optimum](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/llm-inference-hf.html), OpenVINO, and the lightweight [SmolVLM](https://huggingface.co/blog/smolvlm) model come in. In this post, we’ll show you how to get a VLM running locally in just three simple steps, with no expensive hardware or GPUs needed (though it can also run on Intel GPUs).
+
+## What is a VLM
+
+Let’s first recap: A Vision Language Model (VLM) can understand both text and images. Instead of just reading or writing text, it can also “see” pictures, so you can ask it to describe a photo, answer a question about an image, or generate a caption. It’s like giving your LLM eyes.  
+
+<figure style="width: 700px; margin: 0 auto;">
+  <img src="https://huggingface.co/datasets/openvino/documentation/resolve/main/blog/openvino_vlm/chat1.png">
+</figure>
+
+It’s impressive, but not exactly accessible to use. Let’s take [CogVLM](https://github.com/THUDM/CogVLM), for example, it is a powerful open source vision-language model with around 17 billion parameters (10B vision encoder \+ 7B language model)  which can require [about 80GB of RAM](https://inference.roboflow.com/foundation/cogvlm/) to run the model in full precision. Inference is still relatively slow: captioning a single image takes 10 to 13 seconds on an NVIDIA T4 GPU ([RoboflowBenchmark](https://inference.roboflow.com/foundation/cogvlm/?utm_source=chatgpt.com)). Users attempting to run CogVLM on CPUs have reported crashes or memory errors even with 64 GB of RAM, highlighting its impracticality for typical local deployment ([GitHub Issue](https://github.com/THUDM/CogVLM/issues/162)), just to mention one model, this is the challenge faced recently with most small VLMs.
+
+In contrast, SmolVLM is purpose-built for low-resource environments, and it becomes a highly efficient solution for deploying vision-language models on laptops or edge devices.  
+Launched by Hugging Face in July 2024, SmolVLM addresses the growing need for multimodal AI that runs locally without requiring high-end GPUs or cloud infrastructure. As vision-language models become essential in areas like accessibility, robotics, and on-device assistants, SmolVLM offers a path to efficient, privacy-preserving inference at the edge.  
+Architecturally, SmolVLM pairs a lightweight vision encoder with a compact language decoder. This modular design enables it to interpret both images and text.
+
+<figure style="width: 700px; margin: 0 auto;">
+  <img src="https://huggingface.co/datasets/openvino/documentation/resolve/main/blog/openvino_vlm/smolvlm.png" width=700>
+  <figcaption style="text-align: center;">
+    SmolVLM architecture (<b><i>Source: <a href="https://huggingface.co/blog/smolvlm#what-is-smolvlm">SmolVLM - small yet mighty Vision Language Model</i></b></a>).
+  </figcaption>
+</figure>
+
+It offers a lightweight, efficient solution for running image-and-text models directly on laptops or edge devices.
+
+## Hugging Face Optimum
+
+As mentioned, SmolVLM offers a strong advantage for running multimodal models efficiently, but there’s still room for improvement. These models can be further compressed or optimized to run even more effectively on local devices. If you’ve tried optimizing a model by yourself, you probably know it’s not a trivial task.  
+This is where [Optimum Intel for OpenVINO](https://huggingface.co/docs/optimum-intel/en/index) ([repo](https://github.com/huggingface/optimum-intel)) comes in.  
+It acts as a bridge between Hugging Face libraries like [**Transformers**](https://huggingface.co/docs/transformers/en/index)**, [Diffusers](https://huggingface.co/docs/diffusers/index), [timm](https://huggingface.co/docs/timm/index), and [sentence-transformers](https://huggingface.co/sentence-transformers)**, and Intel’s optimization tools, making it easy to accelerate end-to-end pipelines on Intel hardware
+
+Before using it the very first step is to install the library.  
+```bash  
+pip install optimum-intel[openvino]  
+```
+
+By using Optimum with OpenVINO, you gain several benefits, like improving the inference time and lower memory/storage usage out of the box. But you can go even further: quantization can reduce the model size and resource consumption even more. While quantization often requires deep expertise, Optimum simplifies the process, making it much more accessible.
+
+Let’s see how you can run SmolVLM then.
+
+## Step 1: Convert your model to the OpenVINO IR
+
+First, you will need to convert your model to the OpenVINO IR. There are multiple options to do it:
+
+1. You can use the [Optimum CLI](https://huggingface.co/docs/optimum-intel/en/openvino/export#using-the-cli)
+
+```bash
+optimum-cli export openvino -m HuggingFaceTB/SmolVLM2-256M-Video-Instruct smolvlm_ov/
+```
+
+2. Or you can convert it [on the fly](https://huggingface.co/docs/optimum-intel/en/openvino/export#when-loading-your-model) when loading your model:
+
+
+```python
+from optimum.intel import OVModelForVisualCausalLM
+
+model_id = "HuggingFaceTB/SmolVLM2-256M-Video-Instruct"
+model = OVModelForVisualCausalLM.from_pretrained(model_id)
+model.save_pretrained("smolvlm_ov")
+```
+
+## Step 2: Quantization
+
+Now it’s time to optimize the model for efficient execution using **quantization**. Quantization reduces the precision of the model weights and/or activations, leading to smaller, faster models.
+
+Essentially, it's a way to map values from a high-precision data type, such as 32-bit floating-point numbers (FP32), to a lower-precision format, typically 8-bit integers (INT8). While this process offers several key benefits, it can also impact in a potential loss of accuracy.
+
+<figure style="width: 800px; margin: 0 auto;">
+  <img src="https://huggingface.co/datasets/openvino/documentation/resolve/main/blog/openvino_vlm/quantization.png">
+</figure>
+
+Optimum supports two main post-training quantization:
+
+- Weight Only Quantization  
+- Static Quantization
+
+Let’s explore each of them.
+
+### Option 1: Weight Only Quantization
+
+Weight-only quantization means that only the weights are being quantized and leaving the activation in their original precisions. To explain this process, let’s imagine preparing for a long backpacking trip. To reduce weight, you replace bulky items like full-size shampoo bottles with compact travel-sized versions. This is like weight-only quantization, where the model’s weights are compressed from 32-bit floating-point numbers to 8-bit integers, reducing the model’s memory footprint.
+
+However, the “interactions” during the trip, like drinking water, remain unchanged. This is similar to what happens to activations, which stay in high precision (FP32 or BF16) to preserve accuracy during computation.
+
+As a result, the model becomes smaller and more memory-efficient, improving loading times. But since activations are not quantized, inference speed gains are limited. Since OpenVINO 2024.3, if the model's weight have been quantized, the corresponding activations will also be quantized at runtime, leading to additional speedup depending on the device.
+
+Weight-only quantization is a simple first step since it usually doesn’t result in significant accuracy degradation.  
+In order to run it, you will need to create a quantization configuration using Optimum \`OVWeightQuantizationConfig\` as follows
+
+
+```python
+from optimum.intel import OVModelForVisualCausalLM, OVWeightQuantizationConfig
+
+q_config = OVWeightQuantizationConfig(bits=8)
+# Apply quantization and save the new model
+q_model = OVModelForVisualCausalLM.from_pretrained(model_id, quantization_config=q_config)
+q_model.save_pretrained("smolvlm_int8")
+```
+
+or quivalently using the CLI:
+
+
+```bash
+optimum-cli export openvino -m HuggingFaceTB/SmolVLM2-256M-Video-Instruct --weight-format int8 smolvlm_int8/
+
+```
+
+## Option 2: Static Quantization
+
+When applying static quantization, quantization is applied on both weights and activations. For this a calibration step is needed in which a dataset subset is used in order to estimate the activations ranges. In the following example we are using 50 samples of the [contextual dataset](https://huggingface.co/datasets/ucla-contextual/contextual_test) to perform this calibration step.
+
+```python
+from optimum.intel import OVModelForVisualCausalLM, OVQuantizationConfig
+
+q_config = OVQuantizationConfig(bits=8, dataset="contextual", num_samples=50)
+q_model = OVModelForVisualCausalLM.from_pretrained(model_id, quantization_config=q_config)
+q_model.save_pretrained("smolvlm_static_int8")
+```
+
+or quivalently using the CLI:
+
+```bash
+optimum-cli export openvino -m HuggingFaceTB/SmolVLM2-256M-Video-Instruct --quant-mode int8 --dataset contextual --num-samples 50 smolvlm_static_int8/
+```
+
+Quantizing activations adds small errors that can build up and affect accuracy, so careful testing afterward is important. More information and examples can be found in [our documentation](https://huggingface.co/docs/optimum-intel/en/openvino/optimization#pipeline-quantization).
+
+### Step 3: Run inference
+
+You can now run inference with your quantized model :
+
+```python
+generated_ids = q_model.generate(**inputs, max_new_tokens=500)
+generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
+print(generated_texts[0])
+```
+
+If you have a recent Intel laptop, Intel AI PC, or Intel discrete GPU, you can load the model on GPU by adding `device="gpu"` when loading your model:
+
+```python
+model = OVModelForVisualCausalLM.from_pretrained(model_id, device="gpu")
+```
+
+Try the complete notebook [here](https://github.com/huggingface/optimum-intel/blob/main/notebooks/openvino/vision_language_quantization.ipynb).
+
+## Conclusion
+
+Multimodal AI is becoming more accessible thanks to smaller, optimized models like **SmolVLM**, along with tools such as **Hugging Face Optimum** and **OpenVINO**. While deploying vision-language models locally still comes with challenges, this workflow shows that it's possible to run lightweight image-and-text models on modest hardware.
+
+By combining quantization techniques with OpenVINO's inference engine, you can reduce memory and compute requirements significantly, making local deployment feasible for a wide range of applications. Whether you're experimenting, prototyping, or looking to deploy offline, this setup gives you a practical starting point.
+
+As models and tooling continue to improve, so will the ability to run powerful multimodal systems without relying on the cloud.
+
+## Useful Links & Resources
+- [Notebook](https://github.com/huggingface/optimum-intel/blob/main/notebooks/openvino/vision_language_quantization.ipynb)
+- [Try our Space](https://huggingface.co/spaces/echarlaix/vision-langage-openvino)
+- Watch the webinar recording
+- [Optimum Intel Documentation](https://huggingface.co/docs/optimum-intel/en/openvino/inference)
+
+#### Notices and Disclaimers
+Performance varies by use, configuration, and other factors. Learn more on the Performance Index site.
+Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation.
+© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.
+
+