Skip to content

Conversation

echarlaix
Copy link
Contributor

@echarlaix echarlaix commented Sep 12, 2025

echarlaix and others added 4 commits September 16, 2025 13:38
Co-authored-by: Helena Kloosterman <[email protected]>
Co-authored-by: Pedro Cuenca <[email protected]>
Co-authored-by: Pedro Cuenca <[email protected]>
Co-authored-by: Pedro Cuenca <[email protected]>
echarlaix and others added 11 commits September 16, 2025 13:48
@echarlaix
Copy link
Contributor Author

Thanks a lot for the great review @pcuenca, didn't had time to include everything but will do in a second pass. The blog post is not ready for publication yet but once it is, I'll let you know

Copy link
Contributor

@merveenoyan merveenoyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super cool!

@echarlaix
Copy link
Contributor Author

Thanks a lot for your reviews @pcuenca @merveenoyan! The blog post is not ready yet (was set to draft but I should have clarified it in the description). Likely a lot will change in the following days, so don't want you to waste your time on corrections that could very well not be included in the final blog post, will let you know once ready !

Co-authored-by: Nikita Savelyev <[email protected]>
openvino-vlm.md Outdated
| openvino-8bit-woq| 0.247 | 0.016 | 0.482 | 63.928 |


This benchmark shows that small, optimized multimodal models, like [SmolVLM2-256M](https://huggingface.co/HuggingFaceTB/SmolVLM2-256M-Video-Instruct), can run efficiently on Intel CPUs. Weight-only quantization significantly reduces model size, improving efficiency without majorly impacting throughput.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ezelanza would you mind updating once the benchmark is validated on your side

Co-authored-by: Nikita Savelyev <[email protected]>
Co-authored-by: Eze Lanza (Eze) <[email protected]>
Comment on lines +152 to +156
| Configuration |Time To First Token (TTFT)|Time Per Output Token (TPOT)| End-to-End Latency | Decoding Throughput |
|------------------|--------------------------|----------------------------|-----------------------|-------------------------------|
| pytorch | 5.150 | 1.385 | 25.927 | 0.722 |
| openvino | 0.420 | 0.021 | 0.738 | 47.237 |
| openvino-8bit-woq| 0.247 | 0.016 | 0.482 | 63.928 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezelanza are these numbers from the notebook or optimum-benchmark ? cpu vs cpu ? I'm asking because the acceleration x65 seems too good to be true 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be great to have a reference to the benchmark code

Copy link
Member

@IlyasMoutawwakil IlyasMoutawwakil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM great work everyone !
I left one question about the benchmark numbers / reproduction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants