-
Notifications
You must be signed in to change notification settings - Fork 33.1k
Add Granite 4.1 Vision (granite4_vision) #45597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
artem-spector
wants to merge
48
commits into
huggingface:main
Choose a base branch
from
artem-spector:add-gv41
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+3,153
−0
Open
Changes from all commits
Commits
Show all changes
48 commits
Select commit
Hold shift + click to select a range
e3bf724
Add Granite 4.1 Vision model (granite4_vision)
6a04fa9
Fix auto-registration after upstream auto_mappings refactor
de314a8
Fix conflict marker in image_processing_auto.py
29f88f9
Fix check-repo: remove spatial_stride (unused in modeling), fix auto_…
d0093d9
Fix duplicate legacy key in conversion_mapping.py
ed4d9ef
Fix check-repo failures for granite4_vision
84374b3
Regenerate auto_mappings.py after rebase onto upstream/main
cc97911
Fix CI failures after upstream rebase
703d34e
Fix Sam3 auto_mappings.py entries corrupted by rebase
7a815b1
Restore hy_v3, openai_privacy_filter, slanet entries dropped by bad r…
61a8e87
Revert dependency_versions_table.py to match setup.py (upstream state)
2427c19
Fix bad rebase: remove hy-v3/slanet/openai-privacy-filter entries fro…
1af5f3d
Remove merge_lora_adapters and PEFT adapter-toggling generate override
4ed8c0c
Add Granite4VisionTextModel with deepstack injection, replace manual …
ace7e02
Add Granite4VisionTextConfig, fix missing TextConfig import in genera…
7edc402
Fix Granite4VisionTextConfig to inherit PreTrainedConfig, add missing…
81fe3fb
Fix class ordering: define Granite4VisionPreTrainedModel before TextM…
d4a6b42
Fix inv_freq corruption in Granite4VisionTextRotaryEmbedding during f…
7a38510
Inline downsampling into modular, add qformer_config sub-config, conv…
a49cc54
Return Granite4VisionImageFeaturesOutput from get_image_features
e7bb139
Drop Granite4Vision image processor re-definitions, delegate to Llava…
a63064c
Address medium PR review items 8-14
9c7d33a
Address nit PR review items 15-23
103e842
Address remaining review items 20 and 29
0868a8d
Fix test failures found by Granite4VisionModelTest
488552f
Use lazy import for Blip2QFormerConfig; qformer_config sub_configs us…
421b478
Add Granite4VisionTextModel to check_repo ignore lists; document miss…
476e8a7
Fix missing imports in modular: math, AutoConfig, select_best_resolution
c5f63ce
Pop output_attentions/output_hidden_states from **kwargs in Granite4V…
25d4c14
Fix check_modeling_structure violations (TRF002, TRF009, TRF010)
e435343
Fix ruff I001 import ordering and processing consistency check
305cded
Remove converter-regenerated files that should not exist
d84a896
Remove autodoc entries for ImageProcessor classes that don't exist in…
e9268a2
Fix model card date for add_dates.py check
ee05b3e
Fix dependency_versions_table.py: sync mlinter version with setup.py
88be1f0
Address review round 3: config, capture_outputs, hidden states, proje…
24c101a
Nits: rename one-letter vars, AttributeError() for unused inherited c…
d7f134d
Fix qformer_config: build fully-specified at init, no post-super fiel…
dc32af6
Move _can_record_outputs and _deepstack_inject to Granite4VisionPreTr…
b99968d
Address remaining review items: capture_outputs, output class, Dynami…
71202cf
Regenerate modeling/processing from updated modular; fix copyright he…
98e29ea
Fix ruff formatting in modular_granite4_vision.py
0848d92
Fix add_dates.py: update granite4_vision model card date to 2026-04-30
ff2d78e
Remove incorrect skip from test_can_init_all_missing_weights
4dcdf3e
Fix add_dates.py: update granite4_vision model card date to 2026-05-03
662d2c1
Fix _init_weights: add nn.Embedding, nn.LayerNorm, Granite4VisionText…
80bcb8e
Fix ruff F821: detect RMSNorm by attribute pattern instead of class name
3224b69
Fix ruff formatting in modular_granite4_vision.py
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,185 @@ | ||
| <!--Copyright 2026 IBM and The HuggingFace Team. All rights reserved. | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations under the License. | ||
|
|
||
| ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | ||
| rendered properly in your Markdown viewer. | ||
|
|
||
| --> | ||
| *This model was released on 2026-03-27 and added to Hugging Face Transformers on 2026-05-03.* | ||
|
|
||
| <div style="float: right;"> | ||
| <div class="flex flex-wrap space-x-1"> | ||
| <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white"> | ||
| <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat"> | ||
| <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white"> | ||
| </div> | ||
| </div> | ||
|
|
||
| # Granite4Vision | ||
|
|
||
| [Granite Vision 4.1](https://huggingface.co/ibm-granite/granite-vision-4.1-4b) is a vision-language model from IBM Research designed for enterprise-grade document data extraction. It specializes in chart extraction (Chart2CSV, Chart2Summary, Chart2Code), table extraction (JSON, HTML, OTSL), and semantic key-value pair extraction. | ||
|
|
||
| The model builds on [LLaVA-NeXT](llava_next) with several architectural innovations: | ||
|
|
||
| 1. **SigLIP2 Vision Encoder** ([`google/siglip2-so400m-patch16-384`](https://huggingface.co/google/siglip2-so400m-patch16-384)): images are tiled into 384x384 patches. | ||
| 2. **Window Q-Former Projectors**: compress visual features 4x using windowed cross-attention over 4x4 patch windows into 2x2 tokens. | ||
| 3. **DeepStack Feature Injection** with 8 vision-to-LLM injection points: | ||
| - *LayerDeepstack*: features from 4 vision encoder depths are projected into different early LLM layers. | ||
| - *SpatialDeepstack*: deepest vision features are split into 4 spatial groups and injected at later LLM layers. | ||
| 4. **Language Model**: [Granite 4.1](https://huggingface.co/ibm-granite/granite-4.1-4b-base) (4B params) with LoRA adapters (rank 256) across all self-attention and MLP layers. | ||
|
|
||
| The model is delivered as a LoRA adapter on top of the base LLM, enabling single deployments to support both multimodal and text-only workloads. Total parameter count is ~4B. | ||
|
|
||
| > [!TIP] | ||
| > This model was contributed by the [IBM Granite Vision Team](https://github.com/ibm-granite). | ||
|
|
||
| ## Usage Tips | ||
|
|
||
| - Set `padding_side="left"` during batched generation for more accurate results. | ||
|
|
||
| ```py | ||
| processor.tokenizer.padding_side = "left" | ||
| ``` | ||
|
|
||
| - The model supports specialized task tags for document extraction: `<chart2csv>`, `<chart2summary>`, `<chart2code>`, `<tables_html>`, `<tables_otsl>`, `<tables_json>`. Pass these as the text prompt along with a document image. | ||
|
|
||
| - For key-value pair extraction, provide a JSON schema describing the fields to extract. The model returns structured JSON matching the schema. | ||
|
|
||
| The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class. | ||
|
|
||
| <hfoptions id="usage"> | ||
|
|
||
| <hfoption id="Pipeline"> | ||
|
|
||
| ```python | ||
| from transformers import pipeline | ||
|
|
||
| pipe = pipeline( | ||
| task="image-text-to-text", | ||
| model="ibm-granite/granite-vision-4.1-4b", | ||
| ) | ||
| messages = [ | ||
| { | ||
| "role": "user", | ||
| "content": [ | ||
| {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"}, | ||
| {"type": "text", "text": "Describe this image."}, | ||
| ], | ||
| } | ||
| ] | ||
| pipe(text=messages, max_new_tokens=100, return_full_text=False) | ||
| ``` | ||
|
|
||
| </hfoption> | ||
|
|
||
| <hfoption id="AutoModel"> | ||
|
|
||
| ```python | ||
| import torch | ||
| from transformers import AutoProcessor, AutoModelForImageTextToText | ||
|
|
||
| model_id = "ibm-granite/granite-vision-4.1-4b" | ||
|
|
||
| processor = AutoProcessor.from_pretrained(model_id) | ||
| model = AutoModelForImageTextToText.from_pretrained(model_id) | ||
|
|
||
| conversation = [ | ||
| { | ||
| "role": "user", | ||
| "content": [ | ||
| {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"}, | ||
| {"type": "text", "text": "Describe this image."}, | ||
| ], | ||
| }, | ||
| ] | ||
| inputs = processor.apply_chat_template( | ||
| conversation, | ||
| add_generation_prompt=True, | ||
| tokenize=True, | ||
| return_dict=True, | ||
| return_tensors="pt", | ||
| ).to(model.device) | ||
|
|
||
| output = model.generate(**inputs, max_new_tokens=100) | ||
| print(processor.decode(output[0], skip_special_tokens=True)) | ||
| ``` | ||
|
|
||
| </hfoption> | ||
|
|
||
| </hfoptions> | ||
|
|
||
| Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. | ||
|
|
||
| The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to int4. | ||
|
|
||
| ```python | ||
| import torch | ||
| from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig | ||
|
|
||
| quant_config = BitsAndBytesConfig( | ||
| load_in_4bit=True, | ||
| bnb_4bit_compute_dtype=torch.float16, | ||
| bnb_4bit_quant_type="nf4", | ||
| ) | ||
|
|
||
| model_id = "ibm-granite/granite-vision-4.1-4b" | ||
| processor = AutoProcessor.from_pretrained(model_id) | ||
| model = AutoModelForImageTextToText.from_pretrained( | ||
| model_id, quantization_config=quant_config, device_map="auto" | ||
| ) | ||
|
|
||
| conversation = [ | ||
| { | ||
| "role": "user", | ||
| "content": [ | ||
| {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"}, | ||
| {"type": "text", "text": "Describe this image."}, | ||
| ], | ||
| }, | ||
| ] | ||
| inputs = processor.apply_chat_template( | ||
| conversation, | ||
| add_generation_prompt=True, | ||
| tokenize=True, | ||
| return_dict=True, | ||
| return_tensors="pt", | ||
| ).to(model.device) | ||
|
|
||
| output = model.generate(**inputs, max_new_tokens=100) | ||
| print(processor.decode(output[0], skip_special_tokens=True)) | ||
| ``` | ||
|
|
||
| ## Granite4VisionConfig | ||
|
|
||
| [[autodoc]] Granite4VisionConfig | ||
|
|
||
| ## Granite4VisionTextConfig | ||
|
|
||
| [[autodoc]] Granite4VisionTextConfig | ||
|
|
||
| ## Granite4VisionProcessor | ||
|
|
||
| [[autodoc]] Granite4VisionProcessor | ||
| - __call__ | ||
|
|
||
| ## Granite4VisionModel | ||
|
|
||
| [[autodoc]] Granite4VisionModel | ||
|
|
||
| ## Granite4VisionTextModel | ||
|
|
||
| [[autodoc]] Granite4VisionTextModel | ||
|
|
||
| ## Granite4VisionForConditionalGeneration | ||
|
|
||
| [[autodoc]] Granite4VisionForConditionalGeneration | ||
| - forward | ||
| - get_image_features |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| # Copyright 2025 IBM. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| from typing import TYPE_CHECKING | ||
|
|
||
| from ...utils import _LazyModule | ||
| from ...utils.import_utils import define_import_structure | ||
|
|
||
|
|
||
| if TYPE_CHECKING: | ||
| from .configuration_granite4_vision import * | ||
| from .modeling_granite4_vision import * | ||
| from .processing_granite4_vision import * | ||
| else: | ||
| import sys | ||
|
|
||
| _file = globals()["__file__"] | ||
| sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__) |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this belong in
image_processing_auto.py. Was it auto-generated or added manually?