Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
e3bf724
Add Granite 4.1 Vision model (granite4_vision)
Apr 21, 2026
6a04fa9
Fix auto-registration after upstream auto_mappings refactor
Apr 21, 2026
de314a8
Fix conflict marker in image_processing_auto.py
Apr 21, 2026
29f88f9
Fix check-repo: remove spatial_stride (unused in modeling), fix auto_…
Apr 21, 2026
d0093d9
Fix duplicate legacy key in conversion_mapping.py
Apr 21, 2026
ed4d9ef
Fix check-repo failures for granite4_vision
Apr 23, 2026
84374b3
Regenerate auto_mappings.py after rebase onto upstream/main
Apr 23, 2026
cc97911
Fix CI failures after upstream rebase
Apr 23, 2026
703d34e
Fix Sam3 auto_mappings.py entries corrupted by rebase
Apr 23, 2026
7a815b1
Restore hy_v3, openai_privacy_filter, slanet entries dropped by bad r…
Apr 23, 2026
61a8e87
Revert dependency_versions_table.py to match setup.py (upstream state)
Apr 23, 2026
2427c19
Fix bad rebase: remove hy-v3/slanet/openai-privacy-filter entries fro…
Apr 27, 2026
1af5f3d
Remove merge_lora_adapters and PEFT adapter-toggling generate override
Apr 27, 2026
4ed8c0c
Add Granite4VisionTextModel with deepstack injection, replace manual …
Apr 27, 2026
ace7e02
Add Granite4VisionTextConfig, fix missing TextConfig import in genera…
Apr 27, 2026
7edc402
Fix Granite4VisionTextConfig to inherit PreTrainedConfig, add missing…
Apr 27, 2026
81fe3fb
Fix class ordering: define Granite4VisionPreTrainedModel before TextM…
Apr 27, 2026
d4a6b42
Fix inv_freq corruption in Granite4VisionTextRotaryEmbedding during f…
Apr 27, 2026
7a38510
Inline downsampling into modular, add qformer_config sub-config, conv…
Apr 27, 2026
a49cc54
Return Granite4VisionImageFeaturesOutput from get_image_features
Apr 27, 2026
e7bb139
Drop Granite4Vision image processor re-definitions, delegate to Llava…
Apr 27, 2026
a63064c
Address medium PR review items 8-14
Apr 27, 2026
9c7d33a
Address nit PR review items 15-23
Apr 27, 2026
103e842
Address remaining review items 20 and 29
Apr 27, 2026
0868a8d
Fix test failures found by Granite4VisionModelTest
Apr 27, 2026
488552f
Use lazy import for Blip2QFormerConfig; qformer_config sub_configs us…
Apr 28, 2026
421b478
Add Granite4VisionTextModel to check_repo ignore lists; document miss…
Apr 28, 2026
476e8a7
Fix missing imports in modular: math, AutoConfig, select_best_resolution
Apr 28, 2026
c5f63ce
Pop output_attentions/output_hidden_states from **kwargs in Granite4V…
Apr 28, 2026
25d4c14
Fix check_modeling_structure violations (TRF002, TRF009, TRF010)
Apr 28, 2026
e435343
Fix ruff I001 import ordering and processing consistency check
Apr 28, 2026
305cded
Remove converter-regenerated files that should not exist
Apr 28, 2026
d84a896
Remove autodoc entries for ImageProcessor classes that don't exist in…
Apr 28, 2026
e9268a2
Fix model card date for add_dates.py check
Apr 28, 2026
ee05b3e
Fix dependency_versions_table.py: sync mlinter version with setup.py
Apr 29, 2026
88be1f0
Address review round 3: config, capture_outputs, hidden states, proje…
Apr 29, 2026
24c101a
Nits: rename one-letter vars, AttributeError() for unused inherited c…
Apr 29, 2026
d7f134d
Fix qformer_config: build fully-specified at init, no post-super fiel…
Apr 29, 2026
dc32af6
Move _can_record_outputs and _deepstack_inject to Granite4VisionPreTr…
Apr 30, 2026
b99968d
Address remaining review items: capture_outputs, output class, Dynami…
Apr 30, 2026
71202cf
Regenerate modeling/processing from updated modular; fix copyright he…
Apr 30, 2026
98e29ea
Fix ruff formatting in modular_granite4_vision.py
Apr 30, 2026
0848d92
Fix add_dates.py: update granite4_vision model card date to 2026-04-30
Apr 30, 2026
ff2d78e
Remove incorrect skip from test_can_init_all_missing_weights
May 3, 2026
4dcdf3e
Fix add_dates.py: update granite4_vision model card date to 2026-05-03
May 3, 2026
662d2c1
Fix _init_weights: add nn.Embedding, nn.LayerNorm, Granite4VisionText…
May 3, 2026
80bcb8e
Fix ruff F821: detect RMSNorm by attribute pattern instead of class name
May 3, 2026
3224b69
Fix ruff formatting in modular_granite4_vision.py
May 3, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1235,6 +1235,8 @@
title: GlmOcr
- local: model_doc/got_ocr2
title: GOT-OCR2
- local: model_doc/granite4_vision
title: Granite4Vision
- local: model_doc/granitevision
title: GraniteVision
- local: model_doc/grounding-dino
Expand Down
185 changes: 185 additions & 0 deletions docs/source/en/model_doc/granite4_vision.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
<!--Copyright 2026 IBM and The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->
*This model was released on 2026-03-27 and added to Hugging Face Transformers on 2026-05-03.*

<div style="float: right;">
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
</div>
</div>

# Granite4Vision

[Granite Vision 4.1](https://huggingface.co/ibm-granite/granite-vision-4.1-4b) is a vision-language model from IBM Research designed for enterprise-grade document data extraction. It specializes in chart extraction (Chart2CSV, Chart2Summary, Chart2Code), table extraction (JSON, HTML, OTSL), and semantic key-value pair extraction.

The model builds on [LLaVA-NeXT](llava_next) with several architectural innovations:

1. **SigLIP2 Vision Encoder** ([`google/siglip2-so400m-patch16-384`](https://huggingface.co/google/siglip2-so400m-patch16-384)): images are tiled into 384x384 patches.
2. **Window Q-Former Projectors**: compress visual features 4x using windowed cross-attention over 4x4 patch windows into 2x2 tokens.
3. **DeepStack Feature Injection** with 8 vision-to-LLM injection points:
- *LayerDeepstack*: features from 4 vision encoder depths are projected into different early LLM layers.
- *SpatialDeepstack*: deepest vision features are split into 4 spatial groups and injected at later LLM layers.
4. **Language Model**: [Granite 4.1](https://huggingface.co/ibm-granite/granite-4.1-4b-base) (4B params) with LoRA adapters (rank 256) across all self-attention and MLP layers.

The model is delivered as a LoRA adapter on top of the base LLM, enabling single deployments to support both multimodal and text-only workloads. Total parameter count is ~4B.

> [!TIP]
> This model was contributed by the [IBM Granite Vision Team](https://github.com/ibm-granite).

## Usage Tips

- Set `padding_side="left"` during batched generation for more accurate results.

```py
processor.tokenizer.padding_side = "left"
```

- The model supports specialized task tags for document extraction: `<chart2csv>`, `<chart2summary>`, `<chart2code>`, `<tables_html>`, `<tables_otsl>`, `<tables_json>`. Pass these as the text prompt along with a document image.

- For key-value pair extraction, provide a JSON schema describing the fields to extract. The model returns structured JSON matching the schema.

The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.

<hfoptions id="usage">

<hfoption id="Pipeline">

```python
from transformers import pipeline

pipe = pipeline(
task="image-text-to-text",
model="ibm-granite/granite-vision-4.1-4b",
)
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
{"type": "text", "text": "Describe this image."},
],
}
]
pipe(text=messages, max_new_tokens=100, return_full_text=False)
```

</hfoption>

<hfoption id="AutoModel">

```python
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

model_id = "ibm-granite/granite-vision-4.1-4b"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(model_id)

conversation = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
{"type": "text", "text": "Describe this image."},
],
},
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)

output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0], skip_special_tokens=True))
```

</hfoption>

</hfoptions>

Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.

The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to int4.

```python
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig

quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
)

model_id = "ibm-granite/granite-vision-4.1-4b"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id, quantization_config=quant_config, device_map="auto"
)

conversation = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
{"type": "text", "text": "Describe this image."},
],
},
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)

output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0], skip_special_tokens=True))
```

## Granite4VisionConfig

[[autodoc]] Granite4VisionConfig

## Granite4VisionTextConfig

[[autodoc]] Granite4VisionTextConfig

## Granite4VisionProcessor

[[autodoc]] Granite4VisionProcessor
- __call__

## Granite4VisionModel

[[autodoc]] Granite4VisionModel

## Granite4VisionTextModel

[[autodoc]] Granite4VisionTextModel

## Granite4VisionForConditionalGeneration

[[autodoc]] Granite4VisionForConditionalGeneration
- forward
- get_image_features
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,7 @@
from .gpt_sw3 import *
from .gptj import *
from .granite import *
from .granite4_vision import *
from .granite_speech import *
from .granite_speech_plus import *
from .granitemoe import *
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/auto_mappings.py
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,7 @@
("gpt_oss", "GptOssConfig"),
("gptj", "GPTJConfig"),
("granite", "GraniteConfig"),
("granite4_vision", "Granite4VisionConfig"),
("granite_speech", "GraniteSpeechConfig"),
("granite_speech_encoder", "GraniteSpeechEncoderConfig"),
("granite_speech_plus", "GraniteSpeechPlusConfig"),
Expand Down Expand Up @@ -892,6 +893,7 @@
("glm_image", {"pil": "GlmImageImageProcessorPil", "torchvision": "GlmImageImageProcessor"}),
("glpn", {"pil": "GLPNImageProcessorPil", "torchvision": "GLPNImageProcessor"}),
("got_ocr2", {"pil": "GotOcr2ImageProcessorPil", "torchvision": "GotOcr2ImageProcessor"}),
("granite4_vision", {"pil": "LlavaNextImageProcessorPil", "torchvision": "LlavaNextImageProcessor"}),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this belong in image_processing_auto.py. Was it auto-generated or added manually?

("grounding-dino", {"pil": "GroundingDinoImageProcessorPil", "torchvision": "GroundingDinoImageProcessor"}),
("idefics", {"pil": "IdeficsImageProcessorPil", "torchvision": "IdeficsImageProcessor"}),
("idefics2", {"pil": "Idefics2ImageProcessorPil", "torchvision": "Idefics2ImageProcessor"}),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/image_processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@
("focalnet", {"torchvision": "BitImageProcessor", "pil": "BitImageProcessorPil"}),
("gemma3n", {"torchvision": "SiglipImageProcessor", "pil": "SiglipImageProcessorPil"}),
("git", {"torchvision": "CLIPImageProcessor", "pil": "CLIPImageProcessorPil"}),
("granite4_vision", {"torchvision": "LlavaNextImageProcessor", "pil": "LlavaNextImageProcessorPil"}),
("groupvit", {"torchvision": "CLIPImageProcessor", "pil": "CLIPImageProcessorPil"}),
("hiera", {"torchvision": "BitImageProcessor", "pil": "BitImageProcessorPil"}),
("ijepa", {"torchvision": "ViTImageProcessor", "pil": "ViTImageProcessorPil"}),
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
("gpt_oss", "GptOssModel"),
("gptj", "GPTJModel"),
("granite", "GraniteModel"),
("granite4_vision", "Granite4VisionModel"),
("granite_speech", "GraniteSpeechForConditionalGeneration"),
("granitemoe", "GraniteMoeModel"),
("granitemoehybrid", "GraniteMoeHybridModel"),
Expand Down Expand Up @@ -995,6 +996,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
("glm4v_moe", "Glm4vMoeForConditionalGeneration"),
("glm_ocr", "GlmOcrForConditionalGeneration"),
("got_ocr2", "GotOcr2ForConditionalGeneration"),
("granite4_vision", "Granite4VisionForConditionalGeneration"),
("idefics", "IdeficsForVisionText2Text"),
("idefics2", "Idefics2ForConditionalGeneration"),
("idefics3", "Idefics3ForConditionalGeneration"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@
("glm_image", "Glm4vProcessor"),
("glmasr", "GlmAsrProcessor"),
("got_ocr2", "GotOcr2Processor"),
("granite4_vision", "Granite4VisionProcessor"),
("granite_speech", "GraniteSpeechProcessor"),
("granite_speech_plus", "GraniteSpeechProcessor"),
("grounding-dino", "GroundingDinoProcessor"),
Expand Down
28 changes: 28 additions & 0 deletions src/transformers/models/granite4_vision/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Copyright 2025 IBM. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_granite4_vision import *
from .modeling_granite4_vision import *
from .processing_granite4_vision import *
else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
Loading
Loading