Uniform kwargs for processors

### Feature request


We want to standardize the logic flow through Processor classes. Since processors can have different kwargs depending on the model and modality, we are adding a `TypedDict` for each modality to keep track of which kwargs are accepted. 

The initial design is merged and an example model is modified to follow the new uniform processor kwargs in https://github.com/huggingface/transformers/pull/31198. Also https://github.com/huggingface/transformers/pull/31197 has two more examples with standardized API.

This design has to be shipped to all the processors in Transformers, and appreciate contributions. 
Below is an incomplete list of models that need standardization, feel free to add a model if it's missing:

- [x] Align #31368 
- [x] AltClip #31368 
- [x] BLIP #31368
- [x] BLIP-2  #31368 
- [x] Bridgetower #31368 
- [x] Chameleon -> https://github.com/huggingface/transformers/pull/32181
- [x] Chinese CLIP -> #31368 
- [x] CLIP -> in progress by @davidgxue 
- [x] ClipSeg -> https://github.com/huggingface/transformers/pull/32841
- [x] Donut #31368
- [x] Flava -> https://github.com/huggingface/transformers/pull/32845
- [x] Fuyu -> https://github.com/huggingface/transformers/pull/32544
- [x] GIT https://github.com/huggingface/transformers/pull/33668
- [x] Grounding DINO #31964
- [x] Idefics -> https://github.com/huggingface/transformers/pull/32568
- [x] Idefics-2 -> https://github.com/huggingface/transformers/pull/32568
- [x] InstructBlip -> https://github.com/huggingface/transformers/pull/32544
- [x] InstructBlipVideo https://github.com/huggingface/transformers/pull/32845
- [x] Kosmos-2 -> https://github.com/huggingface/transformers/pull/32544
- [x] LayoutLM (1, 2, 3) -> https://github.com/huggingface/transformers/pull/32180
- [x] LLaVa -> https://github.com/huggingface/transformers/pull/32858
- [x] LLaVa-NeXT -> https://github.com/huggingface/transformers/pull/32544
- [x] LLaVa-NeXT-Video https://github.com/huggingface/transformers/pull/35613
- [x] MGP-STR https://github.com/huggingface/transformers/pull/32845
- [x] Nouga -> https://github.com/huggingface/transformers/pull/32841
- [x] OneFormer -> https://github.com/huggingface/transformers/pull/34547
- [x] Owlv2 https://github.com/huggingface/transformers/pull/35700
- [x] OwlVIT https://github.com/huggingface/transformers/pull/35700
- [x] Paligemma -> https://github.com/huggingface/transformers/pull/33571
- [x] Pix2Struct -> https://github.com/huggingface/transformers/pull/32544
- [x] Pixtral -> https://github.com/huggingface/transformers/pull/33521
- [x] SAM -> https://github.com/huggingface/transformers/pull/34578
- [x] SigLip -> https://github.com/huggingface/transformers/pull/32845
- [x] TrOCR -> https://github.com/huggingface/transformers/pull/34587
- [x] TVP -> https://github.com/huggingface/transformers/pull/32845
- [x] Udop -> https://github.com/huggingface/transformers/pull/33628
- [x] VideoLLaVa -> https://github.com/huggingface/transformers/pull/32845
- [x] VILT -> https://github.com/huggingface/transformers/pull/32845
- [x] VisionTextDualEncoder -> https://github.com/huggingface/transformers/pull/34563
- [x] X-CLIP -> https://github.com/huggingface/transformers/pull/32845


Note: For now we'll start with image or image+text, https://github.com/huggingface/transformers/pull/31368 is an ongoing PR that has also audio processor standardization

### Motivation

.

### Your contribution

.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uniform kwargs for processors #31911

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uniform kwargs for processors #31911

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions