mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. (EMNLP 2022)
https://arxiv.org/abs/2205.12005
We presents mPLUG, a new vision-language foundation model for both cross-modal understanding and generation. Most existing pre-trained models suffer from inefficiency and linguistic signal overwhelmed by long visual sequences in cross modal alignment. To address both problems, mPLUG introduces an effective and efficient vision-language architecture with novel cross modal skip-connections. mPLUG achieves state-of-the-art results on a wide range of vision language downstream tasks, including image captioning, image-text retrieval, visual grounding and visual question answering.
- 2023.5.08: Moved from AliceMind repo for further update.
 - 2022.8.28: Released mPLUG downstream tasks!
 
- Pre-trained models
 
For VQA and image captioning tasks, we do an additional continue pre-training on 4M image-text pairs based mplug.en.large to get mplug.en.large.v2.
| Model | Visual Backbone | Text Enc Layers | Fusion Layers | Text Dec Layers | #params | Download | 
|---|---|---|---|---|---|---|
| mplug.en.base | vit-b-16 | 6 | 6 | 12 | 350M | mplug.en.base | 
| mplug.en.large | vit-l-14 | 6 | 6 | 12 | 600M | mplug.en.large | 
| mplug.en.large.v2 | vit-l-14 | 6 | 6 | 12 | 600M | mplug.en.large.v2 | 
| mplug.en.huge | vit-l-14 | 24 | 6 | 12 | 1.1B | comming soon | 
- Pre-train Datasets
 
| COCO | VG | SBU | CC3M | CC13M | |
|---|---|---|---|---|---|
| image | 113K | 100K | 860K | 3M | 10M | 
| text | 567K | 769K | 860K | 3M | 10M | 
- Image-text
 
| Task | VQA | Image Captioning | Retrieval | Referring Expression Comprehension | Visual Entailment | ||||
|---|---|---|---|---|---|---|---|---|---|
| Dataset | VQA v2 | COCO | MSCOCO | Flickr30K | RefCOCO | RefCOCO+ | RefCOCOg | SNLI-VE | NLVR2 | 
| Split | test-dev/test-std | Karpathy test (CE/CIDEr) | 5k test (TR/IR) | 1k test (TR/IR) | val/test-a/test-b | val/test-a/test-b | val-u/test-u | val/test | dev/test-P | 
| Metric | Acc. | CIDEr | R@1 | R@1 | Acc. | Acc. | Acc. | ||
| mPLUGBase | 79.79/79.98 | 137.5/150.4 | -/- | -/- | -/- | -/- | -/- | -/- | -/- | 
| mPLUGLarge | 81.27/81.26 | 141.0/155.1 | 82.8/65.8 | 97.6/88.4 | 92.40/94.51/88.42 | 86.02/90.17 / 78.17 | 85.88/86.42 | 89.45/89.29 | 84.58/84.95 | 
| mPLUGHuge | 82.27/82.41 | 142.3/158.7 | -/- | -/- | -/-/- | -/-/- | -/- | -/- | -/-/- | 
- Video-text
 
| Task | Video Retrieval | Video QA | Video Captioning | |
|---|---|---|---|---|
| Dataset | MSRVTT | MSRVTT-QA | MSVD-QA | VATEX | 
| Split | test | test | test | test(CE) | 
| Metric | R@1 | Acc. | Acc. | CIDEr | 
| mPLUG | 38.1 | 21.1 | 37.2 | 42.0 | 
- 
PyTorch version >= 1.11.0
 - 
Install other libraries via
 
pip install -r requirements.txt
Comming soon.
Download json files of downstream tasks
- Download VQA v2 dataset and Visual Genome dataset from the original websites VQA 2.0.
 - Download and extract the provided dataset json files.
 - In configs/vqa_mplug_base.yaml, set the paths for the json files and the image paths.
 - Finetune the pre-trained mplug_base or large model using 8 A100 GPUs:
 
sh scripts/vqa_mplug_base.sh
sh scripts/vqa_mplug_large.sh
- Evaluate the result using the official evaluation server.
 
- Download COCO Caption dataset from the original websites.
 - Download and extract the provided dataset json files.
 - Download language evalution tool(language_evalution).
 - In configs/caption_mplug_base.yaml, set the paths for the json files and the image paths.
 - Finetune the pre-trained mplug_base or large model using 8 A100 GPUs:
 
sh scripts/caption_mplug_base.sh
sh scripts/caption_mplug_large.sh
- Download MSCOCO or Flickr30k datasets from the original websites.
 - Download and extract the provided dataset json files.
 - In configs/retrieval_flickr30k_mplug_large.yaml or configs/retrieval_coco_mplug_large.yaml, set the paths for the json files and the image path.
 - Finetune the pre-trained checkpoint using 8 A100 GPUs:
 
sh scripts/retrieval_flickr30k_mplug_large.sh
sh scripts/retrieval_coco_mplug_large.sh
- Download RefCOCO datasets from the original websites.
 - Download and extract the provided dataset json files.
 - In configs/grounding_mplug_large.yaml, set the paths for the json files and the image path. Data preparation can follow TransVG
 - Finetune the pre-trained checkpoint using 8 A100 GPUs:
 
sh scripts/grounding_mplug_base.sh
- Download MSRVTT datasets from the original websites.
 - In configs/retrieval_msrvtt_mplug_large.yaml, set the paths for the json files and the video paths.
 - To perform zero-shot evaluation, run:
 
sh scripts/retrieval_msrvtt_mplug_large.sh
- Download MSRVTT-QA datasets from the original websites.
 - In configs/videoqa_msrvtt_mplug_base.yaml, set the paths for the json files and the video paths.
 - To perform zero-shot evaluation, run:
 
sh scripts/videoqa_msrvtt_mplug_base.sh
- Download VATEX datasets from the original websites.
 - In configs/videocap_vatex_mplug_large.yaml, set the paths for the json files and the video paths.
 - To perform zero-shot evaluation, run:
 
sh scripts/videocap_vatex_mplug_large.sh
If you use our work, please cite:
@article{li2022mplug,
  title={mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections},
  author={Li, Chenliang and Xu, Haiyang and Tian, Junfeng and Wang, Wei and Yan, Ming and Bi, Bin and Ye, Jiabo and Chen, Hehong and Xu, Guohai and Cao, Zheng and others},
  journal={arXiv preprint arXiv:2205.12005},
  year={2022}
}
The implementation of mPLUG relies on resources from ALBEF, BLIP, and timm. We thank the original authors for their open-sourcing.
