mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. (EMNLP 2022)

https://arxiv.org/abs/2205.12005

Introduction

We presents mPLUG, a new vision-language foundation model for both cross-modal understanding and generation. Most existing pre-trained models suffer from inefficiency and linguistic signal overwhelmed by long visual sequences in cross modal alignment. To address both problems, mPLUG introduces an effective and efficient vision-language architecture with novel cross modal skip-connections. mPLUG achieves state-of-the-art results on a wide range of vision language downstream tasks, including image captioning, image-text retrieval, visual grounding and visual question answering.

News

2023.5.08: Moved from AliceMind repo for further update.
2022.8.28: Released mPLUG downstream tasks!

Pre-trained models and datasets

Pre-trained models

For VQA and image captioning tasks, we do an additional continue pre-training on 4M image-text pairs based mplug.en.large to get mplug.en.large.v2.

Model	Visual Backbone	Text Enc Layers	Fusion Layers	Text Dec Layers	#params	Download
mplug.en.base	vit-b-16	6	6	12	350M	mplug.en.base
mplug.en.large	vit-l-14	6	6	12	600M	mplug.en.large
mplug.en.large.v2	vit-l-14	6	6	12	600M	mplug.en.large.v2
mplug.en.huge	vit-l-14	24	6	12	1.1B	comming soon

Pre-train Datasets

	COCO	VG	SBU	CC3M	CC13M
image	113K	100K	860K	3M	10M
text	567K	769K	860K	3M	10M

Results

Image-text

Task	VQA	Image Captioning	Retrieval		Referring Expression Comprehension			Visual Entailment
Dataset	VQA v2	COCO	MSCOCO	Flickr30K	RefCOCO	RefCOCO+	RefCOCOg	SNLI-VE	NLVR2
Split	test-dev/test-std	Karpathy test (CE/CIDEr)	5k test (TR/IR)	1k test (TR/IR)	val/test-a/test-b	val/test-a/test-b	val-u/test-u	val/test	dev/test-P
Metric	Acc.	CIDEr	R@1	R@1	Acc.			Acc.	Acc.
mPLUG_Base	79.79/79.98	137.5/150.4	-/-	-/-	-/-	-/-	-/-	-/-	-/-
mPLUG_Large	81.27/81.26	141.0/155.1	82.8/65.8	97.6/88.4	92.40/94.51/88.42	86.02/90.17 / 78.17	85.88/86.42	89.45/89.29	84.58/84.95
mPLUG_Huge	82.27/82.41	142.3/158.7	-/-	-/-	-/-/-	-/-/-	-/-	-/-	-/-/-

Video-text

Task	Video Retrieval	Video QA		Video Captioning
Dataset	MSRVTT	MSRVTT-QA	MSVD-QA	VATEX
Split	test	test	test	test(CE)
Metric	R@1	Acc.	Acc.	CIDEr
mPLUG	38.1	21.1	37.2	42.0

Requirements

PyTorch version >= 1.11.0
Install other libraries via

pip install -r requirements.txt

Pre-training

Comming soon.

Fine-tuning

Download json files of downstream tasks

Visual Question Answering

Download VQA v2 dataset and Visual Genome dataset from the original websites VQA 2.0.
Download and extract the provided dataset json files.
In configs/vqa_mplug_base.yaml, set the paths for the json files and the image paths.
Finetune the pre-trained mplug_base or large model using 8 A100 GPUs:

sh scripts/vqa_mplug_base.sh

sh scripts/vqa_mplug_large.sh

Evaluate the result using the official evaluation server.

Image Captioning

Download COCO Caption dataset from the original websites.
Download and extract the provided dataset json files.
Download language evalution tool(language_evalution).
In configs/caption_mplug_base.yaml, set the paths for the json files and the image paths.
Finetune the pre-trained mplug_base or large model using 8 A100 GPUs:

sh scripts/caption_mplug_base.sh

sh scripts/caption_mplug_large.sh

Image-text Retrieval

Download MSCOCO or Flickr30k datasets from the original websites.
Download and extract the provided dataset json files.
In configs/retrieval_flickr30k_mplug_large.yaml or configs/retrieval_coco_mplug_large.yaml, set the paths for the json files and the image path.
Finetune the pre-trained checkpoint using 8 A100 GPUs:

sh scripts/retrieval_flickr30k_mplug_large.sh

sh scripts/retrieval_coco_mplug_large.sh

Visual Grounding

Download RefCOCO datasets from the original websites.
Download and extract the provided dataset json files.
In configs/grounding_mplug_large.yaml, set the paths for the json files and the image path. Data preparation can follow TransVG
Finetune the pre-trained checkpoint using 8 A100 GPUs:

 sh scripts/grounding_mplug_base.sh

Zero-shot Video-text Retrieval

Download MSRVTT datasets from the original websites.
In configs/retrieval_msrvtt_mplug_large.yaml, set the paths for the json files and the video paths.
To perform zero-shot evaluation, run：

sh scripts/retrieval_msrvtt_mplug_large.sh

Zero-shot Video Question Answering

Download MSRVTT-QA datasets from the original websites.
In configs/videoqa_msrvtt_mplug_base.yaml, set the paths for the json files and the video paths.
To perform zero-shot evaluation, run：

sh scripts/videoqa_msrvtt_mplug_base.sh

Zero-shot Video Captioning

Download VATEX datasets from the original websites.
In configs/videocap_vatex_mplug_large.yaml, set the paths for the json files and the video paths.
To perform zero-shot evaluation, run：

sh scripts/videocap_vatex_mplug_large.sh

Citation

If you use our work, please cite:

@article{li2022mplug,
  title={mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections},
  author={Li, Chenliang and Xu, Haiyang and Tian, Junfeng and Wang, Wei and Yan, Ming and Bi, Bin and Ye, Jiabo and Chen, Hehong and Xu, Guohai and Cao, Zheng and others},
  journal={arXiv preprint arXiv:2205.12005},
  year={2022}
}

Acknowledgement

The implementation of mPLUG relies on resources from ALBEF, BLIP, and timm. We thank the original authors for their open-sourcing.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
configs		configs
dataset		dataset
models		models
optim		optim
scheduler		scheduler
scripts		scripts
vgTools/utils		vgTools/utils
vqaTools		vqaTools
README.md		README.md
caption_mplug.py		caption_mplug.py
caption_mplug_scst.py		caption_mplug_scst.py
grounding_mplug.py		grounding_mplug.py
mplug_framework.png		mplug_framework.png
requirements.txt		requirements.txt
retrieval_img_mplug.py		retrieval_img_mplug.py
retrieval_vid_mplug.py		retrieval_vid_mplug.py
utils.py		utils.py
videocap_mplug.py		videocap_mplug.py
videoqa_mplug.py		videoqa_mplug.py
vqa_mplug.py		vqa_mplug.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. (EMNLP 2022)

Introduction

News

Pre-trained models and datasets

Results

Requirements

Pre-training

Fine-tuning

Visual Question Answering

Image Captioning

Image-text Retrieval

Visual Grounding

Zero-shot Video-text Retrieval

Zero-shot Video Question Answering

Zero-shot Video Captioning

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Languages

X-PLUG/mPLUG

Folders and files

Latest commit

History

Repository files navigation

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. (EMNLP 2022)

Introduction

News

Pre-trained models and datasets

Results

Requirements

Pre-training

Fine-tuning

Visual Question Answering

Image Captioning

Image-text Retrieval

Visual Grounding

Zero-shot Video-text Retrieval

Zero-shot Video Question Answering

Zero-shot Video Captioning

Citation

Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages