Skip to content

Commit f1a33af

Browse files
author
Deyao Zhu
committed
first commit
0 parents  commit f1a33af

File tree

111 files changed

+8792
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

111 files changed

+8792
-0
lines changed

LICENSE.md

+14
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
BSD 3-Clause License
2+
3+
Copyright 2023 Deyao Zhu
4+
All rights reserved.
5+
6+
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
7+
8+
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
9+
10+
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
11+
12+
3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
13+
14+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

LICENSE_Lavis.md

+14
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
BSD 3-Clause License
2+
3+
Copyright (c) 2022 Salesforce, Inc.
4+
All rights reserved.
5+
6+
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
7+
8+
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
9+
10+
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
11+
12+
3. Neither the name of Salesforce.com nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
13+
14+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

MiniGPT_4.pdf

6.31 MB
Binary file not shown.

README.md

+145
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
# MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models
2+
[Deyao Zhu](https://tsutikgiau.github.io/)* (On Job Market!), [Jun Chen](https://junchen14.github.io/)* (On Job Market!), [Xiaoqian Shen](https://xiaoqian-shen.github.io), Xiang Li, and Mohamed Elhoseiny. *Equal Contribution
3+
4+
**King Abdullah University of Science and Technology**
5+
6+
<a href='https://minigpt-4.github.io'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='MiniGPT_4.pdf'><img src='https://img.shields.io/badge/Paper-PDF-red'></a>
7+
8+
9+
## Online Demo
10+
11+
Click the image to chat with MiniGPT-4 around your images
12+
[![demo](figs/online_demo.png)](https://minigpt-4.github.io)
13+
14+
15+
## Examples
16+
| | |
17+
:-------------------------:|:-------------------------:
18+
![find wild](figs/examples/wop_2.png) | ![write story](figs/examples/ad_2.png)
19+
![solve problem](figs/examples/fix_1.png) | ![write Poem](figs/examples/rhyme_1.png)
20+
21+
More examples can be found in the [project page](https://minigpt-4.github.io).
22+
23+
24+
25+
## Introduction
26+
- MiniGPT-4 aligns a frozen visual encoder from BLIP-2 with a frozen LLM, Vicuna, using just one projection layer.
27+
- The training of MiniGPT-4 consists of a first pretrain stage using roughly 5 million aligned image-text pairs for 10 hours on 4 A100s and a second finetuning stage using additional 3,500 carefully curated high-quality pairs for 7 minutes on 1 A100.
28+
- MiniGPT-4 processes many emerging vision-language capabilities similar to those exhibited by GPT-4.
29+
![overview](figs/overview.png)
30+
31+
32+
33+
34+
35+
36+
## Getting Started
37+
### Installation
38+
39+
**1. Prepare the code and the environment**
40+
41+
Git clone our repository, creating a python environment and ativate it via the following command
42+
43+
```bash
44+
git clone https://github.com/Vision-CAIR/MiniGPT-4.git
45+
cd MiniGPT-4
46+
conda env create -f environment.yml
47+
conda activate minigpt4
48+
```
49+
50+
51+
**2. Prepare the pretrained Vicuna weights**
52+
53+
The current version of MiniGPT-4 is built on the v0 versoin of Vicuna-13B.
54+
Please refer to their instructions [here](https://huggingface.co/lmsys/vicuna-13b-delta-v0) to obtaining the weights.
55+
The final weights would be in a single folder with the following structure:
56+
57+
```
58+
vicuna_weights
59+
├── config.json
60+
├── generation_config.json
61+
├── pytorch_model.bin.index.json
62+
├── pytorch_model-00001-of-00003.bin
63+
...
64+
```
65+
66+
Then, set the path to the vicuna weight in the model config file
67+
[here](minigpt4/configs/models/minigpt4.yaml#L16) at Line 16.
68+
69+
**3. Prepare the pretrained MiniGPT-4 checkpoint**
70+
71+
To play with our pretrained model, download the pretrained checkpoint
72+
[here](https://drive.google.com/file/d/1a4zLvaiDBr-36pasffmgpvH5P7CKmpze/view?usp=share_link).
73+
Then, set the path to the pretrained checkpoint in the evaluation config file
74+
in [eval_configs/minigpt4_eval.yaml](eval_configs/minigpt4_eval.yaml#L10) at Line 10.
75+
76+
77+
78+
### Launching Demo Locally
79+
80+
Try out our demo [demo.py](demo.py) on your local machine by running
81+
82+
```
83+
python demo.py --cfg-path eval_configs/minigpt4_eval.yaml
84+
```
85+
86+
87+
88+
### Training
89+
The training of MiniGPT-4 contains two alignment stages.
90+
91+
**1. First pretraining stage**
92+
93+
In the first pretrained stage, the model is trained using image-text pairs from Laion and CC datasets
94+
to align the vision and language model. To download and prepare the datasets, please check
95+
our [first stage dataset preparation instruction](dataset/README_1_STAGE.md).
96+
After the first stage, the visual features are mapped and can be understood by the language
97+
model.
98+
To launch the first stage training, run the following command. In our experiments, we use 4 A100.
99+
You can change the save path in the config file
100+
[train_configs/minigpt4_stage1_pretrain.yaml](train_configs/minigpt4_stage1_pretrain.yaml)
101+
102+
```bash
103+
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage1_pretrain.yaml
104+
```
105+
106+
**1. Second finetuning stage**
107+
108+
In the second stage, we use a small high quality image-text pair dataset created by ourselves
109+
and convert it to a conversation format to further align MiniGPT-4.
110+
To download and prepare our second stage dataset, please check our
111+
[second stage dataset preparation instruction](dataset/README_2_STAGE.md).
112+
To launch the second stage alignment,
113+
first specify the path to the checkpoint file trained in stage 1 in
114+
[train_configs/minigpt4_stage1_pretrain.yaml](train_configs/minigpt4_stage2_finetune.yaml).
115+
You can also specify the output path there.
116+
Then, run the following command. In our experiments, we use 1 A100.
117+
118+
```bash
119+
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage2_finetune.yaml
120+
```
121+
122+
After the second stage alignment, MiniGPT-4 is able to talk about the image coherently and user-friendly.
123+
124+
125+
126+
127+
## Acknowledgement
128+
129+
+ [BLIP2](https://huggingface.co/docs/transformers/main/model_doc/blip-2)
130+
+ [Vicuna](https://github.com/lm-sys/FastChat)
131+
132+
133+
If you're using MiniGPT-4 in your research or applications, please cite using this BibTeX:
134+
```bibtex
135+
@misc{zhu2022minigpt4,
136+
title={MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models},
137+
author={Deyao Zhu and Jun Chen and Xiaoqian Shen and xiang Li and Mohamed Elhoseiny},
138+
year={2023},
139+
}
140+
```
141+
142+
## License
143+
This repository is under [BSD 3-Clause License](LICENSE.md).
144+
Many codes are based on [Lavis](https://github.com/salesforce/LAVIS) with
145+
BSD 3-Clause License [here](LICENSE_Lavis.md).

dataset/README_1_STAGE.md

+96
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
## Download the filtered Conceptual Captions, SBU, LAION datasets
2+
3+
### Pre-training datasets download:
4+
We use the filtered synthetic captions prepared by BLIP. For more details about the dataset, please refer to [BLIP](https://github.com/salesforce/BLIP).
5+
6+
It requires ~2.3T to store LAION and CC3M+CC12M+SBU datasets
7+
8+
Image source | Filtered synthetic caption by ViT-L
9+
--- | :---:
10+
CC3M+CC12M+SBU | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/ccs_synthetic_filtered_large.json">Download</a>
11+
LAION115M | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/laion_synthetic_filtered_large.json">Download</a>
12+
13+
This will download two json files
14+
```
15+
ccs_synthetic_filtered_large.json
16+
laion_synthetic_filtered_large.json
17+
```
18+
19+
## prepare the data step-by-step
20+
21+
22+
### setup the dataset folder and move the annotation file to the data storage folder
23+
```
24+
export MINIGPT4_DATASET=/YOUR/PATH/FOR/LARGE/DATASET/
25+
mkdir ${MINIGPT4_DATASET}/cc_sbu
26+
mkdir ${MINIGPT4_DATASET}/laion
27+
mv ccs_synthetic_filtered_large.json ${MINIGPT4_DATASET}/cc_sbu
28+
mv laion_synthetic_filtered_large.json ${MINIGPT4_DATASET}/laion
29+
```
30+
31+
### Convert the scripts to data storate folder
32+
```
33+
cp convert_cc_sbu.py ${MINIGPT4_DATASET}/cc_sbu
34+
cp download_cc_sbu.sh ${MINIGPT4_DATASET}/cc_sbu
35+
cp convert_laion.py ${MINIGPT4_DATASET}/laion
36+
cp download_laion.sh ${MINIGPT4_DATASET}/laion
37+
```
38+
39+
40+
### Convert the laion and cc_sbu annotation file format to be img2dataset format
41+
```
42+
cd ${MINIGPT4_DATASET}/cc_sbu
43+
python convert_cc_sbu.py
44+
45+
cd ${MINIGPT4_DATASET}/laion
46+
python convert_laion.py
47+
```
48+
49+
### Download the datasets with img2dataset
50+
```
51+
cd ${MINIGPT4_DATASET}/cc_sbu
52+
sh download_cc_sbu.sh
53+
cd ${MINIGPT4_DATASET}/laion
54+
sh download_laion.sh
55+
```
56+
57+
58+
The final dataset structure
59+
60+
```
61+
.
62+
├── ${MINIGPT4_DATASET}
63+
│ ├── cc_sbu
64+
│ ├── convert_cc_sbu.py
65+
│ ├── download_cc_sbu.sh
66+
│ ├── ccs_synthetic_filtered_large.json
67+
│ ├── ccs_synthetic_filtered_large.tsv
68+
│ └── cc_sbu_dataset
69+
│ ├── 00000.tar
70+
│ ├── 00000.parquet
71+
│ ...
72+
│ ├── laion
73+
│ ├── convert_laion.py
74+
│ ├── download_laion.sh
75+
│ ├── laion_synthetic_filtered_large.json
76+
│ ├── laion_synthetic_filtered_large.tsv
77+
│ └── laion_dataset
78+
│ ├── 00000.tar
79+
│ ├── 00000.parquet
80+
│ ...
81+
...
82+
```
83+
84+
85+
## Set up the dataset configuration files
86+
87+
Then, set up the LAION dataset loading path in
88+
[here](../minigpt4/configs/datasets/laion/defaults.yaml#L5) at Line 5 as
89+
${MINIGPT4_DATASET}/laion/laion_dataset/{00000..10488}.tar
90+
91+
and the Conceptual Captoin and SBU datasets loading path in
92+
[here](../minigpt4/configs/datasets/cc_sbu/defaults.yaml#L5) at Line 5 as
93+
${MINIGPT4_DATASET}/cc_sbu/cc_sbu_dataset/{00000..01255}.tar
94+
95+
96+

dataset/README_2_STAGE.md

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
## Second Stage Data Preparation
2+
3+
Our second stage dataset can be downloaded from
4+
[here](https://drive.google.com/file/d/1nJXhoEcy3KTExr17I7BXqY5Y9Lx_-n-9/view?usp=share_link)
5+
After extraction, you will get a data follder with the following structure:
6+
7+
```
8+
cc_sbu_align
9+
├── filter_cap.json
10+
└── image
11+
├── 2.jpg
12+
├── 3.jpg
13+
...
14+
```
15+
16+
Put the folder to any path you want.
17+
Then, set up the dataset path in the dataset config file
18+
[here](../minigpt4/configs/datasets/cc_sbu/align.yaml#L5) at Line 5.
19+

dataset/convert_cc_sbu.py

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
import json
2+
import csv
3+
4+
# specify input and output file paths
5+
input_file = 'ccs_synthetic_filtered_large.json'
6+
output_file = 'ccs_synthetic_filtered_large.tsv'
7+
8+
# load JSON data from input file
9+
with open(input_file, 'r') as f:
10+
data = json.load(f)
11+
12+
# extract header and data from JSON
13+
header = data[0].keys()
14+
rows = [x.values() for x in data]
15+
16+
# write data to TSV file
17+
with open(output_file, 'w') as f:
18+
writer = csv.writer(f, delimiter='\t')
19+
writer.writerow(header)
20+
writer.writerows(rows)

dataset/convert_laion.py

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
import json
2+
import csv
3+
4+
# specify input and output file paths
5+
input_file = 'laion_synthetic_filtered_large.json'
6+
output_file = 'laion_synthetic_filtered_large.tsv'
7+
8+
# load JSON data from input file
9+
with open(input_file, 'r') as f:
10+
data = json.load(f)
11+
12+
# extract header and data from JSON
13+
header = data[0].keys()
14+
rows = [x.values() for x in data]
15+
16+
# write data to TSV file
17+
with open(output_file, 'w') as f:
18+
writer = csv.writer(f, delimiter='\t')
19+
writer.writerow(header)
20+
writer.writerows(rows)

dataset/download_cc_sbu.sh

+6
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
#!/bin/bash
2+
3+
img2dataset --url_list ccs_synthetic_filtered_large.tsv --input_format "tsv"\
4+
--url_col "url" --caption_col "caption" --output_format webdataset\
5+
--output_folder cc_sbu_dataset --processes_count 16 --thread_count 128 --image_size 256 \
6+
--enable_wandb True

dataset/download_laion.sh

+6
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
#!/bin/bash
2+
3+
img2dataset --url_list laion_synthetic_filtered_large.tsv --input_format "tsv"\
4+
--url_col "url" --caption_col "caption" --output_format webdataset\
5+
--output_folder laion_dataset --processes_count 16 --thread_count 128 --image_size 256 \
6+
--enable_wandb True

0 commit comments

Comments
 (0)