|
| 1 | +# MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models |
| 2 | +[Deyao Zhu](https://tsutikgiau.github.io/)* (On Job Market!), [Jun Chen](https://junchen14.github.io/)* (On Job Market!), [Xiaoqian Shen](https://xiaoqian-shen.github.io), Xiang Li, and Mohamed Elhoseiny. *Equal Contribution |
| 3 | + |
| 4 | +**King Abdullah University of Science and Technology** |
| 5 | + |
| 6 | +<a href='https://minigpt-4.github.io'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='MiniGPT_4.pdf'><img src='https://img.shields.io/badge/Paper-PDF-red'></a> |
| 7 | + |
| 8 | + |
| 9 | +## Online Demo |
| 10 | + |
| 11 | +Click the image to chat with MiniGPT-4 around your images |
| 12 | +[](https://minigpt-4.github.io) |
| 13 | + |
| 14 | + |
| 15 | +## Examples |
| 16 | + | | | |
| 17 | +:-------------------------:|:-------------------------: |
| 18 | + |  |
| 19 | + |  |
| 20 | + |
| 21 | +More examples can be found in the [project page](https://minigpt-4.github.io). |
| 22 | + |
| 23 | + |
| 24 | + |
| 25 | +## Introduction |
| 26 | +- MiniGPT-4 aligns a frozen visual encoder from BLIP-2 with a frozen LLM, Vicuna, using just one projection layer. |
| 27 | +- The training of MiniGPT-4 consists of a first pretrain stage using roughly 5 million aligned image-text pairs for 10 hours on 4 A100s and a second finetuning stage using additional 3,500 carefully curated high-quality pairs for 7 minutes on 1 A100. |
| 28 | +- MiniGPT-4 processes many emerging vision-language capabilities similar to those exhibited by GPT-4. |
| 29 | + |
| 30 | + |
| 31 | + |
| 32 | + |
| 33 | + |
| 34 | + |
| 35 | + |
| 36 | +## Getting Started |
| 37 | +### Installation |
| 38 | + |
| 39 | +**1. Prepare the code and the environment** |
| 40 | + |
| 41 | +Git clone our repository, creating a python environment and ativate it via the following command |
| 42 | + |
| 43 | +```bash |
| 44 | +git clone https://github.com/Vision-CAIR/MiniGPT-4.git |
| 45 | +cd MiniGPT-4 |
| 46 | +conda env create -f environment.yml |
| 47 | +conda activate minigpt4 |
| 48 | +``` |
| 49 | + |
| 50 | + |
| 51 | +**2. Prepare the pretrained Vicuna weights** |
| 52 | + |
| 53 | +The current version of MiniGPT-4 is built on the v0 versoin of Vicuna-13B. |
| 54 | +Please refer to their instructions [here](https://huggingface.co/lmsys/vicuna-13b-delta-v0) to obtaining the weights. |
| 55 | +The final weights would be in a single folder with the following structure: |
| 56 | + |
| 57 | +``` |
| 58 | +vicuna_weights |
| 59 | +├── config.json |
| 60 | +├── generation_config.json |
| 61 | +├── pytorch_model.bin.index.json |
| 62 | +├── pytorch_model-00001-of-00003.bin |
| 63 | +... |
| 64 | +``` |
| 65 | + |
| 66 | +Then, set the path to the vicuna weight in the model config file |
| 67 | +[here](minigpt4/configs/models/minigpt4.yaml#L16) at Line 16. |
| 68 | + |
| 69 | +**3. Prepare the pretrained MiniGPT-4 checkpoint** |
| 70 | + |
| 71 | +To play with our pretrained model, download the pretrained checkpoint |
| 72 | +[here](https://drive.google.com/file/d/1a4zLvaiDBr-36pasffmgpvH5P7CKmpze/view?usp=share_link). |
| 73 | +Then, set the path to the pretrained checkpoint in the evaluation config file |
| 74 | +in [eval_configs/minigpt4_eval.yaml](eval_configs/minigpt4_eval.yaml#L10) at Line 10. |
| 75 | + |
| 76 | + |
| 77 | + |
| 78 | +### Launching Demo Locally |
| 79 | + |
| 80 | +Try out our demo [demo.py](demo.py) on your local machine by running |
| 81 | + |
| 82 | +``` |
| 83 | +python demo.py --cfg-path eval_configs/minigpt4_eval.yaml |
| 84 | +``` |
| 85 | + |
| 86 | + |
| 87 | + |
| 88 | +### Training |
| 89 | +The training of MiniGPT-4 contains two alignment stages. |
| 90 | + |
| 91 | +**1. First pretraining stage** |
| 92 | + |
| 93 | +In the first pretrained stage, the model is trained using image-text pairs from Laion and CC datasets |
| 94 | +to align the vision and language model. To download and prepare the datasets, please check |
| 95 | +our [first stage dataset preparation instruction](dataset/README_1_STAGE.md). |
| 96 | +After the first stage, the visual features are mapped and can be understood by the language |
| 97 | +model. |
| 98 | +To launch the first stage training, run the following command. In our experiments, we use 4 A100. |
| 99 | +You can change the save path in the config file |
| 100 | +[train_configs/minigpt4_stage1_pretrain.yaml](train_configs/minigpt4_stage1_pretrain.yaml) |
| 101 | + |
| 102 | +```bash |
| 103 | +torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage1_pretrain.yaml |
| 104 | +``` |
| 105 | + |
| 106 | +**1. Second finetuning stage** |
| 107 | + |
| 108 | +In the second stage, we use a small high quality image-text pair dataset created by ourselves |
| 109 | +and convert it to a conversation format to further align MiniGPT-4. |
| 110 | +To download and prepare our second stage dataset, please check our |
| 111 | +[second stage dataset preparation instruction](dataset/README_2_STAGE.md). |
| 112 | +To launch the second stage alignment, |
| 113 | +first specify the path to the checkpoint file trained in stage 1 in |
| 114 | +[train_configs/minigpt4_stage1_pretrain.yaml](train_configs/minigpt4_stage2_finetune.yaml). |
| 115 | +You can also specify the output path there. |
| 116 | +Then, run the following command. In our experiments, we use 1 A100. |
| 117 | + |
| 118 | +```bash |
| 119 | +torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage2_finetune.yaml |
| 120 | +``` |
| 121 | + |
| 122 | +After the second stage alignment, MiniGPT-4 is able to talk about the image coherently and user-friendly. |
| 123 | + |
| 124 | + |
| 125 | + |
| 126 | + |
| 127 | +## Acknowledgement |
| 128 | + |
| 129 | ++ [BLIP2](https://huggingface.co/docs/transformers/main/model_doc/blip-2) |
| 130 | ++ [Vicuna](https://github.com/lm-sys/FastChat) |
| 131 | + |
| 132 | + |
| 133 | +If you're using MiniGPT-4 in your research or applications, please cite using this BibTeX: |
| 134 | +```bibtex |
| 135 | +@misc{zhu2022minigpt4, |
| 136 | + title={MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models}, |
| 137 | + author={Deyao Zhu and Jun Chen and Xiaoqian Shen and xiang Li and Mohamed Elhoseiny}, |
| 138 | + year={2023}, |
| 139 | +} |
| 140 | +``` |
| 141 | + |
| 142 | +## License |
| 143 | +This repository is under [BSD 3-Clause License](LICENSE.md). |
| 144 | +Many codes are based on [Lavis](https://github.com/salesforce/LAVIS) with |
| 145 | +BSD 3-Clause License [here](LICENSE_Lavis.md). |
0 commit comments