Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TOOLS] Process llava ov data by cpu #415

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions tools/datasets/llava_onevision/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@ When training the LLava-one-vision model with FlagScale, the original LLava-one-
## Preparation

1. Download https://github.com/LLaVA-VL/LLaVA-NeXT into Path_Of_LLaVA-NeXT.

2. Download google/siglip-so400m-patch14-384 into VISION_MODEL_PATH.

3. Write a hostfile with one IP per line, like the example below:
```
1.2.3.4 slots=8
Expand All @@ -16,9 +18,28 @@ When training the LLava-one-vision model with FlagScale, the original LLava-one-
4. Prepare a dataset input compatible with the LLava-one-vision library, like next_ov_stage_july21.yaml.

## Example
Directly processing data, but note that the trainer initialization uses the GPU, while the data preprocessing does not actually use the GPU, leading to wasted GPU resources.
'''
DATA_PATH=next_ov_stage_july21.yaml
EXPNAME_PATH=*PathOfOutputWebDatasets*
HOSTFILE=hostfile
bash make_llava_ov_wds.sh $DATA_PATH $EXPNAME_PATH $HOSTFILE
'''

We currently recommend a more efficient two-stage processing method. In the first step, the GPU is still used, but only for trainer initialization, and the index data of each card is directly saved. In the second step, CPU multiprocessing is used for index processing without occupying GPU resources.

Stage 1: Saving index data with GPU
'''
DATA_PATH=next_ov_stage_july21.yaml
EXPNAME_PATH=*PathOfOutputWebDatasets*
HOSTFILE=hostfile
bash make_llava_ov_index.sh $DATA_PATH $EXPNAME_PATH $HOSTFILE
'''

Stage 2: Processing index data with CPU
'''
DATA_PATH=next_ov_stage_july21.yaml
EXPNAME_PATH=*PathOfOutputWebDatasets*
HOSTFILE=hostfile
bash make_llava_ov_wds_by_CPU.sh $DATA_PATH $EXPNAME_PATH $HOSTFILE
'''
Loading
Loading