This document describes the steps to run the GPT-NeoX model on FasterTransformer. GPT-NeoX is a model developed by EleutherAI, available publicly on their GitHub repository. For the time being, only the 20B parameter version has been tested.
More details are listed in gptj_guide.md.
Optimization in gpt-neox are similar to optimization in GPT, describing in the gpt_guide.md.
We provide the environment variables to tune for specific usage.
Name | Description | Default | Values accepted |
---|---|---|---|
FMHA_ENABLE |
enable the fused multi-head attention kernels (fp16 accumulation) | disabled | ON = enable fmha, otherwise disabled |
CONTEXT_ATTENTION_BMM1_HALF_ACCUM |
use fp16 accumulation for the qk gemm, and only make a difference to unfused multi-head attention kernels | fp32 accumulation | ON = fp32 accumulation, otherwise fp16 accumulation |
- Checkpoint converter
- EleutherAI
- Data type
- FP32
- FP16
- Feature
- Multi-GPU multi-node inference
- Dynamic random seed
- Stop tokens
- Bad words list
- Beam search and sampling are both supported
See common requirements such as in gptj_guide.md.
First download a pytorch checkpoint, as provided by EleutherAI:
wget --cut-dirs=5 -nH -r --no-parent --reject "index.html*" https://mystic.the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/ -P 20B_checkpoints
Then use the script provided by FasterTransformer to convert the checkpoint to raw weights, understood by FT.
python ../examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py 20B_checkpoints ../models/gptneox -t 2
You may download the tokenizer config here.
To tokenize/detokenize files, use the script found in examples/pytorch/gptneox/utils/hftokenizer.py
. You may need to pass the path to the tokenizer config with the --tokenizer
flag.
-
Generate the
gemm_config.in
file.
Data Type = 0 (FP32) or 1 (FP16) or 2 (BF16)./bin/gpt_gemm <batch_size> <beam_width> <max_input_len> <head_number> <size_per_head> <inter_size> <vocab_size> <data_type> <tensor_para_size> E.g., ./bin/gpt_gemm 8 1 32 64 96 24576 50432 1 2
-
Run GPT on C++
Users can see the details of arguments in
examples/cpp/gptneox/gptneox_config.ini
. It controls the model path, model size, tensor parallelism size, and some hyper-parameters.mpirun -n 2 --allow-run-as-root ./bin/gptneox_example
E.g. by setting the
data_type
ofgptneox_config.ini
tofp16
, users can run gpt model under fp16.You can then decode the
out
file with the tokenizer:```bash wget https://mystic.the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/20B_tokenizer.json ../examples/pytorch/gptneox/utils/hftokenizer.py out --tokenizer 20B_tokenizer.json ```
-
Run GPT on PyTorch
Basically,
gptneox_example.py
includes the example how to declare a model, load a checkpoint, and forward context inputs and get generated outputs in Pytorch.For generating outputs based on context inputs, create a text file including the context inputs (line by line) and set
--sample_input_file
to the text file path. (By default, the script will generate outputs without context inputs.)Run with
-h
to see more settings.Run GPT with TP and PP on single node. Note that the number of processes must equal to
tensor_para_size * pipeline_para_size
.# No parallelism (tensor_para_size=1, pipeline_para_size=1) python ../examples/pytorch/gptneox/gptneox_example.py # TP (tensor_para_size=2, pipeline_para_size=1) mpirun -n 2 --allow-run-as-root python ../examples/pytorch/gptneox/gptneox_example.py --tensor_para_size=2 --pipeline_para_size=1 --ckpt_path="/path/to/your/model/2-gpu" # LP (tensor_para_size=1, pipeline_para_size=2) mpirun -n 2 --allow-run-as-root python ../examples/pytorch/gptneox/gptneox_example.py --tensor_para_size=1 --pipeline_para_size=2 --ckpt_path="/path/to/your/model/1-gpu" # TP and LP (tensor_para_size=2, pipeline_para_size=2) mpirun -n 4 --allow-run-as-root python ../examples/pytorch/gptneox/gptneox_example.py --tensor_para_size=2 --pipeline_para_size=2 --ckpt_path="/path/to/your/model/2-gpu"