Skip to content
This repository was archived by the owner on Dec 1, 2024. It is now read-only.

Commit 2aa7e42

Browse files
committed
Update links
1 parent 76ddbcf commit 2aa7e42

File tree

1 file changed

+8
-8
lines changed

1 file changed

+8
-8
lines changed

README.md

+8-8
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# FlexGen
22

3-
FlexGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexGen allows **high-throughput** generation by IO-efficient offloading, compression and **large effective batch sizes**.
3+
FlexGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexGen allows **high-throughput** generation by IO-efficient offloading, compression, and **large effective batch sizes**.
44

5-
## Throughput-Oriented Inference for Large Langugage Models
5+
## Throughput-Oriented Inference for Large Language Models
66

77
In recent years, large language models (LLMs) have shown great performance across a
88
wide range of tasks. Increasingly, LLMs have been applied not only to interactive
@@ -14,15 +14,15 @@ running LLM inferences over millions of tokens in batches, e.g., all the private
1414
corpus, or all the tasks in the [HELM](https://crfm.stanford.edu/helm/latest/) benchmark.
1515
These workloads are less sensitive to latency - the user starts up a job and lets it run overnight -
1616
but increasing throughput is critical for reducing costs.
17-
Thoughput is a measure of tokens processed per second over the job's entire runtime (which can be hours).
18-
Throughput-oriented workloads provide opportunities to trading off latency for higher throughput, which
17+
Throughput is a measure of tokens processed per second over the job's entire runtime (which can be hours).
18+
Throughput-oriented workloads provide opportunities to trade off latency for higher throughput, which
1919
makes it easier to take advantage of low-cost commodity GPUs.
2020

2121
The goal of FlexGen is to create a high-throughput system to enable new and exciting applications of
2222
foundation models to throughput-oriented tasks on low-cost hardware, such as a single commodity GPU
2323
instead of expensive systems.
2424

25-
See [examples](#examples) for we can run _on a single commodity GPU_ with FlexGen, such as benchmarking and data wrangling.
25+
Check out the [examples](#examples) of what you can run on a single commodity GPU with FlexGen, including benchmarking and data wrangling.
2626

2727
**Limitation**. As an offloading-based system running on weak GPUs, FlexGen also has its limitations.
2828
FlexGen can be significantly slower than the case when you have enough powerful GPUs to hold the whole model, especially for small-batch cases.
@@ -59,7 +59,7 @@ You can use the commands below to run a Massive Multitask Language Understanding
5959
```
6060
python3 -m flexgen.apps.helm_run --description mmlu:model=text,subject=abstract_algebra,data_augmentation=canonical --pad-to-seq-len 512 --model facebook/opt-30b --percent 20 80 0 100 0 100 --gpu-batch-size 48 --num-gpu-batches 3 --max-eval-instance 100
6161
```
62-
Note that only a subset of HELM scenarios is tested.
62+
Note that only a subset of HELM scenarios is tested. See more tested senarios [here](flexgen/apps/helm_passed_30b.sh).
6363

6464
### Data Wrangling
6565
You can run the examples in this paper, ['Can Foundation Models Wrangle Your Data?'](https://arxiv.org/abs/2205.09911), by following the instructions [here](flexgen/apps/data_wrangle).
@@ -69,9 +69,9 @@ You can run the examples in this paper, ['Can Foundation Models Wrangle Your Dat
6969
The corresponding effective batch sizes are in parentheses. Please see [here](benchmark/batch_size_table.md) for more details.
7070
| System | OPT-6.7B | OPT-30B | OPT-175B |
7171
| ------ | -------- | ------- | -------- |
72-
| Hugging Face Accelerate | 25.12 (2 on GPU) | 0.62 (8 on CPU) | 0.01 (2 on disk) |
72+
| Hugging Face Accelerate | 25.12 (2 on GPU) | 0.62 (8 on CPU) | 0.01 (2 on disk) |
7373
| DeepSpeed ZeRO-Inference | 9.28 (16 on CPU) | 0.60 (4 on CPU) | 0.01 (1 on disk) |
74-
| Petals\* | - | - | 0.05 |
74+
| Petals\* | - | - | 0.05 |
7575
| FlexGen | 25.26 (2 on GPU) | 7.32 (144 on CPU) | 0.69 (256 on disk) |
7676
| FlexGen with Compression | **29.12** (72 on GPU) | **8.38** (512 on CPU) | **1.12** (144 on CPU) |
7777

0 commit comments

Comments
 (0)