You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Dec 1, 2024. It is now read-only.
Copy file name to clipboardExpand all lines: README.md
+8-8
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,8 @@
1
1
# FlexGen
2
2
3
-
FlexGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexGen allows **high-throughput** generation by IO-efficient offloading, compression and **large effective batch sizes**.
3
+
FlexGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexGen allows **high-throughput** generation by IO-efficient offloading, compression, and **large effective batch sizes**.
4
4
5
-
## Throughput-Oriented Inference for Large Langugage Models
5
+
## Throughput-Oriented Inference for Large Language Models
6
6
7
7
In recent years, large language models (LLMs) have shown great performance across a
8
8
wide range of tasks. Increasingly, LLMs have been applied not only to interactive
@@ -14,15 +14,15 @@ running LLM inferences over millions of tokens in batches, e.g., all the private
14
14
corpus, or all the tasks in the [HELM](https://crfm.stanford.edu/helm/latest/) benchmark.
15
15
These workloads are less sensitive to latency - the user starts up a job and lets it run overnight -
16
16
but increasing throughput is critical for reducing costs.
17
-
Thoughput is a measure of tokens processed per second over the job's entire runtime (which can be hours).
18
-
Throughput-oriented workloads provide opportunities to trading off latency for higher throughput, which
17
+
Throughput is a measure of tokens processed per second over the job's entire runtime (which can be hours).
18
+
Throughput-oriented workloads provide opportunities to trade off latency for higher throughput, which
19
19
makes it easier to take advantage of low-cost commodity GPUs.
20
20
21
21
The goal of FlexGen is to create a high-throughput system to enable new and exciting applications of
22
22
foundation models to throughput-oriented tasks on low-cost hardware, such as a single commodity GPU
23
23
instead of expensive systems.
24
24
25
-
See [examples](#examples)for we can run _on a single commodity GPU_ with FlexGen, such as benchmarking and data wrangling.
25
+
Check out the [examples](#examples)of what you can run on a single commodity GPU with FlexGen, including benchmarking and data wrangling.
26
26
27
27
❌ **Limitation**. As an offloading-based system running on weak GPUs, FlexGen also has its limitations.
28
28
FlexGen can be significantly slower than the case when you have enough powerful GPUs to hold the whole model, especially for small-batch cases.
@@ -59,7 +59,7 @@ You can use the commands below to run a Massive Multitask Language Understanding
Note that only a subset of HELM scenarios is tested.
62
+
Note that only a subset of HELM scenarios is tested. See more tested senarios [here](flexgen/apps/helm_passed_30b.sh).
63
63
64
64
### Data Wrangling
65
65
You can run the examples in this paper, ['Can Foundation Models Wrangle Your Data?'](https://arxiv.org/abs/2205.09911), by following the instructions [here](flexgen/apps/data_wrangle).
@@ -69,9 +69,9 @@ You can run the examples in this paper, ['Can Foundation Models Wrangle Your Dat
69
69
The corresponding effective batch sizes are in parentheses. Please see [here](benchmark/batch_size_table.md) for more details.
70
70
| System | OPT-6.7B | OPT-30B | OPT-175B |
71
71
| ------ | -------- | ------- | -------- |
72
-
| Hugging Face Accelerate |25.12 (2 on GPU) | 0.62 (8 on CPU) | 0.01 (2 on disk) |
72
+
| Hugging Face Accelerate | 25.12 (2 on GPU)| 0.62 (8 on CPU) | 0.01 (2 on disk) |
73
73
| DeepSpeed ZeRO-Inference | 9.28 (16 on CPU) | 0.60 (4 on CPU) | 0.01 (1 on disk) |
74
-
| Petals\*| - | - | 0.05 |
74
+
| Petals\*| - | - | 0.05|
75
75
| FlexGen | 25.26 (2 on GPU) | 7.32 (144 on CPU) | 0.69 (256 on disk) |
76
76
| FlexGen with Compression |**29.12** (72 on GPU) |**8.38** (512 on CPU) |**1.12** (144 on CPU) |
0 commit comments