feat: re-allocate pages dynamically #2024

OlivierDehaene · 2024-06-05T16:50:47Z

Today we allocate all pages at once when first scheduling the request.
This can lead to under-utilisation as a lot of requests terminate by eos token before reaching max new tokens (see p50 in prod).

This PR re-allocates pages dynamically each time a request finishes a page.

Benchmark = share gpt with max_new_tokens forced at 2048 tokens.

Today: without cache extension == allocate all pages at once

     ✓ Post status is 200

     checks.........................: 100.00% ✓ 91         ✗ 0
     data_received..................: 985 kB  16 kB/s
     data_sent......................: 218 kB  3.6 kB/s
     dropped_iterations.............: 410     6.721098/s
     generated_tokens...............: 13388   219.468448/s
     http_req_blocked...............: avg=134.68µs min=1.98µs  med=139.44µs max=252.97µs p(90)=160.32µs p(95)=187.01µs
     http_req_connecting............: avg=87.16µs  min=0s      med=88.77µs  max=170.74µs p(90)=107.65µs p(95)=126.97µs
     http_req_duration..............: avg=20.01s   min=44.28ms med=19.3s    max=53.21s   p(90)=42.29s   p(95)=47.46s
       { expected_response:true }...: avg=20.01s   min=44.28ms med=19.3s    max=53.21s   p(90)=42.29s   p(95)=47.46s
   ✓ http_req_failed................: 0.00%   ✓ 0          ✗ 91
     http_req_receiving.............: avg=68.66µs  min=27.79µs med=62.52µs  max=151.18µs p(90)=100.17µs p(95)=132.87µs
     http_req_sending...............: avg=39.49µs  min=18.96µs med=39.5µs   max=82.64µs  p(90)=51.91µs  p(95)=59.98µs
     http_req_tls_handshaking.......: avg=0s       min=0s      med=0s       max=0s       p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=20.01s   min=44.05ms med=19.3s    max=53.21s   p(90)=42.29s   p(95)=47.46s
     http_reqs......................: 91      1.491756/s
     inference_time.................: avg=6.67s    min=42ms    med=2.96s    max=41.45s   p(90)=17.26s   p(95)=19.74s
     iteration_duration.............: avg=20.01s   min=45.18ms med=19.3s    max=53.21s   p(90)=42.29s   p(95)=47.46s
     iterations.....................: 91      1.491756/s
     queue_time.....................: avg=13.33s   min=1ms     med=11.75s   max=47.45s   p(90)=30.87s   p(95)=32.49s
     time_per_token.................: avg=107.59ms min=39ms    med=47ms     max=366ms    p(90)=260ms    p(95)=338ms
     total_time.....................: avg=20.01s   min=42ms    med=19.3s    max=53.21s   p(90)=42.29s   p(95)=47.46s
     validation_time................: avg=1ms      min=1ms     med=1ms      max=1ms      p(90)=1ms      p(95)=1ms
     vus............................: 100     min=7        max=100
     vus_max........................: 100     min=100      max=100

With this PR

     ✓ Post status is 200

     checks.........................: 100.00% ✓ 206        ✗ 0
     data_received..................: 2.0 MB  32 kB/s
     data_sent......................: 332 kB  5.4 kB/s
     dropped_iterations.............: 296     4.852316/s
     generated_tokens...............: 26491   434.265927/s
     http_req_blocked...............: avg=58.13µs  min=1.5µs   med=3.6µs   max=250.73µs p(90)=145.47µs p(95)=151.59µs
     http_req_connecting............: avg=36.95µs  min=0s      med=0s      max=174.92µs p(90)=92.87µs  p(95)=98.04µs
     http_req_duration..............: avg=11.43s   min=43.66ms med=6.82s   max=51.01s   p(90)=29.03s   p(95)=34.9s
       { expected_response:true }...: avg=11.43s   min=43.66ms med=6.82s   max=51.01s   p(90)=29.03s   p(95)=34.9s
   ✓ http_req_failed................: 0.00%   ✓ 0          ✗ 206
     http_req_receiving.............: avg=64.4µs   min=14.6µs  med=58.8µs  max=196.64µs p(90)=99.13µs  p(95)=124.75µs
     http_req_sending...............: avg=30.86µs  min=19.16µs med=24.53µs max=404.66µs p(90)=41.19µs  p(95)=44.9µs
     http_req_tls_handshaking.......: avg=0s       min=0s      med=0s      max=0s       p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=11.43s   min=43.5ms  med=6.82s   max=51.01s   p(90)=29.03s   p(95)=34.9s
     http_reqs......................: 206     3.37695/s
     inference_time.................: avg=10.88s   min=42ms    med=5.92s   max=50.7s    p(90)=28.16s   p(95)=34.71s
     iteration_duration.............: avg=11.43s   min=44.34ms med=6.82s   max=51.01s   p(90)=29.04s   p(95)=34.9s
     iterations.....................: 206     3.37695/s
     queue_time.....................: avg=543.15ms min=1ms     med=433.5ms max=2.62s    p(90)=1.13s    p(95)=1.34s
     time_per_token.................: avg=230.86ms min=42ms    med=97ms    max=900ms    p(90)=715ms    p(95)=730.25ms
     total_time.....................: avg=11.43s   min=42ms    med=6.82s   max=51.01s   p(90)=29.03s   p(95)=34.9s
     validation_time................: avg=1ms      min=1ms     med=1ms     max=1ms      p(90)=1ms      p(95)=1ms
     vus............................: 99      min=7        max=100
     vus_max........................: 100     min=100      max=100

Queue time is greatly improved (p99 32s => 1.34s) and throughput is multiplied by > 2 (1.49 => 3.37)

zirconium-n · 2024-06-06T09:00:01Z

router/src/infer/v3/block_allocator.rs

    // Block 0 is reserved for health checks
    let mut free_blocks: Vec<u32> = (1..blocks).collect();
    while let Some(cmd) = receiver.recv().await {
        match cmd {


Might be unrelated, but why are we using this channel + command pattern when the routine is such a simple function? What's the advantage over just using a Mutex or RwLock? Current solution seems unnecessarily complicated to me.

In the rest of the code it's because there is a lot of contention.
In the specific case of this struct there is none so I agree a Mutex would be a better idea here.

zirconium-n · 2024-06-06T09:06:18Z

router/src/infer/v3/scheduler.rs

+                    .block_allocation
+                    .as_ref()
+                    .map(|alloc| (alloc.blocks.clone(), alloc.slots.clone()))
+                    .unwrap_or((Vec::new(), Vec::new()));


Suggested change

.unwrap_or((Vec::new(), Vec::new()));

.unwrap_or_default();

zirconium-n · 2024-06-06T09:13:54Z

router/src/infer/v3/scheduler.rs

+                        None
+                    }
+                })
+                .unwrap_or(None)


map + unwrap_or(None) is called and_then

yes all this code will be refactored it's just something I did yesterday evening to test the idea.
The PR should be marked as draft.

zirconium-n · 2024-06-06T09:15:05Z

router/src/infer/v3/scheduler.rs

+        })
+        .collect();
+
+    for id in ids.iter() {


Suggested change

for id in ids.iter() {

for id in &ids {

zirconium-n · 2024-06-06T09:18:00Z

router/src/infer/v3/scheduler.rs

+            tracing::error!("{err}");
+
+            // unwrap_or is valid here as we don't care if the receiver is gone.
+            entry.response_tx.send(Err(err)).unwrap_or(());


If the intent is just ignore the return value, you can use an empty binding.

Suggested change

entry.response_tx.send(Err(err)).unwrap_or(());

let _ = entry.response_tx.send(Err(err));

zirconium-n · 2024-06-06T09:19:58Z

server/text_generation_server/models/flash_causal_lm.py

            input_lengths.append(input_length)

+            speculative_length = get_speculate()
+            speculative_length = 0 if speculative_length is None else speculative_length


Suggested change

speculative_length = 0 if speculative_length is None else speculative_length

speculative_length = get_speculate() or 0

zirconium-n · 2024-06-06T09:48:44Z

router/src/infer/v3/block_allocator.rs

+    }
+
+    pub(crate) async fn extend(&mut self, current_length: u32) -> Result<(), AllocationError> {
+        let remaining_tokens = max(self.prompt_tokens + self.decode_tokens - current_length, 1);


Why a minimum of 1 instead of just returning? And since you are using u32, the subtraction might overflow and get a very big result. I suggest using signed integer for any numeric calculation.

It's a quick hack but it will not be present in the final code.

zirconium-n · 2024-06-06T09:55:33Z

router/src/infer/v3/block_allocator.rs

+                decode_tokens,
                response_sender,
            } => {
+                let decode_tokens = min(decode_tokens, block_size);


What's the rationale for this?

Narsil · 2024-06-06T17:23:32Z

@zirconium-n Thanks for your input.

Can you provide a bit of background on yourself ? Who are you and how are you trying to help here ?

Your comment definitely seem on point on some aspects but it feels very off on our side to have someone we have no connection with, barge in and comment code authoritatively like you are doing.

Starting with introduction on where you come from and what's your goal will go a long way with us replying in a positive manner.

OlivierDehaene · 2024-06-06T08:20:15Z

proto/v3/generate.proto

    /// Top tokens
    repeated Tokens top_tokens = 5;
+    /// Current length of the request: prompt tokens + number of generated tokens until this point
+    uint32 current_length = 6;


This should be cached_tokens instead

OlivierDehaene · 2024-06-06T08:22:04Z

proto/v3/generate.proto

    uint64 batch_id = 1;
    /// Requests to keep
-    repeated uint64 request_ids = 2;
+    repeated UpdatedRequest updated_requests = 2;


@Narsil, not particularly happy about this name. Do you have a better idea?

The way it works is that we send a list of requests <= current requests in batch.
The requests that are not part of this list are dropped from the cached batch.
We take the blocks and slots in this request as truth and reallocate the slots and block_table tensors in the cached batch (allow for updating blocks and slots).

KeptRequests ?

s/blocks/new_blocks/ ? (They are only defined when new blocks are being allocated right ?)

OlivierDehaene · 2024-06-06T08:23:06Z

router/src/infer/v3/block_allocator.rs

+    }
+
+    pub(crate) async fn extend(&mut self, current_length: u32) -> Result<(), AllocationError> {
+        let remaining_tokens = max(self.prompt_tokens + self.decode_tokens - current_length, 1);


current_length is false here, we need to use cached_tokens instead.
This still works because of the max(1) but we should remove it.

OlivierDehaene · 2024-06-06T08:24:29Z

router/src/infer/v3/block_allocator.rs

+                decode_tokens,
                response_sender,
            } => {
+                let decode_tokens = min(decode_tokens, block_size);


Allocate prompt tokens + min(decode_tokens, block_size).
So prompt tokens + 1 or 2 blocks.

OlivierDehaene · 2024-06-06T12:38:18Z

router/src/infer/v3/block_allocator.rs

    // Block 0 is reserved for health checks
    let mut free_blocks: Vec<u32> = (1..blocks).collect();
    while let Some(cmd) = receiver.recv().await {
        match cmd {


In the rest of the code it's because there is a lot of contention.
In the specific case of this struct there is none so I agree a Mutex would be a better idea here.

OlivierDehaene · 2024-06-06T12:38:55Z

router/src/infer/v3/scheduler.rs

+                        None
+                    }
+                })
+                .unwrap_or(None)


yes all this code will be refactored it's just something I did yesterday evening to test the idea.
The PR should be marked as draft.

OlivierDehaene · 2024-06-06T12:39:55Z

router/src/infer/v3/block_allocator.rs

+    }
+
+    pub(crate) async fn extend(&mut self, current_length: u32) -> Result<(), AllocationError> {
+        let remaining_tokens = max(self.prompt_tokens + self.decode_tokens - current_length, 1);


It's a quick hack but it will not be present in the final code.

zirconium-n · 2024-06-07T08:27:02Z

@zirconium-n Thanks for your input.

Can you provide a bit of background on yourself ? Who are you and how are you trying to help here ?

Your comment definitely seem on point on some aspects but it feels very off on our side to have someone we have no connection with, barge in and comment code authoritatively like you are doing.

Starting with introduction on where you come from and what's your goal will go a long way with us replying in a positive manner.

Ah. Sorry if the comments bothered you. I'm playing with a fork of this repo myself and is messing with this particular part of code recently (and maybe eventually open a PR myself). I noticed there are changes happening on upstream and want to keep up with the latest changes. So I thought I might as well provide some help.

By no means I'm trying to be rude or sound authoritative, just trying to provide some ergonomic nits and ask some questions. I will not engage further if this is unwanted.

Narsil · 2024-06-07T09:15:08Z

I will not engage further if this is unwanted.

No this is fine, you can continue, just bear in mind that we might not know all this beforehand :).
Thanks for your input.

As for the core of the changes here, it's about becoming optimistic in memory allocation (instead of the current pessimistic approach). So allocating all possible memory for a give request vs allocating later and having to deal with potential OOM situations.

flozi00 · 2024-07-15T20:05:26Z

I think this could be interesting especially in context of this pr

https://buildkite.com/vllm/performance-benchmark/builds/4068

zirconium-n reviewed Jun 6, 2024

View reviewed changes

OlivierDehaene commented Jun 7, 2024

View reviewed changes

OlivierDehaene added 9 commits June 11, 2024 13:15

wip

18e77a5

wip

1cc8693

working example

35f27cb

fix

51fa606

fix python tests

3c59698

add terminated_generations

298bf31

re-working logic, wip

713d70b

small refactor

6983ec9

FlashCausalLM implem

73c3903

OlivierDehaene force-pushed the feat/page_re_alloc branch from 0dbd6b3 to 73c3903 Compare June 11, 2024 11:15

OlivierDehaene and others added 10 commits June 11, 2024 17:11

fix rust and python unit-tests

37266e2

fix windowing

c2fb459

remove slots from grpc

9ac7b7b

allocate 16 by 16

05eb4dc

fix tests

abe5212

added padded blocks and logs everywhere

7ed1044

Merge branch 'main' into feat/page_re_alloc

fe6a275

fix logic

b21ed58

avoid join_all

e5c2736

mirror docker

fe9abad

github-actions bot added the Stale label Aug 15, 2024

github-actions bot closed this Aug 20, 2024

	entry.response_tx.send(Err(err)).unwrap_or(());
	let _ = entry.response_tx.send(Err(err));

	speculative_length = 0 if speculative_length is None else speculative_length
	speculative_length = get_speculate() or 0

feat: re-allocate pages dynamically #2024

feat: re-allocate pages dynamically #2024

Uh oh!

Conversation

OlivierDehaene commented Jun 5, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Narsil commented Jun 6, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zirconium-n commented Jun 7, 2024

Uh oh!

Narsil commented Jun 7, 2024

Uh oh!

flozi00 commented Jul 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants