Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: re-allocate pages dynamically #2024

Closed
wants to merge 19 commits into from
Closed

Conversation

OlivierDehaene
Copy link
Member

Today we allocate all pages at once when first scheduling the request.
This can lead to under-utilisation as a lot of requests terminate by eos token before reaching max new tokens (see p50 in prod).

This PR re-allocates pages dynamically each time a request finishes a page.

Benchmark = share gpt with max_new_tokens forced at 2048 tokens.

  1. Today: without cache extension == allocate all pages at once
     ✓ Post status is 200

     checks.........................: 100.00% ✓ 91         ✗ 0
     data_received..................: 985 kB  16 kB/s
     data_sent......................: 218 kB  3.6 kB/s
     dropped_iterations.............: 410     6.721098/s
     generated_tokens...............: 13388   219.468448/s
     http_req_blocked...............: avg=134.68µs min=1.98µs  med=139.44µs max=252.97µs p(90)=160.32µs p(95)=187.01µs
     http_req_connecting............: avg=87.16µs  min=0s      med=88.77µs  max=170.74µs p(90)=107.65µs p(95)=126.97µs
     http_req_duration..............: avg=20.01s   min=44.28ms med=19.3s    max=53.21s   p(90)=42.29s   p(95)=47.46s
       { expected_response:true }...: avg=20.01s   min=44.28ms med=19.3s    max=53.21s   p(90)=42.29s   p(95)=47.46s
   ✓ http_req_failed................: 0.00%   ✓ 0          ✗ 91
     http_req_receiving.............: avg=68.66µs  min=27.79µs med=62.52µs  max=151.18µs p(90)=100.17µs p(95)=132.87µs
     http_req_sending...............: avg=39.49µs  min=18.96µs med=39.5µs   max=82.64µs  p(90)=51.91µs  p(95)=59.98µs
     http_req_tls_handshaking.......: avg=0s       min=0s      med=0s       max=0s       p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=20.01s   min=44.05ms med=19.3s    max=53.21s   p(90)=42.29s   p(95)=47.46s
     http_reqs......................: 91      1.491756/s
     inference_time.................: avg=6.67s    min=42ms    med=2.96s    max=41.45s   p(90)=17.26s   p(95)=19.74s
     iteration_duration.............: avg=20.01s   min=45.18ms med=19.3s    max=53.21s   p(90)=42.29s   p(95)=47.46s
     iterations.....................: 91      1.491756/s
     queue_time.....................: avg=13.33s   min=1ms     med=11.75s   max=47.45s   p(90)=30.87s   p(95)=32.49s
     time_per_token.................: avg=107.59ms min=39ms    med=47ms     max=366ms    p(90)=260ms    p(95)=338ms
     total_time.....................: avg=20.01s   min=42ms    med=19.3s    max=53.21s   p(90)=42.29s   p(95)=47.46s
     validation_time................: avg=1ms      min=1ms     med=1ms      max=1ms      p(90)=1ms      p(95)=1ms
     vus............................: 100     min=7        max=100
     vus_max........................: 100     min=100      max=100
  1. With this PR
     ✓ Post status is 200

     checks.........................: 100.00% ✓ 206        ✗ 0
     data_received..................: 2.0 MB  32 kB/s
     data_sent......................: 332 kB  5.4 kB/s
     dropped_iterations.............: 296     4.852316/s
     generated_tokens...............: 26491   434.265927/s
     http_req_blocked...............: avg=58.13µs  min=1.5µs   med=3.6µs   max=250.73µs p(90)=145.47µs p(95)=151.59µs
     http_req_connecting............: avg=36.95µs  min=0s      med=0s      max=174.92µs p(90)=92.87µs  p(95)=98.04µs
     http_req_duration..............: avg=11.43s   min=43.66ms med=6.82s   max=51.01s   p(90)=29.03s   p(95)=34.9s
       { expected_response:true }...: avg=11.43s   min=43.66ms med=6.82s   max=51.01s   p(90)=29.03s   p(95)=34.9s
   ✓ http_req_failed................: 0.00%   ✓ 0          ✗ 206
     http_req_receiving.............: avg=64.4µs   min=14.6µs  med=58.8µs  max=196.64µs p(90)=99.13µs  p(95)=124.75µs
     http_req_sending...............: avg=30.86µs  min=19.16µs med=24.53µs max=404.66µs p(90)=41.19µs  p(95)=44.9µs
     http_req_tls_handshaking.......: avg=0s       min=0s      med=0s      max=0s       p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=11.43s   min=43.5ms  med=6.82s   max=51.01s   p(90)=29.03s   p(95)=34.9s
     http_reqs......................: 206     3.37695/s
     inference_time.................: avg=10.88s   min=42ms    med=5.92s   max=50.7s    p(90)=28.16s   p(95)=34.71s
     iteration_duration.............: avg=11.43s   min=44.34ms med=6.82s   max=51.01s   p(90)=29.04s   p(95)=34.9s
     iterations.....................: 206     3.37695/s
     queue_time.....................: avg=543.15ms min=1ms     med=433.5ms max=2.62s    p(90)=1.13s    p(95)=1.34s
     time_per_token.................: avg=230.86ms min=42ms    med=97ms    max=900ms    p(90)=715ms    p(95)=730.25ms
     total_time.....................: avg=11.43s   min=42ms    med=6.82s   max=51.01s   p(90)=29.03s   p(95)=34.9s
     validation_time................: avg=1ms      min=1ms     med=1ms     max=1ms      p(90)=1ms      p(95)=1ms
     vus............................: 99      min=7        max=100
     vus_max........................: 100     min=100      max=100

Queue time is greatly improved (p99 32s => 1.34s) and throughput is multiplied by > 2 (1.49 => 3.37)

@@ -80,9 +124,13 @@ async fn block_allocator_task(
match cmd {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be unrelated, but why are we using this channel + command pattern when the routine is such a simple function? What's the advantage over just using a Mutex or RwLock? Current solution seems unnecessarily complicated to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the rest of the code it's because there is a lot of contention.
In the specific case of this struct there is none so I agree a Mutex would be a better idea here.

.block_allocation
.as_ref()
.map(|alloc| (alloc.blocks.clone(), alloc.slots.clone()))
.unwrap_or((Vec::new(), Vec::new()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.unwrap_or((Vec::new(), Vec::new()));
.unwrap_or_default();

None
}
})
.unwrap_or(None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map + unwrap_or(None) is called and_then

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes all this code will be refactored it's just something I did yesterday evening to test the idea.
The PR should be marked as draft.

})
.collect();

for id in ids.iter() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for id in ids.iter() {
for id in &ids {

tracing::error!("{err}");

// unwrap_or is valid here as we don't care if the receiver is gone.
entry.response_tx.send(Err(err)).unwrap_or(());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the intent is just ignore the return value, you can use an empty binding.

Suggested change
entry.response_tx.send(Err(err)).unwrap_or(());
let _ = entry.response_tx.send(Err(err));

@@ -201,6 +195,9 @@ def from_tokenized(
input_length = len(tokenized_input)
input_lengths.append(input_length)

speculative_length = get_speculate()
speculative_length = 0 if speculative_length is None else speculative_length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
speculative_length = 0 if speculative_length is None else speculative_length
speculative_length = get_speculate() or 0

}

pub(crate) async fn extend(&mut self, current_length: u32) -> Result<(), AllocationError> {
let remaining_tokens = max(self.prompt_tokens + self.decode_tokens - current_length, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a minimum of 1 instead of just returning? And since you are using u32, the subtraction might overflow and get a very big result. I suggest using signed integer for any numeric calculation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a quick hack but it will not be present in the final code.

response_sender,
} => {
let decode_tokens = min(decode_tokens, block_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale for this?

@Narsil
Copy link
Collaborator

Narsil commented Jun 6, 2024

@zirconium-n Thanks for your input.

Can you provide a bit of background on yourself ? Who are you and how are you trying to help here ?

Your comment definitely seem on point on some aspects but it feels very off on our side to have someone we have no connection with, barge in and comment code authoritatively like you are doing.

Starting with introduction on where you come from and what's your goal will go a long way with us replying in a positive manner.

@@ -198,13 +198,24 @@ message Generation {
optional GeneratedText generated_text = 4;
/// Top tokens
repeated Tokens top_tokens = 5;
/// Current length of the request: prompt tokens + number of generated tokens until this point
uint32 current_length = 6;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be cached_tokens instead

}

message FilterBatchRequest {
/// Batch ID
uint64 batch_id = 1;
/// Requests to keep
repeated uint64 request_ids = 2;
repeated UpdatedRequest updated_requests = 2;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Narsil, not particularly happy about this name. Do you have a better idea?

The way it works is that we send a list of requests <= current requests in batch.
The requests that are not part of this list are dropped from the cached batch.
We take the blocks and slots in this request as truth and reallocate the slots and block_table tensors in the cached batch (allow for updating blocks and slots).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KeptRequests ?

s/blocks/new_blocks/ ? (They are only defined when new blocks are being allocated right ?)

}

pub(crate) async fn extend(&mut self, current_length: u32) -> Result<(), AllocationError> {
let remaining_tokens = max(self.prompt_tokens + self.decode_tokens - current_length, 1);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current_length is false here, we need to use cached_tokens instead.
This still works because of the max(1) but we should remove it.

response_sender,
} => {
let decode_tokens = min(decode_tokens, block_size);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocate prompt tokens + min(decode_tokens, block_size).
So prompt tokens + 1 or 2 blocks.

@@ -80,9 +124,13 @@ async fn block_allocator_task(
match cmd {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the rest of the code it's because there is a lot of contention.
In the specific case of this struct there is none so I agree a Mutex would be a better idea here.

None
}
})
.unwrap_or(None)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes all this code will be refactored it's just something I did yesterday evening to test the idea.
The PR should be marked as draft.

}

pub(crate) async fn extend(&mut self, current_length: u32) -> Result<(), AllocationError> {
let remaining_tokens = max(self.prompt_tokens + self.decode_tokens - current_length, 1);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a quick hack but it will not be present in the final code.

@zirconium-n
Copy link
Contributor

@zirconium-n Thanks for your input.

Can you provide a bit of background on yourself ? Who are you and how are you trying to help here ?

Your comment definitely seem on point on some aspects but it feels very off on our side to have someone we have no connection with, barge in and comment code authoritatively like you are doing.

Starting with introduction on where you come from and what's your goal will go a long way with us replying in a positive manner.

Ah. Sorry if the comments bothered you. I'm playing with a fork of this repo myself and is messing with this particular part of code recently (and maybe eventually open a PR myself). I noticed there are changes happening on upstream and want to keep up with the latest changes. So I thought I might as well provide some help.

By no means I'm trying to be rude or sound authoritative, just trying to provide some ergonomic nits and ask some questions. I will not engage further if this is unwanted.

@Narsil
Copy link
Collaborator

Narsil commented Jun 7, 2024

I will not engage further if this is unwanted.

No this is fine, you can continue, just bear in mind that we might not know all this beforehand :).
Thanks for your input.

As for the core of the changes here, it's about becoming optimistic in memory allocation (instead of the current pessimistic approach). So allocating all possible memory for a give request vs allocating later and having to deal with potential OOM situations.

@flozi00
Copy link
Contributor

flozi00 commented Jul 15, 2024

I think this could be interesting especially in context of this pr

https://buildkite.com/vllm/performance-benchmark/builds/4068

@github-actions github-actions bot added the Stale label Aug 15, 2024
@github-actions github-actions bot closed this Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants