Skip to content

Conversation

@starpit
Copy link
Member

@starpit starpit commented Sep 12, 2025

The complexity here: we need to crop the assistant messages so that they end on a block boundary. We cannot pad them out, because this is not what the model server will do when generating the assistant message. But it will cache the prefix of full blocks. Hence the need to crop. However, we cannot just crop at the end, as this would also crop off any "end of text" special tokens that the chat template adds to the end of the self.assistant(m) token sequence. Therefore, we need to crop just the message part. This logic tries to do all of that in a way that is agnostic to the cast template. However, the logic does currently assume that the chat template will never add special tokens in the middle of the given message m; it assumes special tokens are only ever added (if at all) to the beginning or end.

DO NOT MERGE
TODO: see the discussion below. We need to crop and also pad out the assistant suffix special token.

The complexity here: we need to crop the assistant messages so that
they end on a block boundary. We cannot pad them out, because this is
not what the model server will do when generating the assistant
message. But it will cache the prefix of full blocks. Hence the need
to crop. However, we cannot just crop at the end, as this would also
crop off any "end of text" special tokens that the chat template adds
to the end of the `self.assistant(m)` token sequence. Therefore, we
need to crop just the message part. This logic tries to do all of that
in a way that is agnostic to the cast template. However, the logic
does currently assume that the chat template will never add special
tokens *in the middle* of the given message `m`; it assumes special
tokens are only ever added (if at all) to the beginning or end.

Signed-off-by: Nick Mitchell <[email protected]>
@starpit starpit marked this pull request as draft September 12, 2025 17:24
@starpit
Copy link
Member Author

starpit commented Sep 15, 2025

Assume below 4 tokens per block. We will visualize a block as something like [ -- ], where [ marks the beginning of a block, ] the end, and - the tokens in the middle. The game here is to ensure all + tokens are [ i.e. occur at the beginning of a block. This means some combination of padding and cropping ma be necessary. _ represents padding.

Scenario 1: AAAA and BBB as the two inner generate outputs

In this case, fortune is with us, and we only need to pad to get good inner-outer cache locality. We also send good quality info to the outer generate, because no cropping is needed.

  1. Cache Entries from outer system prompt
<sp>SO<ss>
 [  -- ]
  1. Cache Entries from inner 1
+<sp>SI<ss>___+<up>UI<us>___+<ap>AAAAA<as>
[ -  -] [  --][ -  -] [  --][ -  -][-- ]
  1. Cache Entries from inner 2
+<sp>SI<ss>___+<up>UI<us>___+<ap>BBB<as>
[ -  -] [  --][ -  -] [  --][ -  -][ -

Outer Input

<sp>SO<ss>+<sp>SI<ss>___+<up>UI<us>___+<ap>AAAAA<as>+<sp>SI<ss>___+<up>UI<us>___+<ap>BBB<as>
 [  -- ]  [ -  -] [  --][ -  -] [  --][ -  -][-- ]  [ -  -] [  --][ -  -] [  --][ -  -][ -

Note how, excluding a short suffix, Outer Input = 1 + 2 + 3. Perfect!

Scenario 2: BBB and AAA as the two inner generate outputs

If we swap the order of presentation of the inner outputs to the outer input, then the trailing [ - of the BBB output haunts us. This results in a "miss in the middle", which causes vllm to give up on finding any more cached blocks. So: this one has bad inner-outer cache locality, though it does send good quality info to the outer generate.

  1. Cache Entries from outer system prompt
<sp>SO<ss>
 [  -- ]
  1. Cache Entries from inner 1
+<sp>SI<ss>___+<up>UI<us>___+<ap>BBB<as>
[ -  -] [  --][ -  -] [  --][ -  -][ -    <-- oops! 
  1. Cache Entries from inner 2
+<sp>SI<ss>___+<up>UI<us>___+<ap>AAAAA<as>
[ -  -] [  --][ -  -] [  --][ -  -][-- ]

Outer Input

<sp>SO<ss>+<sp>SI<ss>___+<up>UI<us>___+<ap>BBB<as>+<sp>SI<ss>___+<up>UI<us>___+<ap>AAAAA<as>
 [  -- ]  [ -  -] [  --][ -  -] [  --][ -  -][ -[ -  -] [  --][ -  -] [  --][ -  -][-- ]
                                               ^^^ ouch! Miss in the middle

Scenario 3: Crop assistant output

If instead we crop assistant output, we avoid the miss in the middle, but have cropped a special token. I.e. this one has good locality, but we send low quality input to the outer generate.

<sp>SO<ss>+<sp>SI<ss>___+<up>UI<us>___+<ap>BB+<sp>SI<ss>___+<up>UI<us>___+<ap>AAAAA<as>
 [  -- ]  [ -  -] [  --][ -  -] [  --][ -  -][ -  -] [  --][ -  -] [  --][ -  -][-- ]
                                            ^^^ good now, but we have cropped an <as>

Scenario 4: Crop assistant output with fully padded suffix special token

If instead we crop assistant output and add a fully padded assistant token block, then we get good locality, and send reasonable quality input to the outer generate.

<sp>SO<ss>+<sp>SI<ss>___+<up>UI<us>___+<ap>BB<as>___+<sp>SI<ss>___+<up>UI<us>___+<ap>AAAAA<as>
                                              ^^^^^^ fully padded assistant suffix special token
 [  -- ]  [ -  -] [  --][ -  -] [  --][ -  -] [  --][ -  -] [  --][ -  -] [  --][ -  -][-- ]
                                            ^^^ also good now, preserving assistant suffix special token

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant