fix: assistant msg cropping might crop suffix special tokens #329

starpit · 2025-09-12T16:00:31Z

The complexity here: we need to crop the assistant messages so that they end on a block boundary. We cannot pad them out, because this is not what the model server will do when generating the assistant message. But it will cache the prefix of full blocks. Hence the need to crop. However, we cannot just crop at the end, as this would also crop off any "end of text" special tokens that the chat template adds to the end of the self.assistant(m) token sequence. Therefore, we need to crop just the message part. This logic tries to do all of that in a way that is agnostic to the cast template. However, the logic does currently assume that the chat template will never add special tokens in the middle of the given message m; it assumes special tokens are only ever added (if at all) to the beginning or end.

DO NOT MERGE
TODO: see the discussion below. We need to crop and also pad out the assistant suffix special token.

The complexity here: we need to crop the assistant messages so that they end on a block boundary. We cannot pad them out, because this is not what the model server will do when generating the assistant message. But it will cache the prefix of full blocks. Hence the need to crop. However, we cannot just crop at the end, as this would also crop off any "end of text" special tokens that the chat template adds to the end of the `self.assistant(m)` token sequence. Therefore, we need to crop just the message part. This logic tries to do all of that in a way that is agnostic to the cast template. However, the logic does currently assume that the chat template will never add special tokens *in the middle* of the given message `m`; it assumes special tokens are only ever added (if at all) to the beginning or end. Signed-off-by: Nick Mitchell <[email protected]>

starpit · 2025-09-15T15:37:32Z

Assume below 4 tokens per block. We will visualize a block as something like [ -- ], where [ marks the beginning of a block, ] the end, and - the tokens in the middle. The game here is to ensure all + tokens are [ i.e. occur at the beginning of a block. This means some combination of padding and cropping ma be necessary. _ represents padding.

Scenario 1: AAAA and BBB as the two inner generate outputs

In this case, fortune is with us, and we only need to pad to get good inner-outer cache locality. We also send good quality info to the outer generate, because no cropping is needed.

Cache Entries from outer system prompt

<sp>SO<ss>
 [  -- ]

Cache Entries from inner 1

+<sp>SI<ss>___+<up>UI<us>___+<ap>AAAAA<as>
[ -  -] [  --][ -  -] [  --][ -  -][-- ]

Cache Entries from inner 2

+<sp>SI<ss>___+<up>UI<us>___+<ap>BBB<as>
[ -  -] [  --][ -  -] [  --][ -  -][ -

Outer Input

<sp>SO<ss>+<sp>SI<ss>___+<up>UI<us>___+<ap>AAAAA<as>+<sp>SI<ss>___+<up>UI<us>___+<ap>BBB<as>
 [  -- ]  [ -  -] [  --][ -  -] [  --][ -  -][-- ]  [ -  -] [  --][ -  -] [  --][ -  -][ -

Note how, excluding a short suffix, Outer Input = 1 + 2 + 3. Perfect!

Scenario 2: BBB and AAA as the two inner generate outputs

If we swap the order of presentation of the inner outputs to the outer input, then the trailing [ - of the BBB output haunts us. This results in a "miss in the middle", which causes vllm to give up on finding any more cached blocks. So: this one has bad inner-outer cache locality, though it does send good quality info to the outer generate.

Cache Entries from outer system prompt

<sp>SO<ss>
 [  -- ]

Cache Entries from inner 1

+<sp>SI<ss>___+<up>UI<us>___+<ap>BBB<as>
[ -  -] [  --][ -  -] [  --][ -  -][ -    <-- oops!

Cache Entries from inner 2

+<sp>SI<ss>___+<up>UI<us>___+<ap>AAAAA<as>
[ -  -] [  --][ -  -] [  --][ -  -][-- ]

Outer Input

<sp>SO<ss>+<sp>SI<ss>___+<up>UI<us>___+<ap>BBB<as>+<sp>SI<ss>___+<up>UI<us>___+<ap>AAAAA<as>
 [  -- ]  [ -  -] [  --][ -  -] [  --][ -  -][ -[ -  -] [  --][ -  -] [  --][ -  -][-- ]
                                               ^^^ ouch! Miss in the middle

Scenario 3: Crop assistant output

If instead we crop assistant output, we avoid the miss in the middle, but have cropped a special token. I.e. this one has good locality, but we send low quality input to the outer generate.

<sp>SO<ss>+<sp>SI<ss>___+<up>UI<us>___+<ap>BB+<sp>SI<ss>___+<up>UI<us>___+<ap>AAAAA<as>
 [  -- ]  [ -  -] [  --][ -  -] [  --][ -  -][ -  -] [  --][ -  -] [  --][ -  -][-- ]
                                            ^^^ good now, but we have cropped an <as>

Scenario 4: Crop assistant output with fully padded suffix special token

If instead we crop assistant output and add a fully padded assistant token block, then we get good locality, and send reasonable quality input to the outer generate.

<sp>SO<ss>+<sp>SI<ss>___+<up>UI<us>___+<ap>BB<as>___+<sp>SI<ss>___+<up>UI<us>___+<ap>AAAAA<as>
                                              ^^^^^^ fully padded assistant suffix special token
 [  -- ]  [ -  -] [  --][ -  -] [  --][ -  -] [  --][ -  -] [  --][ -  -] [  --][ -  -][-- ]
                                            ^^^ also good now, preserving assistant suffix special token

starpit force-pushed the assistant-cropping branch from 8bb31c6 to 952fdd9 Compare September 12, 2025 16:18

starpit force-pushed the assistant-cropping branch from 952fdd9 to f20fc4c Compare September 12, 2025 17:04

starpit marked this pull request as draft September 12, 2025 17:24

starpit mentioned this pull request Sep 18, 2025

fix: assistant message cropping may not crop chat template suffix #353

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: assistant msg cropping might crop suffix special tokens #329

fix: assistant msg cropping might crop suffix special tokens #329

Uh oh!

starpit commented Sep 12, 2025 •

edited

Loading

Uh oh!

starpit commented Sep 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: assistant msg cropping might crop suffix special tokens #329

Are you sure you want to change the base?

fix: assistant msg cropping might crop suffix special tokens #329

Uh oh!

Conversation

starpit commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

starpit commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scenario 1: AAAA and BBB as the two inner generate outputs

Scenario 2: BBB and AAA as the two inner generate outputs

Scenario 3: Crop assistant output

Scenario 4: Crop assistant output with fully padded suffix special token

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

starpit commented Sep 12, 2025 •

edited

Loading

starpit commented Sep 15, 2025 •

edited

Loading