Skip to content

Conversation

@starpit
Copy link
Member

@starpit starpit commented Sep 18, 2025

This intersects with #329. That PR covers the quality of output: we should find a way to retain the assistant message suffix. This PR addresses the cache locality performance: in some cases (when the message happens to fill n block fully, i.e. no cropping needed)... but we were including the chat template suffix... something not in kv cache.

This PR makes sure to crop that out all the time. We will still need the real fix for #329 to get good quality output.

@starpit starpit merged commit 8ed8981 into IBM:main Sep 18, 2025
12 checks passed
@starpit starpit deleted the crop-fix branch September 18, 2025 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant