-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract Input/Output token usage from request. #111
base: main
Are you sure you want to change the base?
Conversation
Currently only total_tokens usage from response body are pushed to dynamicMetadata. This PR updates that logic to include input and output token usage as well. Signed-off-by: Sukumar Gaonkar <[email protected]>
…useage # Conflicts: # internal/extproc/translator/openai_awsbedrock.go # internal/extproc/translator/openai_openai.go
Signed-off-by: Sukumar Gaonkar <[email protected]>
// MonitorContinuousUsageStats flag controls if external process monitors every response-body chunk for usage stats | ||
// when true, it will monitor for token metadata usage in every response-body chunk received during request in streaming mode | ||
// compatible with vllm's 'continuous_usage_stats' flag | ||
// when false, it will stop monitoring after detecting token metadata usage after finding it for the first time. | ||
// compatible with OpenAI's streaming response (https://platform.openai.com/docs/api-reference/chat/streaming#chat/streaming-usage) | ||
// Only affects request in streaming mode | ||
MonitorContinuousUsageStats bool `yaml:"monitorContinuousUsageStats,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you remove the change related to this? I think this is another issue and metadata is not cumulative so basically it's overriding previous ones if it's emitted in the middle.
@@ -65,7 +71,7 @@ type Translator interface { | |||
ResponseBody(body io.Reader, endOfStream bool) ( | |||
headerMutation *extprocv3.HeaderMutation, | |||
bodyMutation *extprocv3.BodyMutation, | |||
usedToken uint32, | |||
tokenUsage *TokenUsage, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think this three uint64 would be worth the allocations, so could you return the value
tokenUsage *TokenUsage, | |
tokenUsage TokenUsage, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kept it as a pointer so that nil
value would indicate absence of tokenUsage data in the responseBody
if tokenUsage == nil {...}
with struct will have to check zero value for one of the member
if tokenUsage.TotalTokens == 0 {...}
thoughts?
type TokenUsage struct { | ||
InputTokens uint32 | ||
OutputTokens uint32 | ||
TotalTokens uint32 | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any public structs/fields/methods would need comments.
@@ -29,3 +29,4 @@ site/yarn-debug.log* | |||
site/yarn-error.log* | |||
site/static/.DS_Store | |||
site/temp | |||
/.idea/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary change - already in here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also you can have a global gitignore in your os... that's where people usually place the gitignore for something that's not generated by the project
sorry this feels a conflict with #103 i would appreciate it if you could stop it until it lands - I think the PR will supersede this PR besides vllm stuff part |
@sukumargaonkar thank you for waiting - #103 has landed so could you rework the PR and focus on the vllm stuff? |
ping |
Currently only total_tokens usage from response body are pushed to dynamicMetadata. This PR updates that logic to include input and output token usage as well.
This PR also introduces
monitorContinuousUsageStats
flag in config for external processThe flag controls if external process monitors every response-body chunk for usage stats
when true, it will monitor for token metadata usage in every response-body chunk received during request in streaming mode (compatible with vllm's 'continuous_usage_stats' flag)
when false, it will stop monitoring after detecting token metadata usage after finding it for the first time.
(compatible with OpenAI's streaming response (https://platform.openai.com/docs/api-reference/chat/streaming#chat/streaming-usage))
Only affects request in streaming mode