-
Notifications
You must be signed in to change notification settings - Fork 26
Description
#27 was introduced a while ago and was commit (albeit into another repo, which is now used for ai.hackclub.com AFAIK) but it has a very major flaw that entirely limits longer/more complex requests entirely, and creates an insanely large delay before streaming.
For example, I have set GPT-OSS 120B to high reasoning_effort and used the same two queries.
ai.hackclub.com response time:
And that barely scratches the surface, with more complex queries that have tool use, I have seen ai.hackclub.com take 59.90s to respond with the very first token, while Groq responds instantly.
Given that the delay increases proportionally with the response size, I suspect that the API gathers all tokens into a buffer before responding. The solution to this would simply be to remove the buffer and stream all tokens immediately as they are coming in. If you want to log all tokens, I would highly suggest either logging asynchronously (if that's possible) or just waiting for the message to complete (keep the buffer, but don't wait until the end to send out the responses).