Major response delay

#27 was introduced a while ago and was commit (albeit into another repo, which is now used for ai.hackclub.com AFAIK) but it has a very major flaw that entirely limits longer/more complex requests entirely, and creates an insanely large delay before streaming.

For example, I have set GPT-OSS 120B to high reasoning_effort and used the same two queries.

ai.hackclub.com response time:
<img width="690" height="169" alt="Image" src="https://github.com/user-attachments/assets/14427f4b-cc14-43b7-b209-e8171177c343" />

Groq response time:
<img width="684" height="171" alt="Image" src="https://github.com/user-attachments/assets/5b0d7011-b269-4dde-85d1-615f5ad98edd" />

And that barely scratches the surface, with more complex queries that have tool use, I have seen ai.hackclub.com take 59.90s to respond with the very first token, while Groq responds instantly.

Given that the delay increases proportionally with the response size, I suspect that the API gathers all tokens into a buffer *before* responding. The solution to this would simply be to remove the buffer and stream all tokens immediately as they are coming in. If you want to log all tokens, I would highly suggest either logging asynchronously (if that's possible) or just waiting for the message to complete (keep the buffer, but don't wait until the end to send out the responses).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Major response delay #30

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Major response delay #30

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions