Skip to content

Major response delay #30

@Mostlime12195

Description

@Mostlime12195

#27 was introduced a while ago and was commit (albeit into another repo, which is now used for ai.hackclub.com AFAIK) but it has a very major flaw that entirely limits longer/more complex requests entirely, and creates an insanely large delay before streaming.

For example, I have set GPT-OSS 120B to high reasoning_effort and used the same two queries.

ai.hackclub.com response time:
Image

Groq response time:
Image

And that barely scratches the surface, with more complex queries that have tool use, I have seen ai.hackclub.com take 59.90s to respond with the very first token, while Groq responds instantly.

Given that the delay increases proportionally with the response size, I suspect that the API gathers all tokens into a buffer before responding. The solution to this would simply be to remove the buffer and stream all tokens immediately as they are coming in. If you want to log all tokens, I would highly suggest either logging asynchronously (if that's possible) or just waiting for the message to complete (keep the buffer, but don't wait until the end to send out the responses).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions