[Question]: I encountered while implementing streaming response with Flask and the AI response is returned all at once. #8416

xxb0120 · 2023-09-14T05:35:22Z

xxb0120
Sep 14, 2023

Question Validation

I have searched both the documentation and discord for an answer.

Question

Please help me with an error I encountered while implementing streaming response with Flask and the AI response is returned all at once.

Dependencies:

Flask==2.2.3
langchain==0.0.266
llama-index==0.8.7
openai==0.27.4

Sample code:

@app.route('/stream', methods=['POST'])
def stream():
    # load documents
    def get_test():
        documents = SimpleDirectoryReader("./test").load_data()
        index = VectorStoreIndex.from_documents(documents)
        query_engine = index.as_query_engine(streaming=True, similarity_top_k=1)
        response_stream = query_engine.query("你好")
        for i in response_stream.response_gen:
            yield i
    return Response(stream_with_context(get_test()), mimetype="text/event-stream", content_type="charset=UTF-8", headers={'X-Accel-Buffering': "no"})

if __name__ == '__main__':
    app.run(port=6001, host='0.0.0.0')

This is the request result:

curl -X POST "http://127.0.0.1:6001/stream"
你好！有什么我可以帮助你的吗？

However, the response is not streaming as expected.

@dosu-bot · 2023-09-14T05:41:34Z

dosubot[bot]
bot Sep 14, 2023

🤖

Hello,

Based on the context provided, it seems like you're trying to implement a streaming response with Flask using LlamaIndex. However, the response is not streaming as expected.

Firstly, please ensure that you're using a version of LlamaIndex that supports streaming responses. The exact version that supports streaming responses is 0.8.26.post1. You're currently using version 0.8.7, which might not have this feature.

Secondly, to enable streaming in LlamaIndex, you need to use an LLM that supports streaming and set streaming=True when building a query engine. If you are using the high-level API, you can do this when calling index.as_query_engine(). If you are using the low-level API, you can do this when constructing the Response Synthesizer.

Here is how you can do it:

query_engine = index.as_query_engine(
    streaming=True,
    similarity_top_k=1
)

Or, if you are using the low-level API:

from llama_index import get_response_synthesizer
synth = get_response_synthesizer(streaming=True, ...)
query_engine = RetrieverQueryEngine(response_synthesizer=synth, ...)

After properly configuring both the LLM and the query engine, calling query now returns a StreamingResponse object. You can obtain a Generator from the streaming response and iterate over the tokens as they arrive.

Please ensure that you have correctly set up the streaming feature in your LlamaIndex configuration. If you're still facing issues, please provide more details about the error you're encountering, and I'll be glad to assist you further.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: I encountered while implementing streaming response with Flask and the AI response is returned all at once. #8416

{{title}}

Replies: 1 comment

{{title}}

Select a reply

[Question]: I encountered while implementing streaming response with Flask and the AI response is returned all at once. #8416

xxb0120 Sep 14, 2023

Question Validation

Question

Replies: 1 comment

dosubot[bot] bot Sep 14, 2023

Sources

xxb0120
Sep 14, 2023

dosubot[bot]
bot Sep 14, 2023