I am in a situation that I need to issue a large number of request at a reasoning model. The long thinking process can easily cause the requests to timeout, leading to high failure rate.
Upon discussion with the platform engineers, they recommended that using streaming responses can make the extended reasoning requests more stable.
I played with the curator's source code for a bit and add the streaming support. I works quite well. I used to see a lot of timeout errors and now they are gone.
The proposed changes are here e0c63b9, and the main addition is the stream fetch_response_streamed function that fetches the streaming response and concatenate them together as if it was came from a non-stream request.
https://github.com/lyuwen/curator/blob/e0c63b9c40d45d60421afb24844842a8d2c411e2/src/bespokelabs/curator/request_processor/online/openai_online_request_processor.py#L71-L121
I am in a situation that I need to issue a large number of request at a reasoning model. The long thinking process can easily cause the requests to timeout, leading to high failure rate.
Upon discussion with the platform engineers, they recommended that using streaming responses can make the extended reasoning requests more stable.
I played with the curator's source code for a bit and add the streaming support. I works quite well. I used to see a lot of timeout errors and now they are gone.
The proposed changes are here e0c63b9, and the main addition is the stream
fetch_response_streamedfunction that fetches the streaming response and concatenate them together as if it was came from a non-stream request.https://github.com/lyuwen/curator/blob/e0c63b9c40d45d60421afb24844842a8d2c411e2/src/bespokelabs/curator/request_processor/online/openai_online_request_processor.py#L71-L121