Since Assistants API is now basically deprecated, we want to start an epic to add support for the Responses API, which all improvements for building RAG chatbots in OpenAI will land.

First we want to measure the latency of the OpenAI assistants api so we can measure what the performance gains will be. Anecdotally, the Responses API seems at least 2x as fast but we need to measure this so will be building a benchmark in this ticket.
Here is a high level plan:
Create a set of test assistants
[small-vector-store (2MB), medium-vector-store (10MB), large-vector-store (100MB)]
[kunji-assistant (100MB)] make copy of prod assistant
[english-vector-store, hindi-vector-store] × [english-queries, hindi-queries]
On the example assistants created, run a list of 50-100 test queries and measure latency for each call and compute mean latency.
Will try to set up as a CLI command not as part of test suite because we don't want to mock out OpenAI API (we want to make real calls to the service).
The /threads/sync which is just used for load testing.
CI
We can run the benchmark in GitHub Actions but eventually we could post results to spreadsheet.
Next steps
This is to prepare for a separate ticket where we will add a separate set of endpoints for the synchronous Responses API and re-run the benchmark to measure the performance improvements we sell.
Since Assistants API is now basically deprecated, we want to start an epic to add support for the Responses API, which all improvements for building RAG chatbots in OpenAI will land.
First we want to measure the latency of the OpenAI assistants api so we can measure what the performance gains will be. Anecdotally, the Responses API seems at least 2x as fast but we need to measure this so will be building a benchmark in this ticket.
Here is a high level plan:
Create a set of test assistants
[small-vector-store (2MB), medium-vector-store (10MB), large-vector-store (100MB)]
[kunji-assistant (100MB)] make copy of prod assistant
[english-vector-store, hindi-vector-store] × [english-queries, hindi-queries]
On the example assistants created, run a list of 50-100 test queries and measure latency for each call and compute mean latency.
Will try to set up as a CLI command not as part of test suite because we don't want to mock out OpenAI API (we want to make real calls to the service).
The /threads/sync which is just used for load testing.
CI
We can run the benchmark in GitHub Actions but eventually we could post results to spreadsheet.
Next steps
This is to prepare for a separate ticket where we will add a separate set of endpoints for the synchronous Responses API and re-run the benchmark to measure the performance improvements we sell.