Streaming API is very slow, is it a bug or a user error? #1066

plebedev · 2025-01-27T18:04:42Z

plebedev
Jan 27, 2025

Hello!

We have a product that is using Speech To Text, and we're evaluating Deepgram APIs to potentially use it as a primary STT provider.

We have our own Voice Activity Detection, and we buffer up audio before transcribing. The length of buffered audio is under 10 sec (most I believe is even under 5 sec).

First, we tried pre-recorded audio API since it made more sense to us, and we get fairly good performance. However, this post https://github.com/orgs/deepgram/discussions/751 says
This pre-recorded endpoint is designed to be fast, but it is not expected to be real time or have a maximum latency below 20 seconds. If you need consistently low latency times, please utilize our Streaming speech to text services, which process audio in real time and may be better suited for your use case.

that suggests there are no performance guarantees, and the streaming API would be a better choice.

So, I tried the streaming API by sending chunks of our buffered audio (in 4KB chunks) over, and then finalizing, and closing the web socket connection. The results were disappointing.

While the first response comes back in about 300ms (as promised), the full transcription takes 10X longer than pre-recorded API for the same audio file. Here is one example. You can see that the first response comes back in about 316ms, the second comes back in over 1.6 sec, the time between third and forth responses is ~2.5 sec. The total time between all the data being sent, and the final message received is 5.2sec, that seems too long for a 5 sec audio

2025-01-27 12:24:35.160 DEBUG 86025 --- [           main] o.s.w.s.c.s.StandardWebSocketClient      : Connecting to wss://api.deepgram.com/v1/listen?model=nova-2-phonecall
2025-01-27 12:24:35.567  INFO 86025 --- [cTaskExecutor-1] c.j.a.s.v.DeepgramVoiceService           : Sending audio data...
2025-01-27 12:24:35.571  INFO 86025 --- [cTaskExecutor-1] c.j.a.s.v.DeepgramVoiceService           : Audio data sent
2025-01-27 12:24:35.887  INFO 86025 --- [ient-SecureIO-2] c.j.a.s.v.DeepgramVoiceService           : Received message: {"type":"Results","channel_index":[0,1],"duration":0.2558125,"start":0.0,"is_final":true,"speech_final":false,"channel":{"alternatives":[{"transcript":"pay","confidence":0.7685547,"words":[{"word":"pay","start":0.24,"end":0.2558125,"confidence":0.7685547}]}]},"metadata":{"request_id":"d4c860df-b2c4-4ef6-9bfe-303ff85a755e","model_info":{"name":"2-phonecall-nova","version":"2024-02-07.20824","arch":"nova-2"},"model_uuid":"7e3b5bdf-85ed-4fd2-9f7a-7721bbcad97b"},"from_finalize":true}
2025-01-27 12:24:37.510  INFO 86025 --- [ient-SecureIO-2] c.j.a.s.v.DeepgramVoiceService           : Received message: {"type":"Results","channel_index":[0,1],"duration":1.83,"start":0.2558125,"is_final":true,"speech_final":true,"channel":{"alternatives":[{"transcript":"the sockets in the wall","confidence":1.0,"words":[{"word":"the","start":0.6558125,"end":0.81581247,"confidence":0.98828125},{"word":"sockets","start":0.81581247,"end":1.1358125,"confidence":1.0},{"word":"in","start":1.1358125,"end":1.3758125,"confidence":1.0},{"word":"the","start":1.3758125,"end":1.6258125,"confidence":1.0},{"word":"wall","start":1.6258125,"end":2.0858126,"confidence":0.9194336}]}]},"metadata":{"request_id":"d4c860df-b2c4-4ef6-9bfe-303ff85a755e","model_info":{"name":"2-phonecall-nova","version":"2024-02-07.20824","arch":"nova-2"},"model_uuid":"7e3b5bdf-85ed-4fd2-9f7a-7721bbcad97b"},"from_finalize":false}
2025-01-27 12:24:38.059  INFO 86025 --- [ient-SecureIO-2] c.j.a.s.v.DeepgramVoiceService           : Received message: {"type":"Results","channel_index":[0,1],"duration":0.8699999,"start":2.0858126,"is_final":true,"speech_final":true,"channel":{"alternatives":[{"transcript":"dol green","confidence":0.7302246,"words":[{"word":"dol","start":2.2458127,"end":2.3258126,"confidence":0.3137207},{"word":"green","start":2.3258126,"end":2.8258126,"confidence":0.7302246}]}]},"metadata":{"request_id":"d4c860df-b2c4-4ef6-9bfe-303ff85a755e","model_info":{"name":"2-phonecall-nova","version":"2024-02-07.20824","arch":"nova-2"},"model_uuid":"7e3b5bdf-85ed-4fd2-9f7a-7721bbcad97b"},"from_finalize":false}
2025-01-27 12:24:40.583  INFO 86025 --- [ient-SecureIO-2] c.j.a.s.v.DeepgramVoiceService           : Received message: {"type":"Results","channel_index":[0,1],"duration":3.17,"start":2.9558125,"is_final":true,"speech_final":true,"channel":{"alternatives":[{"transcript":"the child crawled into the dense grass","confidence":1.0,"words":[{"word":"the","start":3.6758125,"end":3.8358126,"confidence":1.0},{"word":"child","start":3.8358126,"end":4.3158126,"confidence":1.0},{"word":"crawled","start":4.3158126,"end":4.6358123,"confidence":1.0},{"word":"into","start":4.6358123,"end":4.9558125,"confidence":1.0},{"word":"the","start":4.9558125,"end":5.1158123,"confidence":1.0},{"word":"dense","start":5.1158123,"end":5.5158124,"confidence":1.0},{"word":"grass","start":5.5158124,"end":6.0158124,"confidence":0.9863281}]}]},"metadata":{"request_id":"d4c860df-b2c4-4ef6-9bfe-303ff85a755e","model_info":{"name":"2-phonecall-nova","version":"2024-02-07.20824","arch":"nova-2"},"model_uuid":"7e3b5bdf-85ed-4fd2-9f7a-7721bbcad97b"},"from_finalize":false}
2025-01-27 12:24:40.762  INFO 86025 --- [ient-SecureIO-2] c.j.a.s.v.DeepgramVoiceService           : Received message: {"type":"Results","channel_index":[0,1],"duration":0.1107502,"start":6.1258125,"is_final":true,"speech_final":false,"channel":{"alternatives":[{"transcript":"","confidence":0.0,"words":[]}]},"metadata":{"request_id":"d4c860df-b2c4-4ef6-9bfe-303ff85a755e","model_info":{"name":"2-phonecall-nova","version":"2024-02-07.20824","arch":"nova-2"},"model_uuid":"7e3b5bdf-85ed-4fd2-9f7a-7721bbcad97b"},"from_finalize":false}
2025-01-27 12:24:40.763  INFO 86025 --- [ient-SecureIO-2] c.j.a.s.v.DeepgramVoiceService           : Received message: {"type":"Metadata","transaction_key":"deprecated","request_id":"d4c860df-b2c4-4ef6-9bfe-303ff85a755e","sha256":"0cd1d314b7ba9215ab0d9a0fc99932b90a3242ad8db886cc13f42ecaa7090348","created":"2025-01-27T17:24:35.583Z","duration":6.2365627,"channels":1,"models":["7e3b5bdf-85ed-4fd2-9f7a-7721bbcad97b"],"model_info":{"7e3b5bdf-85ed-4fd2-9f7a-7721bbcad97b":{"name":"2-phonecall-nova","version":"2024-02-07.20824","arch":"nova-2"}}}
Deepgram WebSocket Transcription Response: pay the sockets in the wall dol green the child crawled into the dense grass
Iteration Time: 5606 ms

Also, the transcription is not accurate. The pre-recorded API returns the correct transcription in about 500ms, and this is one of the longer responses.

Deepgram REST API Transcription Response: paint the sockets in the wall dull green the child crawled into the dense grass
Iteration time: 518 ms

Note the first word pay vs paint, and dol vs dull.

I've ran 100 iterations of each version, and these are the stats:

Deepgram REST API Statistics:
Min: 147 ms
Max: 1235 ms
Average: 375.14 ms
P75: 466 ms
P95: 869 ms

Deepgram WebSocket Statistics:
Min: 5323 ms
Max: 6566 ms
Average: 5557.57 ms
P75: 5645 ms
P95: 5791 ms

While WebSocket/Streaming APIs are very consistent in their response times, they are much slower than the streaming APIs.

I'm using Java, and the code is fairly straight forward:

public String transcribeAudioWebSocket(byte[] audioData) throws IOException {
    assert audioData != null;

    CompletableFuture<String> transcriptionFuture = new CompletableFuture<>();

    try {
      StandardWebSocketClient client = new StandardWebSocketClient();

      WebSocketHandler handler = new AbstractWebSocketHandler() {
        private final StringBuilder transcriptionBuilder = new StringBuilder();

        @Override
        public void handleTextMessage(WebSocketSession session, TextMessage message) {
          try {
            JsonNode responseNode =
                DeepgramVoiceService.this.objectMapper.readTree(message.getPayload());
            log.info("Received message: " + responseNode);
            if (responseNode.has("type") && "Results".equals(responseNode.get("type").asText())) {
              String transcript = responseNode
                  .path("channel")
                  .path("alternatives")
                  .path(0)
                  .path("transcript")
                  .asText("");

              this.transcriptionBuilder.append(transcript).append(' ');
            }
          } catch (Exception e) {
            log.error("Error processing transcription message", e);
            transcriptionFuture.completeExceptionally(
                new DeepGramVoiceServiceException("Failed to process transcription", e));
          }
        }

        @Override
        public void afterConnectionEstablished(WebSocketSession session) {
          try {
            // Send audio data in chunks
            log.info("Sending audio data...");
            for (int i = 0; i < audioData.length; i += CHUNK_SIZE) {
              int end = Math.min(i + CHUNK_SIZE, audioData.length);
              byte[] chunk = Arrays.copyOfRange(audioData, i, end);
              session.sendMessage(new BinaryMessage(chunk));
            }

            // Finalize audio data
            session.sendMessage(new TextMessage("{\"type\": \"Finalize\"}"));
            // Send close message
            session.sendMessage(new TextMessage("{\"type\": \"CloseStream\"}"));
            log.info("Audio data sent");
          } catch (Exception e) {
            log.error("Error sending audio data", e);
            transcriptionFuture.completeExceptionally(
                new DeepGramVoiceServiceException("Failed to send audio data", e));
          }
        }

        @Override
        public void afterConnectionClosed(WebSocketSession session, CloseStatus status)
            throws Exception {
          transcriptionFuture.complete(this.transcriptionBuilder.toString().trim());
          log.info("Final data received");
        }
      };

      // Convert HTTP URL to WebSocket URL and add query parameters
      String wsUrl = this.deepgramApiUrl.replace("https", "wss") + "/listen?model=nova-2-phonecall";

      // Create WebSocket headers
      WebSocketHttpHeaders wsHeaders = new WebSocketHttpHeaders();
      wsHeaders.add("Authorization", "Token " + this.deepgramApiKey);

      client.doHandshake(handler, wsHeaders, URI.create(wsUrl)).get();

      // Wait for the transcription to complete with a timeout
      return transcriptionFuture.get();

    } catch (Exception e) {
      throw new DeepGramVoiceServiceException("Failed to transcribe audio using WebSocket", e);
    }
  }

I tried to increase the chunk size from 4KB to 8KB, and I also tried to remove Finalize message but neither made any difference.

Is it the expected performance, or I'm doing something wrong, and the streaming API is not supposed to be used that way?

2025-01-27T18:04:45Z

deepgram-community[bot]
bot Jan 27, 2025

Thanks for asking your question. Please be sure to reply with as much detail as possible so the community can assist you efficiently.
_{Consider joining our Discord community for more opportunity to engage with your fellow Deepgram users. You can earn points which can be redeemed for cool stuff by being active in our communities!}

0 replies

plebedev · 2025-01-27T18:06:43Z

deepgram-community[bot]
bot Jan 27, 2025

Hey there! It looks like you haven't connected your GitHub account to your Deepgram account. You can do this at https://community.deepgram.com - being verified through this process will allow our team to help you in a much more streamlined fashion.

1 reply

plebedev Jan 27, 2025
Author

Hey there! It looks like you haven't connected your GitHub account to your Deepgram account. You can do this at https://community.deepgram.com - being verified through this process will allow our team to help you in a much more streamlined fashion.

I'm constantly getting

Failed to log in
Please try again.

error. I also have an account with by work e-mail (not connected to my current github profile).

plebedev · 2025-01-27T20:16:57Z

plebedev
Jan 27, 2025
Author

One more thing. I'm doing a load test, and the docs say up to 100 concurrent connections to pre-recorded API. With 50 threads I get 100% success rate. However, once I go above 50 threads, I start getting 429 even though I'm not making more than 100 concurrent requests. How is the rate limit calculated?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepgram

Streaming API is very slow, is it a bug or a user error? #1066

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Deepgram

Streaming API is very slow, is it a bug or a user error? #1066

plebedev Jan 27, 2025

Replies: 3 comments · 1 reply

deepgram-community[bot] bot Jan 27, 2025

deepgram-community[bot] bot Jan 27, 2025

plebedev Jan 27, 2025 Author

plebedev Jan 27, 2025 Author

plebedev
Jan 27, 2025

Replies: 3 comments 1 reply

deepgram-community[bot]
bot Jan 27, 2025

deepgram-community[bot]
bot Jan 27, 2025

plebedev Jan 27, 2025
Author

plebedev
Jan 27, 2025
Author