Skip to content

Conversation

@SoumyaRaikwar
Copy link
Contributor

Description

This PR implements automatic request splitting and retry logic for the elasticsearchexporter. When the Elasticsearch server responds with an HTTP 413 (Payload Too Large) error, the exporter now intercepts the response, splits the large bulk request body (NDJSON) into two smaller chunks, and retries them sequentially. This prevents data loss for batches that exceed http.max_content_length.

Link to tracking issue

Fixes #45834

Testing

  • Added a new unit test TestEsClient_Perform_413_Splitting in exporter/elasticsearchexporter/esclient_test.go.
  • The test mocks an Elasticsearch backend that returns 413 Request Entity Too Large for the initial request.
  • Verified that the client correctly splits the payload, retries the chunks, and ultimately succeeds with 200 OK.
  • Ran make lint and make test locally to ensure no regressions.

Documentation

  • Added a changelog entry in .chloggen/fix-elasticsearch-413.yaml.

@mauri870
Copy link
Member

mauri870 commented Feb 11, 2026

Thank you for working on this! I have a couple questions:

  • What happens with the metrics, do they work out of the box with this approach? or would we need to aggregate the metrics from each split batch?
  • What if one half of the split request fails while the other succeeds? We don’t seem to have retries here beyond the transport-level ones. How would we handle document-level errors from one of the _bulk requests?
  • If we split the batch and it’s still too large, should we split it again? If so, how many times should we retry splitting before giving up?

@SoumyaRaikwar
Copy link
Contributor Author

Thank you for working on this! I have a couple questions:

  • What happens with the metrics, do they work out of the box with this approach? or would we need to aggregate the metrics from each split batch?
  • What if one half of the split request fails while the other succeeds? We don’t seem to have retries here beyond the transport-level ones. How would we handle document-level errors from one of the _bulk requests?
  • If we split the batch and it’s still too large, should we split it again? If so, how many times should we retry splitting before giving up?

@mauri870

  • The metrics should work out of the box. We aggregate the JSON bodies from the split bulkResponses into a single bulkResponse structure before returning. The upstream caller parses this aggregated response, so it correctly counts the total number of successful and failed items.
  • If one chunk succeeds and the other fails with a transport error, we return an error for the whole operation. This triggers the Collector's standard retry mechanism for the entire batch. This ensures no data is lost ("at-least-once" delivery).
  • The recursion is naturally limited. We explicitly stop splitting when a chunk contains fewer than 2 lines. Since the batch size is halved at each step, we reach this base case very quickly, preventing infinite recursion

Copy link
Member

@mauri870 mauri870 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code-wise, LGTM. I also tested it locally with some custom test cases I had for this, and it seems to work as intended. Unfortunately, I don’t have a deep understanding of the internals, so I'll wait for feedback from the code owners.

Copy link
Contributor

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, a few questions


import (
"bytes"
"compress/gzip"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: is the dep swap intentional?

if gzipErr != nil {
return nil, gzipErr
}
defer gr.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: can we close gr sooner rather than leaving it open while recursive calls are made?

content = bodyBytes
}

lines := bytes.Split(content, []byte("\n"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bytes.Split and bytes.Join memory usage footprint can be high. can we walk the bytes and get index of the middle \n, maybe using a slow and fast pointer algorithm, then split the payload by slicing the byte slice?

@lahsivjar
Copy link
Member

Just gave a brief look at the PR, and I don't think this is the way we should fix it. There are already conflicts between how ES exporter does things and how exporterhelper does things (take retries for example). A better way to fix this is to emplement exporterhelper.Request interface for the bulk indexer documents and rely on batch settings to configure the batch size so that 413 doesn't happen. I believe we don't need anything to handle 413's explicitly as exporter can be configured based on the ES it is targeting (by configuring the batch sizes).

Copy link
Contributor

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was talking to @lahsivjar about the HTTP 413 issue and it looks more like a misconfiguration (user issue) on batching config. Let's discuss the problem further in #45834 before jumping to a fix like this which may be a footgun, if we're splitting requests after merging them in a batcher.

@SoumyaRaikwar
Copy link
Contributor Author

SoumyaRaikwar commented Feb 11, 2026

I was talking to @lahsivjar about the HTTP 413 issue and it looks more like a misconfiguration (user issue) on batching config. Let's discuss the problem further in #45834 before jumping to a fix like this which may be a footgun, if we're splitting requests after merging them in a batcher.

@lahsivjar @carsonip
sorry for that, I agree that the current recursive splitting approach in this PR is not providing standard OpenTelemetry Collector architectural patterns.
I am going to study the exporterhelper.Request and configuring the batch size approach as suggested. I will discuss the approach in the issue #45834 before proceeding with any further implementation. thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[exporter/elasticsearch] Missing handling of HTTP 413 errors causes data loss

5 participants