Skip to content

Conversation

pzl
Copy link
Member

@pzl pzl commented Sep 9, 2025

What is the problem this PR solves?

Adds a configuration parameter to fleet server for the maximum size of transferred files via actions

How does this PR solve the problem?

Hard coded limit (100MiB previously) is now a config.

Default is also changed to be unlimited. The file transfer feature launched with very conservative limits on request rates and file sizes. File size itself is primarily an impact on Elasticsearch at rest, at does not effect fleet server, aside from the time to transfer the bytes. Requests are still chunked, chunk size is not changing.

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

@pzl pzl added the enhancement New feature or request label Sep 9, 2025
@prodsecmachine
Copy link

prodsecmachine commented Sep 9, 2025

🎉 Snyk checks have passed. No issues have been found so far.

security/snyk check is complete. No issues have been found. (View Details)

license/snyk check is complete. No issues have been found. (View Details)

Copy link
Contributor

mergify bot commented Sep 9, 2025

This pull request does not have a backport label. Could you fix it @pzl? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@pzl pzl marked this pull request as ready for review September 10, 2025 15:10
@pzl pzl requested a review from a team as a code owner September 10, 2025 15:10
@pzl pzl requested review from ycombinator and pchila September 10, 2025 15:10
@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Sep 12, 2025
defaultStatusMax = 50
defaultStatusMaxBody = 0

defaultFileStorageSize = 0 // no limit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change this to match the previous default, 104857600?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specific onus for this PR is removing the file size limit. It was initially put there as a precautionary measure during initial development of the feature, but upcoming feature work in the stack will see much larger file sizes in default use cases.

Rather than deleting this control altogether, it seemed sensible to keep it as an optional limit (either as cost-control for the user, or to prevent "oopsies" if they don't intend to use the larger-file features coming). If we'd rather the control just be eliminated as complexity, that seems fine too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pzl has any benchmark been performed to check the behaviour of fleet server when uploading large(r) files? Do we have an idea of the size of files in the upcoming feature ?

Copy link
Member Author

@pzl pzl Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - a modified horde was used to have 8k drones sending >2GiB files (randomized range was 1.6Gib to 16 GiB) through fleet server with disabled file size limit, without issue.

The primary control here that maintains stability (in the face of all that traffic) is the request rate limits (the file ops have their own limit settings), and the architecture of file chunking. An entire 16GiB file is not continuously streamed in one request, but rather individual 4MiB chunks. Those chunk requests have several rate controls on them, as well as limits on how many "upload jobs" may be ongoing (i.e. there will not be 8k agents all trying to upload chunks, fleet server controls how many in-progress file jobs there are). So memory will not be exhausted by too many parallel uploads, nor by buffering large files themselves (the 4MiB chunk data as well, is immediately streamed to elasticsearch with minimal processing, so even that is effectively io.Copyied from network-in to net-out.

Fleet server is then able to handle its "max parallel upload jobs" amount -- essentially continually and indefinitely. A chunk request comes, is copied to elasticsearch, and closes. There is a (still very conservative!) limit on how many chunk requests can be handled at once. So large files is not currently increasing any of that handling in parallel, but only how many chunk calls will happen in sequence until the whole "upload job" has concluded. And then of course, at rest, fleet server does nothing with these files, they just sit as documents in elasticsearch. No load or impact to fleet server there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would users / customers be accustomed to this limiting / preventing certain files from being sent? I'm asking to try to get a handle what would be the "principle of least surprise" for users. Whether that would be "very large files are sent (even if multiple TBs)", or "I need to configure a max if above a threshold", and whether our previous limit of 100 MiB would have set an expectation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature is only in use in two places:

  • agent diagnostics bundle (very small zips, not going to change here)
  • defend/endpoint file retrieval or distribution.

The defend feature is very much manual user interaction to perform. Users have run into size limits when attempting to inspect larger binaries, and have had to resort to other means to retrieve those files. Defend needs to unblock a lot of those common workflows.

We're not aware of any case of relying on the size limit as a feature, but more cases of users being hampered by it.


size, _ := data.Int64("file", "size")
if size > u.sizeLimit {
if u.sizeLimit != 0 && size > u.sizeLimit {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the u.sizeLimit != 0 clause here is for allowing uploads of any size. So 0 is the sentinel value for signaling to the code that uploads of any size are allowed.

I'm wondering if we should use 0 to signal that uploads are not allowed and -1 (or any other value < 0) to signal that uploads of any size are allowed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disallowing in just this manner would lead to some wacky and hard-to-debug errors in kibana features that rely on the file transfer. Kibana features (especially in Defend) would still appear enabled and allow user actions to take place that would fail when the agent (or endpoint) receives an action and then proceeds to upload.

I do see the arbitrary nature of 0's and negatives as not particularly meaningful to whichever disabled/unlimited mode here. I guess they happen to be flipped here. -1 (or any negative value) will fail to upload, as file size will always be larger than that. Negatives seem to be the disabler here (not by design, just by happenstance). 0 seemed to be a common unset/default/unlimited value elsewhere in these configs (and go) and I went with that by convention. I think the unset/default value of the language probably shouldn't lead to the feature being disabled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we consider making an explicit [something]Enabled: true/false flag (and require the max*Size to unsigned) to differentiate and avoid magic numbers (at least in the user-facing interface)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the suggestion here.

change the config's type to uint64, and then:

// with other the consts
var fileSizeLimitDisabled = 0

...

// when setting defaults
defaultFileSizeLimit = fileSizeLimitDisabled

?

That's good with me, I'm happy with that.


Just to restate the general case here: the end goal is to remove the file size cap, that is the specific need. I'd like to leave it configurable in case some user has strange edge cases I cannot predict and wants to bound it for their own reasons. But it is not essential to keep this a config. We can simplify the PR if we just want to remove the cap. The actual protections in place are the rate limits. If having a sentinel value here at 0 is the blocker, should we go the other way?

Copy link
Member

@pchila pchila Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to add a comment about the config type to be changed to uint64, then I saw this comment so I'll try to keep the number of open threads to a minimum.

I don't want to hijack @nkvoll comment and will leave it to him to get back here, I would love to see the uint64 type if negative values are not allowed and a constant that will read something like wrote in your comment.

About the sentinel value/explicit enabled/disabled flag I am not sure: if we have such a flag we could leave the default limit to 100MB as today (still configurable using MaxFileStorageByteSize) and just turn off the limit using the FileStorageLimitEnabled: false (name still open to change 😉 ) but I am not 100% sure that this is what @nkvoll envisioned

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the general-case explanation @pzl, much appreciated!

What I'm thinking is pretty much what @pchila is saying, but maybe with a slight variant to avoid discussing the naming, by having it be a pointer. If the pointer is nil, there is no max.

MaxFileStorageByteSize *int64 `config:"max_file_storage_size"`

If I understand correctly, we want to default to "no maximum configured", and this being a pointer, usually defaults to nil, and is different from the any other legal value, which intuitively would be compared to the file size on disk?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in re: negative numbers being nonsensical values here -- do we want this to be *uint64?

Also what should happen if it is configured and set to 0? Feature disabled?

Copy link

}{
{"MaxSize 0 allows uploads", 0, http.StatusOK, 1000},
{"MaxSize 0 allows large uploads", 0, http.StatusOK, 1024 * 1024 * 1024 * 2},
{"MaxSize 0 does not allow 0-length files", 0, http.StatusBadRequest, 0},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little odd? If I set MaxSize 1, would I allow 0-length files?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope! This test was just to make sure a size limit of "0" is not interpreted as a literal value, making "0 length files" suddenly succeed, when the limit is 0.

0-length files should always fail, but this makes sure we don't trip on an edge case when the limit is also 0.

# If not present is automatically filled by the tooling finding the PR where this changelog fragment has been added.
# NOTE: the tooling supports backports, so it's able to fill the original PR number instead of the backport PR number.
# Please provide it if you are adding a fragment for a different PR.
#pr: https://github.com/owner/repo/1234
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#pr: https://github.com/owner/repo/1234
pr: https://github.com/elastic/fleet-server/pull/5478


# Issue URL; optional; the GitHub issue related to this changeset (either closes or is part of).
# If not present is automatically filled by the tooling with the issue linked to the PR number.
#issue: https://github.com/owner/repo/1234
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit, if there is a parent issue it would be nice to have it here


size, _ := data.Int64("file", "size")
if size > u.sizeLimit {
if u.sizeLimit != 0 && size > u.sizeLimit {
Copy link
Member

@pchila pchila Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to add a comment about the config type to be changed to uint64, then I saw this comment so I'll try to keep the number of open threads to a minimum.

I don't want to hijack @nkvoll comment and will leave it to him to get back here, I would love to see the uint64 type if negative values are not allowed and a constant that will read something like wrote in your comment.

About the sentinel value/explicit enabled/disabled flag I am not sure: if we have such a flag we could leave the default limit to 100MB as today (still configurable using MaxFileStorageByteSize) and just turn off the limit using the FileStorageLimitEnabled: false (name still open to change 😉 ) but I am not 100% sure that this is what @nkvoll envisioned

MaxAgents int `config:"max_agents"`
MaxHeaderByteSize int `config:"max_header_byte_size"`
MaxConnections int `config:"max_connections"`
MaxFileStorageByteSize int64 `config:"max_file_storage_size"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, for consistency, if this should have been max_file_storage_byte_size, (even if my preference would have been to have the unit come last, if we include it at all) to align with max_header_byte_size, two lines above, and align with the variable name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming is hard and begs some questions, so please pardon me for asking a question with maybe an obvious answer, but: Why is it not max_file_size (e.g why is storage part of it -- could it be simplified but still describing exactly what it is)? It raises some "storage size vs display size"-connotations to me, which may be completely irrelevant.

Copy link
Member Author

@pzl pzl Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to avoid an overly-broad name in case fleet server ever needs to deal with files in another context (e.g. supporting files on the host its on, cache files, or any other situation involving "files" in fleet). I thought of this feature (file transfer to/from an agent and elasticsearch) as "file storage" to give it a narrower namespace. But it can be whatever, no attachment to the name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants