How we investigated high container memory with Fluent Bit filesystem buffering #11672
jmtt89
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I want to share a debugging session from a Kubernetes cluster because the result was surprising and ended up changing our initial hypothesis. It may be useful to others who run Fluent Bit with filesystem buffering and see very high container memory in
kubectl top.TL;DR
We had a Fluent Bit DaemonSet where some pods showed more than
1 GiBinkubectl top, but:After digging into it, we found that:
kernel/slabdentryandxfs_inode/fluent-bit/storage/tail.0In our rollout, this happened because the new collector was still tailing almost the whole node and then dropping most records later in the pipeline with a label-based filter.
So this did not look like a memory leak in Fluent Bit itself. In our case, it was mainly the result of how our current pipeline shape interacted with filesystem buffering.
Environment
containerdxfstailinput withstorage.type filesystemWe were comparing two Fluent Bit DaemonSets on the same node: a legacy collector (
v1) and a new collector (v2). Both were tailing the same node log tree,/var/log/containers/*.log.What looked surprising
On one hot node:
v2collector: about~1.8Giv1collector: about~200MiAt first glance, this looked like a Fluent Bit memory problem.
What we measured
1. Process RSS was small
Inside the hot
v2pod:VmRSS: about22MiSo the process heap was clearly not where the
~1.8Giwas going.2. Cgroup memory was almost entirely kernel memory
From the hot
v2pod cgroup:memory.current: about1.98Gikernel: about1.96Gislab_reclaimable: about1.959GiOn the same node, the legacy collector looked like this:
memory.current: about222Mikernel: about183Mislab_reclaimable: about181MiSo both collectors showed the same kind of accounting pattern, but
v2had roughly10xmore of it.3. Node-level slab was dominated by filesystem metadata
On hot nodes:
dentry: around~10Mobjectsxfs_inode: hundreds of thousands of objectsOn a cool node:
dentry: around122kThat pushed us away from a Fluent Bit heap issue and toward filesystem metadata churn.
4. It was not explained by open log files or inotify watchers
On the same hot node, comparing the real PIDs:
48inotify watches48log file FDsSo the difference was not simply that
v2was watching more files.5. The key runtime difference was chunk file churn
We traced filesystem syscalls for both processes on the same node.
In a
30swindow:v21678ops on/fluent-bit/storage/tail.0839 openat839 unlink55.9 ops/slegacy collector
539ops on/fluent-bit/storage/tail.0270 openat269 unlink18.0 ops/sThe paths looked like this:
So the excess churn was specifically in chunk file creation and deletion inside the filesystem storage engine.
6. The
.flbfiles themselves looked normalWe inspected chunk files from both collectors:
kube.var.log.containers.*v2was generating a different chunk formatThe important difference was not the file format, but the rate:
v2was creating many more short-lived chunk filesWhy this is happening in our case
In our current rollout,
v2still tails almost the whole node, but later drops most records with agrepfilter based on a Kubernetes label.The current order is effectively:
tailkubernetesgrepon labelSo right now
v2:That greatly amplifies chunk churn in
storage/tail.0.An important detail here is that this explains our current rollout state, but not necessarily the final steady state once all traffic is moved to
v2.Relevant source code references
These files were useful when mapping behavior to runtime:
plugins/in_tail/tail.csrc/flb_input_chunk.clib/chunkio/src/cio_file.cTwo points stood out:
input_chunk_append_raw()applies filters before writing the final chunk content, and if the result is empty the chunk can be destroyed.mmap, open/down/up transitions, and deletion on close.That matches the runtime pattern we observed:
.flbfilesopenat/unlinkdentry/xfs_inodePractical takeaway
Our main takeaway is that, with
storage.type filesystem, high container memory inkubectl topmay be dominated by kernel slab related to filesystem metadata churn, not by Fluent Bit process RSS.In our case, the biggest amplifier was pipeline shape: the new collector was still tailing almost the whole node and only dropping records later, after chunk creation had already happened. That made the problem look like a Fluent Bit memory issue when it was really a consequence of filesystem buffering plus late filtering.
Our next step is to reduce how much data reaches the storage layer before it gets discarded, and then compare that with the final steady state once
v2is the only collector on the node.I'm sharing this in case it helps others debugging similar symptoms, and I'd be interested to hear from maintainers whether this matches the expected behavior of filesystem buffering on XFS/containerd, or whether there are recommended patterns to reduce this kind of chunk churn.
Beta Was this translation helpful? Give feedback.
All reactions