Skip to content

fluent-bit deadlocking when running into Mem_Buf_Limit #3150

Closed
@simonvanbernem

Description

@simonvanbernem

I'm running into an issue, where fluent-bit will deadlock and stop processing logs, when reading 2GB of log-files via tails and sending them to elastic. The memory footprint will steadily grow towards the limit of Mem_Buf_Limit (which is set to 1GB). It will then hover around the limit for maybe a minute while serving logs, while tails is constantly paused and resumed due to hitting Mem_Buf_Limit, until it all of a sudden stops.

The CPU utilization drops to 0, while the memory footprint stays at 1 GB.

I did some debugging that gave me some insights, but not enough to find the issue:

  • The majority of allocations on deadlock comes from the tails input module
  • There are no retries in the elastic output module
  • cb_es_flush simply stops being called
  • The program seems to just wait on an event. There is no infinite-loop or something like this going on.

So I suspect that either the file-contents that were read by tails are not properly cleaned up after the elasticsearch output is done sending them, or that there is some kind of bug where the input read by tails can't be handed over to the elasticsearch output, because there is no memory left because of Mem_Buf_Limit.

I am running on Windows.

This issue seems to be very similar to #3148.

This is my configuration:

[SERVICE]
    flush        1
    daemon       Off
    log_level    info

    http_server  Off
    http_listen  0.0.0.0
    http_port    2020

    storage.metrics on

[INPUT]
    name tail
    Path O:\log_folder\*.log
    Read_From_Head On   
    Mem_Buf_Limit 1g

[OUTPUT]
    name es
    match *
    host 127.0.0.1
    index gb30-fluentd-test-index

This is the log-ouput with info-severity right before it deadlocks:

write: No error
[2021/03/03 13:24:38] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:38] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:38] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:38] [ info] [input] tail.0 resume (mem buf overlimit)
write: No error
[2021/03/03 13:24:39] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:39] [ info] [input] tail.0 resume (mem buf overlimit)
write: No error
[2021/03/03 13:24:39] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:39] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:40] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:40] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:40] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:40] [ info] [input] tail.0 resume (mem buf overlimit)
write: No error
[2021/03/03 13:24:40] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:40] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:41] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:41] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:41] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:41] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:41] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:41] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:42] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:42] [ info] [input] tail.0 resume (mem buf overlimit)

(Note the write: No error messages. I couldn't find where this message is printed in the code. I suspect the No error part to be a bug, where the wrong error code is printed on windows, like mentioned in #3146. So unfortunately, I have no idea where this error happens, or what it is.)

And this is the log-ouput with debug-severity right before it deadlocks (from a different run obviously):

[2021/03/03 14:19:44] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 14:19:44] [debug] [out coro] cb_destroy coro_id=1246
[2021/03/03 14:19:44] [debug] [task] destroy task=00000268C804E500 (task_id=24)
[2021/03/03 14:19:44] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 14:19:44] [debug] [output:es:es.0] HTTP Status=200 URI=/_bulk
[2021/03/03 14:19:44] [debug] [output:es:es.0] Elasticsearch response
{"took":300,"errors":false,"items":[{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"SZs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888271,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"Sps--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888272,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"S5s--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888273,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"TJs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888274,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"TZs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888276,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"Tps--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888279,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"T5s--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888282,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"UJs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888284,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"UZs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888286,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"Ups--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888288,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"U5s--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888290,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"VJs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888292,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"VZs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888294,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"Vps--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888295,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"V5s--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888297,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"WJs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888299,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"WZs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888301,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"Wps--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888303,"_primary_term":1,"status":201}},{"index":{"_index":
[2021/03/03 14:19:44] [debug] [output:es:es.0] HTTP Status=200 URI=/_bulk
[2021/03/03 14:19:44] [debug] [output:es:es.0] Elasticsearch response
{"took":290,"errors":false,"items":[{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"8Zs--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888760,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"8ps--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888766,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"85s--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888769,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"9Js--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888773,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"9Zs--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888776,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"9ps--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888779,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"95s--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888781,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"-Js--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888785,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"-Zs--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888789,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"-ps--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888792,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"-5s--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888795,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"_Js--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888799,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"_Zs--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888802,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"_ps--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888805,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"_5s--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888808,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"AJs--HcBeDJvG1NAiRbv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888821,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"AZs--HcBeDJvG1NAiRbv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888824,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"Aps--HcBeDJvG1NAiRbv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888828,"_primary_term":1,"status":201}},{"index":{"_index":
[2021/03/03 14:19:44] [debug] [http_client] not using http_proxy for header
[2021/03/03 14:19:44] [debug] [http_client] header=POST /_bulk HTTP/1.1
Host: 127.0.0.1:9200
Content-Length: 2726687
Content-Type: application/x-ndjson
User-Agent: Fluent-Bit


write: No error
[2021/03/03 14:19:44] [debug] [out coro] cb_destroy coro_id=1247
[2021/03/03 14:19:44] [debug] [out coro] cb_destroy coro_id=1248
[2021/03/03 14:19:44] [debug] [task] destroy task=00000268C7A24110 (task_id=0)
[2021/03/03 14:19:45] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 14:19:45] [debug] [task] destroy task=00000268C804EA10 (task_id=163)
[2021/03/03 14:19:45] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 14:19:45] [debug] [task] created task=00000268DB98FC70 id=0 OK
[2021/03/03 14:19:45] [debug] [task] created task=00000268DB98ED40 id=24 OK
[2021/03/03 14:19:45] [debug] [task] created task=00000268DB98E200 id=163 OK

For reproduction: I am reading 5.7 million lines of ndjson-logs for a total of 2.4GB from 267 files into a local elasticsearch (7.x) in default configuration. The deadlock happens after processing approximately 5 million lines.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions