Description
I'm running into an issue, where fluent-bit will deadlock and stop processing logs, when reading 2GB of log-files via tails and sending them to elastic. The memory footprint will steadily grow towards the limit of Mem_Buf_Limit
(which is set to 1GB). It will then hover around the limit for maybe a minute while serving logs, while tails is constantly paused and resumed due to hitting Mem_Buf_Limit
, until it all of a sudden stops.
The CPU utilization drops to 0, while the memory footprint stays at 1 GB.
I did some debugging that gave me some insights, but not enough to find the issue:
- The majority of allocations on deadlock comes from the tails input module
- There are no retries in the elastic output module
cb_es_flush
simply stops being called- The program seems to just wait on an event. There is no infinite-loop or something like this going on.
So I suspect that either the file-contents that were read by tails are not properly cleaned up after the elasticsearch output is done sending them, or that there is some kind of bug where the input read by tails can't be handed over to the elasticsearch output, because there is no memory left because of Mem_Buf_Limit
.
I am running on Windows.
This issue seems to be very similar to #3148.
This is my configuration:
[SERVICE]
flush 1
daemon Off
log_level info
http_server Off
http_listen 0.0.0.0
http_port 2020
storage.metrics on
[INPUT]
name tail
Path O:\log_folder\*.log
Read_From_Head On
Mem_Buf_Limit 1g
[OUTPUT]
name es
match *
host 127.0.0.1
index gb30-fluentd-test-index
This is the log-ouput with info-severity right before it deadlocks:
write: No error
[2021/03/03 13:24:38] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:38] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:38] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:38] [ info] [input] tail.0 resume (mem buf overlimit)
write: No error
[2021/03/03 13:24:39] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:39] [ info] [input] tail.0 resume (mem buf overlimit)
write: No error
[2021/03/03 13:24:39] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:39] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:40] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:40] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:40] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:40] [ info] [input] tail.0 resume (mem buf overlimit)
write: No error
[2021/03/03 13:24:40] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:40] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:41] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:41] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:41] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:41] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:41] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:41] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:42] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:42] [ info] [input] tail.0 resume (mem buf overlimit)
(Note the write: No error
messages. I couldn't find where this message is printed in the code. I suspect the No error
part to be a bug, where the wrong error code is printed on windows, like mentioned in #3146. So unfortunately, I have no idea where this error happens, or what it is.)
And this is the log-ouput with debug-severity right before it deadlocks (from a different run obviously):
[2021/03/03 14:19:44] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 14:19:44] [debug] [out coro] cb_destroy coro_id=1246
[2021/03/03 14:19:44] [debug] [task] destroy task=00000268C804E500 (task_id=24)
[2021/03/03 14:19:44] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 14:19:44] [debug] [output:es:es.0] HTTP Status=200 URI=/_bulk
[2021/03/03 14:19:44] [debug] [output:es:es.0] Elasticsearch response
{"took":300,"errors":false,"items":[{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"SZs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888271,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"Sps--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888272,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"S5s--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888273,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"TJs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888274,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"TZs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888276,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"Tps--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888279,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"T5s--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888282,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"UJs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888284,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"UZs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888286,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"Ups--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888288,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"U5s--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888290,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"VJs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888292,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"VZs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888294,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"Vps--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888295,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"V5s--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888297,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"WJs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888299,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"WZs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888301,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"Wps--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888303,"_primary_term":1,"status":201}},{"index":{"_index":
[2021/03/03 14:19:44] [debug] [output:es:es.0] HTTP Status=200 URI=/_bulk
[2021/03/03 14:19:44] [debug] [output:es:es.0] Elasticsearch response
{"took":290,"errors":false,"items":[{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"8Zs--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888760,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"8ps--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888766,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"85s--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888769,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"9Js--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888773,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"9Zs--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888776,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"9ps--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888779,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"95s--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888781,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"-Js--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888785,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"-Zs--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888789,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"-ps--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888792,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"-5s--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888795,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"_Js--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888799,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"_Zs--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888802,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"_ps--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888805,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"_5s--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888808,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"AJs--HcBeDJvG1NAiRbv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888821,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"AZs--HcBeDJvG1NAiRbv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888824,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"Aps--HcBeDJvG1NAiRbv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888828,"_primary_term":1,"status":201}},{"index":{"_index":
[2021/03/03 14:19:44] [debug] [http_client] not using http_proxy for header
[2021/03/03 14:19:44] [debug] [http_client] header=POST /_bulk HTTP/1.1
Host: 127.0.0.1:9200
Content-Length: 2726687
Content-Type: application/x-ndjson
User-Agent: Fluent-Bit
write: No error
[2021/03/03 14:19:44] [debug] [out coro] cb_destroy coro_id=1247
[2021/03/03 14:19:44] [debug] [out coro] cb_destroy coro_id=1248
[2021/03/03 14:19:44] [debug] [task] destroy task=00000268C7A24110 (task_id=0)
[2021/03/03 14:19:45] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 14:19:45] [debug] [task] destroy task=00000268C804EA10 (task_id=163)
[2021/03/03 14:19:45] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 14:19:45] [debug] [task] created task=00000268DB98FC70 id=0 OK
[2021/03/03 14:19:45] [debug] [task] created task=00000268DB98ED40 id=24 OK
[2021/03/03 14:19:45] [debug] [task] created task=00000268DB98E200 id=163 OK
For reproduction: I am reading 5.7 million lines of ndjson-logs for a total of 2.4GB from 267 files into a local elasticsearch (7.x) in default configuration. The deadlock happens after processing approximately 5 million lines.