Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fluent-bit deadlocking when running into Mem_Buf_Limit #3150

Closed
simonvanbernem opened this issue Mar 3, 2021 · 2 comments
Closed

fluent-bit deadlocking when running into Mem_Buf_Limit #3150

simonvanbernem opened this issue Mar 3, 2021 · 2 comments
Labels

Comments

@simonvanbernem
Copy link

I'm running into an issue, where fluent-bit will deadlock and stop processing logs, when reading 2GB of log-files via tails and sending them to elastic. The memory footprint will steadily grow towards the limit of Mem_Buf_Limit (which is set to 1GB). It will then hover around the limit for maybe a minute while serving logs, while tails is constantly paused and resumed due to hitting Mem_Buf_Limit, until it all of a sudden stops.

The CPU utilization drops to 0, while the memory footprint stays at 1 GB.

I did some debugging that gave me some insights, but not enough to find the issue:

  • The majority of allocations on deadlock comes from the tails input module
  • There are no retries in the elastic output module
  • cb_es_flush simply stops being called
  • The program seems to just wait on an event. There is no infinite-loop or something like this going on.

So I suspect that either the file-contents that were read by tails are not properly cleaned up after the elasticsearch output is done sending them, or that there is some kind of bug where the input read by tails can't be handed over to the elasticsearch output, because there is no memory left because of Mem_Buf_Limit.

I am running on Windows.

This issue seems to be very similar to #3148.

This is my configuration:

[SERVICE]
    flush        1
    daemon       Off
    log_level    info

    http_server  Off
    http_listen  0.0.0.0
    http_port    2020

    storage.metrics on

[INPUT]
    name tail
    Path O:\log_folder\*.log
    Read_From_Head On   
    Mem_Buf_Limit 1g

[OUTPUT]
    name es
    match *
    host 127.0.0.1
    index gb30-fluentd-test-index

This is the log-ouput with info-severity right before it deadlocks:

write: No error
[2021/03/03 13:24:38] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:38] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:38] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:38] [ info] [input] tail.0 resume (mem buf overlimit)
write: No error
[2021/03/03 13:24:39] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:39] [ info] [input] tail.0 resume (mem buf overlimit)
write: No error
[2021/03/03 13:24:39] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:39] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:40] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:40] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:40] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:40] [ info] [input] tail.0 resume (mem buf overlimit)
write: No error
[2021/03/03 13:24:40] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:40] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:41] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:41] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:41] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:41] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:41] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:41] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:42] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:42] [ info] [input] tail.0 resume (mem buf overlimit)

(Note the write: No error messages. I couldn't find where this message is printed in the code. I suspect the No error part to be a bug, where the wrong error code is printed on windows, like mentioned in #3146. So unfortunately, I have no idea where this error happens, or what it is.)

And this is the log-ouput with debug-severity right before it deadlocks (from a different run obviously):

[2021/03/03 14:19:44] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 14:19:44] [debug] [out coro] cb_destroy coro_id=1246
[2021/03/03 14:19:44] [debug] [task] destroy task=00000268C804E500 (task_id=24)
[2021/03/03 14:19:44] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 14:19:44] [debug] [output:es:es.0] HTTP Status=200 URI=/_bulk
[2021/03/03 14:19:44] [debug] [output:es:es.0] Elasticsearch response
{"took":300,"errors":false,"items":[{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"SZs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888271,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"Sps--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888272,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"S5s--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888273,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"TJs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888274,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"TZs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888276,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"Tps--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888279,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"T5s--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888282,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"UJs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888284,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"UZs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888286,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"Ups--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888288,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"U5s--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888290,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"VJs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888292,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"VZs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888294,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"Vps--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888295,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"V5s--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888297,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"WJs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888299,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"WZs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888301,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"Wps--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888303,"_primary_term":1,"status":201}},{"index":{"_index":
[2021/03/03 14:19:44] [debug] [output:es:es.0] HTTP Status=200 URI=/_bulk
[2021/03/03 14:19:44] [debug] [output:es:es.0] Elasticsearch response
{"took":290,"errors":false,"items":[{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"8Zs--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888760,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"8ps--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888766,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"85s--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888769,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"9Js--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888773,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"9Zs--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888776,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"9ps--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888779,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"95s--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888781,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"-Js--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888785,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"-Zs--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888789,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"-ps--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888792,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"-5s--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888795,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"_Js--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888799,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"_Zs--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888802,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"_ps--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888805,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"_5s--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888808,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"AJs--HcBeDJvG1NAiRbv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888821,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"AZs--HcBeDJvG1NAiRbv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888824,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"Aps--HcBeDJvG1NAiRbv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888828,"_primary_term":1,"status":201}},{"index":{"_index":
[2021/03/03 14:19:44] [debug] [http_client] not using http_proxy for header
[2021/03/03 14:19:44] [debug] [http_client] header=POST /_bulk HTTP/1.1
Host: 127.0.0.1:9200
Content-Length: 2726687
Content-Type: application/x-ndjson
User-Agent: Fluent-Bit


write: No error
[2021/03/03 14:19:44] [debug] [out coro] cb_destroy coro_id=1247
[2021/03/03 14:19:44] [debug] [out coro] cb_destroy coro_id=1248
[2021/03/03 14:19:44] [debug] [task] destroy task=00000268C7A24110 (task_id=0)
[2021/03/03 14:19:45] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 14:19:45] [debug] [task] destroy task=00000268C804EA10 (task_id=163)
[2021/03/03 14:19:45] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 14:19:45] [debug] [task] created task=00000268DB98FC70 id=0 OK
[2021/03/03 14:19:45] [debug] [task] created task=00000268DB98ED40 id=24 OK
[2021/03/03 14:19:45] [debug] [task] created task=00000268DB98E200 id=163 OK

For reproduction: I am reading 5.7 million lines of ndjson-logs for a total of 2.4GB from 267 files into a local elasticsearch (7.x) in default configuration. The deadlock happens after processing approximately 5 million lines.

@github-actions
Copy link
Contributor

github-actions bot commented Apr 3, 2021

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Apr 3, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Apr 8, 2021

This issue was closed because it has been stalled for 5 days with no activity.

@github-actions github-actions bot closed this as completed Apr 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant