fluent-bit deadlocking when running into Mem_Buf_Limit

I'm running into an issue, where fluent-bit will deadlock and stop processing logs, when reading 2GB of log-files via tails and sending them to elastic. The memory footprint will steadily grow towards the limit of `Mem_Buf_Limit` (which is set to 1GB). It will then hover around the limit for maybe a minute while serving logs, while tails is constantly paused and resumed due to hitting `Mem_Buf_Limit`, until it all of a sudden stops.

The CPU utilization drops to 0, while the memory footprint stays at 1 GB.

I did some debugging that gave me some insights, but not enough to find the issue:
- The majority of allocations on deadlock comes from the tails input module
- There are no retries in the elastic output module
- `cb_es_flush` simply stops being called
- The program seems to just wait on an event. There is no infinite-loop or something like this going on.

So I suspect that either the file-contents that were read by tails are not properly cleaned up after the elasticsearch output is done sending them, or that there is some kind of bug where the input read by tails can't be handed over to the elasticsearch output, because there is no memory left because of `Mem_Buf_Limit`.

I am running on Windows.

This issue seems to be very similar to https://github.com/fluent/fluent-bit/issues/3148.

This is my configuration:

```
[SERVICE]
    flush        1
    daemon       Off
    log_level    info

    http_server  Off
    http_listen  0.0.0.0
    http_port    2020

    storage.metrics on

[INPUT]
    name tail
    Path O:\log_folder\*.log
    Read_From_Head On   
    Mem_Buf_Limit 1g

[OUTPUT]
    name es
    match *
    host 127.0.0.1
    index gb30-fluentd-test-index
```

This is the log-ouput with info-severity right before it deadlocks:

```
write: No error
[2021/03/03 13:24:38] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:38] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:38] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:38] [ info] [input] tail.0 resume (mem buf overlimit)
write: No error
[2021/03/03 13:24:39] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:39] [ info] [input] tail.0 resume (mem buf overlimit)
write: No error
[2021/03/03 13:24:39] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:39] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:40] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:40] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:40] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:40] [ info] [input] tail.0 resume (mem buf overlimit)
write: No error
[2021/03/03 13:24:40] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:40] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:41] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:41] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:41] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:41] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:41] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:41] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 13:24:42] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 13:24:42] [ info] [input] tail.0 resume (mem buf overlimit)

```

(Note the `write: No error` messages. I couldn't find where this message is printed in the code. I suspect the `No error` part to be a bug, where the wrong error code is printed on windows, like mentioned in https://github.com/fluent/fluent-bit/issues/3146. So unfortunately, I have no idea where this error happens, or what it is.)

And this is the log-ouput with debug-severity right before it deadlocks (from a different run obviously):

```
[2021/03/03 14:19:44] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 14:19:44] [debug] [out coro] cb_destroy coro_id=1246
[2021/03/03 14:19:44] [debug] [task] destroy task=00000268C804E500 (task_id=24)
[2021/03/03 14:19:44] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 14:19:44] [debug] [output:es:es.0] HTTP Status=200 URI=/_bulk
[2021/03/03 14:19:44] [debug] [output:es:es.0] Elasticsearch response
{"took":300,"errors":false,"items":[{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"SZs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888271,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"Sps--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888272,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"S5s--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888273,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"TJs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888274,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"TZs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888276,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"Tps--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888279,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"T5s--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888282,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"UJs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888284,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"UZs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888286,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"Ups--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888288,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"U5s--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888290,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"VJs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888292,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"VZs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888294,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"Vps--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888295,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"V5s--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888297,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"WJs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888299,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"WZs--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888301,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"Wps--HcBeDJvG1NAiQXg","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888303,"_primary_term":1,"status":201}},{"index":{"_index":
[2021/03/03 14:19:44] [debug] [output:es:es.0] HTTP Status=200 URI=/_bulk
[2021/03/03 14:19:44] [debug] [output:es:es.0] Elasticsearch response
{"took":290,"errors":false,"items":[{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"8Zs--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888760,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"8ps--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888766,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"85s--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888769,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"9Js--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888773,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"9Zs--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888776,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"9ps--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888779,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"95s--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888781,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"-Js--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888785,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"-Zs--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888789,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"-ps--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888792,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"-5s--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888795,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"_Js--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888799,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"_Zs--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888802,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"_ps--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888805,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"_5s--HcBeDJvG1NAiRXv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888808,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"AJs--HcBeDJvG1NAiRbv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888821,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"AZs--HcBeDJvG1NAiRbv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888824,"_primary_term":1,"status":201}},{"index":{"_index":"gb30-fluentd-test-index","_type":"_doc","_id":"Aps--HcBeDJvG1NAiRbv","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4888828,"_primary_term":1,"status":201}},{"index":{"_index":
[2021/03/03 14:19:44] [debug] [http_client] not using http_proxy for header
[2021/03/03 14:19:44] [debug] [http_client] header=POST /_bulk HTTP/1.1
Host: 127.0.0.1:9200
Content-Length: 2726687
Content-Type: application/x-ndjson
User-Agent: Fluent-Bit


write: No error
[2021/03/03 14:19:44] [debug] [out coro] cb_destroy coro_id=1247
[2021/03/03 14:19:44] [debug] [out coro] cb_destroy coro_id=1248
[2021/03/03 14:19:44] [debug] [task] destroy task=00000268C7A24110 (task_id=0)
[2021/03/03 14:19:45] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/03/03 14:19:45] [debug] [task] destroy task=00000268C804EA10 (task_id=163)
[2021/03/03 14:19:45] [ info] [input] tail.0 resume (mem buf overlimit)
[2021/03/03 14:19:45] [debug] [task] created task=00000268DB98FC70 id=0 OK
[2021/03/03 14:19:45] [debug] [task] created task=00000268DB98ED40 id=24 OK
[2021/03/03 14:19:45] [debug] [task] created task=00000268DB98E200 id=163 OK
```

For reproduction: I am reading 5.7 million lines of ndjson-logs for a total of  2.4GB from 267 files into a local elasticsearch (7.x) in default configuration. The deadlock happens after processing approximately 5 million lines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fluent-bit deadlocking when running into Mem_Buf_Limit #3150

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

fluent-bit deadlocking when running into Mem_Buf_Limit #3150

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions