spikes in requests/sec reporting under high number of requests #4

Cryptophobia · 2018-03-13T19:16:57Z

From @felipejfc on June 6, 2017 21:41

Deleting all deis-fluentd pods solves the problem for like 10 minutes or so...

My guess is that this is related to the number of metrics/sec that is being sent from each host to nsq, it's like 2000 metrics/sec.

Or maybe the number of logs fluentd is parsing?

Copied from original issue: deis/fluentd#99

Cryptophobia · 2018-03-13T19:16:57Z

From @jchauncey on June 6, 2017 21:44

yeah so we have to parse the nginx logs to determine the requests/sec. So we parse the data in a custom fluentd plugin and push taht data into nsq where its picked up by telegraf and pushed into influx. So it makes sense that those values are pretty close to the same.

Cryptophobia · 2018-03-13T19:16:58Z

From @felipejfc on June 6, 2017 21:57

I don't know if we are on the same page here, what I'm saying is: this is wrong, we don't have this kind of spike in our normal traffic pattern... to illustrate, this is how it looks after restarting all deis-fluentd pods:

this is normal (no spikes.)

Cryptophobia · 2018-03-13T19:16:59Z

From @jchauncey on June 6, 2017 22:51

So you're saying that the spikes on the messages/second graph aren't normal?

Cryptophobia · 2018-03-13T19:16:59Z

From @felipejfc on June 7, 2017 0:32

nope, they aren't normal :/

Cryptophobia · 2018-03-13T19:17:00Z

From @felipejfc on June 7, 2017 14:41

@jchauncey looking at nsq logs, it seems that the peaks are related to this events:

[nsqd] 2017/06/07 14:40:25.428914 PROTOCOL(V2): [100.68.51.128:57864] exiting ioloop
[nsqd] 2017/06/07 14:40:25.428968 ERROR: client(100.68.51.128:57864) - failed to read command - read tcp 100.101.163.126:4150->100.68.51.128:57864: i/o timeout
[nsqd] 2017/06/07 14:40:25.429001 PROTOCOL(V2): [100.68.51.128:57864] exiting messagePump
[nsqd] 2017/06/07 14:40:25.430171 TCP: new client(100.68.51.128:55086)
[nsqd] 2017/06/07 14:40:25.430185 CLIENT(100.68.51.128:55086): desired protocol magic '  V2'
[nsqd] 2017/06/07 14:40:25.430797 [100.68.51.128:55086] IDENTIFY: {ShortID: LongID: ClientID:deis-logger-fluentd-w5glj Hostname:deis-logger-fluentd-w5glj HeartbeatInterval:30000 OutputBufferSize:0 OutputBufferTimeout:250 FeatureNegotiation:true TLSv1:false Deflate:false DeflateLevel:0 Snappy:false SampleRate:0 UserAgent:nsq-ruby/2.1.0 MsgTimeout:60000}

Cryptophobia · 2018-03-13T19:17:00Z

From @jchauncey on June 7, 2017 14:50

Are you seeing that set of messages occur frequently in your logs? Or does it just happen once and then you see the spike?

Cryptophobia · 2018-03-13T19:17:01Z

From @felipejfc on June 7, 2017 14:52

it happens frequently and then I have frequent spikes in the graphs, it seems that right after a message of this kind, a spike occurs, it does make sense, I guess that the fluentd pod loses connection to nsq and then when it restablishes it, it sends the messages that are queued...

Cryptophobia · 2018-03-13T19:17:02Z

From @jchauncey on June 7, 2017 14:54

That sounds like a pretty good conclusion. Is this worrying to you? Is it causing any issues with the system itself?

Cryptophobia · 2018-03-13T19:17:02Z

From @felipejfc on June 7, 2017 14:56

not with the system, but the "Requests Per Second" graph is pretty important for us to monitor Deis, we have a television in our room with just the graph on it... we also have set up alarms on the number of requests to detect anomalies, so this kind of spikes messes all our monitoring :/

Cryptophobia · 2018-03-13T19:17:03Z

From @jchauncey on June 7, 2017 14:59

Ah ok I understand. I wonder if there is something we could do in grafana to smooth out the graph?

Cryptophobia · 2018-03-13T19:17:04Z

From @felipejfc on June 7, 2017 15:7

I guess we could but I'ld rather work on the root of the problem like figuring out why fluentd pods are losing connection to nsq and then solving it...
maybe we need to scale nsq?
maybe the problem is with fluentd out plugin? the problem seems to keep occurring in the same pods, over and over... after I've deleted the 3 pods (out of 35) that were losing connection then the problem vanished for a while and then came back, the interesting part is that the faulty fluentd pod was scheduled on the same node as the previous faulty pod.

Cryptophobia · 2018-03-13T19:17:04Z

From @jchauncey on June 7, 2017 15:34

Alright im going to take a look and see if i can reproduce this on my cluster and maybe get you a test build to try out.

Cryptophobia · 2018-03-13T19:17:05Z

From @felipejfc on June 7, 2017 16:46

ok! thanks!

Cryptophobia · 2018-03-13T19:17:06Z

From @jchauncey on June 7, 2017 19:1

So one thing we dont do right now is assign timestamps to the metrics when we create them (which is done in fluentd). This is because I could never get influxdb to accept the right precision from the values coming off of nsq so I decided to let it add that value at write time. This means that if you sent 1000 extra messages to influx that happened at an earlier time they would still have a timestamp of time.Now(). I'm guessing if I could figure out how to get influx to accept the timestamp I want it might help here. So I'm going to try and get that working.

Cryptophobia · 2018-03-13T19:17:06Z

From @felipejfc on June 7, 2017 19:54

it makes total sense to send the timestamp with each metric, but still I'ld like to figure out why some of the fluentd pods are reaching this state (timing out over and over again)

Cryptophobia · 2018-03-13T19:17:07Z

From @jchauncey on June 7, 2017 19:55

the pod network isnt super reliable in my experience (but that doesnt account for much here). Thats why we backed all this with a queue to help with those types of problems

Cryptophobia mentioned this issue Mar 13, 2018

spikes in requests/sec reporting under high number of requests deis/fluentd#99

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spikes in requests/sec reporting under high number of requests #4

spikes in requests/sec reporting under high number of requests #4

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

spikes in requests/sec reporting under high number of requests #4

spikes in requests/sec reporting under high number of requests #4

Comments

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018

Cryptophobia commented Mar 13, 2018