-
Notifications
You must be signed in to change notification settings - Fork 90
Description
Timeout errors can occur when retrieving data from redis, especially when attempting to retrieve a large number of records:
time="Jan 16 10:50:11" level=error msg="Multi command failed: read tcp [::1]:56727->[::1]:6379: i/o timeout" prefix=redis
When resources become insufficient for larger loads, a state where the number of records created increases faster than they are purged out can be reached, so the corresponding timeout errors can be expected.
However, some (very noisy) unexpected additional error logs immediately follow the one above when this state is reached:
time="Jan 16 10:50:11" level=error msg="Couldn't unmarshal analytics data:EOF" analytic_key=tyk-system-analytics prefix=main
time="Jan 16 10:50:11" level=error msg="Couldn't unmarshal analytics data:EOF" analytic_key=tyk-system-analytics prefix=main
time="Jan 16 10:50:11" level=error msg="Couldn't unmarshal analytics data:EOF" analytic_key=tyk-system-analytics prefix=main
(...)
Depending on which pumps are configured, this can result in (also quite noisy) error logs such as:
time="Jan 16 10:50:33" level=error msg="Error decoding analytic record" prefix=resurface-pump
time="Jan 16 10:50:33" level=error msg="Error decoding analytic record" prefix=resurface-pump
(...)
In this case, the resurfaceio backend the following type assertion is performed on line 217:
decoded, ok := v.(analytics.AnalyticsRecord)
if !ok {
rp.log.Error("Error decoding analytic record")
continue
}
Which fails as the interface v does not hold an analytics.AnalyticsRecord type. This can just result in noisy logs as mentioned above, but for pumps that do not carry out a safe type assertion (decoded := v.(analytics.AnalyticsRecord) instead of decoded, ok := v.(analytics.AnalyticsRecord)), an unhandled runtime panic could be triggered.
By tracing back the origin of these logs, we can see how:
- the timeout error is logged after attempting to retrieve data from redis inside the
GetAndDeleteSetmethod - the set of EOF errors are logged after attempting to unmarshal each record in the slice
- multiple type assertion errors are logged as shown above (one for each one these empty interfaces)
I believe that even though many empty records cause EOF errors at read time, many others do not, and they end up getting passed to the writePumps method as new interface-wrapped decoded values, which causes the type assertion errors.
This issue can be reproduced following the same steps described in PR #731, as the related issue can lead to a state where the number of records builds up faster than they are purged out.