Skip to content

TT-12074 Empty interfaces being passed to pump backends #768

@monrax

Description

@monrax

Timeout errors can occur when retrieving data from redis, especially when attempting to retrieve a large number of records:

time="Jan 16 10:50:11" level=error msg="Multi command failed: read tcp [::1]:56727->[::1]:6379: i/o timeout" prefix=redis

When resources become insufficient for larger loads, a state where the number of records created increases faster than they are purged out can be reached, so the corresponding timeout errors can be expected.

However, some (very noisy) unexpected additional error logs immediately follow the one above when this state is reached:

time="Jan 16 10:50:11" level=error msg="Couldn't unmarshal analytics data:EOF" analytic_key=tyk-system-analytics prefix=main
time="Jan 16 10:50:11" level=error msg="Couldn't unmarshal analytics data:EOF" analytic_key=tyk-system-analytics prefix=main
time="Jan 16 10:50:11" level=error msg="Couldn't unmarshal analytics data:EOF" analytic_key=tyk-system-analytics prefix=main
(...)

Depending on which pumps are configured, this can result in (also quite noisy) error logs such as:

time="Jan 16 10:50:33" level=error msg="Error decoding analytic record" prefix=resurface-pump
time="Jan 16 10:50:33" level=error msg="Error decoding analytic record" prefix=resurface-pump
(...)

In this case, the resurfaceio backend the following type assertion is performed on line 217:

decoded, ok := v.(analytics.AnalyticsRecord)
if !ok {
	rp.log.Error("Error decoding analytic record")
	continue
}

Which fails as the interface v does not hold an analytics.AnalyticsRecord type. This can just result in noisy logs as mentioned above, but for pumps that do not carry out a safe type assertion (decoded := v.(analytics.AnalyticsRecord) instead of decoded, ok := v.(analytics.AnalyticsRecord)), an unhandled runtime panic could be triggered.


By tracing back the origin of these logs, we can see how:

I believe that even though many empty records cause EOF errors at read time, many others do not, and they end up getting passed to the writePumps method as new interface-wrapped decoded values, which causes the type assertion errors.

This issue can be reproduced following the same steps described in PR #731, as the related issue can lead to a state where the number of records builds up faster than they are purged out.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions