-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: waku_store_errors_created metrics puts whole errors into labels #3282
Comments
This was caught by examining Cortex errors which were triggered by label length beyond the maximum of 2048 characters:
|
What the hell is this? You guys are even putting whole queries into metric labels:
PLEASE can we stick to discrete values for labels? This is ridiculous. I'm setting label length limit to 64 characters and all of this garbage is getting dropped by Cortex from now on:
|
@jakubgs - we used this long labels aiming to be able to analyse long lasting queries. We will limit the label to 128 characters then |
Shouldn't we use Kibana to find out long lasting queries by having a proper log entry "this query is long" so it's easy to find and debug? |
It makes sense indeed and is a valid option to deep debugging. However, with this approach, we are looking at the overall query times statistics, aiming to get a global picture of how a certain query family behaves. We cannot rely on just one single measure but on hundreds of measures instead. Kibana is super helpful to know the precise query that was slow, but first we need to have a global/coarse summary of the queries performance. |
This is correct. Yes, logs is the correct place for... logs.
@Ivansete-status wrong. This isn't about length, it's about CARDINALITY. You cannot just put values into labels that can have massive amounts of permutations. It will make querying our metrics storage horribly slow:
https://prometheus.io/docs/practices/naming/#labels If you still have any labels that do not have discrete values but are just random garbage that you get from errors or queries then we need to fix that. If you want to measure counts of errors you need an |
The fix in this PR is incorrect: |
It seems to me the best way forward is to somehow classify query in a low cardinality type and report those via Prometheus, in a way that can be crossed check with Kibana logs. |
Yes, if you have an unbound list of possible values you need to define discrete buckets into which those values will go. |
Another example of this madness is
Please fix that one too. |
Problem
The
waku_store_errors_created
metric is completely broken:It puts whole errors that somehow get concatenated into a metric label:
And the metric label just endlessly grows with more
another command is already in progress
repeated.Impact
This fills our metrics storage with garbage data and makes querying slower for all fleets and all metrics.
Expected behavior
Do not put non-discrete values into labels. Labels need to be limited in cardinality since Prometheus and Cortex storage is not designed to handle infinite cardinality. If you want to report counts of error types you need to define a discrete
enum
of possible values and provide only that.nwaku version/commit hash
Version:
v0.34.0
Commit:
09f05fefe2614d640277f96ae96ab4f2f12c8031
The text was updated successfully, but these errors were encountered: