Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: waku_store_errors_created metrics puts whole errors into labels #3282

Open
jakubgs opened this issue Feb 6, 2025 · 11 comments · Fixed by #3283 · May be fixed by #3298
Open

bug: waku_store_errors_created metrics puts whole errors into labels #3282

jakubgs opened this issue Feb 6, 2025 · 11 comments · Fixed by #3283 · May be fixed by #3298
Assignees
Labels
bug Something isn't working effort/hours Estimated to be completed in a few hours

Comments

@jakubgs
Copy link
Contributor

jakubgs commented Feb 6, 2025

Problem

The waku_store_errors_created metric is completely broken:

Image

It puts whole errors that somehow get concatenated into a metric label:

waku_store_errors_created{type="BAD_RESPONSE: archive error: DIRVER_ERROR: failed to run query: error in asyncpool query: error in dbConnQuery calling sendQuery: failed pqsendQuery: another command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\n"} 1738746031.0

And the metric label just endlessly grows with more another command is already in progress repeated.

Impact

This fills our metrics storage with garbage data and makes querying slower for all fleets and all metrics.

Expected behavior

Do not put non-discrete values into labels. Labels need to be limited in cardinality since Prometheus and Cortex storage is not designed to handle infinite cardinality. If you want to report counts of error types you need to define a discrete enum of possible values and provide only that.

nwaku version/commit hash

Version: v0.34.0
Commit: 09f05fefe2614d640277f96ae96ab4f2f12c8031

@jakubgs jakubgs added the bug Something isn't working label Feb 6, 2025
@jakubgs
Copy link
Contributor Author

jakubgs commented Feb 6, 2025

This was caught by examining Cortex errors which were triggered by label length beyond the maximum of 2048 characters:

ts=2025-02-06T15:45:22.605616198Z caller=push.go:53 level=warn org_id=fake msg="push refused" err="rpc error: code = Code(400) desc = label value too long for metric (actual: 2898, limit: 2048) metric: \"failed_store_queries_created{container=\\\"sonda\\\", datacenter=\\\"ac-cn-hongkong-c\\\", error=\\\"300 BAD_RESPONSE: BAD_RESPONSE: archive error: DIRVER_ERROR: failed to run query: error in asyncpool query: error \" label name: \"error\" label value: \"300 BAD_RESPONSE: BAD_RESPONSE: archive error: DIRVER_ERROR: failed to run query: error in asyncpool query: error in dbConnQuery calling sendQuery: failed pqsendQuery: another command is already in pr\""

@Ivansete-status Ivansete-status added the effort/hours Estimated to be completed in a few hours label Feb 6, 2025
@Ivansete-status Ivansete-status moved this to To Do in Waku Feb 6, 2025
@Ivansete-status Ivansete-status moved this from To Do to In Progress in Waku Feb 6, 2025
@Ivansete-status Ivansete-status self-assigned this Feb 6, 2025
@jakubgs
Copy link
Contributor Author

jakubgs commented Feb 6, 2025

What the hell is this? You guys are even putting whole queries into metric labels:

ts=2025-02-06T22:22:29.714052012Z
caller=push.go:53
level=warn
org_id=fake
msg="push refused"
err="rpc error: code = Code(400) desc = label value too long for metric (actual: 151, limit: 128) metric:
\"query_time_secs_created{container=\\\"nim-waku-store\\\", datacenter=\\\"ac-cn-hongkong-c\\\", fleet=\\\"status.prod\\\", group=\\\",env:status,stage:prod,nim,waku,metrics,\\\", instance=\\\"store-02.ac-cn-hongkong-c.status.pro\" label name: \"query\" label value: \"query_tag_SELECTmessageHashFROMmessagesWHEREcontentTopicIN(?,?)ANDpubsubTopic=?ANDtimestamp>=?ANDtimestamp<=?ORDERBYtimestampDESC,messageHashDESCLIMIT?\""

PLEASE can we stick to discrete values for labels? This is ridiculous.

I'm setting label length limit to 64 characters and all of this garbage is getting dropped by Cortex from now on:

@Ivansete-status
Copy link
Collaborator

@jakubgs - we used this long labels aiming to be able to analyse long lasting queries. We will limit the label to 128 characters then

@fryorcraken
Copy link
Collaborator

@jakubgs - we used this long labels aiming to be able to analyse long lasting queries. We will limit the label to 128 characters then

Shouldn't we use Kibana to find out long lasting queries by having a proper log entry "this query is long" so it's easy to find and debug?

@Ivansete-status
Copy link
Collaborator

Shouldn't we use Kibana to find out long lasting queries by having a proper log entry "this query is long" so it's easy to find and debug?

It makes sense indeed and is a valid option to deep debugging. However, with this approach, we are looking at the overall query times statistics, aiming to get a global picture of how a certain query family behaves. We cannot rely on just one single measure but on hundreds of measures instead.

Kibana is super helpful to know the precise query that was slow, but first we need to have a global/coarse summary of the queries performance.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Waku Feb 7, 2025
@jakubgs
Copy link
Contributor Author

jakubgs commented Feb 8, 2025

Shouldn't we use Kibana to find out long lasting queries by having a proper log entry "this query is long" so it's easy to find and debug?

This is correct. Yes, logs is the correct place for... logs.

we used this long labels aiming to be able to analyse long lasting queries. We will limit the label to 128 characters then

@Ivansete-status wrong. This isn't about length, it's about CARDINALITY. You cannot just put values into labels that can have massive amounts of permutations. It will make querying our metrics storage horribly slow:

CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.

https://prometheus.io/docs/practices/naming/#labels

If you still have any labels that do not have discrete values but are just random garbage that you get from errors or queries then we need to fix that. If you want to measure counts of errors you need an enum of possible errors, and not just any string you might get from a library as an error.

@jakubgs jakubgs reopened this Feb 8, 2025
@jakubgs
Copy link
Contributor Author

jakubgs commented Feb 8, 2025

The fix in this PR is incorrect:

@fryorcraken
Copy link
Collaborator

It makes sense indeed and is a valid option to deep debugging. However, with this approach, we are looking at the overall query times statistics, aiming to get a global picture of how a certain query family behaves. We cannot rely on just one single measure but on hundreds of measures instead.

Kibana is super helpful to know the precise query that was slow, but first we need to have a global/coarse summary of the queries performance.

It seems to me the best way forward is to somehow classify query in a low cardinality type and report those via Prometheus, in a way that can be crossed check with Kibana logs.

@jakubgs
Copy link
Contributor Author

jakubgs commented Feb 11, 2025

Yes, if you have an unbound list of possible values you need to define discrete buckets into which those values will go.

@jakubgs
Copy link
Contributor Author

jakubgs commented Feb 11, 2025

We are getting a steadily growing stream of discarded samples:

Image

I would appreciate a proper fix.

@Ivansete-status Ivansete-status moved this from Done to Priority in Waku Feb 12, 2025
@jakubgs
Copy link
Contributor Author

jakubgs commented Feb 13, 2025

Another example of this madness is query_count metric:

# HELP query_count number of times a query is being performed
# TYPE query_count counter
query_count_total{query="query_tag_SELECTchild.relnameASpartition_nameFROMpg_inheritsJOINpg_classparentONpg_inherits.inhparent=parent.oidJOINpg_classchildONpg_inherits.inhrelid=child.oidJOINpg_namespacenmsp_parentONnmsp_parent.oid=pare"} 388.0
query_count_created{query="query_tag_SELECTchild.relnameASpartition_nameFROMpg_inheritsJOINpg_classparentONpg_inherits.inhparent=parent.oidJOINpg_classchildONpg_inherits.inhrelid=child.oidJOINpg_namespacenmsp_parentONnmsp_parent.oid=pare"} 1739207995.0
query_count_total{query="query_tag_SELECTmessageHash,pubsubTopic,contentTopic,payload,version,timestamp,metaFROMmessagesWHEREcontentTopicIN(?)ANDpubsubTopic=?ANDtimestamp>=?ANDtimestamp<=?ORDERBYtimestampDESC,messageHashDESCLIMIT?"} 5.0
query_count_created{query="query_tag_SELECTmessageHash,pubsubTopic,contentTopic,payload,version,timestamp,metaFROMmessagesWHEREcontentTopicIN(?)ANDpubsubTopic=?ANDtimestamp>=?ANDtimestamp<=?ORDERBYtimestampDESC,messageHashDESCLIMIT?"} 1739368077.0
query_count_total{query="query_tag_SELECTmessageHash,pubsubTopic,contentTopic,payload,version,timestamp,metaFROMmessagesWHERE(timestamp,messageHash)>(?,?)ANDcontentTopicIN(?)ANDpubsubTopic=?ANDtimestamp>=?ANDtimestamp<=?ORDERBYtimestam"} 15.0
query_count_created{query="query_tag_SELECTmessageHash,pubsubTopic,contentTopic,payload,version,timestamp,metaFROMmessagesWHERE(timestamp,messageHash)>(?,?)ANDcontentTopicIN(?)ANDpubsubTopic=?ANDtimestamp>=?ANDtimestamp<=?ORDERBYtimestam"} 1739374326.0
query_count_total{query="query_tag_WITHmin_timestampAS(SELECTMIN(timestamp)ASmin_tsFROMmessages_lookupWHEREmessagehashIN('abdfbffbffbbccddadabaee'))SELECTm.messageHash,pubsubTopic,contentTopic,payload,version,m.timestamp,metaFROMmessag"} 5.0
query_count_created{query="query_tag_WITHmin_timestampAS(SELECTMIN(timestamp)ASmin_tsFROMmessages_lookupWHEREmessagehashIN('abdfbffbffbbccddadabaee'))SELECTm.messageHash,pubsubTopic,contentTopic,payload,version,m.timestamp,metaFROMmessag"} 1739352086.0
query_count_total{query="query_tag_SELECTversion();"} 15495.0
query_count_created{query="query_tag_SELECTversion();"} 1739207995.0
query_count_total{query="query_tag_SELECTEXISTS(SELECTFROMpg_tablesWHEREtablename='version');"} 1.0
query_count_created{query="query_tag_SELECTEXISTS(SELECTFROMpg_tablesWHEREtablename='version');"} 1739207995.0
query_count_total{query="query_tag_SELECTpg_database_size(current_database())"} 188.0
query_count_created{query="query_tag_SELECTpg_database_size(current_database())"} 1739208002.0
query_count_total{query="InsertRow"} 45580.0
query_count_created{query="InsertRow"} 1739211246.0
query_count_total{query="query_tag_ALTERTABLEmessagesDETACHPARTITIONmessages__CONCURRENTLY;"} 58.0
query_count_created{query="query_tag_ALTERTABLEmessagesDETACHPARTITIONmessages__CONCURRENTLY;"} 1739238605.0

Please fix that one too.

@Ivansete-status Ivansete-status linked a pull request Feb 13, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working effort/hours Estimated to be completed in a few hours
Projects
Status: Priority
3 participants