bug: waku_store_errors_created metrics puts whole errors into labels #3282

jakubgs · 2025-02-06T15:44:34Z

Problem

The waku_store_errors_created metric is completely broken:

It puts whole errors that somehow get concatenated into a metric label:

waku_store_errors_created{type="BAD_RESPONSE: archive error: DIRVER_ERROR: failed to run query: error in asyncpool query: error in dbConnQuery calling sendQuery: failed pqsendQuery: another command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\nanother command is already in progress\n"} 1738746031.0

And the metric label just endlessly grows with more another command is already in progress repeated.

Impact

This fills our metrics storage with garbage data and makes querying slower for all fleets and all metrics.

Expected behavior

Do not put non-discrete values into labels. Labels need to be limited in cardinality since Prometheus and Cortex storage is not designed to handle infinite cardinality. If you want to report counts of error types you need to define a discrete enum of possible values and provide only that.

nwaku version/commit hash

Version: v0.34.0
Commit: 09f05fefe2614d640277f96ae96ab4f2f12c8031

The text was updated successfully, but these errors were encountered:

jakubgs · 2025-02-06T15:46:06Z

This was caught by examining Cortex errors which were triggered by label length beyond the maximum of 2048 characters:

ts=2025-02-06T15:45:22.605616198Z caller=push.go:53 level=warn org_id=fake msg="push refused" err="rpc error: code = Code(400) desc = label value too long for metric (actual: 2898, limit: 2048) metric: \"failed_store_queries_created{container=\\\"sonda\\\", datacenter=\\\"ac-cn-hongkong-c\\\", error=\\\"300 BAD_RESPONSE: BAD_RESPONSE: archive error: DIRVER_ERROR: failed to run query: error in asyncpool query: error \" label name: \"error\" label value: \"300 BAD_RESPONSE: BAD_RESPONSE: archive error: DIRVER_ERROR: failed to run query: error in asyncpool query: error in dbConnQuery calling sendQuery: failed pqsendQuery: another command is already in pr\""

jakubgs · 2025-02-06T22:27:14Z

What the hell is this? You guys are even putting whole queries into metric labels:

ts=2025-02-06T22:22:29.714052012Z
caller=push.go:53
level=warn
org_id=fake
msg="push refused"
err="rpc error: code = Code(400) desc = label value too long for metric (actual: 151, limit: 128) metric:
\"query_time_secs_created{container=\\\"nim-waku-store\\\", datacenter=\\\"ac-cn-hongkong-c\\\", fleet=\\\"status.prod\\\", group=\\\",env:status,stage:prod,nim,waku,metrics,\\\", instance=\\\"store-02.ac-cn-hongkong-c.status.pro\" label name: \"query\" label value: \"query_tag_SELECTmessageHashFROMmessagesWHEREcontentTopicIN(?,?)ANDpubsubTopic=?ANDtimestamp>=?ANDtimestamp<=?ORDERBYtimestampDESC,messageHashDESCLIMIT?\""

PLEASE can we stick to discrete values for labels? This is ridiculous.

I'm setting label length limit to 64 characters and all of this garbage is getting dropped by Cortex from now on:

infra-hq#efcb3c59 - metrics-ingest,master: lower label character limit to 128

Ivansete-status · 2025-02-06T22:46:54Z

@jakubgs - we used this long labels aiming to be able to analyse long lasting queries. We will limit the label to 128 characters then

fryorcraken · 2025-02-06T23:56:01Z

@jakubgs - we used this long labels aiming to be able to analyse long lasting queries. We will limit the label to 128 characters then

Shouldn't we use Kibana to find out long lasting queries by having a proper log entry "this query is long" so it's easy to find and debug?

Ivansete-status · 2025-02-07T09:35:35Z

Shouldn't we use Kibana to find out long lasting queries by having a proper log entry "this query is long" so it's easy to find and debug?

It makes sense indeed and is a valid option to deep debugging. However, with this approach, we are looking at the overall query times statistics, aiming to get a global picture of how a certain query family behaves. We cannot rely on just one single measure but on hundreds of measures instead.

Kibana is super helpful to know the precise query that was slow, but first we need to have a global/coarse summary of the queries performance.

jakubgs · 2025-02-08T09:36:47Z

Shouldn't we use Kibana to find out long lasting queries by having a proper log entry "this query is long" so it's easy to find and debug?

This is correct. Yes, logs is the correct place for... logs.

we used this long labels aiming to be able to analyse long lasting queries. We will limit the label to 128 characters then

@Ivansete-status wrong. This isn't about length, it's about CARDINALITY. You cannot just put values into labels that can have massive amounts of permutations. It will make querying our metrics storage horribly slow:

CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.

https://prometheus.io/docs/practices/naming/#labels

If you still have any labels that do not have discrete values but are just random garbage that you get from errors or queries then we need to fix that. If you want to measure counts of errors you need an enum of possible errors, and not just any string you might get from a library as an error.

jakubgs · 2025-02-08T09:39:56Z

The fix in this PR is incorrect:

chore: dbconn truncate possible too long error messages #3283

fryorcraken · 2025-02-11T04:17:43Z

It makes sense indeed and is a valid option to deep debugging. However, with this approach, we are looking at the overall query times statistics, aiming to get a global picture of how a certain query family behaves. We cannot rely on just one single measure but on hundreds of measures instead.

Kibana is super helpful to know the precise query that was slow, but first we need to have a global/coarse summary of the queries performance.

It seems to me the best way forward is to somehow classify query in a low cardinality type and report those via Prometheus, in a way that can be crossed check with Kibana logs.

jakubgs · 2025-02-11T09:28:00Z

Yes, if you have an unbound list of possible values you need to define discrete buckets into which those values will go.

jakubgs · 2025-02-11T09:32:33Z

We are getting a steadily growing stream of discarded samples:

I would appreciate a proper fix.

jakubgs · 2025-02-13T10:02:07Z

Another example of this madness is query_count metric:

# HELP query_count number of times a query is being performed
# TYPE query_count counter
query_count_total{query="query_tag_SELECTchild.relnameASpartition_nameFROMpg_inheritsJOINpg_classparentONpg_inherits.inhparent=parent.oidJOINpg_classchildONpg_inherits.inhrelid=child.oidJOINpg_namespacenmsp_parentONnmsp_parent.oid=pare"} 388.0
query_count_created{query="query_tag_SELECTchild.relnameASpartition_nameFROMpg_inheritsJOINpg_classparentONpg_inherits.inhparent=parent.oidJOINpg_classchildONpg_inherits.inhrelid=child.oidJOINpg_namespacenmsp_parentONnmsp_parent.oid=pare"} 1739207995.0
query_count_total{query="query_tag_SELECTmessageHash,pubsubTopic,contentTopic,payload,version,timestamp,metaFROMmessagesWHEREcontentTopicIN(?)ANDpubsubTopic=?ANDtimestamp>=?ANDtimestamp<=?ORDERBYtimestampDESC,messageHashDESCLIMIT?"} 5.0
query_count_created{query="query_tag_SELECTmessageHash,pubsubTopic,contentTopic,payload,version,timestamp,metaFROMmessagesWHEREcontentTopicIN(?)ANDpubsubTopic=?ANDtimestamp>=?ANDtimestamp<=?ORDERBYtimestampDESC,messageHashDESCLIMIT?"} 1739368077.0
query_count_total{query="query_tag_SELECTmessageHash,pubsubTopic,contentTopic,payload,version,timestamp,metaFROMmessagesWHERE(timestamp,messageHash)>(?,?)ANDcontentTopicIN(?)ANDpubsubTopic=?ANDtimestamp>=?ANDtimestamp<=?ORDERBYtimestam"} 15.0
query_count_created{query="query_tag_SELECTmessageHash,pubsubTopic,contentTopic,payload,version,timestamp,metaFROMmessagesWHERE(timestamp,messageHash)>(?,?)ANDcontentTopicIN(?)ANDpubsubTopic=?ANDtimestamp>=?ANDtimestamp<=?ORDERBYtimestam"} 1739374326.0
query_count_total{query="query_tag_WITHmin_timestampAS(SELECTMIN(timestamp)ASmin_tsFROMmessages_lookupWHEREmessagehashIN('abdfbffbffbbccddadabaee'))SELECTm.messageHash,pubsubTopic,contentTopic,payload,version,m.timestamp,metaFROMmessag"} 5.0
query_count_created{query="query_tag_WITHmin_timestampAS(SELECTMIN(timestamp)ASmin_tsFROMmessages_lookupWHEREmessagehashIN('abdfbffbffbbccddadabaee'))SELECTm.messageHash,pubsubTopic,contentTopic,payload,version,m.timestamp,metaFROMmessag"} 1739352086.0
query_count_total{query="query_tag_SELECTversion();"} 15495.0
query_count_created{query="query_tag_SELECTversion();"} 1739207995.0
query_count_total{query="query_tag_SELECTEXISTS(SELECTFROMpg_tablesWHEREtablename='version');"} 1.0
query_count_created{query="query_tag_SELECTEXISTS(SELECTFROMpg_tablesWHEREtablename='version');"} 1739207995.0
query_count_total{query="query_tag_SELECTpg_database_size(current_database())"} 188.0
query_count_created{query="query_tag_SELECTpg_database_size(current_database())"} 1739208002.0
query_count_total{query="InsertRow"} 45580.0
query_count_created{query="InsertRow"} 1739211246.0
query_count_total{query="query_tag_ALTERTABLEmessagesDETACHPARTITIONmessages__CONCURRENTLY;"} 58.0
query_count_created{query="query_tag_ALTERTABLEmessagesDETACHPARTITIONmessages__CONCURRENTLY;"} 1739238605.0

Please fix that one too.

jakubgs added the bug Something isn't working label Feb 6, 2025

github-project-automation bot added this to Waku Feb 6, 2025

Ivansete-status added the effort/hours Estimated to be completed in a few hours label Feb 6, 2025

Ivansete-status moved this to To Do in Waku Feb 6, 2025

Ivansete-status mentioned this issue Feb 6, 2025

chore: dbconn truncate possible too long error messages #3283

Merged

Ivansete-status moved this from To Do to In Progress in Waku Feb 6, 2025

Ivansete-status self-assigned this Feb 6, 2025

Ivansete-status closed this as completed in #3283 Feb 7, 2025

github-project-automation bot moved this from In Progress to Done in Waku Feb 7, 2025

jakubgs reopened this Feb 8, 2025

Ivansete-status moved this from Done to Priority in Waku Feb 12, 2025

Ivansete-status linked a pull request Feb 13, 2025 that will close this issue

chore: more efficient metrics usage #3298

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: waku_store_errors_created metrics puts whole errors into labels #3282

bug: waku_store_errors_created metrics puts whole errors into labels #3282

jakubgs commented Feb 6, 2025

jakubgs commented Feb 6, 2025

jakubgs commented Feb 6, 2025 •

edited

Loading

Ivansete-status commented Feb 6, 2025

fryorcraken commented Feb 6, 2025

Ivansete-status commented Feb 7, 2025

jakubgs commented Feb 8, 2025 •

edited

Loading

jakubgs commented Feb 8, 2025

fryorcraken commented Feb 11, 2025

jakubgs commented Feb 11, 2025

jakubgs commented Feb 11, 2025

jakubgs commented Feb 13, 2025

bug: waku_store_errors_created metrics puts whole errors into labels #3282

bug: waku_store_errors_created metrics puts whole errors into labels #3282

Comments

jakubgs commented Feb 6, 2025

Problem

Impact

Expected behavior

nwaku version/commit hash

jakubgs commented Feb 6, 2025

jakubgs commented Feb 6, 2025 • edited Loading

Ivansete-status commented Feb 6, 2025

fryorcraken commented Feb 6, 2025

Ivansete-status commented Feb 7, 2025

jakubgs commented Feb 8, 2025 • edited Loading

jakubgs commented Feb 8, 2025

fryorcraken commented Feb 11, 2025

jakubgs commented Feb 11, 2025

jakubgs commented Feb 11, 2025

jakubgs commented Feb 13, 2025

jakubgs commented Feb 6, 2025 •

edited

Loading

jakubgs commented Feb 8, 2025 •

edited

Loading