-
Notifications
You must be signed in to change notification settings - Fork 6.3k
[Core] log more information about bad metric tag keys and values in metrics agent #52261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] log more information about bad metric tag keys and values in metrics agent #52261
Conversation
Signed-off-by: Josh Karpel <[email protected]>
python/ray/_private/metrics_agent.py
Outdated
logger.error( | ||
f"Failed to record metric {gauge.name} with value {value} with tags {tags!r} and global tags {global_tags!r} due to: {e!r}" | ||
) | ||
raise e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that right now, if any metric fails to record, all metrics after that one don't get recorded - maybe that behavior is wrong? I've kept it as-is here, but if we remove this raise e
it'll continue on after the bad metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I think we should skip the bad one and continue with the good ones. Could you update the PR to remove raise e
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, and added a test!
@zcin could you take a look at this one when you get a chance? 🙏🏻 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change lgtm. Will loop in someone from core team to look it over.
Signed-off-by: Josh Karpel <[email protected]>
assert samples[0].value == 1 | ||
assert samples[0].labels == {"tag": "a"} | ||
assert samples[1].value == 1 | ||
assert samples[1].labels == {"tag": "c"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be flaky, do we guarantee the order that "a" always come before "c"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question; looking at
ray/python/ray/tests/test_metrics_agent_2.py
Lines 113 to 194 in abca69f
@pytest.mark.skipif(sys.platform == "win32", reason="Flaky on Windows.") | |
def test_metrics_agent_record_and_export(get_agent): | |
namespace = "test" | |
agent, agent_port = get_agent | |
# Record a new gauge. | |
metric_name = "test" | |
test_gauge = Gauge(metric_name, "desc", "unit", ["tag"]) | |
record_a = Record( | |
gauge=test_gauge, | |
value=3, | |
tags={"tag": "a"}, | |
) | |
agent.record_and_export([record_a]) | |
name, samples = get_metric(get_prom_metric_name(namespace, metric_name), agent_port) | |
assert name == get_prom_metric_name(namespace, metric_name) | |
assert len(samples) == 1 | |
assert samples[0].value == 3 | |
assert samples[0].labels == {"tag": "a"} | |
# Record the same gauge. | |
record_b = Record( | |
gauge=test_gauge, | |
value=4, | |
tags={"tag": "a"}, | |
) | |
record_c = Record( | |
gauge=test_gauge, | |
value=4, | |
tags={"tag": "a"}, | |
) | |
agent.record_and_export([record_b, record_c]) | |
name, samples = get_metric(get_prom_metric_name(namespace, metric_name), agent_port) | |
assert name == get_prom_metric_name(namespace, metric_name) | |
assert len(samples) == 1 | |
assert samples[0].value == 4 | |
assert samples[0].labels == {"tag": "a"} | |
# Record the same gauge with different ag. | |
record_d = Record( | |
gauge=test_gauge, | |
value=6, | |
tags={"tag": "aa"}, | |
) | |
agent.record_and_export( | |
[ | |
record_d, | |
] | |
) | |
name, samples = get_metric(get_prom_metric_name(namespace, metric_name), agent_port) | |
assert name == get_prom_metric_name(namespace, metric_name) | |
assert len(samples) == 2 | |
assert samples[0].value == 4 | |
assert samples[0].labels == {"tag": "a"} | |
assert samples[1].value == 6 | |
assert samples[1].labels == {"tag": "aa"} | |
# Record more than 1 gauge. | |
metric_name_2 = "test2" | |
test_gauge_2 = Gauge(metric_name_2, "desc", "unit", ["tag"]) | |
record_e = Record( | |
gauge=test_gauge_2, | |
value=1, | |
tags={"tag": "b"}, | |
) | |
agent.record_and_export([record_e]) | |
name, samples = get_metric( | |
get_prom_metric_name(namespace, metric_name_2), agent_port | |
) | |
assert name == get_prom_metric_name(namespace, metric_name_2) | |
assert samples[0].value == 1 | |
assert samples[0].labels == {"tag": "b"} | |
# Make sure the previous record is still there. | |
name, samples = get_metric(get_prom_metric_name(namespace, metric_name), agent_port) | |
assert name == get_prom_metric_name(namespace, metric_name) | |
assert len(samples) == 2 | |
assert samples[0].value == 4 | |
assert samples[0].labels == {"tag": "a"} | |
assert samples[1].value == 6 | |
assert samples[1].labels == {"tag": "aa"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to check set instead of list.
As a follow-up of this PR, could we follow #50443 and do early validation in |
Thanks! |
…etrics agent (ray-project#52261) Signed-off-by: Josh Karpel <[email protected]>
…etrics agent (#52261) Signed-off-by: Josh Karpel <[email protected]>
…etrics agent (ray-project#52261) Signed-off-by: Josh Karpel <[email protected]> Signed-off-by: zhaoch23 <[email protected]>
We noticed this error being thrown in our Ray Serve cluster, primarily during replica startup when it seems like replica names get really long (>255 characters) if the Serve app name is very long:
But the error coming from the
opencensus
library doesn't say what tag value is bad! So this PR wraps that error in some custom logging to make sure the user knows what's going wrong.Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.