[Core] log more information about bad metric tag keys and values in metrics agent #52261

JoshKarpel · 2025-04-11T19:44:09Z

We noticed this error being thrown in our Ray Serve cluster, primarily during replica startup when it seems like replica names get really long (>255 characters) if the Serve app name is very long:

2025-04-11 14:39:39,891	ERROR reporter_agent.py:1268 -- Error publishing node physical stats.
Traceback (most recent call last):
  File ".../site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 1259, in _run_loop
    json_payload = await loop.run_in_executor(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 1287, in _compose_stats_payload
    self._metrics_agent.record_and_export(
  File ".../site-packages/ray/_private/metrics_agent.py", line 544, in record_and_export
    self._record_gauge(gauge, value, {**tags, **global_tags})
  File ".../site-packages/ray/_private/metrics_agent.py", line 554, in _record_gauge
    tag_value = tag_value_module.TagValue(tag_val)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../site-packages/opencensus/tags/tag_value.py", line 34, in __new__
    raise ValueError(_TAG_VALUE_ERROR)
ValueError: tag value must not be longer than 255 characters and of ascii values between 32 - 126

But the error coming from the opencensus library doesn't say what tag value is bad! So this PR wraps that error in some custom logging to make sure the user knows what's going wrong.

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Josh Karpel <[email protected]>

JoshKarpel · 2025-04-11T19:51:04Z

python/ray/_private/metrics_agent.py

+                    logger.error(
+                        f"Failed to record metric {gauge.name} with value {value} with tags {tags!r} and global tags {global_tags!r} due to: {e!r}"
+                    )
+                    raise e


Note that right now, if any metric fails to record, all metrics after that one don't get recorded - maybe that behavior is wrong? I've kept it as-is here, but if we remove this raise e it'll continue on after the bad metric.

Yea, I think we should skip the bad one and continue with the good ones. Could you update the PR to remove raise e?

Done, and added a test!

JoshKarpel · 2025-04-28T20:51:10Z

@zcin could you take a look at this one when you get a chance? 🙏🏻

zcin

Change lgtm. Will loop in someone from core team to look it over.

Signed-off-by: Josh Karpel <[email protected]>

jjyao · 2025-05-05T18:30:02Z

python/ray/tests/test_metrics_agent_2.py

+    assert samples[0].value == 1
+    assert samples[0].labels == {"tag": "a"}
+    assert samples[1].value == 1
+    assert samples[1].labels == {"tag": "c"}


Will this be flaky, do we guarantee the order that "a" always come before "c"?

Good question; looking at

ray/python/ray/tests/test_metrics_agent_2.py

Lines 113 to 194 in abca69f

@pytest.mark.skipif(sys.platform == "win32", reason="Flaky on Windows.")

def test_metrics_agent_record_and_export(get_agent):

namespace = "test"

agent, agent_port = get_agent

# Record a new gauge.

metric_name = "test"

test_gauge = Gauge(metric_name, "desc", "unit", ["tag"])

record_a = Record(

gauge=test_gauge,

value=3,

tags={"tag": "a"},

)

agent.record_and_export([record_a])

name, samples = get_metric(get_prom_metric_name(namespace, metric_name), agent_port)

assert name == get_prom_metric_name(namespace, metric_name)

assert len(samples) == 1

assert samples[0].value == 3

assert samples[0].labels == {"tag": "a"}

# Record the same gauge.

record_b = Record(

gauge=test_gauge,

value=4,

tags={"tag": "a"},

)

record_c = Record(

gauge=test_gauge,

value=4,

tags={"tag": "a"},

)

agent.record_and_export([record_b, record_c])

name, samples = get_metric(get_prom_metric_name(namespace, metric_name), agent_port)

assert name == get_prom_metric_name(namespace, metric_name)

assert len(samples) == 1

assert samples[0].value == 4

assert samples[0].labels == {"tag": "a"}

# Record the same gauge with different ag.

record_d = Record(

gauge=test_gauge,

value=6,

tags={"tag": "aa"},

)

agent.record_and_export(

[

record_d,

]

)

name, samples = get_metric(get_prom_metric_name(namespace, metric_name), agent_port)

assert name == get_prom_metric_name(namespace, metric_name)

assert len(samples) == 2

assert samples[0].value == 4

assert samples[0].labels == {"tag": "a"}

assert samples[1].value == 6

assert samples[1].labels == {"tag": "aa"}

# Record more than 1 gauge.

metric_name_2 = "test2"

test_gauge_2 = Gauge(metric_name_2, "desc", "unit", ["tag"])

record_e = Record(

gauge=test_gauge_2,

value=1,

tags={"tag": "b"},

)

agent.record_and_export([record_e])

name, samples = get_metric(

get_prom_metric_name(namespace, metric_name_2), agent_port

)

assert name == get_prom_metric_name(namespace, metric_name_2)

assert samples[0].value == 1

assert samples[0].labels == {"tag": "b"}

# Make sure the previous record is still there.

name, samples = get_metric(get_prom_metric_name(namespace, metric_name), agent_port)

assert name == get_prom_metric_name(namespace, metric_name)

assert len(samples) == 2

assert samples[0].value == 4

assert samples[0].labels == {"tag": "a"}

assert samples[1].value == 6

assert samples[1].labels == {"tag": "aa"}

, they do not protect against that, which makes me think that either that test is also flaky, or the order does end up being guaranteed. Do you know if that test has been flaky?

I think it's better to check set instead of list.

jjyao · 2025-05-05T18:31:30Z

As a follow-up of this PR, could we follow #50443 and do early validation in metrics.py.

Signed-off-by: Josh Karpel <[email protected]>

JoshKarpel · 2025-05-05T23:41:14Z

Thanks!

…etrics agent (ray-project#52261) Signed-off-by: Josh Karpel <[email protected]>

…etrics agent (#52261) Signed-off-by: Josh Karpel <[email protected]>

…etrics agent (ray-project#52261) Signed-off-by: Josh Karpel <[email protected]> Signed-off-by: zhaoch23 <[email protected]>

protect against bad metric tag keys and values

e109be9

Signed-off-by: Josh Karpel <[email protected]>

JoshKarpel changed the title ~~[Core] protect against bad metric tag keys and values~~ [Core] log more information about bad metric tag keys and values Apr 11, 2025

JoshKarpel commented Apr 11, 2025

View reviewed changes

JoshKarpel changed the title ~~[Core] log more information about bad metric tag keys and values~~ [Core] log more information about bad metric tag keys and values in metrics agent Apr 11, 2025

hainesmichaelc added the community-contribution Contributed by the community label Apr 11, 2025

Merge branch 'master' into log-failed-gauge-recording

41dd778

JoshKarpel marked this pull request as ready for review April 29, 2025 18:22

masoudcharkhabi added the core Issues that should be addressed in Ray Core label May 1, 2025

Merge branch 'master' into log-failed-gauge-recording

bd7d493

zcin reviewed May 5, 2025

View reviewed changes

jjyao added the go add ONLY when ready to merge, run all tests label May 5, 2025

jjyao assigned jjyao and MengjinYan May 5, 2025

skip bad records, with test

abca69f

Signed-off-by: Josh Karpel <[email protected]>

jjyao reviewed May 5, 2025

View reviewed changes

JoshKarpel added 3 commits May 5, 2025 14:36

Merge branch 'master' into log-failed-gauge-recording

9eece98

use a set for the assert

9c0a046

Signed-off-by: Josh Karpel <[email protected]>

Merge branch 'master' into log-failed-gauge-recording

12a9675

JoshKarpel requested a review from jjyao May 5, 2025 23:25

jjyao approved these changes May 5, 2025

View reviewed changes

jjyao merged commit 955f536 into ray-project:master May 5, 2025
5 checks passed

JoshKarpel deleted the log-failed-gauge-recording branch May 5, 2025 23:41

vickytsang pushed a commit to ROCm/ray that referenced this pull request May 6, 2025

[Core] log more information about bad metric tag keys and values in m…

f5ab27f

…etrics agent (ray-project#52261) Signed-off-by: Josh Karpel <[email protected]>

GokuMohandas pushed a commit that referenced this pull request May 8, 2025

[Core] log more information about bad metric tag keys and values in m…

8a4b35c

…etrics agent (#52261) Signed-off-by: Josh Karpel <[email protected]>

hainesmichaelc added the community-backlog label May 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core] log more information about bad metric tag keys and values in metrics agent #52261

[Core] log more information about bad metric tag keys and values in metrics agent #52261

Uh oh!

JoshKarpel commented Apr 11, 2025 •

edited

Loading

Uh oh!

JoshKarpel Apr 11, 2025

Uh oh!

jjyao May 5, 2025

Uh oh!

JoshKarpel May 5, 2025

Uh oh!

JoshKarpel commented Apr 28, 2025

Uh oh!

zcin left a comment

Uh oh!

jjyao May 5, 2025

Uh oh!

JoshKarpel May 5, 2025

Uh oh!

jjyao May 5, 2025

Uh oh!

jjyao commented May 5, 2025

Uh oh!

Uh oh!

JoshKarpel commented May 5, 2025

Uh oh!

Uh oh!

	@pytest.mark.skipif(sys.platform == "win32", reason="Flaky on Windows.")
	def test_metrics_agent_record_and_export(get_agent):
	namespace = "test"
	agent, agent_port = get_agent

	# Record a new gauge.
	metric_name = "test"
	test_gauge = Gauge(metric_name, "desc", "unit", ["tag"])
	record_a = Record(
	gauge=test_gauge,
	value=3,
	tags={"tag": "a"},
	)
	agent.record_and_export([record_a])
	name, samples = get_metric(get_prom_metric_name(namespace, metric_name), agent_port)
	assert name == get_prom_metric_name(namespace, metric_name)
	assert len(samples) == 1
	assert samples[0].value == 3
	assert samples[0].labels == {"tag": "a"}

	# Record the same gauge.
	record_b = Record(
	gauge=test_gauge,
	value=4,
	tags={"tag": "a"},
	)
	record_c = Record(
	gauge=test_gauge,
	value=4,
	tags={"tag": "a"},
	)
	agent.record_and_export([record_b, record_c])
	name, samples = get_metric(get_prom_metric_name(namespace, metric_name), agent_port)
	assert name == get_prom_metric_name(namespace, metric_name)
	assert len(samples) == 1
	assert samples[0].value == 4
	assert samples[0].labels == {"tag": "a"}

	# Record the same gauge with different ag.
	record_d = Record(
	gauge=test_gauge,
	value=6,
	tags={"tag": "aa"},
	)
	agent.record_and_export(
	[
	record_d,
	]
	)
	name, samples = get_metric(get_prom_metric_name(namespace, metric_name), agent_port)
	assert name == get_prom_metric_name(namespace, metric_name)
	assert len(samples) == 2
	assert samples[0].value == 4
	assert samples[0].labels == {"tag": "a"}
	assert samples[1].value == 6
	assert samples[1].labels == {"tag": "aa"}

	# Record more than 1 gauge.
	metric_name_2 = "test2"
	test_gauge_2 = Gauge(metric_name_2, "desc", "unit", ["tag"])
	record_e = Record(
	gauge=test_gauge_2,
	value=1,
	tags={"tag": "b"},
	)
	agent.record_and_export([record_e])
	name, samples = get_metric(
	get_prom_metric_name(namespace, metric_name_2), agent_port
	)

	assert name == get_prom_metric_name(namespace, metric_name_2)
	assert samples[0].value == 1
	assert samples[0].labels == {"tag": "b"}

	# Make sure the previous record is still there.
	name, samples = get_metric(get_prom_metric_name(namespace, metric_name), agent_port)
	assert name == get_prom_metric_name(namespace, metric_name)
	assert len(samples) == 2
	assert samples[0].value == 4
	assert samples[0].labels == {"tag": "a"}
	assert samples[1].value == 6
	assert samples[1].labels == {"tag": "aa"}

[Core] log more information about bad metric tag keys and values in metrics agent #52261

[Core] log more information about bad metric tag keys and values in metrics agent #52261

Uh oh!

Conversation

JoshKarpel commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

JoshKarpel Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

jjyao May 5, 2025

Choose a reason for hiding this comment

Uh oh!

JoshKarpel May 5, 2025

Choose a reason for hiding this comment

Uh oh!

JoshKarpel commented Apr 28, 2025

Uh oh!

zcin left a comment

Choose a reason for hiding this comment

Uh oh!

jjyao May 5, 2025

Choose a reason for hiding this comment

Uh oh!

JoshKarpel May 5, 2025

Choose a reason for hiding this comment

Uh oh!

jjyao May 5, 2025

Choose a reason for hiding this comment

Uh oh!

jjyao commented May 5, 2025

Uh oh!

Uh oh!

JoshKarpel commented May 5, 2025

Uh oh!

Uh oh!

JoshKarpel commented Apr 11, 2025 •

edited

Loading