Skip to content

[Core] log more information about bad metric tag keys and values in metrics agent #52261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 5, 2025

Conversation

JoshKarpel
Copy link
Contributor

@JoshKarpel JoshKarpel commented Apr 11, 2025

We noticed this error being thrown in our Ray Serve cluster, primarily during replica startup when it seems like replica names get really long (>255 characters) if the Serve app name is very long:

2025-04-11 14:39:39,891	ERROR reporter_agent.py:1268 -- Error publishing node physical stats.
Traceback (most recent call last):
  File ".../site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 1259, in _run_loop
    json_payload = await loop.run_in_executor(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 1287, in _compose_stats_payload
    self._metrics_agent.record_and_export(
  File ".../site-packages/ray/_private/metrics_agent.py", line 544, in record_and_export
    self._record_gauge(gauge, value, {**tags, **global_tags})
  File ".../site-packages/ray/_private/metrics_agent.py", line 554, in _record_gauge
    tag_value = tag_value_module.TagValue(tag_val)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../site-packages/opencensus/tags/tag_value.py", line 34, in __new__
    raise ValueError(_TAG_VALUE_ERROR)
ValueError: tag value must not be longer than 255 characters and of ascii values between 32 - 126 

But the error coming from the opencensus library doesn't say what tag value is bad! So this PR wraps that error in some custom logging to make sure the user knows what's going wrong.

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@JoshKarpel JoshKarpel changed the title [Core] protect against bad metric tag keys and values [Core] log more information about bad metric tag keys and values Apr 11, 2025
logger.error(
f"Failed to record metric {gauge.name} with value {value} with tags {tags!r} and global tags {global_tags!r} due to: {e!r}"
)
raise e
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that right now, if any metric fails to record, all metrics after that one don't get recorded - maybe that behavior is wrong? I've kept it as-is here, but if we remove this raise e it'll continue on after the bad metric.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think we should skip the bad one and continue with the good ones. Could you update the PR to remove raise e?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and added a test!

@JoshKarpel JoshKarpel changed the title [Core] log more information about bad metric tag keys and values [Core] log more information about bad metric tag keys and values in metrics agent Apr 11, 2025
@hainesmichaelc hainesmichaelc added the community-contribution Contributed by the community label Apr 11, 2025
@JoshKarpel
Copy link
Contributor Author

@zcin could you take a look at this one when you get a chance? 🙏🏻

@JoshKarpel JoshKarpel marked this pull request as ready for review April 29, 2025 18:22
@masoudcharkhabi masoudcharkhabi added the core Issues that should be addressed in Ray Core label May 1, 2025
Copy link
Contributor

@zcin zcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change lgtm. Will loop in someone from core team to look it over.

@jjyao jjyao added the go add ONLY when ready to merge, run all tests label May 5, 2025
Signed-off-by: Josh Karpel <[email protected]>
Comment on lines 229 to 232
assert samples[0].value == 1
assert samples[0].labels == {"tag": "a"}
assert samples[1].value == 1
assert samples[1].labels == {"tag": "c"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be flaky, do we guarantee the order that "a" always come before "c"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question; looking at

@pytest.mark.skipif(sys.platform == "win32", reason="Flaky on Windows.")
def test_metrics_agent_record_and_export(get_agent):
namespace = "test"
agent, agent_port = get_agent
# Record a new gauge.
metric_name = "test"
test_gauge = Gauge(metric_name, "desc", "unit", ["tag"])
record_a = Record(
gauge=test_gauge,
value=3,
tags={"tag": "a"},
)
agent.record_and_export([record_a])
name, samples = get_metric(get_prom_metric_name(namespace, metric_name), agent_port)
assert name == get_prom_metric_name(namespace, metric_name)
assert len(samples) == 1
assert samples[0].value == 3
assert samples[0].labels == {"tag": "a"}
# Record the same gauge.
record_b = Record(
gauge=test_gauge,
value=4,
tags={"tag": "a"},
)
record_c = Record(
gauge=test_gauge,
value=4,
tags={"tag": "a"},
)
agent.record_and_export([record_b, record_c])
name, samples = get_metric(get_prom_metric_name(namespace, metric_name), agent_port)
assert name == get_prom_metric_name(namespace, metric_name)
assert len(samples) == 1
assert samples[0].value == 4
assert samples[0].labels == {"tag": "a"}
# Record the same gauge with different ag.
record_d = Record(
gauge=test_gauge,
value=6,
tags={"tag": "aa"},
)
agent.record_and_export(
[
record_d,
]
)
name, samples = get_metric(get_prom_metric_name(namespace, metric_name), agent_port)
assert name == get_prom_metric_name(namespace, metric_name)
assert len(samples) == 2
assert samples[0].value == 4
assert samples[0].labels == {"tag": "a"}
assert samples[1].value == 6
assert samples[1].labels == {"tag": "aa"}
# Record more than 1 gauge.
metric_name_2 = "test2"
test_gauge_2 = Gauge(metric_name_2, "desc", "unit", ["tag"])
record_e = Record(
gauge=test_gauge_2,
value=1,
tags={"tag": "b"},
)
agent.record_and_export([record_e])
name, samples = get_metric(
get_prom_metric_name(namespace, metric_name_2), agent_port
)
assert name == get_prom_metric_name(namespace, metric_name_2)
assert samples[0].value == 1
assert samples[0].labels == {"tag": "b"}
# Make sure the previous record is still there.
name, samples = get_metric(get_prom_metric_name(namespace, metric_name), agent_port)
assert name == get_prom_metric_name(namespace, metric_name)
assert len(samples) == 2
assert samples[0].value == 4
assert samples[0].labels == {"tag": "a"}
assert samples[1].value == 6
assert samples[1].labels == {"tag": "aa"}
, they do not protect against that, which makes me think that either that test is also flaky, or the order does end up being guaranteed. Do you know if that test has been flaky?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to check set instead of list.

@jjyao
Copy link
Collaborator

jjyao commented May 5, 2025

As a follow-up of this PR, could we follow #50443 and do early validation in metrics.py.

@JoshKarpel JoshKarpel requested a review from jjyao May 5, 2025 23:25
@jjyao jjyao merged commit 955f536 into ray-project:master May 5, 2025
5 checks passed
@JoshKarpel JoshKarpel deleted the log-failed-gauge-recording branch May 5, 2025 23:41
@JoshKarpel
Copy link
Contributor Author

Thanks!

vickytsang pushed a commit to ROCm/ray that referenced this pull request May 6, 2025
GokuMohandas pushed a commit that referenced this pull request May 8, 2025
zhaoch23 pushed a commit to Bye-legumes/ray that referenced this pull request May 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-backlog community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants