Skip to content

Conversation

@hustxiayang
Copy link
Contributor

Description

It does not make sense to include firstTokenLatency and outputTokenLatency metrics for embedding models. Actually, embedding models are not generative models, thus need to separate them.

Signed-off-by: yxia216 <yxia216@bloomberg.net>
@hustxiayang hustxiayang requested a review from a team as a code owner September 16, 2025 22:21
@hustxiayang hustxiayang changed the title init refactor: separate embedding metrics to remove metrics which does not make sense Sep 16, 2025
@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.37%. Comparing base (0b97426) to head (3e1e032).

❌ Your project status has failed because the head coverage (79.37%) is below the target coverage (86.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1206      +/-   ##
==========================================
- Coverage   79.38%   79.37%   -0.02%     
==========================================
  Files          88       88              
  Lines       10067    10080      +13     
==========================================
+ Hits         7992     8001       +9     
- Misses       1712     1717       +5     
+ Partials      363      362       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@hustxiayang hustxiayang changed the title refactor: separate embedding metrics to remove metrics which does not make sense refactor: separate embedding metrics to remove metrics which does not make sense for embeddings Sep 16, 2025
@codefromthecrypt
Copy link
Contributor

I think this PR is no longer necessary as even if the metrics are there it is only used in chatCompletion not embeddings. double check me if I'm wrong. Regardless we can test this better soon.

@codefromthecrypt
Copy link
Contributor

after #1204 we have an easy way to ad-hoc test for this (though without it we can still hit the /metrics endpoint and look) if you run the CLI (standalone mode), you can do OTEL_METRICS_EXPORTER=console OTEL_METRIC_EXPORT_INTERVAL=100 aigw ... then make whatever the requests you want. the metrics will show up on the console after 100ms.

@hustxiayang
Copy link
Contributor Author

@codecov-commenter When we deploy the embedding models, we got metrics like "prompt_tokens":7,"total_tokens":7,"completion_tokens":0,"prompt_tokens_details":null and our users complain that the metrics are not correct as it does not make any sense to see completion_tokens:0 as embedding models are not even generative. That' the motivation of this PR.

@codefromthecrypt
Copy link
Contributor

@hustxiayang so I think the main way to get on the same page is to discuss things in terms of what code is in main right now, not what 0.3 did.

As far as I can tell, there is no "first token" latency metrics for embeddings in main any more, which is a part of this PR. If there were tests added to this PR you would be able to verify what I'm saying as true or false.

Also, "total" is removed now, so I think that is also something that the above discussion is out of date with.

Separately, there seems to be a concern if tokens should be a metric for the embeddings endpoint. Perhaps the naming should change but token is a billing and rate-limiting metric regardless of whether we are talking LLM or embeddings. For example, if we remove tokens completely from the embeddings endpoint we will remove our ability to know or control based on things folks are billed on https://docs.langdock.com/api-endpoints/embedding/openai-embedding

I would suggest that first rebase this on main, then use tests as a way to prove the before and after behavior of what this PR aims to do. Without it, we get stuck in interpretations which may no longer be valid.

Make sense?

@codefromthecrypt
Copy link
Contributor

meanwhile I will refactor the tests so that embeddings metrics are tested. it is possible we aren't settng the operation name correctly per https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/#metric-gen_aiclienttokenusage

so for chat completions gen_ai.operation.name=chat, and embeddings, gen_ai.operation.name=embeddings

If this isn't set correctly then the metrics would be conflated and nonsense, so if that's the case it is very much a bug!

@codefromthecrypt
Copy link
Contributor

#1219 verifies the old code indeed split metrics on op name.

@mathetake
Copy link
Member

Let's reopen if this needs to be revived from the conflicts!

@mathetake mathetake closed this Oct 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants