fix: map udf metrics #2588

tmenjo · 2025-05-02T06:20:30Z

fix: latency metrics for UDF processing time in map UDF

This PR has map UDF...

Visit forwarder_udf_processing_time (UDFProcessingTime) in non stream map mode.
Visit forwarder_concurrent_udf_processing_time (ConcurrentUDFProcessingTime) instead of forwarder_udf_processing_time in stream map mode. I think it's better because streaming map is concurrrent processing.

doc: add forwarder_concurrent_udf_processing_time

This and the following commit in this PR updates the Metrics document. This one adds forwarder_concurrent_udf_processing_time to the document.

doc: relationship between UDF and write processing time in Map UDF

This clarifies relationship between forwarder_udf_processing_time/forwarder_concurrent_udf_processing_time and forwarder_write_processing_time. I'd say it's helpful for pipeline developers to analyze metrics.

doc: upstream or downstream partition for each LET metric

This adds a new column Which partition to the latency, traffic, and error metrics tables in the document to clarify what partition_name=<partition-name> means. It would be certainly helpful for Numaflow users to understand each metric correctly.

Signed-off-by: Takashi Menjo <[email protected]>

codecov · 2025-05-02T06:27:48Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.86%. Comparing base (24484b8) to head (d292e08).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2588      +/-   ##
==========================================
+ Coverage   70.44%   70.86%   +0.41%     
==========================================
  Files         395      395              
  Lines       62180    62603     +423     
==========================================
+ Hits        43803    44362     +559     
+ Misses      17253    17128     -125     
+ Partials     1124     1113      -11

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

whynowy · 2025-05-03T07:32:22Z

docs/operations/metrics/metrics.md

-| `forwarder_drop_bytes_total`               | Counter     | `pipeline=<pipeline-name>` <br> `vertex=<vertex-name>` <br> `vertex_type=<vertex-type>` <br> `replica=<replica-index>` <br> `partition_name=<partition-name>` | Provides the total number of bytes dropped by a given Vertex due to a full Inter-Step Buffer Partition          |
-| `forwarder_udf_read_total`                 | Counter     | `pipeline=<pipeline-name>` <br> `vertex=<vertex-name>` <br> `vertex_type=<vertex-type>` <br> `replica=<replica-index>` <br> `partition_name=<partition-name>` | Provides the total number of messages read by UDF                                                               |
-| `forwarder_udf_write_total`                | Counter     | `pipeline=<pipeline-name>` <br> `vertex=<vertex-name>` <br> `vertex_type=<vertex-type>` <br> `replica=<replica-index>` <br> `partition_name=<partition-name>` | Provides the total number of messages written by UDF                                                            |
+| Metric name                                | Metric type | Labels                                                                                                                                                        | Which partition | Description                                                                                                     |


What does Which partition mean? Is it the explanation of the partition_name?

Here "which" means upstream or downstream, not a partition name. Sorry for the confusion.

I believe this column helps operators to understand each metric more correcly. I would choose a more appropriate word.

whynowy · 2025-05-03T07:51:45Z

pkg/udf/forward/forward.go

@@ -476,7 +478,7 @@ func (isdf *InterStepDataForward) streamMessage(ctx context.Context, dataMessage
 		return nil, fmt.Errorf("failed to applyUDF, error: %w", err)
 	}

-	metrics.UDFProcessingTime.With(metricLabels).Observe(float64(time.Since(start).Microseconds()))
+	metrics.ConcurrentUDFProcessingTime.With(metricLabels).Observe(float64(time.Since(start).Microseconds()))


Neither of these metrics makes sense to stream mode, since we don't have a way to exclude the buffer writing time.

we don't have a way to exclude the buffer writing time.

Yes, I know that. My idea is: how about redefining ConcurrentUDFProcessingTime as it includes both udf processing time and writing time to buffers. I described that in the metrics document, in the third commit (doc: relationship between...).

whynowy · 2025-05-03T07:53:24Z

pkg/udf/forward/forward.go

 		udfResults, err = isdf.applyUDF(ctx, dataMessages)
 		if err != nil {
 			isdf.opts.logger.Errorw("failed to applyUDF", zap.Error(err))
 			// As there's no partial failure, non-ack all the readOffsets
 			isdf.fromBufferPartition.NoAck(ctx, readOffsets)
 			return err
 		}
+		metrics.UDFProcessingTime.With(metricLabels).Observe(float64(time.Since(udfStart).Microseconds()))


should this be ConcurrentUDFProcessingTime?

It would be in batch mode, but how about in unary mode? In my understanding, calling applyUDF() will blocks until all the UDF results are received.

Is it better that we visit either concurrent or non-concurrent metric, depending on the mode?

tmenjo · 2025-05-30T01:22:43Z

Hello commiters, could you review again this pull request? Otherwise, I'd like to hear whether this is put off like #2624.

tmenjo added 4 commits May 2, 2025 14:25

fix: latency metrics for UDF processing time in map UDF

e8aefef

Signed-off-by: Takashi Menjo <[email protected]>

doc: add forwarder_concurrent_udf_processing_time

e06e4e8

Signed-off-by: Takashi Menjo <[email protected]>

doc: relationship between UDF and write processing time in Map UDF

2f93578

Signed-off-by: Takashi Menjo <[email protected]>

doc: upstream or downstream partition for each LET metric

d292e08

Signed-off-by: Takashi Menjo <[email protected]>

tmenjo marked this pull request as ready for review May 2, 2025 06:40

tmenjo requested review from whynowy and vigith as code owners May 2, 2025 06:40

whynowy reviewed May 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: map udf metrics #2588

fix: map udf metrics #2588

Uh oh!

tmenjo commented May 2, 2025 •

edited

Loading

Uh oh!

codecov bot commented May 2, 2025

Uh oh!

whynowy May 3, 2025

Uh oh!

tmenjo May 7, 2025

Uh oh!

whynowy May 3, 2025

Uh oh!

tmenjo May 7, 2025

Uh oh!

whynowy May 3, 2025

Uh oh!

tmenjo May 7, 2025

Uh oh!

tmenjo commented May 30, 2025

Uh oh!

Uh oh!

fix: map udf metrics #2588

Are you sure you want to change the base?

fix: map udf metrics #2588

Uh oh!

Conversation

tmenjo commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

fix: latency metrics for UDF processing time in map UDF

doc: add forwarder_concurrent_udf_processing_time

doc: relationship between UDF and write processing time in Map UDF

doc: upstream or downstream partition for each LET metric

Uh oh!

codecov bot commented May 2, 2025

Codecov Report

Uh oh!

whynowy May 3, 2025

Choose a reason for hiding this comment

Uh oh!

tmenjo May 7, 2025

Choose a reason for hiding this comment

Uh oh!

whynowy May 3, 2025

Choose a reason for hiding this comment

Uh oh!

tmenjo May 7, 2025

Choose a reason for hiding this comment

Uh oh!

whynowy May 3, 2025

Choose a reason for hiding this comment

Uh oh!

tmenjo May 7, 2025

Choose a reason for hiding this comment

Uh oh!

tmenjo commented May 30, 2025

Uh oh!

Uh oh!

tmenjo commented May 2, 2025 •

edited

Loading