Add Op dot by wlxjhyf · Pull Request #430 · flagos-ai/FlagGems

wlxjhyf · 2025-01-21T06:12:36Z

PR Category

Operator

Type of Change

Add new operator

Description

Implement dot operator, support Float32, Float16，BFloat16。
The operator implementation is to split the dot operator into two steps, the first step implementing elementwise level multiplication, and the second step implementing summation.
At present, in order to ensure accuracy requirements, the intermediate results are saved as float32 type in the first step.

Issue

#394

Progress

Change is properly reviewed (1 reviewer required, 2 recommended).
Change is responded to an issue.
Change is fully covered by a UT.

Performance

correctness

performance

StrongSpoon

nice performance. please resolve the conflicts and CI could be permitted then.

StrongSpoon · 2025-02-07T06:26:12Z

src/flag_gems/ops/dot.py

+
+    with torch_device_fn.device(x.device):
+        dot_kernel_1[grid_1](x, y, mid, N, block_size)
+        dot_kernel_2[grid_2](mid, out, mid_size, block_mid)


I think it's better to take tensor stride into consideration. but it's a good implementation!

tongxin · 2025-02-17T07:50:28Z

src/flag_gems/ops/dot.py

+        dot_kernel_1[grid_1](x, y, mid, N, block_size)
+        dot_kernel_2[grid_2](mid, out, mid_size, block_mid)


Can we resort to a single persistent kernel when the input numel is small enough?

reasonable.

I tried to implement dot in a single kernel using atomic_add, but even in very small input numel, the performance was not good, but I still kept the code in function dot_kernel.
this shows the performance of kernel1 and kernel2

this shows the performance of single kernel

I probably didnt make myself clear. What I suggested is adding a one pass branch to handle small input. We don't have to use atomic_add on either branch. The two pass branch still exists.

Understand！

tongxin · 2025-02-17T07:53:59Z

@wlxjhyf, thanks for contributing to flaggems. Please resolve the conversions and complete this PR at your earliest convenience.

wlxjhyf · 2025-02-18T03:05:04Z

@wlxjhyf, thanks for contributing to flaggems. Please resolve the conversions and complete this PR at your earliest convenience.

I'm sorry I just saw it, I'll do it right now

tongxin · 2025-02-25T08:13:53Z

@wlxjhyf, thanks for contributing to flaggems. Please resolve the conversions and complete this PR at your earliest convenience.

I'm sorry I just saw it, I'll do it right now

Don't be sorry. We are very grateful to you for your volunteering!

wlxjhyf · 2025-02-26T14:44:04Z

When using a single kernel, N must be smaller than tl.TRITON_MAX_TENSOR_NUMEL (1048576).In my tests on A100, I found that when N is smaller than 4096, using a single operator can still maintain good performance.So, I currently design 4096 as the branching condition.

And I found that the reason for the failure of the last submission was the failure of the test_index_put_acc_true test. Does this relate to my code?

StrongSpoon

lgtm！please resolve the conflicts and this pull request will be merged soon.

StrongSpoon · 2025-03-10T01:54:14Z

src/flag_gems/ops/dot.py

+            dot_kernel_2[grid_2](mid, out, mid_size, block_mid)
+
+    else:
+        block_size = triton.next_power_of_2(math.ceil(N))


math.ceil is useless here.

StrongSpoon · 2025-03-10T02:04:51Z

tests/test_reduction_ops.py

+    inp1 = torch.randn(shape, dtype=dtype, device=flag_gems.device)
+    inp2 = torch.randn(shape, dtype=dtype, device=flag_gems.device)
+    ref_inp1 = to_reference(inp1, False)
+    ref_inp2 = to_reference(inp2, False)


it's recommended to set parameter upcast as True, which indicates higher precision of reference.

StrongSpoon · 2025-03-10T02:08:40Z

When using a single kernel, N must be smaller than tl.TRITON_MAX_TENSOR_NUMEL (1048576).In my tests on A100, I found that when N is smaller than 4096, using a single operator can still maintain good performance.So, I currently design 4096 as the branching condition.

And I found that the reason for the failure of the last submission was the failure of the test_index_put_acc_true test. Does this relate to my code?

it's not your fault. there exists possibility that some operators fail the tests.

StrongSpoon

lg

StrongSpoon

lg

dot

198880b

wlxjhyf mentioned this pull request Jan 21, 2025

Code Contribution: [Medium] [Operator Development] dot #394

Closed

wlxjhyf changed the title ~~dot~~ Add op dot Jan 22, 2025

wlxjhyf changed the title ~~Add op dot~~ Add Op dot Jan 22, 2025

StrongSpoon reviewed Feb 7, 2025

View reviewed changes

StrongSpoon self-assigned this Feb 17, 2025

tongxin reviewed Feb 17, 2025

View reviewed changes

wlxjhyf added 3 commits February 24, 2025 12:50

dot kernel

e329cd1

resolve conflicts of dot kernel

d0f012b

Merge branch 'master' into master

e026813

wlxjhyf and others added 2 commits February 25, 2025 13:20

fix_format_error

d452bc3

fix with single kernel in small input

7d9f603

wlxjhyf requested review from StrongSpoon and tongxin February 27, 2025 14:14

StrongSpoon previously approved these changes Mar 10, 2025

View reviewed changes

Merge branch 'master' into master

803677e

wlxjhyf dismissed StrongSpoon’s stale review via 803677e March 10, 2025 02:21

wlxjhyf requested a review from StrongSpoon March 10, 2025 11:04

wlxjhyf marked this pull request as draft March 10, 2025 11:10

fix:some code problems

e2d137c

wlxjhyf marked this pull request as ready for review March 10, 2025 12:21

StrongSpoon added 2 commits April 22, 2025 17:29

Merge branch 'master' into master

f536523

delete redundant code

a002037

StrongSpoon previously approved these changes Apr 22, 2025

View reviewed changes

reformat

3458d1c

StrongSpoon dismissed their stale review via 3458d1c April 22, 2025 09:45

StrongSpoon added 2 commits April 22, 2025 17:47

reformat

73527f4

reformat

9d8b43c

StrongSpoon approved these changes Apr 23, 2025

View reviewed changes

StrongSpoon merged commit c29bed3 into flagos-ai:master Apr 23, 2025
10 of 12 checks passed

		dot_kernel_1[grid_1](x, y, mid, N, block_size)
		dot_kernel_2[grid_2](mid, out, mid_size, block_mid)

Comments

Conversation

wlxjhyf commented Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

Type of Change

Description

Issue

Progress

Performance

Uh oh!

StrongSpoon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StrongSpoon Feb 7, 2025

Choose a reason for hiding this comment

Uh oh!

tongxin Feb 17, 2025

Choose a reason for hiding this comment

Uh oh!

StrongSpoon Feb 18, 2025

Choose a reason for hiding this comment

Uh oh!

wlxjhyf Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tongxin Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

wlxjhyf Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

tongxin commented Feb 17, 2025

Uh oh!

wlxjhyf commented Feb 18, 2025

Uh oh!

tongxin commented Feb 25, 2025

Uh oh!

wlxjhyf commented Feb 26, 2025

Uh oh!

StrongSpoon left a comment

Choose a reason for hiding this comment

Uh oh!

StrongSpoon Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

StrongSpoon Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

StrongSpoon commented Mar 10, 2025

Uh oh!

StrongSpoon left a comment

Choose a reason for hiding this comment

Uh oh!

StrongSpoon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wlxjhyf commented Jan 21, 2025 •

edited

Loading

StrongSpoon left a comment •

edited

Loading

wlxjhyf Feb 24, 2025 •

edited

Loading