-
Notifications
You must be signed in to change notification settings - Fork 152
Add index_fill operator implementation #1013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- Implement torch.index_fill using code generation pattern - Use 2D grid parallelization for better performance - Add autotune with 9 configurations for optimal block sizes - Support 1D/2D/3D tensors with all float dtypes (float16, float32, bfloat16) - Add comprehensive tests: 69 accuracy tests and performance benchmarks - Performance vs PyTorch: 3-9% faster for dim=0, 1-15% faster for dim=1 in most cases
Summary of ChangesHello @factnn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates a highly optimized Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a Triton-based implementation for the torch.index_fill operator, following a code generation pattern for kernel optimization. The implementation is well-structured, includes comprehensive accuracy and performance tests, and demonstrates good performance gains over the native PyTorch operator. The code is clean and the logic is sound. I have a few minor suggestions to improve code clarity and efficiency in the operator implementation.
| output = input.clone() | ||
| if not output.is_contiguous(): | ||
| output = output.contiguous() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach to creating a contiguous clone can be inefficient. If the input tensor is not contiguous, input.clone() creates a non-contiguous copy, and output.contiguous() then creates a second, contiguous copy. This results in two copy operations. You can achieve the same result more efficiently in a single operation by specifying the memory_format.
| output = input.clone() | |
| if not output.is_contiguous(): | |
| output = output.contiguous() | |
| output = input.clone(memory_format=torch.contiguous_format) |
- Remove unused rank parameter from generate functions - Remove redundant n_indices parameter (M already holds index.numel()) - Optimize clone+contiguous to single operation with memory_format
|
@gemini-code-assist plz review the latest code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new index_fill operator implemented with Triton, which is a great addition. The implementation leverages code generation for rank-specific kernels and autotuning, which are excellent strategies for performance. The accompanying benchmarks and accuracy tests are also comprehensive. I've identified two correctness issues that need to be addressed. Firstly, the implementation doesn't handle negative indices, which is a deviation from standard PyTorch behavior. Secondly, it lacks validation for the dimensionality of the index tensor. I've provided specific suggestions to fix these issues. Once these are resolved, this will be a very solid contribution.
| raise ValueError("input must be on CUDA device") | ||
| if not index.is_cuda: | ||
| raise ValueError("index must be on CUDA device") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch.index_fill expects the index tensor to be 1-dimensional. The current implementation does not validate this, which could lead to undefined behavior if a multi-dimensional index tensor is provided. To ensure conformance with the PyTorch API and improve robustness, you should add a check to verify that index.ndim is 1.
| if index.ndim != 1: | |
| raise RuntimeError("index_fill(): Expected a 1-D tensor for index") | |
|
@ gemini-code-assist I have fixed the problems, please check again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new index_fill operator implemented with Triton, following a code generation pattern for optimization. The implementation is well-structured, handles various edge cases like negative dimensions, and includes comprehensive unit tests and performance benchmarks. The benchmarks show promising speedups over the native PyTorch implementation. My review includes a few suggestions to improve code clarity, correctness, and efficiency, such as removing unused parameters, correcting a type hint, and optimizing tensor cloning.
| output = input.clone() | ||
| if not output.is_contiguous(): | ||
| output = output.contiguous() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current approach to ensure the output tensor is contiguous might perform two copies if the input is not contiguous (one for clone and another for contiguous). You can achieve this more efficiently with a single copy in all cases by using input.clone(memory_format=torch.contiguous_format).
output = input.clone(memory_format=torch.contiguous_format)45059fb to
8dc183f
Compare
|
@gemini-code-assist check please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the torch.index_fill operator implementation using Triton with code generation. The implementation includes rank-specific kernel optimization, 2D grid parallelization, autotuning, and support for various tensor dimensions and data types. Performance tests show speed improvements compared to PyTorch. The code changes include adding a new file src/flag_gems/ops/index_fill.py with the implementation, modifying src/flag_gems/__init__.py and src/flag_gems/ops/__init__.py to include the new operator, adding performance tests in benchmark/test_select_and_slice_perf.py, and adding accuracy tests in tests/test_reduction_ops.py. The review focuses on correctness and maintainability.
| value = value.item() | ||
| else: | ||
| raise ValueError("Value tensor must be a scalar (0-dim tensor or number)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good practice to check if value is a scalar before attempting to convert it to an item. This can prevent unexpected errors if value is not a tensor or doesn't have a numel() method.
if isinstance(value, torch.Tensor) and value.numel() == 1:
value = value.item()if isinstance(value, torch.Tensor) and value.numel() == 1:
value = value.item()- Add support for negative indices in kernel (wrap to positive) - Add validation for 1-D index tensor requirement - Fix type hint: Mapping[str, Callable] -> Mapping[int, Callable] Addresses additional code review feedback.
b313bdb to
ae87ed6
Compare
|
@ gemini-code-assist check please. |
|
Hello @factnn, I understand you'd like me to review the latest code. I can't automatically trigger a new review myself. If you'd like me to perform a fresh code review of the current pull request, please use the command |
- Add support for negative indices in kernel (wrap to positive) - Add validation for 1-D index tensor requirement - Fix type hint: Mapping[str, Callable] -> Mapping[int, Callable] Addresses additional code review feedback.
Regarding CI FailuresSummaryThe Evidence
RequestCould you please:
Thank you! |
|
Address code review feedback: 1. Remove unnecessary code generation - Different rank tensors use the same kernel code - Simplified to direct @triton.jit kernel without dynamic code generation - Removed IndexFillFunction class and file-based overload system 2. Move autotune configs to YAML - Configs now loaded from runtime/backend/_nvidia/tune_configs.yaml - Follows FlagGems architectural standards - Maintains autotune key=['M', 'N', 'dim_stride'] for optimal performance 3. Keep contiguous memory layout - Continue using clone(memory_format=torch.contiguous_format) - Follows index_select pattern for optimal kernel performance - Non-contiguous support can be added in future if needed 4. Fix CPU testing compatibility - Add ref_index = to_reference(index, True) in all tests - Ensures ref tensors and indices are on same device - Fixes 'cuda:0 and cpu!' device mismatch errors Changes: - Simplified index_fill.py from 282 to 171 lines (-39%) - Added tune_configs.yaml entry with 9 autotune configurations - Updated 3 test functions in test_reduction_ops.py
Signed-off-by: taooo <[email protected]>
|
/gemini review |
|
please fix code format-check-ci |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new index_fill operator, complete with a Triton kernel implementation, comprehensive accuracy tests, and performance benchmarks. The implementation is solid and the test coverage is good. I've identified a couple of areas for improvement: one is a performance optimization in the Python wrapper to reduce unnecessary tensor copies for non-contiguous inputs, and the other is a minor correction in the test suite to ensure the accuracy assertion is configured correctly for 1D tensors. Overall, this is a great addition.
| output = input.clone() | ||
| if not output.is_contiguous(): | ||
| output = output.contiguous() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current method of ensuring a contiguous output tensor for non-contiguous inputs can be inefficient. When the input tensor is not contiguous, it results in two copy operations: one from input.clone() and another from output.contiguous(). This can be optimized to a single copy, which will improve performance, especially for large non-contiguous tensors.
| output = input.clone() | |
| if not output.is_contiguous(): | |
| output = output.contiguous() | |
| output = input.contiguous() | |
| if output is input: | |
| output = input.clone() |
| with flag_gems.use_gems(): | ||
| res_out = torch.index_fill(input, dim, index, value) | ||
|
|
||
| gems_assert_close(res_out, ref_out, dtype, reduce_dim=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reduce_dim parameter for gems_assert_close appears to be set incorrectly. For a 1D tensor of shape (N,), the reduction dimension size is N. Using reduce_dim=1 is likely a typo and could lead to misleading accuracy results or mask potential issues. It should be set to N to be consistent with the 2D and 3D tests.
| gems_assert_close(res_out, ref_out, dtype, reduce_dim=1) | |
| gems_assert_close(res_out, ref_out, dtype, reduce_dim=N) |
The to_reference(index, True) was incorrectly upcasting index tensors from int64 to float64, causing torch.index_fill() to fail with 'Expected dtype int64 for index' error. Changed to to_reference(index) without upcast flag to keep index as int64, following the same pattern as other index-based operators (index_add, index_select, etc.).
|
|
PR Category
Operator
Type of Change
New Feature
Description
Add
torch.index_filloperator implementation using Triton with code generation pattern.Development Tool:
Key Features:
Performance:
Issue
N/A (New operator implementation)
Progress
Performance
Test Commands: