IterDomain resize for pad, cat, slice #9

jacobhinkle · 2023-03-15T14:56:48Z

This is my attempt at a cherry-pick of @naoyam's commit csarofeen/pytorch@1e30fee which merged the original PR: csarofeen/pytorch#2480.

Related to Issue #8 but does not include python interface yet.

Cherry-pick of csarofeen/pytorch@1e30fee Original PR: csarofeen/pytorch#2480

Fixes some -Werror=sign-compare errors...

jacobhinkle · 2023-03-15T15:07:46Z

Oops. Closing in favor of #3

``` Traceback (most recent call last): File "/opt/pytorch/nvfuser/nvfuser/__init__.py", line 122, in execute result = self._execute( RuntimeError: isSame(values_[it.first], it.second) INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/evaluator_common.cpp":314, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Precomputed values failed to validate. Something unexpected changed between the compilation and execution. nan != nan Exception raised from validate at /opt/pytorch/nvfuser/csrc/evaluator_common.cpp:314 (most recent call first): frame #0: nvfuser::nvfCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x8d (0x7fdc9919fe3b in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #1: nvfuser::nvfErrorFail(char const*, char const*, unsigned int, char const*, std::string const&) + 0x53 (0x7fdc992ded63 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #2: nvfuser::PrecomputedValues::validate() + 0x172 (0x7fdc993190f2 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #3: nvfuser::PrecomputedValues::evaluate() + 0x66 (0x7fdc9931fde6 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #4: nvfuser::FusionExecutor::inferOutputSizes(nvfuser::Fusion*, nvfuser::KernelArgumentHolder const&) + 0x8d (0x7fdc992ea12d in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #5: nvfuser::FusionKernelRuntime::compileFusionParallel(nvfuser::KernelArgumentHolder) + 0x46d (0x7fdc9943a6ad in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #6: nvfuser::FusionExecutorCache::runFusionWithInputs(c10::ArrayRef<c10::IValue> const&, std::optional<nvfuser::PrimDataType>, std::optional<signed char>) + 0xa8d (0x7fdc99443c9d in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #7: nvfuser::python_frontend::FusionDefinition::execute(c10::ArrayRef<c10::IValue> const&, bool, bool, std::optional<signed char>) const + 0x331 (0x7fdc997450e1 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #8: <unknown function> + 0xeec2e (0x7fdbe8274c2e in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #9: <unknown function> + 0x16e137 (0x7fdbe82f4137 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) <omitting python frames> frame #38: <unknown function> + 0x29d90 (0x7fdd26ea0d90 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #39: __libc_start_main + 0x80 (0x7fdd26ea0e40 in /usr/lib/x86_64-linux-gnu/libc.so.6) ```

* fix bug on mem type * fix tests by defining flags before skipping * revert change in parallelizeAllLike * linter * reduce size of tensors in PipelineTest.Pipeline for speedup * remove prints in PipelineTest.matmul_summa * remove barrier at environment tear-down * linter * fix ShardingTest by removing the symbolic case * linter and size_t to int

This introduces a thread-local global memory allocator for each device and uses it whenever there is an intermediate tensor needed which requires zero-initialization. To enable use `NVFUSER_ENABLE=reuse_zeroed_memory`. You can monitor the allocator using `NVFUSER_DUMP=global_zeroed_memory`. Before we enable this feature by default, we need to have high confidence that every kernel using zero-initialized memory will always clean up their semaphores. This is currently only the case for serial grid reductions, as far as I know. This enables the basic functionality of #1829. However, it does not modify existing algorithms to clean up their memory. See `NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling`, which succeeds when using serial grid reduction, but fails (in debug mode) when using `gridReduce` (note that this test is updated to behave differently in this PR): ``` # NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling Running main() from /opt/pytorch/nvfuser/third_party/googletest/googletest/src/gtest_main.cc Note: Google Test filter = SerialGridReductionTest.Scheduling [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from SerialGridReductionTest [ RUN ] SerialGridReductionTest.Scheduling [global zeroed memory] Resizing arena to 512 bytes [global zeroed memory] Allocating byte range: 0 to 512 bytes [global zeroed memory] Resetting allocated bytes to 0 [global zeroed memory] Allocating byte range: 0 to 512 bytes [global zeroed memory] Resetting allocated bytes to 0 [global zeroed memory] Resizing arena to 16384 bytes [global zeroed memory] Allocating byte range: 0 to 16384 bytes [global zeroed memory] Resetting allocated bytes to 0 [global zeroed memory] Allocating byte range: 0 to 16384 bytes unknown file: Failure C++ exception with description "nnz.equal(0) INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/global_allocator.cpp":88, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Global memory arena was not properly zeroed. Found 2048 bytes that are not zero Exception raised from checkZeroed at /opt/pytorch/nvfuser/csrc/global_allocator.cpp:88 (most recent call first): frame #0: <unknown function> + 0x2fde9e (0x556cdb95de9e in build/nvfuser_tests) frame #1: <unknown function> + 0x2fe0df (0x556cdb95e0df in build/nvfuser_tests) frame #2: <unknown function> + 0x3f3720 (0x556cdba53720 in build/nvfuser_tests) frame #3: <unknown function> + 0x3f33df (0x556cdba533df in build/nvfuser_tests) frame #4: <unknown function> + 0x3f38ed (0x556cdba538ed in build/nvfuser_tests) frame #5: <unknown function> + 0x315e67 (0x556cdb975e67 in build/nvfuser_tests) frame #6: <unknown function> + 0x7c5780 (0x556cdbe25780 in build/nvfuser_tests) frame #7: <unknown function> + 0x7c5877 (0x556cdbe25877 in build/nvfuser_tests) frame #8: <unknown function> + 0x138f8cc (0x556cdc9ef8cc in build/nvfuser_tests) frame #9: <unknown function> + 0x1457f0b (0x556cdcab7f0b in build/nvfuser_tests) frame #10: <unknown function> + 0x14519fd (0x556cdcab19fd in build/nvfuser_tests) frame #11: <unknown function> + 0x142de24 (0x556cdca8de24 in build/nvfuser_tests) frame #12: <unknown function> + 0x142e93f (0x556cdca8e93f in build/nvfuser_tests) frame #13: <unknown function> + 0x142f345 (0x556cdca8f345 in build/nvfuser_tests) frame #14: <unknown function> + 0x143f86c (0x556cdca9f86c in build/nvfuser_tests) frame #15: <unknown function> + 0x1458e98 (0x556cdcab8e98 in build/nvfuser_tests) frame #16: <unknown function> + 0x1452ac7 (0x556cdcab2ac7 in build/nvfuser_tests) frame #17: <unknown function> + 0x143de6d (0x556cdca9de6d in build/nvfuser_tests) frame #18: <unknown function> + 0x1407ca0 (0x556cdca67ca0 in build/nvfuser_tests) frame #19: <unknown function> + 0x1407c19 (0x556cdca67c19 in build/nvfuser_tests) frame #20: <unknown function> + 0x29d90 (0x7f616c7d4d90 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #21: __libc_start_main + 0x80 (0x7f616c7d4e40 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #22: <unknown function> + 0x11e9d5 (0x556cdb77e9d5 in build/nvfuser_tests) " thrown in the test body. To reproduce: NVFUSER_TEST_RANDOM_SEED=1711120799 NVFUSER_TEST_ATEN_RANDOM_SEED=0 nvfuser_tests --gtest_filter='SerialGridReductionTest.Scheduling' [ FAILED ] SerialGridReductionTest.Scheduling (5669 ms) [----------] 1 test from SerialGridReductionTest (5669 ms total) ``` This test runs with serial grid reduction, then with `gridReduce`. Each time it runs two grid reductions. Both serial grid reductions succeed because the semaphore buffer is properly zeroed. The `gridReduce` succeeds the first time since the memory pool calls `at::zeros` again to request a larger buffer size (`gridReduce` requires more semaphores since there is one per thread segment vs one for each each block segment). However, the second call to `gridReduce` fails because it has not cleaned up its semaphores. Hacking that function to force `PERSISTENT=1` would clean up the semaphores resulting in success in this case. I'm leaving those kind of modifications for a follow-up.

Example error message: ```CUDA [ RUN ] TMemTest.AddKernelSameRegion unknown file: Failure C++ exception with description " INTERNAL ASSERT FAILED at "/home/gaoxiang/Fuser/csrc/runtime/compiled_kernel.cpp":169, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. // Codegen generated utilities namespace tmem { __device__ __inline__ void alloc(uint32_t in0, uint32_t in1) { asm volatile("tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%0], %1;\n"::"r"(in0), "r"(in1)); } __device__ __inline__ void relinquishAllocPermit() { asm volatile("tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned;\n"); } __device__ __inline__ void store(uint32_t in0, Array<float, 1, 1> in1) { asm volatile( "tcgen05.st.sync.aligned.32x32b.x1.b32 [%0], {%1};\n" : :"r"(in0), "f"(in1[0]) ); } __device__ __inline__ void waitStore() { asm volatile("tcgen05.wait::st.sync.aligned;\n"); } __device__ __inline__ void load(Array<float, 1, 1>& out0, uint32_t in0) { asm( "tcgen05.ld.sync.aligned.32x32b.x1.b32 {%0}, [%1];\n" :"=f"(out0[0]) :"r"(in0) ); } __device__ __inline__ void waitLoad() { asm volatile("tcgen05.wait::ld.sync.aligned;\n"); } } // namespace tmem __global__ void nvfuser_none_f0_c0_r0_g0(Tensor<float, 1, 1> T0, Tensor<float, 1, 1> T4, Tensor<float, 1, 1> T9) { alignas(16) extern __shared__ char array[]; const unsigned smem_offset = 0; nvfuser_index_t i0; i0 = ((nvfuser_index_t)threadIdx.x) + (32 * ((nvfuser_index_t)blockIdx.x)); bool b1; b1 = i0 < T0.logical_size[0LL]; uint32_t* T10 = reinterpret_cast<uint32_t*>(array + smem_offset + 0); tmem::alloc((uint32_t)(toSmem(T10)), (uint32_t)(32)); tmem::relinquishAllocPermit(); __syncthreads(); Array<float, 1, 1> T1; T1[0] = 0; if (b1) { T1[0] = T0[((T0.alloc_stride[0LL] * ((nvfuser_index_t)threadIdx.x)) + ((32 * T0.alloc_stride[0LL]) * ((nvfuser_index_t)blockIdx.x)))]; } TMemTensor T2(T10[0], 0, (uint16_t)(0)); tmem::store((uint32_t)(T2 + Array<uint16_t, 2, 1>{0, 0}), (*reinterpret_cast<Array<float, 1, 1>*>(&T1[0]))); tmem::waitStore(); Array<float, 1, 1> T3; tmem::load((*reinterpret_cast<Array<float, 1, 1>*>(&T3[0])), (uint32_t)(T2 + Array<uint16_t, 2, 1>{0, 0})); tmem::waitLoad(); asm volatile("tcgen05.dealloc.cta_group::1.sync.aligned.b32 %0, %1;\n"::"r"(T10[0]), "r"((uint32_t)(32))); Array<float, 1, 1> T5; T5[0] = 0; if (b1) { T5[0] = T4[((T4.alloc_stride[0LL] * ((nvfuser_index_t)threadIdx.x)) + ((32 * T4.alloc_stride[0LL]) * ((nvfuser_index_t)blockIdx.x)))]; } TMemTensor T6(T10[0], 0, (uint16_t)(1)); tmem::store((uint32_t)(T6 + Array<uint16_t, 2, 1>{0, 0}), (*reinterpret_cast<Array<float, 1, 1>*>(&T5[0]))); tmem::waitStore(); Array<float, 1, 1> T7; tmem::load((*reinterpret_cast<Array<float, 1, 1>*>(&T7[0])), (uint32_t)(T6 + Array<uint16_t, 2, 1>{0, 0})); tmem::waitLoad(); Array<float, 1, 1> T8; T8[0] = T3[0] + T7[0]; if (b1) { T9[i0] = T8[0]; } } } CUDA NVRTC compile error: ptxas application ptx input, line 48; error : Instruction 'tcgen05.alloc' not supported on .target 'sm_89' ptxas application ptx input, line 48; error : Feature '.cta_group::1' not supported on .target 'sm_89' ptxas application ptx input, line 52; error : Instruction 'tcgen05.relinquish_alloc_permit' not supported on .target 'sm_89' ptxas application ptx input, line 52; error : Feature '.cta_group::1' not supported on .target 'sm_89' ptxas application ptx input, line 69; error : Feature '.32x32b' not supported on .target 'sm_89' ptxas application ptx input, line 69; error : Instruction 'tcgen05.st' not supported on .target 'sm_89' ptxas application ptx input, line 73; error : Instruction 'tcgen05.wait' not supported on .target 'sm_89' ptxas application ptx input, line 77; error : Feature '.32x32b' not supported on .target 'sm_89' ptxas application ptx input, line 77; error : Instruction 'tcgen05.ld' not supported on .target 'sm_89' ptxas application ptx input, line 81; error : Instruction 'tcgen05.wait' not supported on .target 'sm_89' ptxas application ptx input, line 86; error : Instruction 'tcgen05.dealloc' not supported on .target 'sm_89' ptxas application ptx input, line 86; error : Feature '.cta_group::1' not supported on .target 'sm_89' ptxas application ptx input, line 101; error : Feature '.32x32b' not supported on .target 'sm_89' ptxas application ptx input, line 101; error : Instruction 'tcgen05.st' not supported on .target 'sm_89' ptxas application ptx input, line 105; error : Instruction 'tcgen05.wait' not supported on .target 'sm_89' ptxas application ptx input, line 109; error : Feature '.32x32b' not supported on .target 'sm_89' ptxas application ptx input, line 109; error : Instruction 'tcgen05.ld' not supported on .target 'sm_89' ptxas application ptx input, line 113; error : Instruction 'tcgen05.wait' not supported on .target 'sm_89' ptxas fatal : Ptx assembly aborted due to errors Exception raised from invoke at /home/gaoxiang/Fuser/csrc/runtime/compiled_kernel.cpp:169 (most recent call first): frame #0: <unknown function> + 0x1f3e89 (0x5f8f19a46e89 in ./bin/test_nvfuser) frame #1: <unknown function> + 0x5fc9ac (0x5f8f19e4f9ac in ./bin/test_nvfuser) frame #2: <unknown function> + 0x920965 (0x5f8f1a173965 in ./bin/test_nvfuser) frame #3: <unknown function> + 0x923318 (0x5f8f1a176318 in ./bin/test_nvfuser) frame #4: <unknown function> + 0x935e30 (0x5f8f1a188e30 in ./bin/test_nvfuser) frame #5: <unknown function> + 0x100f4f9 (0x5f8f1a8624f9 in ./bin/test_nvfuser) frame #6: <unknown function> + 0x1267437 (0x5f8f1aaba437 in ./bin/test_nvfuser) frame #7: <unknown function> + 0x1250676 (0x5f8f1aaa3676 in ./bin/test_nvfuser) frame #8: <unknown function> + 0x12508b5 (0x5f8f1aaa38b5 in ./bin/test_nvfuser) frame #9: <unknown function> + 0x125115b (0x5f8f1aaa415b in ./bin/test_nvfuser) frame #10: <unknown function> + 0x125ee25 (0x5f8f1aab1e25 in ./bin/test_nvfuser) frame #11: <unknown function> + 0x1267ac7 (0x5f8f1aabaac7 in ./bin/test_nvfuser) frame #12: <unknown function> + 0x125099f (0x5f8f1aaa399f in ./bin/test_nvfuser) frame #13: <unknown function> + 0x3cafcb (0x5f8f19c1dfcb in ./bin/test_nvfuser) frame #14: <unknown function> + 0x27488 (0x7a5456a35488 in /usr/lib/libc.so.6) frame #15: __libc_start_main + 0x8c (0x7a5456a3554c in /usr/lib/libc.so.6) frame #16: <unknown function> + 0x3cb535 (0x5f8f19c1e535 in ./bin/test_nvfuser) " thrown in the test body. To reproduce: NVFUSER_TEST_RANDOM_SEED=1740626485 NVFUSER_TEST_ATEN_RANDOM_SEED=0 test_nvfuser --gtest_filter='TMemTest.AddKernelSameRegion' [ FAILED ] TMemTest.AddKernelSameRegion (67 ms) ```

naoyam and others added 4 commits March 15, 2023 07:44

IterDomain resize for pad, cat, slice

1277724

Cherry-pick of csarofeen/pytorch@1e30fee Original PR: csarofeen/pytorch#2480

Address mismatching signedness in comparisons

36c003b

Fixes some -Werror=sign-compare errors...

Merge branch 'main' into cherrypick_1e30fee

be81798

Add mistakenly-removed LICENSE header to kernel_0.cu

42ef254

jacobhinkle requested a review from jjsjann123 March 15, 2023 14:56

jacobhinkle closed this Mar 15, 2023

jacobhinkle deleted the cherrypick_1e30fee branch March 15, 2023 15:07

liqiangxl mentioned this pull request May 8, 2023

MatmulSASSTest on H100 #306

Closed

DariusMaham mentioned this pull request Apr 19, 2024

Internal assert failed #2107

Closed

wujingyue mentioned this pull request May 8, 2024

Write a sharded transformer block in nvFuser API. #2199

Closed

This was referenced Jun 6, 2024

Squeezed IterDomain ?S536{1} must concretize to IterType::Broadcast but found ?S536{1}. #2359

Closed

Merging IterDomains requires that their iteration types match. #2317

Closed

wujingyue mentioned this pull request Oct 18, 2024

OpInfo has problems testing define_tensor. #3225

Closed

t-vi mentioned this pull request Oct 9, 2025

INTERNAL_ASSERT running Qwen-3-4B from litgpt through thunder #5358

Open

tbqh mentioned this pull request Oct 27, 2025

Add fmin_fmax_promotion presegmentation pass #5337

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

IterDomain resize for pad, cat, slice #9

IterDomain resize for pad, cat, slice #9

Uh oh!

jacobhinkle commented Mar 15, 2023

Uh oh!

jacobhinkle commented Mar 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

IterDomain resize for pad, cat, slice #9

IterDomain resize for pad, cat, slice #9

Uh oh!

Conversation

jacobhinkle commented Mar 15, 2023

Uh oh!

jacobhinkle commented Mar 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants