persistent_use_of_buffer is accumulated over all the resolution points. #4

jjsjann123 · 2023-03-14T09:24:21Z

Cherry-picking from: csarofeen/pytorch#2576

Author: Naoya Maruyama [email protected]
Date: Mon Mar 13 17:50:01 2023 -0700

persistent_use_of_buffer is accumulated over all the resolution points. (#2576)

Recomputation for each persistent use should be done after the
accumulation is done.

Currently, recomputation and replaceVal can be done redundantly. For
example, on A100, that happens with NvFuserScheduler_BatchNorm_fp32/64/32/256.

* fix tests for multicluster fusion

…s. (#2576) Recomputation for each persistent use should be done after the accumulation is done. Currently, recomputation and replaceVal can be done redundantly. For example, on A100, that happens with NvFuserScheduler_BatchNorm_fp32/64/32/256.

jjsjann123 · 2023-03-14T10:20:56Z

I think this one is good. I'm going to merge it with the fouling commit (#3) reverted.

This reverts commit 839a299.

…s. (NVIDIA#4) Cherry-picking from: csarofeen/pytorch#2576 Author: Naoya Maruyama [email protected] Date: Mon Mar 13 17:50:01 2023 -0700 persistent_use_of_buffer is accumulated over all the resolution points. (NVIDIA#2576) Recomputation for each persistent use should be done after the accumulation is done. Currently, recomputation and replaceVal can be done redundantly. For example, on A100, that happens with NvFuserScheduler_BatchNorm_fp32/64/32/256. Co-authored-by: Naoya Maruyama <[email protected]>

``` Traceback (most recent call last): File "/opt/pytorch/nvfuser/nvfuser/__init__.py", line 122, in execute result = self._execute( RuntimeError: isSame(values_[it.first], it.second) INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/evaluator_common.cpp":314, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Precomputed values failed to validate. Something unexpected changed between the compilation and execution. nan != nan Exception raised from validate at /opt/pytorch/nvfuser/csrc/evaluator_common.cpp:314 (most recent call first): frame #0: nvfuser::nvfCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x8d (0x7fdc9919fe3b in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #1: nvfuser::nvfErrorFail(char const*, char const*, unsigned int, char const*, std::string const&) + 0x53 (0x7fdc992ded63 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #2: nvfuser::PrecomputedValues::validate() + 0x172 (0x7fdc993190f2 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #3: nvfuser::PrecomputedValues::evaluate() + 0x66 (0x7fdc9931fde6 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #4: nvfuser::FusionExecutor::inferOutputSizes(nvfuser::Fusion*, nvfuser::KernelArgumentHolder const&) + 0x8d (0x7fdc992ea12d in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #5: nvfuser::FusionKernelRuntime::compileFusionParallel(nvfuser::KernelArgumentHolder) + 0x46d (0x7fdc9943a6ad in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #6: nvfuser::FusionExecutorCache::runFusionWithInputs(c10::ArrayRef<c10::IValue> const&, std::optional<nvfuser::PrimDataType>, std::optional<signed char>) + 0xa8d (0x7fdc99443c9d in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #7: nvfuser::python_frontend::FusionDefinition::execute(c10::ArrayRef<c10::IValue> const&, bool, bool, std::optional<signed char>) const + 0x331 (0x7fdc997450e1 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #8: <unknown function> + 0xeec2e (0x7fdbe8274c2e in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #9: <unknown function> + 0x16e137 (0x7fdbe82f4137 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) <omitting python frames> frame #38: <unknown function> + 0x29d90 (0x7fdd26ea0d90 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #39: __libc_start_main + 0x80 (0x7fdd26ea0e40 in /usr/lib/x86_64-linux-gnu/libc.so.6) ```

…_test_profiling Rebase reduction, add test profiling, and use FusionExecutorCache for allocation

This introduces a thread-local global memory allocator for each device and uses it whenever there is an intermediate tensor needed which requires zero-initialization. To enable use `NVFUSER_ENABLE=reuse_zeroed_memory`. You can monitor the allocator using `NVFUSER_DUMP=global_zeroed_memory`. Before we enable this feature by default, we need to have high confidence that every kernel using zero-initialized memory will always clean up their semaphores. This is currently only the case for serial grid reductions, as far as I know. This enables the basic functionality of #1829. However, it does not modify existing algorithms to clean up their memory. See `NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling`, which succeeds when using serial grid reduction, but fails (in debug mode) when using `gridReduce` (note that this test is updated to behave differently in this PR): ``` # NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling Running main() from /opt/pytorch/nvfuser/third_party/googletest/googletest/src/gtest_main.cc Note: Google Test filter = SerialGridReductionTest.Scheduling [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from SerialGridReductionTest [ RUN ] SerialGridReductionTest.Scheduling [global zeroed memory] Resizing arena to 512 bytes [global zeroed memory] Allocating byte range: 0 to 512 bytes [global zeroed memory] Resetting allocated bytes to 0 [global zeroed memory] Allocating byte range: 0 to 512 bytes [global zeroed memory] Resetting allocated bytes to 0 [global zeroed memory] Resizing arena to 16384 bytes [global zeroed memory] Allocating byte range: 0 to 16384 bytes [global zeroed memory] Resetting allocated bytes to 0 [global zeroed memory] Allocating byte range: 0 to 16384 bytes unknown file: Failure C++ exception with description "nnz.equal(0) INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/global_allocator.cpp":88, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Global memory arena was not properly zeroed. Found 2048 bytes that are not zero Exception raised from checkZeroed at /opt/pytorch/nvfuser/csrc/global_allocator.cpp:88 (most recent call first): frame #0: <unknown function> + 0x2fde9e (0x556cdb95de9e in build/nvfuser_tests) frame #1: <unknown function> + 0x2fe0df (0x556cdb95e0df in build/nvfuser_tests) frame #2: <unknown function> + 0x3f3720 (0x556cdba53720 in build/nvfuser_tests) frame #3: <unknown function> + 0x3f33df (0x556cdba533df in build/nvfuser_tests) frame #4: <unknown function> + 0x3f38ed (0x556cdba538ed in build/nvfuser_tests) frame #5: <unknown function> + 0x315e67 (0x556cdb975e67 in build/nvfuser_tests) frame #6: <unknown function> + 0x7c5780 (0x556cdbe25780 in build/nvfuser_tests) frame #7: <unknown function> + 0x7c5877 (0x556cdbe25877 in build/nvfuser_tests) frame #8: <unknown function> + 0x138f8cc (0x556cdc9ef8cc in build/nvfuser_tests) frame #9: <unknown function> + 0x1457f0b (0x556cdcab7f0b in build/nvfuser_tests) frame #10: <unknown function> + 0x14519fd (0x556cdcab19fd in build/nvfuser_tests) frame #11: <unknown function> + 0x142de24 (0x556cdca8de24 in build/nvfuser_tests) frame #12: <unknown function> + 0x142e93f (0x556cdca8e93f in build/nvfuser_tests) frame #13: <unknown function> + 0x142f345 (0x556cdca8f345 in build/nvfuser_tests) frame #14: <unknown function> + 0x143f86c (0x556cdca9f86c in build/nvfuser_tests) frame #15: <unknown function> + 0x1458e98 (0x556cdcab8e98 in build/nvfuser_tests) frame #16: <unknown function> + 0x1452ac7 (0x556cdcab2ac7 in build/nvfuser_tests) frame #17: <unknown function> + 0x143de6d (0x556cdca9de6d in build/nvfuser_tests) frame #18: <unknown function> + 0x1407ca0 (0x556cdca67ca0 in build/nvfuser_tests) frame #19: <unknown function> + 0x1407c19 (0x556cdca67c19 in build/nvfuser_tests) frame #20: <unknown function> + 0x29d90 (0x7f616c7d4d90 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #21: __libc_start_main + 0x80 (0x7f616c7d4e40 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #22: <unknown function> + 0x11e9d5 (0x556cdb77e9d5 in build/nvfuser_tests) " thrown in the test body. To reproduce: NVFUSER_TEST_RANDOM_SEED=1711120799 NVFUSER_TEST_ATEN_RANDOM_SEED=0 nvfuser_tests --gtest_filter='SerialGridReductionTest.Scheduling' [ FAILED ] SerialGridReductionTest.Scheduling (5669 ms) [----------] 1 test from SerialGridReductionTest (5669 ms total) ``` This test runs with serial grid reduction, then with `gridReduce`. Each time it runs two grid reductions. Both serial grid reductions succeed because the semaphore buffer is properly zeroed. The `gridReduce` succeeds the first time since the memory pool calls `at::zeros` again to request a larger buffer size (`gridReduce` requires more semaphores since there is one per thread segment vs one for each each block segment). However, the second call to `gridReduce` fails because it has not cleaned up its semaphores. Hacking that function to force `PERSISTENT=1` would clean up the semaphores resulting in success in this case. I'm leaving those kind of modifications for a follow-up.

Type error was detected with #3263 while I was testing it with a Debug build. ``` pytest -v tests/python/test_python_frontend.py -k test_pad_dynamic ``` It has a fusion of: ``` Inputs: T0_g_float[ bS0{1}, iS1{i1}, iS2{i2} ] Outputs: T1_g_float[ bS11{1}, iS12{( ( i1 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}, iS13{( ( i2 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )} ] %kernel_math { f7 = (float)(7); f9 = float(2.5) * f7; i11 = (int64_t)(f9); i14 = (nvfuser_index_t)(i11); i16 = (nvfuser_index_t)(i11); i18 = (nvfuser_index_t)(i11); i20 = (nvfuser_index_t)(i11); T2_l_float[ bS3{1}, iS5{( ( i1 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}rf, iS7{( ( i2 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}rf ] = pad( T0_g_float[ bS0{1}, iS1{i1}, iS2{i2} ], {0, 0, i14, i16, i18, i20} ) T1_g_float[ bS11{1}, iS12{( ( i1 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}, iS13{( ( i2 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )} ] = Set( T2_l_float[ bS3{1}, iS5{( ( i1 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}rf, iS7{( ( i2 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}rf ], cache_op=Streaming ) } // %kernel_math ``` Stack trace: ``` #0 __cxxabiv1::__cxa_throw (obj=0xabc6ea0, tinfo=0x7ffeba81c248 <typeinfo for nvfuser::nvfError>, dest=0x7ffeba059370 <nvfuser::nvfError::~nvfError()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:80 #1 0x00007ffeba058665 in nvfuser::nvfCheckFail (func=0x7ffeb9927c33 "as", file=0x7ffeb99d9a1b "/raid/nmaruyama/debug1/csrc/utils.h", line=119, msg=0x7ffeb99c36e6 " INTERNAL ASSERT FAILED at \"/raid/nmaruyama/debug1/csrc/utils.h\":119, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. ") at /raid/nmaruyama/debug1/csrc/exceptions.cpp:283 #2 0x00007ffeb9c1be4b in nvfuser::nvfErrorFail (func=0x7ffeb9927c33 "as", file=0x7ffeb99d9a1b "/raid/nmaruyama/debug1/csrc/utils.h", line=119, condMsg=0x7ffeb99c36e6 " INTERNAL ASSERT FAILED at \"/raid/nmaruyama/debug1/csrc/utils.h\":119, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. ") at /raid/nmaruyama/debug1/csrc/exceptions.h:229 #3 0x00007ffeb9c1bbe4 in nvfuser::PolymorphicBase::as<nvfuser::TensorView> (this=0xac07490) at /raid/nmaruyama/debug1/csrc/utils.h:119 #4 0x00007ffeba54c67f in nvfuser::(anonymous namespace)::isLoadGlobalToLocal (expr=0xabd7c50) at /raid/nmaruyama/debug1/csrc/scheduler/cache_policy_refiner.cpp:61 #5 0x00007ffeba54c599 in nvfuser::refineCachePolicy (fusion=0xabef940) at /raid/nmaruyama/debug1/csrc/scheduler/cache_policy_refiner.cpp:153 ```

Example error message: ```CUDA [ RUN ] TMemTest.AddKernelSameRegion unknown file: Failure C++ exception with description " INTERNAL ASSERT FAILED at "/home/gaoxiang/Fuser/csrc/runtime/compiled_kernel.cpp":169, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. // Codegen generated utilities namespace tmem { __device__ __inline__ void alloc(uint32_t in0, uint32_t in1) { asm volatile("tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%0], %1;\n"::"r"(in0), "r"(in1)); } __device__ __inline__ void relinquishAllocPermit() { asm volatile("tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned;\n"); } __device__ __inline__ void store(uint32_t in0, Array<float, 1, 1> in1) { asm volatile( "tcgen05.st.sync.aligned.32x32b.x1.b32 [%0], {%1};\n" : :"r"(in0), "f"(in1[0]) ); } __device__ __inline__ void waitStore() { asm volatile("tcgen05.wait::st.sync.aligned;\n"); } __device__ __inline__ void load(Array<float, 1, 1>& out0, uint32_t in0) { asm( "tcgen05.ld.sync.aligned.32x32b.x1.b32 {%0}, [%1];\n" :"=f"(out0[0]) :"r"(in0) ); } __device__ __inline__ void waitLoad() { asm volatile("tcgen05.wait::ld.sync.aligned;\n"); } } // namespace tmem __global__ void nvfuser_none_f0_c0_r0_g0(Tensor<float, 1, 1> T0, Tensor<float, 1, 1> T4, Tensor<float, 1, 1> T9) { alignas(16) extern __shared__ char array[]; const unsigned smem_offset = 0; nvfuser_index_t i0; i0 = ((nvfuser_index_t)threadIdx.x) + (32 * ((nvfuser_index_t)blockIdx.x)); bool b1; b1 = i0 < T0.logical_size[0LL]; uint32_t* T10 = reinterpret_cast<uint32_t*>(array + smem_offset + 0); tmem::alloc((uint32_t)(toSmem(T10)), (uint32_t)(32)); tmem::relinquishAllocPermit(); __syncthreads(); Array<float, 1, 1> T1; T1[0] = 0; if (b1) { T1[0] = T0[((T0.alloc_stride[0LL] * ((nvfuser_index_t)threadIdx.x)) + ((32 * T0.alloc_stride[0LL]) * ((nvfuser_index_t)blockIdx.x)))]; } TMemTensor T2(T10[0], 0, (uint16_t)(0)); tmem::store((uint32_t)(T2 + Array<uint16_t, 2, 1>{0, 0}), (*reinterpret_cast<Array<float, 1, 1>*>(&T1[0]))); tmem::waitStore(); Array<float, 1, 1> T3; tmem::load((*reinterpret_cast<Array<float, 1, 1>*>(&T3[0])), (uint32_t)(T2 + Array<uint16_t, 2, 1>{0, 0})); tmem::waitLoad(); asm volatile("tcgen05.dealloc.cta_group::1.sync.aligned.b32 %0, %1;\n"::"r"(T10[0]), "r"((uint32_t)(32))); Array<float, 1, 1> T5; T5[0] = 0; if (b1) { T5[0] = T4[((T4.alloc_stride[0LL] * ((nvfuser_index_t)threadIdx.x)) + ((32 * T4.alloc_stride[0LL]) * ((nvfuser_index_t)blockIdx.x)))]; } TMemTensor T6(T10[0], 0, (uint16_t)(1)); tmem::store((uint32_t)(T6 + Array<uint16_t, 2, 1>{0, 0}), (*reinterpret_cast<Array<float, 1, 1>*>(&T5[0]))); tmem::waitStore(); Array<float, 1, 1> T7; tmem::load((*reinterpret_cast<Array<float, 1, 1>*>(&T7[0])), (uint32_t)(T6 + Array<uint16_t, 2, 1>{0, 0})); tmem::waitLoad(); Array<float, 1, 1> T8; T8[0] = T3[0] + T7[0]; if (b1) { T9[i0] = T8[0]; } } } CUDA NVRTC compile error: ptxas application ptx input, line 48; error : Instruction 'tcgen05.alloc' not supported on .target 'sm_89' ptxas application ptx input, line 48; error : Feature '.cta_group::1' not supported on .target 'sm_89' ptxas application ptx input, line 52; error : Instruction 'tcgen05.relinquish_alloc_permit' not supported on .target 'sm_89' ptxas application ptx input, line 52; error : Feature '.cta_group::1' not supported on .target 'sm_89' ptxas application ptx input, line 69; error : Feature '.32x32b' not supported on .target 'sm_89' ptxas application ptx input, line 69; error : Instruction 'tcgen05.st' not supported on .target 'sm_89' ptxas application ptx input, line 73; error : Instruction 'tcgen05.wait' not supported on .target 'sm_89' ptxas application ptx input, line 77; error : Feature '.32x32b' not supported on .target 'sm_89' ptxas application ptx input, line 77; error : Instruction 'tcgen05.ld' not supported on .target 'sm_89' ptxas application ptx input, line 81; error : Instruction 'tcgen05.wait' not supported on .target 'sm_89' ptxas application ptx input, line 86; error : Instruction 'tcgen05.dealloc' not supported on .target 'sm_89' ptxas application ptx input, line 86; error : Feature '.cta_group::1' not supported on .target 'sm_89' ptxas application ptx input, line 101; error : Feature '.32x32b' not supported on .target 'sm_89' ptxas application ptx input, line 101; error : Instruction 'tcgen05.st' not supported on .target 'sm_89' ptxas application ptx input, line 105; error : Instruction 'tcgen05.wait' not supported on .target 'sm_89' ptxas application ptx input, line 109; error : Feature '.32x32b' not supported on .target 'sm_89' ptxas application ptx input, line 109; error : Instruction 'tcgen05.ld' not supported on .target 'sm_89' ptxas application ptx input, line 113; error : Instruction 'tcgen05.wait' not supported on .target 'sm_89' ptxas fatal : Ptx assembly aborted due to errors Exception raised from invoke at /home/gaoxiang/Fuser/csrc/runtime/compiled_kernel.cpp:169 (most recent call first): frame #0: <unknown function> + 0x1f3e89 (0x5f8f19a46e89 in ./bin/test_nvfuser) frame #1: <unknown function> + 0x5fc9ac (0x5f8f19e4f9ac in ./bin/test_nvfuser) frame #2: <unknown function> + 0x920965 (0x5f8f1a173965 in ./bin/test_nvfuser) frame #3: <unknown function> + 0x923318 (0x5f8f1a176318 in ./bin/test_nvfuser) frame #4: <unknown function> + 0x935e30 (0x5f8f1a188e30 in ./bin/test_nvfuser) frame #5: <unknown function> + 0x100f4f9 (0x5f8f1a8624f9 in ./bin/test_nvfuser) frame #6: <unknown function> + 0x1267437 (0x5f8f1aaba437 in ./bin/test_nvfuser) frame #7: <unknown function> + 0x1250676 (0x5f8f1aaa3676 in ./bin/test_nvfuser) frame #8: <unknown function> + 0x12508b5 (0x5f8f1aaa38b5 in ./bin/test_nvfuser) frame #9: <unknown function> + 0x125115b (0x5f8f1aaa415b in ./bin/test_nvfuser) frame #10: <unknown function> + 0x125ee25 (0x5f8f1aab1e25 in ./bin/test_nvfuser) frame #11: <unknown function> + 0x1267ac7 (0x5f8f1aabaac7 in ./bin/test_nvfuser) frame #12: <unknown function> + 0x125099f (0x5f8f1aaa399f in ./bin/test_nvfuser) frame #13: <unknown function> + 0x3cafcb (0x5f8f19c1dfcb in ./bin/test_nvfuser) frame #14: <unknown function> + 0x27488 (0x7a5456a35488 in /usr/lib/libc.so.6) frame #15: __libc_start_main + 0x8c (0x7a5456a3554c in /usr/lib/libc.so.6) frame #16: <unknown function> + 0x3cb535 (0x5f8f19c1e535 in ./bin/test_nvfuser) " thrown in the test body. To reproduce: NVFUSER_TEST_RANDOM_SEED=1740626485 NVFUSER_TEST_ATEN_RANDOM_SEED=0 test_nvfuser --gtest_filter='TMemTest.AddKernelSameRegion' [ FAILED ] TMemTest.AddKernelSameRegion (67 ms) ```

samnordmann and others added 3 commits March 14, 2023 02:07

Fix multidevice tests (#2574)

e050187

* fix tests for multicluster fusion

IterDomain resize for pad, cat, slice (#2480)

839a299

jjsjann123 requested a review from naoyam March 14, 2023 09:25

Revert "IterDomain resize for pad, cat, slice (#2480)"

9ef183b

This reverts commit 839a299.

jjsjann123 changed the base branch from cherry_pick_2480 to main March 14, 2023 10:22

Merge branch 'main' into cherry_pick_2576

03ea9a6

jjsjann123 merged commit cd899f6 into main Mar 14, 2023

jjsjann123 deleted the cherry_pick_2576 branch March 14, 2023 10:26

liqiangxl mentioned this pull request May 8, 2023

MatmulSASSTest on H100 #306

Closed

jjsjann123 mentioned this pull request Nov 9, 2023

Codegen error: Exception raised from inferShape at /opt/pytorch/nvfuser/csrc/executor.cpp:532 #1277

Closed

cowanmeg pushed a commit to cowanmeg/Fuser that referenced this pull request Jan 5, 2024

Merge pull request NVIDIA#4 from samnordmann/rebase_reduction_and_add…

7894794

…_test_profiling Rebase reduction, add test profiling, and use FusionExecutorCache for allocation

DariusMaham mentioned this pull request Apr 18, 2024

Internal assert failed #2107

Closed

wujingyue mentioned this pull request May 8, 2024

Write a sharded transformer block in nvFuser API. #2199

Closed

This was referenced Jun 6, 2024

Squeezed IterDomain ?S536{1} must concretize to IterType::Broadcast but found ?S536{1}. #2359

Closed

Merging IterDomains requires that their iteration types match. #2317

Closed

wujingyue mentioned this pull request Oct 18, 2024

OpInfo has problems testing define_tensor. #3225

Closed

t-vi mentioned this pull request Oct 9, 2025

INTERNAL_ASSERT running Qwen-3-4B from litgpt through thunder #5358

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

persistent_use_of_buffer is accumulated over all the resolution points. #4

persistent_use_of_buffer is accumulated over all the resolution points. #4

Uh oh!

jjsjann123 commented Mar 14, 2023

Uh oh!

jjsjann123 commented Mar 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

persistent_use_of_buffer is accumulated over all the resolution points. #4

persistent_use_of_buffer is accumulated over all the resolution points. #4

Uh oh!

Conversation

jjsjann123 commented Mar 14, 2023

Uh oh!

jjsjann123 commented Mar 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants