Closed
Description
🐛 Bug
Getting the following error on running any of the following three methods under test/test_mp_sync_batch_norm.py
. It's strange to see this issue because the expected shape and the calculated shape are the same - yet we have an error.
sync_bn1d_multi_channel_test(index)
sync_bn2d_test(index)
sync_bn3d_test(index)
ERROR:
$ python test/test_mp_sync_batch_norm.py
INFO:torch_xla:Letting libtpu.so load fail during _XLAC import. libtpu.so will be loaded from `libtpu` Python package when the ComputationClient is created.
2022-08-09 03:46:00.339314: I 1911080 tensorflow/core/tpu/tpu_initializer_helper.cc:253] Libtpu path is: /dev/null
2022-08-09 03:46:00.339558: I 1911080 tensorflow/compiler/xla/xla_client/xrt_local_service.cc:55] libtpu status: OK
2022-08-09 03:46:00.339598: I 1911080 tensorflow/compiler/xla/xla_client/xrt_local_service.cc:41] Peer localservice 1 {localhost:51011}
2022-08-09 03:46:00.339700: I 1911080 tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-09 03:46:00.351539: W 1911080 tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-08-09 03:46:00.351582: W 1911080 tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-08-09 03:46:00.351614: I 1911080 tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (5fc32f4fda5a): /proc/driver/nvidia/version does not exist
2022-08-09 03:46:00.395516: I 1911080 tensorflow/compiler/xla/service/service.cc:174] XLA service 0x555ade94c210 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-08-09 03:46:00.395576: I 1911080 tensorflow/compiler/xla/service/service.cc:182] StreamExecutor device (0): Host, Default Version
2022-08-09 03:46:00.440707: I 1911080 tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job localservice -> {0 -> localhost:51011}
2022-08-09 03:46:00.441636: I 1911080 tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:438] Started server with target: grpc://localhost:51011
2022-08-09 03:46:00.498331: I 1911833 tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2022-08-09 03:46:00.515291: I 1911454 tensorflow/compiler/jit/xla_device.cc:429] XLA_GPU and XLA_CPU devices are deprecated and will be removed in subsequent releases. Instead, use either @tf.function(jit_compile=True) for must-compile semantics, or run with TF_XLA_FLAGS=--tf_xla_auto_jit=2 for auto-clustering best-effort compilation.
Traceback (most recent call last):
File "test/test_mp_sync_batch_norm.py", line 146, in <module>
xmp.spawn(_mp_fn, args=())
File "/opt/conda/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 383, in spawn
return _run_direct(fn, args, nprocs, join, daemon, start_method)
File "/opt/conda/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 344, in _run_direct
fn(0, *args)
File "test/test_mp_sync_batch_norm.py", line 142, in _mp_fn
sync_bn3d_test(index)
File "test/test_mp_sync_batch_norm.py", line 124, in sync_bn3d_test
result = run_step(sbn_xla, t_xla)
File "test/test_mp_sync_batch_norm.py", line 19, in run_step
loss.backward()
File "/opt/conda/lib/python3.7/site-packages/torch/_tensor.py", line 485, in backward
self, gradient, retain_graph, create_graph, inputs=inputs
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 193, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: Function SumBackward0 returned an invalid gradient at index 0 - got [16, 32, 16, 32, 32] but expected shape compatible with [16, 32, 16, 32, 32]