Open
Description
Fails in CI: https://github.com/pytorch/xla/actions/runs/14676471430/job/41194044338
+ python3 /home/runner/_work/xla/xla/pytorch/xla/test/../examples/eager/train_decoder_only_eager_spmd_data_parallel.py
/home/runner/.local/lib/python3.10/site-packages/torch/_tensor.py:1128: UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior.
Consider using tensor.detach() first. (Triggered internally at /__w/xla/xla/pytorch/aten/src/ATen/native/Scalar.cpp:22.)
return self.item().__format__(format_spec)
https://symbolize.stripped_domain/r/?trace=7c292e460d03,7c2c0985cdcf,7c2c002898aa&map=
*** SIGSEGV (@0x50), see go/stacktraces#s15 received by PID 244127 (TID 245720) on cpu 155; stack trace: ***
PC: @ 0x7c292e460d03 (unknown) torch_xla::runtime::PjRtComputationClient::PjRtShardedData::HasValue()
@ 0x7c249ef12b15 1888 (unknown)
@ 0x7c2c0985cdd0 1472 (unknown)
@ 0x7c2c002898ab 1778499464 torch::lazy::LazyGraphExecutor::RunPostOrder()
@ 0x7c0cf8fa6cb0 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7c292e460d03,7c249ef12b14,7c2c0985cdcf,7c2c002898aa,7c0cf8fa6caf&map=
E0426 03:45:32.828669 245720 coredump_hook.cc:301] RAW: Remote crash data gathering hook invoked.
E0426 03:45:32.828677 245720 client.cc:270] RAW: Coroner client retries enabled, will retry for up to 30 sec.
E0426 03:45:32.828680 245720 coredump_hook.cc:396] RAW: Sending fingerprint to remote end.
E0426 03:45:32.828703 245720 coredump_hook.cc:405] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0426 03:45:32.828706 245720 coredump_hook.cc:457] RAW: Dumping core locally.
E0426 03:45:45.252329 245720 process_state.cc:808] RAW: Raising signal 11 with default behavior
test/tpu/run_tests.sh: line 83: 244127 Segmentation fault python3 "$TEST_CDIR/../examples/eager/train_decoder_only_eager_spmd_data_parallel.py"