Skip to content

TPU test flake: SIGSEGV in train_decoder_only_eager_spmd_data_parallel.py #9046

Open
@tengyifei

Description

@tengyifei

Fails in CI: https://github.com/pytorch/xla/actions/runs/14676471430/job/41194044338

+ python3 /home/runner/_work/xla/xla/pytorch/xla/test/../examples/eager/train_decoder_only_eager_spmd_data_parallel.py
/home/runner/.local/lib/python3.10/site-packages/torch/_tensor.py:1128: UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior.
Consider using tensor.detach() first. (Triggered internally at /__w/xla/xla/pytorch/aten/src/ATen/native/Scalar.cpp:22.)
  return self.item().__format__(format_spec)
https://symbolize.stripped_domain/r/?trace=7c292e460d03,7c2c0985cdcf,7c2c002898aa&map= 
*** SIGSEGV (@0x50), see go/stacktraces#s15 received by PID 244127 (TID 245720) on cpu 155; stack trace: ***
PC: @     0x7c292e460d03  (unknown)  torch_xla::runtime::PjRtComputationClient::PjRtShardedData::HasValue()
    @     0x7c249ef12b15       1888  (unknown)
    @     0x7c2c0985cdd0       1472  (unknown)
    @     0x7c2c002898ab  1778499464  torch::lazy::LazyGraphExecutor::RunPostOrder()
    @     0x7c0cf8fa6cb0  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7c292e460d03,7c249ef12b14,7c2c0985cdcf,7c2c002898aa,7c0cf8fa6caf&map= 
E0426 03:45:32.828669  245720 coredump_hook.cc:301] RAW: Remote crash data gathering hook invoked.
E0426 03:45:32.828677  245720 client.cc:270] RAW: Coroner client retries enabled, will retry for up to 30 sec.
E0426 03:45:32.828680  245720 coredump_hook.cc:396] RAW: Sending fingerprint to remote end.
E0426 03:45:32.828703  245720 coredump_hook.cc:405] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0426 03:45:32.828706  245720 coredump_hook.cc:457] RAW: Dumping core locally.
E0426 03:45:45.252329  245720 process_state.cc:808] RAW: Raising signal 11 with default behavior
test/tpu/run_tests.sh: line 83: 244127 Segmentation fault      python3 "$TEST_CDIR/../examples/eager/train_decoder_only_eager_spmd_data_parallel.py"

Metadata

Metadata

Assignees

No one assigned

    Labels

    CICI related changeflakyIssues due to flaky tests.xla:tpuTPU specific issues and PRs

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions