Skip to content

Investigate non-deterministic Cuda illegal memory access #355

@misiugodfrey

Description

@misiugodfrey

There are some reports that we are occasionally hitting illegal memory access errors in cuda presto when we run with many (1000+) iterations on benchmarks.

To investigate, we're setting up a set of sweeps that will run tpch benchmarks with multiple configurations to see if we can replicate the issue (and perhaps see if some configs are more likely to cause the issue).

So far we have been able to replicate the failure with the following dimensions (although we need re-runs to verify if this config is more likely to cause the issue).

  ┌─────┬───────────────────────────────────────────────────────────┐
  │ dim │                           value                     
  ├─────┼───────────────────────────────────────────────────────────┤
  │ BM  │ Benchmark=tpch                                            │
  ├─────┼───────────────────────────────────────────────────────────┤
  │ SF  │ Scale Factor=1000                                         │
  ├─────┼───────────────────────────────────────────────────────────┤
  │ N   │ 4 nodes, 4 workers/node (16 total)                        │
  ├─────┼───────────────────────────────────────────────────────────┤
  │ DC  │ task.max-drivers-per-task=4                               │
  ├─────┼───────────────────────────────────────────────────────────┤
  │ KVK │ KVIKIO_NTHREADS=12                                        │
  ├─────┼───────────────────────────────────────────────────────────┤
  │ BS  │ cudf.batch_size_min_threshold=100M                        │
  ├─────┼───────────────────────────────────────────────────────────┤
  │ UXC │ exchange.max-buffer-size=64MB + sink.max-buffer-size=64MB │
  ├─────┼───────────────────────────────────────────────────────────┤
  │ SP  │ LIBCUDF_KERNEL_STREAM_POOL_SIZE=8                         │
  └─────┴───────────────────────────────────────────────────────────┘

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions