There are some reports that we are occasionally hitting illegal memory access errors in cuda presto when we run with many (1000+) iterations on benchmarks.
To investigate, we're setting up a set of sweeps that will run tpch benchmarks with multiple configurations to see if we can replicate the issue (and perhaps see if some configs are more likely to cause the issue).
So far we have been able to replicate the failure with the following dimensions (although we need re-runs to verify if this config is more likely to cause the issue).
┌─────┬───────────────────────────────────────────────────────────┐
│ dim │ value
├─────┼───────────────────────────────────────────────────────────┤
│ BM │ Benchmark=tpch │
├─────┼───────────────────────────────────────────────────────────┤
│ SF │ Scale Factor=1000 │
├─────┼───────────────────────────────────────────────────────────┤
│ N │ 4 nodes, 4 workers/node (16 total) │
├─────┼───────────────────────────────────────────────────────────┤
│ DC │ task.max-drivers-per-task=4 │
├─────┼───────────────────────────────────────────────────────────┤
│ KVK │ KVIKIO_NTHREADS=12 │
├─────┼───────────────────────────────────────────────────────────┤
│ BS │ cudf.batch_size_min_threshold=100M │
├─────┼───────────────────────────────────────────────────────────┤
│ UXC │ exchange.max-buffer-size=64MB + sink.max-buffer-size=64MB │
├─────┼───────────────────────────────────────────────────────────┤
│ SP │ LIBCUDF_KERNEL_STREAM_POOL_SIZE=8 │
└─────┴───────────────────────────────────────────────────────────┘
There are some reports that we are occasionally hitting illegal memory access errors in cuda presto when we run with many (1000+) iterations on benchmarks.
To investigate, we're setting up a set of sweeps that will run tpch benchmarks with multiple configurations to see if we can replicate the issue (and perhaps see if some configs are more likely to cause the issue).
So far we have been able to replicate the failure with the following dimensions (although we need re-runs to verify if this config is more likely to cause the issue).