Skip to content

Instantly share code, notes, and snippets.

@jdye64
Last active October 29, 2021 01:04
Show Gist options
  • Select an option

  • Save jdye64/dfb7b3a9a7bc186777a64118d042b90d to your computer and use it in GitHub Desktop.

Select an option

Save jdye64/dfb7b3a9a7bc186777a64118d042b90d to your computer and use it in GitHub Desktop.

Dask-SQL SegFault notes and observations

Notes

  • Occurs regardless of LocalCUDACluster transport specified. Ex: UCX, TCP, etc
  • Only occurs when ucx-py is installed in the Anaconda environment AND LocalCUDACluster is used instead of standard Distributed.Client
  • Any environment without UCX and issues cannot be reproduced

Results

I have provided 2 test cases. One with ucx and one with. The tests are as close as possible (some imports had to be removed) to demonstrate the failures.

dask-sql-no-ucx

Conda

conda create --name dask-sql-no-ucx
conda activate dask-sql-no-ucx
conda install -c rapidsai-nightly -c nvidia -c numba -c conda-forge cudf dask-cudf dask-cuda  python=3.7 cudatoolkit=11.2 openjdk maven
python ./setup.py install // Assuming you are in the dask-sql repo directory

Minimal Reproducer

from dask.distributed import Client
import dask_cudf as dd

import cudf
from dask_sql import Context

if __name__ == "__main__":
    client = Client()
    client

    c = Context()
    df = cudf.DataFrame({'id': [0, 1]})
    c.create_table('test', df)
    print(c.sql("select count(*) from test").compute())

Output

   COUNT(*)
0         2

dask-sql

Conda

conda create --name dask-sql 
conda activate dask-sql
conda install -c rapidsai-nightly -c nvidia -c numba -c conda-forge cudf dask-cudf dask-cuda python=3.7 cudatoolkit=11.2 openjdk maven ucx-py ucx-proc=*=gpu
python ./setup.py install // Assuming you are in the dask-sql repo directory

Minimal Reproducer

from dask.distributed import Client
import dask_cudf as dd

import cudf
from dask_sql import Context
from dask_cuda import LocalCUDACluster


if __name__ == "__main__":
    cluster = LocalCUDACluster(protocol="ucx", enable_tcp_over_ucx=True, enable_nvlink=True, jit_unspill=False, rmm_pool_size="29GB")
    client = Client(cluster)
    client

    c = Context()
    df = cudf.DataFrame({'id': [0, 1]})
    c.create_table('test', df)
    print(c.sql("select count(*) from test").compute())

Output

distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
[rl-dgx-r13-u24-rapids-dgx118:38784:0:38784] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f7e2d26b008)
==== backtrace (tid:  38784) ====
 0  /home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_handle_error+0x115) [0x7f7d7e2cf4e5]
 1  /home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2a881) [0x7f7d7e2cf881]
 2  /home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2aa52) [0x7f7d7e2cfa52]
 3  [0x7f7acfb4144d]
=================================
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f7acfb4144d (sent by kill), pid=38784, tid=38784
#
# JRE version: OpenJDK Runtime Environment (11.0.9.1) (build 11.0.9.1-internal+0-adhoc..src)
# Java VM: OpenJDK 64-Bit Server VM (11.0.9.1-internal+0-adhoc..src, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# J 1422 c1 java.util.WeakHashMap.put(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; java.base@11.0.9.1-internal (162 bytes) @ 0x00007f7acfb4144d [0x00007f7acfb407a0+0x0000000000000cad]
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P" (or dumping to /home/u00u9018xfl6yNnCjC357/development/dask-sql/core.38784)
#
# An error report file with more information is saved as:
# /home/u00u9018xfl6yNnCjC357/development/dask-sql/hs_err_pid38784.log
Compiled method (c1)    9329 1421       3       java.util.Collections$SetFromMap::add (22 bytes)
 total in heap  [0x00007f7acfb3eb90,0x00007f7acfb3f050] = 1216
 relocation     [0x00007f7acfb3ed08,0x00007f7acfb3ed48] = 64
 main code      [0x00007f7acfb3ed60,0x00007f7acfb3ef80] = 544
 stub code      [0x00007f7acfb3ef80,0x00007f7acfb3efc8] = 72
 metadata       [0x00007f7acfb3efc8,0x00007f7acfb3efd0] = 8
 scopes data    [0x00007f7acfb3efd0,0x00007f7acfb3efe8] = 24
 scopes pcs     [0x00007f7acfb3efe8,0x00007f7acfb3f038] = 80
 dependencies   [0x00007f7acfb3f038,0x00007f7acfb3f040] = 8
 nul chk table  [0x00007f7acfb3f040,0x00007f7acfb3f050] = 16
Compiled method (c1)    9330 1534       3       java.util.zip.ZipFile::getZipEntry (301 bytes)
 total in heap  [0x00007f7acfb86010,0x00007f7acfb88b78] = 11112
 relocation     [0x00007f7acfb86188,0x00007f7acfb86388] = 512
 main code      [0x00007f7acfb863a0,0x00007f7acfb88020] = 7296
 stub code      [0x00007f7acfb88020,0x00007f7acfb880e8] = 200
 metadata       [0x00007f7acfb880e8,0x00007f7acfb88158] = 112
 scopes data    [0x00007f7acfb88158,0x00007f7acfb88660] = 1288
 scopes pcs     [0x00007f7acfb88660,0x00007f7acfb88b00] = 1184
 dependencies   [0x00007f7acfb88b00,0x00007f7acfb88b08] = 8
 nul chk table  [0x00007f7acfb88b08,0x00007f7acfb88b78] = 112
Could not load hsdis-amd64.so; library not loadable; PrintAssembly is disabled
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#
distributed.worker - WARNING - Heartbeat to scheduler failed
Traceback (most recent call last):
  File "/home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/distributed/comm/ucx.py", line 295, in read
    await self.ep.recv(msg)
  File "/home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/ucp/core.py", line 725, in recv
    ret = await comm.tag_recv(self._ep, buffer, nbytes, tag, name=log)
ucp.exceptions.UCXCanceled: <[Recv #006] ep: 0x7f9b5c2500f0, tag: 0x8a71230ec78dfc21, nbytes: 16, type: <class 'numpy.ndarray'>>: 

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/distributed/worker.py", line 1197, in heartbeat
    for key in self.active_keys
  File "/home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/distributed/utils_comm.py", line 390, in retry_operation
    operation=operation,
  File "/home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/distributed/core.py", line 863, in send_recv_from_rpc
    result = await send_recv(comm=comm, op=key, **kwargs)
  File "/home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/distributed/core.py", line 640, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/distributed/comm/ucx.py", line 313, in read
    raise CommClosedError("Connection closed by writer")
distributed.comm.core.CommClosedError: Connection closed by writer
distributed.worker - WARNING - Heartbeat to scheduler failed
Traceback (most recent call last):
  File "/home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/distributed/comm/ucx.py", line 295, in read
    await self.ep.recv(msg)
  File "/home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/ucp/core.py", line 725, in recv
    ret = await comm.tag_recv(self._ep, buffer, nbytes, tag, name=log)
ucp.exceptions.UCXCanceled: <[Recv #006] ep: 0x7f956041e0f0, tag: 0x5b546f00aaab3a53, nbytes: 16, type: <class 'numpy.ndarray'>>:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment