Skip to content

Instantly share code, notes, and snippets.

@jdye64
Last active October 29, 2021 01:04
Show Gist options
  • Select an option

  • Save jdye64/dfb7b3a9a7bc186777a64118d042b90d to your computer and use it in GitHub Desktop.

Select an option

Save jdye64/dfb7b3a9a7bc186777a64118d042b90d to your computer and use it in GitHub Desktop.

Revisions

  1. jdye64 revised this gist Oct 29, 2021. 1 changed file with 2 additions and 0 deletions.
    2 changes: 2 additions & 0 deletions segfault-investigation.md
    Original file line number Diff line number Diff line change
    @@ -78,6 +78,8 @@ if __name__ == "__main__":
    ```

    ### Output
    Note that the JVM process is not causing the actual SIGSEGV but rather is providing a core dump as the result of one. The parent process seems to be the guilty party but then the JVM output shows up in the output once the parent process is killed by the kernel and the JVM prints output like that when either its process OR parent process receives a SIGSEGV

    ```
    distributed.preloading - INFO - Import preload module: dask_cuda.initialize
    distributed.preloading - INFO - Import preload module: dask_cuda.initialize
  2. jdye64 revised this gist Oct 29, 2021. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions segfault-investigation.md
    Original file line number Diff line number Diff line change
    @@ -6,9 +6,9 @@
    - Any environment without UCX and issues cannot be reproduced

    ## Results
    I have provided 2 test cases. One with ucx and one with. The tests are as close as possible (some imports had to be removed) to demonstrate the failures.
    I have provided 2 test cases. One with ucx and one without. The tests are as close as possible (some imports had to be removed) to demonstrate the failures.

    - [Works](#dask-sql-no-cux)
    - [Works](#dask-sql-no-ucx)
    - [Fails](#dask-sql)

    ## dask-sql-no-ucx
  3. jdye64 renamed this gist Oct 28, 2021. 1 changed file with 0 additions and 0 deletions.
  4. jdye64 created this gist Oct 28, 2021.
    167 changes: 167 additions & 0 deletions Dask-SQL Segfault Investigation
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,167 @@
    # Dask-SQL SegFault notes and observations

    ## Notes
    - Occurs regardless of `LocalCUDACluster` transport specified. Ex: UCX, TCP, etc
    - Only occurs when `ucx-py` is installed in the Anaconda environment AND `LocalCUDACluster` is used instead of standard `Distributed.Client`
    - Any environment without UCX and issues cannot be reproduced

    ## Results
    I have provided 2 test cases. One with ucx and one with. The tests are as close as possible (some imports had to be removed) to demonstrate the failures.

    - [Works](#dask-sql-no-cux)
    - [Fails](#dask-sql)

    ## dask-sql-no-ucx

    ### Conda
    ```
    conda create --name dask-sql-no-ucx
    conda activate dask-sql-no-ucx
    conda install -c rapidsai-nightly -c nvidia -c numba -c conda-forge cudf dask-cudf dask-cuda python=3.7 cudatoolkit=11.2 openjdk maven
    python ./setup.py install // Assuming you are in the dask-sql repo directory
    ```

    ### Minimal Reproducer
    ```python
    from dask.distributed import Client
    import dask_cudf as dd

    import cudf
    from dask_sql import Context

    if __name__ == "__main__":
    client = Client()
    client

    c = Context()
    df = cudf.DataFrame({'id': [0, 1]})
    c.create_table('test', df)
    print(c.sql("select count(*) from test").compute())
    ```

    ### Output
    ```
    COUNT(*)
    0 2
    ```


    ## dask-sql

    ### Conda
    ```
    conda create --name dask-sql
    conda activate dask-sql
    conda install -c rapidsai-nightly -c nvidia -c numba -c conda-forge cudf dask-cudf dask-cuda python=3.7 cudatoolkit=11.2 openjdk maven ucx-py ucx-proc=*=gpu
    python ./setup.py install // Assuming you are in the dask-sql repo directory
    ```

    ### Minimal Reproducer
    ```python
    from dask.distributed import Client
    import dask_cudf as dd

    import cudf
    from dask_sql import Context
    from dask_cuda import LocalCUDACluster


    if __name__ == "__main__":
    cluster = LocalCUDACluster(protocol="ucx", enable_tcp_over_ucx=True, enable_nvlink=True, jit_unspill=False, rmm_pool_size="29GB")
    client = Client(cluster)
    client

    c = Context()
    df = cudf.DataFrame({'id': [0, 1]})
    c.create_table('test', df)
    print(c.sql("select count(*) from test").compute())
    ```

    ### Output
    ```
    distributed.preloading - INFO - Import preload module: dask_cuda.initialize
    distributed.preloading - INFO - Import preload module: dask_cuda.initialize
    distributed.preloading - INFO - Import preload module: dask_cuda.initialize
    distributed.preloading - INFO - Import preload module: dask_cuda.initialize
    distributed.preloading - INFO - Import preload module: dask_cuda.initialize
    distributed.preloading - INFO - Import preload module: dask_cuda.initialize
    distributed.preloading - INFO - Import preload module: dask_cuda.initialize
    distributed.preloading - INFO - Import preload module: dask_cuda.initialize
    [rl-dgx-r13-u24-rapids-dgx118:38784:0:38784] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f7e2d26b008)
    ==== backtrace (tid: 38784) ====
    0 /home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_handle_error+0x115) [0x7f7d7e2cf4e5]
    1 /home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2a881) [0x7f7d7e2cf881]
    2 /home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2aa52) [0x7f7d7e2cfa52]
    3 [0x7f7acfb4144d]
    =================================
    #
    # A fatal error has been detected by the Java Runtime Environment:
    #
    # SIGSEGV (0xb) at pc=0x00007f7acfb4144d (sent by kill), pid=38784, tid=38784
    #
    # JRE version: OpenJDK Runtime Environment (11.0.9.1) (build 11.0.9.1-internal+0-adhoc..src)
    # Java VM: OpenJDK 64-Bit Server VM (11.0.9.1-internal+0-adhoc..src, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
    # Problematic frame:
    # J 1422 c1 java.util.WeakHashMap.put(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; java.base@11.0.9.1-internal (162 bytes) @ 0x00007f7acfb4144d [0x00007f7acfb407a0+0x0000000000000cad]
    #
    # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P" (or dumping to /home/u00u9018xfl6yNnCjC357/development/dask-sql/core.38784)
    #
    # An error report file with more information is saved as:
    # /home/u00u9018xfl6yNnCjC357/development/dask-sql/hs_err_pid38784.log
    Compiled method (c1) 9329 1421 3 java.util.Collections$SetFromMap::add (22 bytes)
    total in heap [0x00007f7acfb3eb90,0x00007f7acfb3f050] = 1216
    relocation [0x00007f7acfb3ed08,0x00007f7acfb3ed48] = 64
    main code [0x00007f7acfb3ed60,0x00007f7acfb3ef80] = 544
    stub code [0x00007f7acfb3ef80,0x00007f7acfb3efc8] = 72
    metadata [0x00007f7acfb3efc8,0x00007f7acfb3efd0] = 8
    scopes data [0x00007f7acfb3efd0,0x00007f7acfb3efe8] = 24
    scopes pcs [0x00007f7acfb3efe8,0x00007f7acfb3f038] = 80
    dependencies [0x00007f7acfb3f038,0x00007f7acfb3f040] = 8
    nul chk table [0x00007f7acfb3f040,0x00007f7acfb3f050] = 16
    Compiled method (c1) 9330 1534 3 java.util.zip.ZipFile::getZipEntry (301 bytes)
    total in heap [0x00007f7acfb86010,0x00007f7acfb88b78] = 11112
    relocation [0x00007f7acfb86188,0x00007f7acfb86388] = 512
    main code [0x00007f7acfb863a0,0x00007f7acfb88020] = 7296
    stub code [0x00007f7acfb88020,0x00007f7acfb880e8] = 200
    metadata [0x00007f7acfb880e8,0x00007f7acfb88158] = 112
    scopes data [0x00007f7acfb88158,0x00007f7acfb88660] = 1288
    scopes pcs [0x00007f7acfb88660,0x00007f7acfb88b00] = 1184
    dependencies [0x00007f7acfb88b00,0x00007f7acfb88b08] = 8
    nul chk table [0x00007f7acfb88b08,0x00007f7acfb88b78] = 112
    Could not load hsdis-amd64.so; library not loadable; PrintAssembly is disabled
    #
    # If you would like to submit a bug report, please visit:
    # https://bugreport.java.com/bugreport/crash.jsp
    #
    distributed.worker - WARNING - Heartbeat to scheduler failed
    Traceback (most recent call last):
    File "/home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/distributed/comm/ucx.py", line 295, in read
    await self.ep.recv(msg)
    File "/home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/ucp/core.py", line 725, in recv
    ret = await comm.tag_recv(self._ep, buffer, nbytes, tag, name=log)
    ucp.exceptions.UCXCanceled: <[Recv #006] ep: 0x7f9b5c2500f0, tag: 0x8a71230ec78dfc21, nbytes: 16, type: <class 'numpy.ndarray'>>:

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
    File "/home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/distributed/worker.py", line 1197, in heartbeat
    for key in self.active_keys
    File "/home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/distributed/utils_comm.py", line 390, in retry_operation
    operation=operation,
    File "/home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
    File "/home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/distributed/core.py", line 863, in send_recv_from_rpc
    result = await send_recv(comm=comm, op=key, **kwargs)
    File "/home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/distributed/core.py", line 640, in send_recv
    response = await comm.read(deserializers=deserializers)
    File "/home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/distributed/comm/ucx.py", line 313, in read
    raise CommClosedError("Connection closed by writer")
    distributed.comm.core.CommClosedError: Connection closed by writer
    distributed.worker - WARNING - Heartbeat to scheduler failed
    Traceback (most recent call last):
    File "/home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/distributed/comm/ucx.py", line 295, in read
    await self.ep.recv(msg)
    File "/home/u00u9018xfl6yNnCjC357/miniconda3/envs/dask-sql/lib/python3.7/site-packages/ucp/core.py", line 725, in recv
    ret = await comm.tag_recv(self._ep, buffer, nbytes, tag, name=log)
    ucp.exceptions.UCXCanceled: <[Recv #006] ep: 0x7f956041e0f0, tag: 0x5b546f00aaab3a53, nbytes: 16, type: <class 'numpy.ndarray'>>:
    ```