# Connect via SSH to a Slurm compute job that runs as Enroot container Being able to SSH directly into a compute job has the advantage of using all remote development tools such as using your IDE's debugger also for GPU jobs (VSCode, PyCharm, ...). - Slurm: Scheduling system that many HPC clusters use - Enroot: Container system like Docker for NVIDIA GPUs General problem: > Containerized compute jobs are not directly accessible via SSH from your local machine (your notebook or PC). Also many HPC clusters do not provide internet access on their compute node (for security reasons). Proposed solution: > Run your own SSH server within your compute job and make an SSH tunnel from your local machine through the login node, through the compute node, and finally into the compute job. ``` LOCAL MACHINE -> LOGIN NODE -> COMPUTE NODE -> CONTAINERIZED COMPUTE JOB ``` ## Custom image with SSHD installed This Docker image installs an OpenSSH server into a NVDIA's PyTorch image (depending on your setup you may change the base image or install additional software): ```Dockerfile # ./Dockerfile FROM nvcr.io/nvidia/pytorch:22.08-py3 USER root RUN apt-get update RUN apt-get install openssh-server sudo -y # change port and allow root login RUN echo "Port " >> /etc/ssh/sshd_config RUN echo "LogLevel DEBUG3" >> /etc/ssh/sshd_config RUN mkdir -p /run/sshd RUN ssh-keygen -A RUN service ssh start # init conda env RUN conda init EXPOSE CMD ["/usr/sbin/sshd","-D", "-e"] ``` In order to use the Docker image with Slurm you need to push it to Docker hub and then import it with Enroot: ```bash # build the image docker build -t /:latest . # push the image docker push /:latest # import with enroot srun enroot import -o .sqsh docker:///:latest ``` ## Adjust your own SSH config (`~/.ssh/config`) ``` # add to ~/.ssh/config # replace with your username # replace with your job name Host devcontainer.dfki User Port HostName localhost ProxyJump devnode.dfki CheckHostIP no StrictHostKeyChecking=no UserKnownHostsFile=/dev/null Host devnode.dfki User CheckHostIP no ProxyCommand ssh slurm.dfki "nc \$(squeue --me --name= --states=R -h -O NodeList) 22" StrictHostKeyChecking=no UserKnownHostsFile=/dev/null Host slurm.dfki User HostName ``` ## Start compute job You must set `--no-container-remap-root`. ```bash srun -K \ --container-mounts=/home/$USER:/home/$USER \ --container-workdir=$(pwd) \ --container-image=.sqsh \ --ntasks=1 --nodes=1 -p \ --gpus=1 \ --job-name --no-container-remap-root \ --time 12:00:00 /usr/sbin/sshd -D -e ``` ## Connect to compute job ``` ssh devcontainer.dfki ``` That's it! ## Issues - The SSHD port is hard-coded. This will cause problems as soon as multiple people start using this setup. Better make sure to change the port to something unique.