# NVIDIA Enterprise GPU Setup Guide This guide covers the setup process for NVIDIA enterprise GPUs (A100, H100, H200) on Linux systems—focusing on Ubuntu LTS distributions (20.04, 22.04, and 24.04). It details the installation of the latest drivers and CUDA toolkits, along with optional components for containerized or specialized workloads. ## Table of Contents - [Prerequisites](#prerequisites) - [Driver Installation](#driver-installation) - [CUDA Toolkit Installation](#cuda-toolkit-installation) - [Optional Components](#optional-components) - [Verification](#verification) - [Troubleshooting](#troubleshooting) - [Additional Resources](#additional-resources) ## Prerequisites ### Hardware Requirements - **Supported NVIDIA GPU:** A100, H100, H200, etc. - **PCIe Slot:** PCIe Gen4 x16 slot (recommended) - **Power & Cooling:** Adequate power supply and cooling solutions - **Motherboard:** Server-grade with proper PCIe bifurcation support ### System Requirements - **Operating System:** Ubuntu LTS (20.04, 22.04, or 24.04) - **Kernel & Build Tools:** Ensure Linux kernel headers and basic build tools are installed Install basic development tools and headers: ```bash sudo apt-get update sudo apt-get install -y \ build-essential \ linux-headers-$(uname -r) \ software-properties-common \ gnupg ``` --- ## Driver Installation Before installing the latest NVIDIA drivers, ensure that old versions are removed and that you are using updated repository methods. ### 1. Remove Existing NVIDIA Drivers ```bash sudo apt-get remove --purge '^nvidia-.*' sudo apt-get autoremove ``` ### 2. Remove Any Outdated NVIDIA Signing Keys (if present) ```bash sudo apt-key del 7fa2af80 ``` ### 3. Add the NVIDIA Repository and GPG Key _Note:_ With `apt-key` deprecated, we now use the NVIDIA CUDA keyring package. Download and install the CUDA keyring package: ```bash wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu$(. /etc/os-release; echo $VERSION_ID | sed 's/\.//')/x86_64/cuda-keyring_1.1-1_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb ``` ### 4. Add the Repository Pin File This file sets repository priorities: ```bash wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu$(. /etc/os-release; echo $VERSION_ID | sed 's/\.//')/x86_64/cuda-ubuntu$(. /etc/os-release; echo $VERSION_ID | sed 's/\.//').pin sudo mv cuda-ubuntu*.pin /etc/apt/preferences.d/cuda-repository-pin-600 ``` ### 5. Update Package Lists ```bash sudo apt-get update ``` ### 6. Install the Latest NVIDIA Drivers For modern enterprise GPUs, Ubuntu’s driver query typically recommends the “open” driver series. For example, for an A100 the recommended driver might be: ```bash sudo apt-get install -y nvidia-driver-570-open ``` If you wish to let Ubuntu automatically choose the best driver for your hardware, run: ```bash sudo ubuntu-drivers autoinstall ``` > **Note:** For H200 or other new-generation GPUs, verify the recommended driver version using `ubuntu-drivers devices` and choose the driver (open or server) that best suits your workload. ### 7. Reboot the System ```bash sudo reboot ``` --- ## CUDA Toolkit Installation Once the driver is updated and working, install the CUDA toolkit to support development and GPU-accelerated applications. ### 1. Install the CUDA Toolkit and Development Tools The following command installs the CUDA toolkit. (This package may be updated over time; if you require a specific version, consider downloading the installer from NVIDIA's website.) ```bash sudo apt-get install -y cuda-toolkit nvidia-cuda-toolkit ``` ### 2. Update Environment Variables Add the CUDA paths to your shell configuration (e.g., add these lines to your `~/.bashrc`): ```bash export PATH=/usr/local/cuda/bin${PATH:+:${PATH}} export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} ``` Apply the changes: ```bash source ~/.bashrc ``` --- ## Optional Components ### NVIDIA Container Toolkit (for Docker) Set up the container toolkit to run GPU-accelerated containers: ```bash # Get distribution info distribution=$(. /etc/os-release; echo $ID$VERSION_ID) # Create directory for keyring if it doesn't exist sudo mkdir -p /etc/apt/keyrings # Download and add the NVIDIA GPG key (modern method) curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /etc/apt/keyrings/nvidia-container-toolkit.gpg # Add the repository (modern method) curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \ sed 's#deb https://#deb [signed-by=/etc/apt/keyrings/nvidia-container-toolkit.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list # Update package lists sudo apt-get update # Install NVIDIA container toolkit sudo apt-get install -y nvidia-container-toolkit # Restart Docker service sudo systemctl restart docker ``` ### NVIDIA GPUDirect Storage (GDS) Install GPUDirect Storage for high-performance data transfer: ```bash sudo apt-get install -y nvidia-gds ``` ### NVIDIA Fabric Manager (for NVLink/NVSwitch configurations) If you are using NVLink or NVSwitch interconnects, install and enable the fabric manager: ```bash sudo apt-get install -y nvidia-fabric-manager sudo systemctl start nvidia-fabricmanager sudo systemctl enable nvidia-fabricmanager ``` --- ## Verification After installation, verify that both the driver and CUDA toolkit are functioning correctly. ### 1. Check the NVIDIA Driver Installation ```bash nvidia-smi ``` Expected output should display your GPU(s), driver version, and CUDA version. For example: ``` +-----------------------------------------------------------------------------+ | NVIDIA-SMI 570.xxx.xx Driver Version: 570.xxx.xx CUDA Version: 12.x | |-------------------------------+----------------------+----------------------+ ... ``` ### 2. Verify the CUDA Toolkit Installation ```bash nvcc --version ``` ### 3. Test CUDA Functionality Create and run a basic CUDA program: ```bash # Create a simple CUDA test program cat > cuda_test.cu << 'EOF' #include __global__ void kernel() { } int main() { kernel<<<1,1>>>(); cudaDeviceSynchronize(); printf("CUDA test successful!\n"); return 0; } EOF # Compile and run the test program nvcc cuda_test.cu -o cuda_test ./cuda_test ``` --- ## Troubleshooting ### Common Issues 1. **nvidia-smi Not Found or Not Displaying Correctly** - Verify driver installation with: ```bash dpkg -l | grep nvidia-driver ``` - Check if the NVIDIA kernel module is loaded: ```bash lsmod | grep nvidia ``` - Inspect system logs for errors: ```bash dmesg | grep nvidia ``` 2. **CUDA Toolkit Not Found** - Ensure that your `PATH` and `LD_LIBRARY_PATH` environment variables are correctly set. - Check that the CUDA installation directory exists: ```bash ls /usr/local/cuda ``` - Run the provided CUDA samples for further validation. 3. **Performance Issues** - Confirm PCIe link status with: ```bash nvidia-smi -q | grep "Max Link" ``` - Monitor power and thermal data: ```bash nvidia-smi -q -d POWER,TEMPERATURE ``` - Verify compute mode with: ```bash nvidia-smi -q | grep "Compute Mode" ``` ### Additional Tips - **System Compatibility:** Always verify that your hardware and OS are compatible with the chosen driver and CUDA versions. - **Official Documentation:** Refer to the latest NVIDIA documentation for enterprise GPUs for any changes. - **Monitoring:** For production environments, consider using NVIDIA Data Center GPU Manager (DCGM) to monitor health and performance. --- ## Latest Drivers If the required NVIDIA driver version is not available from your default package manager, follow these steps to add NVIDIA’s official repository for the latest drivers: ### 1. Remove Previous NVIDIA Repository Files ```bash sudo rm /etc/apt/sources.list.d/cuda* sudo rm /etc/apt/sources.list.d/nvidia-ml* ``` ### 2. Add NVIDIA’s Repository Determine your Ubuntu distribution codename: ```bash distribution=$(lsb_release -sc) ``` Download the repository pin file and add the repository: ```bash wget https://developer.download.nvidia.com/compute/cuda/repos/${distribution}/x86_64/cuda-${distribution}.pin sudo mv cuda-${distribution}.pin /etc/apt/preferences.d/cuda-repository-pin-600 wget https://developer.download.nvidia.com/compute/cuda/repos/${distribution}/x86_64/3bf863cc.pub sudo apt-key add 3bf863cc.pub sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/${distribution}/x86_64/ /" sudo apt-get update ``` ### 3. Install the Latest NVIDIA Driver Choose the driver version that best suits your GPU and workload. For example, to install the open driver for many enterprise GPUs: ```bash sudo apt-get install -y nvidia-driver-565-open ``` If this version is not available, try the latest available driver: ```bash sudo apt-get install -y nvidia-driver-latest ``` ### 4. Install CUDA After Updating the Driver After updating your driver, install CUDA: ```bash sudo apt-get install -y cuda ``` ### 5. Finalize Installation Reboot your system to load the new driver and CUDA libraries: ```bash sudo reboot ``` After rebooting, confirm the installation: ```bash nvidia-smi nvcc --version ``` --- ## Additional Resources - [NVIDIA Data Center Documentation](https://docs.nvidia.com/datacenter/) - [CUDA Documentation](https://docs.nvidia.com/cuda/) - [NVIDIA Container Toolkit Documentation](https://docs.nvidia.com/datacenter/cloud-native/) - [NVIDIA Enterprise Support](https://docs.nvidia.com/enterprise/)