# NVIDIA Enterprise GPU Setup Guide

This guide covers the setup process for NVIDIA enterprise GPUs (A100, H100, H200) on Linux systems—focusing on Ubuntu LTS distributions (20.04, 22.04, and 24.04). It details the installation of the latest drivers and CUDA toolkits, along with optional components for containerized or specialized workloads.

## Table of Contents
- [Prerequisites](#prerequisites)
- [Driver Installation](#driver-installation)
- [CUDA Toolkit Installation](#cuda-toolkit-installation)
- [Optional Components](#optional-components)
- [Verification](#verification)
- [Troubleshooting](#troubleshooting)
- [Additional Resources](#additional-resources)

## Prerequisites

### Hardware Requirements
- **Supported NVIDIA GPU:** A100, H100, H200, etc.
- **PCIe Slot:** PCIe Gen4 x16 slot (recommended)
- **Power & Cooling:** Adequate power supply and cooling solutions
- **Motherboard:** Server-grade with proper PCIe bifurcation support

### System Requirements
- **Operating System:** Ubuntu LTS (20.04, 22.04, or 24.04)
- **Kernel & Build Tools:** Ensure Linux kernel headers and basic build tools are installed

Install basic development tools and headers:
```bash
sudo apt-get update
sudo apt-get install -y \
    build-essential \
    linux-headers-$(uname -r) \
    software-properties-common \
    gnupg
```

---

## Driver Installation

Before installing the latest NVIDIA drivers, ensure that old versions are removed and that you are using updated repository methods.

### 1. Remove Existing NVIDIA Drivers
```bash
sudo apt-get remove --purge '^nvidia-.*'
sudo apt-get autoremove
```

### 2. Remove Any Outdated NVIDIA Signing Keys (if present)
```bash
sudo apt-key del 7fa2af80
```

### 3. Add the NVIDIA Repository and GPG Key

_Note:_ With `apt-key` deprecated, we now use the NVIDIA CUDA keyring package.

Download and install the CUDA keyring package:
```bash
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu$(. /etc/os-release; echo $VERSION_ID | sed 's/\.//')/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
```

### 4. Add the Repository Pin File

This file sets repository priorities:
```bash
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu$(. /etc/os-release; echo $VERSION_ID | sed 's/\.//')/x86_64/cuda-ubuntu$(. /etc/os-release; echo $VERSION_ID | sed 's/\.//').pin
sudo mv cuda-ubuntu*.pin /etc/apt/preferences.d/cuda-repository-pin-600
```

### 5. Update Package Lists
```bash
sudo apt-get update
```

### 6. Install the Latest NVIDIA Drivers

For modern enterprise GPUs, Ubuntu’s driver query typically recommends the “open” driver series. For example, for an A100 the recommended driver might be:
```bash
sudo apt-get install -y nvidia-driver-570-open
```

If you wish to let Ubuntu automatically choose the best driver for your hardware, run:
```bash
sudo ubuntu-drivers autoinstall
```

> **Note:** For H200 or other new-generation GPUs, verify the recommended driver version using `ubuntu-drivers devices` and choose the driver (open or server) that best suits your workload.

### 7. Reboot the System
```bash
sudo reboot
```

---

## CUDA Toolkit Installation

Once the driver is updated and working, install the CUDA toolkit to support development and GPU-accelerated applications.

### 1. Install the CUDA Toolkit and Development Tools

The following command installs the CUDA toolkit. (This package may be updated over time; if you require a specific version, consider downloading the installer from NVIDIA's website.)
```bash
sudo apt-get install -y cuda-toolkit nvidia-cuda-toolkit
```

### 2. Update Environment Variables

Add the CUDA paths to your shell configuration (e.g., add these lines to your `~/.bashrc`):
```bash
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
```

Apply the changes:
```bash
source ~/.bashrc
```

---

## Optional Components

### NVIDIA Container Toolkit (for Docker)
Set up the container toolkit to run GPU-accelerated containers:
```bash
# Get distribution info
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)

# Create directory for keyring if it doesn't exist
sudo mkdir -p /etc/apt/keyrings

# Download and add the NVIDIA GPG key (modern method)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /etc/apt/keyrings/nvidia-container-toolkit.gpg

# Add the repository (modern method)
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
    sed 's#deb https://#deb [signed-by=/etc/apt/keyrings/nvidia-container-toolkit.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Update package lists
sudo apt-get update

# Install NVIDIA container toolkit
sudo apt-get install -y nvidia-container-toolkit

# Restart Docker service
sudo systemctl restart docker
```

### NVIDIA GPUDirect Storage (GDS)
Install GPUDirect Storage for high-performance data transfer:
```bash
sudo apt-get install -y nvidia-gds
```

### NVIDIA Fabric Manager (for NVLink/NVSwitch configurations)
If you are using NVLink or NVSwitch interconnects, install and enable the fabric manager:
```bash
sudo apt-get install -y nvidia-fabric-manager
sudo systemctl start nvidia-fabricmanager
sudo systemctl enable nvidia-fabricmanager
```

---

## Verification

After installation, verify that both the driver and CUDA toolkit are functioning correctly.

### 1. Check the NVIDIA Driver Installation
```bash
nvidia-smi
```
Expected output should display your GPU(s), driver version, and CUDA version. For example:
```
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 570.xxx.xx          Driver Version: 570.xxx.xx   CUDA Version: 12.x |
|-------------------------------+----------------------+----------------------+
...
```

### 2. Verify the CUDA Toolkit Installation
```bash
nvcc --version
```

### 3. Test CUDA Functionality
Create and run a basic CUDA program:
```bash
# Create a simple CUDA test program
cat > cuda_test.cu << 'EOF'
#include <stdio.h>
__global__ void kernel() { }
int main() {
    kernel<<<1,1>>>();
    cudaDeviceSynchronize();
    printf("CUDA test successful!\n");
    return 0;
}
EOF

# Compile and run the test program
nvcc cuda_test.cu -o cuda_test
./cuda_test
```

---

## Troubleshooting

### Common Issues

1. **nvidia-smi Not Found or Not Displaying Correctly**
   - Verify driver installation with:  
     ```bash
     dpkg -l | grep nvidia-driver
     ```
   - Check if the NVIDIA kernel module is loaded:  
     ```bash
     lsmod | grep nvidia
     ```
   - Inspect system logs for errors:  
     ```bash
     dmesg | grep nvidia
     ```

2. **CUDA Toolkit Not Found**
   - Ensure that your `PATH` and `LD_LIBRARY_PATH` environment variables are correctly set.
   - Check that the CUDA installation directory exists:  
     ```bash
     ls /usr/local/cuda
     ```
   - Run the provided CUDA samples for further validation.

3. **Performance Issues**
   - Confirm PCIe link status with:  
     ```bash
     nvidia-smi -q | grep "Max Link"
     ```
   - Monitor power and thermal data:  
     ```bash
     nvidia-smi -q -d POWER,TEMPERATURE
     ```
   - Verify compute mode with:  
     ```bash
     nvidia-smi -q | grep "Compute Mode"
     ```

### Additional Tips

- **System Compatibility:** Always verify that your hardware and OS are compatible with the chosen driver and CUDA versions.
- **Official Documentation:** Refer to the latest NVIDIA documentation for enterprise GPUs for any changes.
- **Monitoring:** For production environments, consider using NVIDIA Data Center GPU Manager (DCGM) to monitor health and performance.

---

## Latest Drivers

If the required NVIDIA driver version is not available from your default package manager, follow these steps to add NVIDIA’s official repository for the latest drivers:

### 1. Remove Previous NVIDIA Repository Files
```bash
sudo rm /etc/apt/sources.list.d/cuda*
sudo rm /etc/apt/sources.list.d/nvidia-ml*
```

### 2. Add NVIDIA’s Repository
Determine your Ubuntu distribution codename:
```bash
distribution=$(lsb_release -sc)
```

Download the repository pin file and add the repository:
```bash
wget https://developer.download.nvidia.com/compute/cuda/repos/${distribution}/x86_64/cuda-${distribution}.pin
sudo mv cuda-${distribution}.pin /etc/apt/preferences.d/cuda-repository-pin-600

wget https://developer.download.nvidia.com/compute/cuda/repos/${distribution}/x86_64/3bf863cc.pub
sudo apt-key add 3bf863cc.pub

sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/${distribution}/x86_64/ /"
sudo apt-get update
```

### 3. Install the Latest NVIDIA Driver
Choose the driver version that best suits your GPU and workload. For example, to install the open driver for many enterprise GPUs:
```bash
sudo apt-get install -y nvidia-driver-565-open
```
If this version is not available, try the latest available driver:
```bash
sudo apt-get install -y nvidia-driver-latest
```

### 4. Install CUDA After Updating the Driver
After updating your driver, install CUDA:
```bash
sudo apt-get install -y cuda
```

### 5. Finalize Installation
Reboot your system to load the new driver and CUDA libraries:
```bash
sudo reboot
```

After rebooting, confirm the installation:
```bash
nvidia-smi
nvcc --version
```

---

## Additional Resources

- [NVIDIA Data Center Documentation](https://docs.nvidia.com/datacenter/)
- [CUDA Documentation](https://docs.nvidia.com/cuda/)
- [NVIDIA Container Toolkit Documentation](https://docs.nvidia.com/datacenter/cloud-native/)
- [NVIDIA Enterprise Support](https://docs.nvidia.com/enterprise/)