Skip to content

Instantly share code, notes, and snippets.

@CaglayanDokme
Created April 3, 2026 14:27
Show Gist options
  • Select an option

  • Save CaglayanDokme/9646e12533fe9ba84ef7f79906940956 to your computer and use it in GitHub Desktop.

Select an option

Save CaglayanDokme/9646e12533fe9ba84ef7f79906940956 to your computer and use it in GitHub Desktop.
High-Speed Data Offloading: Accelerating RAM-to-NVMe Transfers on Zynq MPSoC

In high-performance embedded systems—specifically those involving FPGAs—a common bottleneck is moving massive amounts of data from a reserved DDR memory region to a non-volatile storage device like an NVMe SSD. While the hardware interconnects (PCIe Gen2/Gen3) are theoretically capable of gigabyte-per-second speeds, standard Linux memory access methods often throttle this to a fraction of the hardware’s potential.

This guide explores how to bypass these bottlenecks using the u-dma-buf driver, achieving transfer speeds of up to 12 Gbps with minimal CPU overhead.

1. The Legacy Bottleneck: Why /dev/mem is Slow

The “classic” way to access a reserved RAM region from user space is through /dev/mem. While simple, it is architecturally designed for safety rather than speed.

Non-Cacheable & Strongly Ordered

By default, the Linux kernel maps physical addresses accessed via /dev/mem as Uncached or Strongly Ordered. This is a protective measure: because /dev/mem can access hardware registers (MMIO), the kernel ensures the CPU never caches these values. If it did, the CPU might read a “stale” status flag from its cache instead of the actual hardware.

On ARM64 architectures (like the Zynq UltraScale+), this forces every single CPU instruction to wait for a round-trip to the physical DDR controller. The CPU cannot “burst” data or use its internal L1/L2/L3 caches.

  • Performance Cap: In practice, this limits data rates, regardless of how fast your DDR or NVMe drive is. On my machine, it was around 1 Gbps.

2. The Solution: u-dma-buf and the “Quirk”

To unlock the full 64-bit bus width, we need a way to tell the kernel that a specific region of RAM is “Normal Memory” (Cacheable) rather than “Device Memory” (Uncached). The u-dma-buf driver provides this bridge.

Understanding the “Quirk”

In the context of this driver, “Quirk” is an exclusive term referring to a mechanism that bypasses standard ARM64 memory mapping restrictions.

On ARM64, the kernel’s DMA API often forces user-space mappings to be uncached to avoid “cache aliasing” (the same memory being mapped with different attributes). The Quirk mode allows the driver to:

  1. Bypass the standard dma_mmap_coherent restrictions.
  2. Manually insert pages into the process’s Virtual Memory Area (VMA) with Write-Back Cached attributes.
  3. Provide struct page backing (via quirk-mmap-page), which is essential for Direct I/O.

3. BSP Configuration (PetaLinux / Yocto)

To enable high-speed direct transfers, the Device Tree must be configured to allow the kernel to perform cache maintenance on the reserved region.

Device Tree Snippet

reserved-memory {
    reserved_buffer: buffer@0 {
        compatible = "shared-dma-pool";
        reusable;  /* Allows the kernel to create a linear map for cache cleaning */
        reg = <0x8 0x0 0x0 0x40000000>; /* 1GB at High DDR */
    };
};

udmabuf@0 {
    compatible = "ikwzm,u-dma-buf";
    device-name = "udmabuf0";
    size = <0x0 0x40000000>;
    memory-region = <&reserved_buffer>;
    quirk-mmap-page; /* Mode 4: Enables struct page backing for O_DIRECT */
};

Note: Using reusable instead of no-map is critical. If no-map is used, the kernel cannot access the memory to flush the cache before a DMA transfer, leading to system crashes.

4. Application Strategies & Performance

We evaluated two primary methodologies: Bounced Writing and Direct Writing.

Strategy A: Bounced Writing

In this mode, data is copied from the DMA buffer to a standard user-space buffer before being written to the SSD.

Performance:

  • Single-Thread: ~5 Gbps @ 40% CPU usage.
  • 4-Thread Parallel: ~10 Gbps @ 95% CPU usage.

Core Implementation Snippet:

void* bounce_buf = nullptr;

if (posix_memalign(&bounce_buf, 4096, chunk_size) != 0) 
{
    return -1; // Handle allocation failure
}

size_t remaining = total_size;

// Transfer loop: Copy -> Write
while (remaining > 0) 
{
    size_t n = min(remaining, chunk_size);

    // A. Copy from "Special" DMA RAM to "Normal" user RAM
    // This is CPU-intensive but leverages L1/L2 caches
    memcpy(bounce_buf, src, n);

    // B. Write to NVMe
    // The kernel can easily "pin" this standard RAM for the NVMe controller
    size_t written = 0;
    while (written < n) 
    {
        ssize_t ret = write(dst_fd, static_cast<uint8_t*>(bounce_buf) + written, n - written);
        if (ret < 0) 
            break; // Handle write error

        written += ret;
    }

    src += n;
    remaining -= n;
}

free(bounce_buf);

The Chunk Size Analysis:

The throughput in bounced mode is heavily dependent on the intermediate Chunk Size. Benchmarking shows that performance peaks between 2MB and 8MB.

  • Under 1MB: Syscall overhead dominates.
  • Over 16MB: Performance degrades as the buffer exceeds the CPU’s efficient L3 cache handling, causing cache misses and memory latency stalls.

Strategy B: Direct Writing (The “Zero-Copy” Path)

By utilizing quirk-mmap-page and opening the NVMe file with O_DIRECT, the NVMe controller can DMA data directly from the reserved DDR region.

Performance:

  • Single-Thread: ~9 Gbps @ 30% CPU usage.
  • Parallel: ~12 Gbps @ 50% CPU usage.

Core Implementation Snippet:

// Open NVMe file with O_DIRECT for zero-copy
int fd = open("/mnt/nvme/data.bin", O_WRONLY | O_CREAT | O_DIRECT, 0644);

// Map the udmabuf (cached via Quirk Mode)
void* map = mmap(NULL, size, PROT_READ, MAP_SHARED, udmabuf_fd, 0);

// Zero-copy transfer: NVMe DMA controller reads directly from PS DDR
write(fd, map, size);

5. Final Comparison

Metric /dev/mem Bounced (Cached) Direct (Quirk-Page)
Max Throughput ~1 Gbps ~10 Gbps ~12 Gbps
CPU Efficiency Low (Busy Waiting) Low (Busy Copying) High (Zero-Copy)
Memory Sync Not Required Automatic (via Kernel) Semi-automatic (via Driver)
Complexity Low High (Threading/Pipes) Medium

Conclusion

For high-speed FPGA data offloading on Zynq MPSoC, the “Quirk-Mmap-Page” methodology is the clear winner. It doubles the single-threaded performance compared to the bounce-buffer method and hits the hardware limit of the NVMe interface while keeping half of the CPU resources free for other applications. By moving from /dev/mem to a properly configured u-dma-buf node, you effectively transform a 1 Gbps bottleneck into a 12 Gbps data pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment