In high-performance embedded systems—specifically those involving FPGAs—a common bottleneck is moving massive amounts of data from a reserved DDR memory region to a non-volatile storage device like an NVMe SSD. While the hardware interconnects (PCIe Gen2/Gen3) are theoretically capable of gigabyte-per-second speeds, standard Linux memory access methods often throttle this to a fraction of the hardware’s potential.
This guide explores how to bypass these bottlenecks using the u-dma-buf driver, achieving transfer speeds of up to 12 Gbps with minimal CPU overhead.
The “classic” way to access a reserved RAM region from user space is through /dev/mem. While simple, it is architecturally designed for safety rather than speed.
By default, the Linux kernel maps physical addresses accessed via /dev/mem as Uncached or Strongly Ordered. This is a protective measure: because /dev/mem can access hardware registers (MMIO), the kernel ensures the CPU never caches these values. If it did, the CPU might read a “stale” status flag from its cache instead of the actual hardware.
On ARM64 architectures (like the Zynq UltraScale+), this forces every single CPU instruction to wait for a round-trip to the physical DDR controller. The CPU cannot “burst” data or use its internal L1/L2/L3 caches.
- Performance Cap: In practice, this limits data rates, regardless of how fast your DDR or NVMe drive is. On my machine, it was around 1 Gbps.
To unlock the full 64-bit bus width, we need a way to tell the kernel that a specific region of RAM is “Normal Memory” (Cacheable) rather than “Device Memory” (Uncached). The u-dma-buf driver provides this bridge.
In the context of this driver, “Quirk” is an exclusive term referring to a mechanism that bypasses standard ARM64 memory mapping restrictions.
On ARM64, the kernel’s DMA API often forces user-space mappings to be uncached to avoid “cache aliasing” (the same memory being mapped with different attributes). The Quirk mode allows the driver to:
- Bypass the standard
dma_mmap_coherentrestrictions. - Manually insert pages into the process’s Virtual Memory Area (VMA) with Write-Back Cached attributes.
- Provide
struct pagebacking (viaquirk-mmap-page), which is essential for Direct I/O.
To enable high-speed direct transfers, the Device Tree must be configured to allow the kernel to perform cache maintenance on the reserved region.
reserved-memory {
reserved_buffer: buffer@0 {
compatible = "shared-dma-pool";
reusable; /* Allows the kernel to create a linear map for cache cleaning */
reg = <0x8 0x0 0x0 0x40000000>; /* 1GB at High DDR */
};
};
udmabuf@0 {
compatible = "ikwzm,u-dma-buf";
device-name = "udmabuf0";
size = <0x0 0x40000000>;
memory-region = <&reserved_buffer>;
quirk-mmap-page; /* Mode 4: Enables struct page backing for O_DIRECT */
};
Note: Using reusable instead of no-map is critical. If no-map is used, the kernel cannot access the memory to flush the cache before a DMA transfer, leading to system crashes.
We evaluated two primary methodologies: Bounced Writing and Direct Writing.
In this mode, data is copied from the DMA buffer to a standard user-space buffer before being written to the SSD.
Performance:
- Single-Thread: ~5 Gbps @ 40% CPU usage.
- 4-Thread Parallel: ~10 Gbps @ 95% CPU usage.
Core Implementation Snippet:
void* bounce_buf = nullptr;
if (posix_memalign(&bounce_buf, 4096, chunk_size) != 0)
{
return -1; // Handle allocation failure
}
size_t remaining = total_size;
// Transfer loop: Copy -> Write
while (remaining > 0)
{
size_t n = min(remaining, chunk_size);
// A. Copy from "Special" DMA RAM to "Normal" user RAM
// This is CPU-intensive but leverages L1/L2 caches
memcpy(bounce_buf, src, n);
// B. Write to NVMe
// The kernel can easily "pin" this standard RAM for the NVMe controller
size_t written = 0;
while (written < n)
{
ssize_t ret = write(dst_fd, static_cast<uint8_t*>(bounce_buf) + written, n - written);
if (ret < 0)
break; // Handle write error
written += ret;
}
src += n;
remaining -= n;
}
free(bounce_buf);The Chunk Size Analysis:
The throughput in bounced mode is heavily dependent on the intermediate Chunk Size. Benchmarking shows that performance peaks between 2MB and 8MB.
- Under 1MB: Syscall overhead dominates.
- Over 16MB: Performance degrades as the buffer exceeds the CPU’s efficient L3 cache handling, causing cache misses and memory latency stalls.
By utilizing quirk-mmap-page and opening the NVMe file with O_DIRECT, the NVMe controller can DMA data directly from the reserved DDR region.
Performance:
- Single-Thread: ~9 Gbps @ 30% CPU usage.
- Parallel: ~12 Gbps @ 50% CPU usage.
Core Implementation Snippet:
// Open NVMe file with O_DIRECT for zero-copy
int fd = open("/mnt/nvme/data.bin", O_WRONLY | O_CREAT | O_DIRECT, 0644);
// Map the udmabuf (cached via Quirk Mode)
void* map = mmap(NULL, size, PROT_READ, MAP_SHARED, udmabuf_fd, 0);
// Zero-copy transfer: NVMe DMA controller reads directly from PS DDR
write(fd, map, size);| Metric | /dev/mem |
Bounced (Cached) | Direct (Quirk-Page) |
|---|---|---|---|
| Max Throughput | ~1 Gbps | ~10 Gbps | ~12 Gbps |
| CPU Efficiency | Low (Busy Waiting) | Low (Busy Copying) | High (Zero-Copy) |
| Memory Sync | Not Required | Automatic (via Kernel) | Semi-automatic (via Driver) |
| Complexity | Low | High (Threading/Pipes) | Medium |
For high-speed FPGA data offloading on Zynq MPSoC, the “Quirk-Mmap-Page” methodology is the clear winner. It doubles the single-threaded performance compared to the bounce-buffer method and hits the hardware limit of the NVMe interface while keeping half of the CPU resources free for other applications. By moving from /dev/mem to a properly configured u-dma-buf node, you effectively transform a 1 Gbps bottleneck into a 12 Gbps data pipeline.