From: Ankit Agrawal <ankita@nvidia.com>
This series enables hugepfnmap support in QEMU for VFIO device memory
regions that have non-power-of-2 sizes. This specifically addresses the
needs of Grace-based systems (GB200) where device memory is exposed
as a BAR.
## Problem
On Grace-based systems, device memory regions can have sizes like
0x2F00F00000 (not power-of-2). The current QEMU VFIO mapping code
aligns each sparse mmap area independently using the trailing zeros
of its size (ctz64), which results in suboptimal alignment for the
overall VMA.
This prevents the kernel from using hugepfnmap that enables huge
page mappings for device memory. Without proper alignment, the
mapping falls back to PTE, significantly impacting performance
due to increased TLB pressure and page table overhead for large
memory regions.
## Solution
Patch 1: Sort sparse mmap regions by offset during setup and validate
that they don't overlap. This ensures predictable mapping
order and enables gap detection.
Patch 2: Change the alignment strategy from per-sparse-region to
whole-region alignment using pow2ceil(region->size). Create
a single aligned base mapping for the entire region, then
overlay sparse areas with MAP_FIXED. Gaps between sparse
regions are explicitly unmapped.
Ankit Agrawal (2):
hw/vfio: sort and validate sparse mmap regions by offset
hw/vfio: align mmap to power-of-2 of region size for hugepfnmap
hw/vfio/region.c | 126 ++++++++++++++++++++++++++++++++++++-----------
1 file changed, 98 insertions(+), 28 deletions(-)
--
2.34.1