Documentation/filesystems/vfs.rst | 4 +++ drivers/vfio/pci/vfio_pci.c | 1 + drivers/vfio/pci/vfio_pci_core.c | 49 ++++++++++++++++++++++++++ drivers/vfio/vfio_main.c | 14 ++++++++ include/linux/fs.h | 1 + include/linux/huge_mm.h | 5 +-- include/linux/vfio.h | 5 +++ include/linux/vfio_pci_core.h | 2 ++ mm/huge_memory.c | 7 ++-- mm/mmap.c | 58 +++++++++++++++++++++++++++---- 10 files changed, 135 insertions(+), 11 deletions(-)
This series is based on v6.18. It allows mmap(!MAP_FIXED) to work with huge pfnmaps with best effort. Meanwhile, it enables it for vfio-pci as the first user. v1: https://lore.kernel.org/r/20250613134111.469884-1-peterx@redhat.com A changelog may not apply because all the patches were rewrote based on a new interface this v2 introduced. Hence omitted. In this version, a new file operation, get_mapping_order(), is introduced (based on discussion with Jason on v1) to minimize the code needed for drivers to implement this. It also helps avoid exporting any mm functions. One can refer to the discussion in v1 for more information. Currently, get_mapping_order() API is define as: int (*get_mapping_order)(struct file *file, unsigned long pgoff, size_t len); The first argument is the file pointer, the 2nd+3rd are the pgoff+len specified from a mmap() request. The driver can use this interface to opt-in providing mapping order hints to core mm on VA allocations for the range of the file specified. I kept the interface as simple for now, so that core mm will always do the alignment with pgoff assuming that would always work. The driver can only report the order from pgoff+len, which will be used to do the alignment. Before this series, an userapp in most cases need to be modified to benefit from huge mappings to provide huge size aligned VA using MAP_FIXED. After this series, the userapp can benefit from huge pfnmap automatically after the kernel upgrades, with no userspace modifications. It's still best-effort, because the auto-alignment will require a larger VA range to be allocated via the per-arch allocator, hence if the huge-mapping aligned VA cannot be allocated then it'll still fallback to small mappings like before. However that's from theory POV: in reality I don't yet know when it'll fail especially when on a 64bits system. So far, only vfio-pci is supported. But the logic should be applicable to all the drivers that support or will support huge pfnmaps. I've copied some more people in this version too from hardware perspective. For testings: - checkpatch.pl - cross build harness - unit test that I got from Alex [1], checking mmap() alignments on a QEMU instance with an 128MB bar. Checking the alignments look all sane with mmap(!MAP_FIXED), and huge mappings properly installed. I didn't observe anything wrong. I currently lack larger bars to test PUD sizes. Please kindly report if one can run this with 1G+ bars and hit issues. Alex Mastro: thanks for the testing offered in v1, but since this series was rewritten, a re-test will be needed. I hence didn't collect the T-b. Comments welcomed, thanks. [1] https://github.com/awilliam/tests/blob/vfio-pci-device-map-alignment/vfio-pci-device-map-alignment.c Peter Xu (4): mm/thp: Allow thp_get_unmapped_area_vmflags() to take alignment mm: Add file_operations.get_mapping_order() vfio: Introduce vfio_device_ops.get_mapping_order hook vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings Documentation/filesystems/vfs.rst | 4 +++ drivers/vfio/pci/vfio_pci.c | 1 + drivers/vfio/pci/vfio_pci_core.c | 49 ++++++++++++++++++++++++++ drivers/vfio/vfio_main.c | 14 ++++++++ include/linux/fs.h | 1 + include/linux/huge_mm.h | 5 +-- include/linux/vfio.h | 5 +++ include/linux/vfio_pci_core.h | 2 ++ mm/huge_memory.c | 7 ++-- mm/mmap.c | 58 +++++++++++++++++++++++++++---- 10 files changed, 135 insertions(+), 11 deletions(-) -- 2.50.1
On Thu, Dec 04, 2025 at 10:09:59AM -0500, Peter Xu wrote:
> Alex Mastro: thanks for the testing offered in v1, but since this series
> was rewritten, a re-test will be needed. I hence didn't collect the T-b.
Thank Peter, LGTM.
Tested-by: Alex Mastro <amastro@fb.com>
$ cc -Og -Wall -Wextra test_vfio_map_dma.c -o test_vfio_map_dma
$ ./test_vfio_map_dma 0000:05:00.0 4 0x600000 0x800000000 0x100000000
opening 0000:05:00.0 via /dev/vfio/39
BAR 4: size=0x2000000000, offset=0x40000000000, flags=0x7
mmap'd BAR 4: offset=0x600000, size=0x800000000 -> vaddr=0x7fdac0600000
VFIO_IOMMU_MAP_DMA: vaddr=0x7fdac0600000, iova=0x100000000, size=0x800000000
$ sudo bpftrace -q -e 'fexit:vfio_pci_mmap_huge_fault { printf("order=%d, ret=0x%x\n", args.order, retval); }' 2>&1 > ~/dump
$ cat ~/dump | sort | uniq -c | sort -nr
512 order=9, ret=0x100
31 order=18, ret=0x100
2
1 order=18, ret=0x800
test_vfio_map_dma.c
---
#include <errno.h>
#include <fcntl.h>
#include <libgen.h>
#include <linux/limits.h>
#include <linux/types.h>
#include <linux/vfio.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <unistd.h>
#define ensure(cond) \
do { \
if (!(cond)) { \
fprintf(stderr, \
"%s:%d Condition failed: '%s' (errno=%d: %s)\n", \
__FILE__, __LINE__, #cond, errno, \
strerror(errno)); \
exit(EXIT_FAILURE); \
} \
} while (0)
static uint32_t group_for_bdf(const char *bdf)
{
char path[PATH_MAX];
char link[PATH_MAX];
int ret;
snprintf(path, sizeof(path), "/sys/bus/pci/devices/%s/iommu_group",
bdf);
ret = readlink(path, link, sizeof(link));
ensure(ret > 0);
const char *filename = basename(link);
ensure(filename);
return strtoul(filename, NULL, 0);
}
int main(int argc, char **argv)
{
int ret;
if (argc != 6) {
printf("usage: %s <vfio_bdf> <bar_idx> <bar_offset> <size> <iova>\n",
argv[0]);
printf("example: %s 0000:05:00.0 2 0x20000 0x1000 0x100000\n",
argv[0]);
return 1;
}
const char *bdf = argv[1];
uint32_t bar_idx = strtoul(argv[2], NULL, 0);
uint64_t bar_offs = strtoull(argv[3], NULL, 0);
uint64_t size = strtoull(argv[4], NULL, 0);
uint64_t iova = strtoull(argv[5], NULL, 0);
uint32_t group_num = group_for_bdf(bdf);
char group_path[PATH_MAX];
snprintf(group_path, sizeof(group_path), "/dev/vfio/%u", group_num);
int container_fd = open("/dev/vfio/vfio", O_RDWR);
ensure(container_fd >= 0);
printf("opening %s via %s\n", bdf, group_path);
int group_fd = open(group_path, O_RDWR);
ensure(group_fd >= 0);
ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container_fd);
ensure(!ret);
ret = ioctl(container_fd, VFIO_SET_IOMMU, VFIO_TYPE1v2_IOMMU);
ensure(!ret);
int device_fd = ioctl(group_fd, VFIO_GROUP_GET_DEVICE_FD, bdf);
ensure(device_fd >= 0);
/* Get region info for the BAR */
struct vfio_region_info region_info = {
.argsz = sizeof(region_info),
.index = bar_idx,
};
ret = ioctl(device_fd, VFIO_DEVICE_GET_REGION_INFO, ®ion_info);
ensure(!ret);
printf("BAR %u: size=0x%llx, offset=0x%llx, flags=0x%x\n", bar_idx,
region_info.size, region_info.offset, region_info.flags);
ensure(region_info.flags & VFIO_REGION_INFO_FLAG_MMAP);
ensure(bar_offs + size <= region_info.size);
/* mmap the BAR at the specified offset */
void *bar_mmap = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED,
device_fd, region_info.offset + bar_offs);
ensure(bar_mmap != MAP_FAILED);
ret = madvise(bar_mmap, size, MADV_HUGEPAGE);
ensure(!ret);
printf("mmap'd BAR %u: offset=0x%lx, size=0x%lx -> vaddr=%p\n", bar_idx,
bar_offs, size, bar_mmap);
/* Map the mmap'd address into IOMMU using VFIO_IOMMU_MAP_DMA */
struct vfio_iommu_type1_dma_map dma_map = {
.argsz = sizeof(dma_map),
.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE,
.vaddr = (uint64_t)bar_mmap,
.iova = iova,
.size = size,
};
printf("VFIO_IOMMU_MAP_DMA: vaddr=%p, iova=0x%llx, size=0x%lx\n",
bar_mmap, (unsigned long long)dma_map.iova, size);
ret = ioctl(container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
ensure(!ret);
/* Cleanup */
struct vfio_iommu_type1_dma_unmap dma_unmap = {
.argsz = sizeof(dma_unmap),
.iova = dma_map.iova,
.size = size,
};
ret = ioctl(container_fd, VFIO_IOMMU_UNMAP_DMA, &dma_unmap);
ensure(!ret);
ret = munmap(bar_mmap, size);
ensure(!ret);
close(device_fd);
close(group_fd);
close(container_fd);
return 0;
}
On 12/4/25 16:09, Peter Xu wrote: > This series is based on v6.18. It allows mmap(!MAP_FIXED) to work with > huge pfnmaps with best effort. Meanwhile, it enables it for vfio-pci as > the first user. > > v1: https://lore.kernel.org/r/20250613134111.469884-1-peterx@redhat.com > > A changelog may not apply because all the patches were rewrote based on a > new interface this v2 introduced. Hence omitted. > > In this version, a new file operation, get_mapping_order(), is introduced > (based on discussion with Jason on v1) to minimize the code needed for > drivers to implement this. It also helps avoid exporting any mm functions. > One can refer to the discussion in v1 for more information. > > Currently, get_mapping_order() API is define as: > > int (*get_mapping_order)(struct file *file, unsigned long pgoff, size_t len); > > The first argument is the file pointer, the 2nd+3rd are the pgoff+len > specified from a mmap() request. The driver can use this interface to > opt-in providing mapping order hints to core mm on VA allocations for the > range of the file specified. I kept the interface as simple for now, so > that core mm will always do the alignment with pgoff assuming that would > always work. The driver can only report the order from pgoff+len, which > will be used to do the alignment. > > Before this series, an userapp in most cases need to be modified to benefit > from huge mappings to provide huge size aligned VA using MAP_FIXED. After > this series, the userapp can benefit from huge pfnmap automatically after > the kernel upgrades, with no userspace modifications. > > It's still best-effort, because the auto-alignment will require a larger VA > range to be allocated via the per-arch allocator, hence if the huge-mapping > aligned VA cannot be allocated then it'll still fallback to small mappings > like before. However that's from theory POV: in reality I don't yet know > when it'll fail especially when on a 64bits system. > > So far, only vfio-pci is supported. But the logic should be applicable to > all the drivers that support or will support huge pfnmaps. I've copied > some more people in this version too from hardware perspective. > > For testings: > > - checkpatch.pl > - cross build harness > - unit test that I got from Alex [1], checking mmap() alignments on a QEMU > instance with an 128MB bar. > > Checking the alignments look all sane with mmap(!MAP_FIXED), and huge > mappings properly installed. I didn't observe anything wrong. > > I currently lack larger bars to test PUD sizes. Please kindly report if > one can run this with 1G+ bars and hit issues. LGTM, with a 32G BAR : Using device 0000:02:00.0 in IOMMU group 27 Device 0000:02:00.0 supports 9 regions, 5 irqs [BAR0]: size 0x1000000, order 24, offset 0x0, flags 0xf Testing BAR0, require at least 21 bit alignment [PASS] Minimum alignment 21 Testing random offset [PASS] Random offset Testing random size [PASS] Random size [BAR1]: size 0x800000000, order 35, offset 0x10000000000, flags 0x7 Testing BAR1, require at least 30 bit alignment [PASS] Minimum alignment 31 Testing random offset [PASS] Random offset Testing random size [PASS] Random size [BAR3]: size 0x2000000, order 25, offset 0x30000000000, flags 0x7 Testing BAR3, require at least 21 bit alignment [PASS] Minimum alignment 21 Testing random offset [PASS] Random offset Testing random size [PASS] Random size C. > > Alex Mastro: thanks for the testing offered in v1, but since this series > was rewritten, a re-test will be needed. I hence didn't collect the T-b. > > Comments welcomed, thanks. > > [1] https://github.com/awilliam/tests/blob/vfio-pci-device-map-alignment/vfio-pci-device-map-alignment.c > > Peter Xu (4): > mm/thp: Allow thp_get_unmapped_area_vmflags() to take alignment > mm: Add file_operations.get_mapping_order() > vfio: Introduce vfio_device_ops.get_mapping_order hook > vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings > > Documentation/filesystems/vfs.rst | 4 +++ > drivers/vfio/pci/vfio_pci.c | 1 + > drivers/vfio/pci/vfio_pci_core.c | 49 ++++++++++++++++++++++++++ > drivers/vfio/vfio_main.c | 14 ++++++++ > include/linux/fs.h | 1 + > include/linux/huge_mm.h | 5 +-- > include/linux/vfio.h | 5 +++ > include/linux/vfio_pci_core.h | 2 ++ > mm/huge_memory.c | 7 ++-- > mm/mmap.c | 58 +++++++++++++++++++++++++++---- > 10 files changed, 135 insertions(+), 11 deletions(-) >
© 2016 - 2026 Red Hat, Inc.