[PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings

Peter Xu posted 5 patches 3 months, 4 weeks ago
[PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Peter Xu 3 months, 4 weeks ago
This patch enables best-effort mmap() for vfio-pci bars even without
MAP_FIXED, so as to utilize huge pfnmaps as much as possible.  It should
also avoid userspace changes (switching to MAP_FIXED with pre-aligned VA
addresses) to start enabling huge pfnmaps on VFIO bars.

Here the trick is making sure the MMIO PFNs will be aligned with the VAs
allocated from mmap() when !MAP_FIXED, so that whatever returned from
mmap(!MAP_FIXED) of vfio-pci MMIO regions will be automatically suitable
for huge pfnmaps as much as possible.

To achieve that, a custom vfio_device's get_unmapped_area() for vfio-pci
devices is needed.

Note that MMIO physical addresses should normally be guaranteed to be
always bar-size aligned, hence the bar offset can logically be directly
used to do the calculation.  However to make it strict and clear (rather
than relying on spec details), we still try to fetch the bar's physical
addresses from pci_dev.resource[].

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 drivers/vfio/pci/vfio_pci.c      |  3 ++
 drivers/vfio/pci/vfio_pci_core.c | 65 ++++++++++++++++++++++++++++++++
 include/linux/vfio_pci_core.h    |  6 +++
 3 files changed, 74 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 5ba39f7623bb..d9ae6cdbea28 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -144,6 +144,9 @@ static const struct vfio_device_ops vfio_pci_ops = {
 	.detach_ioas	= vfio_iommufd_physical_detach_ioas,
 	.pasid_attach_ioas	= vfio_iommufd_physical_pasid_attach_ioas,
 	.pasid_detach_ioas	= vfio_iommufd_physical_pasid_detach_ioas,
+#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
+	.get_unmapped_area	= vfio_pci_core_get_unmapped_area,
+#endif
 };
 
 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 6328c3a05bcd..835bc168f8b7 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1641,6 +1641,71 @@ static unsigned long vma_to_pfn(struct vm_area_struct *vma)
 	return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff;
 }
 
+#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
+/*
+ * Hint function to provide mmap() virtual address candidate so as to be
+ * able to map huge pfnmaps as much as possible.  It is done by aligning
+ * the VA to the PFN to be mapped in the specific bar.
+ *
+ * Note that this function does the minimum check on mmap() parameters to
+ * make the PFN calculation valid only. The majority of mmap() sanity check
+ * will be done later in mmap().
+ */
+unsigned long vfio_pci_core_get_unmapped_area(struct vfio_device *device,
+					      struct file *file,
+					      unsigned long addr,
+					      unsigned long len,
+					      unsigned long pgoff,
+					      unsigned long flags)
+{
+	struct vfio_pci_core_device *vdev =
+		container_of(device, struct vfio_pci_core_device, vdev);
+	struct pci_dev *pdev = vdev->pdev;
+	unsigned long ret, phys_len, req_start, phys_addr;
+	unsigned int index;
+
+	index = pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+
+	/* Currently, only bars 0-5 supports huge pfnmap */
+	if (index >= VFIO_PCI_ROM_REGION_INDEX)
+		goto fallback;
+
+	/* Bar offset */
+	req_start = (pgoff << PAGE_SHIFT) & ((1UL << VFIO_PCI_OFFSET_SHIFT) - 1);
+	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
+
+	/*
+	 * Make sure we at least can get a valid physical address to do the
+	 * math.  If this happens, it will probably fail mmap() later..
+	 */
+	if (req_start >= phys_len)
+		goto fallback;
+
+	phys_len = MIN(phys_len, len);
+	/* Calculate the start of physical address to be mapped */
+	phys_addr = pci_resource_start(pdev, index) + req_start;
+
+	/* Choose the alignment */
+	if (IS_ENABLED(CONFIG_ARCH_SUPPORTS_PUD_PFNMAP) && phys_len >= PUD_SIZE) {
+		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
+						   flags, PUD_SIZE, 0);
+		if (ret)
+			return ret;
+	}
+
+	if (phys_len >= PMD_SIZE) {
+		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
+						   flags, PMD_SIZE, 0);
+		if (ret)
+			return ret;
+	}
+
+fallback:
+	return mm_get_unmapped_area(current->mm, file, addr, len, pgoff, flags);
+}
+EXPORT_SYMBOL_GPL(vfio_pci_core_get_unmapped_area);
+#endif
+
 static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
 					   unsigned int order)
 {
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index fbb472dd99b3..e59699e01901 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -119,6 +119,12 @@ ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
 		size_t count, loff_t *ppos);
 ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
 		size_t count, loff_t *ppos);
+unsigned long vfio_pci_core_get_unmapped_area(struct vfio_device *device,
+					      struct file *file,
+					      unsigned long addr,
+					      unsigned long len,
+					      unsigned long pgoff,
+					      unsigned long flags);
 int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma);
 void vfio_pci_core_request(struct vfio_device *core_vdev, unsigned int count);
 int vfio_pci_core_match(struct vfio_device *core_vdev, char *buf);
-- 
2.49.0
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by David Hildenbrand 3 months, 4 weeks ago
On 13.06.25 15:41, Peter Xu wrote:
> This patch enables best-effort mmap() for vfio-pci bars even without
> MAP_FIXED, so as to utilize huge pfnmaps as much as possible.  It should
> also avoid userspace changes (switching to MAP_FIXED with pre-aligned VA
> addresses) to start enabling huge pfnmaps on VFIO bars.
> 
> Here the trick is making sure the MMIO PFNs will be aligned with the VAs
> allocated from mmap() when !MAP_FIXED, so that whatever returned from
> mmap(!MAP_FIXED) of vfio-pci MMIO regions will be automatically suitable
> for huge pfnmaps as much as possible.
> 
> To achieve that, a custom vfio_device's get_unmapped_area() for vfio-pci
> devices is needed.
> 
> Note that MMIO physical addresses should normally be guaranteed to be
> always bar-size aligned, hence the bar offset can logically be directly
> used to do the calculation.  However to make it strict and clear (rather
> than relying on spec details), we still try to fetch the bar's physical
> addresses from pci_dev.resource[].
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

There is likely a

Co-developed-by: Alex Williamson <alex.williamson@redhat.com>

missing?

> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   drivers/vfio/pci/vfio_pci.c      |  3 ++
>   drivers/vfio/pci/vfio_pci_core.c | 65 ++++++++++++++++++++++++++++++++
>   include/linux/vfio_pci_core.h    |  6 +++
>   3 files changed, 74 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 5ba39f7623bb..d9ae6cdbea28 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -144,6 +144,9 @@ static const struct vfio_device_ops vfio_pci_ops = {
>   	.detach_ioas	= vfio_iommufd_physical_detach_ioas,
>   	.pasid_attach_ioas	= vfio_iommufd_physical_pasid_attach_ioas,
>   	.pasid_detach_ioas	= vfio_iommufd_physical_pasid_detach_ioas,
> +#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
> +	.get_unmapped_area	= vfio_pci_core_get_unmapped_area,
> +#endif
>   };
>   
>   static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 6328c3a05bcd..835bc168f8b7 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1641,6 +1641,71 @@ static unsigned long vma_to_pfn(struct vm_area_struct *vma)
>   	return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff;
>   }
>   
> +#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
> +/*
> + * Hint function to provide mmap() virtual address candidate so as to be
> + * able to map huge pfnmaps as much as possible.  It is done by aligning
> + * the VA to the PFN to be mapped in the specific bar.
> + *
> + * Note that this function does the minimum check on mmap() parameters to
> + * make the PFN calculation valid only. The majority of mmap() sanity check
> + * will be done later in mmap().
> + */
> +unsigned long vfio_pci_core_get_unmapped_area(struct vfio_device *device,
> +					      struct file *file,
> +					      unsigned long addr,
> +					      unsigned long len,
> +					      unsigned long pgoff,
> +					      unsigned long flags)

A very suboptimal way to indent this many parameters; just use two tabs 
at the beginning.

> +{
> +	struct vfio_pci_core_device *vdev =
> +		container_of(device, struct vfio_pci_core_device, vdev);
> +	struct pci_dev *pdev = vdev->pdev;
> +	unsigned long ret, phys_len, req_start, phys_addr;
> +	unsigned int index;
> +
> +	index = pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);

Could do

unsigned int index =  pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);

at the very top.

> +
> +	/* Currently, only bars 0-5 supports huge pfnmap */
> +	if (index >= VFIO_PCI_ROM_REGION_INDEX)
> +		goto fallback;
> +
> +	/* Bar offset */
> +	req_start = (pgoff << PAGE_SHIFT) & ((1UL << VFIO_PCI_OFFSET_SHIFT) - 1);
> +	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
> +
> +	/*
> +	 * Make sure we at least can get a valid physical address to do the
> +	 * math.  If this happens, it will probably fail mmap() later..
> +	 */
> +	if (req_start >= phys_len)
> +		goto fallback;
> +
> +	phys_len = MIN(phys_len, len);
> +	/* Calculate the start of physical address to be mapped */
> +	phys_addr = pci_resource_start(pdev, index) + req_start;
> +
> +	/* Choose the alignment */
> +	if (IS_ENABLED(CONFIG_ARCH_SUPPORTS_PUD_PFNMAP) && phys_len >= PUD_SIZE) {
> +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> +						   flags, PUD_SIZE, 0);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	if (phys_len >= PMD_SIZE) {
> +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> +						   flags, PMD_SIZE, 0);
> +		if (ret)
> +			return ret;

Similar to Jason, I wonder if that logic should reside in the core, and 
we only indicate the maximum page table level we support.

			   unsigned int order)
>   {
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index fbb472dd99b3..e59699e01901 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -119,6 +119,12 @@ ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
>   		size_t count, loff_t *ppos);
>   ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
>   		size_t count, loff_t *ppos);
> +unsigned long vfio_pci_core_get_unmapped_area(struct vfio_device *device,
> +					      struct file *file,
> +					      unsigned long addr,
> +					      unsigned long len,
> +					      unsigned long pgoff,
> +					      unsigned long flags);

Dito.

-- 
Cheers,

David / dhildenb
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Peter Xu 3 months, 4 weeks ago
On Fri, Jun 13, 2025 at 08:09:41PM +0200, David Hildenbrand wrote:
> On 13.06.25 15:41, Peter Xu wrote:
> > This patch enables best-effort mmap() for vfio-pci bars even without
> > MAP_FIXED, so as to utilize huge pfnmaps as much as possible.  It should
> > also avoid userspace changes (switching to MAP_FIXED with pre-aligned VA
> > addresses) to start enabling huge pfnmaps on VFIO bars.
> > 
> > Here the trick is making sure the MMIO PFNs will be aligned with the VAs
> > allocated from mmap() when !MAP_FIXED, so that whatever returned from
> > mmap(!MAP_FIXED) of vfio-pci MMIO regions will be automatically suitable
> > for huge pfnmaps as much as possible.
> > 
> > To achieve that, a custom vfio_device's get_unmapped_area() for vfio-pci
> > devices is needed.
> > 
> > Note that MMIO physical addresses should normally be guaranteed to be
> > always bar-size aligned, hence the bar offset can logically be directly
> > used to do the calculation.  However to make it strict and clear (rather
> > than relying on spec details), we still try to fetch the bar's physical
> > addresses from pci_dev.resource[].
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> 
> There is likely a
> 
> Co-developed-by: Alex Williamson <alex.williamson@redhat.com>
> 
> missing?

Would it mean the same if we use the two SoBs like what this patch uses?
I sincerely don't know the difference..  I hope it's fine to show that this
patch was developed together.  Please let me know otherwise.

> 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   drivers/vfio/pci/vfio_pci.c      |  3 ++
> >   drivers/vfio/pci/vfio_pci_core.c | 65 ++++++++++++++++++++++++++++++++
> >   include/linux/vfio_pci_core.h    |  6 +++
> >   3 files changed, 74 insertions(+)
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index 5ba39f7623bb..d9ae6cdbea28 100644
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -144,6 +144,9 @@ static const struct vfio_device_ops vfio_pci_ops = {
> >   	.detach_ioas	= vfio_iommufd_physical_detach_ioas,
> >   	.pasid_attach_ioas	= vfio_iommufd_physical_pasid_attach_ioas,
> >   	.pasid_detach_ioas	= vfio_iommufd_physical_pasid_detach_ioas,
> > +#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
> > +	.get_unmapped_area	= vfio_pci_core_get_unmapped_area,
> > +#endif
> >   };
> >   static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > index 6328c3a05bcd..835bc168f8b7 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > @@ -1641,6 +1641,71 @@ static unsigned long vma_to_pfn(struct vm_area_struct *vma)
> >   	return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff;
> >   }
> > +#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
> > +/*
> > + * Hint function to provide mmap() virtual address candidate so as to be
> > + * able to map huge pfnmaps as much as possible.  It is done by aligning
> > + * the VA to the PFN to be mapped in the specific bar.
> > + *
> > + * Note that this function does the minimum check on mmap() parameters to
> > + * make the PFN calculation valid only. The majority of mmap() sanity check
> > + * will be done later in mmap().
> > + */
> > +unsigned long vfio_pci_core_get_unmapped_area(struct vfio_device *device,
> > +					      struct file *file,
> > +					      unsigned long addr,
> > +					      unsigned long len,
> > +					      unsigned long pgoff,
> > +					      unsigned long flags)
> 
> A very suboptimal way to indent this many parameters; just use two tabs at
> the beginning.

This is the default indentation from Emacs c-mode.

Since this is a VFIO file, I checked the file and looks like there's not
yet a strict rule of indentation across the whole file.  I can switch to
two-tabs for sure if nobody else disagrees.

> 
> > +{
> > +	struct vfio_pci_core_device *vdev =
> > +		container_of(device, struct vfio_pci_core_device, vdev);
> > +	struct pci_dev *pdev = vdev->pdev;
> > +	unsigned long ret, phys_len, req_start, phys_addr;
> > +	unsigned int index;
> > +
> > +	index = pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
> 
> Could do
> 
> unsigned int index =  pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
> 
> at the very top.

Sure.

> 
> > +
> > +	/* Currently, only bars 0-5 supports huge pfnmap */
> > +	if (index >= VFIO_PCI_ROM_REGION_INDEX)
> > +		goto fallback;
> > +
> > +	/* Bar offset */
> > +	req_start = (pgoff << PAGE_SHIFT) & ((1UL << VFIO_PCI_OFFSET_SHIFT) - 1);
> > +	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
> > +
> > +	/*
> > +	 * Make sure we at least can get a valid physical address to do the
> > +	 * math.  If this happens, it will probably fail mmap() later..
> > +	 */
> > +	if (req_start >= phys_len)
> > +		goto fallback;
> > +
> > +	phys_len = MIN(phys_len, len);
> > +	/* Calculate the start of physical address to be mapped */
> > +	phys_addr = pci_resource_start(pdev, index) + req_start;
> > +
> > +	/* Choose the alignment */
> > +	if (IS_ENABLED(CONFIG_ARCH_SUPPORTS_PUD_PFNMAP) && phys_len >= PUD_SIZE) {
> > +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> > +						   flags, PUD_SIZE, 0);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	if (phys_len >= PMD_SIZE) {
> > +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> > +						   flags, PMD_SIZE, 0);
> > +		if (ret)
> > +			return ret;
> 
> Similar to Jason, I wonder if that logic should reside in the core, and we
> only indicate the maximum page table level we support.

I replied.  We can continue the discussion there.

Thanks,

-- 
Peter Xu
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Jason Gunthorpe 3 months, 4 weeks ago
On Fri, Jun 13, 2025 at 09:41:11AM -0400, Peter Xu wrote:

> +	/* Choose the alignment */
> +	if (IS_ENABLED(CONFIG_ARCH_SUPPORTS_PUD_PFNMAP) && phys_len >= PUD_SIZE) {
> +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> +						   flags, PUD_SIZE, 0);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	if (phys_len >= PMD_SIZE) {
> +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> +						   flags, PMD_SIZE, 0);
> +		if (ret)
> +			return ret;
> +	}

Hurm, we have contiguous pages now, so PMD_SIZE is not so great, eg on
4k ARM with we can have a 16*2M=32MB contiguity, and 16k ARM uses
contiguity to get a 32*16k=1GB option.

Forcing to only align to the PMD or PUD seems suboptimal..

> +fallback:
> +	return mm_get_unmapped_area(current->mm, file, addr, len, pgoff, flags);

Why not put this into mm_get_unmapped_area_vmflags() and get rid of
thp_get_unmapped_area_vmflags() too?

Is there any reason the caller should have to do a retry?

Jason
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Peter Xu 3 months, 4 weeks ago
On Fri, Jun 13, 2025 at 11:29:03AM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 13, 2025 at 09:41:11AM -0400, Peter Xu wrote:
> 
> > +	/* Choose the alignment */
> > +	if (IS_ENABLED(CONFIG_ARCH_SUPPORTS_PUD_PFNMAP) && phys_len >= PUD_SIZE) {
> > +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> > +						   flags, PUD_SIZE, 0);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	if (phys_len >= PMD_SIZE) {
> > +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> > +						   flags, PMD_SIZE, 0);
> > +		if (ret)
> > +			return ret;
> > +	}
> 
> Hurm, we have contiguous pages now, so PMD_SIZE is not so great, eg on
> 4k ARM with we can have a 16*2M=32MB contiguity, and 16k ARM uses
> contiguity to get a 32*16k=1GB option.
> 
> Forcing to only align to the PMD or PUD seems suboptimal..

Right, however the cont-pte / cont-pmd are still not supported in huge
pfnmaps in general?  It'll definitely be nice if someone could look at that
from ARM perspective, then provide support of both in one shot.

> 
> > +fallback:
> > +	return mm_get_unmapped_area(current->mm, file, addr, len, pgoff, flags);
> 
> Why not put this into mm_get_unmapped_area_vmflags() and get rid of
> thp_get_unmapped_area_vmflags() too?
> 
> Is there any reason the caller should have to do a retry?

We would still need thp_get_unmapped_area_vmflags() because that encodes
PMD_SIZE for THPs; we need the flexibility of providing any size alignment
as a generic helper.

But I get your point.  For example, mm_get_unmapped_area_aligned() can
still fallback to mm_get_unmapped_area_vmflags() automatically.

That was ok, however that loses some flexibility when the caller wants to
try with different alignments, exactly like above: currently, it was trying
to do a first attempt of PUD mapping then fallback to PMD if that fails.

Indeed I don't know whether such fallback would help in our unit tests. But
logically speaking we'll need to look into every arch's va allocator to
know when it might fail with bigger allocations, and if PUD fails it's
still sensible one wants to retry with PMD if available.  From that POV, we
don't want to immediately fallback to 4K if 1G fails.

Thanks,

-- 
Peter Xu
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Jason Gunthorpe 3 months, 4 weeks ago
On Fri, Jun 13, 2025 at 11:26:40AM -0400, Peter Xu wrote:
> On Fri, Jun 13, 2025 at 11:29:03AM -0300, Jason Gunthorpe wrote:
> > On Fri, Jun 13, 2025 at 09:41:11AM -0400, Peter Xu wrote:
> > 
> > > +	/* Choose the alignment */
> > > +	if (IS_ENABLED(CONFIG_ARCH_SUPPORTS_PUD_PFNMAP) && phys_len >= PUD_SIZE) {
> > > +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> > > +						   flags, PUD_SIZE, 0);
> > > +		if (ret)
> > > +			return ret;
> > > +	}
> > > +
> > > +	if (phys_len >= PMD_SIZE) {
> > > +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> > > +						   flags, PMD_SIZE, 0);
> > > +		if (ret)
> > > +			return ret;
> > > +	}
> > 
> > Hurm, we have contiguous pages now, so PMD_SIZE is not so great, eg on
> > 4k ARM with we can have a 16*2M=32MB contiguity, and 16k ARM uses
> > contiguity to get a 32*16k=1GB option.
> > 
> > Forcing to only align to the PMD or PUD seems suboptimal..
> 
> Right, however the cont-pte / cont-pmd are still not supported in huge
> pfnmaps in general?  It'll definitely be nice if someone could look at that
> from ARM perspective, then provide support of both in one shot.

Maybe leave behind a comment about this. I've been poking around if
somone would do the ARM PFNMAP support but can't report any commitment.

> > > +fallback:
> > > +	return mm_get_unmapped_area(current->mm, file, addr, len, pgoff, flags);
> > 
> > Why not put this into mm_get_unmapped_area_vmflags() and get rid of
> > thp_get_unmapped_area_vmflags() too?
> > 
> > Is there any reason the caller should have to do a retry?
> 
> We would still need thp_get_unmapped_area_vmflags() because that encodes
> PMD_SIZE for THPs; we need the flexibility of providing any size alignment
> as a generic helper.

There is only one caller for thp_get_unmapped_area_vmflags(), just
open code PMD_SIZE there and thin this whole thing out. It reads
better like that anyhow:

	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && !file
		   && !addr /* no hint */
		   && IS_ALIGNED(len, PMD_SIZE)) {
		/* Ensures that larger anonymous mappings are THP aligned. */
		addr = mm_get_unmapped_area_aligned(file, 0, len, pgoff,
						    flags, vm_flags, PMD_SIZE);

> That was ok, however that loses some flexibility when the caller wants to
> try with different alignments, exactly like above: currently, it was trying
> to do a first attempt of PUD mapping then fallback to PMD if that fails.

Oh, that's a good point, I didn't notice that subtle bit.

But then maybe that is showing the API is just wrong and the core code
should be trying to find the best alignment not the caller. Like we
can have those PUD/PMD size ifdefs inside the mm instead of in VFIO?

VFIO would just pass the BAR size, implying the best alignment, and
the core implementation will try to get the largest VMA alignment that
snaps to an arch supported page contiguity, testing each of the arches
page size possibilities in turn.

That sounds like a much better API than pushing this into drivers??

Jason
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Peter Xu 3 months, 4 weeks ago
On Fri, Jun 13, 2025 at 01:09:56PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 13, 2025 at 11:26:40AM -0400, Peter Xu wrote:
> > On Fri, Jun 13, 2025 at 11:29:03AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Jun 13, 2025 at 09:41:11AM -0400, Peter Xu wrote:
> > > 
> > > > +	/* Choose the alignment */
> > > > +	if (IS_ENABLED(CONFIG_ARCH_SUPPORTS_PUD_PFNMAP) && phys_len >= PUD_SIZE) {
> > > > +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> > > > +						   flags, PUD_SIZE, 0);
> > > > +		if (ret)
> > > > +			return ret;
> > > > +	}
> > > > +
> > > > +	if (phys_len >= PMD_SIZE) {
> > > > +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> > > > +						   flags, PMD_SIZE, 0);
> > > > +		if (ret)
> > > > +			return ret;
> > > > +	}
> > > 
> > > Hurm, we have contiguous pages now, so PMD_SIZE is not so great, eg on
> > > 4k ARM with we can have a 16*2M=32MB contiguity, and 16k ARM uses
> > > contiguity to get a 32*16k=1GB option.
> > > 
> > > Forcing to only align to the PMD or PUD seems suboptimal..
> > 
> > Right, however the cont-pte / cont-pmd are still not supported in huge
> > pfnmaps in general?  It'll definitely be nice if someone could look at that
> > from ARM perspective, then provide support of both in one shot.
> 
> Maybe leave behind a comment about this. I've been poking around if
> somone would do the ARM PFNMAP support but can't report any commitment.

I didn't know what's the best part to take a note for the whole pfnmap
effort, but I added a note into the commit message on this patch:

        Note 2: Currently continuous pgtable entries (for example, cont-pte) is not
        yet supported for huge pfnmaps in general.  It also is not considered in
        this patch so far.  Separate work will be needed to enable continuous
        pgtable entries on archs that support it.

> 
> > > > +fallback:
> > > > +	return mm_get_unmapped_area(current->mm, file, addr, len, pgoff, flags);
> > > 
> > > Why not put this into mm_get_unmapped_area_vmflags() and get rid of
> > > thp_get_unmapped_area_vmflags() too?
> > > 
> > > Is there any reason the caller should have to do a retry?
> > 
> > We would still need thp_get_unmapped_area_vmflags() because that encodes
> > PMD_SIZE for THPs; we need the flexibility of providing any size alignment
> > as a generic helper.
> 
> There is only one caller for thp_get_unmapped_area_vmflags(), just
> open code PMD_SIZE there and thin this whole thing out. It reads
> better like that anyhow:
> 
> 	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && !file
> 		   && !addr /* no hint */
> 		   && IS_ALIGNED(len, PMD_SIZE)) {
> 		/* Ensures that larger anonymous mappings are THP aligned. */
> 		addr = mm_get_unmapped_area_aligned(file, 0, len, pgoff,
> 						    flags, vm_flags, PMD_SIZE);
> 
> > That was ok, however that loses some flexibility when the caller wants to
> > try with different alignments, exactly like above: currently, it was trying
> > to do a first attempt of PUD mapping then fallback to PMD if that fails.
> 
> Oh, that's a good point, I didn't notice that subtle bit.
> 
> But then maybe that is showing the API is just wrong and the core code
> should be trying to find the best alignment not the caller. Like we
> can have those PUD/PMD size ifdefs inside the mm instead of in VFIO?
> 
> VFIO would just pass the BAR size, implying the best alignment, and
> the core implementation will try to get the largest VMA alignment that
> snaps to an arch supported page contiguity, testing each of the arches
> page size possibilities in turn.
> 
> That sounds like a much better API than pushing this into drivers??

Yes it would be nice if the core mm can evolve to make supporting such
easier.  Though the question is how to pass information over to core mm.

For example, currently a vfio device file represents the whole device, and
it's also VFIO that defines what the MMIO region offsets means. So core mm
has no simple idea which BAR VFIO is mapping if it only receives a mmap()
request.  So even if we assume the core mm provides some vma flag showing
that, it won't be per-vma, but need to be case by case of the mmap()
request at least relevant to pgoff and len being mapped.

And it's definitely the case that for one device its BAR sizes are
different, hence it asks for different alignments when mmap() even if on
the same device fd.

It's similar to many other use cases of get_unmapped_area() users.  For
example, see v4l2_m2m_get_unmapped_area() which has similar treatment on at
least knowing which part of the file was being mapped:

	if (offset < DST_QUEUE_OFF_BASE) {
		vq = v4l2_m2m_get_src_vq(fh->m2m_ctx);
	} else {
		vq = v4l2_m2m_get_dst_vq(fh->m2m_ctx);
		pgoff -= (DST_QUEUE_OFF_BASE >> PAGE_SHIFT);
	}

Such flexibility might still be needed for now until we know how to provide
the abstraction.

Meanwhile, there can be other constraints to existing get_unmapped_area()
users that a decision might be done with any parameter passed into it
besides the pgoff.. so even if we provide the whole pgoff info, it might
not be enough.

-- 
Peter Xu
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Jason Gunthorpe 3 months, 4 weeks ago
On Fri, Jun 13, 2025 at 03:15:19PM -0400, Peter Xu wrote:
> > > > > +	if (phys_len >= PMD_SIZE) {
> > > > > +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> > > > > +						   flags, PMD_SIZE, 0);
> > > > > +		if (ret)
> > > > > +			return ret;
> > > > > +	}
> > > > 
> > > > Hurm, we have contiguous pages now, so PMD_SIZE is not so great, eg on
> > > > 4k ARM with we can have a 16*2M=32MB contiguity, and 16k ARM uses
> > > > contiguity to get a 32*16k=1GB option.
> > > > 
> > > > Forcing to only align to the PMD or PUD seems suboptimal..
> > > 
> > > Right, however the cont-pte / cont-pmd are still not supported in huge
> > > pfnmaps in general?  It'll definitely be nice if someone could look at that
> > > from ARM perspective, then provide support of both in one shot.
> > 
> > Maybe leave behind a comment about this. I've been poking around if
> > somone would do the ARM PFNMAP support but can't report any commitment.
> 
> I didn't know what's the best part to take a note for the whole pfnmap
> effort, but I added a note into the commit message on this patch:
> 
>         Note 2: Currently continuous pgtable entries (for example, cont-pte) is not
>         yet supported for huge pfnmaps in general.  It also is not considered in
>         this patch so far.  Separate work will be needed to enable continuous
>         pgtable entries on archs that support it.
> 
> > 
> > > > > +fallback:
> > > > > +	return mm_get_unmapped_area(current->mm, file, addr, len, pgoff, flags);
> > > > 
> > > > Why not put this into mm_get_unmapped_area_vmflags() and get rid of
> > > > thp_get_unmapped_area_vmflags() too?
> > > > 
> > > > Is there any reason the caller should have to do a retry?
> > > 
> > > We would still need thp_get_unmapped_area_vmflags() because that encodes
> > > PMD_SIZE for THPs; we need the flexibility of providing any size alignment
> > > as a generic helper.
> > 
> > There is only one caller for thp_get_unmapped_area_vmflags(), just
> > open code PMD_SIZE there and thin this whole thing out. It reads
> > better like that anyhow:
> > 
> > 	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && !file
> > 		   && !addr /* no hint */
> > 		   && IS_ALIGNED(len, PMD_SIZE)) {
> > 		/* Ensures that larger anonymous mappings are THP aligned. */
> > 		addr = mm_get_unmapped_area_aligned(file, 0, len, pgoff,
> > 						    flags, vm_flags, PMD_SIZE);
> > 
> > > That was ok, however that loses some flexibility when the caller wants to
> > > try with different alignments, exactly like above: currently, it was trying
> > > to do a first attempt of PUD mapping then fallback to PMD if that fails.
> > 
> > Oh, that's a good point, I didn't notice that subtle bit.
> > 
> > But then maybe that is showing the API is just wrong and the core code
> > should be trying to find the best alignment not the caller. Like we
> > can have those PUD/PMD size ifdefs inside the mm instead of in VFIO?
> > 
> > VFIO would just pass the BAR size, implying the best alignment, and
> > the core implementation will try to get the largest VMA alignment that
> > snaps to an arch supported page contiguity, testing each of the arches
> > page size possibilities in turn.
> > 
> > That sounds like a much better API than pushing this into drivers??
> 
> Yes it would be nice if the core mm can evolve to make supporting such
> easier.  Though the question is how to pass information over to core mm.

I was just thinking something simple, change how your new 
mm_get_unmapped_area_aligned() works so that the caller is expected to
pass in the size of the biggest folio/pfn page in as
align.

The mm_get_unmapped_area_aligned() returns a vm address that
will result in large mappings.

pgoff works the same way, the assumption is the biggest folio is at
pgoff 0 and followed by another biggest folio so the pgoff logic tries
to make the second folio map fully.

ie what a hugetlb fd or thp memfd would like.

Then you still hook the file operations and still figure out what BAR
and so on to call mm_get_unmapped_area_aligned() with the correct
aligned parameter.

mm_get_unmapped_area_aligned() goes through the supported page sizes
of the arch and selects the best one for the indicated biggest folio

If we were happy writing this in vfio then it can work just as well in
the core mm side.

> It's similar to many other use cases of get_unmapped_area() users.  For
> example, see v4l2_m2m_get_unmapped_area() which has similar treatment on at
> least knowing which part of the file was being mapped:
> 
> 	if (offset < DST_QUEUE_OFF_BASE) {
> 		vq = v4l2_m2m_get_src_vq(fh->m2m_ctx);
> 	} else {
> 		vq = v4l2_m2m_get_dst_vq(fh->m2m_ctx);
> 		pgoff -= (DST_QUEUE_OFF_BASE >> PAGE_SHIFT);
> 	}

Careful thats only use for nommu :)

Jason
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Peter Xu 3 months, 3 weeks ago
On Fri, Jun 13, 2025 at 08:16:57PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 13, 2025 at 03:15:19PM -0400, Peter Xu wrote:
> > > > > > +	if (phys_len >= PMD_SIZE) {
> > > > > > +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> > > > > > +						   flags, PMD_SIZE, 0);
> > > > > > +		if (ret)
> > > > > > +			return ret;
> > > > > > +	}
> > > > > 
> > > > > Hurm, we have contiguous pages now, so PMD_SIZE is not so great, eg on
> > > > > 4k ARM with we can have a 16*2M=32MB contiguity, and 16k ARM uses
> > > > > contiguity to get a 32*16k=1GB option.
> > > > > 
> > > > > Forcing to only align to the PMD or PUD seems suboptimal..
> > > > 
> > > > Right, however the cont-pte / cont-pmd are still not supported in huge
> > > > pfnmaps in general?  It'll definitely be nice if someone could look at that
> > > > from ARM perspective, then provide support of both in one shot.
> > > 
> > > Maybe leave behind a comment about this. I've been poking around if
> > > somone would do the ARM PFNMAP support but can't report any commitment.
> > 
> > I didn't know what's the best part to take a note for the whole pfnmap
> > effort, but I added a note into the commit message on this patch:
> > 
> >         Note 2: Currently continuous pgtable entries (for example, cont-pte) is not
> >         yet supported for huge pfnmaps in general.  It also is not considered in
> >         this patch so far.  Separate work will be needed to enable continuous
> >         pgtable entries on archs that support it.
> > 
> > > 
> > > > > > +fallback:
> > > > > > +	return mm_get_unmapped_area(current->mm, file, addr, len, pgoff, flags);
> > > > > 
> > > > > Why not put this into mm_get_unmapped_area_vmflags() and get rid of
> > > > > thp_get_unmapped_area_vmflags() too?
> > > > > 
> > > > > Is there any reason the caller should have to do a retry?
> > > > 
> > > > We would still need thp_get_unmapped_area_vmflags() because that encodes
> > > > PMD_SIZE for THPs; we need the flexibility of providing any size alignment
> > > > as a generic helper.
> > > 
> > > There is only one caller for thp_get_unmapped_area_vmflags(), just
> > > open code PMD_SIZE there and thin this whole thing out. It reads
> > > better like that anyhow:
> > > 
> > > 	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && !file
> > > 		   && !addr /* no hint */
> > > 		   && IS_ALIGNED(len, PMD_SIZE)) {
> > > 		/* Ensures that larger anonymous mappings are THP aligned. */
> > > 		addr = mm_get_unmapped_area_aligned(file, 0, len, pgoff,
> > > 						    flags, vm_flags, PMD_SIZE);
> > > 
> > > > That was ok, however that loses some flexibility when the caller wants to
> > > > try with different alignments, exactly like above: currently, it was trying
> > > > to do a first attempt of PUD mapping then fallback to PMD if that fails.
> > > 
> > > Oh, that's a good point, I didn't notice that subtle bit.
> > > 
> > > But then maybe that is showing the API is just wrong and the core code
> > > should be trying to find the best alignment not the caller. Like we
> > > can have those PUD/PMD size ifdefs inside the mm instead of in VFIO?
> > > 
> > > VFIO would just pass the BAR size, implying the best alignment, and
> > > the core implementation will try to get the largest VMA alignment that
> > > snaps to an arch supported page contiguity, testing each of the arches
> > > page size possibilities in turn.
> > > 
> > > That sounds like a much better API than pushing this into drivers??
> > 
> > Yes it would be nice if the core mm can evolve to make supporting such
> > easier.  Though the question is how to pass information over to core mm.
> 
> I was just thinking something simple, change how your new 
> mm_get_unmapped_area_aligned() works so that the caller is expected to
> pass in the size of the biggest folio/pfn page in as
> align.
> 
> The mm_get_unmapped_area_aligned() returns a vm address that
> will result in large mappings.
> 
> pgoff works the same way, the assumption is the biggest folio is at
> pgoff 0 and followed by another biggest folio so the pgoff logic tries
> to make the second folio map fully.
> 
> ie what a hugetlb fd or thp memfd would like.
> 
> Then you still hook the file operations and still figure out what BAR
> and so on to call mm_get_unmapped_area_aligned() with the correct
> aligned parameter.
> 
> mm_get_unmapped_area_aligned() goes through the supported page sizes
> of the arch and selects the best one for the indicated biggest folio
> 
> If we were happy writing this in vfio then it can work just as well in
> the core mm side.

So far, the new vfio_pci_core_get_unmapped_area() almost does VFIO's own
stuff, except that it does retry with different sizes.

Can I understand it as a suggestion to pass in a bitmask into the core mm
API (e.g. keep the name of mm_get_unmapped_area_aligned()), instead of a
constant "align", so that core mm would try to allocate from the largest
size to smaller until it finds some working VA to use?

> 
> > It's similar to many other use cases of get_unmapped_area() users.  For
> > example, see v4l2_m2m_get_unmapped_area() which has similar treatment on at
> > least knowing which part of the file was being mapped:
> > 
> > 	if (offset < DST_QUEUE_OFF_BASE) {
> > 		vq = v4l2_m2m_get_src_vq(fh->m2m_ctx);
> > 	} else {
> > 		vq = v4l2_m2m_get_dst_vq(fh->m2m_ctx);
> > 		pgoff -= (DST_QUEUE_OFF_BASE >> PAGE_SHIFT);
> > 	}
> 
> Careful thats only use for nommu :)

My fault, please ignore it.. :)

I'm also surprised it is even available for !MMU.. but I decided to not dig
anymore today on that.

Thanks,

-- 
Peter Xu
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Jason Gunthorpe 3 months, 3 weeks ago
On Mon, Jun 16, 2025 at 06:06:23PM -0400, Peter Xu wrote:

> Can I understand it as a suggestion to pass in a bitmask into the core mm
> API (e.g. keep the name of mm_get_unmapped_area_aligned()), instead of a
> constant "align", so that core mm would try to allocate from the largest
> size to smaller until it finds some working VA to use?

I don't think you need a bitmask.

Split the concerns, the caller knows what is inside it's FD. It only
needs to provide the highest pgoff aligned folio/pfn within the FD.

The mm knows what leaf page tables options exist. It should try to
align to the closest leaf page table size that is <= the FD's max
aligned folio.

Higher alignment would be wasteful of address space.

Lower alignment misses an opportunity to create large leaf PTEs.

Jason
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Peter Xu 3 months, 3 weeks ago
On Mon, Jun 16, 2025 at 08:00:11PM -0300, Jason Gunthorpe wrote:
> On Mon, Jun 16, 2025 at 06:06:23PM -0400, Peter Xu wrote:
> 
> > Can I understand it as a suggestion to pass in a bitmask into the core mm
> > API (e.g. keep the name of mm_get_unmapped_area_aligned()), instead of a
> > constant "align", so that core mm would try to allocate from the largest
> > size to smaller until it finds some working VA to use?
> 
> I don't think you need a bitmask.
> 
> Split the concerns, the caller knows what is inside it's FD. It only
> needs to provide the highest pgoff aligned folio/pfn within the FD.

Ultimately I even dropped this hint.  I found that it's not really
get_unmapped_area()'s job to detect over-sized pgoffs.  It's mmap()'s job.
So I decided to avoid this parameter as of now.

> 
> The mm knows what leaf page tables options exist. It should try to
> align to the closest leaf page table size that is <= the FD's max
> aligned folio.

So again IMHO this is also not per-FD information, but needs to be passed
over from the driver for each call.

Likely the "order" parameter appeared in other discussions to imply a
maximum supported size from the driver side (or, for a folio, but that is
definitely another user after this series can land).

So far I didn't yet add the "order", because currently VFIO definitely
supports all max orders the system supports.  Maybe we can add the order
when there's a real need, but maybe it won't happen in the near future?

In summary, I came up with below changes, would below look reasonable?

I also added Liam and Lorenzo in this reply.

================8<================

From 7f1b7aada21ab036849edc49635fb0656e0457c4 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Fri, 30 May 2025 12:45:55 -0400
Subject: [PATCH 1/4] mm: Rename __thp_get_unmapped_area to
 mm_get_unmapped_area_aligned

This function is handy to locate an unmapped region which is best aligned
to the specified alignment, taking whatever form of pgoff address space
into considerations.

Rename the function and make it more general for even non-THP use in follow
up patches.  Dropping "THP" in the name because it doesn't have much to do
with THP internally.  The suffix "_aligned" imply it is a helper to
generate aligned virtual address based on what is specified (which can be
not PMD_SIZE).

When at it, using check_add_overflow() helpers to verify the inputs to make
sure no overflow will happen.

Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/huge_memory.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4734de1dc0ae..885b5845dbba 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1088,23 +1088,28 @@ static inline bool is_transparent_hugepage(const struct folio *folio)
 		folio_test_large_rmappable(folio);
 }
 
-static unsigned long __thp_get_unmapped_area(struct file *filp,
+static unsigned long mm_get_unmapped_area_aligned(struct file *filp,
 		unsigned long addr, unsigned long len,
-		loff_t off, unsigned long flags, unsigned long size,
+		loff_t off, unsigned long flags, unsigned long align,
 		vm_flags_t vm_flags)
 {
-	loff_t off_end = off + len;
-	loff_t off_align = round_up(off, size);
+	loff_t off_end;
+	loff_t off_align = round_up(off, align);
 	unsigned long len_pad, ret, off_sub;
 
 	if (!IS_ENABLED(CONFIG_64BIT) || in_compat_syscall())
 		return 0;
 
-	if (off_end <= off_align || (off_end - off_align) < size)
+	/* Can't use the overflow API, do manual check for now */
+	if (off_align < off)
 		return 0;
-
-	len_pad = len + size;
-	if (len_pad < len || (off + len_pad) < off)
+	if (check_add_overflow(off, len, &off_end))
+		return 0;
+	if (off_end <= off_align || (off_end - off_align) < align)
+		return 0;
+	if (check_add_overflow(len, align, &len_pad))
+		return 0;
+	if ((off + len_pad) < off)
 		return 0;
 
 	ret = mm_get_unmapped_area_vmflags(current->mm, filp, addr, len_pad,
@@ -1118,16 +1123,16 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
 		return 0;
 
 	/*
-	 * Do not try to align to THP boundary if allocation at the address
+	 * Do not try to provide alignment if allocation at the address
 	 * hint succeeds.
 	 */
 	if (ret == addr)
 		return addr;
 
-	off_sub = (off - ret) & (size - 1);
+	off_sub = (off - ret) & (align - 1);
 
 	if (test_bit(MMF_TOPDOWN, &current->mm->flags) && !off_sub)
-		return ret + size;
+		return ret + align;
 
 	ret += off_sub;
 	return ret;
@@ -1140,7 +1145,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 	unsigned long ret;
 	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
 
-	ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PMD_SIZE, vm_flags);
+	ret = mm_get_unmapped_area_aligned(filp, addr, len, off, flags,
+					   PMD_SIZE, vm_flags);
 	if (ret)
 		return ret;
 
-- 
2.49.0


From 709379a39f4a59a6d3bda7a39ca55f08fdaf9e1a Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Tue, 17 Jun 2025 15:27:07 -0400
Subject: [PATCH 2/4] mm: huge_mapping_get_va_aligned() helper

Add this helper to allocate a VA that would be best to map huge mappings
that the system would support. It can be used in file's get_unmapped_area()
functions as long as proper max_pgoff will be provided so that core mm will
know the available range of pgoff to map in the future.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h | 10 ++++++++-
 mm/huge_memory.c        | 46 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2f190c90192d..59fdafb1034b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -339,7 +339,8 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags,
 		vm_flags_t vm_flags);
-
+unsigned long huge_mapping_get_va_aligned(struct file *filp, unsigned long addr,
+		unsigned long len, unsigned long pgoff, unsigned long flags);
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
 int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		unsigned int new_order);
@@ -543,6 +544,13 @@ thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
 	return 0;
 }
 
+static inline unsigned long
+huge_mapping_get_va_aligned(struct file *filp, unsigned long addr,
+		unsigned long len, unsigned long pgoff, unsigned long flags)
+{
+	return mm_get_unmapped_area(current->mm, filp, addr, len, pgoff, flags);
+}
+
 static inline bool
 can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 885b5845dbba..bc016b656dc7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1161,6 +1161,52 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 }
 EXPORT_SYMBOL_GPL(thp_get_unmapped_area);
 
+/**
+ * huge_mapping_get_va_aligned: best-effort VA allocation for huge mappings
+ *
+ * @filp: file target of the mmap() request
+ * @addr: hint address from mmap() request
+ * @len: len of the mmap() request
+ * @pgoff: file offset of the mmap() request
+ * @flags: flags of the mmap() request
+ *
+ * This function should normally be used by a driver's specific
+ * get_unmapped_area() handler to provide a huge-mapping friendly virtual
+ * address for a specific mmap() request.  The caller should pass in most
+ * of the parameters from the get_unmapped_area() request.
+ *
+ * Normally it means the caller's mmap() needs to also support any possible
+ * huge mappings the system supports.
+ *
+ * Return: a best-effort virtual address that will satisfy the most huge
+ * mappings for the result VMA to be mapped.
+ */
+unsigned long huge_mapping_get_va_aligned(struct file *filp, unsigned long addr,
+		unsigned long len, unsigned long pgoff, unsigned long flags)
+{
+	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
+	unsigned long ret;
+
+	/* TODO: support continuous ptes/pmds */
+	if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) &&
+	    len >= PUD_SIZE) {
+		ret = mm_get_unmapped_area_aligned(filp, addr, len, off, flags,
+						   PUD_SIZE, 0);
+		if (ret)
+			return ret;
+	}
+
+	if (len >= PMD_SIZE) {
+		ret = mm_get_unmapped_area_aligned(filp, addr, len, off, flags,
+						   PMD_SIZE, 0);
+		if (ret)
+			return ret;
+	}
+
+	return mm_get_unmapped_area(current->mm, filp, addr, len, pgoff, flags);
+}
+EXPORT_SYMBOL_GPL(huge_mapping_get_va_aligned);
+
 static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
 		unsigned long addr)
 {
-- 
2.49.0


From ff90dbba05ea54e5c6690fbedf330c837f8f0ea1 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Wed, 4 Jun 2025 17:54:40 -0400
Subject: [PATCH 3/4] vfio: Introduce vfio_device_ops.get_unmapped_area hook

Add a hook to vfio_device_ops to allow sub-modules provide virtual
addresses for an mmap() request.

Note that the fallback will be mm_get_unmapped_area(), which should
maintain the old behavior of generic VA allocation (__get_unmapped_area).
It's a bit unfortunate that is needed, as the current get_unmapped_area()
file ops cannot support a retval which fallbacks to the default.  So that
is needed both here and whenever sub-module will opt-in with its own.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 drivers/vfio/vfio_main.c | 25 +++++++++++++++++++++++++
 include/linux/vfio.h     |  8 ++++++++
 2 files changed, 33 insertions(+)

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 1fd261efc582..480cc2398810 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -1354,6 +1354,28 @@ static int vfio_device_fops_mmap(struct file *filep, struct vm_area_struct *vma)
 	return device->ops->mmap(device, vma);
 }
 
+#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
+static unsigned long vfio_device_get_unmapped_area(struct file *file,
+						   unsigned long addr,
+						   unsigned long len,
+						   unsigned long pgoff,
+						   unsigned long flags)
+{
+	struct vfio_device_file *df = file->private_data;
+	struct vfio_device *device = df->device;
+	unsigned long ret;
+
+	if (device->ops->get_unmapped_area) {
+		ret = device->ops->get_unmapped_area(device, file, addr,
+						      len, pgoff, flags);
+		if (ret)
+			return ret;
+	}
+
+	return mm_get_unmapped_area(current->mm, file, addr, len, pgoff, flags);
+}
+#endif
+
 const struct file_operations vfio_device_fops = {
 	.owner		= THIS_MODULE,
 	.open		= vfio_device_fops_cdev_open,
@@ -1363,6 +1385,9 @@ const struct file_operations vfio_device_fops = {
 	.unlocked_ioctl	= vfio_device_fops_unl_ioctl,
 	.compat_ioctl	= compat_ptr_ioctl,
 	.mmap		= vfio_device_fops_mmap,
+#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
+	.get_unmapped_area = vfio_device_get_unmapped_area,
+#endif
 };
 
 static struct vfio_device *vfio_device_from_file(struct file *file)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 707b00772ce1..d900541e2716 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -108,6 +108,8 @@ struct vfio_device {
  * @dma_unmap: Called when userspace unmaps IOVA from the container
  *             this device is attached to.
  * @device_feature: Optional, fill in the VFIO_DEVICE_FEATURE ioctl
+ * @get_unmapped_area: Optional, provide virtual address hint for mmap().
+ *                     If zero is returned, fallback to the default allocator.
  */
 struct vfio_device_ops {
 	char	*name;
@@ -135,6 +137,12 @@ struct vfio_device_ops {
 	void	(*dma_unmap)(struct vfio_device *vdev, u64 iova, u64 length);
 	int	(*device_feature)(struct vfio_device *device, u32 flags,
 				  void __user *arg, size_t argsz);
+	unsigned long (*get_unmapped_area)(struct vfio_device *device,
+					   struct file *file,
+					   unsigned long addr,
+					   unsigned long len,
+					   unsigned long pgoff,
+					   unsigned long flags);
 };
 
 #if IS_ENABLED(CONFIG_IOMMUFD)
-- 
2.49.0

From 38539aafac83ae204d3e03f441f7e33841db6b07 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Fri, 30 May 2025 13:21:20 -0400
Subject: [PATCH 4/4] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED
 mappings

This patch enables best-effort mmap() for vfio-pci bars even without
MAP_FIXED, so as to utilize huge pfnmaps as much as possible.  It should
also avoid userspace changes (switching to MAP_FIXED with pre-aligned VA
addresses) to start enabling huge pfnmaps on VFIO bars.

Here the trick is making sure the MMIO PFNs will be aligned with the VAs
allocated from mmap() when !MAP_FIXED, so that whatever returned from
mmap(!MAP_FIXED) of vfio-pci MMIO regions will be automatically suitable
for huge pfnmaps as much as possible.

To achieve that, a custom vfio_device's get_unmapped_area() for vfio-pci
devices is needed.

Note, MMIO physical addresses should normally be guaranteed to be always
bar-size aligned, hence the bar offset can logically be directly used to do
the calculation.  However to make it strict and clear (rather than relying
on spec details), we still try to fetch the bar's physical addresses from
pci_dev.resource[].

[1] https://lore.kernel.org/linux-pci/20250529214414.1508155-1-amastro@fb.com/

Reported-by: Alex Mastro <amastro@fb.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 drivers/vfio/pci/vfio_pci.c      |  1 +
 drivers/vfio/pci/vfio_pci_core.c | 34 ++++++++++++++++++++++++++++++++
 include/linux/vfio_pci_core.h    |  3 +++
 3 files changed, 38 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 5ba39f7623bb..32b570f17d0f 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -144,6 +144,7 @@ static const struct vfio_device_ops vfio_pci_ops = {
 	.detach_ioas	= vfio_iommufd_physical_detach_ioas,
 	.pasid_attach_ioas	= vfio_iommufd_physical_pasid_attach_ioas,
 	.pasid_detach_ioas	= vfio_iommufd_physical_pasid_detach_ioas,
+	.get_unmapped_area	= vfio_pci_core_get_unmapped_area,
 };
 
 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 6328c3a05bcd..5392bec4929a 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1641,6 +1641,40 @@ static unsigned long vma_to_pfn(struct vm_area_struct *vma)
 	return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff;
 }
 
+/*
+ * Hint function to provide mmap() virtual address candidate so as to be
+ * able to map huge pfnmaps as much as possible.  It is done by aligning
+ * the VA to the PFN to be mapped in the specific bar.
+ *
+ * Note that this function does the minimum check on mmap() parameters to
+ * make the PFN calculation valid only. The majority of mmap() sanity check
+ * will be done later in mmap().
+ */
+unsigned long vfio_pci_core_get_unmapped_area(struct vfio_device *device,
+		struct file *file, unsigned long addr, unsigned long len,
+		unsigned long pgoff, unsigned long flags)
+{
+	struct vfio_pci_core_device *vdev =
+		container_of(device, struct vfio_pci_core_device, vdev);
+	struct pci_dev *pdev = vdev->pdev;
+	unsigned int index = pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+	unsigned long req_start;
+
+	/* Currently, only bars 0-5 supports huge pfnmap */
+	if (index >= VFIO_PCI_ROM_REGION_INDEX)
+		return 0;
+
+	/* Calculate the start of physical address to be mapped */
+	req_start = (pgoff << PAGE_SHIFT) & ((1UL << VFIO_PCI_OFFSET_SHIFT) - 1);
+	if (check_add_overflow(req_start, pci_resource_start(pdev, index),
+			       &req_start))
+		return 0;
+
+	return huge_mapping_get_va_aligned(file, addr, len, req_start >> PAGE_SHIFT,
+					   flags);
+}
+EXPORT_SYMBOL_GPL(vfio_pci_core_get_unmapped_area);
+
 static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
 					   unsigned int order)
 {
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index fbb472dd99b3..d97c920b4dbf 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -119,6 +119,9 @@ ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
 		size_t count, loff_t *ppos);
 ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
 		size_t count, loff_t *ppos);
+unsigned long vfio_pci_core_get_unmapped_area(struct vfio_device *device,
+		struct file *file, unsigned long addr, unsigned long len,
+		unsigned long pgoff, unsigned long flags);
 int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma);
 void vfio_pci_core_request(struct vfio_device *core_vdev, unsigned int count);
 int vfio_pci_core_match(struct vfio_device *core_vdev, char *buf);
-- 
2.49.0


-- 
Peter Xu
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Jason Gunthorpe 3 months, 3 weeks ago
On Tue, Jun 17, 2025 at 04:56:13PM -0400, Peter Xu wrote:
> On Mon, Jun 16, 2025 at 08:00:11PM -0300, Jason Gunthorpe wrote:
> > On Mon, Jun 16, 2025 at 06:06:23PM -0400, Peter Xu wrote:
> > 
> > > Can I understand it as a suggestion to pass in a bitmask into the core mm
> > > API (e.g. keep the name of mm_get_unmapped_area_aligned()), instead of a
> > > constant "align", so that core mm would try to allocate from the largest
> > > size to smaller until it finds some working VA to use?
> > 
> > I don't think you need a bitmask.
> > 
> > Split the concerns, the caller knows what is inside it's FD. It only
> > needs to provide the highest pgoff aligned folio/pfn within the FD.
> 
> Ultimately I even dropped this hint.  I found that it's not really
> get_unmapped_area()'s job to detect over-sized pgoffs.  It's mmap()'s job.
> So I decided to avoid this parameter as of now.

Well, the point of the pgoff is only what you said earlier, to adjust
the starting alignment so the pgoff aligned high order folios/pfns
line up properly.

> > The mm knows what leaf page tables options exist. It should try to
> > align to the closest leaf page table size that is <= the FD's max
> > aligned folio.
> 
> So again IMHO this is also not per-FD information, but needs to be passed
> over from the driver for each call.

It is per-FD in the sense that each FD is unique and each range of
pgoff could have a unique maximum.
 
> Likely the "order" parameter appeared in other discussions to imply a
> maximum supported size from the driver side (or, for a folio, but that is
> definitely another user after this series can land).

Yes, it is the only information the driver can actually provide and
comes directly from what it will install in the VMA.

> So far I didn't yet add the "order", because currently VFIO definitely
> supports all max orders the system supports.  Maybe we can add the order
> when there's a real need, but maybe it won't happen in the near
> future?

The purpose of the order is to prevent over alignment and waste of
VMA. Your technique to use the length to limit alignment instead is
good enough for VFIO but not very general.

The VFIO part looks pretty good, I still don't really understand why
you'd have CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP though. The inline
fallback you have for it seems good enough and we don't care if things
are overaligned for ioremap.

Jason
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Peter Xu 3 months, 3 weeks ago
On Tue, Jun 17, 2025 at 08:18:07PM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 17, 2025 at 04:56:13PM -0400, Peter Xu wrote:
> > On Mon, Jun 16, 2025 at 08:00:11PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Jun 16, 2025 at 06:06:23PM -0400, Peter Xu wrote:
> > > 
> > > > Can I understand it as a suggestion to pass in a bitmask into the core mm
> > > > API (e.g. keep the name of mm_get_unmapped_area_aligned()), instead of a
> > > > constant "align", so that core mm would try to allocate from the largest
> > > > size to smaller until it finds some working VA to use?
> > > 
> > > I don't think you need a bitmask.
> > > 
> > > Split the concerns, the caller knows what is inside it's FD. It only
> > > needs to provide the highest pgoff aligned folio/pfn within the FD.
> > 
> > Ultimately I even dropped this hint.  I found that it's not really
> > get_unmapped_area()'s job to detect over-sized pgoffs.  It's mmap()'s job.
> > So I decided to avoid this parameter as of now.
> 
> Well, the point of the pgoff is only what you said earlier, to adjust
> the starting alignment so the pgoff aligned high order folios/pfns
> line up properly.

I meant "highest pgoff" that I dropped.

We definitely need the pgoff to make it work.  So here I dropped "highest
pgoff" passed from the caller because I decided to leave such check to the
mmap() hook later.

> 
> > > The mm knows what leaf page tables options exist. It should try to
> > > align to the closest leaf page table size that is <= the FD's max
> > > aligned folio.
> > 
> > So again IMHO this is also not per-FD information, but needs to be passed
> > over from the driver for each call.
> 
> It is per-FD in the sense that each FD is unique and each range of
> pgoff could have a unique maximum.
>  
> > Likely the "order" parameter appeared in other discussions to imply a
> > maximum supported size from the driver side (or, for a folio, but that is
> > definitely another user after this series can land).
> 
> Yes, it is the only information the driver can actually provide and
> comes directly from what it will install in the VMA.
> 
> > So far I didn't yet add the "order", because currently VFIO definitely
> > supports all max orders the system supports.  Maybe we can add the order
> > when there's a real need, but maybe it won't happen in the near
> > future?
> 
> The purpose of the order is to prevent over alignment and waste of
> VMA. Your technique to use the length to limit alignment instead is
> good enough for VFIO but not very general.

Yes that's also something I didn't like.  I think I'll just go ahead and
add the order parameter, then use it in previous patch too.

I'll wait for some more time though for others' input before a respin.

Thanks,

> 
> The VFIO part looks pretty good, I still don't really understand why
> you'd have CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP though. The inline
> fallback you have for it seems good enough and we don't care if things
> are overaligned for ioremap.
> 
> Jason
> 

-- 
Peter Xu
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Peter Xu 3 months, 3 weeks ago
On Tue, Jun 17, 2025 at 07:36:08PM -0400, Peter Xu wrote:
> On Tue, Jun 17, 2025 at 08:18:07PM -0300, Jason Gunthorpe wrote:
> > On Tue, Jun 17, 2025 at 04:56:13PM -0400, Peter Xu wrote:
> > > On Mon, Jun 16, 2025 at 08:00:11PM -0300, Jason Gunthorpe wrote:
> > > > On Mon, Jun 16, 2025 at 06:06:23PM -0400, Peter Xu wrote:
> > > > 
> > > > > Can I understand it as a suggestion to pass in a bitmask into the core mm
> > > > > API (e.g. keep the name of mm_get_unmapped_area_aligned()), instead of a
> > > > > constant "align", so that core mm would try to allocate from the largest
> > > > > size to smaller until it finds some working VA to use?
> > > > 
> > > > I don't think you need a bitmask.
> > > > 
> > > > Split the concerns, the caller knows what is inside it's FD. It only
> > > > needs to provide the highest pgoff aligned folio/pfn within the FD.
> > > 
> > > Ultimately I even dropped this hint.  I found that it's not really
> > > get_unmapped_area()'s job to detect over-sized pgoffs.  It's mmap()'s job.
> > > So I decided to avoid this parameter as of now.
> > 
> > Well, the point of the pgoff is only what you said earlier, to adjust
> > the starting alignment so the pgoff aligned high order folios/pfns
> > line up properly.
> 
> I meant "highest pgoff" that I dropped.
> 
> We definitely need the pgoff to make it work.  So here I dropped "highest
> pgoff" passed from the caller because I decided to leave such check to the
> mmap() hook later.
> 
> > 
> > > > The mm knows what leaf page tables options exist. It should try to
> > > > align to the closest leaf page table size that is <= the FD's max
> > > > aligned folio.
> > > 
> > > So again IMHO this is also not per-FD information, but needs to be passed
> > > over from the driver for each call.
> > 
> > It is per-FD in the sense that each FD is unique and each range of
> > pgoff could have a unique maximum.
> >  
> > > Likely the "order" parameter appeared in other discussions to imply a
> > > maximum supported size from the driver side (or, for a folio, but that is
> > > definitely another user after this series can land).
> > 
> > Yes, it is the only information the driver can actually provide and
> > comes directly from what it will install in the VMA.
> > 
> > > So far I didn't yet add the "order", because currently VFIO definitely
> > > supports all max orders the system supports.  Maybe we can add the order
> > > when there's a real need, but maybe it won't happen in the near
> > > future?
> > 
> > The purpose of the order is to prevent over alignment and waste of
> > VMA. Your technique to use the length to limit alignment instead is
> > good enough for VFIO but not very general.
> 
> Yes that's also something I didn't like.  I think I'll just go ahead and
> add the order parameter, then use it in previous patch too.

So I changed my mind, slightly.  I can still have the "order" parameter to
make the API cleaner (even if it'll be a pure overhead.. because all
existing caller will pass in PUD_SIZE as of now), but I think I'll still
stick with the ifdef in patch 4, as I mentioned here:

https://lore.kernel.org/all/aFGMG3763eSv9l8b@x1.local/

The problem is I just noticed yet again that exporting
huge_mapping_get_va_aligned() for all configs doesn't make sense.  At least
it'll need something like this to make !MMU compile for VFIO, while this is
definitely some ugliness I also want to avoid..

===8<===
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 59fdafb1034b..f40a8fb64eaa 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -548,7 +548,11 @@ static inline unsigned long
 huge_mapping_get_va_aligned(struct file *filp, unsigned long addr,
                unsigned long len, unsigned long pgoff, unsigned long flags)
 {
+#ifdef CONFIG_MMU
        return mm_get_unmapped_area(current->mm, filp, addr, len, pgoff, flags);
+#else
+       return 0;
+#endif
 }

 static inline bool
===8<===

The issue is still mm_get_unmapped_area() is only exported on CONFIG_MMU,
so we need to special case that for huge_mapping_get_va_aligned(), and here
for !THP && !MMU.

Besides the ugliness, it's also about how to choose a default value to
return when mm_get_unmapped_area() isn't available.

I gave it a defalut value (0) as example, but I don't even thnk that 0
makes sense.  It would (if ever triggerable from any caller on !MMU) mean
it will return 0 directly to __get_unmapped_area() and further do_mmap()
(of !MMU code, which will come down from ksys_mmap_pgoff() of nommu.c) will
take that addr=0 to be the addr to mmap.. that sounds wrong.

There's just no way to provide a sane default value for !MMU.

So going one step back: huge_mapping_get_va_aligned() (or whatever name we
prefer) doesn't make sense to be exported always, but only when CONFIG_MMU.
It should follow the same way we treat mm_get_unmapped_area().

Here it also goes back to the question on why !MMU even support mmap():

https://www.kernel.org/doc/Documentation/nommu-mmap.txt

So, for the case of v4l driver (v4l2_m2m_get_unmapped_area that I used to
quote, which only defines in !MMU and I used to misread..), for example,
it's really a minimal mmap() support on ucLinux and that's all about that.
My gut feeling is the noMMU use case more or less abused the current
get_unmapped_area() hook to provide the physical addresses, so as to make
mmap() work even on ucLinux.

It's for sure not a proof that we should have huge_mapping_get_va_aligned()
or mm_get_unmapped_area() availalbe even for !MMU.  That's all about VAs
and that do not exist in !MMU as a concept.

Thanks,

-- 
Peter Xu
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Jason Gunthorpe 3 months, 3 weeks ago
On Wed, Jun 18, 2025 at 12:56:01PM -0400, Peter Xu wrote:
> So I changed my mind, slightly.  I can still have the "order" parameter to
> make the API cleaner (even if it'll be a pure overhead.. because all
> existing caller will pass in PUD_SIZE as of now), 

That doesn't seem right, the callers should report the real value not
artifically cap it.. Like ARM does have page sizes greater than PUD
that might be interesting to enable someday for PFN users.

> but I think I'll still
> stick with the ifdef in patch 4, as I mentioned here:

> https://lore.kernel.org/all/aFGMG3763eSv9l8b@x1.local/
> 
> The problem is I just noticed yet again that exporting
> huge_mapping_get_va_aligned() for all configs doesn't make sense.  At least
> it'll need something like this to make !MMU compile for VFIO, while this is
> definitely some ugliness I also want to avoid..

IMHO this uglyness should certainly be contained to the mm code and not
leak into drivers.

> There's just no way to provide a sane default value for !MMU.

So all this mess seems to say that get_unmapped_area() is just the
wrong fop to have here. It can't be implemented sanely for !MMU and
has these weird conditions, like can't fail.

I again suggest to just simplify and add an new fop 

size_t get_best_mapping_order(struct file *filp, pgoff_t pgoff,
                              size_t length);

Which will return the largest pgoff aligned order within pgoff/length
that the FD could try to install. Very simple for the driver
side. vfio pci will just return ilog2(bar_size).

PAGE_SHIFT can be a safe default.

Then put all this maze of conditionals in the mm side replacing the
call to fops->get_unmapped_area() and don't export anything new. The
mm will automaticall cap the alignment based on what the architecture
can do and what 

!MMU would simply entirely ignore this new stuff.

> So going one step back: huge_mapping_get_va_aligned() (or whatever name we
> prefer) doesn't make sense to be exported always, but only when CONFIG_MMU.
> It should follow the same way we treat mm_get_unmapped_area().

We just deleted !SMP, I really wonder if it is time for !MMU to go
away too..

Jason
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Peter Xu 3 months, 3 weeks ago
On Wed, Jun 18, 2025 at 02:46:41PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 18, 2025 at 12:56:01PM -0400, Peter Xu wrote:
> > So I changed my mind, slightly.  I can still have the "order" parameter to
> > make the API cleaner (even if it'll be a pure overhead.. because all
> > existing caller will pass in PUD_SIZE as of now), 
> 
> That doesn't seem right, the callers should report the real value not
> artifically cap it.. Like ARM does have page sizes greater than PUD
> that might be interesting to enable someday for PFN users.

It needs to pass in PUD_SIZE to match what vfio-pci currently supports in
its huge_fault().

> 
> > but I think I'll still
> > stick with the ifdef in patch 4, as I mentioned here:
> 
> > https://lore.kernel.org/all/aFGMG3763eSv9l8b@x1.local/
> > 
> > The problem is I just noticed yet again that exporting
> > huge_mapping_get_va_aligned() for all configs doesn't make sense.  At least
> > it'll need something like this to make !MMU compile for VFIO, while this is
> > definitely some ugliness I also want to avoid..
> 
> IMHO this uglyness should certainly be contained to the mm code and not
> leak into drivers.
> 
> > There's just no way to provide a sane default value for !MMU.
> 
> So all this mess seems to say that get_unmapped_area() is just the
> wrong fop to have here. It can't be implemented sanely for !MMU and
> has these weird conditions, like can't fail.
> 
> I again suggest to just simplify and add an new fop 
> 
> size_t get_best_mapping_order(struct file *filp, pgoff_t pgoff,
>                               size_t length);
> 
> Which will return the largest pgoff aligned order within pgoff/length
> that the FD could try to install. Very simple for the driver
> side. vfio pci will just return ilog2(bar_size).
> 
> PAGE_SHIFT can be a safe default.

I agree this is a better way.  We can make the PAGE_SHIFT by default or
just 0, because it doesn't sound necessary to me to support anything
smaller than PAGE_SIZE.. maybe a "int" retval would suffice to also cover
errors.

So this will introduce a new file operation that will only be used so far
in VFIO, playing similar role until we start to convert many
get_unmapped_area() to this one.

> 
> Then put all this maze of conditionals in the mm side replacing the
> call to fops->get_unmapped_area() and don't export anything new. The
> mm will automaticall cap the alignment based on what the architecture
> can do and what 
> 
> !MMU would simply entirely ignore this new stuff.

For the long term, we should move all get_unmapped_area() users to the new
API.  For old !MMU users, we should rename get_unmapped_area() to something
better, like get_mmap_addr().  For those cases it's really not about
looking for something not mapped, but normally exactly what is requested.

> 
> > So going one step back: huge_mapping_get_va_aligned() (or whatever name we
> > prefer) doesn't make sense to be exported always, but only when CONFIG_MMU.
> > It should follow the same way we treat mm_get_unmapped_area().
> 
> We just deleted !SMP, I really wonder if it is time for !MMU to go
> away too..

Yes, if this comes earlier, we can completely drop get_unmapped_area()
after all existing MMU users converted to the new one.

Any early objections / concerns / comments from anyone else, before I go
and introduce it?

-- 
Peter Xu
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Jason Gunthorpe 3 months, 3 weeks ago
On Wed, Jun 18, 2025 at 03:15:50PM -0400, Peter Xu wrote:
> > > So I changed my mind, slightly.  I can still have the "order" parameter to
> > > make the API cleaner (even if it'll be a pure overhead.. because all
> > > existing caller will pass in PUD_SIZE as of now), 
> > 
> > That doesn't seem right, the callers should report the real value not
> > artifically cap it.. Like ARM does have page sizes greater than PUD
> > that might be interesting to enable someday for PFN users.
> 
> It needs to pass in PUD_SIZE to match what vfio-pci currently supports in
> its huge_fault().

Hm, OK that does make sense. I would add a small comment though as it
is not so intuitive and may not apply to something using ioremap..

> So this will introduce a new file operation that will only be used so far
> in VFIO, playing similar role until we start to convert many
> get_unmapped_area() to this one.

Yes, if someone wants to do a project here you can markup
memfds/shmem/hugetlbfs/etc/etc to define their internal folio orders
and hopefully ultimately remove some of that alignment logic from the
arch code.

Jason
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Peter Xu 3 months, 3 weeks ago
On Thu, Jun 19, 2025 at 10:58:52AM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 18, 2025 at 03:15:50PM -0400, Peter Xu wrote:
> > > > So I changed my mind, slightly.  I can still have the "order" parameter to
> > > > make the API cleaner (even if it'll be a pure overhead.. because all
> > > > existing caller will pass in PUD_SIZE as of now), 
> > > 
> > > That doesn't seem right, the callers should report the real value not
> > > artifically cap it.. Like ARM does have page sizes greater than PUD
> > > that might be interesting to enable someday for PFN users.
> > 
> > It needs to pass in PUD_SIZE to match what vfio-pci currently supports in
> > its huge_fault().
> 
> Hm, OK that does make sense. I would add a small comment though as it
> is not so intuitive and may not apply to something using ioremap..

Sure, I'll remember to add some comment if I'll go back to the old
interface.  I hope it won't happen..

> 
> > So this will introduce a new file operation that will only be used so far
> > in VFIO, playing similar role until we start to convert many
> > get_unmapped_area() to this one.
> 
> Yes, if someone wants to do a project here you can markup
> memfds/shmem/hugetlbfs/etc/etc to define their internal folio orders
> and hopefully ultimately remove some of that alignment logic from the
> arch code.

I'm a bit refrained to touch all of the files just for this, but I can
definitely add very verbose explanation into the commit log when I'll
introduce the new API, on not only the relationship of that and the old
APIs, also possible future works.

Besides the get_unmapped_area() -> NEW API conversions which is arch
independent in most cases, indeed if it would be great to reduce per-arch
alignment requirement as much as possible.  At least that should apply for
hugetlbfs that it shouldn't be arch-dependent.  I am not sure about the
rest, though.  For example, I see archs may treat PF_RANDOMIZE differently.
There might be a lot of trivial details to look at.

OTOH, one other thought (which may not need to monitor all archs) is it
does look confusing to have two layers of alignment operation, which is at
least the case of THP right now.  So it might be good to at least punch it
through to use vm_unmapped_area_info.align_mask / etc. if possible, to
avoid double-padding: after all, unmapped_area() also did align paddings.
It smells like something we overlooked when initially support THP.

Thanks,

-- 
Peter Xu
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Jason Gunthorpe 3 months, 3 weeks ago
On Thu, Jun 19, 2025 at 10:55:02AM -0400, Peter Xu wrote:
> On Thu, Jun 19, 2025 at 10:58:52AM -0300, Jason Gunthorpe wrote:
> > On Wed, Jun 18, 2025 at 03:15:50PM -0400, Peter Xu wrote:
> > > > > So I changed my mind, slightly.  I can still have the "order" parameter to
> > > > > make the API cleaner (even if it'll be a pure overhead.. because all
> > > > > existing caller will pass in PUD_SIZE as of now), 
> > > > 
> > > > That doesn't seem right, the callers should report the real value not
> > > > artifically cap it.. Like ARM does have page sizes greater than PUD
> > > > that might be interesting to enable someday for PFN users.
> > > 
> > > It needs to pass in PUD_SIZE to match what vfio-pci currently supports in
> > > its huge_fault().
> > 
> > Hm, OK that does make sense. I would add a small comment though as it
> > is not so intuitive and may not apply to something using ioremap..
> 
> Sure, I'll remember to add some comment if I'll go back to the old
> interface.  I hope it won't happen..

Even with this new version you have to decide to return PUD_SIZE or
bar_size in pci and your same reasoning that PUD_SIZE make sense
applies (though I would probably return bar_size and just let the core
code cap it to PUD_SIZE)

> I'm a bit refrained to touch all of the files just for this, but I can
> definitely add very verbose explanation into the commit log when I'll
> introduce the new API, on not only the relationship of that and the old
> APIs, also possible future works.

Yeah, I wouldn't grow this work any more. It does highlight there is
alot of room to improve the arch interface though.

> OTOH, one other thought (which may not need to monitor all archs) is it
> does look confusing to have two layers of alignment operation, which is at
> least the case of THP right now.  So it might be good to at least punch it
> through to use vm_unmapped_area_info.align_mask / etc. if possible, to
> avoid double-padding: after all, unmapped_area() also did align paddings.
> It smells like something we overlooked when initially support THP.

I would not address that in this series, THP has been abusing this for
a long time, may as well keep it for now.

Either the arch code should return the info struct or the order should
be passed down to arch code. This would give enough information to the
maple tree algorithm to be able to do one operation.

Jason
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Peter Xu 3 months, 2 weeks ago
On Thu, Jun 19, 2025 at 03:40:41PM -0300, Jason Gunthorpe wrote:
> Even with this new version you have to decide to return PUD_SIZE or
> bar_size in pci and your same reasoning that PUD_SIZE make sense
> applies (though I would probably return bar_size and just let the core
> code cap it to PUD_SIZE)

Yes.

Today I went back to look at this, I was trying to introduce this for
file_operations:

	int (*get_mapping_order)(struct file *, unsigned long, size_t);

It looks almost good, except that it so far has no way to return the
physical address for further calculation on the alignment.

For THP, VA is always calculated against pgoff not physical address on the
alignment.  I think it's OK for THP, because every 2M THP folio will be
naturally 2M aligned on the physical address, so it fits when e.g. pgoff=0
in the calculation of thp_get_unmapped_area_vmflags().

Logically it should even also work for vfio-pci, as long as VFIO keeps
using the lower 40 bits of the device_fd to represent the bar offset,
meanwhile it'll also require PCIe spec asking the PCI bars to be mapped
aligned with bar sizes.

But from an API POV, get_mapping_order() logically should return something
for further calculation of the alignment to get the VA.  pgoff here may not
always be the right thing to use to align to the VA: after all, pgtable
mapping is about VA -> PA, the only reasonable and reliable way is to align
VA to the PA to be mappped, and as an API we shouldn't assume pgoff is
always aligned to PA address space.

Any thoughts?

-- 
Peter Xu
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Jason Gunthorpe 3 months, 2 weeks ago
On Tue, Jun 24, 2025 at 04:37:26PM -0400, Peter Xu wrote:
> On Thu, Jun 19, 2025 at 03:40:41PM -0300, Jason Gunthorpe wrote:
> > Even with this new version you have to decide to return PUD_SIZE or
> > bar_size in pci and your same reasoning that PUD_SIZE make sense
> > applies (though I would probably return bar_size and just let the core
> > code cap it to PUD_SIZE)
> 
> Yes.
> 
> Today I went back to look at this, I was trying to introduce this for
> file_operations:
> 
> 	int (*get_mapping_order)(struct file *, unsigned long, size_t);
> 
> It looks almost good, except that it so far has no way to return the
> physical address for further calculation on the alignment.
> 
> For THP, VA is always calculated against pgoff not physical address on the
> alignment.  I think it's OK for THP, because every 2M THP folio will be
> naturally 2M aligned on the physical address, so it fits when e.g. pgoff=0
> in the calculation of thp_get_unmapped_area_vmflags().
> 
> Logically it should even also work for vfio-pci, as long as VFIO keeps
> using the lower 40 bits of the device_fd to represent the bar offset,
> meanwhile it'll also require PCIe spec asking the PCI bars to be mapped
> aligned with bar sizes.
> 
> But from an API POV, get_mapping_order() logically should return something
> for further calculation of the alignment to get the VA.  pgoff here may not
> always be the right thing to use to align to the VA: after all, pgtable
> mapping is about VA -> PA, the only reasonable and reliable way is to align
> VA to the PA to be mappped, and as an API we shouldn't assume pgoff is
> always aligned to PA address space.

My feeling, and the reason I used the phrase "pgoff aligned address",
is that the owner of the file should already ensure that for the large
PTEs/folios:
 pgoff % 2**order == 0
 physical % 2**order == 0

So, things like VFIO do need to hand out high alignment pgoffs to make
this work - which it already does.

To me this just keeps thing simpler. I guess if someone comes up with
a case where they really can't get a pgoff alignment and really need a
high order mapping then maybe we can add a new return field of some
kind (pgoff adjustment?) but that is so weird I'd leave it to the
future person to come and justfiy it.

Jason
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Peter Xu 3 months, 2 weeks ago
On Tue, Jun 24, 2025 at 08:40:32PM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 24, 2025 at 04:37:26PM -0400, Peter Xu wrote:
> > On Thu, Jun 19, 2025 at 03:40:41PM -0300, Jason Gunthorpe wrote:
> > > Even with this new version you have to decide to return PUD_SIZE or
> > > bar_size in pci and your same reasoning that PUD_SIZE make sense
> > > applies (though I would probably return bar_size and just let the core
> > > code cap it to PUD_SIZE)
> > 
> > Yes.
> > 
> > Today I went back to look at this, I was trying to introduce this for
> > file_operations:
> > 
> > 	int (*get_mapping_order)(struct file *, unsigned long, size_t);
> > 
> > It looks almost good, except that it so far has no way to return the
> > physical address for further calculation on the alignment.
> > 
> > For THP, VA is always calculated against pgoff not physical address on the
> > alignment.  I think it's OK for THP, because every 2M THP folio will be
> > naturally 2M aligned on the physical address, so it fits when e.g. pgoff=0
> > in the calculation of thp_get_unmapped_area_vmflags().
> > 
> > Logically it should even also work for vfio-pci, as long as VFIO keeps
> > using the lower 40 bits of the device_fd to represent the bar offset,
> > meanwhile it'll also require PCIe spec asking the PCI bars to be mapped
> > aligned with bar sizes.
> > 
> > But from an API POV, get_mapping_order() logically should return something
> > for further calculation of the alignment to get the VA.  pgoff here may not
> > always be the right thing to use to align to the VA: after all, pgtable
> > mapping is about VA -> PA, the only reasonable and reliable way is to align
> > VA to the PA to be mappped, and as an API we shouldn't assume pgoff is
> > always aligned to PA address space.
> 
> My feeling, and the reason I used the phrase "pgoff aligned address",
> is that the owner of the file should already ensure that for the large
> PTEs/folios:
>  pgoff % 2**order == 0
>  physical % 2**order == 0

IMHO there shouldn't really be any hard requirement in mm that pgoff and
physical address space need to be aligned.. but I confess I don't have an
example driver that didn't do that in the linux tree.

> 
> So, things like VFIO do need to hand out high alignment pgoffs to make
> this work - which it already does.
> 
> To me this just keeps thing simpler. I guess if someone comes up with
> a case where they really can't get a pgoff alignment and really need a
> high order mapping then maybe we can add a new return field of some
> kind (pgoff adjustment?) but that is so weird I'd leave it to the
> future person to come and justfiy it.

When looking more, I also found some special cased get_unmapped_area() that
may not be trivially converted into the new API even for CONFIG_MMU, namely:

- io_uring_get_unmapped_area
- arena_get_unmapped_area (from bpf_map->ops->map_get_unmapped_area)

I'll need to have some closer look tomorrow.  If any of them cannot be 100%
safely converted to the new API, I'd also think we should not introduce the
new API, but reuse get_unmapped_area() until we know a way out.

-- 
Peter Xu
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Jason Gunthorpe 3 months, 2 weeks ago
On Tue, Jun 24, 2025 at 08:48:45PM -0400, Peter Xu wrote:
> > My feeling, and the reason I used the phrase "pgoff aligned address",
> > is that the owner of the file should already ensure that for the large
> > PTEs/folios:
> >  pgoff % 2**order == 0
> >  physical % 2**order == 0
> 
> IMHO there shouldn't really be any hard requirement in mm that pgoff and
> physical address space need to be aligned.. but I confess I don't have an
> example driver that didn't do that in the linux tree.

Well, maybe, but right now there does seem to be for
THP/hugetlbfs/etc. It is a nice simple solution that exposes the
alignment requirements to userspace if it wants to use MAP_FIXED.

> > To me this just keeps thing simpler. I guess if someone comes up with
> > a case where they really can't get a pgoff alignment and really need a
> > high order mapping then maybe we can add a new return field of some
> > kind (pgoff adjustment?) but that is so weird I'd leave it to the
> > future person to come and justfiy it.
> 
> When looking more, I also found some special cased get_unmapped_area() that
> may not be trivially converted into the new API even for CONFIG_MMU, namely:
> 
> - io_uring_get_unmapped_area
> - arena_get_unmapped_area (from bpf_map->ops->map_get_unmapped_area)
> 
> I'll need to have some closer look tomorrow.  If any of them cannot be 100%
> safely converted to the new API, I'd also think we should not introduce the
> new API, but reuse get_unmapped_area() until we know a way out.

Oh yuk. It is trying to avoid the dcache flush on some kernel paths
for virtually tagged cache systems.

Arguably this fixup should not be in io_uring, but conveying the right
information to the core code, and requesting a special flush
avoidance mapping is not so easy.

But again I suspect the pgoff is the right solution.

IIRC this is handled by forcing a few low virtual address bits to
always match across all user mappings (the colour) via the pgoff. This
way the userspace always uses the same cache tag and doesn't become
cache incoherent. ie:

   user_addr % PAGE_SIZE*N == pgoff % PAGE_SIZE*N

The issue is now the kernel is using the direct map and we can't force
a random jumble of pages to have the right colours to match
userspace. So the kernel has all those dcache flushes sprinkled about
before it touches user mapped memory through the direct map as the
kernel will use a different colour and cache tag.

So.. if iouring selects a pgoff that automatically gives the right
colour for the userspace mapping to also match the kernel mapping's
colour then things should just work.

Frankly I'm shocked that someone invested time in trying to make this
work - the commit log says it was for parisc and only 2 years ago :(

d808459b2e31 ("io_uring: Adjust mapping wrt architecture aliasing requirements")

I thought such physically tagged cache systems were long ago dead and
buried..

Shouldn't this entirely reject MAP_FIXED too?

Jason
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Peter Xu 3 months, 2 weeks ago
On Wed, Jun 25, 2025 at 10:07:11AM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 24, 2025 at 08:48:45PM -0400, Peter Xu wrote:
> > > My feeling, and the reason I used the phrase "pgoff aligned address",
> > > is that the owner of the file should already ensure that for the large
> > > PTEs/folios:
> > >  pgoff % 2**order == 0
> > >  physical % 2**order == 0
> > 
> > IMHO there shouldn't really be any hard requirement in mm that pgoff and
> > physical address space need to be aligned.. but I confess I don't have an
> > example driver that didn't do that in the linux tree.
> 
> Well, maybe, but right now there does seem to be for
> THP/hugetlbfs/etc. It is a nice simple solution that exposes the
> alignment requirements to userspace if it wants to use MAP_FIXED.
> 
> > > To me this just keeps thing simpler. I guess if someone comes up with
> > > a case where they really can't get a pgoff alignment and really need a
> > > high order mapping then maybe we can add a new return field of some
> > > kind (pgoff adjustment?) but that is so weird I'd leave it to the
> > > future person to come and justfiy it.
> > 
> > When looking more, I also found some special cased get_unmapped_area() that
> > may not be trivially converted into the new API even for CONFIG_MMU, namely:
> > 
> > - io_uring_get_unmapped_area
> > - arena_get_unmapped_area (from bpf_map->ops->map_get_unmapped_area)
> > 
> > I'll need to have some closer look tomorrow.  If any of them cannot be 100%
> > safely converted to the new API, I'd also think we should not introduce the
> > new API, but reuse get_unmapped_area() until we know a way out.
> 
> Oh yuk. It is trying to avoid the dcache flush on some kernel paths
> for virtually tagged cache systems.
> 
> Arguably this fixup should not be in io_uring, but conveying the right
> information to the core code, and requesting a special flush
> avoidance mapping is not so easy.

IIUC it still makes sense to be with io_uring, because only io_uring
subsystem knows what to align against.  I don't yet understand how generic
mm can do this, after all generic mm doesn't know the address that io_uring
is using (from io_region_get_ptr()).

> 
> But again I suspect the pgoff is the right solution.
> 
> IIRC this is handled by forcing a few low virtual address bits to
> always match across all user mappings (the colour) via the pgoff. This
> way the userspace always uses the same cache tag and doesn't become
> cache incoherent. ie:
> 
>    user_addr % PAGE_SIZE*N == pgoff % PAGE_SIZE*N
> 
> The issue is now the kernel is using the direct map and we can't force

After I read the two use cases, I mostly agree.  Just one trivial thing to
mention, it may not be direct map but vmap() (see io_region_init_ptr()).

> a random jumble of pages to have the right colours to match
> userspace. So the kernel has all those dcache flushes sprinkled about
> before it touches user mapped memory through the direct map as the
> kernel will use a different colour and cache tag.
> 
> So.. if iouring selects a pgoff that automatically gives the right
> colour for the userspace mapping to also match the kernel mapping's
> colour then things should just work.
> 
> Frankly I'm shocked that someone invested time in trying to make this
> work - the commit log says it was for parisc and only 2 years ago :(
> 
> d808459b2e31 ("io_uring: Adjust mapping wrt architecture aliasing requirements")
> 
> I thought such physically tagged cache systems were long ago dead and
> buried..

Yeah.. internet says parisc stopped shipping since 2005.  Obviously
there're still people running io_uring on parisc systems, more or less.
This change seems to be required to make io_uring work on parisc or any
vipt.

> 
> Shouldn't this entirely reject MAP_FIXED too?

It already does, see (io_uring_get_unmapped_area(), of parisc):

	/*
	 * Do not allow to map to user-provided address to avoid breaking the
	 * aliasing rules. Userspace is not able to guess the offset address of
	 * kernel kmalloc()ed memory area.
	 */
	if (addr)
		return -EINVAL;

I do not know whoever would use MAP_FIXED but with addr=0.  So failing
addr!=0 should literally stop almost all MAP_FIXED already.

Side topic, but... logically speaking this should really be fine when
!SHM_COLOUR.  This commit should break MAP_FIXED for everyone on io_uring,
but I guess nobody really use MAP_FIXED for io_uring fds..

It's also utterly confusing to set addr=ptr for parisc, fundamentally addr
here must be a kernel va not user va, so it'll (AFAIU) 100% fail later with
STACK_SIZE checks..  IMHO we should really change this to:

diff --git a/io_uring/memmap.c b/io_uring/memmap.c
index 725dc0bec24c..1225a9393dc5 100644
--- a/io_uring/memmap.c
+++ b/io_uring/memmap.c
@@ -380,12 +380,10 @@ unsigned long io_uring_get_unmapped_area(struct file *filp, unsigned long addr,
         */
        filp = NULL;
        flags |= MAP_SHARED;
-       pgoff = 0;      /* has been translated to ptr above */
 #ifdef SHM_COLOUR
-       addr = (uintptr_t) ptr;
-       pgoff = addr >> PAGE_SHIFT;
+       pgoff = (uintptr_t)ptr >> PAGE_SHIFT;
 #else
-       addr = 0UL;
+       pgoff = 0;      /* has been translated to ptr above */
 #endif
        return mm_get_unmapped_area(current->mm, filp, addr, len, pgoff, flags);
 }

And avoid the confusing "addr=ptr" setup.  This might be too off-topic,
though.

Then I also looked at the other bpf arena use case, which doubled the len
when requesting VA and does proper round ups for 4G:

arena_get_unmapped_area():
	ret = mm_get_unmapped_area(current->mm, filp, addr, len * 2, 0, flags);
        ...
	return round_up(ret, SZ_4G);

AFAIU, this is buggy.. at least we should check "round_up(ret, SZ_4G)"
still falls into the (ret, ret+2*len) region... or AFAIU we can return some
address that might be used by other VMAs already..

But in general that smells like a similar alignment issue, IIUC.  So might
be applicable for the new API.

Going back to the topic of this series - I think the new API would work for
io_uring and parisc too if I can return phys_pgoff, here what parisc would
need is:

#ifdef SHM_COLOUR
        *phys_pgoff = io_region_get_ptr(..) >> PAGE_SHIFT;
#else
        *phys_pgoff = pgoff;
#endif

Here *phys_pgoff (or a rename) would be required to fetch the kernel VA (no
matter direct mapping or vmap()) offset, to avoid aliasing issue.

Should I go and introduce the API with *phys_pgoff returned together, then?
I'll still need to scratch my head on how to properly define it, but it at
least will also get vfio use case decouple with spec dependency.

Thanks,

-- 
Peter Xu
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Jason Gunthorpe 3 months, 2 weeks ago
On Wed, Jun 25, 2025 at 01:12:11PM -0400, Peter Xu wrote:

> After I read the two use cases, I mostly agree.  Just one trivial thing to
> mention, it may not be direct map but vmap() (see io_region_init_ptr()).

If it is vmapped then this is all silly, you should vmap and mmmap
using the same cache colouring and, AFAIK, pgoff is how this works for
purely userspace.

Once vmap'd it should determine the cache colour and set the pgoff
properly, then everything should already work no?

> It already does, see (io_uring_get_unmapped_area(), of parisc):
> 
> 	/*
> 	 * Do not allow to map to user-provided address to avoid breaking the
> 	 * aliasing rules. Userspace is not able to guess the offset address of
> 	 * kernel kmalloc()ed memory area.
> 	 */
> 	if (addr)
> 		return -EINVAL;
> 
> I do not know whoever would use MAP_FIXED but with addr=0.  So failing
> addr!=0 should literally stop almost all MAP_FIXED already.

Maybe but also it is not right to not check MAP_FIXED directly.. And
addr is supposed to be a hint for non-fixed mode so it is weird to
-EINVAL when you can ignore the hint??

> Going back to the topic of this series - I think the new API would work for
> io_uring and parisc too if I can return phys_pgoff, here what parisc would
> need is:

The best solution is to fix the selection of normal pgoff so it has
consistent colouring of user VMAs and kernel vmaps. Either compute a
pgoff that matches the vmap (hopefully easy if it is not uABI) or
teach the kernel vmap how to respect a "pgoff" to set the cache
colouring just like the user VMA's do (AFIACR).

But I think this is getting maybe too big and I'd just introduce the
new API and not try to convert this hard stuff. The above explanation
how it could be fixed should be enough??

Jason
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Peter Xu 3 months, 2 weeks ago
On Wed, Jun 25, 2025 at 03:41:54PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 25, 2025 at 01:12:11PM -0400, Peter Xu wrote:
> 
> > After I read the two use cases, I mostly agree.  Just one trivial thing to
> > mention, it may not be direct map but vmap() (see io_region_init_ptr()).
> 
> If it is vmapped then this is all silly, you should vmap and mmmap
> using the same cache colouring and, AFAIK, pgoff is how this works for
> purely userspace.
> 
> Once vmap'd it should determine the cache colour and set the pgoff
> properly, then everything should already work no?

I don't yet see how to set the pgoff.  Here pgoff is passed from the
userspace, which follows io_uring's definition (per io_uring_mmap).

For example, in parisc one could map the complete queue with
pgoff=IORING_OFF_CQ_RING (0x8000000), but then the VA alignment needs to be
adjusted to the vmap() returned for complete queue's io_mapped_region.ptr.

> 
> > It already does, see (io_uring_get_unmapped_area(), of parisc):
> > 
> > 	/*
> > 	 * Do not allow to map to user-provided address to avoid breaking the
> > 	 * aliasing rules. Userspace is not able to guess the offset address of
> > 	 * kernel kmalloc()ed memory area.
> > 	 */
> > 	if (addr)
> > 		return -EINVAL;
> > 
> > I do not know whoever would use MAP_FIXED but with addr=0.  So failing
> > addr!=0 should literally stop almost all MAP_FIXED already.
> 
> Maybe but also it is not right to not check MAP_FIXED directly.. And
> addr is supposed to be a hint for non-fixed mode so it is weird to
> -EINVAL when you can ignore the hint??

I agree on both points here.

> 
> > Going back to the topic of this series - I think the new API would work for
> > io_uring and parisc too if I can return phys_pgoff, here what parisc would
> > need is:
> 
> The best solution is to fix the selection of normal pgoff so it has
> consistent colouring of user VMAs and kernel vmaps. Either compute a
> pgoff that matches the vmap (hopefully easy if it is not uABI) or
> teach the kernel vmap how to respect a "pgoff" to set the cache
> colouring just like the user VMA's do (AFIACR).
> 
> But I think this is getting maybe too big and I'd just introduce the
> new API and not try to convert this hard stuff. The above explanation
> how it could be fixed should be enough??

I never planned to do it myself.  However if I'm going to sign-off and
propose an API, I want to be crystal clear of the goal of the API, and
feasibility of the goal even if I'm not going to work on it..

We don't want to introduce something then found it won't work even for some
MMU use cases, and start maintaining both, or revert back. I wished we
could have sticked with the get_unmapped_area() as of now and leave the API
for later.

So if we want the new API to be proposed here, and make VFIO use it first
(while consider it to be applicable to all existing MMU users at least,
which I checked all of them so far now), I'd think this proper:

    int (*mmap_va_hint)(struct file *file, unsigned long *pgoff, size_t len);

The changes comparing to previous:

    (1) merged pgoff and *phys_pgoff parameters into one unsigned long, so
    the hook can adjust the pgoff for the va allocator to be used.  The
    adjustment will not be visible to future mmap() when VMA is created.

    (2) I renamed it to mmap_va_hint(), because *pgoff will be able to be
    updated, so it's not only about ordering, but "order" and "pgoff
    adjustment" hints that the core mm will use when calculating the VA.

Does it look ok to you?

-- 
Peter Xu
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Jason Gunthorpe 3 months, 1 week ago
On Wed, Jun 25, 2025 at 03:26:44PM -0400, Peter Xu wrote:
> On Wed, Jun 25, 2025 at 03:41:54PM -0300, Jason Gunthorpe wrote:
> > On Wed, Jun 25, 2025 at 01:12:11PM -0400, Peter Xu wrote:
> > 
> > > After I read the two use cases, I mostly agree.  Just one trivial thing to
> > > mention, it may not be direct map but vmap() (see io_region_init_ptr()).
> > 
> > If it is vmapped then this is all silly, you should vmap and mmmap
> > using the same cache colouring and, AFAIK, pgoff is how this works for
> > purely userspace.
> > 
> > Once vmap'd it should determine the cache colour and set the pgoff
> > properly, then everything should already work no?
> 
> I don't yet see how to set the pgoff.  Here pgoff is passed from the
> userspace, which follows io_uring's definition (per io_uring_mmap).

That's too bad

So you have to do it the other way and pass the pgoff to the vmap so
the vmap ends up with the same colouring as a user VMa holding the
same pages..

> So if we want the new API to be proposed here, and make VFIO use it first
> (while consider it to be applicable to all existing MMU users at least,
> which I checked all of them so far now), I'd think this proper:
> 
>     int (*mmap_va_hint)(struct file *file, unsigned long *pgoff, size_t len);
> 
> The changes comparing to previous:
> 
>     (1) merged pgoff and *phys_pgoff parameters into one unsigned long, so
>     the hook can adjust the pgoff for the va allocator to be used.  The
>     adjustment will not be visible to future mmap() when VMA is created.

It seems functional, but the above is better, IMHO.

>     (2) I renamed it to mmap_va_hint(), because *pgoff will be able to be
>     updated, so it's not only about ordering, but "order" and "pgoff
>     adjustment" hints that the core mm will use when calculating the VA.

Where does order come back though? Returns order?

It seems viable

Jason
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Peter Xu 3 months, 1 week ago
On Mon, Jun 30, 2025 at 11:05:37AM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 25, 2025 at 03:26:44PM -0400, Peter Xu wrote:
> > On Wed, Jun 25, 2025 at 03:41:54PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Jun 25, 2025 at 01:12:11PM -0400, Peter Xu wrote:
> > > 
> > > > After I read the two use cases, I mostly agree.  Just one trivial thing to
> > > > mention, it may not be direct map but vmap() (see io_region_init_ptr()).
> > > 
> > > If it is vmapped then this is all silly, you should vmap and mmmap
> > > using the same cache colouring and, AFAIK, pgoff is how this works for
> > > purely userspace.
> > > 
> > > Once vmap'd it should determine the cache colour and set the pgoff
> > > properly, then everything should already work no?
> > 
> > I don't yet see how to set the pgoff.  Here pgoff is passed from the
> > userspace, which follows io_uring's definition (per io_uring_mmap).
> 
> That's too bad
> 
> So you have to do it the other way and pass the pgoff to the vmap so
> the vmap ends up with the same colouring as a user VMa holding the
> same pages..

Not sure if I get that point, but.. it'll be hard to achieve at least.

The vmap() happens (submit/complete queues initializes) when io_uring
instance is created.  The mmap() happens later, and it can also happen
multiple times, so that all of the VAs got mmap()ed need to share the same
colouring with the vmap()..  In this case it sounds reasonable to me to
have the alignment done at mmap(), against the vmap() results.

> 
> > So if we want the new API to be proposed here, and make VFIO use it first
> > (while consider it to be applicable to all existing MMU users at least,
> > which I checked all of them so far now), I'd think this proper:
> > 
> >     int (*mmap_va_hint)(struct file *file, unsigned long *pgoff, size_t len);
> > 
> > The changes comparing to previous:
> > 
> >     (1) merged pgoff and *phys_pgoff parameters into one unsigned long, so
> >     the hook can adjust the pgoff for the va allocator to be used.  The
> >     adjustment will not be visible to future mmap() when VMA is created.
> 
> It seems functional, but the above is better, IMHO.

Do you mean we can start with no modification allowed on *pgoff?  I'd
prefer having *pgoff modifiable from the start, as it'll not only work for
io_uring / parisc above since the 1st day (so we don't need to introduce it
on top, modifying existing users..), but it'll also be cleaner to be used
in the current VFIO's use case.

> 
> >     (2) I renamed it to mmap_va_hint(), because *pgoff will be able to be
> >     updated, so it's not only about ordering, but "order" and "pgoff
> >     adjustment" hints that the core mm will use when calculating the VA.
> 
> Where does order come back though? Returns order?

Yes.

> 
> It seems viable

After I double check with the API above, I can go and prepare a new version.

Thanks a lot, Jason.

-- 
Peter Xu
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Jason Gunthorpe 3 months, 1 week ago
On Wed, Jul 02, 2025 at 04:58:46PM -0400, Peter Xu wrote:
> > So you have to do it the other way and pass the pgoff to the vmap so
> > the vmap ends up with the same colouring as a user VMa holding the
> > same pages..
> 
> Not sure if I get that point, but.. it'll be hard to achieve at least.
> 
> The vmap() happens (submit/complete queues initializes) when io_uring
> instance is created.  The mmap() happens later, and it can also happen
> multiple times, so that all of the VAs got mmap()ed need to share the same
> colouring with the vmap()..  In this case it sounds reasonable to me to
> have the alignment done at mmap(), against the vmap() results.

The way this usually works is the memory is bound to a mmap "cookie"
- the pgoff - which userspace can use as many times as it likes.

Usually you know the thing being allocated will be mmap'd and what
it's pgoff will be because it is 1:1 with the cookie/pgoff.

Didn't try to guess what io_uring has done here, but, IMHO, it would
be weird if the pgoffs are not 1:1 with the vmaps.

Since you said the pgoff was constant and not exchanged user/kernel
then presumably the vmap just needs to use that constant pgoff for its
colouring.

> > > The changes comparing to previous:
> > > 
> > >     (1) merged pgoff and *phys_pgoff parameters into one unsigned long, so
> > >     the hook can adjust the pgoff for the va allocator to be used.  The
> > >     adjustment will not be visible to future mmap() when VMA is created.
> > 
> > It seems functional, but the above is better, IMHO.
> 
> Do you mean we can start with no modification allowed on *pgoff?  I'd
> prefer having *pgoff modifiable from the start, as it'll not only work for
> io_uring / parisc above since the 1st day (so we don't need to introduce it
> on top, modifying existing users..), but it'll also be cleaner to be used
> in the current VFIO's use case.

I think modifiably pgoff is really a weird concept... Especially if it
is only modified for the alignment calculation.

But if it is the only way so be it

Jason
Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Posted by Peter Xu 3 months, 2 weeks ago
On Tue, Jun 24, 2025 at 04:37:26PM -0400, Peter Xu wrote:
> On Thu, Jun 19, 2025 at 03:40:41PM -0300, Jason Gunthorpe wrote:
> > Even with this new version you have to decide to return PUD_SIZE or
> > bar_size in pci and your same reasoning that PUD_SIZE make sense
> > applies (though I would probably return bar_size and just let the core
> > code cap it to PUD_SIZE)
> 
> Yes.
> 
> Today I went back to look at this, I was trying to introduce this for
> file_operations:
> 
> 	int (*get_mapping_order)(struct file *, unsigned long, size_t);
> 
> It looks almost good, except that it so far has no way to return the
> physical address for further calculation on the alignment.
> 
> For THP, VA is always calculated against pgoff not physical address on the
> alignment.  I think it's OK for THP, because every 2M THP folio will be
> naturally 2M aligned on the physical address, so it fits when e.g. pgoff=0
> in the calculation of thp_get_unmapped_area_vmflags().
> 
> Logically it should even also work for vfio-pci, as long as VFIO keeps
> using the lower 40 bits of the device_fd to represent the bar offset,
> meanwhile it'll also require PCIe spec asking the PCI bars to be mapped
> aligned with bar sizes.
> 
> But from an API POV, get_mapping_order() logically should return something
> for further calculation of the alignment to get the VA.  pgoff here may not
> always be the right thing to use to align to the VA: after all, pgtable
> mapping is about VA -> PA, the only reasonable and reliable way is to align
> VA to the PA to be mappped, and as an API we shouldn't assume pgoff is
> always aligned to PA address space.
> 
> Any thoughts?

I should have listed current viable next steps..  We have at least these
options:

(a) Ignore this issue, keep the get_mapping_order() interface like above,
    as long as it works for vfio-pci

    I don't like this option.  I prefer the API (if we're going to
    introduce one) to be applicable no matter how pgoff would be mapped to
    PAs.  I don't like the API to rely on specific driver on specific spec
    (in this case, PCI).

(b) I can make the new API like this instead:

    int (*get_mapping_order)(struct file *, unsigned long, unsigned long *, size_t);

    where I can return a *phys_pgoff altogether after the call returned the
    order to map in retval.  But that's very not pretty if not ugly.

(c) Go back to what I did with the current v1, addressing comments and keep
    using get_unmapped_area() until we know a better way.

I'll vote for (c), but I'm open to suggestions.

Thanks,

-- 
Peter Xu