[PATCH 10/13] PCI/P2PDMA: support compound page in p2pmem_alloc_mmap()

Hou Tao posted 13 patches 1 month, 3 weeks ago
[PATCH 10/13] PCI/P2PDMA: support compound page in p2pmem_alloc_mmap()
Posted by Hou Tao 1 month, 3 weeks ago
From: Hou Tao <houtao1@huawei.com>

P2PDMA memory has already supported compound page and the helpers which
support inserting compound page into vma is also ready, therefore, add
support for compound page in p2pmem_alloc_mmap() as well. It will reduce
the overhead of mmap() and get_user_pages() a lot when compound page is
enabled for p2pdma memory.

The use of vm_private_data to save the alignment of p2pdma memory needs
explanation. The normal way to get the alignment is through pci_dev. It
can be achieved by either invoking kernfs_of() and sysfs_file_kobj() or
defining a new struct kernfs_vm_ops to pass the kobject to the
may_split() and ->pagesize() callbacks. The former approach depends too
much on kernfs implementation details, and the latter would lead to
excessive churn. Therefore, choose the simpler way of saving alignment
in vm_private_data instead.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 drivers/pci/p2pdma.c | 48 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 44 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index e97f5da73458..4a133219ac43 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -128,6 +128,25 @@ static unsigned long p2pmem_get_unmapped_area(struct file *filp, struct kobject
 	return mm_get_unmapped_area(filp, uaddr, len, pgoff, flags);
 }
 
+static int p2pmem_may_split(struct vm_area_struct *vma, unsigned long addr)
+{
+	size_t align = (uintptr_t)vma->vm_private_data;
+
+	if (!IS_ALIGNED(addr, align))
+		return -EINVAL;
+	return 0;
+}
+
+static unsigned long p2pmem_pagesize(struct vm_area_struct *vma)
+{
+	return (uintptr_t)vma->vm_private_data;
+}
+
+static const struct vm_operations_struct p2pmem_vm_ops = {
+	.may_split = p2pmem_may_split,
+	.pagesize = p2pmem_pagesize,
+};
+
 static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
 		const struct bin_attribute *attr, struct vm_area_struct *vma)
 {
@@ -136,6 +155,7 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
 	struct pci_p2pdma *p2pdma;
 	struct percpu_ref *ref;
 	unsigned long vaddr;
+	size_t align;
 	void *kaddr;
 	int ret;
 
@@ -161,6 +181,16 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
 		goto out;
 	}
 
+	align = p2pdma->align;
+	if (vma->vm_start & (align - 1) || vma->vm_end & (align - 1)) {
+		pci_info_ratelimited(pdev,
+				     "%s: unaligned vma (%#lx~%#lx, %#lx)\n",
+				     current->comm, vma->vm_start, vma->vm_end,
+				     align);
+		ret = -EINVAL;
+		goto out;
+	}
+
 	kaddr = (void *)gen_pool_alloc_owner(p2pdma->pool, len, (void **)&ref);
 	if (!kaddr) {
 		ret = -ENOMEM;
@@ -178,7 +208,7 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
 	}
 	rcu_read_unlock();
 
-	for (vaddr = vma->vm_start; vaddr < vma->vm_end; vaddr += PAGE_SIZE) {
+	for (vaddr = vma->vm_start; vaddr < vma->vm_end; vaddr += align) {
 		struct page *page = virt_to_page(kaddr);
 
 		/*
@@ -188,7 +218,12 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
 		 */
 		VM_WARN_ON_ONCE_PAGE(page_ref_count(page), page);
 		set_page_count(page, 1);
-		ret = vm_insert_page(vma, vaddr, page);
+		if (align == PUD_SIZE)
+			ret = vm_insert_folio_pud(vma, vaddr, page_folio(page));
+		else if (align == PMD_SIZE)
+			ret = vm_insert_folio_pmd(vma, vaddr, page_folio(page));
+		else
+			ret = vm_insert_page(vma, vaddr, page);
 		if (ret) {
 			gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
 			percpu_ref_put(ref);
@@ -196,10 +231,15 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
 		}
 		percpu_ref_get(ref);
 		put_page(page);
-		kaddr += PAGE_SIZE;
-		len -= PAGE_SIZE;
+		kaddr += align;
+		len -= align;
 	}
 
+	/* Disable unaligned splitting due to vma merge */
+	vm_flags_set(vma, VM_DONTEXPAND);
+	vma->vm_ops = &p2pmem_vm_ops;
+	vma->vm_private_data = (void *)(uintptr_t)align;
+
 	percpu_ref_put(ref);
 
 	return 0;
-- 
2.29.2
Re: [PATCH 10/13] PCI/P2PDMA: support compound page in p2pmem_alloc_mmap()
Posted by Alistair Popple 1 month ago
On 2025-12-20 at 15:04 +1100, Hou Tao <houtao@huaweicloud.com> wrote...
> From: Hou Tao <houtao1@huawei.com>
> 
> P2PDMA memory has already supported compound page and the helpers which
> support inserting compound page into vma is also ready, therefore, add
> support for compound page in p2pmem_alloc_mmap() as well. It will reduce
> the overhead of mmap() and get_user_pages() a lot when compound page is
> enabled for p2pdma memory.
> 
> The use of vm_private_data to save the alignment of p2pdma memory needs
> explanation. The normal way to get the alignment is through pci_dev. It
> can be achieved by either invoking kernfs_of() and sysfs_file_kobj() or
> defining a new struct kernfs_vm_ops to pass the kobject to the
> may_split() and ->pagesize() callbacks. The former approach depends too
> much on kernfs implementation details, and the latter would lead to
> excessive churn. Therefore, choose the simpler way of saving alignment
> in vm_private_data instead.
> 
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>  drivers/pci/p2pdma.c | 48 ++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 44 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index e97f5da73458..4a133219ac43 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -128,6 +128,25 @@ static unsigned long p2pmem_get_unmapped_area(struct file *filp, struct kobject
>  	return mm_get_unmapped_area(filp, uaddr, len, pgoff, flags);
>  }
>  
> +static int p2pmem_may_split(struct vm_area_struct *vma, unsigned long addr)
> +{
> +	size_t align = (uintptr_t)vma->vm_private_data;
> +
> +	if (!IS_ALIGNED(addr, align))
> +		return -EINVAL;
> +	return 0;
> +}
> +
> +static unsigned long p2pmem_pagesize(struct vm_area_struct *vma)
> +{
> +	return (uintptr_t)vma->vm_private_data;
> +}
> +
> +static const struct vm_operations_struct p2pmem_vm_ops = {
> +	.may_split = p2pmem_may_split,
> +	.pagesize = p2pmem_pagesize,
> +};
> +
>  static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
>  		const struct bin_attribute *attr, struct vm_area_struct *vma)
>  {
> @@ -136,6 +155,7 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
>  	struct pci_p2pdma *p2pdma;
>  	struct percpu_ref *ref;
>  	unsigned long vaddr;
> +	size_t align;
>  	void *kaddr;
>  	int ret;
>  
> @@ -161,6 +181,16 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
>  		goto out;
>  	}
>  
> +	align = p2pdma->align;
> +	if (vma->vm_start & (align - 1) || vma->vm_end & (align - 1)) {
> +		pci_info_ratelimited(pdev,
> +				     "%s: unaligned vma (%#lx~%#lx, %#lx)\n",
> +				     current->comm, vma->vm_start, vma->vm_end,
> +				     align);
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
>  	kaddr = (void *)gen_pool_alloc_owner(p2pdma->pool, len, (void **)&ref);
>  	if (!kaddr) {
>  		ret = -ENOMEM;
> @@ -178,7 +208,7 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
>  	}
>  	rcu_read_unlock();
>  
> -	for (vaddr = vma->vm_start; vaddr < vma->vm_end; vaddr += PAGE_SIZE) {
> +	for (vaddr = vma->vm_start; vaddr < vma->vm_end; vaddr += align) {
>  		struct page *page = virt_to_page(kaddr);
>  
>  		/*
> @@ -188,7 +218,12 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
>  		 */
>  		VM_WARN_ON_ONCE_PAGE(page_ref_count(page), page);
>  		set_page_count(page, 1);
> -		ret = vm_insert_page(vma, vaddr, page);
> +		if (align == PUD_SIZE)
> +			ret = vm_insert_folio_pud(vma, vaddr, page_folio(page));
> +		else if (align == PMD_SIZE)
> +			ret = vm_insert_folio_pmd(vma, vaddr, page_folio(page));

This doesn't look quite right to me - where do you initialise the folio
metadata? I'd expect a call to prep_compound_page() or some equivalent somewhere
- for example calling something like zone_device_page_init() to set the correct
folio order, etc.

 - Alistair

> +		else
> +			ret = vm_insert_page(vma, vaddr, page);
>  		if (ret) {
>  			gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
>  			percpu_ref_put(ref);
> @@ -196,10 +231,15 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
>  		}
>  		percpu_ref_get(ref);
>  		put_page(page);
> -		kaddr += PAGE_SIZE;
> -		len -= PAGE_SIZE;
> +		kaddr += align;
> +		len -= align;
>  	}
>  
> +	/* Disable unaligned splitting due to vma merge */
> +	vm_flags_set(vma, VM_DONTEXPAND);
> +	vma->vm_ops = &p2pmem_vm_ops;
> +	vma->vm_private_data = (void *)(uintptr_t)align;
> +
>  	percpu_ref_put(ref);
>  
>  	return 0;
> -- 
> 2.29.2
>
Re: [PATCH 10/13] PCI/P2PDMA: support compound page in p2pmem_alloc_mmap()
Posted by Logan Gunthorpe 1 month, 2 weeks ago

On 2025-12-19 21:04, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> P2PDMA memory has already supported compound page and the helpers which
> support inserting compound page into vma is also ready, therefore, add
> support for compound page in p2pmem_alloc_mmap() as well. It will reduce
> the overhead of mmap() and get_user_pages() a lot when compound page is
> enabled for p2pdma memory.
> 
> The use of vm_private_data to save the alignment of p2pdma memory needs
> explanation. The normal way to get the alignment is through pci_dev. It
> can be achieved by either invoking kernfs_of() and sysfs_file_kobj() or
> defining a new struct kernfs_vm_ops to pass the kobject to the
> may_split() and ->pagesize() callbacks. The former approach depends too
> much on kernfs implementation details, and the latter would lead to
> excessive churn. Therefore, choose the simpler way of saving alignment
> in vm_private_data instead.
> 
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>  drivers/pci/p2pdma.c | 48 ++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 44 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index e97f5da73458..4a133219ac43 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -128,6 +128,25 @@ static unsigned long p2pmem_get_unmapped_area(struct file *filp, struct kobject
>  	return mm_get_unmapped_area(filp, uaddr, len, pgoff, flags);
>  }
>  
> +static int p2pmem_may_split(struct vm_area_struct *vma, unsigned long addr)
> +{
> +	size_t align = (uintptr_t)vma->vm_private_data;
> +
> +	if (!IS_ALIGNED(addr, align))
> +		return -EINVAL;
> +	return 0;
> +}
> +
> +static unsigned long p2pmem_pagesize(struct vm_area_struct *vma)
> +{
> +	return (uintptr_t)vma->vm_private_data;
> +}
> +
> +static const struct vm_operations_struct p2pmem_vm_ops = {
> +	.may_split = p2pmem_may_split,
> +	.pagesize = p2pmem_pagesize,
> +};
> +
>  static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
>  		const struct bin_attribute *attr, struct vm_area_struct *vma)
>  {
> @@ -136,6 +155,7 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
>  	struct pci_p2pdma *p2pdma;
>  	struct percpu_ref *ref;
>  	unsigned long vaddr;
> +	size_t align;
>  	void *kaddr;
>  	int ret;
>  
> @@ -161,6 +181,16 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
>  		goto out;
>  	}
>  
> +	align = p2pdma->align;
> +	if (vma->vm_start & (align - 1) || vma->vm_end & (align - 1)) {
> +		pci_info_ratelimited(pdev,
> +				     "%s: unaligned vma (%#lx~%#lx, %#lx)\n",
> +				     current->comm, vma->vm_start, vma->vm_end,
> +				     align);
> +		ret = -EINVAL;
> +		goto out;
> +	}

I'm a bit confused by some aspects of these changes. Why does the
alignment become a property of the PCI device? It appears that if the
CPU supports different sized huge pages then the size and alignment
restrictions on P2PDMA memory become greater. So if someone is only
allocating a few KB these changes will break their code and refuse to
allocate single pages.

I would have expected this code to allocate an appropriately aligned
block of the p2p memory based on the requirements of the current
mapping, not based on alignment requirements established when the device
is probed.

Logan
Re: [PATCH 10/13] PCI/P2PDMA: support compound page in p2pmem_alloc_mmap()
Posted by Jason Gunthorpe 1 month ago
On Mon, Dec 22, 2025 at 10:04:17AM -0700, Logan Gunthorpe wrote:
> I would have expected this code to allocate an appropriately aligned
> block of the p2p memory based on the requirements of the current
> mapping, not based on alignment requirements established when the device
> is probed.

Yeah, I think this is not right too.

I think the flow has become confused by trying to set a static
vmemmap_shift when creating the pgmap. That is not how something like
this should work at all.

Instead the basic idea should be that each mmap systemcall will
determine what folio order it would like to have, it will allocate an
aligned range of physical from the genpool, and then it will alter the
folios in that range into a single high order folio.

Finally the high order folio is installed in one shot with the mm
dealing with placing it optimally in the right page table levels.

You could use a heuristic (ie I'm 2M size aligned or 1G size aligned)
or maybe use the MAP_HUGE_2M/MAP_HUGE_1G flags, or something else
perhaps.

Don't follow what DAX did, this doesn't have the limitations DAX had
to work with.

I also don't think drivers should be open coding the
vm_insert_folio_xx() stuff, the mm should have a helper to accept a
folio of any order, the VA and the phys, then install it optimally. So
don't export vm_insert_folio_pmd()/etc please.

Finally, Peter Xu has been working on the issue of setting the
alignment of VMAs when they will be used to hold large aligned folios,
that would help this be more useful by avoiding the need for MAP_FIXED:

https://lore.kernel.org/kvm/20251204151003.171039-1-peterx@redhat.com/

Assuming the folio size can be determined early enough in the VMA
process, though Lorenzo's recent refactorying here into mmap_prepare
may be helpful.

Jason
Re: [PATCH 10/13] PCI/P2PDMA: support compound page in p2pmem_alloc_mmap()
Posted by Logan Gunthorpe 1 month ago

On 2026-01-07 13:24, Jason Gunthorpe wrote:
> On Mon, Dec 22, 2025 at 10:04:17AM -0700, Logan Gunthorpe wrote:
>> I would have expected this code to allocate an appropriately aligned
>> block of the p2p memory based on the requirements of the current
>> mapping, not based on alignment requirements established when the device
>> is probed.
> 
> Yeah, I think this is not right too.
> 
> I think the flow has become confused by trying to set a static
> vmemmap_shift when creating the pgmap. That is not how something like
> this should work at all.
> 
> Instead the basic idea should be that each mmap systemcall will
> determine what folio order it would like to have, it will allocate an
> aligned range of physical from the genpool, and then it will alter the
> folios in that range into a single high order folio.
> 
> Finally the high order folio is installed in one shot with the mm
> dealing with placing it optimally in the right page table levels.

This all sounds the same as what I was advocating for. genpool does
still need to be modified to support the specified alignment
requirements for the allocation.

If there is more help from the VM layer to insert different orders of
memory, that would be fantastic too.

Logan
Re: [PATCH 10/13] PCI/P2PDMA: support compound page in p2pmem_alloc_mmap()
Posted by Hou Tao 1 month, 2 weeks ago

On 12/23/2025 1:04 AM, Logan Gunthorpe wrote:
>
> On 2025-12-19 21:04, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> P2PDMA memory has already supported compound page and the helpers which
>> support inserting compound page into vma is also ready, therefore, add
>> support for compound page in p2pmem_alloc_mmap() as well. It will reduce
>> the overhead of mmap() and get_user_pages() a lot when compound page is
>> enabled for p2pdma memory.
>>
>> The use of vm_private_data to save the alignment of p2pdma memory needs
>> explanation. The normal way to get the alignment is through pci_dev. It
>> can be achieved by either invoking kernfs_of() and sysfs_file_kobj() or
>> defining a new struct kernfs_vm_ops to pass the kobject to the
>> may_split() and ->pagesize() callbacks. The former approach depends too
>> much on kernfs implementation details, and the latter would lead to
>> excessive churn. Therefore, choose the simpler way of saving alignment
>> in vm_private_data instead.
>>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
>>  drivers/pci/p2pdma.c | 48 ++++++++++++++++++++++++++++++++++++++++----
>>  1 file changed, 44 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
>> index e97f5da73458..4a133219ac43 100644
>> --- a/drivers/pci/p2pdma.c
>> +++ b/drivers/pci/p2pdma.c
>> @@ -128,6 +128,25 @@ static unsigned long p2pmem_get_unmapped_area(struct file *filp, struct kobject
>>  	return mm_get_unmapped_area(filp, uaddr, len, pgoff, flags);
>>  }
>>  
>> +static int p2pmem_may_split(struct vm_area_struct *vma, unsigned long addr)
>> +{
>> +	size_t align = (uintptr_t)vma->vm_private_data;
>> +
>> +	if (!IS_ALIGNED(addr, align))
>> +		return -EINVAL;
>> +	return 0;
>> +}
>> +
>> +static unsigned long p2pmem_pagesize(struct vm_area_struct *vma)
>> +{
>> +	return (uintptr_t)vma->vm_private_data;
>> +}
>> +
>> +static const struct vm_operations_struct p2pmem_vm_ops = {
>> +	.may_split = p2pmem_may_split,
>> +	.pagesize = p2pmem_pagesize,
>> +};
>> +
>>  static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
>>  		const struct bin_attribute *attr, struct vm_area_struct *vma)
>>  {
>> @@ -136,6 +155,7 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
>>  	struct pci_p2pdma *p2pdma;
>>  	struct percpu_ref *ref;
>>  	unsigned long vaddr;
>> +	size_t align;
>>  	void *kaddr;
>>  	int ret;
>>  
>> @@ -161,6 +181,16 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
>>  		goto out;
>>  	}
>>  
>> +	align = p2pdma->align;
>> +	if (vma->vm_start & (align - 1) || vma->vm_end & (align - 1)) {
>> +		pci_info_ratelimited(pdev,
>> +				     "%s: unaligned vma (%#lx~%#lx, %#lx)\n",
>> +				     current->comm, vma->vm_start, vma->vm_end,
>> +				     align);
>> +		ret = -EINVAL;
>> +		goto out;
>> +	}
> I'm a bit confused by some aspects of these changes. Why does the
> alignment become a property of the PCI device? It appears that if the
> CPU supports different sized huge pages then the size and alignment
> restrictions on P2PDMA memory become greater. So if someone is only
> allocating a few KB these changes will break their code and refuse to
> allocate single pages.
>
> I would have expected this code to allocate an appropriately aligned
> block of the p2p memory based on the requirements of the current
> mapping, not based on alignment requirements established when the device
> is probed.

The behavior mimics device-dax in which the creation of device-dax
device needs to specify the alignment property. Supporting different
alignments for different userspace mapping could work. However, it is no
way for the userspace to tell whether or not the the alignment is
mandatory. Take the below procedure as an example:

1) the size of CMB bar is 4MB
2) application 1 allocates 4KB. Its mapping is 4KB aligned
3) application 2 allocates 2MB. If the allocation from gen_pool is not
aligned, the mapping only supports 4KB-aligned mapping. If the
allocation support aligned allocation, the mapping could support
2MB-aligned mapping. However, the mmap implementation in the kernel
doesn't know which way is appropriate. If the alignment is specified in
the p2pdma, the implement could know the aligned 2MB mapping is appropriate.

> Logan
>
>
>
> .
Re: [PATCH 10/13] PCI/P2PDMA: support compound page in p2pmem_alloc_mmap()
Posted by Logan Gunthorpe 1 month ago
>> I'm a bit confused by some aspects of these changes. Why does the
>> alignment become a property of the PCI device? It appears that if the
>> CPU supports different sized huge pages then the size and alignment
>> restrictions on P2PDMA memory become greater. So if someone is only
>> allocating a few KB these changes will break their code and refuse to
>> allocate single pages.
>>
>> I would have expected this code to allocate an appropriately aligned
>> block of the p2p memory based on the requirements of the current
>> mapping, not based on alignment requirements established when the device
>> is probed.
> 
> The behavior mimics device-dax in which the creation of device-dax
> device needs to specify the alignment property. Supporting different
> alignments for different userspace mapping could work. However, it is no
> way for the userspace to tell whether or not the the alignment is
> mandatory. Take the below procedure as an example:

Then I don't think the approach device-dax took makes sense for p2pdma.

> 1) the size of CMB bar is 4MB
> 2) application 1 allocates 4KB. Its mapping is 4KB aligned
> 3) application 2 allocates 2MB. If the allocation from gen_pool is not
> aligned, the mapping only supports 4KB-aligned mapping. If the
> allocation support aligned allocation, the mapping could support
> 2MB-aligned mapping. However, the mmap implementation in the kernel
> doesn't know which way is appropriate. If the alignment is specified in
> the p2pdma, the implement could know the aligned 2MB mapping is appropriate.

Specifying a minimum alignment as a property of the p2pdma device makes
no sense to me.

I think the p2pdma code should just make the best effort to get the
highest aligned buffer for the allocation it can. If it can not, it
falls back to just getting page aligned buffers. We might have to make
some minor modifications to genalloc to create an aligned version of the
allocator (similar to gen_pool_dma_alloc_align()).

Logan