[v1] vfio/type1: optimize vfio_pin_pages_remote() for hugetlbfs folio

[PATCH] vfio/type1: optimize vfio_pin_pages_remote() for hugetlbfs folio

Posted by lizhe.67@bytedance.com 9 months ago

From: Li Zhe <lizhe.67@bytedance.com>

When vfio_pin_pages_remote() is called with a range of addresses that
includes hugetlbfs folios, the function currently performs individual
statistics counting operations for each page. This can lead to significant
performance overheads, especially when dealing with large ranges of pages.

This patch optimize this process by batching the statistics counting
operations.

The performance test results for completing the 8G VFIO IOMMU DMA mapping,
obtained through trace-cmd, are as follows. In this case, the 8G virtual
address space has been mapped to physical memory using hugetlbfs with
pagesize=2M.

Before this patch:
funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();

After this patch:
funcgraph_entry:      # 15635.055 us |  vfio_pin_map_dma();

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 drivers/vfio/vfio_iommu_type1.c | 49 +++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 0ac56072af9f..bafa7f8c4cc6 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -337,6 +337,30 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
 	return NULL;
 }
 
+/*
+ * Find a random vfio_pfn that belongs to the range
+ * [iova, iova + PAGE_SIZE * npage)
+ */
+static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
+		dma_addr_t iova, unsigned long npage)
+{
+	struct vfio_pfn *vpfn;
+	struct rb_node *node = dma->pfn_list.rb_node;
+	dma_addr_t end_iova = iova + PAGE_SIZE * npage;
+
+	while (node) {
+		vpfn = rb_entry(node, struct vfio_pfn, node);
+
+		if (end_iova <= vpfn->iova)
+			node = node->rb_left;
+		else if (iova > vpfn->iova)
+			node = node->rb_right;
+		else
+			return vpfn;
+	}
+	return NULL;
+}
+
 static void vfio_link_pfn(struct vfio_dma *dma,
 			  struct vfio_pfn *new)
 {
@@ -670,6 +694,31 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 				iova += (PAGE_SIZE * ret);
 				continue;
 			}
+
+		}
+		/* Handle hugetlbfs page */
+		if (likely(!disable_hugepages) &&
+				folio_test_hugetlb(page_folio(batch->pages[batch->offset]))) {
+			if (pfn != *pfn_base + pinned)
+				goto out;
+
+			if (!rsvd && !vfio_find_vpfn_range(dma, iova, batch->size)) {
+				if (!dma->lock_cap &&
+				    mm->locked_vm + lock_acct + batch->size > limit) {
+					pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
+						__func__, limit << PAGE_SHIFT);
+					ret = -ENOMEM;
+					goto unpin_out;
+				}
+				pinned += batch->size;
+				npage -= batch->size;
+				vaddr += PAGE_SIZE * batch->size;
+				iova += PAGE_SIZE * batch->size;
+				lock_acct += batch->size;
+				batch->offset += batch->size;
+				batch->size = 0;
+				continue;
+			}
 		}
 
 		/*
-- 
2.20.1

Re: [PATCH] vfio/type1: optimize vfio_pin_pages_remote() for hugetlbfs folio

Posted by Alex Williamson 8 months, 4 weeks ago

On Tue, 13 May 2025 11:57:30 +0800
lizhe.67@bytedance.com wrote:

> From: Li Zhe <lizhe.67@bytedance.com>
> 
> When vfio_pin_pages_remote() is called with a range of addresses that
> includes hugetlbfs folios, the function currently performs individual
> statistics counting operations for each page. This can lead to significant
> performance overheads, especially when dealing with large ranges of pages.
> 
> This patch optimize this process by batching the statistics counting
> operations.
> 
> The performance test results for completing the 8G VFIO IOMMU DMA mapping,
> obtained through trace-cmd, are as follows. In this case, the 8G virtual
> address space has been mapped to physical memory using hugetlbfs with
> pagesize=2M.
> 
> Before this patch:
> funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();
> 
> After this patch:
> funcgraph_entry:      # 15635.055 us |  vfio_pin_map_dma();
> 
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 49 +++++++++++++++++++++++++++++++++
>  1 file changed, 49 insertions(+)

Hi,

Thanks for looking at improvements in this area...

> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 0ac56072af9f..bafa7f8c4cc6 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -337,6 +337,30 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
>  	return NULL;
>  }
>  
> +/*
> + * Find a random vfio_pfn that belongs to the range
> + * [iova, iova + PAGE_SIZE * npage)
> + */
> +static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
> +		dma_addr_t iova, unsigned long npage)
> +{
> +	struct vfio_pfn *vpfn;
> +	struct rb_node *node = dma->pfn_list.rb_node;
> +	dma_addr_t end_iova = iova + PAGE_SIZE * npage;
> +
> +	while (node) {
> +		vpfn = rb_entry(node, struct vfio_pfn, node);
> +
> +		if (end_iova <= vpfn->iova)
> +			node = node->rb_left;
> +		else if (iova > vpfn->iova)
> +			node = node->rb_right;
> +		else
> +			return vpfn;
> +	}
> +	return NULL;
> +}

This essentially duplicates vfio_find_vpfn(), where the existing
function only finds a single page.  The existing function should be
extended for this new use case and callers updated.  Also the vfio_pfn
is not "random", it's the first vfio_pfn overlapping the range.

> +
>  static void vfio_link_pfn(struct vfio_dma *dma,
>  			  struct vfio_pfn *new)
>  {
> @@ -670,6 +694,31 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
>  				iova += (PAGE_SIZE * ret);
>  				continue;
>  			}
> +

Spurious new blank line.

> +		}

A new blank line here would be appreciated.

> +		/* Handle hugetlbfs page */
> +		if (likely(!disable_hugepages) &&

Isn't this already accounted for with npage = 1?

> +				folio_test_hugetlb(page_folio(batch->pages[batch->offset]))) {

I don't follow how this guarantees the entire batch->size is
contiguous.  Isn't it possible that a batch could contain multiple
hugetlb folios?  Is the assumption here only true if folio_nr_pages()
(or specifically the pages remaining) is >= batch->capacity?  What
happens if we try to map the last half of one 2MB hugetlb page and
first half of the non-physically-contiguous next page?  Or what if the
hugetlb size is 64KB and the batch contains multiple folios that are
not physically contiguous?

> +			if (pfn != *pfn_base + pinned)
> +				goto out;
> +
> +			if (!rsvd && !vfio_find_vpfn_range(dma, iova, batch->size)) {
> +				if (!dma->lock_cap &&
> +				    mm->locked_vm + lock_acct + batch->size > limit) {
> +					pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> +						__func__, limit << PAGE_SHIFT);
> +					ret = -ENOMEM;
> +					goto unpin_out;
> +				}
> +				pinned += batch->size;
> +				npage -= batch->size;
> +				vaddr += PAGE_SIZE * batch->size;
> +				iova += PAGE_SIZE * batch->size;
> +				lock_acct += batch->size;
> +				batch->offset += batch->size;
> +				batch->size = 0;
> +				continue;
> +			}

There's a lot of duplication with the existing page-iterative loop.  I
think they could be consolidated if we extract the number of known
contiguous pages based on the folio into a variable, default 1.

Also, while this approach is an improvement, it leaves a lot on the
table in scenarios where folio_nr_pages() exceeds batch->capacity.  For
example we're at best incrementing 1GB hugetlb pages in 2MB increments.
We're also wasting a lot of cycles to fill pages points we mostly don't
use.  Thanks,

Alex

Re: [PATCH] vfio/type1: optimize vfio_pin_pages_remote() for hugetlbfs folio

Posted by Jason Gunthorpe 8 months, 3 weeks ago

On Thu, May 15, 2025 at 03:19:46PM -0600, Alex Williamson wrote:
> On Tue, 13 May 2025 11:57:30 +0800
> lizhe.67@bytedance.com wrote:
> 
> > From: Li Zhe <lizhe.67@bytedance.com>
> > 
> > When vfio_pin_pages_remote() is called with a range of addresses that
> > includes hugetlbfs folios, the function currently performs individual
> > statistics counting operations for each page. This can lead to significant
> > performance overheads, especially when dealing with large ranges of pages.
> > 
> > This patch optimize this process by batching the statistics counting
> > operations.
> > 
> > The performance test results for completing the 8G VFIO IOMMU DMA mapping,
> > obtained through trace-cmd, are as follows. In this case, the 8G virtual
> > address space has been mapped to physical memory using hugetlbfs with
> > pagesize=2M.
> > 
> > Before this patch:
> > funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();
> > 
> > After this patch:
> > funcgraph_entry:      # 15635.055 us |  vfio_pin_map_dma();
> > 
> > Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 49 +++++++++++++++++++++++++++++++++
> >  1 file changed, 49 insertions(+)
> 
> Hi,
> 
> Thanks for looking at improvements in this area...

Why not just use iommufd? Doesn't it already does all these
optimizations?

Indeed today you can use iommufd with a memfd handle which should
return the huge folios directly from the hugetlbfs and we never
iterate with 4K pages.

Jason

Re: [PATCH] vfio/type1: optimize vfio_pin_pages_remote() for hugetlbfs folio

Posted by Alex Williamson 8 months, 3 weeks ago

On Fri, 16 May 2025 11:18:16 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Thu, May 15, 2025 at 03:19:46PM -0600, Alex Williamson wrote:
> > On Tue, 13 May 2025 11:57:30 +0800
> > lizhe.67@bytedance.com wrote:
> >   
> > > From: Li Zhe <lizhe.67@bytedance.com>
> > > 
> > > When vfio_pin_pages_remote() is called with a range of addresses that
> > > includes hugetlbfs folios, the function currently performs individual
> > > statistics counting operations for each page. This can lead to significant
> > > performance overheads, especially when dealing with large ranges of pages.
> > > 
> > > This patch optimize this process by batching the statistics counting
> > > operations.
> > > 
> > > The performance test results for completing the 8G VFIO IOMMU DMA mapping,
> > > obtained through trace-cmd, are as follows. In this case, the 8G virtual
> > > address space has been mapped to physical memory using hugetlbfs with
> > > pagesize=2M.
> > > 
> > > Before this patch:
> > > funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();
> > > 
> > > After this patch:
> > > funcgraph_entry:      # 15635.055 us |  vfio_pin_map_dma();
> > > 
> > > Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> > > ---
> > >  drivers/vfio/vfio_iommu_type1.c | 49 +++++++++++++++++++++++++++++++++
> > >  1 file changed, 49 insertions(+)  
> > 
> > Hi,
> > 
> > Thanks for looking at improvements in this area...  
> 
> Why not just use iommufd? Doesn't it already does all these
> optimizations?

We don't have feature parity yet (P2P DMA), we don't have libvirt
support, and many users are on kernels or product stacks where iommufd
isn't available yet.

> Indeed today you can use iommufd with a memfd handle which should
> return the huge folios directly from the hugetlbfs and we never
> iterate with 4K pages.

Good to know we won't need to revisit this, and maybe "good enough"
here, without tackling the batch size or gup page pointers is enough to
hold users over until iommufd.  Thanks,

Alex

Re: [PATCH] vfio/type1: optimize vfio_pin_pages_remote() for hugetlbfs folio

Posted by lizhe.67@bytedance.com 8 months, 4 weeks ago

On Thu, 15 May 2025 15:19:46 -0600, alex.williamson@redhat.com wrote:

>> +/*
>> + * Find a random vfio_pfn that belongs to the range
>> + * [iova, iova + PAGE_SIZE * npage)
>> + */
>> +static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
>> +		dma_addr_t iova, unsigned long npage)
>> +{
>> +	struct vfio_pfn *vpfn;
>> +	struct rb_node *node = dma->pfn_list.rb_node;
>> +	dma_addr_t end_iova = iova + PAGE_SIZE * npage;
>> +
>> +	while (node) {
>> +		vpfn = rb_entry(node, struct vfio_pfn, node);
>> +
>> +		if (end_iova <= vpfn->iova)
>> +			node = node->rb_left;
>> +		else if (iova > vpfn->iova)
>> +			node = node->rb_right;
>> +		else
>> +			return vpfn;
>> +	}
>> +	return NULL;
>> +}
>
>This essentially duplicates vfio_find_vpfn(), where the existing
>function only finds a single page.  The existing function should be
>extended for this new use case and callers updated.  Also the vfio_pfn

How about implementing it in this way?

static inline struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
{
	return vfio_find_vpfn_range(dma, iova, 1);
}

With this implement, the caller who calls vfio_find_vpfn() won't need to
be modified.

>is not "random", it's the first vfio_pfn overlapping the range.

Yes you are right. I will modify the comments to "Find the first vfio_pfn
overlapping the range [iova, iova + PAGE_SIZE * npage) in rb tree" in v2.

>> +
>>  static void vfio_link_pfn(struct vfio_dma *dma,
>>  			  struct vfio_pfn *new)
>>  {
>> @@ -670,6 +694,31 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
>>  				iova += (PAGE_SIZE * ret);
>>  				continue;
>>  			}
>> +
>
>Spurious new blank line.
>
>> +		}
>
>A new blank line here would be appreciated.

Thank you. I will address this in v2.

>> +		/* Handle hugetlbfs page */
>> +		if (likely(!disable_hugepages) &&
>
>Isn't this already accounted for with npage = 1?

Yes. This check is not necessary. I will remove it in v2.

>> +				folio_test_hugetlb(page_folio(batch->pages[batch->offset]))) {
>
>I don't follow how this guarantees the entire batch->size is
>contiguous.  Isn't it possible that a batch could contain multiple
>hugetlb folios?  Is the assumption here only true if folio_nr_pages()
>(or specifically the pages remaining) is >= batch->capacity?  What

Yes.

>happens if we try to map the last half of one 2MB hugetlb page and
>first half of the non-physically-contiguous next page?  Or what if the
>hugetlb size is 64KB and the batch contains multiple folios that are
>not physically contiguous?

Sorry for my problematic implementation. I will fix it in v2.

>> +			if (pfn != *pfn_base + pinned)
>> +				goto out;
>> +
>> +			if (!rsvd && !vfio_find_vpfn_range(dma, iova, batch->size)) {
>> +				if (!dma->lock_cap &&
>> +				    mm->locked_vm + lock_acct + batch->size > limit) {
>> +					pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
>> +						__func__, limit << PAGE_SHIFT);
>> +					ret = -ENOMEM;
>> +					goto unpin_out;
>> +				}
>> +				pinned += batch->size;
>> +				npage -= batch->size;
>> +				vaddr += PAGE_SIZE * batch->size;
>> +				iova += PAGE_SIZE * batch->size;
>> +				lock_acct += batch->size;
>> +				batch->offset += batch->size;
>> +				batch->size = 0;
>> +				continue;
>> +			}
>
>There's a lot of duplication with the existing page-iterative loop.  I
>think they could be consolidated if we extract the number of known
>contiguous pages based on the folio into a variable, default 1.

Good idea! I will try to implement this optimization based on this idea
in v2.

>Also, while this approach is an improvement, it leaves a lot on the
>table in scenarios where folio_nr_pages() exceeds batch->capacity.  For
>example we're at best incrementing 1GB hugetlb pages in 2MB increments.
>We're also wasting a lot of cycles to fill pages points we mostly don't
>use.  Thanks,

Yes, this is the ideal case. However, we are uncertain whether the pages
to be pinned are hugetlbfs folio, and increasing batch->capacity would
lead to a significant increase in memory usage. Additionally, since
pin_user_pages_remote() currently lacks a variant that can reduce the
overhead of "filling pages points" of hugetlbfs folio (maybe not correct).
I suggest we proceeding with additional optimizations after further
infrastructure improvements.

Re: [PATCH] vfio/type1: optimize vfio_pin_pages_remote() for hugetlbfs folio

Posted by Alex Williamson 8 months, 3 weeks ago

On Fri, 16 May 2025 16:16:16 +0800
lizhe.67@bytedance.com wrote:

> On Thu, 15 May 2025 15:19:46 -0600, alex.williamson@redhat.com wrote:
> 
> >> +/*
> >> + * Find a random vfio_pfn that belongs to the range
> >> + * [iova, iova + PAGE_SIZE * npage)
> >> + */
> >> +static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
> >> +		dma_addr_t iova, unsigned long npage)
> >> +{
> >> +	struct vfio_pfn *vpfn;
> >> +	struct rb_node *node = dma->pfn_list.rb_node;
> >> +	dma_addr_t end_iova = iova + PAGE_SIZE * npage;
> >> +
> >> +	while (node) {
> >> +		vpfn = rb_entry(node, struct vfio_pfn, node);
> >> +
> >> +		if (end_iova <= vpfn->iova)
> >> +			node = node->rb_left;
> >> +		else if (iova > vpfn->iova)
> >> +			node = node->rb_right;
> >> +		else
> >> +			return vpfn;
> >> +	}
> >> +	return NULL;
> >> +}  
> >
> >This essentially duplicates vfio_find_vpfn(), where the existing
> >function only finds a single page.  The existing function should be
> >extended for this new use case and callers updated.  Also the vfio_pfn  
> 
> How about implementing it in this way?
> 
> static inline struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> {
> 	return vfio_find_vpfn_range(dma, iova, 1);
> }
> 
> With this implement, the caller who calls vfio_find_vpfn() won't need to
> be modified.

Yes, that would minimize churn elsewhere.

> >is not "random", it's the first vfio_pfn overlapping the range.  
> 
> Yes you are right. I will modify the comments to "Find the first vfio_pfn
> overlapping the range [iova, iova + PAGE_SIZE * npage) in rb tree" in v2.
> 
> >> +
> >>  static void vfio_link_pfn(struct vfio_dma *dma,
> >>  			  struct vfio_pfn *new)
> >>  {
> >> @@ -670,6 +694,31 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
> >>  				iova += (PAGE_SIZE * ret);
> >>  				continue;
> >>  			}
> >> +  
> >
> >Spurious new blank line.
> >  
> >> +		}  
> >
> >A new blank line here would be appreciated.  
> 
> Thank you. I will address this in v2.
> 
> >> +		/* Handle hugetlbfs page */
> >> +		if (likely(!disable_hugepages) &&  
> >
> >Isn't this already accounted for with npage = 1?  
> 
> Yes. This check is not necessary. I will remove it in v2.
> 
> >> +				folio_test_hugetlb(page_folio(batch->pages[batch->offset]))) {  
> >
> >I don't follow how this guarantees the entire batch->size is
> >contiguous.  Isn't it possible that a batch could contain multiple
> >hugetlb folios?  Is the assumption here only true if folio_nr_pages()
> >(or specifically the pages remaining) is >= batch->capacity?  What  
> 
> Yes.
> 
> >happens if we try to map the last half of one 2MB hugetlb page and
> >first half of the non-physically-contiguous next page?  Or what if the
> >hugetlb size is 64KB and the batch contains multiple folios that are
> >not physically contiguous?  
> 
> Sorry for my problematic implementation. I will fix it in v2.
> 
> >> +			if (pfn != *pfn_base + pinned)
> >> +				goto out;
> >> +
> >> +			if (!rsvd && !vfio_find_vpfn_range(dma, iova, batch->size)) {
> >> +				if (!dma->lock_cap &&
> >> +				    mm->locked_vm + lock_acct + batch->size > limit) {
> >> +					pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> >> +						__func__, limit << PAGE_SHIFT);
> >> +					ret = -ENOMEM;
> >> +					goto unpin_out;
> >> +				}
> >> +				pinned += batch->size;
> >> +				npage -= batch->size;
> >> +				vaddr += PAGE_SIZE * batch->size;
> >> +				iova += PAGE_SIZE * batch->size;
> >> +				lock_acct += batch->size;
> >> +				batch->offset += batch->size;
> >> +				batch->size = 0;
> >> +				continue;
> >> +			}  
> >
> >There's a lot of duplication with the existing page-iterative loop.  I
> >think they could be consolidated if we extract the number of known
> >contiguous pages based on the folio into a variable, default 1.  
> 
> Good idea! I will try to implement this optimization based on this idea
> in v2.
> 
> >Also, while this approach is an improvement, it leaves a lot on the
> >table in scenarios where folio_nr_pages() exceeds batch->capacity.  For
> >example we're at best incrementing 1GB hugetlb pages in 2MB increments.
> >We're also wasting a lot of cycles to fill pages points we mostly don't
> >use.  Thanks,  
> 
> Yes, this is the ideal case. However, we are uncertain whether the pages
> to be pinned are hugetlbfs folio, and increasing batch->capacity would
> lead to a significant increase in memory usage. Additionally, since
> pin_user_pages_remote() currently lacks a variant that can reduce the
> overhead of "filling pages points" of hugetlbfs folio (maybe not correct).
> I suggest we proceeding with additional optimizations after further
> infrastructure improvements.

That's fair, it's a nice improvement within the existing interfaces
even if we choose to pursue something more aggressive in the future.
Please consider documenting improvements relative to 1GB hugetlb in the
next version also.  Thanks,

Alex