vfio/type1: optimize vfio_pin_pages_remote() and vfio_unpin_pages_remote() for large folio

[PATCH 0/4] vfio/type1: optimize vfio_pin_pages_remote() and vfio_unpin_pages_remote() for large folio

Posted by lizhe.67@bytedance.com 3 months, 1 week ago

From: Li Zhe <lizhe.67@bytedance.com>

This patchset is an consolidation of the two previous patchsets[1][2].

When vfio_pin_pages_remote() is called with a range of addresses that
includes large folios, the function currently performs individual
statistics counting operations for each page. This can lead to significant
performance overheads, especially when dealing with large ranges of pages.

The function vfio_unpin_pages_remote() has a similar issue, where executing
put_pfn() for each pfn brings considerable consumption.

This patchset optimizes the performance of the relevant functions by
batching the less efficient operations mentioned before.

The first patch optimizes the performance of the function
vfio_pin_pages_remote(), while the remaining patches optimize the
performance of the function vfio_unpin_pages_remote().

The performance test results, based on v6.16-rc4, for completing the 16G
VFIO MAP/UNMAP DMA, obtained through unit test[3] with slight
modifications[4], are as follows.

Base(6.16-rc4):
./vfio-pci-mem-dma-map 0000:03:00.0 16
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.047 s (340.2 GB/s)
VFIO UNMAP DMA in 0.135 s (118.6 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.280 s (57.2 GB/s)
VFIO UNMAP DMA in 0.312 s (51.3 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.052 s (310.5 GB/s)
VFIO UNMAP DMA in 0.136 s (117.3 GB/s)

With this patchset:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.027 s (596.4 GB/s)
VFIO UNMAP DMA in 0.045 s (357.6 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.288 s (55.5 GB/s)
VFIO UNMAP DMA in 0.288 s (55.6 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.031 s (508.3 GB/s)
VFIO UNMAP DMA in 0.045 s (352.9 GB/s)

For large folio, we achieve an over 40% performance improvement for VFIO
MAP DMA and an over 66% performance improvement for VFIO DMA UNMAP. For
small folios, the performance test results show little difference compared
with the performance before optimization.

[1]: https://lore.kernel.org/all/20250529064947.38433-1-lizhe.67@bytedance.com/
[2]: https://lore.kernel.org/all/20250620032344.13382-1-lizhe.67@bytedance.com/
[3]: https://github.com/awilliam/tests/blob/vfio-pci-mem-dma-map/vfio-pci-mem-dma-map.c
[4]: https://lore.kernel.org/all/20250610031013.98556-1-lizhe.67@bytedance.com/

Li Zhe (4):
  vfio/type1: optimize vfio_pin_pages_remote() for large folios
  vfio/type1: batch vfio_find_vpfn() in function
    vfio_unpin_pages_remote()
  vfio/type1: introduce a new member has_rsvd for struct vfio_dma
  vfio/type1: optimize vfio_unpin_pages_remote() for large folio

 drivers/vfio/vfio_iommu_type1.c | 121 ++++++++++++++++++++++++++------
 1 file changed, 100 insertions(+), 21 deletions(-)

-- 
2.20.1

Re: [PATCH 0/4] vfio/type1: optimize vfio_pin_pages_remote() and vfio_unpin_pages_remote() for large folio

Posted by David Hildenbrand 3 months, 1 week ago

On 30.06.25 09:25, lizhe.67@bytedance.com wrote:
> From: Li Zhe <lizhe.67@bytedance.com>
> 
> This patchset is an consolidation of the two previous patchsets[1][2].
> 
> When vfio_pin_pages_remote() is called with a range of addresses that
> includes large folios, the function currently performs individual
> statistics counting operations for each page. This can lead to significant
> performance overheads, especially when dealing with large ranges of pages.
> 
> The function vfio_unpin_pages_remote() has a similar issue, where executing
> put_pfn() for each pfn brings considerable consumption.
> 
> This patchset optimizes the performance of the relevant functions by
> batching the less efficient operations mentioned before.
> 
> The first patch optimizes the performance of the function
> vfio_pin_pages_remote(), while the remaining patches optimize the
> performance of the function vfio_unpin_pages_remote().
> 
> The performance test results, based on v6.16-rc4, for completing the 16G
> VFIO MAP/UNMAP DMA, obtained through unit test[3] with slight
> modifications[4], are as follows.
> 
> Base(6.16-rc4):
> ./vfio-pci-mem-dma-map 0000:03:00.0 16
> ------- AVERAGE (MADV_HUGEPAGE) --------
> VFIO MAP DMA in 0.047 s (340.2 GB/s)
> VFIO UNMAP DMA in 0.135 s (118.6 GB/s)
> ------- AVERAGE (MAP_POPULATE) --------
> VFIO MAP DMA in 0.280 s (57.2 GB/s)
> VFIO UNMAP DMA in 0.312 s (51.3 GB/s)
> ------- AVERAGE (HUGETLBFS) --------
> VFIO MAP DMA in 0.052 s (310.5 GB/s)
> VFIO UNMAP DMA in 0.136 s (117.3 GB/s)
> 
> With this patchset:
> ------- AVERAGE (MADV_HUGEPAGE) --------
> VFIO MAP DMA in 0.027 s (596.4 GB/s)
> VFIO UNMAP DMA in 0.045 s (357.6 GB/s)
> ------- AVERAGE (MAP_POPULATE) --------
> VFIO MAP DMA in 0.288 s (55.5 GB/s)
> VFIO UNMAP DMA in 0.288 s (55.6 GB/s)
> ------- AVERAGE (HUGETLBFS) --------
> VFIO MAP DMA in 0.031 s (508.3 GB/s)
> VFIO UNMAP DMA in 0.045 s (352.9 GB/s)
> 
> For large folio, we achieve an over 40% performance improvement for VFIO
> MAP DMA and an over 66% performance improvement for VFIO DMA UNMAP. For
> small folios, the performance test results show little difference compared
> with the performance before optimization.

Jason mentioned in reply to the other series that, ideally, vfio 
shouldn't be messing with folios at all.

While we now do that on the unpin side, we still do it at the pin side.

Which makes me wonder if we can avoid folios in patch #1 in 
contig_pages(), and simply collect pages that correspond to consecutive 
PFNs.

What was the reason again, that contig_pages() would not exceed a single 
folio?

-- 
Cheers,

David / dhildenb

Re: [PATCH 0/4] vfio/type1: optimize vfio_pin_pages_remote() and vfio_unpin_pages_remote() for large folio

Posted by lizhe.67@bytedance.com 3 months, 1 week ago

On Wed, 2 Jul 2025 10:15:29 +0200, david@redhat.com wrote:

> Jason mentioned in reply to the other series that, ideally, vfio 
> shouldn't be messing with folios at all.
>
> While we now do that on the unpin side, we still do it at the pin side.

Yes.

> Which makes me wonder if we can avoid folios in patch #1 in 
> contig_pages(), and simply collect pages that correspond to consecutive 
> PFNs.

In my opinion, comparing whether the pfns of two pages are contiguous
is relatively inefficient. Using folios might be a more efficient
solution.

Given that 'page' is already in use within vfio, it seems that adopting
'folio' wouldn't be particularly troublesome? If you have any better
suggestions, I sincerely hope you would share them with me.

> What was the reason again, that contig_pages() would not exceed a single 
> folio?

Regarding this issue, I think Alex and I are on the same page[1]. For a
folio, all of its pages have the same invalid or reserved state. In
the function vfio_pin_pages_remote(), we need to ensure that the state
is the same as the previous pfn (through variable 'rsvd' and function
is_invalid_reserved_pfn()). Therefore, we do not want the return value
of contig_pages() to exceed a single folio.

Thanks,
Zhe

[1]: https://lore.kernel.org/all/20250613081613.0bef3d39.alex.williamson@redhat.com/

Re: [PATCH 0/4] vfio/type1: optimize vfio_pin_pages_remote() and vfio_unpin_pages_remote() for large folio

Posted by David Hildenbrand 3 months, 1 week ago

On 02.07.25 11:38, lizhe.67@bytedance.com wrote:
> On Wed, 2 Jul 2025 10:15:29 +0200, david@redhat.com wrote:
> 
>> Jason mentioned in reply to the other series that, ideally, vfio
>> shouldn't be messing with folios at all.
>>
>> While we now do that on the unpin side, we still do it at the pin side.
> 
> Yes.
> 
>> Which makes me wonder if we can avoid folios in patch #1 in
>> contig_pages(), and simply collect pages that correspond to consecutive
>> PFNs.
> 
> In my opinion, comparing whether the pfns of two pages are contiguous
> is relatively inefficient. Using folios might be a more efficient
> solution.

	buffer[i + 1] == nth_page(buffer[i], 1)

Is extremely efficient, except on

	#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)

Because it's essentially

	buffer[i + 1] == buffer[i] + 1

But with that config it's less efficient

	buffer[i + 1] == pfn_to_page(page_to_pfn(buffer[i]) + 1)

That could be optimized (if we care about the config), assuming that we don't cross
memory sections (e.g., 128 MiB on x86).

See page_ext_iter_next_fast_possible(), that optimized for something similar.

So based on the first page, one could easily determine how far to batch
using the simple

	buffer[i + 1] == buffer[i] + 1

comparison.

That would mean that one could exceed a folio, in theory.

> 
> Given that 'page' is already in use within vfio, it seems that adopting
> 'folio' wouldn't be particularly troublesome? If you have any better
> suggestions, I sincerely hope you would share them with me.

One challenge in the future will likely be that not all pages that we can
GUP will belong to folios. We would possibly be able to handle that by
checking if the page actually belongs to a folio.

Not dealing with folios where avoidable would be easier.

> 
>> What was the reason again, that contig_pages() would not exceed a single
>> folio?
> 
> Regarding this issue, I think Alex and I are on the same page[1]. For a
> folio, all of its pages have the same invalid or reserved state. In
> the function vfio_pin_pages_remote(), we need to ensure that the state
> is the same as the previous pfn (through variable 'rsvd' and function
> is_invalid_reserved_pfn()). Therefore, we do not want the return value
> of contig_pages() to exceed a single folio.

If we obtained a page from GUP, is_invalid_reserved_pfn() would only trigger
for the shared zeropage. but that one can no longer be returned from FOLL_LONGTERM.

So if you know the pages came from GUP, I would assume they are never invalid_reserved?

Again, just a thought on how to apply something similar as done for the unpin case, avoiding
messing with folios.

-- 
Cheers,

David / dhildenb

Re: [PATCH 0/4] vfio/type1: optimize vfio_pin_pages_remote() and vfio_unpin_pages_remote() for large folio

Posted by lizhe.67@bytedance.com 3 months, 1 week ago

On Wed, 2 Jul 2025 11:57:08 +0200, david@redhat.com wrote:

> On 02.07.25 11:38, lizhe.67@bytedance.com wrote:
> > On Wed, 2 Jul 2025 10:15:29 +0200, david@redhat.com wrote:
> > 
> >> Jason mentioned in reply to the other series that, ideally, vfio
> >> shouldn't be messing with folios at all.
> >>
> >> While we now do that on the unpin side, we still do it at the pin side.
> > 
> > Yes.
> > 
> >> Which makes me wonder if we can avoid folios in patch #1 in
> >> contig_pages(), and simply collect pages that correspond to consecutive
> >> PFNs.
> > 
> > In my opinion, comparing whether the pfns of two pages are contiguous
> > is relatively inefficient. Using folios might be a more efficient
> > solution.
> 
> 	buffer[i + 1] == nth_page(buffer[i], 1)
> 
> Is extremely efficient, except on
> 
> 	#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
> 
> Because it's essentially
> 
> 	buffer[i + 1] == buffer[i] + 1
> 
> But with that config it's less efficient
> 
> 	buffer[i + 1] == pfn_to_page(page_to_pfn(buffer[i]) + 1)
> 
> That could be optimized (if we care about the config), assuming that we don't cross
> memory sections (e.g., 128 MiB on x86).
> 
> See page_ext_iter_next_fast_possible(), that optimized for something similar.
> 
> So based on the first page, one could easily determine how far to batch
> using the simple
> 
> 	buffer[i + 1] == buffer[i] + 1
> 
> comparison.
> 
> That would mean that one could exceed a folio, in theory.

Thank you very much for your suggestion. I think we can focus on
optimizing the case where

!(defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP))

I believe that in most scenarios where vfio is used,
CONFIG_SPARSEMEM_VMEMMAP is enabled. Excessive CONFIG
may make the patch appear overly complicated.

> > Given that 'page' is already in use within vfio, it seems that adopting
> > 'folio' wouldn't be particularly troublesome? If you have any better
> > suggestions, I sincerely hope you would share them with me.
> 
> One challenge in the future will likely be that not all pages that we can
> GUP will belong to folios. We would possibly be able to handle that by
> checking if the page actually belongs to a folio.
> 
> Not dealing with folios where avoidable would be easier.
> 
> > 
> >> What was the reason again, that contig_pages() would not exceed a single
> >> folio?
> > 
> > Regarding this issue, I think Alex and I are on the same page[1]. For a
> > folio, all of its pages have the same invalid or reserved state. In
> > the function vfio_pin_pages_remote(), we need to ensure that the state
> > is the same as the previous pfn (through variable 'rsvd' and function
> > is_invalid_reserved_pfn()). Therefore, we do not want the return value
> > of contig_pages() to exceed a single folio.
> 
> If we obtained a page from GUP, is_invalid_reserved_pfn() would only trigger
> for the shared zeropage. but that one can no longer be returned from FOLL_LONGTERM.
> 
> So if you know the pages came from GUP, I would assume they are never invalid_reserved?

Yes, we use function vaddr_get_pfns(), which ultimately invokes GUP
with the FOLL_LONGTERM flag.

> Again, just a thought on how to apply something similar as done for the unpin case, avoiding
> messing with folios.

Taking into account the previous discussion, it seems that we might
simply replace the contig_pages() in patch #1 with the following one.
Also, function contig_pages() could also be extracted into mm.h as a
helper function. It seems that Jason would like to utilize it in other
contexts. Moreover, the subject of this patchset should be changed to
"Optimize vfio_pin_pages_remote() and vfio_unpin_pages_remote()". Do
you think this would work?

+static inline unsigned long contig_pages(struct page **pages,
+					 unsigned long size)
+{
+	struct page *first_page = pages[0];
+	unsigned long i;
+
+	for (i = 1; i < size; i++)
+		if (pages[i] != nth_page(first_page, i))
+			break;
+	return i;
+}

I have conducted a preliminary performance test, and the results are
similar to those obtained previously.

Thanks,
Zhe

Re: [PATCH 0/4] vfio/type1: optimize vfio_pin_pages_remote() and vfio_unpin_pages_remote() for large folio

Posted by David Hildenbrand 3 months ago

On 03.07.25 05:54, lizhe.67@bytedance.com wrote:
> On Wed, 2 Jul 2025 11:57:08 +0200, david@redhat.com wrote:
> 
>> On 02.07.25 11:38, lizhe.67@bytedance.com wrote:
>>> On Wed, 2 Jul 2025 10:15:29 +0200, david@redhat.com wrote:
>>>
>>>> Jason mentioned in reply to the other series that, ideally, vfio
>>>> shouldn't be messing with folios at all.
>>>>
>>>> While we now do that on the unpin side, we still do it at the pin side.
>>>
>>> Yes.
>>>
>>>> Which makes me wonder if we can avoid folios in patch #1 in
>>>> contig_pages(), and simply collect pages that correspond to consecutive
>>>> PFNs.
>>>
>>> In my opinion, comparing whether the pfns of two pages are contiguous
>>> is relatively inefficient. Using folios might be a more efficient
>>> solution.
>>
>> 	buffer[i + 1] == nth_page(buffer[i], 1)
>>
>> Is extremely efficient, except on
>>
>> 	#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>>
>> Because it's essentially
>>
>> 	buffer[i + 1] == buffer[i] + 1
>>
>> But with that config it's less efficient
>>
>> 	buffer[i + 1] == pfn_to_page(page_to_pfn(buffer[i]) + 1)
>>
>> That could be optimized (if we care about the config), assuming that we don't cross
>> memory sections (e.g., 128 MiB on x86).
>>
>> See page_ext_iter_next_fast_possible(), that optimized for something similar.
>>
>> So based on the first page, one could easily determine how far to batch
>> using the simple
>>
>> 	buffer[i + 1] == buffer[i] + 1
>>
>> comparison.
>>
>> That would mean that one could exceed a folio, in theory.
> 
> Thank you very much for your suggestion. I think we can focus on
> optimizing the case where
> 
> !(defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP))
> 
> I believe that in most scenarios where vfio is used,
> CONFIG_SPARSEMEM_VMEMMAP is enabled. Excessive CONFIG
> may make the patch appear overly complicated.
> 
>>> Given that 'page' is already in use within vfio, it seems that adopting
>>> 'folio' wouldn't be particularly troublesome? If you have any better
>>> suggestions, I sincerely hope you would share them with me.
>>
>> One challenge in the future will likely be that not all pages that we can
>> GUP will belong to folios. We would possibly be able to handle that by
>> checking if the page actually belongs to a folio.
>>
>> Not dealing with folios where avoidable would be easier.
>>
>>>
>>>> What was the reason again, that contig_pages() would not exceed a single
>>>> folio?
>>>
>>> Regarding this issue, I think Alex and I are on the same page[1]. For a
>>> folio, all of its pages have the same invalid or reserved state. In
>>> the function vfio_pin_pages_remote(), we need to ensure that the state
>>> is the same as the previous pfn (through variable 'rsvd' and function
>>> is_invalid_reserved_pfn()). Therefore, we do not want the return value
>>> of contig_pages() to exceed a single folio.
>>
>> If we obtained a page from GUP, is_invalid_reserved_pfn() would only trigger
>> for the shared zeropage. but that one can no longer be returned from FOLL_LONGTERM.
>>
>> So if you know the pages came from GUP, I would assume they are never invalid_reserved?
> 
> Yes, we use function vaddr_get_pfns(), which ultimately invokes GUP
> with the FOLL_LONGTERM flag.
> 
>> Again, just a thought on how to apply something similar as done for the unpin case, avoiding
>> messing with folios.
> 
> Taking into account the previous discussion, it seems that we might
> simply replace the contig_pages() in patch #1 with the following one.
> Also, function contig_pages() could also be extracted into mm.h as a
> helper function. It seems that Jason would like to utilize it in other
> contexts. Moreover, the subject of this patchset should be changed to
> "Optimize vfio_pin_pages_remote() and vfio_unpin_pages_remote()". Do
> you think this would work?
> 
> +static inline unsigned long contig_pages(struct page **pages,
> +					 unsigned long size)

size -> nr_pages

> +{
> +	struct page *first_page = pages[0];
> +	unsigned long i;
> +
> +	for (i = 1; i < size; i++)
> +		if (pages[i] != nth_page(first_page, i))
> +			break;
> +	return i;
> +}

LGTM.

I wonder if we can find a better function name, especially when moving 
this to some header where it can be reused.

Something that expresses that we will return the next batch that starts 
at the first page.

-- 
Cheers,

David / dhildenb

Re: [PATCH 0/4] vfio/type1: optimize vfio_pin_pages_remote() and vfio_unpin_pages_remote() for large folio

Posted by lizhe.67@bytedance.com 3 months ago

On Thu, 3 Jul 2025 13:06:26 +0200, david@redhat.com wrote:

> > +static inline unsigned long contig_pages(struct page **pages,
> > +					 unsigned long size)
> 
> size -> nr_pages
> 
> > +{
> > +	struct page *first_page = pages[0];
> > +	unsigned long i;
> > +
> > +	for (i = 1; i < size; i++)
> > +		if (pages[i] != nth_page(first_page, i))
> > +			break;
> > +	return i;
> > +}
> 
> LGTM.
> 
> I wonder if we can find a better function name, especially when moving 
> this to some header where it can be reused.
> 
> Something that expresses that we will return the next batch that starts 
> at the first page.

Thank you. Given that this function may have more users in the future,
I will place it in include/linux/mm.h instead of the vfio file. Once
I've addressed the comments on the other patches with Jason, I will
resend a new patchset.

Thanks,
Zhe

Re: [PATCH 0/4] vfio/type1: optimize vfio_pin_pages_remote() and vfio_unpin_pages_remote() for large folio

Posted by Jason Gunthorpe 3 months ago

On Thu, Jul 03, 2025 at 01:06:26PM +0200, David Hildenbrand wrote:
> > +{
> > +	struct page *first_page = pages[0];
> > +	unsigned long i;
> > +
> > +	for (i = 1; i < size; i++)
> > +		if (pages[i] != nth_page(first_page, i))
> > +			break;
> > +	return i;
> > +}
> 
> LGTM.
> 
> I wonder if we can find a better function name, especially when moving this
> to some header where it can be reused.

It should be a common function:

  unsigned long num_pages_contiguous(struct page *list, size_t nelms);

Jason

Re: [PATCH 0/4] vfio/type1: optimize vfio_pin_pages_remote() and vfio_unpin_pages_remote() for large folio

Posted by lizhe.67@bytedance.com 3 months ago

On Thu, 3 Jul 2025 08:12:16 -0300, jgg@ziepe.ca wrote:

> On Thu, Jul 03, 2025 at 01:06:26PM +0200, David Hildenbrand wrote:
> > > +{
> > > +	struct page *first_page = pages[0];
> > > +	unsigned long i;
> > > +
> > > +	for (i = 1; i < size; i++)
> > > +		if (pages[i] != nth_page(first_page, i))
> > > +			break;
> > > +	return i;
> > > +}
> > 
> > LGTM.
> > 
> > I wonder if we can find a better function name, especially when moving this
> > to some header where it can be reused.
> 
> It should be a common function:
> 
>   unsigned long num_pages_contiguous(struct page *list, size_t nelms);

I fully agree with you.

Thanks,
Zhe

Re: [PATCH 0/4] vfio/type1: optimize vfio_pin_pages_remote() and vfio_unpin_pages_remote() for large folio

Posted by Jason Gunthorpe 3 months, 1 week ago

On Wed, Jul 02, 2025 at 11:57:08AM +0200, David Hildenbrand wrote:
> On 02.07.25 11:38, lizhe.67@bytedance.com wrote:
> > On Wed, 2 Jul 2025 10:15:29 +0200, david@redhat.com wrote:
> > 
> > > Jason mentioned in reply to the other series that, ideally, vfio
> > > shouldn't be messing with folios at all.
> > > 
> > > While we now do that on the unpin side, we still do it at the pin side.
> > 
> > Yes.
> > 
> > > Which makes me wonder if we can avoid folios in patch #1 in
> > > contig_pages(), and simply collect pages that correspond to consecutive
> > > PFNs.
> > 
> > In my opinion, comparing whether the pfns of two pages are contiguous
> > is relatively inefficient. Using folios might be a more efficient
> > solution.
> 
> 	buffer[i + 1] == nth_page(buffer[i], 1)
>
> Is extremely efficient, except on

sg_alloc_append_table_from_pages() is using the

                next_pfn = (sg_phys(sgt_append->prv) + prv_len) / PAGE_SIZE;
                        last_pg = pfn_to_page(next_pfn - 1);

Approach to evaluate contiguity.

iommufd is also using very similar in batch_from_pages():

                if (!batch_add_pfn(batch, page_to_pfn(*pages)))

So we should not be trying to optimize this only in VFIO, I would drop
that from this series.

If it can be optimized we should try to have some kind of generic
helper for building a physical contiguous range from a struct page
list.

> If we obtained a page from GUP, is_invalid_reserved_pfn() would only trigger
> for the shared zeropage. but that one can no longer be returned from FOLL_LONGTERM.

AFAIK the use of "reserved" here means it is non-gupable memory that
was acquired through follow_pfn. When it is pulled back out of the
iommu_domain as a phys_addr_t the is_invalid_reserved_pfn() is used to
tell if the address came from GUP or if it came from follow_pfn.

Jason

Re: [PATCH 0/4] vfio/type1: optimize vfio_pin_pages_remote() and vfio_unpin_pages_remote() for large folio

Posted by lizhe.67@bytedance.com 3 months, 1 week ago

> On Wed, 2 Jul 2025 09:47:56 -0300, jgg@nvidia.com wrote:
> 
> On Wed, Jul 02, 2025 at 11:57:08AM +0200, David Hildenbrand wrote:
> > On 02.07.25 11:38, lizhe.67@bytedance.com wrote:
> > > On Wed, 2 Jul 2025 10:15:29 +0200, david@redhat.com wrote:
> > > 
> > > > Jason mentioned in reply to the other series that, ideally, vfio
> > > > shouldn't be messing with folios at all.
> > > > 
> > > > While we now do that on the unpin side, we still do it at the pin side.
> > > 
> > > Yes.
> > > 
> > > > Which makes me wonder if we can avoid folios in patch #1 in
> > > > contig_pages(), and simply collect pages that correspond to consecutive
> > > > PFNs.
> > > 
> > > In my opinion, comparing whether the pfns of two pages are contiguous
> > > is relatively inefficient. Using folios might be a more efficient
> > > solution.
> > 
> > 	buffer[i + 1] == nth_page(buffer[i], 1)
> >
> > Is extremely efficient, except on
> 
> sg_alloc_append_table_from_pages() is using the
> 
>                 next_pfn = (sg_phys(sgt_append->prv) + prv_len) / PAGE_SIZE;
>                         last_pg = pfn_to_page(next_pfn - 1);
> 
> Approach to evaluate contiguity.
> 
> iommufd is also using very similar in batch_from_pages():
> 
>                 if (!batch_add_pfn(batch, page_to_pfn(*pages)))

I'm not particularly familiar with this section of the code, so I
can't say for certain. Regarding the two locations mentioned earlier,
if it's possible to determine the contiguity of physical memory by
passing in an array of page pointers, then we could adopt the
approach suggested by David. I've made a preliminary implementation
here[1]. Is that helper function okay with you?

> So we should not be trying to optimize this only in VFIO, I would drop
> that from this series.
> 
> If it can be optimized we should try to have some kind of generic
> helper for building a physical contiguous range from a struct page
> list.

Thanks,
Zhe

[1]: https://lore.kernel.org/all/20250703035425.36124-1-lizhe.67@bytedance.com/