drivers/vfio/vfio_iommu_type1.c | 121 ++++++++++++++++++++++++++------ 1 file changed, 100 insertions(+), 21 deletions(-)
From: Li Zhe <lizhe.67@bytedance.com> This patchset is an consolidation of the two previous patchsets[1][2]. When vfio_pin_pages_remote() is called with a range of addresses that includes large folios, the function currently performs individual statistics counting operations for each page. This can lead to significant performance overheads, especially when dealing with large ranges of pages. The function vfio_unpin_pages_remote() has a similar issue, where executing put_pfn() for each pfn brings considerable consumption. This patchset optimizes the performance of the relevant functions by batching the less efficient operations mentioned before. The first patch optimizes the performance of the function vfio_pin_pages_remote(), while the remaining patches optimize the performance of the function vfio_unpin_pages_remote(). The performance test results, based on v6.16-rc4, for completing the 16G VFIO MAP/UNMAP DMA, obtained through unit test[3] with slight modifications[4], are as follows. Base(6.16-rc4): ./vfio-pci-mem-dma-map 0000:03:00.0 16 ------- AVERAGE (MADV_HUGEPAGE) -------- VFIO MAP DMA in 0.047 s (340.2 GB/s) VFIO UNMAP DMA in 0.135 s (118.6 GB/s) ------- AVERAGE (MAP_POPULATE) -------- VFIO MAP DMA in 0.280 s (57.2 GB/s) VFIO UNMAP DMA in 0.312 s (51.3 GB/s) ------- AVERAGE (HUGETLBFS) -------- VFIO MAP DMA in 0.052 s (310.5 GB/s) VFIO UNMAP DMA in 0.136 s (117.3 GB/s) With this patchset: ------- AVERAGE (MADV_HUGEPAGE) -------- VFIO MAP DMA in 0.027 s (596.4 GB/s) VFIO UNMAP DMA in 0.045 s (357.6 GB/s) ------- AVERAGE (MAP_POPULATE) -------- VFIO MAP DMA in 0.288 s (55.5 GB/s) VFIO UNMAP DMA in 0.288 s (55.6 GB/s) ------- AVERAGE (HUGETLBFS) -------- VFIO MAP DMA in 0.031 s (508.3 GB/s) VFIO UNMAP DMA in 0.045 s (352.9 GB/s) For large folio, we achieve an over 40% performance improvement for VFIO MAP DMA and an over 66% performance improvement for VFIO DMA UNMAP. For small folios, the performance test results show little difference compared with the performance before optimization. [1]: https://lore.kernel.org/all/20250529064947.38433-1-lizhe.67@bytedance.com/ [2]: https://lore.kernel.org/all/20250620032344.13382-1-lizhe.67@bytedance.com/ [3]: https://github.com/awilliam/tests/blob/vfio-pci-mem-dma-map/vfio-pci-mem-dma-map.c [4]: https://lore.kernel.org/all/20250610031013.98556-1-lizhe.67@bytedance.com/ Li Zhe (4): vfio/type1: optimize vfio_pin_pages_remote() for large folios vfio/type1: batch vfio_find_vpfn() in function vfio_unpin_pages_remote() vfio/type1: introduce a new member has_rsvd for struct vfio_dma vfio/type1: optimize vfio_unpin_pages_remote() for large folio drivers/vfio/vfio_iommu_type1.c | 121 ++++++++++++++++++++++++++------ 1 file changed, 100 insertions(+), 21 deletions(-) -- 2.20.1
On 30.06.25 09:25, lizhe.67@bytedance.com wrote: > From: Li Zhe <lizhe.67@bytedance.com> > > This patchset is an consolidation of the two previous patchsets[1][2]. > > When vfio_pin_pages_remote() is called with a range of addresses that > includes large folios, the function currently performs individual > statistics counting operations for each page. This can lead to significant > performance overheads, especially when dealing with large ranges of pages. > > The function vfio_unpin_pages_remote() has a similar issue, where executing > put_pfn() for each pfn brings considerable consumption. > > This patchset optimizes the performance of the relevant functions by > batching the less efficient operations mentioned before. > > The first patch optimizes the performance of the function > vfio_pin_pages_remote(), while the remaining patches optimize the > performance of the function vfio_unpin_pages_remote(). > > The performance test results, based on v6.16-rc4, for completing the 16G > VFIO MAP/UNMAP DMA, obtained through unit test[3] with slight > modifications[4], are as follows. > > Base(6.16-rc4): > ./vfio-pci-mem-dma-map 0000:03:00.0 16 > ------- AVERAGE (MADV_HUGEPAGE) -------- > VFIO MAP DMA in 0.047 s (340.2 GB/s) > VFIO UNMAP DMA in 0.135 s (118.6 GB/s) > ------- AVERAGE (MAP_POPULATE) -------- > VFIO MAP DMA in 0.280 s (57.2 GB/s) > VFIO UNMAP DMA in 0.312 s (51.3 GB/s) > ------- AVERAGE (HUGETLBFS) -------- > VFIO MAP DMA in 0.052 s (310.5 GB/s) > VFIO UNMAP DMA in 0.136 s (117.3 GB/s) > > With this patchset: > ------- AVERAGE (MADV_HUGEPAGE) -------- > VFIO MAP DMA in 0.027 s (596.4 GB/s) > VFIO UNMAP DMA in 0.045 s (357.6 GB/s) > ------- AVERAGE (MAP_POPULATE) -------- > VFIO MAP DMA in 0.288 s (55.5 GB/s) > VFIO UNMAP DMA in 0.288 s (55.6 GB/s) > ------- AVERAGE (HUGETLBFS) -------- > VFIO MAP DMA in 0.031 s (508.3 GB/s) > VFIO UNMAP DMA in 0.045 s (352.9 GB/s) > > For large folio, we achieve an over 40% performance improvement for VFIO > MAP DMA and an over 66% performance improvement for VFIO DMA UNMAP. For > small folios, the performance test results show little difference compared > with the performance before optimization. Jason mentioned in reply to the other series that, ideally, vfio shouldn't be messing with folios at all. While we now do that on the unpin side, we still do it at the pin side. Which makes me wonder if we can avoid folios in patch #1 in contig_pages(), and simply collect pages that correspond to consecutive PFNs. What was the reason again, that contig_pages() would not exceed a single folio? -- Cheers, David / dhildenb
On Wed, 2 Jul 2025 10:15:29 +0200, david@redhat.com wrote: > Jason mentioned in reply to the other series that, ideally, vfio > shouldn't be messing with folios at all. > > While we now do that on the unpin side, we still do it at the pin side. Yes. > Which makes me wonder if we can avoid folios in patch #1 in > contig_pages(), and simply collect pages that correspond to consecutive > PFNs. In my opinion, comparing whether the pfns of two pages are contiguous is relatively inefficient. Using folios might be a more efficient solution. Given that 'page' is already in use within vfio, it seems that adopting 'folio' wouldn't be particularly troublesome? If you have any better suggestions, I sincerely hope you would share them with me. > What was the reason again, that contig_pages() would not exceed a single > folio? Regarding this issue, I think Alex and I are on the same page[1]. For a folio, all of its pages have the same invalid or reserved state. In the function vfio_pin_pages_remote(), we need to ensure that the state is the same as the previous pfn (through variable 'rsvd' and function is_invalid_reserved_pfn()). Therefore, we do not want the return value of contig_pages() to exceed a single folio. Thanks, Zhe [1]: https://lore.kernel.org/all/20250613081613.0bef3d39.alex.williamson@redhat.com/
On 02.07.25 11:38, lizhe.67@bytedance.com wrote: > On Wed, 2 Jul 2025 10:15:29 +0200, david@redhat.com wrote: > >> Jason mentioned in reply to the other series that, ideally, vfio >> shouldn't be messing with folios at all. >> >> While we now do that on the unpin side, we still do it at the pin side. > > Yes. > >> Which makes me wonder if we can avoid folios in patch #1 in >> contig_pages(), and simply collect pages that correspond to consecutive >> PFNs. > > In my opinion, comparing whether the pfns of two pages are contiguous > is relatively inefficient. Using folios might be a more efficient > solution. buffer[i + 1] == nth_page(buffer[i], 1) Is extremely efficient, except on #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) Because it's essentially buffer[i + 1] == buffer[i] + 1 But with that config it's less efficient buffer[i + 1] == pfn_to_page(page_to_pfn(buffer[i]) + 1) That could be optimized (if we care about the config), assuming that we don't cross memory sections (e.g., 128 MiB on x86). See page_ext_iter_next_fast_possible(), that optimized for something similar. So based on the first page, one could easily determine how far to batch using the simple buffer[i + 1] == buffer[i] + 1 comparison. That would mean that one could exceed a folio, in theory. > > Given that 'page' is already in use within vfio, it seems that adopting > 'folio' wouldn't be particularly troublesome? If you have any better > suggestions, I sincerely hope you would share them with me. One challenge in the future will likely be that not all pages that we can GUP will belong to folios. We would possibly be able to handle that by checking if the page actually belongs to a folio. Not dealing with folios where avoidable would be easier. > >> What was the reason again, that contig_pages() would not exceed a single >> folio? > > Regarding this issue, I think Alex and I are on the same page[1]. For a > folio, all of its pages have the same invalid or reserved state. In > the function vfio_pin_pages_remote(), we need to ensure that the state > is the same as the previous pfn (through variable 'rsvd' and function > is_invalid_reserved_pfn()). Therefore, we do not want the return value > of contig_pages() to exceed a single folio. If we obtained a page from GUP, is_invalid_reserved_pfn() would only trigger for the shared zeropage. but that one can no longer be returned from FOLL_LONGTERM. So if you know the pages came from GUP, I would assume they are never invalid_reserved? Again, just a thought on how to apply something similar as done for the unpin case, avoiding messing with folios. -- Cheers, David / dhildenb
On Wed, 2 Jul 2025 11:57:08 +0200, david@redhat.com wrote: > On 02.07.25 11:38, lizhe.67@bytedance.com wrote: > > On Wed, 2 Jul 2025 10:15:29 +0200, david@redhat.com wrote: > > > >> Jason mentioned in reply to the other series that, ideally, vfio > >> shouldn't be messing with folios at all. > >> > >> While we now do that on the unpin side, we still do it at the pin side. > > > > Yes. > > > >> Which makes me wonder if we can avoid folios in patch #1 in > >> contig_pages(), and simply collect pages that correspond to consecutive > >> PFNs. > > > > In my opinion, comparing whether the pfns of two pages are contiguous > > is relatively inefficient. Using folios might be a more efficient > > solution. > > buffer[i + 1] == nth_page(buffer[i], 1) > > Is extremely efficient, except on > > #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) > > Because it's essentially > > buffer[i + 1] == buffer[i] + 1 > > But with that config it's less efficient > > buffer[i + 1] == pfn_to_page(page_to_pfn(buffer[i]) + 1) > > That could be optimized (if we care about the config), assuming that we don't cross > memory sections (e.g., 128 MiB on x86). > > See page_ext_iter_next_fast_possible(), that optimized for something similar. > > So based on the first page, one could easily determine how far to batch > using the simple > > buffer[i + 1] == buffer[i] + 1 > > comparison. > > That would mean that one could exceed a folio, in theory. Thank you very much for your suggestion. I think we can focus on optimizing the case where !(defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)) I believe that in most scenarios where vfio is used, CONFIG_SPARSEMEM_VMEMMAP is enabled. Excessive CONFIG may make the patch appear overly complicated. > > Given that 'page' is already in use within vfio, it seems that adopting > > 'folio' wouldn't be particularly troublesome? If you have any better > > suggestions, I sincerely hope you would share them with me. > > One challenge in the future will likely be that not all pages that we can > GUP will belong to folios. We would possibly be able to handle that by > checking if the page actually belongs to a folio. > > Not dealing with folios where avoidable would be easier. > > > > >> What was the reason again, that contig_pages() would not exceed a single > >> folio? > > > > Regarding this issue, I think Alex and I are on the same page[1]. For a > > folio, all of its pages have the same invalid or reserved state. In > > the function vfio_pin_pages_remote(), we need to ensure that the state > > is the same as the previous pfn (through variable 'rsvd' and function > > is_invalid_reserved_pfn()). Therefore, we do not want the return value > > of contig_pages() to exceed a single folio. > > If we obtained a page from GUP, is_invalid_reserved_pfn() would only trigger > for the shared zeropage. but that one can no longer be returned from FOLL_LONGTERM. > > So if you know the pages came from GUP, I would assume they are never invalid_reserved? Yes, we use function vaddr_get_pfns(), which ultimately invokes GUP with the FOLL_LONGTERM flag. > Again, just a thought on how to apply something similar as done for the unpin case, avoiding > messing with folios. Taking into account the previous discussion, it seems that we might simply replace the contig_pages() in patch #1 with the following one. Also, function contig_pages() could also be extracted into mm.h as a helper function. It seems that Jason would like to utilize it in other contexts. Moreover, the subject of this patchset should be changed to "Optimize vfio_pin_pages_remote() and vfio_unpin_pages_remote()". Do you think this would work? +static inline unsigned long contig_pages(struct page **pages, + unsigned long size) +{ + struct page *first_page = pages[0]; + unsigned long i; + + for (i = 1; i < size; i++) + if (pages[i] != nth_page(first_page, i)) + break; + return i; +} I have conducted a preliminary performance test, and the results are similar to those obtained previously. Thanks, Zhe
On 03.07.25 05:54, lizhe.67@bytedance.com wrote: > On Wed, 2 Jul 2025 11:57:08 +0200, david@redhat.com wrote: > >> On 02.07.25 11:38, lizhe.67@bytedance.com wrote: >>> On Wed, 2 Jul 2025 10:15:29 +0200, david@redhat.com wrote: >>> >>>> Jason mentioned in reply to the other series that, ideally, vfio >>>> shouldn't be messing with folios at all. >>>> >>>> While we now do that on the unpin side, we still do it at the pin side. >>> >>> Yes. >>> >>>> Which makes me wonder if we can avoid folios in patch #1 in >>>> contig_pages(), and simply collect pages that correspond to consecutive >>>> PFNs. >>> >>> In my opinion, comparing whether the pfns of two pages are contiguous >>> is relatively inefficient. Using folios might be a more efficient >>> solution. >> >> buffer[i + 1] == nth_page(buffer[i], 1) >> >> Is extremely efficient, except on >> >> #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) >> >> Because it's essentially >> >> buffer[i + 1] == buffer[i] + 1 >> >> But with that config it's less efficient >> >> buffer[i + 1] == pfn_to_page(page_to_pfn(buffer[i]) + 1) >> >> That could be optimized (if we care about the config), assuming that we don't cross >> memory sections (e.g., 128 MiB on x86). >> >> See page_ext_iter_next_fast_possible(), that optimized for something similar. >> >> So based on the first page, one could easily determine how far to batch >> using the simple >> >> buffer[i + 1] == buffer[i] + 1 >> >> comparison. >> >> That would mean that one could exceed a folio, in theory. > > Thank you very much for your suggestion. I think we can focus on > optimizing the case where > > !(defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)) > > I believe that in most scenarios where vfio is used, > CONFIG_SPARSEMEM_VMEMMAP is enabled. Excessive CONFIG > may make the patch appear overly complicated. > >>> Given that 'page' is already in use within vfio, it seems that adopting >>> 'folio' wouldn't be particularly troublesome? If you have any better >>> suggestions, I sincerely hope you would share them with me. >> >> One challenge in the future will likely be that not all pages that we can >> GUP will belong to folios. We would possibly be able to handle that by >> checking if the page actually belongs to a folio. >> >> Not dealing with folios where avoidable would be easier. >> >>> >>>> What was the reason again, that contig_pages() would not exceed a single >>>> folio? >>> >>> Regarding this issue, I think Alex and I are on the same page[1]. For a >>> folio, all of its pages have the same invalid or reserved state. In >>> the function vfio_pin_pages_remote(), we need to ensure that the state >>> is the same as the previous pfn (through variable 'rsvd' and function >>> is_invalid_reserved_pfn()). Therefore, we do not want the return value >>> of contig_pages() to exceed a single folio. >> >> If we obtained a page from GUP, is_invalid_reserved_pfn() would only trigger >> for the shared zeropage. but that one can no longer be returned from FOLL_LONGTERM. >> >> So if you know the pages came from GUP, I would assume they are never invalid_reserved? > > Yes, we use function vaddr_get_pfns(), which ultimately invokes GUP > with the FOLL_LONGTERM flag. > >> Again, just a thought on how to apply something similar as done for the unpin case, avoiding >> messing with folios. > > Taking into account the previous discussion, it seems that we might > simply replace the contig_pages() in patch #1 with the following one. > Also, function contig_pages() could also be extracted into mm.h as a > helper function. It seems that Jason would like to utilize it in other > contexts. Moreover, the subject of this patchset should be changed to > "Optimize vfio_pin_pages_remote() and vfio_unpin_pages_remote()". Do > you think this would work? > > +static inline unsigned long contig_pages(struct page **pages, > + unsigned long size) size -> nr_pages > +{ > + struct page *first_page = pages[0]; > + unsigned long i; > + > + for (i = 1; i < size; i++) > + if (pages[i] != nth_page(first_page, i)) > + break; > + return i; > +} LGTM. I wonder if we can find a better function name, especially when moving this to some header where it can be reused. Something that expresses that we will return the next batch that starts at the first page. -- Cheers, David / dhildenb
On Thu, 3 Jul 2025 13:06:26 +0200, david@redhat.com wrote: > > +static inline unsigned long contig_pages(struct page **pages, > > + unsigned long size) > > size -> nr_pages > > > +{ > > + struct page *first_page = pages[0]; > > + unsigned long i; > > + > > + for (i = 1; i < size; i++) > > + if (pages[i] != nth_page(first_page, i)) > > + break; > > + return i; > > +} > > LGTM. > > I wonder if we can find a better function name, especially when moving > this to some header where it can be reused. > > Something that expresses that we will return the next batch that starts > at the first page. Thank you. Given that this function may have more users in the future, I will place it in include/linux/mm.h instead of the vfio file. Once I've addressed the comments on the other patches with Jason, I will resend a new patchset. Thanks, Zhe
On Thu, Jul 03, 2025 at 01:06:26PM +0200, David Hildenbrand wrote: > > +{ > > + struct page *first_page = pages[0]; > > + unsigned long i; > > + > > + for (i = 1; i < size; i++) > > + if (pages[i] != nth_page(first_page, i)) > > + break; > > + return i; > > +} > > LGTM. > > I wonder if we can find a better function name, especially when moving this > to some header where it can be reused. It should be a common function: unsigned long num_pages_contiguous(struct page *list, size_t nelms); Jason
On Thu, 3 Jul 2025 08:12:16 -0300, jgg@ziepe.ca wrote: > On Thu, Jul 03, 2025 at 01:06:26PM +0200, David Hildenbrand wrote: > > > +{ > > > + struct page *first_page = pages[0]; > > > + unsigned long i; > > > + > > > + for (i = 1; i < size; i++) > > > + if (pages[i] != nth_page(first_page, i)) > > > + break; > > > + return i; > > > +} > > > > LGTM. > > > > I wonder if we can find a better function name, especially when moving this > > to some header where it can be reused. > > It should be a common function: > > unsigned long num_pages_contiguous(struct page *list, size_t nelms); I fully agree with you. Thanks, Zhe
On Wed, Jul 02, 2025 at 11:57:08AM +0200, David Hildenbrand wrote: > On 02.07.25 11:38, lizhe.67@bytedance.com wrote: > > On Wed, 2 Jul 2025 10:15:29 +0200, david@redhat.com wrote: > > > > > Jason mentioned in reply to the other series that, ideally, vfio > > > shouldn't be messing with folios at all. > > > > > > While we now do that on the unpin side, we still do it at the pin side. > > > > Yes. > > > > > Which makes me wonder if we can avoid folios in patch #1 in > > > contig_pages(), and simply collect pages that correspond to consecutive > > > PFNs. > > > > In my opinion, comparing whether the pfns of two pages are contiguous > > is relatively inefficient. Using folios might be a more efficient > > solution. > > buffer[i + 1] == nth_page(buffer[i], 1) > > Is extremely efficient, except on sg_alloc_append_table_from_pages() is using the next_pfn = (sg_phys(sgt_append->prv) + prv_len) / PAGE_SIZE; last_pg = pfn_to_page(next_pfn - 1); Approach to evaluate contiguity. iommufd is also using very similar in batch_from_pages(): if (!batch_add_pfn(batch, page_to_pfn(*pages))) So we should not be trying to optimize this only in VFIO, I would drop that from this series. If it can be optimized we should try to have some kind of generic helper for building a physical contiguous range from a struct page list. > If we obtained a page from GUP, is_invalid_reserved_pfn() would only trigger > for the shared zeropage. but that one can no longer be returned from FOLL_LONGTERM. AFAIK the use of "reserved" here means it is non-gupable memory that was acquired through follow_pfn. When it is pulled back out of the iommu_domain as a phys_addr_t the is_invalid_reserved_pfn() is used to tell if the address came from GUP or if it came from follow_pfn. Jason
> On Wed, 2 Jul 2025 09:47:56 -0300, jgg@nvidia.com wrote: > > On Wed, Jul 02, 2025 at 11:57:08AM +0200, David Hildenbrand wrote: > > On 02.07.25 11:38, lizhe.67@bytedance.com wrote: > > > On Wed, 2 Jul 2025 10:15:29 +0200, david@redhat.com wrote: > > > > > > > Jason mentioned in reply to the other series that, ideally, vfio > > > > shouldn't be messing with folios at all. > > > > > > > > While we now do that on the unpin side, we still do it at the pin side. > > > > > > Yes. > > > > > > > Which makes me wonder if we can avoid folios in patch #1 in > > > > contig_pages(), and simply collect pages that correspond to consecutive > > > > PFNs. > > > > > > In my opinion, comparing whether the pfns of two pages are contiguous > > > is relatively inefficient. Using folios might be a more efficient > > > solution. > > > > buffer[i + 1] == nth_page(buffer[i], 1) > > > > Is extremely efficient, except on > > sg_alloc_append_table_from_pages() is using the > > next_pfn = (sg_phys(sgt_append->prv) + prv_len) / PAGE_SIZE; > last_pg = pfn_to_page(next_pfn - 1); > > Approach to evaluate contiguity. > > iommufd is also using very similar in batch_from_pages(): > > if (!batch_add_pfn(batch, page_to_pfn(*pages))) I'm not particularly familiar with this section of the code, so I can't say for certain. Regarding the two locations mentioned earlier, if it's possible to determine the contiguity of physical memory by passing in an array of page pointers, then we could adopt the approach suggested by David. I've made a preliminary implementation here[1]. Is that helper function okay with you? > So we should not be trying to optimize this only in VFIO, I would drop > that from this series. > > If it can be optimized we should try to have some kind of generic > helper for building a physical contiguous range from a struct page > list. Thanks, Zhe [1]: https://lore.kernel.org/all/20250703035425.36124-1-lizhe.67@bytedance.com/
© 2016 - 2025 Red Hat, Inc.