From: Li Zhe <lizhe.67@bytedance.com>
When vfio_unpin_pages_remote() is called with a range of addresses that
includes large folios, the function currently performs individual
put_pfn() operations for each page. This can lead to significant
performance overheads, especially when dealing with large ranges of pages.
This patch optimize this process by batching the put_pfn() operations.
The performance test results, based on v6.15, for completing the 16G VFIO
IOMMU DMA unmapping, obtained through unit test[1] with slight
modifications[2], are as follows.
Base(v6.15):
./vfio-pci-mem-dma-map 0000:03:00.0 16
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.047 s (338.6 GB/s)
VFIO UNMAP DMA in 0.138 s (116.2 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.280 s (57.2 GB/s)
VFIO UNMAP DMA in 0.312 s (51.3 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.052 s (308.3 GB/s)
VFIO UNMAP DMA in 0.139 s (115.1 GB/s)
Map[3] + This patchset:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.028 s (563.9 GB/s)
VFIO UNMAP DMA in 0.049 s (325.1 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.294 s (54.4 GB/s)
VFIO UNMAP DMA in 0.296 s (54.1 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.033 s (485.1 GB/s)
VFIO UNMAP DMA in 0.049 s (324.4 GB/s)
For large folio, we achieve an approximate 64% performance improvement
in the VFIO UNMAP DMA item. For small folios, the performance test
results appear to show no significant changes.
[1]: https://github.com/awilliam/tests/blob/vfio-pci-mem-dma-map/vfio-pci-mem-dma-map.c
[2]: https://lore.kernel.org/all/20250610031013.98556-1-lizhe.67@bytedance.com/
[3]: https://lore.kernel.org/all/20250529064947.38433-1-lizhe.67@bytedance.com/
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
drivers/vfio/vfio_iommu_type1.c | 35 +++++++++++++++++++++++++++++----
1 file changed, 31 insertions(+), 4 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index e952bf8bdfab..159ba80082a8 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -806,11 +806,38 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
bool do_accounting)
{
long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
- long i;
- for (i = 0; i < npage; i++)
- if (put_pfn(pfn++, dma->prot))
- unlocked++;
+ while (npage) {
+ long nr_pages = 1;
+
+ if (!is_invalid_reserved_pfn(pfn)) {
+ struct page *page = pfn_to_page(pfn);
+ struct folio *folio = page_folio(page);
+ long folio_pages_num = folio_nr_pages(folio);
+
+ /*
+ * For a folio, it represents a physically
+ * contiguous set of bytes, and all of its pages
+ * share the same invalid/reserved state.
+ *
+ * Here, our PFNs are contiguous. Therefore, if we
+ * detect that the current PFN belongs to a large
+ * folio, we can batch the operations for the next
+ * nr_pages PFNs.
+ */
+ if (folio_pages_num > 1)
+ nr_pages = min_t(long, npage,
+ folio_pages_num -
+ folio_page_idx(folio, page));
+
+ unpin_user_folio_dirty_locked(folio, nr_pages,
+ dma->prot & IOMMU_WRITE);
+ unlocked += nr_pages;
+ }
+
+ pfn += nr_pages;
+ npage -= nr_pages;
+ }
if (do_accounting)
vfio_lock_acct(dma, locked - unlocked, true);
--
2.20.1
On 17.06.25 06:18, lizhe.67@bytedance.com wrote: > From: Li Zhe <lizhe.67@bytedance.com> > > When vfio_unpin_pages_remote() is called with a range of addresses that > includes large folios, the function currently performs individual > put_pfn() operations for each page. This can lead to significant > performance overheads, especially when dealing with large ranges of pages. > > This patch optimize this process by batching the put_pfn() operations. > > The performance test results, based on v6.15, for completing the 16G VFIO > IOMMU DMA unmapping, obtained through unit test[1] with slight > modifications[2], are as follows. > > Base(v6.15): > ./vfio-pci-mem-dma-map 0000:03:00.0 16 > ------- AVERAGE (MADV_HUGEPAGE) -------- > VFIO MAP DMA in 0.047 s (338.6 GB/s) > VFIO UNMAP DMA in 0.138 s (116.2 GB/s) > ------- AVERAGE (MAP_POPULATE) -------- > VFIO MAP DMA in 0.280 s (57.2 GB/s) > VFIO UNMAP DMA in 0.312 s (51.3 GB/s) > ------- AVERAGE (HUGETLBFS) -------- > VFIO MAP DMA in 0.052 s (308.3 GB/s) > VFIO UNMAP DMA in 0.139 s (115.1 GB/s) > > Map[3] + This patchset: > ------- AVERAGE (MADV_HUGEPAGE) -------- > VFIO MAP DMA in 0.028 s (563.9 GB/s) > VFIO UNMAP DMA in 0.049 s (325.1 GB/s) > ------- AVERAGE (MAP_POPULATE) -------- > VFIO MAP DMA in 0.294 s (54.4 GB/s) > VFIO UNMAP DMA in 0.296 s (54.1 GB/s) > ------- AVERAGE (HUGETLBFS) -------- > VFIO MAP DMA in 0.033 s (485.1 GB/s) > VFIO UNMAP DMA in 0.049 s (324.4 GB/s) > > For large folio, we achieve an approximate 64% performance improvement > in the VFIO UNMAP DMA item. For small folios, the performance test > results appear to show no significant changes. > > [1]: https://github.com/awilliam/tests/blob/vfio-pci-mem-dma-map/vfio-pci-mem-dma-map.c > [2]: https://lore.kernel.org/all/20250610031013.98556-1-lizhe.67@bytedance.com/ > [3]: https://lore.kernel.org/all/20250529064947.38433-1-lizhe.67@bytedance.com/ > > Signed-off-by: Li Zhe <lizhe.67@bytedance.com> > --- > drivers/vfio/vfio_iommu_type1.c | 35 +++++++++++++++++++++++++++++---- > 1 file changed, 31 insertions(+), 4 deletions(-) > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > index e952bf8bdfab..159ba80082a8 100644 > --- a/drivers/vfio/vfio_iommu_type1.c > +++ b/drivers/vfio/vfio_iommu_type1.c > @@ -806,11 +806,38 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova, > bool do_accounting) > { > long unlocked = 0, locked = vpfn_pages(dma, iova, npage); > - long i; > > - for (i = 0; i < npage; i++) > - if (put_pfn(pfn++, dma->prot)) > - unlocked++; > + while (npage) { > + long nr_pages = 1; > + > + if (!is_invalid_reserved_pfn(pfn)) { > + struct page *page = pfn_to_page(pfn); > + struct folio *folio = page_folio(page); > + long folio_pages_num = folio_nr_pages(folio); > + > + /* > + * For a folio, it represents a physically > + * contiguous set of bytes, and all of its pages > + * share the same invalid/reserved state. > + * > + * Here, our PFNs are contiguous. Therefore, if we > + * detect that the current PFN belongs to a large > + * folio, we can batch the operations for the next > + * nr_pages PFNs. > + */ > + if (folio_pages_num > 1) > + nr_pages = min_t(long, npage, > + folio_pages_num - > + folio_page_idx(folio, page)); > + (I know I can be a pain :) ) But the long comment indicates that this is confusing. That is essentially the logic in gup_folio_range_next(). What about factoring that out into a helper like /* * TODO, returned number includes the provided current page. */ unsigned long folio_remaining_pages(struct folio *folio, struct pages *pages, unsigned long max_pages) { if (!folio_test_large(folio)) return 1; return min_t(unsigned long, max_pages, folio_nr_pages(folio) - folio_page_idx(folio, page)); } Then here you would do if (!is_invalid_reserved_pfn(pfn)) { struct page *page = pfn_to_page(pfn); struct folio *folio = page_folio(page); /* We can batch-process pages belonging to the same folio. */ nr_pages = folio_remaining_pages(folio, page, npage); unpin_user_folio_dirty_locked(folio, nr_pages, dma->prot & IOMMU_WRITE); unlocked += nr_pages; } -- Cheers, David / dhildenb
On Tue, 17 Jun 2025 09:43:56 +0200, david@redhat.com wrote: > On 17.06.25 06:18, lizhe.67@bytedance.com wrote: > > From: Li Zhe <lizhe.67@bytedance.com> > > > > When vfio_unpin_pages_remote() is called with a range of addresses that > > includes large folios, the function currently performs individual > > put_pfn() operations for each page. This can lead to significant > > performance overheads, especially when dealing with large ranges of pages. > > > > This patch optimize this process by batching the put_pfn() operations. > > > > The performance test results, based on v6.15, for completing the 16G VFIO > > IOMMU DMA unmapping, obtained through unit test[1] with slight > > modifications[2], are as follows. > > > > Base(v6.15): > > ./vfio-pci-mem-dma-map 0000:03:00.0 16 > > ------- AVERAGE (MADV_HUGEPAGE) -------- > > VFIO MAP DMA in 0.047 s (338.6 GB/s) > > VFIO UNMAP DMA in 0.138 s (116.2 GB/s) > > ------- AVERAGE (MAP_POPULATE) -------- > > VFIO MAP DMA in 0.280 s (57.2 GB/s) > > VFIO UNMAP DMA in 0.312 s (51.3 GB/s) > > ------- AVERAGE (HUGETLBFS) -------- > > VFIO MAP DMA in 0.052 s (308.3 GB/s) > > VFIO UNMAP DMA in 0.139 s (115.1 GB/s) > > > > Map[3] + This patchset: > > ------- AVERAGE (MADV_HUGEPAGE) -------- > > VFIO MAP DMA in 0.028 s (563.9 GB/s) > > VFIO UNMAP DMA in 0.049 s (325.1 GB/s) > > ------- AVERAGE (MAP_POPULATE) -------- > > VFIO MAP DMA in 0.294 s (54.4 GB/s) > > VFIO UNMAP DMA in 0.296 s (54.1 GB/s) > > ------- AVERAGE (HUGETLBFS) -------- > > VFIO MAP DMA in 0.033 s (485.1 GB/s) > > VFIO UNMAP DMA in 0.049 s (324.4 GB/s) > > > > For large folio, we achieve an approximate 64% performance improvement > > in the VFIO UNMAP DMA item. For small folios, the performance test > > results appear to show no significant changes. > > > > [1]: https://github.com/awilliam/tests/blob/vfio-pci-mem-dma-map/vfio-pci-mem-dma-map.c > > [2]: https://lore.kernel.org/all/20250610031013.98556-1-lizhe.67@bytedance.com/ > > [3]: https://lore.kernel.org/all/20250529064947.38433-1-lizhe.67@bytedance.com/ > > > > Signed-off-by: Li Zhe <lizhe.67@bytedance.com> > > --- > > drivers/vfio/vfio_iommu_type1.c | 35 +++++++++++++++++++++++++++++---- > > 1 file changed, 31 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > > index e952bf8bdfab..159ba80082a8 100644 > > --- a/drivers/vfio/vfio_iommu_type1.c > > +++ b/drivers/vfio/vfio_iommu_type1.c > > @@ -806,11 +806,38 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova, > > bool do_accounting) > > { > > long unlocked = 0, locked = vpfn_pages(dma, iova, npage); > > - long i; > > > > - for (i = 0; i < npage; i++) > > - if (put_pfn(pfn++, dma->prot)) > > - unlocked++; > > + while (npage) { > > + long nr_pages = 1; > > + > > + if (!is_invalid_reserved_pfn(pfn)) { > > + struct page *page = pfn_to_page(pfn); > > + struct folio *folio = page_folio(page); > > + long folio_pages_num = folio_nr_pages(folio); > > + > > + /* > > + * For a folio, it represents a physically > > + * contiguous set of bytes, and all of its pages > > + * share the same invalid/reserved state. > > + * > > + * Here, our PFNs are contiguous. Therefore, if we > > + * detect that the current PFN belongs to a large > > + * folio, we can batch the operations for the next > > + * nr_pages PFNs. > > + */ > > + if (folio_pages_num > 1) > > + nr_pages = min_t(long, npage, > > + folio_pages_num - > > + folio_page_idx(folio, page)); > > + > > (I know I can be a pain :) ) No, not at all! I really appreciate you taking the time to review my patch. > But the long comment indicates that this is confusing. > > > That is essentially the logic in gup_folio_range_next(). > > What about factoring that out into a helper like > > /* > * TODO, returned number includes the provided current page. > */ > unsigned long folio_remaining_pages(struct folio *folio, > struct pages *pages, unsigned long max_pages) > { > if (!folio_test_large(folio)) > return 1; > return min_t(unsigned long, max_pages, > folio_nr_pages(folio) - folio_page_idx(folio, page)); > } > > > Then here you would do > > if (!is_invalid_reserved_pfn(pfn)) { > struct page *page = pfn_to_page(pfn); > struct folio *folio = page_folio(page); > > /* We can batch-process pages belonging to the same folio. */ > nr_pages = folio_remaining_pages(folio, page, npage); > > unpin_user_folio_dirty_locked(folio, nr_pages, > dma->prot & IOMMU_WRITE); > unlocked += nr_pages; > } Yes, this indeed makes the code much more comprehensible. Do you think the implementation of the patch as follows look viable to you? I have added some brief comments on top of your work to explain why we can batch-process pages belonging to the same folio. This was suggested by Alex[1]. diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index e952bf8bdfab..d7653f4c10d5 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -801,16 +801,43 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, return pinned; } +/* Returned number includes the provided current page. */ +static inline unsigned long folio_remaining_pages(struct folio *folio, + struct page *page, unsigned long max_pages) +{ + if (!folio_test_large(folio)) + return 1; + return min_t(unsigned long, max_pages, + folio_nr_pages(folio) - folio_page_idx(folio, page)); +} + static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova, unsigned long pfn, unsigned long npage, bool do_accounting) { long unlocked = 0, locked = vpfn_pages(dma, iova, npage); - long i; - for (i = 0; i < npage; i++) - if (put_pfn(pfn++, dma->prot)) - unlocked++; + while (npage) { + unsigned long nr_pages = 1; + + if (!is_invalid_reserved_pfn(pfn)) { + struct page *page = pfn_to_page(pfn); + struct folio *folio = page_folio(page); + + /* + * We can batch-process pages belonging to the same + * folio because all pages within a folio share the + * same invalid/reserved state. + * */ + nr_pages = folio_remaining_pages(folio, page, npage); + unpin_user_folio_dirty_locked(folio, nr_pages, + dma->prot & IOMMU_WRITE); + unlocked += nr_pages; + } + + pfn += nr_pages; + npage -= nr_pages; + } if (do_accounting) vfio_lock_acct(dma, locked - unlocked, true); --- Thanks, Zhe [1]: https://lore.kernel.org/all/20250613113818.584bec0a.alex.williamson@redhat.com/
On 17.06.25 11:21, lizhe.67@bytedance.com wrote: > On Tue, 17 Jun 2025 09:43:56 +0200, david@redhat.com wrote: > >> On 17.06.25 06:18, lizhe.67@bytedance.com wrote: >>> From: Li Zhe <lizhe.67@bytedance.com> >>> >>> When vfio_unpin_pages_remote() is called with a range of addresses that >>> includes large folios, the function currently performs individual >>> put_pfn() operations for each page. This can lead to significant >>> performance overheads, especially when dealing with large ranges of pages. >>> >>> This patch optimize this process by batching the put_pfn() operations. >>> >>> The performance test results, based on v6.15, for completing the 16G VFIO >>> IOMMU DMA unmapping, obtained through unit test[1] with slight >>> modifications[2], are as follows. >>> >>> Base(v6.15): >>> ./vfio-pci-mem-dma-map 0000:03:00.0 16 >>> ------- AVERAGE (MADV_HUGEPAGE) -------- >>> VFIO MAP DMA in 0.047 s (338.6 GB/s) >>> VFIO UNMAP DMA in 0.138 s (116.2 GB/s) >>> ------- AVERAGE (MAP_POPULATE) -------- >>> VFIO MAP DMA in 0.280 s (57.2 GB/s) >>> VFIO UNMAP DMA in 0.312 s (51.3 GB/s) >>> ------- AVERAGE (HUGETLBFS) -------- >>> VFIO MAP DMA in 0.052 s (308.3 GB/s) >>> VFIO UNMAP DMA in 0.139 s (115.1 GB/s) >>> >>> Map[3] + This patchset: >>> ------- AVERAGE (MADV_HUGEPAGE) -------- >>> VFIO MAP DMA in 0.028 s (563.9 GB/s) >>> VFIO UNMAP DMA in 0.049 s (325.1 GB/s) >>> ------- AVERAGE (MAP_POPULATE) -------- >>> VFIO MAP DMA in 0.294 s (54.4 GB/s) >>> VFIO UNMAP DMA in 0.296 s (54.1 GB/s) >>> ------- AVERAGE (HUGETLBFS) -------- >>> VFIO MAP DMA in 0.033 s (485.1 GB/s) >>> VFIO UNMAP DMA in 0.049 s (324.4 GB/s) >>> >>> For large folio, we achieve an approximate 64% performance improvement >>> in the VFIO UNMAP DMA item. For small folios, the performance test >>> results appear to show no significant changes. >>> >>> [1]: https://github.com/awilliam/tests/blob/vfio-pci-mem-dma-map/vfio-pci-mem-dma-map.c >>> [2]: https://lore.kernel.org/all/20250610031013.98556-1-lizhe.67@bytedance.com/ >>> [3]: https://lore.kernel.org/all/20250529064947.38433-1-lizhe.67@bytedance.com/ >>> >>> Signed-off-by: Li Zhe <lizhe.67@bytedance.com> >>> --- >>> drivers/vfio/vfio_iommu_type1.c | 35 +++++++++++++++++++++++++++++---- >>> 1 file changed, 31 insertions(+), 4 deletions(-) >>> >>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c >>> index e952bf8bdfab..159ba80082a8 100644 >>> --- a/drivers/vfio/vfio_iommu_type1.c >>> +++ b/drivers/vfio/vfio_iommu_type1.c >>> @@ -806,11 +806,38 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova, >>> bool do_accounting) >>> { >>> long unlocked = 0, locked = vpfn_pages(dma, iova, npage); >>> - long i; >>> >>> - for (i = 0; i < npage; i++) >>> - if (put_pfn(pfn++, dma->prot)) >>> - unlocked++; >>> + while (npage) { >>> + long nr_pages = 1; >>> + >>> + if (!is_invalid_reserved_pfn(pfn)) { >>> + struct page *page = pfn_to_page(pfn); >>> + struct folio *folio = page_folio(page); >>> + long folio_pages_num = folio_nr_pages(folio); >>> + >>> + /* >>> + * For a folio, it represents a physically >>> + * contiguous set of bytes, and all of its pages >>> + * share the same invalid/reserved state. >>> + * >>> + * Here, our PFNs are contiguous. Therefore, if we >>> + * detect that the current PFN belongs to a large >>> + * folio, we can batch the operations for the next >>> + * nr_pages PFNs. >>> + */ >>> + if (folio_pages_num > 1) >>> + nr_pages = min_t(long, npage, >>> + folio_pages_num - >>> + folio_page_idx(folio, page)); >>> + >> >> (I know I can be a pain :) ) > > No, not at all! I really appreciate you taking the time to review my > patch. > >> But the long comment indicates that this is confusing. >> >> >> That is essentially the logic in gup_folio_range_next(). >> >> What about factoring that out into a helper like >> >> /* >> * TODO, returned number includes the provided current page. >> */ >> unsigned long folio_remaining_pages(struct folio *folio, >> struct pages *pages, unsigned long max_pages) >> { >> if (!folio_test_large(folio)) >> return 1; >> return min_t(unsigned long, max_pages, >> folio_nr_pages(folio) - folio_page_idx(folio, page)); >> } >> >> >> Then here you would do >> >> if (!is_invalid_reserved_pfn(pfn)) { >> struct page *page = pfn_to_page(pfn); >> struct folio *folio = page_folio(page); >> >> /* We can batch-process pages belonging to the same folio. */ >> nr_pages = folio_remaining_pages(folio, page, npage); >> >> unpin_user_folio_dirty_locked(folio, nr_pages, >> dma->prot & IOMMU_WRITE); >> unlocked += nr_pages; >> } > > Yes, this indeed makes the code much more comprehensible. Do you think > the implementation of the patch as follows look viable to you? I have > added some brief comments on top of your work to explain why we can > batch-process pages belonging to the same folio. This was suggested by > Alex[1]. > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > index e952bf8bdfab..d7653f4c10d5 100644 > --- a/drivers/vfio/vfio_iommu_type1.c > +++ b/drivers/vfio/vfio_iommu_type1.c > @@ -801,16 +801,43 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, > return pinned; > } > > +/* Returned number includes the provided current page. */ > +static inline unsigned long folio_remaining_pages(struct folio *folio, > + struct page *page, unsigned long max_pages) > +{ > + if (!folio_test_large(folio)) > + return 1; > + return min_t(unsigned long, max_pages, > + folio_nr_pages(folio) - folio_page_idx(folio, page)); > +} Note that I think that should go somewhere into mm.h, and also get used by GUP. So factoring it out from GUP and then using it here. -- Cheers, David / dhildenb
On Tue, 17 Jun 2025 09:43:56 +0200, david@redhat.com wrote: > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > > index e952bf8bdfab..d7653f4c10d5 100644 > > --- a/drivers/vfio/vfio_iommu_type1.c > > +++ b/drivers/vfio/vfio_iommu_type1.c > > @@ -801,16 +801,43 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, > > return pinned; > > } > > > > +/* Returned number includes the provided current page. */ > > +static inline unsigned long folio_remaining_pages(struct folio *folio, > > + struct page *page, unsigned long max_pages) > > +{ > > + if (!folio_test_large(folio)) > > + return 1; > > + return min_t(unsigned long, max_pages, > > + folio_nr_pages(folio) - folio_page_idx(folio, page)); > > +} > > Note that I think that should go somewhere into mm.h, and also get used > by GUP. So factoring it out from GUP and then using it here. I think I need to separate this out into a distinct patch within the patchset. Is that correct? Thanks, Zhe
On 17.06.25 11:47, lizhe.67@bytedance.com wrote: > On Tue, 17 Jun 2025 09:43:56 +0200, david@redhat.com wrote: > >>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c >>> index e952bf8bdfab..d7653f4c10d5 100644 >>> --- a/drivers/vfio/vfio_iommu_type1.c >>> +++ b/drivers/vfio/vfio_iommu_type1.c >>> @@ -801,16 +801,43 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, >>> return pinned; >>> } >>> >>> +/* Returned number includes the provided current page. */ >>> +static inline unsigned long folio_remaining_pages(struct folio *folio, >>> + struct page *page, unsigned long max_pages) >>> +{ >>> + if (!folio_test_large(folio)) >>> + return 1; >>> + return min_t(unsigned long, max_pages, >>> + folio_nr_pages(folio) - folio_page_idx(folio, page)); >>> +} >> >> Note that I think that should go somewhere into mm.h, and also get used >> by GUP. So factoring it out from GUP and then using it here. > > I think I need to separate this out into a distinct patch within the > patchset. Is that correct? Yes, that's what I would do. -- Cheers, David / dhildenb
On Tue, 17 Jun 2025 11:49:43 +0200, david@redhat.com wrote: > On 17.06.25 11:47, lizhe.67@bytedance.com wrote: > > On Tue, 17 Jun 2025 09:43:56 +0200, david@redhat.com wrote: > > > >>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > >>> index e952bf8bdfab..d7653f4c10d5 100644 > >>> --- a/drivers/vfio/vfio_iommu_type1.c > >>> +++ b/drivers/vfio/vfio_iommu_type1.c > >>> @@ -801,16 +801,43 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, > >>> return pinned; > >>> } > >>> > >>> +/* Returned number includes the provided current page. */ > >>> +static inline unsigned long folio_remaining_pages(struct folio *folio, > >>> + struct page *page, unsigned long max_pages) > >>> +{ > >>> + if (!folio_test_large(folio)) > >>> + return 1; > >>> + return min_t(unsigned long, max_pages, > >>> + folio_nr_pages(folio) - folio_page_idx(folio, page)); > >>> +} > >> > >> Note that I think that should go somewhere into mm.h, and also get used > >> by GUP. So factoring it out from GUP and then using it here. > > > > I think I need to separate this out into a distinct patch within the > > patchset. Is that correct? > > Yes, that's what I would do. How do you think of this implementation? diff --git a/include/linux/mm.h b/include/linux/mm.h index 242b05671502..eb91f99ea973 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2165,6 +2165,23 @@ static inline long folio_nr_pages(const struct folio *folio) return folio_large_nr_pages(folio); } +/* + * folio_remaining_pages - Counts the number of pages from a given + * start page to the end of the folio. + * + * @folio: Pointer to folio + * @start_page: The starting page from which to begin counting. + * + * Returned number includes the provided start page. + * + * The caller must ensure that @start_page belongs to @folio. + */ +static inline unsigned long folio_remaining_pages(struct folio *folio, + struct page *start_page) +{ + return folio_nr_pages(folio) - folio_page_idx(folio, start_page); +} + /* Only hugetlbfs can allocate folios larger than MAX_ORDER */ #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE #define MAX_FOLIO_NR_PAGES (1UL << PUD_ORDER) diff --git a/mm/gup.c b/mm/gup.c index 15debead5f5b..14ae2e3088b4 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -242,7 +242,7 @@ static inline struct folio *gup_folio_range_next(struct page *start, if (folio_test_large(folio)) nr = min_t(unsigned int, npages - i, - folio_nr_pages(folio) - folio_page_idx(folio, next)); + folio_remaining_pages(folio, next)); *ntails = nr; return folio; diff --git a/mm/page_isolation.c b/mm/page_isolation.c index b2fc5266e3d2..34e85258060c 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -96,7 +96,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e return page; } - skip_pages = folio_nr_pages(folio) - folio_page_idx(folio, page); + skip_pages = folio_remaining_pages(folio, page); pfn += skip_pages - 1; continue; } --- Thanks, Zhe
On 17.06.25 14:42, lizhe.67@bytedance.com wrote: > On Tue, 17 Jun 2025 11:49:43 +0200, david@redhat.com wrote: > >> On 17.06.25 11:47, lizhe.67@bytedance.com wrote: >>> On Tue, 17 Jun 2025 09:43:56 +0200, david@redhat.com wrote: >>> >>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c >>>>> index e952bf8bdfab..d7653f4c10d5 100644 >>>>> --- a/drivers/vfio/vfio_iommu_type1.c >>>>> +++ b/drivers/vfio/vfio_iommu_type1.c >>>>> @@ -801,16 +801,43 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, >>>>> return pinned; >>>>> } >>>>> >>>>> +/* Returned number includes the provided current page. */ >>>>> +static inline unsigned long folio_remaining_pages(struct folio *folio, >>>>> + struct page *page, unsigned long max_pages) >>>>> +{ >>>>> + if (!folio_test_large(folio)) >>>>> + return 1; >>>>> + return min_t(unsigned long, max_pages, >>>>> + folio_nr_pages(folio) - folio_page_idx(folio, page)); >>>>> +} >>>> >>>> Note that I think that should go somewhere into mm.h, and also get used >>>> by GUP. So factoring it out from GUP and then using it here. >>> >>> I think I need to separate this out into a distinct patch within the >>> patchset. Is that correct? >> >> Yes, that's what I would do. > > How do you think of this implementation? > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 242b05671502..eb91f99ea973 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2165,6 +2165,23 @@ static inline long folio_nr_pages(const struct folio *folio) > return folio_large_nr_pages(folio); > } > > +/* > + * folio_remaining_pages - Counts the number of pages from a given > + * start page to the end of the folio. > + * > + * @folio: Pointer to folio > + * @start_page: The starting page from which to begin counting. > + * > + * Returned number includes the provided start page. > + * > + * The caller must ensure that @start_page belongs to @folio. > + */ > +static inline unsigned long folio_remaining_pages(struct folio *folio, > + struct page *start_page) > +{ > + return folio_nr_pages(folio) - folio_page_idx(folio, start_page); > +} > + > /* Only hugetlbfs can allocate folios larger than MAX_ORDER */ > #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE > #define MAX_FOLIO_NR_PAGES (1UL << PUD_ORDER) > diff --git a/mm/gup.c b/mm/gup.c > index 15debead5f5b..14ae2e3088b4 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -242,7 +242,7 @@ static inline struct folio *gup_folio_range_next(struct page *start, > > if (folio_test_large(folio)) > nr = min_t(unsigned int, npages - i, > - folio_nr_pages(folio) - folio_page_idx(folio, next)); > + folio_remaining_pages(folio, next)); > > *ntails = nr; > return folio; > diff --git a/mm/page_isolation.c b/mm/page_isolation.c > index b2fc5266e3d2..34e85258060c 100644 > --- a/mm/page_isolation.c > +++ b/mm/page_isolation.c > @@ -96,7 +96,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e > return page; > } > > - skip_pages = folio_nr_pages(folio) - folio_page_idx(folio, page); > + skip_pages = folio_remaining_pages(folio, page); > pfn += skip_pages - 1; > continue; > } > --- Guess I would have pulled the "min" in there, but passing something like ULONG_MAX for the page_isolation case also looks rather ugly. -- Cheers, David / dhildenb
On Tue, 17 Jun 2025 15:47:09 +0200, david@redhat.com wrote: > > How do you think of this implementation? > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index 242b05671502..eb91f99ea973 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -2165,6 +2165,23 @@ static inline long folio_nr_pages(const struct folio *folio) > > return folio_large_nr_pages(folio); > > } > > > > +/* > > + * folio_remaining_pages - Counts the number of pages from a given > > + * start page to the end of the folio. > > + * > > + * @folio: Pointer to folio > > + * @start_page: The starting page from which to begin counting. > > + * > > + * Returned number includes the provided start page. > > + * > > + * The caller must ensure that @start_page belongs to @folio. > > + */ > > +static inline unsigned long folio_remaining_pages(struct folio *folio, > > + struct page *start_page) > > +{ > > + return folio_nr_pages(folio) - folio_page_idx(folio, start_page); > > +} > > + > > /* Only hugetlbfs can allocate folios larger than MAX_ORDER */ > > #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE > > #define MAX_FOLIO_NR_PAGES (1UL << PUD_ORDER) > > diff --git a/mm/gup.c b/mm/gup.c > > index 15debead5f5b..14ae2e3088b4 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -242,7 +242,7 @@ static inline struct folio *gup_folio_range_next(struct page *start, > > > > if (folio_test_large(folio)) > > nr = min_t(unsigned int, npages - i, > > - folio_nr_pages(folio) - folio_page_idx(folio, next)); > > + folio_remaining_pages(folio, next)); > > > > *ntails = nr; > > return folio; > > diff --git a/mm/page_isolation.c b/mm/page_isolation.c > > index b2fc5266e3d2..34e85258060c 100644 > > --- a/mm/page_isolation.c > > +++ b/mm/page_isolation.c > > @@ -96,7 +96,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e > > return page; > > } > > > > - skip_pages = folio_nr_pages(folio) - folio_page_idx(folio, page); > > + skip_pages = folio_remaining_pages(folio, page); > > pfn += skip_pages - 1; > > continue; > > } > > --- > > Guess I would have pulled the "min" in there, but passing something like > ULONG_MAX for the page_isolation case also looks rather ugly. Yes, the page_isolation case does not require the 'min' logic. Since there are already places in the current kernel code where folio_remaining_pages() is used without needing min, we could simply create a custom wrapper function based on folio_remaining_pages() only in those specific scenarios where min is necessary. Following this line of thinking, the wrapper function in vfio would look something like this. diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -801,16 +801,40 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, return pinned; } +static inline unsigned long vfio_folio_remaining_pages( + struct folio *folio, struct page *start_page, + unsigned long max_pages) +{ + if (!folio_test_large(folio)) + return 1; + return min(max_pages, + folio_remaining_pages(folio, start_page)); +} + --- Does this approach seem acceptable to you? Thanks, Zhe
On 18.06.25 08:11, lizhe.67@bytedance.com wrote: > On Tue, 17 Jun 2025 15:47:09 +0200, david@redhat.com wrote: > >>> How do you think of this implementation? >>> >>> diff --git a/include/linux/mm.h b/include/linux/mm.h >>> index 242b05671502..eb91f99ea973 100644 >>> --- a/include/linux/mm.h >>> +++ b/include/linux/mm.h >>> @@ -2165,6 +2165,23 @@ static inline long folio_nr_pages(const struct folio *folio) >>> return folio_large_nr_pages(folio); >>> } >>> >>> +/* >>> + * folio_remaining_pages - Counts the number of pages from a given >>> + * start page to the end of the folio. >>> + * >>> + * @folio: Pointer to folio >>> + * @start_page: The starting page from which to begin counting. >>> + * >>> + * Returned number includes the provided start page. >>> + * >>> + * The caller must ensure that @start_page belongs to @folio. >>> + */ >>> +static inline unsigned long folio_remaining_pages(struct folio *folio, >>> + struct page *start_page) >>> +{ >>> + return folio_nr_pages(folio) - folio_page_idx(folio, start_page); >>> +} >>> + >>> /* Only hugetlbfs can allocate folios larger than MAX_ORDER */ >>> #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE >>> #define MAX_FOLIO_NR_PAGES (1UL << PUD_ORDER) >>> diff --git a/mm/gup.c b/mm/gup.c >>> index 15debead5f5b..14ae2e3088b4 100644 >>> --- a/mm/gup.c >>> +++ b/mm/gup.c >>> @@ -242,7 +242,7 @@ static inline struct folio *gup_folio_range_next(struct page *start, >>> >>> if (folio_test_large(folio)) >>> nr = min_t(unsigned int, npages - i, >>> - folio_nr_pages(folio) - folio_page_idx(folio, next)); >>> + folio_remaining_pages(folio, next)); >>> >>> *ntails = nr; >>> return folio; >>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c >>> index b2fc5266e3d2..34e85258060c 100644 >>> --- a/mm/page_isolation.c >>> +++ b/mm/page_isolation.c >>> @@ -96,7 +96,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e >>> return page; >>> } >>> >>> - skip_pages = folio_nr_pages(folio) - folio_page_idx(folio, page); >>> + skip_pages = folio_remaining_pages(folio, page); >>> pfn += skip_pages - 1; >>> continue; >>> } >>> --- >> >> Guess I would have pulled the "min" in there, but passing something like >> ULONG_MAX for the page_isolation case also looks rather ugly. > > Yes, the page_isolation case does not require the 'min' logic. Since > there are already places in the current kernel code where > folio_remaining_pages() is used without needing min, we could simply > create a custom wrapper function based on folio_remaining_pages() only > in those specific scenarios where min is necessary. I would just do something like that, removing gup_folio_range_next() completely. diff --git a/include/linux/mm.h b/include/linux/mm.h index 98a606908307b..6c224b1c6c169 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1995,6 +1995,32 @@ static inline long folio_nr_pages(const struct folio *folio) return folio_large_nr_pages(folio); } +/** + * folio_remaining_pages - The remaining number of pages in the folio. + * @folio: The folio. + * @page: The folio page to start the counting. + * @max_npages: Limit how far to count. + * + * The returned number will include the provided page. + * + * The caller must ensure that @page belongs to @folio. When setting + * @max_npages to ULONG_MAX, the parameter is effectively ignored. + * + * Return: The remaining number of pages, limited by @max_npages. + */ +static inline unsigned long folio_remaining_pages(struct folio *folio, + struct page *page, unsigned long max_npages) +{ + unsigned long nr; + + if (!folio_test_large(folio)) + return 1; + nr = folio_large_nr_pages(folio) - folio_page_idx(folio, page); + if (__builtin_constant_p(max_npages) && max_npages == ULONG_MAX) + return nr; + return min_t(unsigned long, nr, max_npages); +} + /* Only hugetlbfs can allocate folios larger than MAX_ORDER */ #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE #define MAX_FOLIO_NR_PAGES (1UL << PUD_ORDER) diff --git a/mm/gup.c b/mm/gup.c index 6888e871a74a9..3385428d028f6 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -234,21 +234,6 @@ void folio_add_pin(struct folio *folio) } } -static inline struct folio *gup_folio_range_next(struct page *start, - unsigned long npages, unsigned long i, unsigned int *ntails) -{ - struct page *next = nth_page(start, i); - struct folio *folio = page_folio(next); - unsigned int nr = 1; - - if (folio_test_large(folio)) - nr = min_t(unsigned int, npages - i, - folio_nr_pages(folio) - folio_page_idx(folio, next)); - - *ntails = nr; - return folio; -} - static inline struct folio *gup_folio_next(struct page **list, unsigned long npages, unsigned long i, unsigned int *ntails) { @@ -356,11 +341,13 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages, bool make_dirty) { unsigned long i; - struct folio *folio; unsigned int nr; for (i = 0; i < npages; i += nr) { - folio = gup_folio_range_next(page, npages, i, &nr); + struct page *next = nth_page(page, i); + struct folio *folio = page_folio(next); + + nr = folio_remaining_pages(folio, next, npages - i); if (make_dirty && !folio_test_dirty(folio)) { folio_lock(folio); folio_mark_dirty(folio); diff --git a/mm/page_isolation.c b/mm/page_isolation.c index ece3bfc56bcd5..6f75defb38d73 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -96,7 +96,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e return page; } - skip_pages = folio_nr_pages(folio) - folio_page_idx(folio, page); + skip_pages = folio_remaining_pages(folio, page, ULONG_MAX); pfn += skip_pages - 1; continue; } -- But up to you, anything with such a helper function makes the code easier to digest. -- Cheers, David / dhildenb
On Wed, 18 Jun 2025 10:54:38 +0200, david@redhat.com wrote: > On 18.06.25 08:11, lizhe.67@bytedance.com wrote: > > On Tue, 17 Jun 2025 15:47:09 +0200, david@redhat.com wrote: > > > >>> How do you think of this implementation? > >>> > >>> diff --git a/include/linux/mm.h b/include/linux/mm.h > >>> index 242b05671502..eb91f99ea973 100644 > >>> --- a/include/linux/mm.h > >>> +++ b/include/linux/mm.h > >>> @@ -2165,6 +2165,23 @@ static inline long folio_nr_pages(const struct folio *folio) > >>> return folio_large_nr_pages(folio); > >>> } > >>> > >>> +/* > >>> + * folio_remaining_pages - Counts the number of pages from a given > >>> + * start page to the end of the folio. > >>> + * > >>> + * @folio: Pointer to folio > >>> + * @start_page: The starting page from which to begin counting. > >>> + * > >>> + * Returned number includes the provided start page. > >>> + * > >>> + * The caller must ensure that @start_page belongs to @folio. > >>> + */ > >>> +static inline unsigned long folio_remaining_pages(struct folio *folio, > >>> + struct page *start_page) > >>> +{ > >>> + return folio_nr_pages(folio) - folio_page_idx(folio, start_page); > >>> +} > >>> + > >>> /* Only hugetlbfs can allocate folios larger than MAX_ORDER */ > >>> #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE > >>> #define MAX_FOLIO_NR_PAGES (1UL << PUD_ORDER) > >>> diff --git a/mm/gup.c b/mm/gup.c > >>> index 15debead5f5b..14ae2e3088b4 100644 > >>> --- a/mm/gup.c > >>> +++ b/mm/gup.c > >>> @@ -242,7 +242,7 @@ static inline struct folio *gup_folio_range_next(struct page *start, > >>> > >>> if (folio_test_large(folio)) > >>> nr = min_t(unsigned int, npages - i, > >>> - folio_nr_pages(folio) - folio_page_idx(folio, next)); > >>> + folio_remaining_pages(folio, next)); > >>> > >>> *ntails = nr; > >>> return folio; > >>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c > >>> index b2fc5266e3d2..34e85258060c 100644 > >>> --- a/mm/page_isolation.c > >>> +++ b/mm/page_isolation.c > >>> @@ -96,7 +96,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e > >>> return page; > >>> } > >>> > >>> - skip_pages = folio_nr_pages(folio) - folio_page_idx(folio, page); > >>> + skip_pages = folio_remaining_pages(folio, page); > >>> pfn += skip_pages - 1; > >>> continue; > >>> } > >>> --- > >> > >> Guess I would have pulled the "min" in there, but passing something like > >> ULONG_MAX for the page_isolation case also looks rather ugly. > > > > Yes, the page_isolation case does not require the 'min' logic. Since > > there are already places in the current kernel code where > > folio_remaining_pages() is used without needing min, we could simply > > create a custom wrapper function based on folio_remaining_pages() only > > in those specific scenarios where min is necessary. > > I would just do something like that, removing gup_folio_range_next() completely. > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 98a606908307b..6c224b1c6c169 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1995,6 +1995,32 @@ static inline long folio_nr_pages(const struct folio *folio) > return folio_large_nr_pages(folio); > } > > +/** > + * folio_remaining_pages - The remaining number of pages in the folio. > + * @folio: The folio. > + * @page: The folio page to start the counting. > + * @max_npages: Limit how far to count. > + * > + * The returned number will include the provided page. > + * > + * The caller must ensure that @page belongs to @folio. When setting > + * @max_npages to ULONG_MAX, the parameter is effectively ignored. > + * > + * Return: The remaining number of pages, limited by @max_npages. > + */ > +static inline unsigned long folio_remaining_pages(struct folio *folio, > + struct page *page, unsigned long max_npages) > +{ > + unsigned long nr; > + > + if (!folio_test_large(folio)) > + return 1; > + nr = folio_large_nr_pages(folio) - folio_page_idx(folio, page); > + if (__builtin_constant_p(max_npages) && max_npages == ULONG_MAX) > + return nr; > + return min_t(unsigned long, nr, max_npages); > +} > + > /* Only hugetlbfs can allocate folios larger than MAX_ORDER */ > #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE > #define MAX_FOLIO_NR_PAGES (1UL << PUD_ORDER) > diff --git a/mm/gup.c b/mm/gup.c > index 6888e871a74a9..3385428d028f6 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -234,21 +234,6 @@ void folio_add_pin(struct folio *folio) > } > } > > -static inline struct folio *gup_folio_range_next(struct page *start, > - unsigned long npages, unsigned long i, unsigned int *ntails) > -{ > - struct page *next = nth_page(start, i); > - struct folio *folio = page_folio(next); > - unsigned int nr = 1; > - > - if (folio_test_large(folio)) > - nr = min_t(unsigned int, npages - i, > - folio_nr_pages(folio) - folio_page_idx(folio, next)); > - > - *ntails = nr; > - return folio; > -} > - > static inline struct folio *gup_folio_next(struct page **list, > unsigned long npages, unsigned long i, unsigned int *ntails) > { > @@ -356,11 +341,13 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages, > bool make_dirty) > { > unsigned long i; > - struct folio *folio; > unsigned int nr; > > for (i = 0; i < npages; i += nr) { > - folio = gup_folio_range_next(page, npages, i, &nr); > + struct page *next = nth_page(page, i); > + struct folio *folio = page_folio(next); > + > + nr = folio_remaining_pages(folio, next, npages - i); > if (make_dirty && !folio_test_dirty(folio)) { > folio_lock(folio); > folio_mark_dirty(folio); > diff --git a/mm/page_isolation.c b/mm/page_isolation.c > index ece3bfc56bcd5..6f75defb38d73 100644 > --- a/mm/page_isolation.c > +++ b/mm/page_isolation.c > @@ -96,7 +96,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e > return page; > } > > - skip_pages = folio_nr_pages(folio) - folio_page_idx(folio, page); > + skip_pages = folio_remaining_pages(folio, page, ULONG_MAX); > pfn += skip_pages - 1; > continue; > } > -- > > > But up to you, anything with such a helper function makes the code easier to digest. Thank you. This implementation looks better. Please allow me to integrate this patch into my patchset. Thanks, Zhe
On Wed, 18 Jun 2025 14:11:43 +0800, lizhe.67@bytedance.com wrote: > > > How do you think of this implementation? > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > index 242b05671502..eb91f99ea973 100644 > > > --- a/include/linux/mm.h > > > +++ b/include/linux/mm.h > > > @@ -2165,6 +2165,23 @@ static inline long folio_nr_pages(const struct folio *folio) > > > return folio_large_nr_pages(folio); > > > } > > > > > > +/* > > > + * folio_remaining_pages - Counts the number of pages from a given > > > + * start page to the end of the folio. > > > + * > > > + * @folio: Pointer to folio > > > + * @start_page: The starting page from which to begin counting. > > > + * > > > + * Returned number includes the provided start page. > > > + * > > > + * The caller must ensure that @start_page belongs to @folio. > > > + */ > > > +static inline unsigned long folio_remaining_pages(struct folio *folio, > > > + struct page *start_page) > > > +{ > > > + return folio_nr_pages(folio) - folio_page_idx(folio, start_page); > > > +} > > > + > > > /* Only hugetlbfs can allocate folios larger than MAX_ORDER */ > > > #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE > > > #define MAX_FOLIO_NR_PAGES (1UL << PUD_ORDER) > > > diff --git a/mm/gup.c b/mm/gup.c > > > index 15debead5f5b..14ae2e3088b4 100644 > > > --- a/mm/gup.c > > > +++ b/mm/gup.c > > > @@ -242,7 +242,7 @@ static inline struct folio *gup_folio_range_next(struct page *start, > > > > > > if (folio_test_large(folio)) > > > nr = min_t(unsigned int, npages - i, > > > - folio_nr_pages(folio) - folio_page_idx(folio, next)); > > > + folio_remaining_pages(folio, next)); > > > > > > *ntails = nr; > > > return folio; > > > diff --git a/mm/page_isolation.c b/mm/page_isolation.c > > > index b2fc5266e3d2..34e85258060c 100644 > > > --- a/mm/page_isolation.c > > > +++ b/mm/page_isolation.c > > > @@ -96,7 +96,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e > > > return page; > > > } > > > > > > - skip_pages = folio_nr_pages(folio) - folio_page_idx(folio, page); > > > + skip_pages = folio_remaining_pages(folio, page); > > > pfn += skip_pages - 1; > > > continue; > > > } > > > --- > > > > Guess I would have pulled the "min" in there, but passing something like > > ULONG_MAX for the page_isolation case also looks rather ugly. > > Yes, the page_isolation case does not require the 'min' logic. Since > there are already places in the current kernel code where > folio_remaining_pages() is used without needing min, we could simply > create a custom wrapper function based on folio_remaining_pages() only > in those specific scenarios where min is necessary. > > Following this line of thinking, the wrapper function in vfio would > look something like this. > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > --- a/drivers/vfio/vfio_iommu_type1.c > +++ b/drivers/vfio/vfio_iommu_type1.c > @@ -801,16 +801,40 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, > return pinned; > } > > +static inline unsigned long vfio_folio_remaining_pages( > + struct folio *folio, struct page *start_page, > + unsigned long max_pages) > +{ > + if (!folio_test_large(folio)) > + return 1; The above two lines may no longer be necessary. > + return min(max_pages, > + folio_remaining_pages(folio, start_page)); > +} Thanks, Zhe
© 2016 - 2025 Red Hat, Inc.