Use PTE batching to batch process PTEs mapping the same large folio. An
improvement is expected due to batching refcount-mapcount manipulation on
the folios, and for arm64 which supports contig mappings, the number of
TLB flushes is also reduced.
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
---
mm/khugepaged.c | 25 ++++++++++++++++++-------
1 file changed, 18 insertions(+), 7 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a55fb1dcd224..f23e943506bc 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -700,12 +700,15 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
spinlock_t *ptl,
struct list_head *compound_pagelist)
{
+ unsigned long end = address + HPAGE_PMD_SIZE;
struct folio *src, *tmp;
- pte_t *_pte;
pte_t pteval;
+ pte_t *_pte;
+ unsigned int nr_ptes;
- for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
- _pte++, address += PAGE_SIZE) {
+ for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
+ address += nr_ptes * PAGE_SIZE) {
+ nr_ptes = 1;
pteval = ptep_get(_pte);
if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
@@ -722,18 +725,26 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
struct page *src_page = pte_page(pteval);
src = page_folio(src_page);
- if (!folio_test_large(src))
+
+ if (folio_test_large(src)) {
+ unsigned int max_nr_ptes = (end - address) >> PAGE_SHIFT;
+
+ nr_ptes = folio_pte_batch(src, _pte, pteval, max_nr_ptes);
+ } else {
release_pte_folio(src);
+ }
+
/*
* ptl mostly unnecessary, but preempt has to
* be disabled to update the per-cpu stats
* inside folio_remove_rmap_pte().
*/
spin_lock(ptl);
- ptep_clear(vma->vm_mm, address, _pte);
- folio_remove_rmap_pte(src, src_page, vma);
+ clear_ptes(vma->vm_mm, address, _pte, nr_ptes);
+ folio_remove_rmap_ptes(src, src_page, nr_ptes, vma);
spin_unlock(ptl);
- free_folio_and_swap_cache(src);
+ free_swap_cache(src);
+ folio_put_refs(src, nr_ptes);
}
}
--
2.30.2
On 24 Jul 2025, at 1:23, Dev Jain wrote: > Use PTE batching to batch process PTEs mapping the same large folio. An > improvement is expected due to batching refcount-mapcount manipulation on > the folios, and for arm64 which supports contig mappings, the number of > TLB flushes is also reduced. > > Acked-by: David Hildenbrand <david@redhat.com> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> > Signed-off-by: Dev Jain <dev.jain@arm.com> > --- > mm/khugepaged.c | 25 ++++++++++++++++++------- > 1 file changed, 18 insertions(+), 7 deletions(-) > Reviewed-by: Zi Yan <ziy@nvidia.com> Best Regards, Yan, Zi
Trying this again as my mail client apparently messed this up: NIT: Please don't capitalise 'Optimize' here. I think Andrew fixed this for you actually in the repo though :P On Thu, Jul 24, 2025 at 10:53:00AM +0530, Dev Jain wrote: > Use PTE batching to batch process PTEs mapping the same large folio. An > improvement is expected due to batching refcount-mapcount manipulation on > the folios, and for arm64 which supports contig mappings, the number of > TLB flushes is also reduced. > > Acked-by: David Hildenbrand <david@redhat.com> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> > Signed-off-by: Dev Jain <dev.jain@arm.com> > --- > mm/khugepaged.c | 25 ++++++++++++++++++------- > 1 file changed, 18 insertions(+), 7 deletions(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index a55fb1dcd224..f23e943506bc 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -700,12 +700,15 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte, > spinlock_t *ptl, > struct list_head *compound_pagelist) > { > + unsigned long end = address + HPAGE_PMD_SIZE; > struct folio *src, *tmp; > - pte_t *_pte; > pte_t pteval; > + pte_t *_pte; > + unsigned int nr_ptes; > > - for (_pte = pte; _pte < pte + HPAGE_PMD_NR; > - _pte++, address += PAGE_SIZE) { > + for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes, > + address += nr_ptes * PAGE_SIZE) { > + nr_ptes = 1; > pteval = ptep_get(_pte); > if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1); > @@ -722,18 +725,26 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte, > struct page *src_page = pte_page(pteval); > > src = page_folio(src_page); > - if (!folio_test_large(src)) > + > + if (folio_test_large(src)) { > + unsigned int max_nr_ptes = (end - address) >> PAGE_SHIFT; > + > + nr_ptes = folio_pte_batch(src, _pte, pteval, max_nr_ptes); > + } else { > release_pte_folio(src); > + } > + > /* > * ptl mostly unnecessary, but preempt has to > * be disabled to update the per-cpu stats > * inside folio_remove_rmap_pte(). > */ > spin_lock(ptl); > - ptep_clear(vma->vm_mm, address, _pte); > - folio_remove_rmap_pte(src, src_page, vma); > + clear_ptes(vma->vm_mm, address, _pte, nr_ptes); > + folio_remove_rmap_ptes(src, src_page, nr_ptes, vma); > spin_unlock(ptl); > - free_folio_and_swap_cache(src); > + free_swap_cache(src); > + folio_put_refs(src, nr_ptes); Hm one thing here though is the free_folio_and_swap_cache() does: free_swap_cache(folio); if (!is_huge_zero_folio(folio)) folio_put(folio); Whereas here you unconditionally reduce the reference count. Might this cause issues with the shrinker version of the huge zero folio? Should this be: if (!is_huge_zero_folio(src)) folio_put_refs(src, nr_ptes); Or do we otherwise avoid issues with this? > } > } > > -- > 2.30.2 >
On Thu, Jul 24, 2025 at 06:55:56PM +0100, Lorenzo Stoakes wrote: > Trying this again as my mail client apparently messed this up: > > > NIT: Please don't capitalise 'Optimize' here. > > I think Andrew fixed this for you actually in the repo though :P > > On Thu, Jul 24, 2025 at 10:53:00AM +0530, Dev Jain wrote: > > Use PTE batching to batch process PTEs mapping the same large folio. An > > improvement is expected due to batching refcount-mapcount manipulation on > > the folios, and for arm64 which supports contig mappings, the number of > > TLB flushes is also reduced. > > > > Acked-by: David Hildenbrand <david@redhat.com> > > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> > > Signed-off-by: Dev Jain <dev.jain@arm.com> With the concern I raised addressed by David, this LGTM, so: Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > --- > > mm/khugepaged.c | 25 ++++++++++++++++++------- > > 1 file changed, 18 insertions(+), 7 deletions(-) > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index a55fb1dcd224..f23e943506bc 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -700,12 +700,15 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte, > > spinlock_t *ptl, > > struct list_head *compound_pagelist) > > { > > + unsigned long end = address + HPAGE_PMD_SIZE; > > struct folio *src, *tmp; > > - pte_t *_pte; > > pte_t pteval; > > + pte_t *_pte; > > + unsigned int nr_ptes; > > > > - for (_pte = pte; _pte < pte + HPAGE_PMD_NR; > > - _pte++, address += PAGE_SIZE) { > > + for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes, > > + address += nr_ptes * PAGE_SIZE) { > > + nr_ptes = 1; > > pteval = ptep_get(_pte); > > if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > > add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1); > > @@ -722,18 +725,26 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte, > > struct page *src_page = pte_page(pteval); > > > > src = page_folio(src_page); > > - if (!folio_test_large(src)) > > + > > + if (folio_test_large(src)) { > > + unsigned int max_nr_ptes = (end - address) >> PAGE_SHIFT; > > + > > + nr_ptes = folio_pte_batch(src, _pte, pteval, max_nr_ptes); > > + } else { > > release_pte_folio(src); > > + } > > + > > /* > > * ptl mostly unnecessary, but preempt has to > > * be disabled to update the per-cpu stats > > * inside folio_remove_rmap_pte(). > > */ > > spin_lock(ptl); > > - ptep_clear(vma->vm_mm, address, _pte); > > - folio_remove_rmap_pte(src, src_page, vma); > > + clear_ptes(vma->vm_mm, address, _pte, nr_ptes); > > + folio_remove_rmap_ptes(src, src_page, nr_ptes, vma); > > spin_unlock(ptl); > > - free_folio_and_swap_cache(src); > > + free_swap_cache(src); > > + folio_put_refs(src, nr_ptes); > > Hm one thing here though is the free_folio_and_swap_cache() does: > > free_swap_cache(folio); > if (!is_huge_zero_folio(folio)) > folio_put(folio); > > Whereas here you unconditionally reduce the reference count. Might this > cause issues with the shrinker version of the huge zero folio? > > Should this be: > > if (!is_huge_zero_folio(src)) > folio_put_refs(src, nr_ptes); > > Or do we otherwise avoid issues with this? > > > > } > > } > > > > -- > > 2.30.2 > > >
>> + if (folio_test_large(src)) { >> + unsigned int max_nr_ptes = (end - address) >> PAGE_SHIFT; >> + >> + nr_ptes = folio_pte_batch(src, _pte, pteval, max_nr_ptes); >> + } else { >> release_pte_folio(src); >> + } >> + >> /* >> * ptl mostly unnecessary, but preempt has to >> * be disabled to update the per-cpu stats >> * inside folio_remove_rmap_pte(). >> */ >> spin_lock(ptl); >> - ptep_clear(vma->vm_mm, address, _pte); >> - folio_remove_rmap_pte(src, src_page, vma); >> + clear_ptes(vma->vm_mm, address, _pte, nr_ptes); >> + folio_remove_rmap_ptes(src, src_page, nr_ptes, vma); >> spin_unlock(ptl); >> - free_folio_and_swap_cache(src); >> + free_swap_cache(src); >> + folio_put_refs(src, nr_ptes); > > Hm one thing here though is the free_folio_and_swap_cache() does: > > free_swap_cache(folio); > if (!is_huge_zero_folio(folio)) > folio_put(folio); > > Whereas here you unconditionally reduce the reference count. Might this > cause issues with the shrinker version of the huge zero folio? > > Should this be: > > if (!is_huge_zero_folio(src)) > folio_put_refs(src, nr_ptes); > > Or do we otherwise avoid issues with this? (resending my reply) The huge zero folio is never PTE-mapped. -- Cheers, David / dhildenb
On Thu, Jul 24, 2025 at 07:57:22PM +0200, David Hildenbrand wrote: > > > > + if (folio_test_large(src)) { > > > + unsigned int max_nr_ptes = (end - address) >> PAGE_SHIFT; > > > + > > > + nr_ptes = folio_pte_batch(src, _pte, pteval, max_nr_ptes); > > > + } else { > > > release_pte_folio(src); > > > + } > > > + > > > /* > > > * ptl mostly unnecessary, but preempt has to > > > * be disabled to update the per-cpu stats > > > * inside folio_remove_rmap_pte(). > > > */ > > > spin_lock(ptl); > > > - ptep_clear(vma->vm_mm, address, _pte); > > > - folio_remove_rmap_pte(src, src_page, vma); > > > + clear_ptes(vma->vm_mm, address, _pte, nr_ptes); > > > + folio_remove_rmap_ptes(src, src_page, nr_ptes, vma); > > > spin_unlock(ptl); > > > - free_folio_and_swap_cache(src); > > > + free_swap_cache(src); > > > + folio_put_refs(src, nr_ptes); > > > > Hm one thing here though is the free_folio_and_swap_cache() does: > > > > free_swap_cache(folio); > > if (!is_huge_zero_folio(folio)) > > folio_put(folio); > > > > Whereas here you unconditionally reduce the reference count. Might this > > cause issues with the shrinker version of the huge zero folio? > > > > Should this be: > > > > if (!is_huge_zero_folio(src)) > > folio_put_refs(src, nr_ptes); > > > > Or do we otherwise avoid issues with this? > > (resending my reply) > > The huge zero folio is never PTE-mapped. OK fine, as mentioned off-list I hate this kind of 'implicit' knowledge, and you pointed out that really we should be using vm_normal_page() or equivalent in this code. One to address at some point :) Anyway with this concern addressed, the patch is fine, will send tag... > > -- > Cheers, > > David / dhildenb >
© 2016 - 2025 Red Hat, Inc.