[v5] support batch checking of references and unmapping for large folios

[PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios

Posted by Baolin Wang 1 month, 1 week ago

Similar to folio_referenced_one(), we can apply batched unmapping for file
large folios to optimize the performance of file folios reclamation.

Barry previously implemented batched unmapping for lazyfree anonymous large
folios[1] and did not further optimize anonymous large folios or file-backed
large folios at that stage. As for file-backed large folios, the batched
unmapping support is relatively straightforward, as we only need to clear
the consecutive (present) PTE entries for file-backed large folios.

Performance testing:
Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
75% performance improvement on my Arm64 32-core server (and 50%+ improvement
on my X86 machine) with this patch.

W/o patch:
real    0m1.018s
user    0m0.000s
sys     0m1.018s

W/ patch:
real	0m0.249s
user	0m0.000s
sys	0m0.249s

[1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Barry Song <baohua@kernel.org>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/rmap.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 985ab0b085ba..e1d16003c514 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
 	end_addr = pmd_addr_end(addr, vma->vm_end);
 	max_nr = (end_addr - addr) >> PAGE_SHIFT;
 
-	/* We only support lazyfree batching for now ... */
-	if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
+	/* We only support lazyfree or file folios batching for now ... */
+	if (folio_test_anon(folio) && folio_test_swapbacked(folio))
 		return 1;
+
 	if (pte_unused(pte))
 		return 1;
 
@@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 *
 			 * See Documentation/mm/mmu_notifier.rst
 			 */
-			dec_mm_counter(mm, mm_counter_file(folio));
+			add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
 		}
 discard:
 		if (unlikely(folio_test_hugetlb(folio))) {
-- 
2.47.3

Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios

Posted by Lorenzo Stoakes 3 weeks ago

FYI Dev found an issue here, see [0].

[0]: https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/

Cheers, Lorenzo

On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
> Similar to folio_referenced_one(), we can apply batched unmapping for file
> large folios to optimize the performance of file folios reclamation.
>
> Barry previously implemented batched unmapping for lazyfree anonymous large
> folios[1] and did not further optimize anonymous large folios or file-backed
> large folios at that stage. As for file-backed large folios, the batched
> unmapping support is relatively straightforward, as we only need to clear
> the consecutive (present) PTE entries for file-backed large folios.
>
> Performance testing:
> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> 75% performance improvement on my Arm64 32-core server (and 50%+ improvement
> on my X86 machine) with this patch.
>
> W/o patch:
> real    0m1.018s
> user    0m0.000s
> sys     0m1.018s
>
> W/ patch:
> real	0m0.249s
> user	0m0.000s
> sys	0m0.249s
>
> [1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Acked-by: Barry Song <baohua@kernel.org>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  mm/rmap.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 985ab0b085ba..e1d16003c514 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>  	end_addr = pmd_addr_end(addr, vma->vm_end);
>  	max_nr = (end_addr - addr) >> PAGE_SHIFT;
>
> -	/* We only support lazyfree batching for now ... */
> -	if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
> +	/* We only support lazyfree or file folios batching for now ... */
> +	if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>  		return 1;
> +
>  	if (pte_unused(pte))
>  		return 1;
>
> @@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  			 *
>  			 * See Documentation/mm/mmu_notifier.rst
>  			 */
> -			dec_mm_counter(mm, mm_counter_file(folio));
> +			add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
>  		}
>  discard:
>  		if (unlikely(folio_test_hugetlb(folio))) {
> --
> 2.47.3
>

Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios

Posted by Harry Yoo 1 month ago

On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
> Similar to folio_referenced_one(), we can apply batched unmapping for file
> large folios to optimize the performance of file folios reclamation.
> 
> Barry previously implemented batched unmapping for lazyfree anonymous large
> folios[1] and did not further optimize anonymous large folios or file-backed
> large folios at that stage. As for file-backed large folios, the batched
> unmapping support is relatively straightforward, as we only need to clear
> the consecutive (present) PTE entries for file-backed large folios.
>
> Performance testing:
> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> 75% performance improvement on my Arm64 32-core server (and 50%+ improvement
> on my X86 machine) with this patch.
> 
> W/o patch:
> real    0m1.018s
> user    0m0.000s
> sys     0m1.018s
> 
> W/ patch:
> real	0m0.249s
> user	0m0.000s
> sys	0m0.249s
> 
> [1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Acked-by: Barry Song <baohua@kernel.org>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---

Looks good to me, so:
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon

Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios

Posted by Wei Yang 1 month ago

On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
>Similar to folio_referenced_one(), we can apply batched unmapping for file
>large folios to optimize the performance of file folios reclamation.
>
>Barry previously implemented batched unmapping for lazyfree anonymous large
>folios[1] and did not further optimize anonymous large folios or file-backed
>large folios at that stage. As for file-backed large folios, the batched
>unmapping support is relatively straightforward, as we only need to clear
>the consecutive (present) PTE entries for file-backed large folios.
>
>Performance testing:
>Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>75% performance improvement on my Arm64 32-core server (and 50%+ improvement
>on my X86 machine) with this patch.
>
>W/o patch:
>real    0m1.018s
>user    0m0.000s
>sys     0m1.018s
>
>W/ patch:
>real	0m0.249s
>user	0m0.000s
>sys	0m0.249s
>
>[1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>Acked-by: Barry Song <baohua@kernel.org>
>Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>---
> mm/rmap.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
>diff --git a/mm/rmap.c b/mm/rmap.c
>index 985ab0b085ba..e1d16003c514 100644
>--- a/mm/rmap.c
>+++ b/mm/rmap.c
>@@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
> 	end_addr = pmd_addr_end(addr, vma->vm_end);
> 	max_nr = (end_addr - addr) >> PAGE_SHIFT;
> 
>-	/* We only support lazyfree batching for now ... */
>-	if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>+	/* We only support lazyfree or file folios batching for now ... */
>+	if (folio_test_anon(folio) && folio_test_swapbacked(folio))
> 		return 1;
>+
> 	if (pte_unused(pte))
> 		return 1;
> 
>@@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> 			 *
> 			 * See Documentation/mm/mmu_notifier.rst
> 			 */
>-			dec_mm_counter(mm, mm_counter_file(folio));
>+			add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
> 		}
> discard:
> 		if (unlikely(folio_test_hugetlb(folio))) {
>-- 
>2.47.3
>

Hi, Baolin

When reading your patch, I come up one small question.

Current try_to_unmap_one() has following structure:

    try_to_unmap_one()
        while (page_vma_mapped_walk(&pvmw)) {
            nr_pages = folio_unmap_pte_batch()

            if (nr_pages = folio_nr_pages(folio))
                goto walk_done;
        }

I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().

If my understanding is correct, page_vma_mapped_walk() would start from
(pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
(pvmw->address + nr_pages * PAGE_SIZE), right?

Not sure my understanding is correct, if so do we have some reason not to
skip the cleared range?

-- 
Wei Yang
Help you, Help me

Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios

Posted by Barry Song 1 month ago

On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
> >Similar to folio_referenced_one(), we can apply batched unmapping for file
> >large folios to optimize the performance of file folios reclamation.
> >
> >Barry previously implemented batched unmapping for lazyfree anonymous large
> >folios[1] and did not further optimize anonymous large folios or file-backed
> >large folios at that stage. As for file-backed large folios, the batched
> >unmapping support is relatively straightforward, as we only need to clear
> >the consecutive (present) PTE entries for file-backed large folios.
> >
> >Performance testing:
> >Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> >reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> >75% performance improvement on my Arm64 32-core server (and 50%+ improvement
> >on my X86 machine) with this patch.
> >
> >W/o patch:
> >real    0m1.018s
> >user    0m0.000s
> >sys     0m1.018s
> >
> >W/ patch:
> >real   0m0.249s
> >user   0m0.000s
> >sys    0m0.249s
> >
> >[1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
> >Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> >Acked-by: Barry Song <baohua@kernel.org>
> >Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >---
> > mm/rmap.c | 7 ++++---
> > 1 file changed, 4 insertions(+), 3 deletions(-)
> >
> >diff --git a/mm/rmap.c b/mm/rmap.c
> >index 985ab0b085ba..e1d16003c514 100644
> >--- a/mm/rmap.c
> >+++ b/mm/rmap.c
> >@@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
> >       end_addr = pmd_addr_end(addr, vma->vm_end);
> >       max_nr = (end_addr - addr) >> PAGE_SHIFT;
> >
> >-      /* We only support lazyfree batching for now ... */
> >-      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
> >+      /* We only support lazyfree or file folios batching for now ... */
> >+      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
> >               return 1;
> >+
> >       if (pte_unused(pte))
> >               return 1;
> >
> >@@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >                        *
> >                        * See Documentation/mm/mmu_notifier.rst
> >                        */
> >-                      dec_mm_counter(mm, mm_counter_file(folio));
> >+                      add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
> >               }
> > discard:
> >               if (unlikely(folio_test_hugetlb(folio))) {
> >--
> >2.47.3
> >
>
> Hi, Baolin
>
> When reading your patch, I come up one small question.
>
> Current try_to_unmap_one() has following structure:
>
>     try_to_unmap_one()
>         while (page_vma_mapped_walk(&pvmw)) {
>             nr_pages = folio_unmap_pte_batch()
>
>             if (nr_pages = folio_nr_pages(folio))
>                 goto walk_done;
>         }
>
> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
>
> If my understanding is correct, page_vma_mapped_walk() would start from
> (pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
> (pvmw->address + nr_pages * PAGE_SIZE), right?
>
> Not sure my understanding is correct, if so do we have some reason not to
> skip the cleared range?

I don’t quite understand your question. For nr_pages > 1 but not equal
to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs inside.

take a look:

next_pte:
                do {
                        pvmw->address += PAGE_SIZE;
                        if (pvmw->address >= end)
                                return not_found(pvmw);
                        /* Did we cross page table boundary? */
                        if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
                                if (pvmw->ptl) {
                                        spin_unlock(pvmw->ptl);
                                        pvmw->ptl = NULL;
                                }
                                pte_unmap(pvmw->pte);
                                pvmw->pte = NULL;
                                pvmw->flags |= PVMW_PGTABLE_CROSSED;
                                goto restart;
                        }
                        pvmw->pte++;
                } while (pte_none(ptep_get(pvmw->pte)));


>
> --
> Wei Yang
> Help you, Help me

Thanks
Barry

Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios

Posted by Wei Yang 1 month ago

On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
>On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>>
>> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
>> >Similar to folio_referenced_one(), we can apply batched unmapping for file
>> >large folios to optimize the performance of file folios reclamation.
>> >
>> >Barry previously implemented batched unmapping for lazyfree anonymous large
>> >folios[1] and did not further optimize anonymous large folios or file-backed
>> >large folios at that stage. As for file-backed large folios, the batched
>> >unmapping support is relatively straightforward, as we only need to clear
>> >the consecutive (present) PTE entries for file-backed large folios.
>> >
>> >Performance testing:
>> >Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>> >reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>> >75% performance improvement on my Arm64 32-core server (and 50%+ improvement
>> >on my X86 machine) with this patch.
>> >
>> >W/o patch:
>> >real    0m1.018s
>> >user    0m0.000s
>> >sys     0m1.018s
>> >
>> >W/ patch:
>> >real   0m0.249s
>> >user   0m0.000s
>> >sys    0m0.249s
>> >
>> >[1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>> >Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>> >Acked-by: Barry Song <baohua@kernel.org>
>> >Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> >---
>> > mm/rmap.c | 7 ++++---
>> > 1 file changed, 4 insertions(+), 3 deletions(-)
>> >
>> >diff --git a/mm/rmap.c b/mm/rmap.c
>> >index 985ab0b085ba..e1d16003c514 100644
>> >--- a/mm/rmap.c
>> >+++ b/mm/rmap.c
>> >@@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>> >       end_addr = pmd_addr_end(addr, vma->vm_end);
>> >       max_nr = (end_addr - addr) >> PAGE_SHIFT;
>> >
>> >-      /* We only support lazyfree batching for now ... */
>> >-      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>> >+      /* We only support lazyfree or file folios batching for now ... */
>> >+      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>> >               return 1;
>> >+
>> >       if (pte_unused(pte))
>> >               return 1;
>> >
>> >@@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>> >                        *
>> >                        * See Documentation/mm/mmu_notifier.rst
>> >                        */
>> >-                      dec_mm_counter(mm, mm_counter_file(folio));
>> >+                      add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
>> >               }
>> > discard:
>> >               if (unlikely(folio_test_hugetlb(folio))) {
>> >--
>> >2.47.3
>> >
>>
>> Hi, Baolin
>>
>> When reading your patch, I come up one small question.
>>
>> Current try_to_unmap_one() has following structure:
>>
>>     try_to_unmap_one()
>>         while (page_vma_mapped_walk(&pvmw)) {
>>             nr_pages = folio_unmap_pte_batch()
>>
>>             if (nr_pages = folio_nr_pages(folio))
>>                 goto walk_done;
>>         }
>>
>> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
>>
>> If my understanding is correct, page_vma_mapped_walk() would start from
>> (pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
>> (pvmw->address + nr_pages * PAGE_SIZE), right?
>>
>> Not sure my understanding is correct, if so do we have some reason not to
>> skip the cleared range?
>
>I don’t quite understand your question. For nr_pages > 1 but not equal
>to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs inside.
>
>take a look:
>
>next_pte:
>                do {
>                        pvmw->address += PAGE_SIZE;
>                        if (pvmw->address >= end)
>                                return not_found(pvmw);
>                        /* Did we cross page table boundary? */
>                        if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
>                                if (pvmw->ptl) {
>                                        spin_unlock(pvmw->ptl);
>                                        pvmw->ptl = NULL;
>                                }
>                                pte_unmap(pvmw->pte);
>                                pvmw->pte = NULL;
>                                pvmw->flags |= PVMW_PGTABLE_CROSSED;
>                                goto restart;
>                        }
>                        pvmw->pte++;
>                } while (pte_none(ptep_get(pvmw->pte)));
>

Yes, we do it in page_vma_mapped_walk() now. Since they are pte_none(), they
will be skipped.

I mean maybe we can skip it in try_to_unmap_one(), for example:

diff --git a/mm/rmap.c b/mm/rmap.c
index 9e5bd4834481..ea1afec7c802 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		 */
 		if (nr_pages == folio_nr_pages(folio))
 			goto walk_done;
+		else {
+			pvmw.address += PAGE_SIZE * (nr_pages - 1);
+			pvmw.pte += nr_pages - 1;
+		}
 		continue;
 walk_abort:
 		ret = false;

Not sure this is reasonable.


-- 
Wei Yang
Help you, Help me

Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios

Posted by Dev Jain 3 weeks ago

On 07/01/26 7:16 am, Wei Yang wrote:
> On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
>> On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>>> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
>>>> Similar to folio_referenced_one(), we can apply batched unmapping for file
>>>> large folios to optimize the performance of file folios reclamation.
>>>>
>>>> Barry previously implemented batched unmapping for lazyfree anonymous large
>>>> folios[1] and did not further optimize anonymous large folios or file-backed
>>>> large folios at that stage. As for file-backed large folios, the batched
>>>> unmapping support is relatively straightforward, as we only need to clear
>>>> the consecutive (present) PTE entries for file-backed large folios.
>>>>
>>>> Performance testing:
>>>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>>>> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>>>> 75% performance improvement on my Arm64 32-core server (and 50%+ improvement
>>>> on my X86 machine) with this patch.
>>>>
>>>> W/o patch:
>>>> real    0m1.018s
>>>> user    0m0.000s
>>>> sys     0m1.018s
>>>>
>>>> W/ patch:
>>>> real   0m0.249s
>>>> user   0m0.000s
>>>> sys    0m0.249s
>>>>
>>>> [1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> Acked-by: Barry Song <baohua@kernel.org>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> ---
>>>> mm/rmap.c | 7 ++++---
>>>> 1 file changed, 4 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 985ab0b085ba..e1d16003c514 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>>>>       end_addr = pmd_addr_end(addr, vma->vm_end);
>>>>       max_nr = (end_addr - addr) >> PAGE_SHIFT;
>>>>
>>>> -      /* We only support lazyfree batching for now ... */
>>>> -      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>>>> +      /* We only support lazyfree or file folios batching for now ... */
>>>> +      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>>>>               return 1;
>>>> +
>>>>       if (pte_unused(pte))
>>>>               return 1;
>>>>
>>>> @@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>                        *
>>>>                        * See Documentation/mm/mmu_notifier.rst
>>>>                        */
>>>> -                      dec_mm_counter(mm, mm_counter_file(folio));
>>>> +                      add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
>>>>               }
>>>> discard:
>>>>               if (unlikely(folio_test_hugetlb(folio))) {
>>>> --
>>>> 2.47.3
>>>>
>>> Hi, Baolin
>>>
>>> When reading your patch, I come up one small question.
>>>
>>> Current try_to_unmap_one() has following structure:
>>>
>>>     try_to_unmap_one()
>>>         while (page_vma_mapped_walk(&pvmw)) {
>>>             nr_pages = folio_unmap_pte_batch()
>>>
>>>             if (nr_pages = folio_nr_pages(folio))
>>>                 goto walk_done;
>>>         }
>>>
>>> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
>>>
>>> If my understanding is correct, page_vma_mapped_walk() would start from
>>> (pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
>>> (pvmw->address + nr_pages * PAGE_SIZE), right?
>>>
>>> Not sure my understanding is correct, if so do we have some reason not to
>>> skip the cleared range?
>> I don’t quite understand your question. For nr_pages > 1 but not equal
>> to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs inside.
>>
>> take a look:
>>
>> next_pte:
>>                do {
>>                        pvmw->address += PAGE_SIZE;
>>                        if (pvmw->address >= end)
>>                                return not_found(pvmw);
>>                        /* Did we cross page table boundary? */
>>                        if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
>>                                if (pvmw->ptl) {
>>                                        spin_unlock(pvmw->ptl);
>>                                        pvmw->ptl = NULL;
>>                                }
>>                                pte_unmap(pvmw->pte);
>>                                pvmw->pte = NULL;
>>                                pvmw->flags |= PVMW_PGTABLE_CROSSED;
>>                                goto restart;
>>                        }
>>                        pvmw->pte++;
>>                } while (pte_none(ptep_get(pvmw->pte)));
>>
> Yes, we do it in page_vma_mapped_walk() now. Since they are pte_none(), they
> will be skipped.
>
> I mean maybe we can skip it in try_to_unmap_one(), for example:
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 9e5bd4834481..ea1afec7c802 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  		 */
>  		if (nr_pages == folio_nr_pages(folio))
>  			goto walk_done;
> +		else {
> +			pvmw.address += PAGE_SIZE * (nr_pages - 1);
> +			pvmw.pte += nr_pages - 1;
> +		}
>  		continue;
>  walk_abort:
>  		ret = false;

I am of the opinion that we should do something like this. In the internal pvmw code,
we keep skipping ptes till the ptes are none. With my proposed uffd-fix [1], if the old
ptes were uffd-wp armed, pte_install_uffd_wp_if_needed will convert all ptes from none
to not none, and we will lose the batching effect. I also plan to extend support to
anonymous folios (therefore generalizing for all types of memory) which will set a
batch of ptes as swap, and the internal pvmw code won't be able to skip through the
batch.


[1] https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/

>
> Not sure this is reasonable.
>
>

Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios

Posted by Barry Song 3 weeks ago

> >
> > I mean maybe we can skip it in try_to_unmap_one(), for example:
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 9e5bd4834481..ea1afec7c802 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >                */
> >               if (nr_pages == folio_nr_pages(folio))
> >                       goto walk_done;
> > +             else {
> > +                     pvmw.address += PAGE_SIZE * (nr_pages - 1);
> > +                     pvmw.pte += nr_pages - 1;
> > +             }
> >               continue;
> >  walk_abort:
> >               ret = false;
>
> I am of the opinion that we should do something like this. In the internal pvmw code,
> we keep skipping ptes till the ptes are none. With my proposed uffd-fix [1], if the old
> ptes were uffd-wp armed, pte_install_uffd_wp_if_needed will convert all ptes from none
> to not none, and we will lose the batching effect. I also plan to extend support to
> anonymous folios (therefore generalizing for all types of memory) which will set a

I posted an RFC on anon folios quite some time ago [1].
It’s great to hear that you’re interested in taking this over.

[1] https://lore.kernel.org/all/20250513084620.58231-1-21cnbao@gmail.com/

> batch of ptes as swap, and the internal pvmw code won't be able to skip through the
> batch.

Interesting — I didn’t catch this issue in the RFC earlier. Back then,
we only supported nr == 1 and nr == folio_nr_pages(folio). When
nr == nr_pages, page_vma_mapped_walk() would break entirely. With
Lance’s commit ddd05742b45b08, arbitrary nr in [1, nr_pages] is now
supported, which means we have to handle all the complexity. :-)

Thanks
Barry

Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios

Posted by Dev Jain 2 weeks, 6 days ago

On 16/01/26 8:44 pm, Barry Song wrote:
>>> I mean maybe we can skip it in try_to_unmap_one(), for example:
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 9e5bd4834481..ea1afec7c802 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>                */
>>>               if (nr_pages == folio_nr_pages(folio))
>>>                       goto walk_done;
>>> +             else {
>>> +                     pvmw.address += PAGE_SIZE * (nr_pages - 1);
>>> +                     pvmw.pte += nr_pages - 1;
>>> +             }
>>>               continue;
>>>  walk_abort:
>>>               ret = false;
>> I am of the opinion that we should do something like this. In the internal pvmw code,
>> we keep skipping ptes till the ptes are none. With my proposed uffd-fix [1], if the old
>> ptes were uffd-wp armed, pte_install_uffd_wp_if_needed will convert all ptes from none
>> to not none, and we will lose the batching effect. I also plan to extend support to
>> anonymous folios (therefore generalizing for all types of memory) which will set a
> I posted an RFC on anon folios quite some time ago [1].
> It’s great to hear that you’re interested in taking this over.
>
> [1] https://lore.kernel.org/all/20250513084620.58231-1-21cnbao@gmail.com/

Great! Now I have a reference to look at :)

>
>> batch of ptes as swap, and the internal pvmw code won't be able to skip through the
>> batch.
> Interesting — I didn’t catch this issue in the RFC earlier. Back then,
> we only supported nr == 1 and nr == folio_nr_pages(folio). When
> nr == nr_pages, page_vma_mapped_walk() would break entirely. With
> Lance’s commit ddd05742b45b08, arbitrary nr in [1, nr_pages] is now
> supported, which means we have to handle all the complexity. :-)
>
> Thanks
> Barry

Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios

Posted by Barry Song 3 weeks ago

On Fri, Jan 16, 2026 at 5:53 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
> On 07/01/26 7:16 am, Wei Yang wrote:
> > On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
> >> On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@gmail.com> wrote:
> >>> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
> >>>> Similar to folio_referenced_one(), we can apply batched unmapping for file
> >>>> large folios to optimize the performance of file folios reclamation.
> >>>>
> >>>> Barry previously implemented batched unmapping for lazyfree anonymous large
> >>>> folios[1] and did not further optimize anonymous large folios or file-backed
> >>>> large folios at that stage. As for file-backed large folios, the batched
> >>>> unmapping support is relatively straightforward, as we only need to clear
> >>>> the consecutive (present) PTE entries for file-backed large folios.
> >>>>
> >>>> Performance testing:
> >>>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> >>>> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> >>>> 75% performance improvement on my Arm64 32-core server (and 50%+ improvement
> >>>> on my X86 machine) with this patch.
> >>>>
> >>>> W/o patch:
> >>>> real    0m1.018s
> >>>> user    0m0.000s
> >>>> sys     0m1.018s
> >>>>
> >>>> W/ patch:
> >>>> real   0m0.249s
> >>>> user   0m0.000s
> >>>> sys    0m0.249s
> >>>>
> >>>> [1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
> >>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> >>>> Acked-by: Barry Song <baohua@kernel.org>
> >>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >>>> ---
> >>>> mm/rmap.c | 7 ++++---
> >>>> 1 file changed, 4 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>>> index 985ab0b085ba..e1d16003c514 100644
> >>>> --- a/mm/rmap.c
> >>>> +++ b/mm/rmap.c
> >>>> @@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
> >>>>       end_addr = pmd_addr_end(addr, vma->vm_end);
> >>>>       max_nr = (end_addr - addr) >> PAGE_SHIFT;
> >>>>
> >>>> -      /* We only support lazyfree batching for now ... */
> >>>> -      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
> >>>> +      /* We only support lazyfree or file folios batching for now ... */
> >>>> +      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
> >>>>               return 1;
> >>>> +
> >>>>       if (pte_unused(pte))
> >>>>               return 1;
> >>>>
> >>>> @@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>>>                        *
> >>>>                        * See Documentation/mm/mmu_notifier.rst
> >>>>                        */
> >>>> -                      dec_mm_counter(mm, mm_counter_file(folio));
> >>>> +                      add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
> >>>>               }
> >>>> discard:
> >>>>               if (unlikely(folio_test_hugetlb(folio))) {
> >>>> --
> >>>> 2.47.3
> >>>>
> >>> Hi, Baolin
> >>>
> >>> When reading your patch, I come up one small question.
> >>>
> >>> Current try_to_unmap_one() has following structure:
> >>>
> >>>     try_to_unmap_one()
> >>>         while (page_vma_mapped_walk(&pvmw)) {
> >>>             nr_pages = folio_unmap_pte_batch()
> >>>
> >>>             if (nr_pages = folio_nr_pages(folio))
> >>>                 goto walk_done;
> >>>         }
> >>>
> >>> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
> >>>
> >>> If my understanding is correct, page_vma_mapped_walk() would start from
> >>> (pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
> >>> (pvmw->address + nr_pages * PAGE_SIZE), right?
> >>>
> >>> Not sure my understanding is correct, if so do we have some reason not to
> >>> skip the cleared range?
> >> I don’t quite understand your question. For nr_pages > 1 but not equal
> >> to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs inside.
> >>
> >> take a look:
> >>
> >> next_pte:
> >>                do {
> >>                        pvmw->address += PAGE_SIZE;
> >>                        if (pvmw->address >= end)
> >>                                return not_found(pvmw);
> >>                        /* Did we cross page table boundary? */
> >>                        if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
> >>                                if (pvmw->ptl) {
> >>                                        spin_unlock(pvmw->ptl);
> >>                                        pvmw->ptl = NULL;
> >>                                }
> >>                                pte_unmap(pvmw->pte);
> >>                                pvmw->pte = NULL;
> >>                                pvmw->flags |= PVMW_PGTABLE_CROSSED;
> >>                                goto restart;
> >>                        }
> >>                        pvmw->pte++;
> >>                } while (pte_none(ptep_get(pvmw->pte)));
> >>
> > Yes, we do it in page_vma_mapped_walk() now. Since they are pte_none(), they
> > will be skipped.
> >
> > I mean maybe we can skip it in try_to_unmap_one(), for example:
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 9e5bd4834481..ea1afec7c802 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >                */
> >               if (nr_pages == folio_nr_pages(folio))
> >                       goto walk_done;
> > +             else {
> > +                     pvmw.address += PAGE_SIZE * (nr_pages - 1);
> > +                     pvmw.pte += nr_pages - 1;
> > +             }
> >               continue;
> >  walk_abort:
> >               ret = false;
>
> I am of the opinion that we should do something like this. In the internal pvmw code,

I am still not convinced that skipping PTEs in try_to_unmap_one()
is the right place. If we really want to skip certain PTEs early,
should we instead hint page_vma_mapped_walk()? That said, I don't
see much value in doing so, since in most cases nr is either 1 or
folio_nr_pages(folio).

> we keep skipping ptes till the ptes are none. With my proposed uffd-fix [1], if the old
> ptes were uffd-wp armed, pte_install_uffd_wp_if_needed will convert all ptes from none
> to not none, and we will lose the batching effect. I also plan to extend support to
> anonymous folios (therefore generalizing for all types of memory) which will set a
> batch of ptes as swap, and the internal pvmw code won't be able to skip through the
> batch.

Thanks for catching this, Dev. I already filter out some of the more
complex cases, for example:
if (pte_unused(pte))
        return 1;

Since the userfaultfd write-protection case is also a corner case,
could we filter it out as well?

diff --git a/mm/rmap.c b/mm/rmap.c
index c86f1135222b..6bb8ba6f046e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1870,6 +1870,9 @@ static inline unsigned int
folio_unmap_pte_batch(struct folio *folio,
        if (pte_unused(pte))
                return 1;

+       if (userfaultfd_wp(vma))
+               return 1;
+
        return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
}

Just offering a second option — yours is probably better.

Thanks
Barry

Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios

Posted by Dev Jain 2 weeks, 6 days ago

On 16/01/26 7:58 pm, Barry Song wrote:
> On Fri, Jan 16, 2026 at 5:53 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>> On 07/01/26 7:16 am, Wei Yang wrote:
>>> On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
>>>> On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>>>>> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
>>>>>> Similar to folio_referenced_one(), we can apply batched unmapping for file
>>>>>> large folios to optimize the performance of file folios reclamation.
>>>>>>
>>>>>> Barry previously implemented batched unmapping for lazyfree anonymous large
>>>>>> folios[1] and did not further optimize anonymous large folios or file-backed
>>>>>> large folios at that stage. As for file-backed large folios, the batched
>>>>>> unmapping support is relatively straightforward, as we only need to clear
>>>>>> the consecutive (present) PTE entries for file-backed large folios.
>>>>>>
>>>>>> Performance testing:
>>>>>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>>>>>> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>>>>>> 75% performance improvement on my Arm64 32-core server (and 50%+ improvement
>>>>>> on my X86 machine) with this patch.
>>>>>>
>>>>>> W/o patch:
>>>>>> real    0m1.018s
>>>>>> user    0m0.000s
>>>>>> sys     0m1.018s
>>>>>>
>>>>>> W/ patch:
>>>>>> real   0m0.249s
>>>>>> user   0m0.000s
>>>>>> sys    0m0.249s
>>>>>>
>>>>>> [1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>>>>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>> Acked-by: Barry Song <baohua@kernel.org>
>>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>> ---
>>>>>> mm/rmap.c | 7 ++++---
>>>>>> 1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index 985ab0b085ba..e1d16003c514 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>>>>>>       end_addr = pmd_addr_end(addr, vma->vm_end);
>>>>>>       max_nr = (end_addr - addr) >> PAGE_SHIFT;
>>>>>>
>>>>>> -      /* We only support lazyfree batching for now ... */
>>>>>> -      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>>>>>> +      /* We only support lazyfree or file folios batching for now ... */
>>>>>> +      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>>>>>>               return 1;
>>>>>> +
>>>>>>       if (pte_unused(pte))
>>>>>>               return 1;
>>>>>>
>>>>>> @@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>>>                        *
>>>>>>                        * See Documentation/mm/mmu_notifier.rst
>>>>>>                        */
>>>>>> -                      dec_mm_counter(mm, mm_counter_file(folio));
>>>>>> +                      add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
>>>>>>               }
>>>>>> discard:
>>>>>>               if (unlikely(folio_test_hugetlb(folio))) {
>>>>>> --
>>>>>> 2.47.3
>>>>>>
>>>>> Hi, Baolin
>>>>>
>>>>> When reading your patch, I come up one small question.
>>>>>
>>>>> Current try_to_unmap_one() has following structure:
>>>>>
>>>>>     try_to_unmap_one()
>>>>>         while (page_vma_mapped_walk(&pvmw)) {
>>>>>             nr_pages = folio_unmap_pte_batch()
>>>>>
>>>>>             if (nr_pages = folio_nr_pages(folio))
>>>>>                 goto walk_done;
>>>>>         }
>>>>>
>>>>> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
>>>>>
>>>>> If my understanding is correct, page_vma_mapped_walk() would start from
>>>>> (pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
>>>>> (pvmw->address + nr_pages * PAGE_SIZE), right?
>>>>>
>>>>> Not sure my understanding is correct, if so do we have some reason not to
>>>>> skip the cleared range?
>>>> I don’t quite understand your question. For nr_pages > 1 but not equal
>>>> to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs inside.
>>>>
>>>> take a look:
>>>>
>>>> next_pte:
>>>>                do {
>>>>                        pvmw->address += PAGE_SIZE;
>>>>                        if (pvmw->address >= end)
>>>>                                return not_found(pvmw);
>>>>                        /* Did we cross page table boundary? */
>>>>                        if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
>>>>                                if (pvmw->ptl) {
>>>>                                        spin_unlock(pvmw->ptl);
>>>>                                        pvmw->ptl = NULL;
>>>>                                }
>>>>                                pte_unmap(pvmw->pte);
>>>>                                pvmw->pte = NULL;
>>>>                                pvmw->flags |= PVMW_PGTABLE_CROSSED;
>>>>                                goto restart;
>>>>                        }
>>>>                        pvmw->pte++;
>>>>                } while (pte_none(ptep_get(pvmw->pte)));
>>>>
>>> Yes, we do it in page_vma_mapped_walk() now. Since they are pte_none(), they
>>> will be skipped.
>>>
>>> I mean maybe we can skip it in try_to_unmap_one(), for example:
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 9e5bd4834481..ea1afec7c802 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>                */
>>>               if (nr_pages == folio_nr_pages(folio))
>>>                       goto walk_done;
>>> +             else {
>>> +                     pvmw.address += PAGE_SIZE * (nr_pages - 1);
>>> +                     pvmw.pte += nr_pages - 1;
>>> +             }
>>>               continue;
>>>  walk_abort:
>>>               ret = false;
>> I am of the opinion that we should do something like this. In the internal pvmw code,
> I am still not convinced that skipping PTEs in try_to_unmap_one()
> is the right place. If we really want to skip certain PTEs early,
> should we instead hint page_vma_mapped_walk()? That said, I don't
> see much value in doing so, since in most cases nr is either 1 or
> folio_nr_pages(folio).
>
>> we keep skipping ptes till the ptes are none. With my proposed uffd-fix [1], if the old
>> ptes were uffd-wp armed, pte_install_uffd_wp_if_needed will convert all ptes from none
>> to not none, and we will lose the batching effect. I also plan to extend support to
>> anonymous folios (therefore generalizing for all types of memory) which will set a
>> batch of ptes as swap, and the internal pvmw code won't be able to skip through the
>> batch.
> Thanks for catching this, Dev. I already filter out some of the more
> complex cases, for example:
> if (pte_unused(pte))
>         return 1;
>
> Since the userfaultfd write-protection case is also a corner case,
> could we filter it out as well?
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index c86f1135222b..6bb8ba6f046e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1870,6 +1870,9 @@ static inline unsigned int
> folio_unmap_pte_batch(struct folio *folio,
>         if (pte_unused(pte))
>                 return 1;
>
> +       if (userfaultfd_wp(vma))
> +               return 1;
> +
>         return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
> }
>
> Just offering a second option — yours is probably better.

No. This is not an edge case. This is a case which gets exposed by your work, and
I believe that if you intend to get the file folio batching thingy in, then you
need to fix the uffd stuff too.

>
> Thanks
> Barry

Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios

Posted by Baolin Wang 2 weeks, 5 days ago


On 1/18/26 1:46 PM, Dev Jain wrote:
> 
> On 16/01/26 7:58 pm, Barry Song wrote:
>> On Fri, Jan 16, 2026 at 5:53 PM Dev Jain <dev.jain@arm.com> wrote:
>>>
>>> On 07/01/26 7:16 am, Wei Yang wrote:
>>>> On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
>>>>> On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>>>>>> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
>>>>>>> Similar to folio_referenced_one(), we can apply batched unmapping for file
>>>>>>> large folios to optimize the performance of file folios reclamation.
>>>>>>>
>>>>>>> Barry previously implemented batched unmapping for lazyfree anonymous large
>>>>>>> folios[1] and did not further optimize anonymous large folios or file-backed
>>>>>>> large folios at that stage. As for file-backed large folios, the batched
>>>>>>> unmapping support is relatively straightforward, as we only need to clear
>>>>>>> the consecutive (present) PTE entries for file-backed large folios.
>>>>>>>
>>>>>>> Performance testing:
>>>>>>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>>>>>>> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>>>>>>> 75% performance improvement on my Arm64 32-core server (and 50%+ improvement
>>>>>>> on my X86 machine) with this patch.
>>>>>>>
>>>>>>> W/o patch:
>>>>>>> real    0m1.018s
>>>>>>> user    0m0.000s
>>>>>>> sys     0m1.018s
>>>>>>>
>>>>>>> W/ patch:
>>>>>>> real   0m0.249s
>>>>>>> user   0m0.000s
>>>>>>> sys    0m0.249s
>>>>>>>
>>>>>>> [1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>>>>>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>> Acked-by: Barry Song <baohua@kernel.org>
>>>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>> ---
>>>>>>> mm/rmap.c | 7 ++++---
>>>>>>> 1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>>>
>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>>> index 985ab0b085ba..e1d16003c514 100644
>>>>>>> --- a/mm/rmap.c
>>>>>>> +++ b/mm/rmap.c
>>>>>>> @@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>>>>>>>        end_addr = pmd_addr_end(addr, vma->vm_end);
>>>>>>>        max_nr = (end_addr - addr) >> PAGE_SHIFT;
>>>>>>>
>>>>>>> -      /* We only support lazyfree batching for now ... */
>>>>>>> -      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>>>>>>> +      /* We only support lazyfree or file folios batching for now ... */
>>>>>>> +      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>>>>>>>                return 1;
>>>>>>> +
>>>>>>>        if (pte_unused(pte))
>>>>>>>                return 1;
>>>>>>>
>>>>>>> @@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>>>>                         *
>>>>>>>                         * See Documentation/mm/mmu_notifier.rst
>>>>>>>                         */
>>>>>>> -                      dec_mm_counter(mm, mm_counter_file(folio));
>>>>>>> +                      add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
>>>>>>>                }
>>>>>>> discard:
>>>>>>>                if (unlikely(folio_test_hugetlb(folio))) {
>>>>>>> --
>>>>>>> 2.47.3
>>>>>>>
>>>>>> Hi, Baolin
>>>>>>
>>>>>> When reading your patch, I come up one small question.
>>>>>>
>>>>>> Current try_to_unmap_one() has following structure:
>>>>>>
>>>>>>      try_to_unmap_one()
>>>>>>          while (page_vma_mapped_walk(&pvmw)) {
>>>>>>              nr_pages = folio_unmap_pte_batch()
>>>>>>
>>>>>>              if (nr_pages = folio_nr_pages(folio))
>>>>>>                  goto walk_done;
>>>>>>          }
>>>>>>
>>>>>> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
>>>>>>
>>>>>> If my understanding is correct, page_vma_mapped_walk() would start from
>>>>>> (pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
>>>>>> (pvmw->address + nr_pages * PAGE_SIZE), right?
>>>>>>
>>>>>> Not sure my understanding is correct, if so do we have some reason not to
>>>>>> skip the cleared range?
>>>>> I don’t quite understand your question. For nr_pages > 1 but not equal
>>>>> to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs inside.
>>>>>
>>>>> take a look:
>>>>>
>>>>> next_pte:
>>>>>                 do {
>>>>>                         pvmw->address += PAGE_SIZE;
>>>>>                         if (pvmw->address >= end)
>>>>>                                 return not_found(pvmw);
>>>>>                         /* Did we cross page table boundary? */
>>>>>                         if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
>>>>>                                 if (pvmw->ptl) {
>>>>>                                         spin_unlock(pvmw->ptl);
>>>>>                                         pvmw->ptl = NULL;
>>>>>                                 }
>>>>>                                 pte_unmap(pvmw->pte);
>>>>>                                 pvmw->pte = NULL;
>>>>>                                 pvmw->flags |= PVMW_PGTABLE_CROSSED;
>>>>>                                 goto restart;
>>>>>                         }
>>>>>                         pvmw->pte++;
>>>>>                 } while (pte_none(ptep_get(pvmw->pte)));
>>>>>
>>>> Yes, we do it in page_vma_mapped_walk() now. Since they are pte_none(), they
>>>> will be skipped.
>>>>
>>>> I mean maybe we can skip it in try_to_unmap_one(), for example:
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 9e5bd4834481..ea1afec7c802 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>                 */
>>>>                if (nr_pages == folio_nr_pages(folio))
>>>>                        goto walk_done;
>>>> +             else {
>>>> +                     pvmw.address += PAGE_SIZE * (nr_pages - 1);
>>>> +                     pvmw.pte += nr_pages - 1;
>>>> +             }
>>>>                continue;
>>>>   walk_abort:
>>>>                ret = false;
>>> I am of the opinion that we should do something like this. In the internal pvmw code,
>> I am still not convinced that skipping PTEs in try_to_unmap_one()
>> is the right place. If we really want to skip certain PTEs early,
>> should we instead hint page_vma_mapped_walk()? That said, I don't
>> see much value in doing so, since in most cases nr is either 1 or
>> folio_nr_pages(folio).
>>
>>> we keep skipping ptes till the ptes are none. With my proposed uffd-fix [1], if the old
>>> ptes were uffd-wp armed, pte_install_uffd_wp_if_needed will convert all ptes from none
>>> to not none, and we will lose the batching effect. I also plan to extend support to
>>> anonymous folios (therefore generalizing for all types of memory) which will set a
>>> batch of ptes as swap, and the internal pvmw code won't be able to skip through the
>>> batch.
>> Thanks for catching this, Dev. I already filter out some of the more
>> complex cases, for example:
>> if (pte_unused(pte))
>>          return 1;
>>
>> Since the userfaultfd write-protection case is also a corner case,
>> could we filter it out as well?
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index c86f1135222b..6bb8ba6f046e 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1870,6 +1870,9 @@ static inline unsigned int
>> folio_unmap_pte_batch(struct folio *folio,
>>          if (pte_unused(pte))
>>                  return 1;
>>
>> +       if (userfaultfd_wp(vma))
>> +               return 1;
>> +
>>          return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
>> }
>>
>> Just offering a second option — yours is probably better.
> 
> No. This is not an edge case. This is a case which gets exposed by your work, and
> I believe that if you intend to get the file folio batching thingy in, then you
> need to fix the uffd stuff too.

Barry’s point isn’t that this is an edge case. I think he means that 
uffd is not a common performance-sensitive scenario in production. Also, 
we typically fall back to per-page handling for uffd cases (see 
finish_fault() and alloc_anon_folio()). So I perfer to follow Barry’s 
suggestion and filter out the uffd cases until we have test case to show 
performance improvement.

I also think you can continue iterating your patch[1] to support batched 
unmapping for uffd VMAs, and provide data to evaluate its value.

[1] 
https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/

Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios

Posted by Dev Jain 2 weeks, 4 days ago

On 19/01/26 11:20 am, Baolin Wang wrote:
>
>
> On 1/18/26 1:46 PM, Dev Jain wrote:
>>
>> On 16/01/26 7:58 pm, Barry Song wrote:
>>> On Fri, Jan 16, 2026 at 5:53 PM Dev Jain <dev.jain@arm.com> wrote:
>>>>
>>>> On 07/01/26 7:16 am, Wei Yang wrote:
>>>>> On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
>>>>>> On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@gmail.com>
>>>>>> wrote:
>>>>>>> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
>>>>>>>> Similar to folio_referenced_one(), we can apply batched unmapping
>>>>>>>> for file
>>>>>>>> large folios to optimize the performance of file folios reclamation.
>>>>>>>>
>>>>>>>> Barry previously implemented batched unmapping for lazyfree
>>>>>>>> anonymous large
>>>>>>>> folios[1] and did not further optimize anonymous large folios or
>>>>>>>> file-backed
>>>>>>>> large folios at that stage. As for file-backed large folios, the
>>>>>>>> batched
>>>>>>>> unmapping support is relatively straightforward, as we only need
>>>>>>>> to clear
>>>>>>>> the consecutive (present) PTE entries for file-backed large folios.
>>>>>>>>
>>>>>>>> Performance testing:
>>>>>>>> Allocate 10G clean file-backed folios by mmap() in a memory
>>>>>>>> cgroup, and try to
>>>>>>>> reclaim 8G file-backed folios via the memory.reclaim interface. I
>>>>>>>> can observe
>>>>>>>> 75% performance improvement on my Arm64 32-core server (and 50%+
>>>>>>>> improvement
>>>>>>>> on my X86 machine) with this patch.
>>>>>>>>
>>>>>>>> W/o patch:
>>>>>>>> real    0m1.018s
>>>>>>>> user    0m0.000s
>>>>>>>> sys     0m1.018s
>>>>>>>>
>>>>>>>> W/ patch:
>>>>>>>> real   0m0.249s
>>>>>>>> user   0m0.000s
>>>>>>>> sys    0m0.249s
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>>>>>>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>> Acked-by: Barry Song <baohua@kernel.org>
>>>>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>> ---
>>>>>>>> mm/rmap.c | 7 ++++---
>>>>>>>> 1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>>>> index 985ab0b085ba..e1d16003c514 100644
>>>>>>>> --- a/mm/rmap.c
>>>>>>>> +++ b/mm/rmap.c
>>>>>>>> @@ -1863,9 +1863,10 @@ static inline unsigned int
>>>>>>>> folio_unmap_pte_batch(struct folio *folio,
>>>>>>>>        end_addr = pmd_addr_end(addr, vma->vm_end);
>>>>>>>>        max_nr = (end_addr - addr) >> PAGE_SHIFT;
>>>>>>>>
>>>>>>>> -      /* We only support lazyfree batching for now ... */
>>>>>>>> -      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>>>>>>>> +      /* We only support lazyfree or file folios batching for now
>>>>>>>> ... */
>>>>>>>> +      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>>>>>>>>                return 1;
>>>>>>>> +
>>>>>>>>        if (pte_unused(pte))
>>>>>>>>                return 1;
>>>>>>>>
>>>>>>>> @@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio
>>>>>>>> *folio, struct vm_area_struct *vma,
>>>>>>>>                         *
>>>>>>>>                         * See Documentation/mm/mmu_notifier.rst
>>>>>>>>                         */
>>>>>>>> -                      dec_mm_counter(mm, mm_counter_file(folio));
>>>>>>>> +                      add_mm_counter(mm, mm_counter_file(folio),
>>>>>>>> -nr_pages);
>>>>>>>>                }
>>>>>>>> discard:
>>>>>>>>                if (unlikely(folio_test_hugetlb(folio))) {
>>>>>>>> -- 
>>>>>>>> 2.47.3
>>>>>>>>
>>>>>>> Hi, Baolin
>>>>>>>
>>>>>>> When reading your patch, I come up one small question.
>>>>>>>
>>>>>>> Current try_to_unmap_one() has following structure:
>>>>>>>
>>>>>>>      try_to_unmap_one()
>>>>>>>          while (page_vma_mapped_walk(&pvmw)) {
>>>>>>>              nr_pages = folio_unmap_pte_batch()
>>>>>>>
>>>>>>>              if (nr_pages = folio_nr_pages(folio))
>>>>>>>                  goto walk_done;
>>>>>>>          }
>>>>>>>
>>>>>>> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
>>>>>>>
>>>>>>> If my understanding is correct, page_vma_mapped_walk() would start
>>>>>>> from
>>>>>>> (pvmw->address + PAGE_SIZE) in next iteration, but we have already
>>>>>>> cleared to
>>>>>>> (pvmw->address + nr_pages * PAGE_SIZE), right?
>>>>>>>
>>>>>>> Not sure my understanding is correct, if so do we have some reason
>>>>>>> not to
>>>>>>> skip the cleared range?
>>>>>> I don’t quite understand your question. For nr_pages > 1 but not equal
>>>>>> to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs
>>>>>> inside.
>>>>>>
>>>>>> take a look:
>>>>>>
>>>>>> next_pte:
>>>>>>                 do {
>>>>>>                         pvmw->address += PAGE_SIZE;
>>>>>>                         if (pvmw->address >= end)
>>>>>>                                 return not_found(pvmw);
>>>>>>                         /* Did we cross page table boundary? */
>>>>>>                         if ((pvmw->address & (PMD_SIZE - PAGE_SIZE))
>>>>>> == 0) {
>>>>>>                                 if (pvmw->ptl) {
>>>>>>                                         spin_unlock(pvmw->ptl);
>>>>>>                                         pvmw->ptl = NULL;
>>>>>>                                 }
>>>>>>                                 pte_unmap(pvmw->pte);
>>>>>>                                 pvmw->pte = NULL;
>>>>>>                                 pvmw->flags |= PVMW_PGTABLE_CROSSED;
>>>>>>                                 goto restart;
>>>>>>                         }
>>>>>>                         pvmw->pte++;
>>>>>>                 } while (pte_none(ptep_get(pvmw->pte)));
>>>>>>
>>>>> Yes, we do it in page_vma_mapped_walk() now. Since they are
>>>>> pte_none(), they
>>>>> will be skipped.
>>>>>
>>>>> I mean maybe we can skip it in try_to_unmap_one(), for example:
>>>>>
>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>> index 9e5bd4834481..ea1afec7c802 100644
>>>>> --- a/mm/rmap.c
>>>>> +++ b/mm/rmap.c
>>>>> @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio
>>>>> *folio, struct vm_area_struct *vma,
>>>>>                 */
>>>>>                if (nr_pages == folio_nr_pages(folio))
>>>>>                        goto walk_done;
>>>>> +             else {
>>>>> +                     pvmw.address += PAGE_SIZE * (nr_pages - 1);
>>>>> +                     pvmw.pte += nr_pages - 1;
>>>>> +             }
>>>>>                continue;
>>>>>   walk_abort:
>>>>>                ret = false;
>>>> I am of the opinion that we should do something like this. In the
>>>> internal pvmw code,
>>> I am still not convinced that skipping PTEs in try_to_unmap_one()
>>> is the right place. If we really want to skip certain PTEs early,
>>> should we instead hint page_vma_mapped_walk()? That said, I don't
>>> see much value in doing so, since in most cases nr is either 1 or
>>> folio_nr_pages(folio).
>>>
>>>> we keep skipping ptes till the ptes are none. With my proposed
>>>> uffd-fix [1], if the old
>>>> ptes were uffd-wp armed, pte_install_uffd_wp_if_needed will convert
>>>> all ptes from none
>>>> to not none, and we will lose the batching effect. I also plan to
>>>> extend support to
>>>> anonymous folios (therefore generalizing for all types of memory)
>>>> which will set a
>>>> batch of ptes as swap, and the internal pvmw code won't be able to
>>>> skip through the
>>>> batch.
>>> Thanks for catching this, Dev. I already filter out some of the more
>>> complex cases, for example:
>>> if (pte_unused(pte))
>>>          return 1;
>>>
>>> Since the userfaultfd write-protection case is also a corner case,
>>> could we filter it out as well?
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index c86f1135222b..6bb8ba6f046e 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1870,6 +1870,9 @@ static inline unsigned int
>>> folio_unmap_pte_batch(struct folio *folio,
>>>          if (pte_unused(pte))
>>>                  return 1;
>>>
>>> +       if (userfaultfd_wp(vma))
>>> +               return 1;
>>> +
>>>          return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
>>> }
>>>
>>> Just offering a second option — yours is probably better.
>>
>> No. This is not an edge case. This is a case which gets exposed by your
>> work, and
>> I believe that if you intend to get the file folio batching thingy in,
>> then you
>> need to fix the uffd stuff too.
>
> Barry’s point isn’t that this is an edge case. I think he means that uffd
> is not a common performance-sensitive scenario in production. Also, we
> typically fall back to per-page handling for uffd cases (see
> finish_fault() and alloc_anon_folio()). So I perfer to follow Barry’s
> suggestion and filter out the uffd cases until we have test case to show
> performance improvement. 

I am of the opinion that you are making the wrong analogy here. The
per-page fault fidelity is *required* for uffd.

When you say you want to support file folio batched unmapping, I think it's
inappropriate to say "let us refuse to

batch if the pte mapping the file folio is smeared with a particular bit
and consider it a totally different case". Instead

of getting in folio (all memory types) batched unmapping in, we have
already broken this to "lazyfree folio", then

"file folio", the remaining being "anon folio". Now you intend to break
"file folio" to "file folio non uffd" and "file folio uffd".


Just my 2C, I don't opine strongly here.


>
> I also think you can continue iterating your patch[1] to support batched
> unmapping for uffd VMAs, and provide data to evaluate its value.
>
> [1]
> https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/

Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios

Posted by Baolin Wang 2 weeks, 4 days ago


On 1/19/26 2:36 PM, Dev Jain wrote:
> 
> On 19/01/26 11:20 am, Baolin Wang wrote:
>>
>>
>> On 1/18/26 1:46 PM, Dev Jain wrote:
>>>
>>> On 16/01/26 7:58 pm, Barry Song wrote:
>>>> On Fri, Jan 16, 2026 at 5:53 PM Dev Jain <dev.jain@arm.com> wrote:
>>>>>
>>>>> On 07/01/26 7:16 am, Wei Yang wrote:
>>>>>> On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
>>>>>>> On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@gmail.com>
>>>>>>> wrote:
>>>>>>>> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
>>>>>>>>> Similar to folio_referenced_one(), we can apply batched unmapping
>>>>>>>>> for file
>>>>>>>>> large folios to optimize the performance of file folios reclamation.
>>>>>>>>>
>>>>>>>>> Barry previously implemented batched unmapping for lazyfree
>>>>>>>>> anonymous large
>>>>>>>>> folios[1] and did not further optimize anonymous large folios or
>>>>>>>>> file-backed
>>>>>>>>> large folios at that stage. As for file-backed large folios, the
>>>>>>>>> batched
>>>>>>>>> unmapping support is relatively straightforward, as we only need
>>>>>>>>> to clear
>>>>>>>>> the consecutive (present) PTE entries for file-backed large folios.
>>>>>>>>>
>>>>>>>>> Performance testing:
>>>>>>>>> Allocate 10G clean file-backed folios by mmap() in a memory
>>>>>>>>> cgroup, and try to
>>>>>>>>> reclaim 8G file-backed folios via the memory.reclaim interface. I
>>>>>>>>> can observe
>>>>>>>>> 75% performance improvement on my Arm64 32-core server (and 50%+
>>>>>>>>> improvement
>>>>>>>>> on my X86 machine) with this patch.
>>>>>>>>>
>>>>>>>>> W/o patch:
>>>>>>>>> real    0m1.018s
>>>>>>>>> user    0m0.000s
>>>>>>>>> sys     0m1.018s
>>>>>>>>>
>>>>>>>>> W/ patch:
>>>>>>>>> real   0m0.249s
>>>>>>>>> user   0m0.000s
>>>>>>>>> sys    0m0.249s
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>>>>>>>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>> Acked-by: Barry Song <baohua@kernel.org>
>>>>>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>> ---
>>>>>>>>> mm/rmap.c | 7 ++++---
>>>>>>>>> 1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>>>>> index 985ab0b085ba..e1d16003c514 100644
>>>>>>>>> --- a/mm/rmap.c
>>>>>>>>> +++ b/mm/rmap.c
>>>>>>>>> @@ -1863,9 +1863,10 @@ static inline unsigned int
>>>>>>>>> folio_unmap_pte_batch(struct folio *folio,
>>>>>>>>>         end_addr = pmd_addr_end(addr, vma->vm_end);
>>>>>>>>>         max_nr = (end_addr - addr) >> PAGE_SHIFT;
>>>>>>>>>
>>>>>>>>> -      /* We only support lazyfree batching for now ... */
>>>>>>>>> -      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>>>>>>>>> +      /* We only support lazyfree or file folios batching for now
>>>>>>>>> ... */
>>>>>>>>> +      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>>>>>>>>>                 return 1;
>>>>>>>>> +
>>>>>>>>>         if (pte_unused(pte))
>>>>>>>>>                 return 1;
>>>>>>>>>
>>>>>>>>> @@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio
>>>>>>>>> *folio, struct vm_area_struct *vma,
>>>>>>>>>                          *
>>>>>>>>>                          * See Documentation/mm/mmu_notifier.rst
>>>>>>>>>                          */
>>>>>>>>> -                      dec_mm_counter(mm, mm_counter_file(folio));
>>>>>>>>> +                      add_mm_counter(mm, mm_counter_file(folio),
>>>>>>>>> -nr_pages);
>>>>>>>>>                 }
>>>>>>>>> discard:
>>>>>>>>>                 if (unlikely(folio_test_hugetlb(folio))) {
>>>>>>>>> -- 
>>>>>>>>> 2.47.3
>>>>>>>>>
>>>>>>>> Hi, Baolin
>>>>>>>>
>>>>>>>> When reading your patch, I come up one small question.
>>>>>>>>
>>>>>>>> Current try_to_unmap_one() has following structure:
>>>>>>>>
>>>>>>>>       try_to_unmap_one()
>>>>>>>>           while (page_vma_mapped_walk(&pvmw)) {
>>>>>>>>               nr_pages = folio_unmap_pte_batch()
>>>>>>>>
>>>>>>>>               if (nr_pages = folio_nr_pages(folio))
>>>>>>>>                   goto walk_done;
>>>>>>>>           }
>>>>>>>>
>>>>>>>> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
>>>>>>>>
>>>>>>>> If my understanding is correct, page_vma_mapped_walk() would start
>>>>>>>> from
>>>>>>>> (pvmw->address + PAGE_SIZE) in next iteration, but we have already
>>>>>>>> cleared to
>>>>>>>> (pvmw->address + nr_pages * PAGE_SIZE), right?
>>>>>>>>
>>>>>>>> Not sure my understanding is correct, if so do we have some reason
>>>>>>>> not to
>>>>>>>> skip the cleared range?
>>>>>>> I don’t quite understand your question. For nr_pages > 1 but not equal
>>>>>>> to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs
>>>>>>> inside.
>>>>>>>
>>>>>>> take a look:
>>>>>>>
>>>>>>> next_pte:
>>>>>>>                  do {
>>>>>>>                          pvmw->address += PAGE_SIZE;
>>>>>>>                          if (pvmw->address >= end)
>>>>>>>                                  return not_found(pvmw);
>>>>>>>                          /* Did we cross page table boundary? */
>>>>>>>                          if ((pvmw->address & (PMD_SIZE - PAGE_SIZE))
>>>>>>> == 0) {
>>>>>>>                                  if (pvmw->ptl) {
>>>>>>>                                          spin_unlock(pvmw->ptl);
>>>>>>>                                          pvmw->ptl = NULL;
>>>>>>>                                  }
>>>>>>>                                  pte_unmap(pvmw->pte);
>>>>>>>                                  pvmw->pte = NULL;
>>>>>>>                                  pvmw->flags |= PVMW_PGTABLE_CROSSED;
>>>>>>>                                  goto restart;
>>>>>>>                          }
>>>>>>>                          pvmw->pte++;
>>>>>>>                  } while (pte_none(ptep_get(pvmw->pte)));
>>>>>>>
>>>>>> Yes, we do it in page_vma_mapped_walk() now. Since they are
>>>>>> pte_none(), they
>>>>>> will be skipped.
>>>>>>
>>>>>> I mean maybe we can skip it in try_to_unmap_one(), for example:
>>>>>>
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index 9e5bd4834481..ea1afec7c802 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio
>>>>>> *folio, struct vm_area_struct *vma,
>>>>>>                  */
>>>>>>                 if (nr_pages == folio_nr_pages(folio))
>>>>>>                         goto walk_done;
>>>>>> +             else {
>>>>>> +                     pvmw.address += PAGE_SIZE * (nr_pages - 1);
>>>>>> +                     pvmw.pte += nr_pages - 1;
>>>>>> +             }
>>>>>>                 continue;
>>>>>>    walk_abort:
>>>>>>                 ret = false;
>>>>> I am of the opinion that we should do something like this. In the
>>>>> internal pvmw code,
>>>> I am still not convinced that skipping PTEs in try_to_unmap_one()
>>>> is the right place. If we really want to skip certain PTEs early,
>>>> should we instead hint page_vma_mapped_walk()? That said, I don't
>>>> see much value in doing so, since in most cases nr is either 1 or
>>>> folio_nr_pages(folio).
>>>>
>>>>> we keep skipping ptes till the ptes are none. With my proposed
>>>>> uffd-fix [1], if the old
>>>>> ptes were uffd-wp armed, pte_install_uffd_wp_if_needed will convert
>>>>> all ptes from none
>>>>> to not none, and we will lose the batching effect. I also plan to
>>>>> extend support to
>>>>> anonymous folios (therefore generalizing for all types of memory)
>>>>> which will set a
>>>>> batch of ptes as swap, and the internal pvmw code won't be able to
>>>>> skip through the
>>>>> batch.
>>>> Thanks for catching this, Dev. I already filter out some of the more
>>>> complex cases, for example:
>>>> if (pte_unused(pte))
>>>>           return 1;
>>>>
>>>> Since the userfaultfd write-protection case is also a corner case,
>>>> could we filter it out as well?
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index c86f1135222b..6bb8ba6f046e 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1870,6 +1870,9 @@ static inline unsigned int
>>>> folio_unmap_pte_batch(struct folio *folio,
>>>>           if (pte_unused(pte))
>>>>                   return 1;
>>>>
>>>> +       if (userfaultfd_wp(vma))
>>>> +               return 1;
>>>> +
>>>>           return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
>>>> }
>>>>
>>>> Just offering a second option — yours is probably better.
>>>
>>> No. This is not an edge case. This is a case which gets exposed by your
>>> work, and
>>> I believe that if you intend to get the file folio batching thingy in,
>>> then you
>>> need to fix the uffd stuff too.
>>
>> Barry’s point isn’t that this is an edge case. I think he means that uffd
>> is not a common performance-sensitive scenario in production. Also, we
>> typically fall back to per-page handling for uffd cases (see
>> finish_fault() and alloc_anon_folio()). So I perfer to follow Barry’s
>> suggestion and filter out the uffd cases until we have test case to show
>> performance improvement.
> 
> I am of the opinion that you are making the wrong analogy here. The
> per-page fault fidelity is *required* for uffd.
> 
> When you say you want to support file folio batched unmapping, I think it's
> inappropriate to say "let us refuse to
> 
> batch if the pte mapping the file folio is smeared with a particular bit
> and consider it a totally different case". Instead
> 
> of getting in folio (all memory types) batched unmapping in, we have
> already broken this to "lazyfree folio", then
> 
> "file folio", the remaining being "anon folio". Now you intend to break
> "file folio" to "file folio non uffd" and "file folio uffd".

At least for me, I think this is a reasonable approach: break a complex 
problem into smaller features and address them step by step (possibly by 
different contributors in the community). This makes it easier for 
reviewers to focus and discuss. You can see that batched unmapping for 
anonymous folios still has ongoing discussion.

As I mentioned, since uffd is not a common performance-sensitive 
scenario in production, we need to continue discussing whether we 
actually need to support batched unmapping for uffd, and support the 
decision with technical feedback and performance data. So I’d prefer to 
discuss it in a separate patch.

David and Lorenzo, what do you think?

Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios

Posted by Baolin Wang 3 weeks ago


On 1/16/26 10:28 PM, Barry Song wrote:
> On Fri, Jan 16, 2026 at 5:53 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>> On 07/01/26 7:16 am, Wei Yang wrote:
>>> On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
>>>> On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>>>>> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
>>>>>> Similar to folio_referenced_one(), we can apply batched unmapping for file
>>>>>> large folios to optimize the performance of file folios reclamation.
>>>>>>
>>>>>> Barry previously implemented batched unmapping for lazyfree anonymous large
>>>>>> folios[1] and did not further optimize anonymous large folios or file-backed
>>>>>> large folios at that stage. As for file-backed large folios, the batched
>>>>>> unmapping support is relatively straightforward, as we only need to clear
>>>>>> the consecutive (present) PTE entries for file-backed large folios.
>>>>>>
>>>>>> Performance testing:
>>>>>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>>>>>> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>>>>>> 75% performance improvement on my Arm64 32-core server (and 50%+ improvement
>>>>>> on my X86 machine) with this patch.
>>>>>>
>>>>>> W/o patch:
>>>>>> real    0m1.018s
>>>>>> user    0m0.000s
>>>>>> sys     0m1.018s
>>>>>>
>>>>>> W/ patch:
>>>>>> real   0m0.249s
>>>>>> user   0m0.000s
>>>>>> sys    0m0.249s
>>>>>>
>>>>>> [1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>>>>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>> Acked-by: Barry Song <baohua@kernel.org>
>>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>> ---
>>>>>> mm/rmap.c | 7 ++++---
>>>>>> 1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index 985ab0b085ba..e1d16003c514 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>>>>>>        end_addr = pmd_addr_end(addr, vma->vm_end);
>>>>>>        max_nr = (end_addr - addr) >> PAGE_SHIFT;
>>>>>>
>>>>>> -      /* We only support lazyfree batching for now ... */
>>>>>> -      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>>>>>> +      /* We only support lazyfree or file folios batching for now ... */
>>>>>> +      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>>>>>>                return 1;
>>>>>> +
>>>>>>        if (pte_unused(pte))
>>>>>>                return 1;
>>>>>>
>>>>>> @@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>>>                         *
>>>>>>                         * See Documentation/mm/mmu_notifier.rst
>>>>>>                         */
>>>>>> -                      dec_mm_counter(mm, mm_counter_file(folio));
>>>>>> +                      add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
>>>>>>                }
>>>>>> discard:
>>>>>>                if (unlikely(folio_test_hugetlb(folio))) {
>>>>>> --
>>>>>> 2.47.3
>>>>>>
>>>>> Hi, Baolin
>>>>>
>>>>> When reading your patch, I come up one small question.
>>>>>
>>>>> Current try_to_unmap_one() has following structure:
>>>>>
>>>>>      try_to_unmap_one()
>>>>>          while (page_vma_mapped_walk(&pvmw)) {
>>>>>              nr_pages = folio_unmap_pte_batch()
>>>>>
>>>>>              if (nr_pages = folio_nr_pages(folio))
>>>>>                  goto walk_done;
>>>>>          }
>>>>>
>>>>> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
>>>>>
>>>>> If my understanding is correct, page_vma_mapped_walk() would start from
>>>>> (pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
>>>>> (pvmw->address + nr_pages * PAGE_SIZE), right?
>>>>>
>>>>> Not sure my understanding is correct, if so do we have some reason not to
>>>>> skip the cleared range?
>>>> I don’t quite understand your question. For nr_pages > 1 but not equal
>>>> to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs inside.
>>>>
>>>> take a look:
>>>>
>>>> next_pte:
>>>>                 do {
>>>>                         pvmw->address += PAGE_SIZE;
>>>>                         if (pvmw->address >= end)
>>>>                                 return not_found(pvmw);
>>>>                         /* Did we cross page table boundary? */
>>>>                         if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
>>>>                                 if (pvmw->ptl) {
>>>>                                         spin_unlock(pvmw->ptl);
>>>>                                         pvmw->ptl = NULL;
>>>>                                 }
>>>>                                 pte_unmap(pvmw->pte);
>>>>                                 pvmw->pte = NULL;
>>>>                                 pvmw->flags |= PVMW_PGTABLE_CROSSED;
>>>>                                 goto restart;
>>>>                         }
>>>>                         pvmw->pte++;
>>>>                 } while (pte_none(ptep_get(pvmw->pte)));
>>>>
>>> Yes, we do it in page_vma_mapped_walk() now. Since they are pte_none(), they
>>> will be skipped.
>>>
>>> I mean maybe we can skip it in try_to_unmap_one(), for example:
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 9e5bd4834481..ea1afec7c802 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>                 */
>>>                if (nr_pages == folio_nr_pages(folio))
>>>                        goto walk_done;
>>> +             else {
>>> +                     pvmw.address += PAGE_SIZE * (nr_pages - 1);
>>> +                     pvmw.pte += nr_pages - 1;
>>> +             }
>>>                continue;
>>>   walk_abort:
>>>                ret = false;
>>
>> I am of the opinion that we should do something like this. In the internal pvmw code,
> 
> I am still not convinced that skipping PTEs in try_to_unmap_one()
> is the right place. If we really want to skip certain PTEs early,
> should we instead hint page_vma_mapped_walk()? That said, I don't
> see much value in doing so, since in most cases nr is either 1 or
> folio_nr_pages(folio).
> 
>> we keep skipping ptes till the ptes are none. With my proposed uffd-fix [1], if the old
>> ptes were uffd-wp armed, pte_install_uffd_wp_if_needed will convert all ptes from none
>> to not none, and we will lose the batching effect. I also plan to extend support to
>> anonymous folios (therefore generalizing for all types of memory) which will set a
>> batch of ptes as swap, and the internal pvmw code won't be able to skip through the
>> batch.
> 
> Thanks for catching this, Dev. I already filter out some of the more
> complex cases, for example:
> if (pte_unused(pte))
>          return 1;

Hi Dev, thanks for the report[1], and you also explained why mm-selftets 
can pass.

[1] 
https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/


> Since the userfaultfd write-protection case is also a corner case,
> could we filter it out as well?
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index c86f1135222b..6bb8ba6f046e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1870,6 +1870,9 @@ static inline unsigned int
> folio_unmap_pte_batch(struct folio *folio,
>          if (pte_unused(pte))
>                  return 1;
> 
> +       if (userfaultfd_wp(vma))
> +               return 1;
> +
>          return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
> }

That small fix makes sense to me. I think Dev can continue to support 
the UFFD batch optimization, and we need more review and testing for the 
UFFD batched operations, as David suggested[2].

[2] 
https://lore.kernel.org/all/9edeeef1-5553-406b-8e56-30b11809eec5@kernel.org/

Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios

Posted by Barry Song 3 weeks ago

>
> Thanks for catching this, Dev. I already filter out some of the more
> complex cases, for example:
> if (pte_unused(pte))
>         return 1;
>
> Since the userfaultfd write-protection case is also a corner case,
> could we filter it out as well?
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index c86f1135222b..6bb8ba6f046e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1870,6 +1870,9 @@ static inline unsigned int
> folio_unmap_pte_batch(struct folio *folio,
>         if (pte_unused(pte))
>                 return 1;
>
> +       if (userfaultfd_wp(vma))
> +               return 1;
> +
>         return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
> }
>
> Just offering a second option — yours is probably better.

Sorry for replying in the wrong place. The above reply was actually meant
for your fix-patch below[1]:

"mm: Fix uffd-wp bit loss when batching file folio unmapping"

[1] https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/

Thanks
Barry

Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios

Posted by Lorenzo Stoakes 3 weeks ago

On Fri, Jan 16, 2026 at 03:23:02PM +0530, Dev Jain wrote:
> I am of the opinion that we should do something like this. In the internal pvmw code,
> we keep skipping ptes till the ptes are none. With my proposed uffd-fix [1], if the old
> ptes were uffd-wp armed, pte_install_uffd_wp_if_needed will convert all ptes from none
> to not none, and we will lose the batching effect. I also plan to extend support to
> anonymous folios (therefore generalizing for all types of memory) which will set a
> batch of ptes as swap, and the internal pvmw code won't be able to skip through the
> batch.
>
>
> [1] https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/

No, as I told you, the correct course is to make your suggestion here with
perhaps a suggested fix-patch, please let's not split the discussion
between _the actual series where the issue exists_ and an invalid patch
report, it makes it _super hard_ to track what on earth is going on here.

Now anybody responding will be inclined to reply there and it's a total
mess...

Thanks, Lorenzo

Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios

Posted by Barry Song 1 month ago

On Wed, Jan 7, 2026 at 2:46 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
> >On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@gmail.com> wrote:
> >>
> >> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
> >> >Similar to folio_referenced_one(), we can apply batched unmapping for file
> >> >large folios to optimize the performance of file folios reclamation.
> >> >
> >> >Barry previously implemented batched unmapping for lazyfree anonymous large
> >> >folios[1] and did not further optimize anonymous large folios or file-backed
> >> >large folios at that stage. As for file-backed large folios, the batched
> >> >unmapping support is relatively straightforward, as we only need to clear
> >> >the consecutive (present) PTE entries for file-backed large folios.
> >> >
> >> >Performance testing:
> >> >Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> >> >reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> >> >75% performance improvement on my Arm64 32-core server (and 50%+ improvement
> >> >on my X86 machine) with this patch.
> >> >
> >> >W/o patch:
> >> >real    0m1.018s
> >> >user    0m0.000s
> >> >sys     0m1.018s
> >> >
> >> >W/ patch:
> >> >real   0m0.249s
> >> >user   0m0.000s
> >> >sys    0m0.249s
> >> >
> >> >[1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
> >> >Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> >> >Acked-by: Barry Song <baohua@kernel.org>
> >> >Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >> >---
> >> > mm/rmap.c | 7 ++++---
> >> > 1 file changed, 4 insertions(+), 3 deletions(-)
> >> >
> >> >diff --git a/mm/rmap.c b/mm/rmap.c
> >> >index 985ab0b085ba..e1d16003c514 100644
> >> >--- a/mm/rmap.c
> >> >+++ b/mm/rmap.c
> >> >@@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
> >> >       end_addr = pmd_addr_end(addr, vma->vm_end);
> >> >       max_nr = (end_addr - addr) >> PAGE_SHIFT;
> >> >
> >> >-      /* We only support lazyfree batching for now ... */
> >> >-      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
> >> >+      /* We only support lazyfree or file folios batching for now ... */
> >> >+      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
> >> >               return 1;
> >> >+
> >> >       if (pte_unused(pte))
> >> >               return 1;
> >> >
> >> >@@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >> >                        *
> >> >                        * See Documentation/mm/mmu_notifier.rst
> >> >                        */
> >> >-                      dec_mm_counter(mm, mm_counter_file(folio));
> >> >+                      add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
> >> >               }
> >> > discard:
> >> >               if (unlikely(folio_test_hugetlb(folio))) {
> >> >--
> >> >2.47.3
> >> >
> >>
> >> Hi, Baolin
> >>
> >> When reading your patch, I come up one small question.
> >>
> >> Current try_to_unmap_one() has following structure:
> >>
> >>     try_to_unmap_one()
> >>         while (page_vma_mapped_walk(&pvmw)) {
> >>             nr_pages = folio_unmap_pte_batch()
> >>
> >>             if (nr_pages = folio_nr_pages(folio))
> >>                 goto walk_done;
> >>         }
> >>
> >> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
> >>
> >> If my understanding is correct, page_vma_mapped_walk() would start from
> >> (pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
> >> (pvmw->address + nr_pages * PAGE_SIZE), right?
> >>
> >> Not sure my understanding is correct, if so do we have some reason not to
> >> skip the cleared range?
> >
> >I don’t quite understand your question. For nr_pages > 1 but not equal
> >to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs inside.
> >
> >take a look:
> >
> >next_pte:
> >                do {
> >                        pvmw->address += PAGE_SIZE;
> >                        if (pvmw->address >= end)
> >                                return not_found(pvmw);
> >                        /* Did we cross page table boundary? */
> >                        if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
> >                                if (pvmw->ptl) {
> >                                        spin_unlock(pvmw->ptl);
> >                                        pvmw->ptl = NULL;
> >                                }
> >                                pte_unmap(pvmw->pte);
> >                                pvmw->pte = NULL;
> >                                pvmw->flags |= PVMW_PGTABLE_CROSSED;
> >                                goto restart;
> >                        }
> >                        pvmw->pte++;
> >                } while (pte_none(ptep_get(pvmw->pte)));
> >
>
> Yes, we do it in page_vma_mapped_walk() now. Since they are pte_none(), they
> will be skipped.
>
> I mean maybe we can skip it in try_to_unmap_one(), for example:
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 9e5bd4834481..ea1afec7c802 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>                  */
>                 if (nr_pages == folio_nr_pages(folio))
>                         goto walk_done;
> +               else {
> +                       pvmw.address += PAGE_SIZE * (nr_pages - 1);
> +                       pvmw.pte += nr_pages - 1;
> +               }
>                 continue;
>  walk_abort:
>                 ret = false;


I feel this couples the PTE walk iteration with the unmap
operation, which does not seem fine to me. It also appears
to affect only corner cases.

>
> Not sure this is reasonable.
>

Thanks
Barry

Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios

Posted by Baolin Wang 1 month ago


On 1/7/26 10:21 AM, Barry Song wrote:
> On Wed, Jan 7, 2026 at 2:46 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>>
>> On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
>>> On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>>>>
>>>> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
>>>>> Similar to folio_referenced_one(), we can apply batched unmapping for file
>>>>> large folios to optimize the performance of file folios reclamation.
>>>>>
>>>>> Barry previously implemented batched unmapping for lazyfree anonymous large
>>>>> folios[1] and did not further optimize anonymous large folios or file-backed
>>>>> large folios at that stage. As for file-backed large folios, the batched
>>>>> unmapping support is relatively straightforward, as we only need to clear
>>>>> the consecutive (present) PTE entries for file-backed large folios.
>>>>>
>>>>> Performance testing:
>>>>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>>>>> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>>>>> 75% performance improvement on my Arm64 32-core server (and 50%+ improvement
>>>>> on my X86 machine) with this patch.
>>>>>
>>>>> W/o patch:
>>>>> real    0m1.018s
>>>>> user    0m0.000s
>>>>> sys     0m1.018s
>>>>>
>>>>> W/ patch:
>>>>> real   0m0.249s
>>>>> user   0m0.000s
>>>>> sys    0m0.249s
>>>>>
>>>>> [1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>>>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> Acked-by: Barry Song <baohua@kernel.org>
>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>> ---
>>>>> mm/rmap.c | 7 ++++---
>>>>> 1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>> index 985ab0b085ba..e1d16003c514 100644
>>>>> --- a/mm/rmap.c
>>>>> +++ b/mm/rmap.c
>>>>> @@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>>>>>        end_addr = pmd_addr_end(addr, vma->vm_end);
>>>>>        max_nr = (end_addr - addr) >> PAGE_SHIFT;
>>>>>
>>>>> -      /* We only support lazyfree batching for now ... */
>>>>> -      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>>>>> +      /* We only support lazyfree or file folios batching for now ... */
>>>>> +      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>>>>>                return 1;
>>>>> +
>>>>>        if (pte_unused(pte))
>>>>>                return 1;
>>>>>
>>>>> @@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>>                         *
>>>>>                         * See Documentation/mm/mmu_notifier.rst
>>>>>                         */
>>>>> -                      dec_mm_counter(mm, mm_counter_file(folio));
>>>>> +                      add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
>>>>>                }
>>>>> discard:
>>>>>                if (unlikely(folio_test_hugetlb(folio))) {
>>>>> --
>>>>> 2.47.3
>>>>>
>>>>
>>>> Hi, Baolin
>>>>
>>>> When reading your patch, I come up one small question.
>>>>
>>>> Current try_to_unmap_one() has following structure:
>>>>
>>>>      try_to_unmap_one()
>>>>          while (page_vma_mapped_walk(&pvmw)) {
>>>>              nr_pages = folio_unmap_pte_batch()
>>>>
>>>>              if (nr_pages = folio_nr_pages(folio))
>>>>                  goto walk_done;
>>>>          }
>>>>
>>>> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
>>>>
>>>> If my understanding is correct, page_vma_mapped_walk() would start from
>>>> (pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
>>>> (pvmw->address + nr_pages * PAGE_SIZE), right?
>>>>
>>>> Not sure my understanding is correct, if so do we have some reason not to
>>>> skip the cleared range?
>>>
>>> I don’t quite understand your question. For nr_pages > 1 but not equal
>>> to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs inside.
>>>
>>> take a look:
>>>
>>> next_pte:
>>>                 do {
>>>                         pvmw->address += PAGE_SIZE;
>>>                         if (pvmw->address >= end)
>>>                                 return not_found(pvmw);
>>>                         /* Did we cross page table boundary? */
>>>                         if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
>>>                                 if (pvmw->ptl) {
>>>                                         spin_unlock(pvmw->ptl);
>>>                                         pvmw->ptl = NULL;
>>>                                 }
>>>                                 pte_unmap(pvmw->pte);
>>>                                 pvmw->pte = NULL;
>>>                                 pvmw->flags |= PVMW_PGTABLE_CROSSED;
>>>                                 goto restart;
>>>                         }
>>>                         pvmw->pte++;
>>>                 } while (pte_none(ptep_get(pvmw->pte)));
>>>
>>
>> Yes, we do it in page_vma_mapped_walk() now. Since they are pte_none(), they
>> will be skipped.
>>
>> I mean maybe we can skip it in try_to_unmap_one(), for example:
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 9e5bd4834481..ea1afec7c802 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>                   */
>>                  if (nr_pages == folio_nr_pages(folio))
>>                          goto walk_done;
>> +               else {
>> +                       pvmw.address += PAGE_SIZE * (nr_pages - 1);
>> +                       pvmw.pte += nr_pages - 1;
>> +               }
>>                  continue;
>>   walk_abort:
>>                  ret = false;
> 
> 
> I feel this couples the PTE walk iteration with the unmap
> operation, which does not seem fine to me. It also appears
> to affect only corner cases.

Agree. There may be no performance gains, so I also prefer to leave it 
as is.

Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios

Posted by Wei Yang 1 month ago

On Wed, Jan 07, 2026 at 10:29:18AM +0800, Baolin Wang wrote:
>
>
>On 1/7/26 10:21 AM, Barry Song wrote:
>> On Wed, Jan 7, 2026 at 2:46 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>> > 
>> > On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
>> > > On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>> > > > 
>> > > > On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
>> > > > > Similar to folio_referenced_one(), we can apply batched unmapping for file
>> > > > > large folios to optimize the performance of file folios reclamation.
>> > > > > 
>> > > > > Barry previously implemented batched unmapping for lazyfree anonymous large
>> > > > > folios[1] and did not further optimize anonymous large folios or file-backed
>> > > > > large folios at that stage. As for file-backed large folios, the batched
>> > > > > unmapping support is relatively straightforward, as we only need to clear
>> > > > > the consecutive (present) PTE entries for file-backed large folios.
>> > > > > 
>> > > > > Performance testing:
>> > > > > Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>> > > > > reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>> > > > > 75% performance improvement on my Arm64 32-core server (and 50%+ improvement
>> > > > > on my X86 machine) with this patch.
>> > > > > 
>> > > > > W/o patch:
>> > > > > real    0m1.018s
>> > > > > user    0m0.000s
>> > > > > sys     0m1.018s
>> > > > > 
>> > > > > W/ patch:
>> > > > > real   0m0.249s
>> > > > > user   0m0.000s
>> > > > > sys    0m0.249s
>> > > > > 
>> > > > > [1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>> > > > > Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>> > > > > Acked-by: Barry Song <baohua@kernel.org>
>> > > > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> > > > > ---
>> > > > > mm/rmap.c | 7 ++++---
>> > > > > 1 file changed, 4 insertions(+), 3 deletions(-)
>> > > > > 
>> > > > > diff --git a/mm/rmap.c b/mm/rmap.c
>> > > > > index 985ab0b085ba..e1d16003c514 100644
>> > > > > --- a/mm/rmap.c
>> > > > > +++ b/mm/rmap.c
>> > > > > @@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>> > > > >        end_addr = pmd_addr_end(addr, vma->vm_end);
>> > > > >        max_nr = (end_addr - addr) >> PAGE_SHIFT;
>> > > > > 
>> > > > > -      /* We only support lazyfree batching for now ... */
>> > > > > -      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>> > > > > +      /* We only support lazyfree or file folios batching for now ... */
>> > > > > +      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>> > > > >                return 1;
>> > > > > +
>> > > > >        if (pte_unused(pte))
>> > > > >                return 1;
>> > > > > 
>> > > > > @@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>> > > > >                         *
>> > > > >                         * See Documentation/mm/mmu_notifier.rst
>> > > > >                         */
>> > > > > -                      dec_mm_counter(mm, mm_counter_file(folio));
>> > > > > +                      add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
>> > > > >                }
>> > > > > discard:
>> > > > >                if (unlikely(folio_test_hugetlb(folio))) {
>> > > > > --
>> > > > > 2.47.3
>> > > > > 
>> > > > 
>> > > > Hi, Baolin
>> > > > 
>> > > > When reading your patch, I come up one small question.
>> > > > 
>> > > > Current try_to_unmap_one() has following structure:
>> > > > 
>> > > >      try_to_unmap_one()
>> > > >          while (page_vma_mapped_walk(&pvmw)) {
>> > > >              nr_pages = folio_unmap_pte_batch()
>> > > > 
>> > > >              if (nr_pages = folio_nr_pages(folio))
>> > > >                  goto walk_done;
>> > > >          }
>> > > > 
>> > > > I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
>> > > > 
>> > > > If my understanding is correct, page_vma_mapped_walk() would start from
>> > > > (pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
>> > > > (pvmw->address + nr_pages * PAGE_SIZE), right?
>> > > > 
>> > > > Not sure my understanding is correct, if so do we have some reason not to
>> > > > skip the cleared range?
>> > > 
>> > > I don’t quite understand your question. For nr_pages > 1 but not equal
>> > > to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs inside.
>> > > 
>> > > take a look:
>> > > 
>> > > next_pte:
>> > >                 do {
>> > >                         pvmw->address += PAGE_SIZE;
>> > >                         if (pvmw->address >= end)
>> > >                                 return not_found(pvmw);
>> > >                         /* Did we cross page table boundary? */
>> > >                         if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
>> > >                                 if (pvmw->ptl) {
>> > >                                         spin_unlock(pvmw->ptl);
>> > >                                         pvmw->ptl = NULL;
>> > >                                 }
>> > >                                 pte_unmap(pvmw->pte);
>> > >                                 pvmw->pte = NULL;
>> > >                                 pvmw->flags |= PVMW_PGTABLE_CROSSED;
>> > >                                 goto restart;
>> > >                         }
>> > >                         pvmw->pte++;
>> > >                 } while (pte_none(ptep_get(pvmw->pte)));
>> > > 
>> > 
>> > Yes, we do it in page_vma_mapped_walk() now. Since they are pte_none(), they
>> > will be skipped.
>> > 
>> > I mean maybe we can skip it in try_to_unmap_one(), for example:
>> > 
>> > diff --git a/mm/rmap.c b/mm/rmap.c
>> > index 9e5bd4834481..ea1afec7c802 100644
>> > --- a/mm/rmap.c
>> > +++ b/mm/rmap.c
>> > @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>> >                   */
>> >                  if (nr_pages == folio_nr_pages(folio))
>> >                          goto walk_done;
>> > +               else {
>> > +                       pvmw.address += PAGE_SIZE * (nr_pages - 1);
>> > +                       pvmw.pte += nr_pages - 1;
>> > +               }
>> >                  continue;
>> >   walk_abort:
>> >                  ret = false;
>> 
>> 
>> I feel this couples the PTE walk iteration with the unmap
>> operation, which does not seem fine to me. It also appears
>> to affect only corner cases.
>
>Agree. There may be no performance gains, so I also prefer to leave it as is.

Got it, thanks.

-- 
Wei Yang
Help you, Help me

[PATCH] mm: rmap: skip batched unmapping for UFFD vmas

Posted by Baolin Wang 3 weeks ago

As Dev reported[1], it's not ready to support batched unmapping for uffd case.
Let's still fallback to per-page unmapping for the uffd case.

[1] https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/
Reported-by: Dev Jain <dev.jain@arm.com>
Suggested-by: Barry Song <baohua@kernel.org>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/rmap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/rmap.c b/mm/rmap.c
index f13480cb9f2e..172643092dcf 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1953,6 +1953,9 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
 	if (pte_unused(pte))
 		return 1;
 
+	if (userfaultfd_wp(vma))
+		return 1;
+
 	return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
 }
 
-- 
2.47.3

[PATCH v5 1/5] mm: rmap: support batched checks of the references for large folios
[PATCH v5 2/5] arm64: mm: factor out the address and ptep alignment into a new helper
[PATCH v5 3/5] arm64: mm: support batch clearing of the young flag for large folios
[PATCH v5 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()
[PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios