[PATCH v2 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)

Qi Zheng posted 7 patches 1 month ago
There is a newer version of this series
[PATCH v2 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
Posted by Qi Zheng 1 month ago
Now in order to pursue high performance, applications mostly use some
high-performance user-mode memory allocators, such as jemalloc or
tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
release page table memory, which may cause huge page table memory usage.

The following are a memory usage snapshot of one process which actually
happened on our server:

        VIRT:  55t
        RES:   590g
        VmPTE: 110g

In this case, most of the page table entries are empty. For such a PTE
page where all entries are empty, we can actually free it back to the
system for others to use.

As a first step, this commit aims to synchronously free the empty PTE
pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
cases other than madvise(MADV_DONTNEED).

Once an empty PTE is detected, we first try to hold the pmd lock within
the pte lock. If successful, we clear the pmd entry directly (fast path).
Otherwise, we wait until the pte lock is released, then re-hold the pmd
and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
whether the PTE page is empty and free it (slow path).

For other cases such as madvise(MADV_FREE), consider scanning and freeing
empty PTE pages asynchronously in the future.

The following code snippet can show the effect of optimization:

        mmap 50G
        while (1) {
                for (; i < 1024 * 25; i++) {
                        touch 2M memory
                        madvise MADV_DONTNEED 2M
                }
        }

As we can see, the memory usage of VmPTE is reduced:

                        before                          after
VIRT                   50.0 GB                        50.0 GB
RES                     3.1 MB                         3.1 MB
VmPTE                102640 KB                         240 KB

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mm.h |  1 +
 mm/Kconfig         | 15 ++++++++++
 mm/Makefile        |  1 +
 mm/internal.h      | 23 ++++++++++++++++
 mm/madvise.c       |  4 ++-
 mm/memory.c        | 45 +++++++++++++++++++++++++++++-
 mm/pt_reclaim.c    | 68 ++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 155 insertions(+), 2 deletions(-)
 create mode 100644 mm/pt_reclaim.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3e4bb43035953..ce3936590fe72 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2319,6 +2319,7 @@ extern void pagefault_out_of_memory(void);
 struct zap_details {
 	struct folio *single_folio;	/* Locked folio to be unmapped */
 	bool even_cows;			/* Zap COWed private pages too? */
+	bool reclaim_pt;		/* Need reclaim page tables? */
 	zap_flags_t zap_flags;		/* Extra flags for zapping */
 };
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 84000b0168086..681909e0a9fa3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1301,6 +1301,21 @@ config ARCH_HAS_USER_SHADOW_STACK
 	  The architecture has hardware support for userspace shadow call
           stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
 
+config ARCH_SUPPORTS_PT_RECLAIM
+	def_bool n
+
+config PT_RECLAIM
+	bool "reclaim empty user page table pages"
+	default y
+	depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
+	select MMU_GATHER_RCU_TABLE_FREE
+	help
+	  Try to reclaim empty user page table pages in paths other that munmap
+	  and exit_mmap path.
+
+	  Note: now only empty user PTE page table pages will be reclaimed.
+
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index d5639b0361663..9d816323d247a 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -145,3 +145,4 @@ obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
 obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
+obj-$(CONFIG_PT_RECLAIM) += pt_reclaim.o
diff --git a/mm/internal.h b/mm/internal.h
index d5b93c5b63648..7aba395a9940f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1508,4 +1508,27 @@ int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
 		unsigned long end, const struct mm_walk_ops *ops,
 		void *private);
 
+#ifdef CONFIG_PT_RECLAIM
+bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval);
+void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb,
+	      pmd_t pmdval);
+void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
+		     struct mmu_gather *tlb);
+#else
+static inline bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd,
+					 pmd_t *pmdval)
+{
+	return false;
+}
+static inline void free_pte(struct mm_struct *mm, unsigned long addr,
+			    struct mmu_gather *tlb, pmd_t pmdval)
+{
+}
+static inline void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd,
+				   unsigned long addr, struct mmu_gather *tlb)
+{
+}
+#endif /* CONFIG_PT_RECLAIM */
+
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index 0ceae57da7dad..ee88652761d45 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -851,7 +851,9 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
 					unsigned long start, unsigned long end)
 {
-	zap_page_range_single(vma, start, end - start, NULL);
+	struct zap_details details = {.reclaim_pt = true,};
+
+	zap_page_range_single(vma, start, end - start, &details);
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 002aa4f454fa0..c4a8c18fbcfd7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1436,7 +1436,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 static inline bool should_zap_cows(struct zap_details *details)
 {
 	/* By default, zap all pages */
-	if (!details)
+	if (!details || details->reclaim_pt)
 		return true;
 
 	/* Or, we zap COWed pages only if the caller wants to */
@@ -1678,6 +1678,30 @@ static inline int do_zap_pte_range(struct mmu_gather *tlb,
 					 details, rss);
 }
 
+static inline int count_pte_none(pte_t *pte, int nr)
+{
+	int none_nr = 0;
+
+	/*
+	 * If PTE_MARKER_UFFD_WP is enabled, the uffd-wp PTEs may be
+	 * re-installed, so we need to check pte_none() one by one.
+	 * Otherwise, checking a single PTE in a batch is sufficient.
+	 */
+#ifdef CONFIG_PTE_MARKER_UFFD_WP
+	for (;;) {
+		if (pte_none(ptep_get(pte)))
+			none_nr++;
+		if (--nr == 0)
+			break;
+		pte++;
+	}
+#else
+	if (pte_none(ptep_get(pte)))
+		none_nr = nr;
+#endif
+	return none_nr;
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
@@ -1689,8 +1713,16 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	spinlock_t *ptl;
 	pte_t *start_pte;
 	pte_t *pte;
+	pmd_t pmdval;
+	bool can_reclaim_pt = false;
+	bool direct_reclaim = false;
+	unsigned long start = addr;
+	int none_nr = 0;
 	int nr;
 
+	if (details && details->reclaim_pt && (end - start >= PMD_SIZE))
+		can_reclaim_pt = true;
+
 retry:
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	init_rss_vec(rss);
@@ -1706,12 +1738,16 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 		nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, rss,
 				      &force_flush, &force_break);
+		none_nr += count_pte_none(pte, nr);
 		if (unlikely(force_break)) {
 			addr += nr * PAGE_SIZE;
 			break;
 		}
 	} while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
 
+	if (addr == end && can_reclaim_pt && (none_nr == PTRS_PER_PTE))
+		direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval);
+
 	add_mm_rss_vec(mm, rss);
 	arch_leave_lazy_mmu_mode();
 
@@ -1738,6 +1774,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		goto retry;
 	}
 
+	if (can_reclaim_pt) {
+		if (direct_reclaim)
+			free_pte(mm, start, tlb, pmdval);
+		else
+			try_to_free_pte(mm, pmd, start, tlb);
+	}
+
 	return addr;
 }
 
diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c
new file mode 100644
index 0000000000000..fc055da40b615
--- /dev/null
+++ b/mm/pt_reclaim.c
@@ -0,0 +1,68 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/hugetlb.h>
+#include <asm-generic/tlb.h>
+#include <asm/pgalloc.h>
+
+#include "internal.h"
+
+bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval)
+{
+	spinlock_t *pml = pmd_lockptr(mm, pmd);
+
+	if (!spin_trylock(pml))
+		return false;
+
+	*pmdval = pmdp_get_lockless(pmd);
+	pmd_clear(pmd);
+	spin_unlock(pml);
+
+	return true;
+}
+
+void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb,
+	      pmd_t pmdval)
+{
+	pte_free_tlb(tlb, pmd_pgtable(pmdval), addr);
+	mm_dec_nr_ptes(mm);
+}
+
+void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
+		     struct mmu_gather *tlb)
+{
+	pmd_t pmdval;
+	spinlock_t *pml, *ptl;
+	pte_t *start_pte, *pte;
+	int i;
+
+	start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
+	if (!start_pte)
+		return;
+
+	pml = pmd_lock(mm, pmd);
+	if (ptl != pml)
+		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+
+	if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd))))
+		goto out_ptl;
+
+	/* Check if it is empty PTE page */
+	for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
+		if (!pte_none(ptep_get(pte)))
+			goto out_ptl;
+	}
+	pte_unmap(start_pte);
+
+	pmd_clear(pmd);
+
+	if (ptl != pml)
+		spin_unlock(ptl);
+	spin_unlock(pml);
+
+	free_pte(mm, addr, tlb, pmdval);
+
+	return;
+out_ptl:
+	pte_unmap_unlock(start_pte, ptl);
+	if (pml != ptl)
+		spin_unlock(pml);
+}
-- 
2.20.1
Re: [PATCH v2 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
Posted by Jann Horn 3 weeks, 3 days ago
On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> As a first step, this commit aims to synchronously free the empty PTE
> pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
> pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
> cases other than madvise(MADV_DONTNEED).
>
> Once an empty PTE is detected, we first try to hold the pmd lock within
> the pte lock. If successful, we clear the pmd entry directly (fast path).
> Otherwise, we wait until the pte lock is released, then re-hold the pmd
> and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
> whether the PTE page is empty and free it (slow path).

How does this interact with move_pages_pte()? I am looking at your
series applied on top of next-20241106, and it looks to me like
move_pages_pte() uses pte_offset_map_rw_nolock() and assumes that the
PMD entry can't change. You can clearly see this assumption at the
WARN_ON_ONCE(pmd_none(*dst_pmd)). And if we race the wrong way, I
think for example move_present_pte() can end up moving a present PTE
into a page table that has already been scheduled for RCU freeing.

> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  include/linux/mm.h |  1 +
>  mm/Kconfig         | 15 ++++++++++
>  mm/Makefile        |  1 +
>  mm/internal.h      | 23 ++++++++++++++++
>  mm/madvise.c       |  4 ++-
>  mm/memory.c        | 45 +++++++++++++++++++++++++++++-
>  mm/pt_reclaim.c    | 68 ++++++++++++++++++++++++++++++++++++++++++++++
>  7 files changed, 155 insertions(+), 2 deletions(-)
>  create mode 100644 mm/pt_reclaim.c
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 3e4bb43035953..ce3936590fe72 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2319,6 +2319,7 @@ extern void pagefault_out_of_memory(void);
>  struct zap_details {
>         struct folio *single_folio;     /* Locked folio to be unmapped */
>         bool even_cows;                 /* Zap COWed private pages too? */
> +       bool reclaim_pt;                /* Need reclaim page tables? */
>         zap_flags_t zap_flags;          /* Extra flags for zapping */
>  };
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 84000b0168086..681909e0a9fa3 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1301,6 +1301,21 @@ config ARCH_HAS_USER_SHADOW_STACK
>           The architecture has hardware support for userspace shadow call
>            stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
>
> +config ARCH_SUPPORTS_PT_RECLAIM
> +       def_bool n
> +
> +config PT_RECLAIM
> +       bool "reclaim empty user page table pages"
> +       default y
> +       depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
> +       select MMU_GATHER_RCU_TABLE_FREE
> +       help
> +         Try to reclaim empty user page table pages in paths other that munmap

nit: s/other that/other than/

> +         and exit_mmap path.
> +
> +         Note: now only empty user PTE page table pages will be reclaimed.
[...]
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 0ceae57da7dad..ee88652761d45 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -851,7 +851,9 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
>  static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
>                                         unsigned long start, unsigned long end)
>  {
> -       zap_page_range_single(vma, start, end - start, NULL);
> +       struct zap_details details = {.reclaim_pt = true,};
> +
> +       zap_page_range_single(vma, start, end - start, &details);
>         return 0;
>  }
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 002aa4f454fa0..c4a8c18fbcfd7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1436,7 +1436,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  static inline bool should_zap_cows(struct zap_details *details)
>  {
>         /* By default, zap all pages */
> -       if (!details)
> +       if (!details || details->reclaim_pt)
>                 return true;
>
>         /* Or, we zap COWed pages only if the caller wants to */

This looks hacky - when we have a "details" object, its ->even_cows
member is supposed to indicate whether COW pages should be zapped. So
please instead set .even_cows=true in madvise_dontneed_single_vma().

> @@ -1678,6 +1678,30 @@ static inline int do_zap_pte_range(struct mmu_gather *tlb,
>                                          details, rss);
>  }
>
> +static inline int count_pte_none(pte_t *pte, int nr)
> +{
> +       int none_nr = 0;
> +
> +       /*
> +        * If PTE_MARKER_UFFD_WP is enabled, the uffd-wp PTEs may be
> +        * re-installed, so we need to check pte_none() one by one.
> +        * Otherwise, checking a single PTE in a batch is sufficient.
> +        */

Please also add a comment noting that count_pte_none() relies on this
invariant on top of do_zap_pte_range() or somewhere like that.

> +#ifdef CONFIG_PTE_MARKER_UFFD_WP
> +       for (;;) {
> +               if (pte_none(ptep_get(pte)))
> +                       none_nr++;
> +               if (--nr == 0)
> +                       break;
> +               pte++;
> +       }
> +#else
> +       if (pte_none(ptep_get(pte)))
> +               none_nr = nr;
> +#endif
> +       return none_nr;
> +}
> +
>  static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                                 struct vm_area_struct *vma, pmd_t *pmd,
>                                 unsigned long addr, unsigned long end,
> @@ -1689,8 +1713,16 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>         spinlock_t *ptl;
>         pte_t *start_pte;
>         pte_t *pte;
> +       pmd_t pmdval;
> +       bool can_reclaim_pt = false;
> +       bool direct_reclaim = false;
> +       unsigned long start = addr;
> +       int none_nr = 0;
>         int nr;
>
> +       if (details && details->reclaim_pt && (end - start >= PMD_SIZE))
> +               can_reclaim_pt = true;
> +
>  retry:
>         tlb_change_page_size(tlb, PAGE_SIZE);
>         init_rss_vec(rss);
> @@ -1706,12 +1738,16 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>
>                 nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, rss,
>                                       &force_flush, &force_break);
> +               none_nr += count_pte_none(pte, nr);
>                 if (unlikely(force_break)) {
>                         addr += nr * PAGE_SIZE;
>                         break;
>                 }
>         } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
>
> +       if (addr == end && can_reclaim_pt && (none_nr == PTRS_PER_PTE))
> +               direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval);
> +
>         add_mm_rss_vec(mm, rss);
>         arch_leave_lazy_mmu_mode();
>
> @@ -1738,6 +1774,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                 goto retry;
>         }
>
> +       if (can_reclaim_pt) {
> +               if (direct_reclaim)
> +                       free_pte(mm, start, tlb, pmdval);
> +               else
> +                       try_to_free_pte(mm, pmd, start, tlb);
> +       }
> +
>         return addr;
>  }
>
> diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c
> new file mode 100644
> index 0000000000000..fc055da40b615
> --- /dev/null
> +++ b/mm/pt_reclaim.c
> @@ -0,0 +1,68 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/hugetlb.h>
> +#include <asm-generic/tlb.h>
> +#include <asm/pgalloc.h>
> +
> +#include "internal.h"
> +
> +bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval)
> +{
> +       spinlock_t *pml = pmd_lockptr(mm, pmd);
> +
> +       if (!spin_trylock(pml))
> +               return false;
> +
> +       *pmdval = pmdp_get_lockless(pmd);
> +       pmd_clear(pmd);
> +       spin_unlock(pml);
> +
> +       return true;
> +}
> +
> +void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb,
> +             pmd_t pmdval)
> +{
> +       pte_free_tlb(tlb, pmd_pgtable(pmdval), addr);
> +       mm_dec_nr_ptes(mm);
> +}
> +
> +void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
> +                    struct mmu_gather *tlb)
> +{
> +       pmd_t pmdval;
> +       spinlock_t *pml, *ptl;
> +       pte_t *start_pte, *pte;
> +       int i;
> +

If you swap the following two operations around (first pmd_lock(),
then pte_offset_map_rw_nolock(), then take the PTE lock):

> +       start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
> +       if (!start_pte)
> +               return;
> +
> +       pml = pmd_lock(mm, pmd);
> +       if (ptl != pml)
> +               spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> +

Then I think this check can go away:

> +       if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd))))
> +               goto out_ptl;
> +
> +       /* Check if it is empty PTE page */
> +       for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
> +               if (!pte_none(ptep_get(pte)))
> +                       goto out_ptl;
> +       }
> +       pte_unmap(start_pte);
> +
> +       pmd_clear(pmd);
> +
> +       if (ptl != pml)
> +               spin_unlock(ptl);
> +       spin_unlock(pml);
> +
> +       free_pte(mm, addr, tlb, pmdval);
> +
> +       return;
> +out_ptl:
> +       pte_unmap_unlock(start_pte, ptl);
> +       if (pml != ptl)
> +               spin_unlock(pml);
> +}
> --
> 2.20.1
>
Re: [PATCH v2 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
Posted by Qi Zheng 3 weeks, 3 days ago
Hi Jann,

On 2024/11/8 07:35, Jann Horn wrote:
> On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>> As a first step, this commit aims to synchronously free the empty PTE
>> pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
>> pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
>> cases other than madvise(MADV_DONTNEED).
>>
>> Once an empty PTE is detected, we first try to hold the pmd lock within
>> the pte lock. If successful, we clear the pmd entry directly (fast path).
>> Otherwise, we wait until the pte lock is released, then re-hold the pmd
>> and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
>> whether the PTE page is empty and free it (slow path).
> 
> How does this interact with move_pages_pte()? I am looking at your
> series applied on top of next-20241106, and it looks to me like
> move_pages_pte() uses pte_offset_map_rw_nolock() and assumes that the
> PMD entry can't change. You can clearly see this assumption at the
> WARN_ON_ONCE(pmd_none(*dst_pmd)). And if we race the wrong way, I

In move_pages_pte(), the following conditions may indeed be triggered:

	/* Sanity checks before the operation */
	if (WARN_ON_ONCE(pmd_none(*dst_pmd)) ||	WARN_ON_ONCE(pmd_none(*src_pmd)) ||
	    WARN_ON_ONCE(pmd_trans_huge(*dst_pmd)) || 
WARN_ON_ONCE(pmd_trans_huge(*src_pmd))) {
		err = -EINVAL;
		goto out;
	}

But maybe we can just remove the WARN_ON_ONCE(), because...

> think for example move_present_pte() can end up moving a present PTE
> into a page table that has already been scheduled for RCU freeing.

...this situation is impossible to happen. Before performing move
operation, the pte_same() check will be performed after holding the
pte lock, which can ensure that the PTE page is stable:

CPU 0                    CPU 1

zap_pte_range

			orig_src_pte = ptep_get(src_pte);

pmd_lock
pte_lock
check if all PTEs are pte_none()
--> clear pmd entry
unlock pte
unlock pmd

			src_pte_lock
			pte_same(orig_src_pte, ptep_get(src_pte))
			--> return false and will skip the move op


> 
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   include/linux/mm.h |  1 +
>>   mm/Kconfig         | 15 ++++++++++
>>   mm/Makefile        |  1 +
>>   mm/internal.h      | 23 ++++++++++++++++
>>   mm/madvise.c       |  4 ++-
>>   mm/memory.c        | 45 +++++++++++++++++++++++++++++-
>>   mm/pt_reclaim.c    | 68 ++++++++++++++++++++++++++++++++++++++++++++++
>>   7 files changed, 155 insertions(+), 2 deletions(-)
>>   create mode 100644 mm/pt_reclaim.c
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 3e4bb43035953..ce3936590fe72 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2319,6 +2319,7 @@ extern void pagefault_out_of_memory(void);
>>   struct zap_details {
>>          struct folio *single_folio;     /* Locked folio to be unmapped */
>>          bool even_cows;                 /* Zap COWed private pages too? */
>> +       bool reclaim_pt;                /* Need reclaim page tables? */
>>          zap_flags_t zap_flags;          /* Extra flags for zapping */
>>   };
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 84000b0168086..681909e0a9fa3 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -1301,6 +1301,21 @@ config ARCH_HAS_USER_SHADOW_STACK
>>            The architecture has hardware support for userspace shadow call
>>             stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
>>
>> +config ARCH_SUPPORTS_PT_RECLAIM
>> +       def_bool n
>> +
>> +config PT_RECLAIM
>> +       bool "reclaim empty user page table pages"
>> +       default y
>> +       depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
>> +       select MMU_GATHER_RCU_TABLE_FREE
>> +       help
>> +         Try to reclaim empty user page table pages in paths other that munmap
> 
> nit: s/other that/other than/

will fix.

> 
>> +         and exit_mmap path.
>> +
>> +         Note: now only empty user PTE page table pages will be reclaimed.
> [...]
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 0ceae57da7dad..ee88652761d45 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -851,7 +851,9 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
>>   static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
>>                                          unsigned long start, unsigned long end)
>>   {
>> -       zap_page_range_single(vma, start, end - start, NULL);
>> +       struct zap_details details = {.reclaim_pt = true,};
>> +
>> +       zap_page_range_single(vma, start, end - start, &details);
>>          return 0;
>>   }
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 002aa4f454fa0..c4a8c18fbcfd7 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1436,7 +1436,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>>   static inline bool should_zap_cows(struct zap_details *details)
>>   {
>>          /* By default, zap all pages */
>> -       if (!details)
>> +       if (!details || details->reclaim_pt)
>>                  return true;
>>
>>          /* Or, we zap COWed pages only if the caller wants to */
> 
> This looks hacky - when we have a "details" object, its ->even_cows
> member is supposed to indicate whether COW pages should be zapped. So
> please instead set .even_cows=true in madvise_dontneed_single_vma().

But the details->reclaim_pt should continue to be set, right? Because
we need to use .reclaim_pt to indicate whether the empty PTE page
should be reclaimed.

> 
>> @@ -1678,6 +1678,30 @@ static inline int do_zap_pte_range(struct mmu_gather *tlb,
>>                                           details, rss);
>>   }
>>
>> +static inline int count_pte_none(pte_t *pte, int nr)
>> +{
>> +       int none_nr = 0;
>> +
>> +       /*
>> +        * If PTE_MARKER_UFFD_WP is enabled, the uffd-wp PTEs may be
>> +        * re-installed, so we need to check pte_none() one by one.
>> +        * Otherwise, checking a single PTE in a batch is sufficient.
>> +        */
> 
> Please also add a comment noting that count_pte_none() relies on this
> invariant on top of do_zap_pte_range() or somewhere like that.

OK, will add a comment above do_zap_pte_range() to explain this special
case.

> 
>> +#ifdef CONFIG_PTE_MARKER_UFFD_WP
>> +       for (;;) {
>> +               if (pte_none(ptep_get(pte)))
>> +                       none_nr++;
>> +               if (--nr == 0)
>> +                       break;
>> +               pte++;
>> +       }
>> +#else
>> +       if (pte_none(ptep_get(pte)))
>> +               none_nr = nr;
>> +#endif
>> +       return none_nr;
>> +}
>> +
>>   static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>                                  struct vm_area_struct *vma, pmd_t *pmd,
>>                                  unsigned long addr, unsigned long end,
>> @@ -1689,8 +1713,16 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>          spinlock_t *ptl;
>>          pte_t *start_pte;
>>          pte_t *pte;
>> +       pmd_t pmdval;
>> +       bool can_reclaim_pt = false;
>> +       bool direct_reclaim = false;
>> +       unsigned long start = addr;
>> +       int none_nr = 0;
>>          int nr;
>>
>> +       if (details && details->reclaim_pt && (end - start >= PMD_SIZE))
>> +               can_reclaim_pt = true;
>> +
>>   retry:
>>          tlb_change_page_size(tlb, PAGE_SIZE);
>>          init_rss_vec(rss);
>> @@ -1706,12 +1738,16 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>
>>                  nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, rss,
>>                                        &force_flush, &force_break);
>> +               none_nr += count_pte_none(pte, nr);
>>                  if (unlikely(force_break)) {
>>                          addr += nr * PAGE_SIZE;
>>                          break;
>>                  }
>>          } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
>>
>> +       if (addr == end && can_reclaim_pt && (none_nr == PTRS_PER_PTE))
>> +               direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval);
>> +
>>          add_mm_rss_vec(mm, rss);
>>          arch_leave_lazy_mmu_mode();
>>
>> @@ -1738,6 +1774,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>                  goto retry;
>>          }
>>
>> +       if (can_reclaim_pt) {
>> +               if (direct_reclaim)
>> +                       free_pte(mm, start, tlb, pmdval);
>> +               else
>> +                       try_to_free_pte(mm, pmd, start, tlb);
>> +       }
>> +
>>          return addr;
>>   }
>>
>> diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c
>> new file mode 100644
>> index 0000000000000..fc055da40b615
>> --- /dev/null
>> +++ b/mm/pt_reclaim.c
>> @@ -0,0 +1,68 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +#include <linux/hugetlb.h>
>> +#include <asm-generic/tlb.h>
>> +#include <asm/pgalloc.h>
>> +
>> +#include "internal.h"
>> +
>> +bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval)
>> +{
>> +       spinlock_t *pml = pmd_lockptr(mm, pmd);
>> +
>> +       if (!spin_trylock(pml))
>> +               return false;
>> +
>> +       *pmdval = pmdp_get_lockless(pmd);
>> +       pmd_clear(pmd);
>> +       spin_unlock(pml);
>> +
>> +       return true;
>> +}
>> +
>> +void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb,
>> +             pmd_t pmdval)
>> +{
>> +       pte_free_tlb(tlb, pmd_pgtable(pmdval), addr);
>> +       mm_dec_nr_ptes(mm);
>> +}
>> +
>> +void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
>> +                    struct mmu_gather *tlb)
>> +{
>> +       pmd_t pmdval;
>> +       spinlock_t *pml, *ptl;
>> +       pte_t *start_pte, *pte;
>> +       int i;
>> +
> 
> If you swap the following two operations around (first pmd_lock(),
> then pte_offset_map_rw_nolock(), then take the PTE lock):

Indeed, will change to it.

Thanks,
Qi

> 
>> +       start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
>> +       if (!start_pte)
>> +               return;
>> +
>> +       pml = pmd_lock(mm, pmd);
>> +       if (ptl != pml)
>> +               spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>> +
> 
> Then I think this check can go away:
> 
>> +       if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd))))
>> +               goto out_ptl;
>> +
>> +       /* Check if it is empty PTE page */
>> +       for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
>> +               if (!pte_none(ptep_get(pte)))
>> +                       goto out_ptl;
>> +       }
>> +       pte_unmap(start_pte);
>> +
>> +       pmd_clear(pmd);
>> +
>> +       if (ptl != pml)
>> +               spin_unlock(ptl);
>> +       spin_unlock(pml);
>> +
>> +       free_pte(mm, addr, tlb, pmdval);
>> +
>> +       return;
>> +out_ptl:
>> +       pte_unmap_unlock(start_pte, ptl);
>> +       if (pml != ptl)
>> +               spin_unlock(pml);
>> +}
>> --
>> 2.20.1
>>
Re: [PATCH v2 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
Posted by Jann Horn 3 weeks, 2 days ago
On Fri, Nov 8, 2024 at 8:13 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> On 2024/11/8 07:35, Jann Horn wrote:
> > On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> >> As a first step, this commit aims to synchronously free the empty PTE
> >> pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
> >> pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
> >> cases other than madvise(MADV_DONTNEED).
> >>
> >> Once an empty PTE is detected, we first try to hold the pmd lock within
> >> the pte lock. If successful, we clear the pmd entry directly (fast path).
> >> Otherwise, we wait until the pte lock is released, then re-hold the pmd
> >> and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
> >> whether the PTE page is empty and free it (slow path).
> >
> > How does this interact with move_pages_pte()? I am looking at your
> > series applied on top of next-20241106, and it looks to me like
> > move_pages_pte() uses pte_offset_map_rw_nolock() and assumes that the
> > PMD entry can't change. You can clearly see this assumption at the
> > WARN_ON_ONCE(pmd_none(*dst_pmd)). And if we race the wrong way, I
>
> In move_pages_pte(), the following conditions may indeed be triggered:
>
>         /* Sanity checks before the operation */
>         if (WARN_ON_ONCE(pmd_none(*dst_pmd)) || WARN_ON_ONCE(pmd_none(*src_pmd)) ||
>             WARN_ON_ONCE(pmd_trans_huge(*dst_pmd)) ||
> WARN_ON_ONCE(pmd_trans_huge(*src_pmd))) {
>                 err = -EINVAL;
>                 goto out;
>         }
>
> But maybe we can just remove the WARN_ON_ONCE(), because...
>
> > think for example move_present_pte() can end up moving a present PTE
> > into a page table that has already been scheduled for RCU freeing.
>
> ...this situation is impossible to happen. Before performing move
> operation, the pte_same() check will be performed after holding the
> pte lock, which can ensure that the PTE page is stable:
>
> CPU 0                    CPU 1
>
> zap_pte_range
>
>                         orig_src_pte = ptep_get(src_pte);
>
> pmd_lock
> pte_lock
> check if all PTEs are pte_none()
> --> clear pmd entry
> unlock pte
> unlock pmd
>
>                         src_pte_lock
>                         pte_same(orig_src_pte, ptep_get(src_pte))
>                         --> return false and will skip the move op

Yes, that works for the source PTE. But what about the destination?

Operations on the destination PTE in move_pages_pte() are, when moving
a normal present source PTE pointing to an order-0 page, and assuming
that the optimistic folio_trylock(src_folio) and
anon_vma_trylock_write(src_anon_vma) succeed:

dst_pte = pte_offset_map_rw_nolock(mm, dst_pmd, dst_addr,
&dummy_pmdval, &dst_ptl)
[check that dst_pte is non-NULL]
some racy WARN_ON_ONCE() checks
spin_lock(dst_ptl);
orig_dst_pte = ptep_get(dst_pte);
spin_unlock(dst_ptl);
[bail if orig_dst_pte isn't none]
double_pt_lock(dst_ptl, src_ptl)
[check pte_same(ptep_get(dst_pte), orig_dst_pte)]

and then we're past the point of no return. Note that there is a
pte_same() check against orig_dst_pte, but pte_none(orig_dst_pte) is
intentionally pte_none(), so the pte_same() check does not guarantee
that the destination page table is still linked in.

> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index 002aa4f454fa0..c4a8c18fbcfd7 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -1436,7 +1436,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> >>   static inline bool should_zap_cows(struct zap_details *details)
> >>   {
> >>          /* By default, zap all pages */
> >> -       if (!details)
> >> +       if (!details || details->reclaim_pt)
> >>                  return true;
> >>
> >>          /* Or, we zap COWed pages only if the caller wants to */
> >
> > This looks hacky - when we have a "details" object, its ->even_cows
> > member is supposed to indicate whether COW pages should be zapped. So
> > please instead set .even_cows=true in madvise_dontneed_single_vma().
>
> But the details->reclaim_pt should continue to be set, right? Because
> we need to use .reclaim_pt to indicate whether the empty PTE page
> should be reclaimed.

Yeah, you should set both.
Re: [PATCH v2 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
Posted by Qi Zheng 3 weeks, 2 days ago

On 2024/11/9 02:04, Jann Horn wrote:
> On Fri, Nov 8, 2024 at 8:13 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>> On 2024/11/8 07:35, Jann Horn wrote:
>>> On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>>>> As a first step, this commit aims to synchronously free the empty PTE
>>>> pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
>>>> pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
>>>> cases other than madvise(MADV_DONTNEED).
>>>>
>>>> Once an empty PTE is detected, we first try to hold the pmd lock within
>>>> the pte lock. If successful, we clear the pmd entry directly (fast path).
>>>> Otherwise, we wait until the pte lock is released, then re-hold the pmd
>>>> and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
>>>> whether the PTE page is empty and free it (slow path).
>>>
>>> How does this interact with move_pages_pte()? I am looking at your
>>> series applied on top of next-20241106, and it looks to me like
>>> move_pages_pte() uses pte_offset_map_rw_nolock() and assumes that the
>>> PMD entry can't change. You can clearly see this assumption at the
>>> WARN_ON_ONCE(pmd_none(*dst_pmd)). And if we race the wrong way, I
>>
>> In move_pages_pte(), the following conditions may indeed be triggered:
>>
>>          /* Sanity checks before the operation */
>>          if (WARN_ON_ONCE(pmd_none(*dst_pmd)) || WARN_ON_ONCE(pmd_none(*src_pmd)) ||
>>              WARN_ON_ONCE(pmd_trans_huge(*dst_pmd)) ||
>> WARN_ON_ONCE(pmd_trans_huge(*src_pmd))) {
>>                  err = -EINVAL;
>>                  goto out;
>>          }
>>
>> But maybe we can just remove the WARN_ON_ONCE(), because...
>>
>>> think for example move_present_pte() can end up moving a present PTE
>>> into a page table that has already been scheduled for RCU freeing.
>>
>> ...this situation is impossible to happen. Before performing move
>> operation, the pte_same() check will be performed after holding the
>> pte lock, which can ensure that the PTE page is stable:
>>
>> CPU 0                    CPU 1
>>
>> zap_pte_range
>>
>>                          orig_src_pte = ptep_get(src_pte);
>>
>> pmd_lock
>> pte_lock
>> check if all PTEs are pte_none()
>> --> clear pmd entry
>> unlock pte
>> unlock pmd
>>
>>                          src_pte_lock
>>                          pte_same(orig_src_pte, ptep_get(src_pte))
>>                          --> return false and will skip the move op
> 
> Yes, that works for the source PTE. But what about the destination?
> 
> Operations on the destination PTE in move_pages_pte() are, when moving
> a normal present source PTE pointing to an order-0 page, and assuming
> that the optimistic folio_trylock(src_folio) and
> anon_vma_trylock_write(src_anon_vma) succeed:
> 
> dst_pte = pte_offset_map_rw_nolock(mm, dst_pmd, dst_addr,
> &dummy_pmdval, &dst_ptl)
> [check that dst_pte is non-NULL]
> some racy WARN_ON_ONCE() checks
> spin_lock(dst_ptl);
> orig_dst_pte = ptep_get(dst_pte);
> spin_unlock(dst_ptl);
> [bail if orig_dst_pte isn't none]
> double_pt_lock(dst_ptl, src_ptl)
> [check pte_same(ptep_get(dst_pte), orig_dst_pte)]
> 
> and then we're past the point of no return. Note that there is a
> pte_same() check against orig_dst_pte, but pte_none(orig_dst_pte) is
> intentionally pte_none(), so the pte_same() check does not guarantee
> that the destination page table is still linked in.

OK, now I got what you mean. This is indeed a problem. In this case,
it is still necessary to recheck pmd_same() to ensure the stability
of dst_pte page. Will fix it.

> 
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 002aa4f454fa0..c4a8c18fbcfd7 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -1436,7 +1436,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>>>>    static inline bool should_zap_cows(struct zap_details *details)
>>>>    {
>>>>           /* By default, zap all pages */
>>>> -       if (!details)
>>>> +       if (!details || details->reclaim_pt)
>>>>                   return true;
>>>>
>>>>           /* Or, we zap COWed pages only if the caller wants to */
>>>
>>> This looks hacky - when we have a "details" object, its ->even_cows
>>> member is supposed to indicate whether COW pages should be zapped. So
>>> please instead set .even_cows=true in madvise_dontneed_single_vma().
>>
>> But the details->reclaim_pt should continue to be set, right? Because
>> we need to use .reclaim_pt to indicate whether the empty PTE page
>> should be reclaimed.
> 
> Yeah, you should set both.

OK.

Thanks!