[PATCH v1 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)

Qi Zheng posted 7 patches 1 month, 1 week ago
There is a newer version of this series
[PATCH v1 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
Posted by Qi Zheng 1 month, 1 week ago
Now in order to pursue high performance, applications mostly use some
high-performance user-mode memory allocators, such as jemalloc or
tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
release page table memory, which may cause huge page table memory usage.

The following are a memory usage snapshot of one process which actually
happened on our server:

        VIRT:  55t
        RES:   590g
        VmPTE: 110g

In this case, most of the page table entries are empty. For such a PTE
page where all entries are empty, we can actually free it back to the
system for others to use.

As a first step, this commit aims to synchronously free the empty PTE
pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
cases other than madvise(MADV_DONTNEED).

Once an empty PTE is detected, we first try to hold the pmd lock within
the pte lock. If successful, we clear the pmd entry directly (fast path).
Otherwise, we wait until the pte lock is released, then re-hold the pmd
and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
whether the PTE page is empty and free it (slow path).

For other cases such as madvise(MADV_FREE), consider scanning and freeing
empty PTE pages asynchronously in the future.

The following code snippet can show the effect of optimization:

        mmap 50G
        while (1) {
                for (; i < 1024 * 25; i++) {
                        touch 2M memory
                        madvise MADV_DONTNEED 2M
                }
        }

As we can see, the memory usage of VmPTE is reduced:

                        before                          after
VIRT                   50.0 GB                        50.0 GB
RES                     3.1 MB                         3.1 MB
VmPTE                102640 KB                         240 KB

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mm.h |  1 +
 mm/Kconfig         | 14 ++++++++++
 mm/Makefile        |  1 +
 mm/internal.h      | 29 ++++++++++++++++++++
 mm/madvise.c       |  4 ++-
 mm/memory.c        | 47 +++++++++++++++++++++++++++-----
 mm/pt_reclaim.c    | 68 ++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 156 insertions(+), 8 deletions(-)
 create mode 100644 mm/pt_reclaim.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index df0a5eac66b78..667a466bb4649 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2321,6 +2321,7 @@ extern void pagefault_out_of_memory(void);
 struct zap_details {
 	struct folio *single_folio;	/* Locked folio to be unmapped */
 	bool even_cows;			/* Zap COWed private pages too? */
+	bool reclaim_pt;
 	zap_flags_t zap_flags;		/* Extra flags for zapping */
 };
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 4b2a1ef9a161c..f5993b9cc2a9f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1302,6 +1302,20 @@ config ARCH_HAS_USER_SHADOW_STACK
 	  The architecture has hardware support for userspace shadow call
           stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
 
+config ARCH_SUPPORTS_PT_RECLAIM
+	def_bool n
+
+config PT_RECLAIM
+	bool "reclaim empty user page table pages"
+	default y
+	depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
+	select MMU_GATHER_RCU_TABLE_FREE
+	help
+	  Try to reclaim empty user page table pages in paths other that munmap
+	  and exit_mmap path.
+
+	  Note: now only empty user PTE page table pages will be reclaimed.
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index d5639b0361663..9d816323d247a 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -145,3 +145,4 @@ obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
 obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
+obj-$(CONFIG_PT_RECLAIM) += pt_reclaim.o
diff --git a/mm/internal.h b/mm/internal.h
index 906da6280c2df..4adaaea0917c0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1445,4 +1445,33 @@ static inline void accept_page(struct page *page)
 }
 #endif /* CONFIG_UNACCEPTED_MEMORY */
 
+#ifdef CONFIG_PT_RECLAIM
+static inline void set_pt_unreclaimable(bool *can_reclaim_pt)
+{
+	*can_reclaim_pt = false;
+}
+bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval);
+void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb,
+	      pmd_t pmdval);
+void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
+		     struct mmu_gather *tlb);
+#else
+static inline void set_pt_unreclaimable(bool *can_reclaim_pt)
+{
+}
+static inline bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd,
+					 pmd_t *pmdval)
+{
+	return false;
+}
+static inline void free_pte(struct mm_struct *mm, unsigned long addr,
+			    struct mmu_gather *tlb, pmd_t pmdval)
+{
+}
+static inline void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd,
+				   unsigned long addr, struct mmu_gather *tlb)
+{
+}
+#endif /* CONFIG_PT_RECLAIM */
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index e871a72a6c329..82a6d15429da7 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -843,7 +843,9 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
 					unsigned long start, unsigned long end)
 {
-	zap_page_range_single(vma, start, end - start, NULL);
+	struct zap_details details = {.reclaim_pt = true,};
+
+	zap_page_range_single(vma, start, end - start, &details);
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index cc89ede8ce2ab..77774b34f2cde 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1437,7 +1437,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 static inline bool should_zap_cows(struct zap_details *details)
 {
 	/* By default, zap all pages */
-	if (!details)
+	if (!details || details->reclaim_pt)
 		return true;
 
 	/* Or, we zap COWed pages only if the caller wants to */
@@ -1611,8 +1611,18 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	pte_t *start_pte;
 	pte_t *pte;
 	swp_entry_t entry;
+	pmd_t pmdval;
+	bool can_reclaim_pt = false;
+	bool direct_reclaim;
+	unsigned long start = addr;
 	int nr;
 
+	if (details && details->reclaim_pt)
+		can_reclaim_pt = true;
+
+	if ((ALIGN_DOWN(end, PMD_SIZE)) - (ALIGN(start, PMD_SIZE)) < PMD_SIZE)
+		can_reclaim_pt = false;
+
 retry:
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	init_rss_vec(rss);
@@ -1641,6 +1651,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
 					      addr, details, rss, &force_flush,
 					      &force_break, &is_pt_unreclaimable);
+			if (is_pt_unreclaimable)
+				set_pt_unreclaimable(&can_reclaim_pt);
 			if (unlikely(force_break)) {
 				addr += nr * PAGE_SIZE;
 				break;
@@ -1653,8 +1665,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		    is_device_exclusive_entry(entry)) {
 			page = pfn_swap_entry_to_page(entry);
 			folio = page_folio(page);
-			if (unlikely(!should_zap_folio(details, folio)))
+			if (unlikely(!should_zap_folio(details, folio))) {
+				set_pt_unreclaimable(&can_reclaim_pt);
 				continue;
+			}
 			/*
 			 * Both device private/exclusive mappings should only
 			 * work with anonymous page so far, so we don't need to
@@ -1670,14 +1684,18 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			max_nr = (end - addr) / PAGE_SIZE;
 			nr = swap_pte_batch(pte, max_nr, ptent);
 			/* Genuine swap entries, hence a private anon pages */
-			if (!should_zap_cows(details))
+			if (!should_zap_cows(details)) {
+				set_pt_unreclaimable(&can_reclaim_pt);
 				continue;
+			}
 			rss[MM_SWAPENTS] -= nr;
 			free_swap_and_cache_nr(entry, nr);
 		} else if (is_migration_entry(entry)) {
 			folio = pfn_swap_entry_folio(entry);
-			if (!should_zap_folio(details, folio))
+			if (!should_zap_folio(details, folio)) {
+				set_pt_unreclaimable(&can_reclaim_pt);
 				continue;
+			}
 			rss[mm_counter(folio)]--;
 		} else if (pte_marker_entry_uffd_wp(entry)) {
 			/*
@@ -1685,21 +1703,29 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			 * drop the marker if explicitly requested.
 			 */
 			if (!vma_is_anonymous(vma) &&
-			    !zap_drop_file_uffd_wp(details))
+			    !zap_drop_file_uffd_wp(details)) {
+				set_pt_unreclaimable(&can_reclaim_pt);
 				continue;
+			}
 		} else if (is_hwpoison_entry(entry) ||
 			   is_poisoned_swp_entry(entry)) {
-			if (!should_zap_cows(details))
+			if (!should_zap_cows(details)) {
+				set_pt_unreclaimable(&can_reclaim_pt);
 				continue;
+			}
 		} else {
 			/* We should have covered all the swap entry types */
 			pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
 			WARN_ON_ONCE(1);
 		}
 		clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
-		zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
+		if (zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent))
+			set_pt_unreclaimable(&can_reclaim_pt);
 	} while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
 
+	if (addr == end && can_reclaim_pt)
+		direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval);
+
 	add_mm_rss_vec(mm, rss);
 	arch_leave_lazy_mmu_mode();
 
@@ -1724,6 +1750,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		goto retry;
 	}
 
+	if (can_reclaim_pt) {
+		if (direct_reclaim)
+			free_pte(mm, start, tlb, pmdval);
+		else
+			try_to_free_pte(mm, pmd, start, tlb);
+	}
+
 	return addr;
 }
 
diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c
new file mode 100644
index 0000000000000..fc055da40b615
--- /dev/null
+++ b/mm/pt_reclaim.c
@@ -0,0 +1,68 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/hugetlb.h>
+#include <asm-generic/tlb.h>
+#include <asm/pgalloc.h>
+
+#include "internal.h"
+
+bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval)
+{
+	spinlock_t *pml = pmd_lockptr(mm, pmd);
+
+	if (!spin_trylock(pml))
+		return false;
+
+	*pmdval = pmdp_get_lockless(pmd);
+	pmd_clear(pmd);
+	spin_unlock(pml);
+
+	return true;
+}
+
+void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb,
+	      pmd_t pmdval)
+{
+	pte_free_tlb(tlb, pmd_pgtable(pmdval), addr);
+	mm_dec_nr_ptes(mm);
+}
+
+void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
+		     struct mmu_gather *tlb)
+{
+	pmd_t pmdval;
+	spinlock_t *pml, *ptl;
+	pte_t *start_pte, *pte;
+	int i;
+
+	start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
+	if (!start_pte)
+		return;
+
+	pml = pmd_lock(mm, pmd);
+	if (ptl != pml)
+		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+
+	if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd))))
+		goto out_ptl;
+
+	/* Check if it is empty PTE page */
+	for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
+		if (!pte_none(ptep_get(pte)))
+			goto out_ptl;
+	}
+	pte_unmap(start_pte);
+
+	pmd_clear(pmd);
+
+	if (ptl != pml)
+		spin_unlock(ptl);
+	spin_unlock(pml);
+
+	free_pte(mm, addr, tlb, pmdval);
+
+	return;
+out_ptl:
+	pte_unmap_unlock(start_pte, ptl);
+	if (pml != ptl)
+		spin_unlock(pml);
+}
-- 
2.20.1
Re: [PATCH v1 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
Posted by Jann Horn 1 month, 1 week ago
+arm64 maintainers in case they have opinions on the break-before-make aspects

On Thu, Oct 17, 2024 at 11:48 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> Now in order to pursue high performance, applications mostly use some
> high-performance user-mode memory allocators, such as jemalloc or
> tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
> to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
> release page table memory, which may cause huge page table memory usage.
>
> The following are a memory usage snapshot of one process which actually
> happened on our server:
>
>         VIRT:  55t
>         RES:   590g
>         VmPTE: 110g
>
> In this case, most of the page table entries are empty. For such a PTE
> page where all entries are empty, we can actually free it back to the
> system for others to use.
>
> As a first step, this commit aims to synchronously free the empty PTE
> pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
> pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
> cases other than madvise(MADV_DONTNEED).
>
> Once an empty PTE is detected, we first try to hold the pmd lock within
> the pte lock. If successful, we clear the pmd entry directly (fast path).
> Otherwise, we wait until the pte lock is released, then re-hold the pmd
> and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
> whether the PTE page is empty and free it (slow path).
>
> For other cases such as madvise(MADV_FREE), consider scanning and freeing
> empty PTE pages asynchronously in the future.

One thing I find somewhat scary about this is that it makes it
possible to free page tables in anonymous mappings, and to free page
tables of VMAs with an ->anon_vma, which was not possible before. Have
you checked all the current users of pte_offset_map_ro_nolock(),
pte_offset_map_rw_nolock(), and pte_offset_map() to make sure none of
them assume that this can't happen?

For example, pte_offset_map_rw_nolock() is called from move_ptes(),
with a comment basically talking about how this is safe *because only
khugepaged can remove page tables*.

> diff --git a/mm/memory.c b/mm/memory.c
> index cc89ede8ce2ab..77774b34f2cde 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1437,7 +1437,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  static inline bool should_zap_cows(struct zap_details *details)
>  {
>         /* By default, zap all pages */
> -       if (!details)
> +       if (!details || details->reclaim_pt)
>                 return true;
>
>         /* Or, we zap COWed pages only if the caller wants to */
> @@ -1611,8 +1611,18 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>         pte_t *start_pte;
>         pte_t *pte;
>         swp_entry_t entry;
> +       pmd_t pmdval;
> +       bool can_reclaim_pt = false;
> +       bool direct_reclaim;
> +       unsigned long start = addr;
>         int nr;
>
> +       if (details && details->reclaim_pt)
> +               can_reclaim_pt = true;
> +
> +       if ((ALIGN_DOWN(end, PMD_SIZE)) - (ALIGN(start, PMD_SIZE)) < PMD_SIZE)
> +               can_reclaim_pt = false;

Does this check actually work? Assuming we're on x86, if you pass in
start=0x1000 and end=0x2000, if I understand correctly,
ALIGN_DOWN(end, PMD_SIZE) will be 0, while ALIGN(start, PMD_SIZE) will
be 0x200000, and so we will check:

if (0 - 0x200000 < PMD_SIZE)

which is

if (0xffffffffffe00000 < 0x200000)

which is false?

>  retry:
>         tlb_change_page_size(tlb, PAGE_SIZE);
>         init_rss_vec(rss);
> @@ -1641,6 +1651,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                         nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
>                                               addr, details, rss, &force_flush,
>                                               &force_break, &is_pt_unreclaimable);
> +                       if (is_pt_unreclaimable)
> +                               set_pt_unreclaimable(&can_reclaim_pt);
>                         if (unlikely(force_break)) {
>                                 addr += nr * PAGE_SIZE;
>                                 break;
> @@ -1653,8 +1665,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                     is_device_exclusive_entry(entry)) {
>                         page = pfn_swap_entry_to_page(entry);
>                         folio = page_folio(page);
> -                       if (unlikely(!should_zap_folio(details, folio)))
> +                       if (unlikely(!should_zap_folio(details, folio))) {
> +                               set_pt_unreclaimable(&can_reclaim_pt);
>                                 continue;
> +                       }
>                         /*
>                          * Both device private/exclusive mappings should only
>                          * work with anonymous page so far, so we don't need to
> @@ -1670,14 +1684,18 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                         max_nr = (end - addr) / PAGE_SIZE;
>                         nr = swap_pte_batch(pte, max_nr, ptent);
>                         /* Genuine swap entries, hence a private anon pages */
> -                       if (!should_zap_cows(details))
> +                       if (!should_zap_cows(details)) {
> +                               set_pt_unreclaimable(&can_reclaim_pt);
>                                 continue;
> +                       }
>                         rss[MM_SWAPENTS] -= nr;
>                         free_swap_and_cache_nr(entry, nr);
>                 } else if (is_migration_entry(entry)) {
>                         folio = pfn_swap_entry_folio(entry);
> -                       if (!should_zap_folio(details, folio))
> +                       if (!should_zap_folio(details, folio)) {
> +                               set_pt_unreclaimable(&can_reclaim_pt);
>                                 continue;
> +                       }
>                         rss[mm_counter(folio)]--;
>                 } else if (pte_marker_entry_uffd_wp(entry)) {
>                         /*
> @@ -1685,21 +1703,29 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                          * drop the marker if explicitly requested.
>                          */
>                         if (!vma_is_anonymous(vma) &&
> -                           !zap_drop_file_uffd_wp(details))
> +                           !zap_drop_file_uffd_wp(details)) {
> +                               set_pt_unreclaimable(&can_reclaim_pt);
>                                 continue;
> +                       }
>                 } else if (is_hwpoison_entry(entry) ||
>                            is_poisoned_swp_entry(entry)) {
> -                       if (!should_zap_cows(details))
> +                       if (!should_zap_cows(details)) {
> +                               set_pt_unreclaimable(&can_reclaim_pt);
>                                 continue;
> +                       }
>                 } else {
>                         /* We should have covered all the swap entry types */
>                         pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
>                         WARN_ON_ONCE(1);
>                 }
>                 clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
> -               zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
> +               if (zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent))
> +                       set_pt_unreclaimable(&can_reclaim_pt);
>         } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
>
> +       if (addr == end && can_reclaim_pt)
> +               direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval);
> +
>         add_mm_rss_vec(mm, rss);
>         arch_leave_lazy_mmu_mode();
>
> @@ -1724,6 +1750,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                 goto retry;
>         }
>
> +       if (can_reclaim_pt) {
> +               if (direct_reclaim)
> +                       free_pte(mm, start, tlb, pmdval);
> +               else
> +                       try_to_free_pte(mm, pmd, start, tlb);
> +       }
> +
>         return addr;
>  }
>
> diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c
> new file mode 100644
> index 0000000000000..fc055da40b615
> --- /dev/null
> +++ b/mm/pt_reclaim.c
> @@ -0,0 +1,68 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/hugetlb.h>
> +#include <asm-generic/tlb.h>
> +#include <asm/pgalloc.h>
> +
> +#include "internal.h"
> +
> +bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval)
> +{
> +       spinlock_t *pml = pmd_lockptr(mm, pmd);
> +
> +       if (!spin_trylock(pml))
> +               return false;
> +
> +       *pmdval = pmdp_get_lockless(pmd);
> +       pmd_clear(pmd);
> +       spin_unlock(pml);
> +
> +       return true;
> +}
> +
> +void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb,
> +             pmd_t pmdval)
> +{
> +       pte_free_tlb(tlb, pmd_pgtable(pmdval), addr);
> +       mm_dec_nr_ptes(mm);
> +}
> +
> +void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
> +                    struct mmu_gather *tlb)
> +{
> +       pmd_t pmdval;
> +       spinlock_t *pml, *ptl;
> +       pte_t *start_pte, *pte;
> +       int i;
> +
> +       start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
> +       if (!start_pte)
> +               return;
> +
> +       pml = pmd_lock(mm, pmd);
> +       if (ptl != pml)
> +               spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> +
> +       if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd))))
> +               goto out_ptl;
> +
> +       /* Check if it is empty PTE page */
> +       for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
> +               if (!pte_none(ptep_get(pte)))
> +                       goto out_ptl;
> +       }
> +       pte_unmap(start_pte);
> +
> +       pmd_clear(pmd);
> +
> +       if (ptl != pml)
> +               spin_unlock(ptl);
> +       spin_unlock(pml);

At this point, you have cleared the PMD and dropped the locks
protecting against concurrency, but have not yet done a TLB flush. If
another thread concurrently repopulates the PMD at this point, can we
get incoherent TLB state in a way that violates the arm64
break-before-make rule?

Though I guess we can probably already violate break-before-make if
MADV_DONTNEED races with a pagefault, since zap_present_folio_ptes()
does not seem to set "force_flush" when zapping anon PTEs...

(I realize you're only enabling this for x86 for now, but we should
probably make sure the code is not arch-dependent in subtle
undocumented ways...)

> +       free_pte(mm, addr, tlb, pmdval);
> +
> +       return;
> +out_ptl:
> +       pte_unmap_unlock(start_pte, ptl);
> +       if (pml != ptl)
> +               spin_unlock(pml);
> +}
> --
> 2.20.1
>
Re: [PATCH v1 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
Posted by Will Deacon 1 month ago
On Thu, Oct 17, 2024 at 08:43:43PM +0200, Jann Horn wrote:
> +arm64 maintainers in case they have opinions on the break-before-make aspects

Thanks, Jann.

> On Thu, Oct 17, 2024 at 11:48 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> > +void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
> > +                    struct mmu_gather *tlb)
> > +{
> > +       pmd_t pmdval;
> > +       spinlock_t *pml, *ptl;
> > +       pte_t *start_pte, *pte;
> > +       int i;
> > +
> > +       start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
> > +       if (!start_pte)
> > +               return;
> > +
> > +       pml = pmd_lock(mm, pmd);
> > +       if (ptl != pml)
> > +               spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> > +
> > +       if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd))))
> > +               goto out_ptl;
> > +
> > +       /* Check if it is empty PTE page */
> > +       for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
> > +               if (!pte_none(ptep_get(pte)))
> > +                       goto out_ptl;
> > +       }
> > +       pte_unmap(start_pte);
> > +
> > +       pmd_clear(pmd);
> > +
> > +       if (ptl != pml)
> > +               spin_unlock(ptl);
> > +       spin_unlock(pml);
> 
> At this point, you have cleared the PMD and dropped the locks
> protecting against concurrency, but have not yet done a TLB flush. If
> another thread concurrently repopulates the PMD at this point, can we
> get incoherent TLB state in a way that violates the arm64
> break-before-make rule?

Sounds like it, yes, unless there's something that constrains the new
PMD value to be some function of what it was in the first place?

Will
Re: [PATCH v1 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
Posted by Qi Zheng 1 month ago
Hi Will,

On 2024/10/24 21:21, Will Deacon wrote:
> On Thu, Oct 17, 2024 at 08:43:43PM +0200, Jann Horn wrote:
>> +arm64 maintainers in case they have opinions on the break-before-make aspects
> 
> Thanks, Jann.
> 
>> On Thu, Oct 17, 2024 at 11:48 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>>> +void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
>>> +                    struct mmu_gather *tlb)
>>> +{
>>> +       pmd_t pmdval;
>>> +       spinlock_t *pml, *ptl;
>>> +       pte_t *start_pte, *pte;
>>> +       int i;
>>> +
>>> +       start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
>>> +       if (!start_pte)
>>> +               return;
>>> +
>>> +       pml = pmd_lock(mm, pmd);
>>> +       if (ptl != pml)
>>> +               spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>>> +
>>> +       if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd))))
>>> +               goto out_ptl;
>>> +
>>> +       /* Check if it is empty PTE page */
>>> +       for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
>>> +               if (!pte_none(ptep_get(pte)))
>>> +                       goto out_ptl;
>>> +       }
>>> +       pte_unmap(start_pte);
>>> +
>>> +       pmd_clear(pmd);
>>> +
>>> +       if (ptl != pml)
>>> +               spin_unlock(ptl);
>>> +       spin_unlock(pml);
>>
>> At this point, you have cleared the PMD and dropped the locks
>> protecting against concurrency, but have not yet done a TLB flush. If
>> another thread concurrently repopulates the PMD at this point, can we
>> get incoherent TLB state in a way that violates the arm64
>> break-before-make rule?
> 
> Sounds like it, yes, unless there's something that constrains the new
> PMD value to be some function of what it was in the first place?

Thank you for taking a look at this! I have tried to detect this case
and flush TLB in page fault. For details, please refer to this RFC
patch:

https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/

And more context here: 
https://lore.kernel.org/lkml/6f38cb19-9847-4f70-bbe7-06881bb016be@bytedance.com/

If necessary, I can rebase the RFC patch and resend it.

Thanks!

> 
> Will
Re: [PATCH v1 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
Posted by Qi Zheng 1 month, 1 week ago

On 2024/10/18 02:43, Jann Horn wrote:
> +arm64 maintainers in case they have opinions on the break-before-make aspects
> 
> On Thu, Oct 17, 2024 at 11:48 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>> Now in order to pursue high performance, applications mostly use some
>> high-performance user-mode memory allocators, such as jemalloc or
>> tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
>> to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
>> release page table memory, which may cause huge page table memory usage.
>>
>> The following are a memory usage snapshot of one process which actually
>> happened on our server:
>>
>>          VIRT:  55t
>>          RES:   590g
>>          VmPTE: 110g
>>
>> In this case, most of the page table entries are empty. For such a PTE
>> page where all entries are empty, we can actually free it back to the
>> system for others to use.
>>
>> As a first step, this commit aims to synchronously free the empty PTE
>> pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
>> pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
>> cases other than madvise(MADV_DONTNEED).
>>
>> Once an empty PTE is detected, we first try to hold the pmd lock within
>> the pte lock. If successful, we clear the pmd entry directly (fast path).
>> Otherwise, we wait until the pte lock is released, then re-hold the pmd
>> and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
>> whether the PTE page is empty and free it (slow path).
>>
>> For other cases such as madvise(MADV_FREE), consider scanning and freeing
>> empty PTE pages asynchronously in the future.
> 
> One thing I find somewhat scary about this is that it makes it
> possible to free page tables in anonymous mappings, and to free page
> tables of VMAs with an ->anon_vma, which was not possible before. Have
> you checked all the current users of pte_offset_map_ro_nolock(),
> pte_offset_map_rw_nolock(), and pte_offset_map() to make sure none of
> them assume that this can't happen?

For the users of pte_offset_map_ro_nolock() and pte_offset_map(), they
will only perform read-only operations on the PTE page, and the
rcu_read_lock() in pte_offset_map_ro_nolock() and pte_offset_map() will
ensure that the PTE page is valid, so this is safe.

For the users of pte_offset_map_rw_nolock() + pmd_same()/pte_same()
check, they behave similarly to pte_offset_map_lock(), so this is
safe.

For the users who have used pte_offset_map_rw_nolock() but have not
performed a pmd_same()/pte_same() check, that is, the following:

1. copy_pte_range()
2. move_ptes()

They all hold the exclusive mmap_lock, and we will hold the read
lock of mmap_lock to free page tables in anonymous mappings, so
it is also safe.

> 
> For example, pte_offset_map_rw_nolock() is called from move_ptes(),
> with a comment basically talking about how this is safe *because only
> khugepaged can remove page tables*.

As mentioned above, it is also safe here.

> 
>> diff --git a/mm/memory.c b/mm/memory.c
>> index cc89ede8ce2ab..77774b34f2cde 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1437,7 +1437,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>>   static inline bool should_zap_cows(struct zap_details *details)
>>   {
>>          /* By default, zap all pages */
>> -       if (!details)
>> +       if (!details || details->reclaim_pt)
>>                  return true;
>>
>>          /* Or, we zap COWed pages only if the caller wants to */
>> @@ -1611,8 +1611,18 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>          pte_t *start_pte;
>>          pte_t *pte;
>>          swp_entry_t entry;
>> +       pmd_t pmdval;
>> +       bool can_reclaim_pt = false;
>> +       bool direct_reclaim;
>> +       unsigned long start = addr;
>>          int nr;
>>
>> +       if (details && details->reclaim_pt)
>> +               can_reclaim_pt = true;
>> +
>> +       if ((ALIGN_DOWN(end, PMD_SIZE)) - (ALIGN(start, PMD_SIZE)) < PMD_SIZE)
>> +               can_reclaim_pt = false;
> 
> Does this check actually work? Assuming we're on x86, if you pass in
> start=0x1000 and end=0x2000, if I understand correctly,
> ALIGN_DOWN(end, PMD_SIZE) will be 0, while ALIGN(start, PMD_SIZE) will
> be 0x200000, and so we will check:
> 
> if (0 - 0x200000 < PMD_SIZE)
> 
> which is
> 
> if (0xffffffffffe00000 < 0x200000)
> 
> which is false?

Oh, I missed this, it seems that we can just do:

if (end - start < PMD_SIZE)
	can_reclaim_pt = false;

> 
>>   retry:
>>          tlb_change_page_size(tlb, PAGE_SIZE);
>>          init_rss_vec(rss);
>> @@ -1641,6 +1651,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>                          nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
>>                                                addr, details, rss, &force_flush,
>>                                                &force_break, &is_pt_unreclaimable);
>> +                       if (is_pt_unreclaimable)
>> +                               set_pt_unreclaimable(&can_reclaim_pt);
>>                          if (unlikely(force_break)) {
>>                                  addr += nr * PAGE_SIZE;
>>                                  break;
>> @@ -1653,8 +1665,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>                      is_device_exclusive_entry(entry)) {
>>                          page = pfn_swap_entry_to_page(entry);
>>                          folio = page_folio(page);
>> -                       if (unlikely(!should_zap_folio(details, folio)))
>> +                       if (unlikely(!should_zap_folio(details, folio))) {
>> +                               set_pt_unreclaimable(&can_reclaim_pt);
>>                                  continue;
>> +                       }
>>                          /*
>>                           * Both device private/exclusive mappings should only
>>                           * work with anonymous page so far, so we don't need to
>> @@ -1670,14 +1684,18 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>                          max_nr = (end - addr) / PAGE_SIZE;
>>                          nr = swap_pte_batch(pte, max_nr, ptent);
>>                          /* Genuine swap entries, hence a private anon pages */
>> -                       if (!should_zap_cows(details))
>> +                       if (!should_zap_cows(details)) {
>> +                               set_pt_unreclaimable(&can_reclaim_pt);
>>                                  continue;
>> +                       }
>>                          rss[MM_SWAPENTS] -= nr;
>>                          free_swap_and_cache_nr(entry, nr);
>>                  } else if (is_migration_entry(entry)) {
>>                          folio = pfn_swap_entry_folio(entry);
>> -                       if (!should_zap_folio(details, folio))
>> +                       if (!should_zap_folio(details, folio)) {
>> +                               set_pt_unreclaimable(&can_reclaim_pt);
>>                                  continue;
>> +                       }
>>                          rss[mm_counter(folio)]--;
>>                  } else if (pte_marker_entry_uffd_wp(entry)) {
>>                          /*
>> @@ -1685,21 +1703,29 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>                           * drop the marker if explicitly requested.
>>                           */
>>                          if (!vma_is_anonymous(vma) &&
>> -                           !zap_drop_file_uffd_wp(details))
>> +                           !zap_drop_file_uffd_wp(details)) {
>> +                               set_pt_unreclaimable(&can_reclaim_pt);
>>                                  continue;
>> +                       }
>>                  } else if (is_hwpoison_entry(entry) ||
>>                             is_poisoned_swp_entry(entry)) {
>> -                       if (!should_zap_cows(details))
>> +                       if (!should_zap_cows(details)) {
>> +                               set_pt_unreclaimable(&can_reclaim_pt);
>>                                  continue;
>> +                       }
>>                  } else {
>>                          /* We should have covered all the swap entry types */
>>                          pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
>>                          WARN_ON_ONCE(1);
>>                  }
>>                  clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>> -               zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
>> +               if (zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent))
>> +                       set_pt_unreclaimable(&can_reclaim_pt);
>>          } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
>>
>> +       if (addr == end && can_reclaim_pt)
>> +               direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval);
>> +
>>          add_mm_rss_vec(mm, rss);
>>          arch_leave_lazy_mmu_mode();
>>
>> @@ -1724,6 +1750,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>                  goto retry;
>>          }
>>
>> +       if (can_reclaim_pt) {
>> +               if (direct_reclaim)
>> +                       free_pte(mm, start, tlb, pmdval);
>> +               else
>> +                       try_to_free_pte(mm, pmd, start, tlb);
>> +       }
>> +
>>          return addr;
>>   }
>>
>> diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c
>> new file mode 100644
>> index 0000000000000..fc055da40b615
>> --- /dev/null
>> +++ b/mm/pt_reclaim.c
>> @@ -0,0 +1,68 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +#include <linux/hugetlb.h>
>> +#include <asm-generic/tlb.h>
>> +#include <asm/pgalloc.h>
>> +
>> +#include "internal.h"
>> +
>> +bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval)
>> +{
>> +       spinlock_t *pml = pmd_lockptr(mm, pmd);
>> +
>> +       if (!spin_trylock(pml))
>> +               return false;
>> +
>> +       *pmdval = pmdp_get_lockless(pmd);
>> +       pmd_clear(pmd);
>> +       spin_unlock(pml);
>> +
>> +       return true;
>> +}
>> +
>> +void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb,
>> +             pmd_t pmdval)
>> +{
>> +       pte_free_tlb(tlb, pmd_pgtable(pmdval), addr);
>> +       mm_dec_nr_ptes(mm);
>> +}
>> +
>> +void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
>> +                    struct mmu_gather *tlb)
>> +{
>> +       pmd_t pmdval;
>> +       spinlock_t *pml, *ptl;
>> +       pte_t *start_pte, *pte;
>> +       int i;
>> +
>> +       start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
>> +       if (!start_pte)
>> +               return;
>> +
>> +       pml = pmd_lock(mm, pmd);
>> +       if (ptl != pml)
>> +               spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>> +
>> +       if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd))))
>> +               goto out_ptl;
>> +
>> +       /* Check if it is empty PTE page */
>> +       for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
>> +               if (!pte_none(ptep_get(pte)))
>> +                       goto out_ptl;
>> +       }
>> +       pte_unmap(start_pte);
>> +
>> +       pmd_clear(pmd);
>> +
>> +       if (ptl != pml)
>> +               spin_unlock(ptl);
>> +       spin_unlock(pml);
> 
> At this point, you have cleared the PMD and dropped the locks
> protecting against concurrency, but have not yet done a TLB flush. If
> another thread concurrently repopulates the PMD at this point, can we
> get incoherent TLB state in a way that violates the arm64
> break-before-make rule?
> 
> Though I guess we can probably already violate break-before-make if
> MADV_DONTNEED races with a pagefault, since zap_present_folio_ptes()
> does not seem to set "force_flush" when zapping anon PTEs...

Thanks for pointing this out! That's why I sent a separate patch
discussing this a while ago, but unfortunately haven't gotten any
feedback yet, please take a look:

https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/

Thanks!

> 
> (I realize you're only enabling this for x86 for now, but we should
> probably make sure the code is not arch-dependent in subtle
> undocumented ways...)
> 
>> +       free_pte(mm, addr, tlb, pmdval);
>> +
>> +       return;
>> +out_ptl:
>> +       pte_unmap_unlock(start_pte, ptl);
>> +       if (pml != ptl)
>> +               spin_unlock(pml);
>> +}
>> --
>> 2.20.1
>>
Re: [PATCH v1 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
Posted by Qi Zheng 1 month, 1 week ago

On 2024/10/18 10:53, Qi Zheng wrote:
> 
> 
> On 2024/10/18 02:43, Jann Horn wrote:
>> +arm64 maintainers in case they have opinions on the break-before-make 
>> aspects
>>

[snip]

>>> +
>>> +       pmd_clear(pmd);
>>> +
>>> +       if (ptl != pml)
>>> +               spin_unlock(ptl);
>>> +       spin_unlock(pml);
>>
>> At this point, you have cleared the PMD and dropped the locks
>> protecting against concurrency, but have not yet done a TLB flush. If
>> another thread concurrently repopulates the PMD at this point, can we
>> get incoherent TLB state in a way that violates the arm64
>> break-before-make rule?
>>
>> Though I guess we can probably already violate break-before-make if
>> MADV_DONTNEED races with a pagefault, since zap_present_folio_ptes()
>> does not seem to set "force_flush" when zapping anon PTEs...
> 
> Thanks for pointing this out! That's why I sent a separate patch
> discussing this a while ago, but unfortunately haven't gotten any
> feedback yet, please take a look:
> 
> https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/

More context here: 
https://lore.kernel.org/lkml/6f38cb19-9847-4f70-bbe7-06881bb016be@bytedance.com/

> 
> Thanks!
> 
>>
>> (I realize you're only enabling this for x86 for now, but we should
>> probably make sure the code is not arch-dependent in subtle
>> undocumented ways...)
>>
>>> +       free_pte(mm, addr, tlb, pmdval);
>>> +
>>> +       return;
>>> +out_ptl:
>>> +       pte_unmap_unlock(start_pte, ptl);
>>> +       if (pml != ptl)
>>> +               spin_unlock(pml);
>>> +}
>>> -- 
>>> 2.20.1
>>>