:p
atchew
Login
Hi all, Previously, we tried to use a completely asynchronous method to reclaim empty user PTE pages [1]. After discussing with David Hildenbrand, we decided to implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the first step. So this series aims to synchronously scan and reclaim empty user PTE pages in zap_page_range_single() (madvise(MADV_DONTNEED) etc will invoke this). In zap_page_range_single(), mmu_gather is used to perform batch tlb flushing and page freeing operations. Therefore, if we want to free the empty PTE page in this path, the most natural way is to add it to mmu_gather as well. There are two problems that need to be solved here: 1. Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, mmu_gather will free page table pages by semi RCU: - batch table freeing: asynchronous free by RCU - single table freeing: IPI + synchronous free But this is not enough to free the empty PTE page table pages in paths other that munmap and exit_mmap path, because IPI cannot be synchronized with rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table also be freed by RCU like batch table freeing. 2. When we use mmu_gather to batch flush tlb and free PTE pages, the TLB is not flushed before pmd lock is unlocked. This may result in the following two situations: 1) Userland can trigger page fault and fill a huge page, which will cause the existence of small size TLB and huge TLB for the same address. 2) Userland can also trigger page fault and fill a PTE page, which will cause the existence of two small size TLBs, but the PTE page they map are different. For case 1), according to Intel's TLB Application note (317080), some CPUs of x86 do not allow it: ``` If software modifies the paging structures so that the page size used for a 4-KByte range of linear addresses changes, the TLBs may subsequently contain both ordinary and large-page translations for the address range.12 A reference to a linear address in the address range may use either translation. Which of the two translations is used may vary from one execution to another and the choice may be implementation-specific. Software wishing to prevent this uncertainty should not write to a paging- structure entry in a way that would change, for any linear address, both the page size and either the page frame or attributes. It can instead use the following algorithm: first mark the relevant paging-structure entry (e.g., PDE) not present; then invalidate any translations for the affected linear addresses (see Section 5.2); and then modify the relevant paging-structure entry to mark it present and establish translation(s) for the new page size. ``` We can also learn more information from the comments above pmdp_invalidate() in __split_huge_pmd_locked(). For case 2), we can see from the comments above ptep_clear_flush() in wp_page_copy() that this situation is also not allowed. Even without this patch series, madvise(MADV_DONTNEED) can also cause this situation: CPU 0 CPU 1 madvise (MADV_DONTNEED) --> clear pte entry pte_unmap_unlock touch and tlb miss --> set pte entry mmu_gather flush tlb But strangely, I didn't see any relevant fix code, maybe I missed something, or is this guaranteed by userland? Anyway, this series defines the following two functions to be implemented by the architecture. If the architecture does not allow the above two situations, then define these two functions to flush the tlb before set_pmd_at(). - arch_flush_tlb_before_set_huge_page - arch_flush_tlb_before_set_pte_page As a first step, we supported this feature on x86_64 and selectd the newly introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM. In order to reduce overhead, we only handle the cases with a high probability of generating empty PTE pages, and other cases will be filtered out, such as: - hugetlb vma (unsuitable) - userfaultfd_wp vma (may reinstall the pte entry) - writable private file mapping case (COW-ed anon page is not zapped) - etc For userfaultfd_wp and writable private file mapping cases (and MADV_FREE case, of course), consider scanning and freeing empty PTE pages asynchronously in the future. This series is based on next-20240627. Comments and suggestions are welcome! Thanks, Qi [1]. https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ Qi Zheng (7): mm: pgtable: make pte_offset_map_nolock() return pmdval mm: introduce CONFIG_PT_RECLAIM mm: pass address information to pmd_install() mm: pgtable: try to reclaim empty PTE pages in zap_page_range_single() x86: mm: free page table pages by RCU instead of semi RCU x86: mm: define arch_flush_tlb_before_set_huge_page x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 Documentation/mm/split_page_table_lock.rst | 3 +- arch/arm/mm/fault-armv.c | 2 +- arch/powerpc/mm/pgtable.c | 2 +- arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable.h | 6 + arch/x86/include/asm/tlb.h | 23 ++++ arch/x86/kernel/paravirt.c | 7 ++ arch/x86/mm/pgtable.c | 15 ++- include/linux/hugetlb.h | 2 +- include/linux/mm.h | 13 +- include/linux/pgtable.h | 14 +++ mm/Kconfig | 14 +++ mm/Makefile | 1 + mm/debug_vm_pgtable.c | 2 +- mm/filemap.c | 4 +- mm/gup.c | 2 +- mm/huge_memory.c | 3 + mm/internal.h | 17 ++- mm/khugepaged.c | 24 +++- mm/memory.c | 21 ++-- mm/migrate_device.c | 2 +- mm/mmu_gather.c | 2 +- mm/mprotect.c | 8 +- mm/mremap.c | 4 +- mm/page_vma_mapped.c | 2 +- mm/pgtable-generic.c | 21 ++-- mm/pt_reclaim.c | 131 +++++++++++++++++++++ mm/userfaultfd.c | 10 +- mm/vmscan.c | 2 +- 29 files changed, 307 insertions(+), 51 deletions(-) create mode 100644 mm/pt_reclaim.c -- 2.20.1
Make pte_offset_map_nolock() return pmdval so that we can recheck the *pmd once the lock is taken. This is a preparation for freeing empty PTE pages, no functional changes are expected. Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> --- Documentation/mm/split_page_table_lock.rst | 3 ++- arch/arm/mm/fault-armv.c | 2 +- arch/powerpc/mm/pgtable.c | 2 +- include/linux/mm.h | 4 ++-- mm/filemap.c | 2 +- mm/khugepaged.c | 4 ++-- mm/memory.c | 4 ++-- mm/mremap.c | 2 +- mm/page_vma_mapped.c | 2 +- mm/pgtable-generic.c | 21 ++++++++++++--------- mm/userfaultfd.c | 4 ++-- mm/vmscan.c | 2 +- 12 files changed, 28 insertions(+), 24 deletions(-) diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst index XXXXXXX..XXXXXXX 100644 --- a/Documentation/mm/split_page_table_lock.rst +++ b/Documentation/mm/split_page_table_lock.rst @@ -XXX,XX +XXX,XX @@ There are helpers to lock/unlock a table and other accessor functions: pointer to its PTE table lock, or returns NULL if no PTE table; - pte_offset_map_nolock() maps PTE, returns pointer to PTE with pointer to its PTE table - lock (not taken), or returns NULL if no PTE table; + lock (not taken) and the value of its pmd entry, or returns NULL + if no PTE table; - pte_offset_map() maps PTE, returns pointer to PTE, or returns NULL if no PTE table; - pte_unmap() diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c index XXXXXXX..XXXXXXX 100644 --- a/arch/arm/mm/fault-armv.c +++ b/arch/arm/mm/fault-armv.c @@ -XXX,XX +XXX,XX @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address, * must use the nested version. This also means we need to * open-code the spin-locking. */ - pte = pte_offset_map_nolock(vma->vm_mm, pmd, address, &ptl); + pte = pte_offset_map_nolock(vma->vm_mm, pmd, NULL, address, &ptl); if (!pte) return 0; diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index XXXXXXX..XXXXXXX 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -XXX,XX +XXX,XX @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr) */ if (pmd_none(*pmd)) return; - pte = pte_offset_map_nolock(mm, pmd, addr, &ptl); + pte = pte_offset_map_nolock(mm, pmd, NULL, addr, &ptl); BUG_ON(!pte); assert_spin_locked(ptl); pte_unmap(pte); diff --git a/include/linux/mm.h b/include/linux/mm.h index XXXXXXX..XXXXXXX 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -XXX,XX +XXX,XX @@ static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd, return pte; } -pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, - unsigned long addr, spinlock_t **ptlp); +pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdvalp, + unsigned long addr, spinlock_t **ptlp); #define pte_unmap_unlock(pte, ptl) do { \ spin_unlock(ptl); \ diff --git a/mm/filemap.c b/mm/filemap.c index XXXXXXX..XXXXXXX 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -XXX,XX +XXX,XX @@ static vm_fault_t filemap_fault_recheck_pte_none(struct vm_fault *vmf) if (!(vmf->flags & FAULT_FLAG_ORIG_PTE_VALID)) return 0; - ptep = pte_offset_map_nolock(vma->vm_mm, vmf->pmd, vmf->address, + ptep = pte_offset_map_nolock(vma->vm_mm, vmf->pmd, NULL, vmf->address, &vmf->ptl); if (unlikely(!ptep)) return VM_FAULT_NOPAGE; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index XXXXXXX..XXXXXXX 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -XXX,XX +XXX,XX @@ static int __collapse_huge_page_swapin(struct mm_struct *mm, }; if (!pte++) { - pte = pte_offset_map_nolock(mm, pmd, address, &ptl); + pte = pte_offset_map_nolock(mm, pmd, NULL, address, &ptl); if (!pte) { mmap_read_unlock(mm); result = SCAN_PMD_NULL; @@ -XXX,XX +XXX,XX @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED)) pml = pmd_lock(mm, pmd); - start_pte = pte_offset_map_nolock(mm, pmd, haddr, &ptl); + start_pte = pte_offset_map_nolock(mm, pmd, NULL, haddr, &ptl); if (!start_pte) /* mmap_lock + page lock should prevent this */ goto abort; if (!pml) diff --git a/mm/memory.c b/mm/memory.c index XXXXXXX..XXXXXXX 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -XXX,XX +XXX,XX @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, ret = -ENOMEM; goto out; } - src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl); + src_pte = pte_offset_map_nolock(src_mm, src_pmd, NULL, addr, &src_ptl); if (!src_pte) { pte_unmap_unlock(dst_pte, dst_ptl); /* ret == 0 */ @@ -XXX,XX +XXX,XX @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) * it into a huge pmd: just retry later if so. */ vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd, - vmf->address, &vmf->ptl); + NULL, vmf->address, &vmf->ptl); if (unlikely(!vmf->pte)) return 0; vmf->orig_pte = ptep_get_lockless(vmf->pte); diff --git a/mm/mremap.c b/mm/mremap.c index XXXXXXX..XXXXXXX 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -XXX,XX +XXX,XX @@ static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, err = -EAGAIN; goto out; } - new_pte = pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl); + new_pte = pte_offset_map_nolock(mm, new_pmd, NULL, new_addr, &new_ptl); if (!new_pte) { pte_unmap_unlock(old_pte, old_ptl); err = -EAGAIN; diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index XXXXXXX..XXXXXXX 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -XXX,XX +XXX,XX @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp) * Though, in most cases, page lock already protects this. */ pvmw->pte = pte_offset_map_nolock(pvmw->vma->vm_mm, pvmw->pmd, - pvmw->address, ptlp); + NULL, pvmw->address, ptlp); if (!pvmw->pte) return false; diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index XXXXXXX..XXXXXXX 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -XXX,XX +XXX,XX @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp) return NULL; } -pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, +pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdvalp, unsigned long addr, spinlock_t **ptlp) { pmd_t pmdval; @@ -XXX,XX +XXX,XX @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, pte = __pte_offset_map(pmd, addr, &pmdval); if (likely(pte)) *ptlp = pte_lockptr(mm, &pmdval); + if (pmdvalp) + *pmdvalp = pmdval; return pte; } @@ -XXX,XX +XXX,XX @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, * and disconnected table. Until pte_unmap(pte) unmaps and rcu_read_unlock()s * afterwards. * - * pte_offset_map_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map(); - * but when successful, it also outputs a pointer to the spinlock in ptlp - as - * pte_offset_map_lock() does, but in this case without locking it. This helps - * the caller to avoid a later pte_lockptr(mm, *pmd), which might by that time - * act on a changed *pmd: pte_offset_map_nolock() provides the correct spinlock - * pointer for the page table that it returns. In principle, the caller should - * recheck *pmd once the lock is taken; in practice, no callsite needs that - - * either the mmap_lock for write, or pte_same() check on contents, is enough. + * pte_offset_map_nolock(mm, pmd, pmdvalp, addr, ptlp), above, is like + * pte_offset_map(); but when successful, it also outputs a pointer to the + * spinlock in ptlp - as pte_offset_map_lock() does, but in this case without + * locking it. This helps the caller to avoid a later pte_lockptr(mm, *pmd), + * which might by that time act on a changed *pmd: pte_offset_map_nolock() + * provides the correct spinlock pointer for the page table that it returns. + * In principle, the caller should recheck *pmd once the lock is taken; But in + * most cases, either the mmap_lock for write, or pte_same() check on contents, + * is enough. * * Note that free_pgtables(), used after unmapping detached vmas, or when * exiting the whole mm, does not take page table lock before freeing a page diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index XXXXXXX..XXXXXXX 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -XXX,XX +XXX,XX @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, src_addr, src_addr + PAGE_SIZE); mmu_notifier_invalidate_range_start(&range); retry: - dst_pte = pte_offset_map_nolock(mm, dst_pmd, dst_addr, &dst_ptl); + dst_pte = pte_offset_map_nolock(mm, dst_pmd, NULL, dst_addr, &dst_ptl); /* Retry if a huge pmd materialized from under us */ if (unlikely(!dst_pte)) { @@ -XXX,XX +XXX,XX @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, goto out; } - src_pte = pte_offset_map_nolock(mm, src_pmd, src_addr, &src_ptl); + src_pte = pte_offset_map_nolock(mm, src_pmd, NULL, src_addr, &src_ptl); /* * We held the mmap_lock for reading so MADV_DONTNEED diff --git a/mm/vmscan.c b/mm/vmscan.c index XXXXXXX..XXXXXXX 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -XXX,XX +XXX,XX @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, DEFINE_MAX_SEQ(walk->lruvec); int old_gen, new_gen = lru_gen_from_seq(max_seq); - pte = pte_offset_map_nolock(args->mm, pmd, start & PMD_MASK, &ptl); + pte = pte_offset_map_nolock(args->mm, pmd, NULL, start & PMD_MASK, &ptl); if (!pte) return false; if (!spin_trylock(ptl)) { -- 2.20.1
This configuration variable will be used to build the code needed to free empty user page table pages. This feature is not available on all architectures yet, so ARCH_SUPPORTS_PT_RECLAIM is needed. We can remove it once all architectures support this feature. Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> --- mm/Kconfig | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/mm/Kconfig b/mm/Kconfig index XXXXXXX..XXXXXXX 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -XXX,XX +XXX,XX @@ config IOMMU_MM_DATA config EXECMEM bool +config ARCH_SUPPORTS_PT_RECLAIM + def_bool n + +config PT_RECLAIM + bool "reclaim empty user page table pages" + default y + depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP + select MMU_GATHER_RCU_TABLE_FREE + help + Try to reclaim empty user page table pages in paths other that munmap + and exit_mmap path. + + Note: now only empty user PTE page table pages will be reclaimed. + source "mm/damon/Kconfig" endmenu -- 2.20.1
In the subsequent implementation of freeing empty page table pages, we need the address information to flush tlb, so pass address to pmd_install() in advance. No functional changes. Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> --- include/linux/hugetlb.h | 2 +- include/linux/mm.h | 9 +++++---- mm/debug_vm_pgtable.c | 2 +- mm/filemap.c | 2 +- mm/gup.c | 2 +- mm/internal.h | 3 ++- mm/memory.c | 15 ++++++++------- mm/migrate_device.c | 2 +- mm/mprotect.c | 8 ++++---- mm/mremap.c | 2 +- mm/userfaultfd.c | 6 +++--- 11 files changed, 28 insertions(+), 25 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index XXXXXXX..XXXXXXX 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -XXX,XX +XXX,XX @@ static inline pte_t *pte_offset_huge(pmd_t *pmd, unsigned long address) static inline pte_t *pte_alloc_huge(struct mm_struct *mm, pmd_t *pmd, unsigned long address) { - return pte_alloc(mm, pmd) ? NULL : pte_offset_huge(pmd, address); + return pte_alloc(mm, pmd, address) ? NULL : pte_offset_huge(pmd, address); } #endif diff --git a/include/linux/mm.h b/include/linux/mm.h index XXXXXXX..XXXXXXX 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -XXX,XX +XXX,XX @@ static inline void mm_inc_nr_ptes(struct mm_struct *mm) {} static inline void mm_dec_nr_ptes(struct mm_struct *mm) {} #endif -int __pte_alloc(struct mm_struct *mm, pmd_t *pmd); +int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long addr); int __pte_alloc_kernel(pmd_t *pmd); #if defined(CONFIG_MMU) @@ -XXX,XX +XXX,XX @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdvalp, pte_unmap(pte); \ } while (0) -#define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd)) +#define pte_alloc(mm, pmd, addr) \ + (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd, addr)) #define pte_alloc_map(mm, pmd, address) \ - (pte_alloc(mm, pmd) ? NULL : pte_offset_map(pmd, address)) + (pte_alloc(mm, pmd, address) ? NULL : pte_offset_map(pmd, address)) #define pte_alloc_map_lock(mm, pmd, address, ptlp) \ - (pte_alloc(mm, pmd) ? \ + (pte_alloc(mm, pmd, address) ? \ NULL : pte_offset_map_lock(mm, pmd, address, ptlp)) #define pte_alloc_kernel(pmd, address) \ diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index XXXXXXX..XXXXXXX 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -XXX,XX +XXX,XX @@ static int __init init_args(struct pgtable_debug_args *args) args->start_pmdp = pmd_offset(args->pudp, 0UL); WARN_ON(!args->start_pmdp); - if (pte_alloc(args->mm, args->pmdp)) { + if (pte_alloc(args->mm, args->pmdp, args->vaddr)) { pr_err("Failed to allocate pte entries\n"); ret = -ENOMEM; goto error; diff --git a/mm/filemap.c b/mm/filemap.c index XXXXXXX..XXXXXXX 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -XXX,XX +XXX,XX @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct folio *folio, } if (pmd_none(*vmf->pmd) && vmf->prealloc_pte) - pmd_install(mm, vmf->pmd, &vmf->prealloc_pte); + pmd_install(mm, vmf->pmd, vmf->address, &vmf->prealloc_pte); return false; } diff --git a/mm/gup.c b/mm/gup.c index XXXXXXX..XXXXXXX 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -XXX,XX +XXX,XX @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, spin_unlock(ptl); split_huge_pmd(vma, pmd, address); /* If pmd was left empty, stuff a page table in there quickly */ - return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) : + return pte_alloc(mm, pmd, address) ? ERR_PTR(-ENOMEM) : follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); } page = follow_huge_pmd(vma, address, pmd, flags, ctx); diff --git a/mm/internal.h b/mm/internal.h index XXXXXXX..XXXXXXX 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -XXX,XX +XXX,XX @@ void folio_activate(struct folio *folio); void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, struct vm_area_struct *start_vma, unsigned long floor, unsigned long ceiling, bool mm_wr_locked); -void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte); +void pmd_install(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, + pgtable_t *pte); struct zap_details; void unmap_page_range(struct mmu_gather *tlb, diff --git a/mm/memory.c b/mm/memory.c index XXXXXXX..XXXXXXX 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -XXX,XX +XXX,XX @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, } while (vma); } -void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte) +void pmd_install(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, + pgtable_t *pte) { spinlock_t *ptl = pmd_lock(mm, pmd); @@ -XXX,XX +XXX,XX @@ void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte) spin_unlock(ptl); } -int __pte_alloc(struct mm_struct *mm, pmd_t *pmd) +int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) { pgtable_t new = pte_alloc_one(mm); if (!new) return -ENOMEM; - pmd_install(mm, pmd, &new); + pmd_install(mm, pmd, addr, &new); if (new) pte_free(mm, new); return 0; @@ -XXX,XX +XXX,XX @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr, /* Allocate the PTE if necessary; takes PMD lock once only. */ ret = -ENOMEM; - if (pte_alloc(mm, pmd)) + if (pte_alloc(mm, pmd, addr)) goto out; while (pages_to_write_in_pmd) { @@ -XXX,XX +XXX,XX @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) * Use pte_alloc() instead of pte_alloc_map(), so that OOM can * be distinguished from a transient failure of pte_offset_map(). */ - if (pte_alloc(vma->vm_mm, vmf->pmd)) + if (pte_alloc(vma->vm_mm, vmf->pmd, vmf->address)) return VM_FAULT_OOM; /* Use the zero-page for reads */ @@ -XXX,XX +XXX,XX @@ vm_fault_t finish_fault(struct vm_fault *vmf) } if (vmf->prealloc_pte) - pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte); - else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) + pmd_install(vma->vm_mm, vmf->pmd, vmf->address, &vmf->prealloc_pte); + else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd, vmf->address))) return VM_FAULT_OOM; } diff --git a/mm/migrate_device.c b/mm/migrate_device.c index XXXXXXX..XXXXXXX 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -XXX,XX +XXX,XX @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, goto abort; if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp)) goto abort; - if (pte_alloc(mm, pmdp)) + if (pte_alloc(mm, pmdp, addr)) goto abort; if (unlikely(anon_vma_prepare(vma))) goto abort; diff --git a/mm/mprotect.c b/mm/mprotect.c index XXXXXXX..XXXXXXX 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -XXX,XX +XXX,XX @@ pgtable_populate_needed(struct vm_area_struct *vma, unsigned long cp_flags) * allocation failures during page faults by kicking OOM and returning * error. */ -#define change_pmd_prepare(vma, pmd, cp_flags) \ +#define change_pmd_prepare(vma, pmd, addr, cp_flags) \ ({ \ long err = 0; \ if (unlikely(pgtable_populate_needed(vma, cp_flags))) { \ - if (pte_alloc(vma->vm_mm, pmd)) \ + if (pte_alloc(vma->vm_mm, pmd, addr)) \ err = -ENOMEM; \ } \ err; \ @@ -XXX,XX +XXX,XX @@ static inline long change_pmd_range(struct mmu_gather *tlb, again: next = pmd_addr_end(addr, end); - ret = change_pmd_prepare(vma, pmd, cp_flags); + ret = change_pmd_prepare(vma, pmd, addr, cp_flags); if (ret) { pages = ret; break; @@ -XXX,XX +XXX,XX @@ static inline long change_pmd_range(struct mmu_gather *tlb, * cleared; make sure pmd populated if * necessary, then fall-through to pte level. */ - ret = change_pmd_prepare(vma, pmd, cp_flags); + ret = change_pmd_prepare(vma, pmd, addr, cp_flags); if (ret) { pages = ret; break; diff --git a/mm/mremap.c b/mm/mremap.c index XXXXXXX..XXXXXXX 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -XXX,XX +XXX,XX @@ unsigned long move_page_tables(struct vm_area_struct *vma, } if (pmd_none(*old_pmd)) continue; - if (pte_alloc(new_vma->vm_mm, new_pmd)) + if (pte_alloc(new_vma->vm_mm, new_pmd, new_addr)) break; if (move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma, new_pmd, new_addr, need_rmap_locks) < 0) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index XXXXXXX..XXXXXXX 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -XXX,XX +XXX,XX @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, break; } if (unlikely(pmd_none(dst_pmdval)) && - unlikely(__pte_alloc(dst_mm, dst_pmd))) { + unlikely(__pte_alloc(dst_mm, dst_pmd, dst_addr))) { err = -ENOMEM; break; } @@ -XXX,XX +XXX,XX @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start, err = -ENOENT; break; } - if (unlikely(__pte_alloc(mm, src_pmd))) { + if (unlikely(__pte_alloc(mm, src_pmd, src_addr))) { err = -ENOMEM; break; } } - if (unlikely(pte_alloc(mm, dst_pmd))) { + if (unlikely(pte_alloc(mm, dst_pmd, dst_addr))) { err = -ENOMEM; break; } -- 2.20.1
Now in order to pursue high performance, applications mostly use some high-performance user-mode memory allocators, such as jemalloc or tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will release page table memory, which may cause huge page table memory usage. The following are a memory usage snapshot of one process which actually happened on our server: VIRT: 55t RES: 590g VmPTE: 110g In this case, most of the page table entries are empty. For such a PTE page where all entries are empty, we can actually free it back to the system for others to use. As a first step, this commit attempts to synchronously free the empty PTE pages in zap_page_range_single() (MADV_DONTNEED etc will invoke this). In order to reduce overhead, we only handle the cases with a high probability of generating empty PTE pages, and other cases will be filtered out, such as: - hugetlb vma (unsuitable) - userfaultfd_wp vma (may reinstall the pte entry) - writable private file mapping case (COW-ed anon page is not zapped) - etc For userfaultfd_wp and private file mapping cases (and MADV_FREE case, of course), consider scanning and freeing empty PTE pages asynchronously in the future. The following code snippet can show the effect of optimization: mmap 50G while (1) { for (; i < 1024 * 25; i++) { touch 2M memory madvise MADV_DONTNEED 2M } } As we can see, the memory usage of VmPTE is reduced: before after VIRT 50.0 GB 50.0 GB RES 3.1 MB 3.1 MB VmPTE 102640 KB 240 KB Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> --- include/linux/pgtable.h | 14 +++++ mm/Makefile | 1 + mm/huge_memory.c | 3 + mm/internal.h | 14 +++++ mm/khugepaged.c | 22 ++++++- mm/memory.c | 2 + mm/pt_reclaim.c | 131 ++++++++++++++++++++++++++++++++++++++++ 7 files changed, 186 insertions(+), 1 deletion(-) create mode 100644 mm/pt_reclaim.c diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index XXXXXXX..XXXXXXX 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -XXX,XX +XXX,XX @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma, } #endif +#ifndef arch_flush_tlb_before_set_huge_page +static inline void arch_flush_tlb_before_set_huge_page(struct mm_struct *mm, + unsigned long addr) +{ +} +#endif + +#ifndef arch_flush_tlb_before_set_pte_page +static inline void arch_flush_tlb_before_set_pte_page(struct mm_struct *mm, + unsigned long addr) +{ +} +#endif + #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long address, diff --git a/mm/Makefile b/mm/Makefile index XXXXXXX..XXXXXXX 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -XXX,XX +XXX,XX @@ obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o obj-$(CONFIG_EXECMEM) += execmem.o +obj-$(CONFIG_PT_RECLAIM) += pt_reclaim.o diff --git a/mm/huge_memory.c b/mm/huge_memory.c index XXXXXXX..XXXXXXX 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -XXX,XX +XXX,XX @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, folio_add_new_anon_rmap(folio, vma, haddr, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); + arch_flush_tlb_before_set_huge_page(vma->vm_mm, haddr); set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry); update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); @@ -XXX,XX +XXX,XX @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm, entry = mk_pmd(&zero_folio->page, vma->vm_page_prot); entry = pmd_mkhuge(entry); pgtable_trans_huge_deposit(mm, pmd, pgtable); + arch_flush_tlb_before_set_huge_page(mm, haddr); set_pmd_at(mm, haddr, pmd, entry); mm_inc_nr_ptes(mm); } @@ -XXX,XX +XXX,XX @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, pgtable = NULL; } + arch_flush_tlb_before_set_huge_page(mm, addr); set_pmd_at(mm, addr, pmd, entry); update_mmu_cache_pmd(vma, addr, pmd); diff --git a/mm/internal.h b/mm/internal.h index XXXXXXX..XXXXXXX 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -XXX,XX +XXX,XX @@ void unlink_file_vma_batch_init(struct unlink_vma_file_batch *); void unlink_file_vma_batch_add(struct unlink_vma_file_batch *, struct vm_area_struct *); void unlink_file_vma_batch_final(struct unlink_vma_file_batch *); +#ifdef CONFIG_PT_RECLAIM +void try_to_reclaim_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma, + unsigned long start_addr, unsigned long end_addr, + struct zap_details *details); +#else +static inline void try_to_reclaim_pgtables(struct mmu_gather *tlb, + struct vm_area_struct *vma, + unsigned long start_addr, + unsigned long end_addr, + struct zap_details *details) +{ +} +#endif /* CONFIG_PT_RECLAIM */ + #endif /* __MM_INTERNAL_H */ diff --git a/mm/khugepaged.c b/mm/khugepaged.c index XXXXXXX..XXXXXXX 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -XXX,XX +XXX,XX @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED)) pml = pmd_lock(mm, pmd); - start_pte = pte_offset_map_nolock(mm, pmd, NULL, haddr, &ptl); + start_pte = pte_offset_map_nolock(mm, pmd, &pgt_pmd, haddr, &ptl); if (!start_pte) /* mmap_lock + page lock should prevent this */ goto abort; if (!pml) @@ -XXX,XX +XXX,XX @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, else if (ptl != pml) spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); + /* pmd entry may be changed by others */ + if (unlikely(IS_ENABLED(CONFIG_PT_RECLAIM) && !pml && + !pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) + goto abort; + /* step 2: clear page table and adjust rmap */ for (i = 0, addr = haddr, pte = start_pte; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) { @@ -XXX,XX +XXX,XX @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, pml = pmd_lock(mm, pmd); if (ptl != pml) spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); + + if (unlikely(IS_ENABLED(CONFIG_PT_RECLAIM) && + !pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) { + spin_unlock(ptl); + goto unlock; + } } pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd); pmdp_get_lockless_sync(); @@ -XXX,XX +XXX,XX @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, } if (start_pte) pte_unmap_unlock(start_pte, ptl); +unlock: if (pml && pml != ptl) spin_unlock(pml); if (notified) @@ -XXX,XX +XXX,XX @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) mmu_notifier_invalidate_range_start(&range); pml = pmd_lock(mm, pmd); +#ifdef CONFIG_PT_RECLAIM + /* check if the pmd is still valid */ + if (check_pmd_still_valid(mm, addr, pmd) != SCAN_SUCCEED) { + spin_unlock(pml); + mmu_notifier_invalidate_range_end(&range); + continue; + } +#endif ptl = pte_lockptr(mm, pmd); if (ptl != pml) spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); diff --git a/mm/memory.c b/mm/memory.c index XXXXXXX..XXXXXXX 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -XXX,XX +XXX,XX @@ void pmd_install(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, spinlock_t *ptl = pmd_lock(mm, pmd); if (likely(pmd_none(*pmd))) { /* Has another populated it ? */ + arch_flush_tlb_before_set_pte_page(mm, addr); mm_inc_nr_ptes(mm); /* * Ensure all pte setup (eg. pte page lock and page clearing) are @@ -XXX,XX +XXX,XX @@ void zap_page_range_single(struct vm_area_struct *vma, unsigned long address, * could have been expanded for hugetlb pmd sharing. */ unmap_single_vma(&tlb, vma, address, end, details, false); + try_to_reclaim_pgtables(&tlb, vma, address, end, details); mmu_notifier_invalidate_range_end(&range); tlb_finish_mmu(&tlb); hugetlb_zap_end(vma, details); diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c new file mode 100644 index XXXXXXX..XXXXXXX --- /dev/null +++ b/mm/pt_reclaim.c @@ -XXX,XX +XXX,XX @@ +// SPDX-License-Identifier: GPL-2.0 +#include <linux/pagewalk.h> +#include <linux/hugetlb.h> +#include <asm-generic/tlb.h> +#include <asm/pgalloc.h> + +#include "internal.h" + +/* + * Locking: + * - already held the mmap read lock to traverse the pgtable + * - use pmd lock for clearing pmd entry + * - use pte lock for checking empty PTE page, and release it after clearing + * pmd entry, then we can capture the changed pmd in pte_offset_map_lock() + * etc after holding this pte lock. Thanks to this, we don't need to hold the + * rmap-related locks. + * - users of pte_offset_map_lock() etc all expect the PTE page to be stable by + * using rcu lock, so PTE pages should be freed by RCU. + */ +static int reclaim_pgtables_pmd_entry(pmd_t *pmd, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + struct mm_struct *mm = walk->mm; + struct mmu_gather *tlb = walk->private; + pte_t *start_pte, *pte; + pmd_t pmdval; + spinlock_t *pml = NULL, *ptl; + int i; + + start_pte = pte_offset_map_nolock(mm, pmd, &pmdval, addr, &ptl); + if (!start_pte) + return 0; + + pml = pmd_lock(mm, pmd); + if (ptl != pml) + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); + + if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd)))) + goto out_ptl; + + /* Check if it is empty PTE page */ + for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) { + if (!pte_none(ptep_get(pte))) + goto out_ptl; + } + pte_unmap(start_pte); + + pmd_clear(pmd); + if (ptl != pml) + spin_unlock(ptl); + spin_unlock(pml); + + /* + * NOTE: + * In order to reuse mmu_gather to batch flush tlb and free PTE pages, + * here tlb is not flushed before pmd lock is unlocked. This may + * result in the following two situations: + * + * 1) Userland can trigger page fault and fill a huge page, which will + * cause the existence of small size TLB and huge TLB for the same + * address. + * + * 2) Userland can also trigger page fault and fill a PTE page, which + * will cause the existence of two small size TLBs, but the PTE + * page they map are different. + * + * Some CPUs do not allow these, to solve this, we can define + * arch_flush_tlb_before_set_{huge|pte}_page to detect this case and + * flush TLB before filling a huge page or a PTE page in page fault + * path. + */ + pte_free_tlb(tlb, pmd_pgtable(pmdval), addr); + mm_dec_nr_ptes(mm); + + return 0; + +out_ptl: + pte_unmap_unlock(start_pte, ptl); + if (pml != ptl) + spin_unlock(pml); + + return 0; +} + +static const struct mm_walk_ops reclaim_pgtables_walk_ops = { + .pmd_entry = reclaim_pgtables_pmd_entry, + .walk_lock = PGWALK_RDLOCK, +}; + +void try_to_reclaim_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma, + unsigned long start_addr, unsigned long end_addr, + struct zap_details *details) +{ + unsigned long start = max(vma->vm_start, start_addr); + unsigned long end; + + if (start >= vma->vm_end) + return; + end = min(vma->vm_end, end_addr); + if (end <= vma->vm_start) + return; + + /* Skip hugetlb case */ + if (is_vm_hugetlb_page(vma)) + return; + + /* Leave this to the THP path to handle */ + if (vma->vm_flags & VM_HUGEPAGE) + return; + + /* userfaultfd_wp case may reinstall the pte entry, also skip */ + if (userfaultfd_wp(vma)) + return; + + /* + * For private file mapping, the COW-ed page is an anon page, and it + * will not be zapped. For simplicity, skip the all writable private + * file mapping cases. + */ + if (details && !vma_is_anonymous(vma) && + !(vma->vm_flags & VM_MAYSHARE) && + (vma->vm_flags & VM_WRITE)) + return; + + start = ALIGN(start, PMD_SIZE); + end = ALIGN_DOWN(end, PMD_SIZE); + if (end - start < PMD_SIZE) + return; + + walk_page_range_vma(vma, start, end, &reclaim_pgtables_walk_ops, tlb); +} -- 2.20.1
Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, the page table pages will be freed by semi RCU, that is: - batch table freeing: asynchronous free by RCU - single table freeing: IPI + synchronous free In this way, the page table can be lockless traversed by disabling IRQ in paths such as fast GUP. But this is not enough to free the empty PTE page table pages in paths other that munmap and exit_mmap path, because IPI cannot be synchronized with rcu_read_lock() in pte_offset_map{_lock}(). In preparation for supporting empty PTE page table pages reclaimation, let single table also be freed by RCU like batch table freeing. Then we can also use pte_offset_map() etc to prevent PTE page from being freed. Like pte_free_defer(), we can also safely use ptdesc->pt_rcu_head to free the page table pages: - The pt_rcu_head is unioned with pt_list and pmd_huge_pte. - For pt_list, it is used to manage the PGD page in x86. Fortunately tlb_remove_table() will not be used for free PGD pages, so it is safe to use pt_rcu_head. - For pmd_huge_pte, we will do zap_deposited_table() before freeing the PMD page, so it is also safe. Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> --- arch/x86/include/asm/tlb.h | 23 +++++++++++++++++++++++ arch/x86/kernel/paravirt.c | 7 +++++++ arch/x86/mm/pgtable.c | 2 +- mm/mmu_gather.c | 2 +- 4 files changed, 32 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h index XXXXXXX..XXXXXXX 100644 --- a/arch/x86/include/asm/tlb.h +++ b/arch/x86/include/asm/tlb.h @@ -XXX,XX +XXX,XX @@ static inline void __tlb_remove_table(void *table) free_page_and_swap_cache(table); } +#ifndef CONFIG_PT_RECLAIM +static inline void __tlb_remove_table_one(void *table) +{ + free_page_and_swap_cache(table); +} +#else +static inline void __tlb_remove_table_one_rcu(struct rcu_head *head) +{ + struct page *page; + + page = container_of(head, struct page, rcu_head); + free_page_and_swap_cache(page); +} + +static inline void __tlb_remove_table_one(void *table) +{ + struct page *page; + + page = table; + call_rcu(&page->rcu_head, __tlb_remove_table_one_rcu); +} +#endif /* CONFIG_PT_RECLAIM */ + #endif /* _ASM_X86_TLB_H */ diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index XXXXXXX..XXXXXXX 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -XXX,XX +XXX,XX @@ void __init native_pv_lock_init(void) static_branch_disable(&virt_spin_lock_key); } +#ifndef CONFIG_PT_RECLAIM static void native_tlb_remove_table(struct mmu_gather *tlb, void *table) { tlb_remove_page(tlb, table); } +#else +static void native_tlb_remove_table(struct mmu_gather *tlb, void *table) +{ + tlb_remove_table(tlb, table); +} +#endif struct static_key paravirt_steal_enabled; struct static_key paravirt_steal_rq_enabled; diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index XXXXXXX..XXXXXXX 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -XXX,XX +XXX,XX @@ EXPORT_SYMBOL(physical_mask); #define PGTABLE_HIGHMEM 0 #endif -#ifndef CONFIG_PARAVIRT +#if !defined(CONFIG_PARAVIRT) && !defined(CONFIG_PT_RECLAIM) static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table) { diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index XXXXXXX..XXXXXXX 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -XXX,XX +XXX,XX @@ static inline void tlb_table_invalidate(struct mmu_gather *tlb) static void tlb_remove_table_one(void *table) { tlb_remove_table_sync_one(); - __tlb_remove_table(table); + __tlb_remove_table_one(table); } static void tlb_table_flush(struct mmu_gather *tlb) -- 2.20.1
When we use mmu_gather to batch flush tlb and free PTE pages, the TLB is not flushed before pmd lock is unlocked. This may result in the following two situations: 1) Userland can trigger page fault and fill a huge page, which will cause the existence of small size TLB and huge TLB for the same address. 2) Userland can also trigger page fault and fill a PTE page, which will cause the existence of two small size TLBs, but the PTE page they map are different. According to Intel's TLB Application note (317080), some CPUs of x86 do not allow the 1) case, so define arch_flush_tlb_before_set_huge_page to detect and fix this issue. Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> --- arch/x86/include/asm/pgtable.h | 6 ++++++ arch/x86/mm/pgtable.c | 13 +++++++++++++ 2 files changed, 19 insertions(+) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index XXXXXXX..XXXXXXX 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -XXX,XX +XXX,XX @@ void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte); #define arch_check_zapped_pmd arch_check_zapped_pmd void arch_check_zapped_pmd(struct vm_area_struct *vma, pmd_t pmd); +#ifdef CONFIG_PT_RECLAIM +#define arch_flush_tlb_before_set_huge_page arch_flush_tlb_before_set_huge_page +void arch_flush_tlb_before_set_huge_page(struct mm_struct *mm, + unsigned long addr); +#endif + #ifdef CONFIG_XEN_PV #define arch_has_hw_nonleaf_pmd_young arch_has_hw_nonleaf_pmd_young static inline bool arch_has_hw_nonleaf_pmd_young(void) diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index XXXXXXX..XXXXXXX 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -XXX,XX +XXX,XX @@ void arch_check_zapped_pmd(struct vm_area_struct *vma, pmd_t pmd) VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) && pmd_shstk(pmd)); } + +#ifdef CONFIG_PT_RECLAIM +void arch_flush_tlb_before_set_huge_page(struct mm_struct *mm, + unsigned long addr) +{ + if (atomic_read(&mm->tlb_flush_pending)) { + unsigned long start = ALIGN_DOWN(addr, PMD_SIZE); + unsigned long end = start + PMD_SIZE; + + flush_tlb_mm_range(mm, start, end, PAGE_SHIFT, false); + } +} +#endif /* CONFIG_PT_RECLAIM */ -- 2.20.1
Now, x86 has fully supported the CONFIG_PT_RECLAIM feature, and reclaiming PTE pages is profitable only on 64-bit systems, so select ARCH_SUPPORTS_PT_RECLAIM if X86_64. Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> --- arch/x86/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index XXXXXXX..XXXXXXX 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -XXX,XX +XXX,XX @@ config X86 select FUNCTION_ALIGNMENT_4B imply IMA_SECURE_AND_OR_TRUSTED_BOOT if EFI select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE + select ARCH_SUPPORTS_PT_RECLAIM if X86_64 config INSTRUCTION_DECODER def_bool y -- 2.20.1
Changes in v4: - update the process_addrs.rst in [PATCH v4 01/11] (suggested by Lorenzo Stoakes) - fix [PATCH v3 4/9] and move it after [PATCH v3 5/9] (pointed by David Hildenbrand) - change to use any_skipped instead of rechecking pte_none() to detect empty user PTE pages (suggested by David Hildenbrand) - rebase onto the next-20241203 Changes in v3: - recheck pmd state instead of pmd_same() in retract_page_tables() (suggested by Jann Horn) - recheck dst_pmd entry in move_pages_pte() (pointed by Jann Horn) - introduce new skip_none_ptes() (suggested by David Hildenbrand) - minor changes in [PATCH v2 5/7] - remove tlb_remove_table_sync_one() if CONFIG_PT_RECLAIM is enabled. - use put_page() instead of free_page_and_swap_cache() in __tlb_remove_table_one_rcu() (pointed by Jann Horn) - collect the Reviewed-bys and Acked-bys - rebase onto the next-20241112 Changes in v2: - fix [PATCH v1 1/7] (Jann Horn) - reset force_flush and force_break to false in [PATCH v1 2/7] (Jann Horn) - introduce zap_nonpresent_ptes() and do_zap_pte_range() - check pte_none() instead of can_reclaim_pt after the processing of PTEs (remove [PATCH v1 3/7] and [PATCH v1 4/7]) - reorder patches - rebase onto the next-20241031 Changes in v1: - replace [RFC PATCH 1/7] with a separate serise (already merge into mm-unstable): https://lore.kernel.org/lkml/cover.1727332572.git.zhengqi.arch@bytedance.com/ (suggested by David Hildenbrand) - squash [RFC PATCH 2/7] into [RFC PATCH 4/7] (suggested by David Hildenbrand) - change to scan and reclaim empty user PTE pages in zap_pte_range() (suggested by David Hildenbrand) - sent a separate RFC patch to track the tlb flushing issue, and remove that part form this series ([RFC PATCH 3/7] and [RFC PATCH 6/7]). link: https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/ - add [PATCH v1 1/7] into this series - drop RFC tag - rebase onto the next-20241011 Changes in RFC v2: - fix compilation errors in [RFC PATCH 5/7] and [RFC PATCH 7/7] reproted by kernel test robot - use pte_offset_map_nolock() + pmd_same() instead of check_pmd_still_valid() in retract_page_tables() (in [RFC PATCH 4/7]) - rebase onto the next-20240805 Hi all, Previously, we tried to use a completely asynchronous method to reclaim empty user PTE pages [1]. After discussing with David Hildenbrand, we decided to implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the first step. So this series aims to synchronously free the empty PTE pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude cases other than madvise(MADV_DONTNEED). In zap_pte_range(), mmu_gather is used to perform batch tlb flushing and page freeing operations. Therefore, if we want to free the empty PTE page in this path, the most natural way is to add it to mmu_gather as well. Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, mmu_gather will free page table pages by semi RCU: - batch table freeing: asynchronous free by RCU - single table freeing: IPI + synchronous free But this is not enough to free the empty PTE page table pages in paths other that munmap and exit_mmap path, because IPI cannot be synchronized with rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table also be freed by RCU like batch table freeing. As a first step, we supported this feature on x86_64 and selectd the newly introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM. For other cases such as madvise(MADV_FREE), consider scanning and freeing empty PTE pages asynchronously in the future. This series is based on next-20241112 (which contains the series [2]). Note: issues related to TLB flushing are not new to this series and are tracked in the separate RFC patch [3]. And more context please refer to this thread [4]. Comments and suggestions are welcome! Thanks, Qi [1]. https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ [2]. https://lore.kernel.org/lkml/cover.1727332572.git.zhengqi.arch@bytedance.com/ [3]. https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/ [4]. https://lore.kernel.org/lkml/6f38cb19-9847-4f70-bbe7-06881bb016be@bytedance.com/ Qi Zheng (11): mm: khugepaged: recheck pmd state in retract_page_tables() mm: userfaultfd: recheck dst_pmd entry in move_pages_pte() mm: introduce zap_nonpresent_ptes() mm: introduce do_zap_pte_range() mm: skip over all consecutive none ptes in do_zap_pte_range() mm: zap_install_uffd_wp_if_needed: return whether uffd-wp pte has been re-installed mm: do_zap_pte_range: return any_skipped information to the caller mm: make zap_pte_range() handle full within-PMD range mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED) x86: mm: free page table pages by RCU instead of semi RCU x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 Documentation/mm/process_addrs.rst | 4 + arch/x86/Kconfig | 1 + arch/x86/include/asm/tlb.h | 20 +++ arch/x86/kernel/paravirt.c | 7 + arch/x86/mm/pgtable.c | 10 +- include/linux/mm.h | 1 + include/linux/mm_inline.h | 11 +- include/linux/mm_types.h | 4 +- mm/Kconfig | 15 ++ mm/Makefile | 1 + mm/internal.h | 19 +++ mm/khugepaged.c | 45 +++-- mm/madvise.c | 7 +- mm/memory.c | 253 ++++++++++++++++++----------- mm/mmu_gather.c | 9 +- mm/pt_reclaim.c | 71 ++++++++ mm/userfaultfd.c | 51 ++++-- 17 files changed, 397 insertions(+), 132 deletions(-) create mode 100644 mm/pt_reclaim.c -- 2.20.1
In retract_page_tables(), the lock of new_folio is still held, we will be blocked in the page fault path, which prevents the pte entries from being set again. So even though the old empty PTE page may be concurrently freed and a new PTE page is filled into the pmd entry, it is still empty and can be removed. So just refactor the retract_page_tables() a little bit and recheck the pmd state after holding the pmd lock. Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> --- Documentation/mm/process_addrs.rst | 4 +++ mm/khugepaged.c | 45 ++++++++++++++++++++---------- 2 files changed, 35 insertions(+), 14 deletions(-) diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst index XXXXXXX..XXXXXXX 100644 --- a/Documentation/mm/process_addrs.rst +++ b/Documentation/mm/process_addrs.rst @@ -XXX,XX +XXX,XX @@ are extra requirements for accessing them: new page table has been installed in the same location and filled with entries. Writers normally need to take the PTE lock and revalidate that the PMD entry still refers to the same PTE-level page table. + If the writer does not care whether it is the same PTE-level page table, it + can take the PMD lock and revalidate that the contents of pmd entry still meet + the requirements. In particular, this also happens in :c:func:`!retract_page_tables` + when handling :c:macro:`!MADV_COLLAPSE`. To access PTE-level page tables, a helper like :c:func:`!pte_offset_map_lock` or :c:func:`!pte_offset_map` can be used depending on stability requirements. diff --git a/mm/khugepaged.c b/mm/khugepaged.c index XXXXXXX..XXXXXXX 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -XXX,XX +XXX,XX @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, return SCAN_SUCCEED; } -static int find_pmd_or_thp_or_none(struct mm_struct *mm, - unsigned long address, - pmd_t **pmd) +static inline int check_pmd_state(pmd_t *pmd) { - pmd_t pmde; + pmd_t pmde = pmdp_get_lockless(pmd); - *pmd = mm_find_pmd(mm, address); - if (!*pmd) - return SCAN_PMD_NULL; - - pmde = pmdp_get_lockless(*pmd); if (pmd_none(pmde)) return SCAN_PMD_NONE; if (!pmd_present(pmde)) @@ -XXX,XX +XXX,XX @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm, return SCAN_SUCCEED; } +static int find_pmd_or_thp_or_none(struct mm_struct *mm, + unsigned long address, + pmd_t **pmd) +{ + *pmd = mm_find_pmd(mm, address); + if (!*pmd) + return SCAN_PMD_NULL; + + return check_pmd_state(*pmd); +} + static int check_pmd_still_valid(struct mm_struct *mm, unsigned long address, pmd_t *pmd) @@ -XXX,XX +XXX,XX @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) pmd_t *pmd, pgt_pmd; spinlock_t *pml; spinlock_t *ptl; - bool skipped_uffd = false; + bool success = false; /* * Check vma->anon_vma to exclude MAP_PRIVATE mappings that @@ -XXX,XX +XXX,XX @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) mmu_notifier_invalidate_range_start(&range); pml = pmd_lock(mm, pmd); + /* + * The lock of new_folio is still held, we will be blocked in + * the page fault path, which prevents the pte entries from + * being set again. So even though the old empty PTE page may be + * concurrently freed and a new PTE page is filled into the pmd + * entry, it is still empty and can be removed. + * + * So here we only need to recheck if the state of pmd entry + * still meets our requirements, rather than checking pmd_same() + * like elsewhere. + */ + if (check_pmd_state(pmd) != SCAN_SUCCEED) + goto drop_pml; ptl = pte_lockptr(mm, pmd); if (ptl != pml) spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); @@ -XXX,XX +XXX,XX @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) * repeating the anon_vma check protects from one category, * and repeating the userfaultfd_wp() check from another. */ - if (unlikely(vma->anon_vma || userfaultfd_wp(vma))) { - skipped_uffd = true; - } else { + if (likely(!vma->anon_vma && !userfaultfd_wp(vma))) { pgt_pmd = pmdp_collapse_flush(vma, addr, pmd); pmdp_get_lockless_sync(); + success = true; } if (ptl != pml) spin_unlock(ptl); +drop_pml: spin_unlock(pml); mmu_notifier_invalidate_range_end(&range); - if (!skipped_uffd) { + if (success) { mm_dec_nr_ptes(mm); page_table_check_pte_clear_range(mm, addr, pgt_pmd); pte_free_defer(mm, pmd_pgtable(pgt_pmd)); -- 2.20.1
In move_pages_pte(), since dst_pte needs to be none, the subsequent pte_same() check cannot prevent the dst_pte page from being freed concurrently, so we also need to abtain dst_pmdval and recheck pmd_same(). Otherwise, once we support empty PTE page reclaimation for anonymous pages, it may result in moving the src_pte page into the dts_pte page that is about to be freed by RCU. Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> --- mm/userfaultfd.c | 51 +++++++++++++++++++++++++++++++----------------- 1 file changed, 33 insertions(+), 18 deletions(-) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index XXXXXXX..XXXXXXX 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -XXX,XX +XXX,XX @@ void double_pt_unlock(spinlock_t *ptl1, __release(ptl2); } +static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte, + pte_t orig_dst_pte, pte_t orig_src_pte, + pmd_t *dst_pmd, pmd_t dst_pmdval) +{ + return pte_same(ptep_get(src_pte), orig_src_pte) && + pte_same(ptep_get(dst_pte), orig_dst_pte) && + pmd_same(dst_pmdval, pmdp_get_lockless(dst_pmd)); +} static int move_present_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma, @@ -XXX,XX +XXX,XX @@ static int move_present_pte(struct mm_struct *mm, unsigned long dst_addr, unsigned long src_addr, pte_t *dst_pte, pte_t *src_pte, pte_t orig_dst_pte, pte_t orig_src_pte, + pmd_t *dst_pmd, pmd_t dst_pmdval, spinlock_t *dst_ptl, spinlock_t *src_ptl, struct folio *src_folio) { @@ -XXX,XX +XXX,XX @@ static int move_present_pte(struct mm_struct *mm, double_pt_lock(dst_ptl, src_ptl); - if (!pte_same(ptep_get(src_pte), orig_src_pte) || - !pte_same(ptep_get(dst_pte), orig_dst_pte)) { + if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte, + dst_pmd, dst_pmdval)) { err = -EAGAIN; goto out; } @@ -XXX,XX +XXX,XX @@ static int move_swap_pte(struct mm_struct *mm, unsigned long dst_addr, unsigned long src_addr, pte_t *dst_pte, pte_t *src_pte, pte_t orig_dst_pte, pte_t orig_src_pte, + pmd_t *dst_pmd, pmd_t dst_pmdval, spinlock_t *dst_ptl, spinlock_t *src_ptl) { if (!pte_swp_exclusive(orig_src_pte)) @@ -XXX,XX +XXX,XX @@ static int move_swap_pte(struct mm_struct *mm, double_pt_lock(dst_ptl, src_ptl); - if (!pte_same(ptep_get(src_pte), orig_src_pte) || - !pte_same(ptep_get(dst_pte), orig_dst_pte)) { + if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte, + dst_pmd, dst_pmdval)) { double_pt_unlock(dst_ptl, src_ptl); return -EAGAIN; } @@ -XXX,XX +XXX,XX @@ static int move_zeropage_pte(struct mm_struct *mm, unsigned long dst_addr, unsigned long src_addr, pte_t *dst_pte, pte_t *src_pte, pte_t orig_dst_pte, pte_t orig_src_pte, + pmd_t *dst_pmd, pmd_t dst_pmdval, spinlock_t *dst_ptl, spinlock_t *src_ptl) { pte_t zero_pte; double_pt_lock(dst_ptl, src_ptl); - if (!pte_same(ptep_get(src_pte), orig_src_pte) || - !pte_same(ptep_get(dst_pte), orig_dst_pte)) { + if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte, + dst_pmd, dst_pmdval)) { double_pt_unlock(dst_ptl, src_ptl); return -EAGAIN; } @@ -XXX,XX +XXX,XX @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pte_t *src_pte = NULL; pte_t *dst_pte = NULL; pmd_t dummy_pmdval; + pmd_t dst_pmdval; struct folio *src_folio = NULL; struct anon_vma *src_anon_vma = NULL; struct mmu_notifier_range range; @@ -XXX,XX +XXX,XX @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, retry: /* * Use the maywrite version to indicate that dst_pte will be modified, - * but since we will use pte_same() to detect the change of the pte - * entry, there is no need to get pmdval, so just pass a dummy variable - * to it. + * since dst_pte needs to be none, the subsequent pte_same() check + * cannot prevent the dst_pte page from being freed concurrently, so we + * also need to abtain dst_pmdval and recheck pmd_same() later. */ - dst_pte = pte_offset_map_rw_nolock(mm, dst_pmd, dst_addr, &dummy_pmdval, + dst_pte = pte_offset_map_rw_nolock(mm, dst_pmd, dst_addr, &dst_pmdval, &dst_ptl); /* Retry if a huge pmd materialized from under us */ @@ -XXX,XX +XXX,XX @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, goto out; } - /* same as dst_pte */ + /* + * Unlike dst_pte, the subsequent pte_same() check can ensure the + * stability of the src_pte page, so there is no need to get pmdval, + * just pass a dummy variable to it. + */ src_pte = pte_offset_map_rw_nolock(mm, src_pmd, src_addr, &dummy_pmdval, &src_ptl); @@ -XXX,XX +XXX,XX @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, err = move_zeropage_pte(mm, dst_vma, src_vma, dst_addr, src_addr, dst_pte, src_pte, orig_dst_pte, orig_src_pte, - dst_ptl, src_ptl); + dst_pmd, dst_pmdval, dst_ptl, src_ptl); goto out; } @@ -XXX,XX +XXX,XX @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, err = move_present_pte(mm, dst_vma, src_vma, dst_addr, src_addr, dst_pte, src_pte, - orig_dst_pte, orig_src_pte, - dst_ptl, src_ptl, src_folio); + orig_dst_pte, orig_src_pte, dst_pmd, + dst_pmdval, dst_ptl, src_ptl, src_folio); } else { entry = pte_to_swp_entry(orig_src_pte); if (non_swap_entry(entry)) { @@ -XXX,XX +XXX,XX @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, goto out; } - err = move_swap_pte(mm, dst_addr, src_addr, - dst_pte, src_pte, - orig_dst_pte, orig_src_pte, - dst_ptl, src_ptl); + err = move_swap_pte(mm, dst_addr, src_addr, dst_pte, src_pte, + orig_dst_pte, orig_src_pte, dst_pmd, + dst_pmdval, dst_ptl, src_ptl); } out: -- 2.20.1
Similar to zap_present_ptes(), let's introduce zap_nonpresent_ptes() to handle non-present ptes, which can improve code readability. No functional change. Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Jann Horn <jannh@google.com> Acked-by: David Hildenbrand <david@redhat.com> --- mm/memory.c | 136 ++++++++++++++++++++++++++++------------------------ 1 file changed, 73 insertions(+), 63 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index XXXXXXX..XXXXXXX 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -XXX,XX +XXX,XX @@ static inline int zap_present_ptes(struct mmu_gather *tlb, return 1; } +static inline int zap_nonpresent_ptes(struct mmu_gather *tlb, + struct vm_area_struct *vma, pte_t *pte, pte_t ptent, + unsigned int max_nr, unsigned long addr, + struct zap_details *details, int *rss) +{ + swp_entry_t entry; + int nr = 1; + + entry = pte_to_swp_entry(ptent); + if (is_device_private_entry(entry) || + is_device_exclusive_entry(entry)) { + struct page *page = pfn_swap_entry_to_page(entry); + struct folio *folio = page_folio(page); + + if (unlikely(!should_zap_folio(details, folio))) + return 1; + /* + * Both device private/exclusive mappings should only + * work with anonymous page so far, so we don't need to + * consider uffd-wp bit when zap. For more information, + * see zap_install_uffd_wp_if_needed(). + */ + WARN_ON_ONCE(!vma_is_anonymous(vma)); + rss[mm_counter(folio)]--; + if (is_device_private_entry(entry)) + folio_remove_rmap_pte(folio, page, vma); + folio_put(folio); + } else if (!non_swap_entry(entry)) { + /* Genuine swap entries, hence a private anon pages */ + if (!should_zap_cows(details)) + return 1; + + nr = swap_pte_batch(pte, max_nr, ptent); + rss[MM_SWAPENTS] -= nr; + free_swap_and_cache_nr(entry, nr); + } else if (is_migration_entry(entry)) { + struct folio *folio = pfn_swap_entry_folio(entry); + + if (!should_zap_folio(details, folio)) + return 1; + rss[mm_counter(folio)]--; + } else if (pte_marker_entry_uffd_wp(entry)) { + /* + * For anon: always drop the marker; for file: only + * drop the marker if explicitly requested. + */ + if (!vma_is_anonymous(vma) && !zap_drop_markers(details)) + return 1; + } else if (is_guard_swp_entry(entry)) { + /* + * Ordinary zapping should not remove guard PTE + * markers. Only do so if we should remove PTE markers + * in general. + */ + if (!zap_drop_markers(details)) + return 1; + } else if (is_hwpoison_entry(entry) || is_poisoned_swp_entry(entry)) { + if (!should_zap_cows(details)) + return 1; + } else { + /* We should have covered all the swap entry types */ + pr_alert("unrecognized swap entry 0x%lx\n", entry.val); + WARN_ON_ONCE(1); + } + clear_not_present_full_ptes(vma->vm_mm, addr, pte, nr, tlb->fullmm); + zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent); + + return nr; +} + static unsigned long zap_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, @@ -XXX,XX +XXX,XX @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, spinlock_t *ptl; pte_t *start_pte; pte_t *pte; - swp_entry_t entry; int nr; tlb_change_page_size(tlb, PAGE_SIZE); @@ -XXX,XX +XXX,XX @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, arch_enter_lazy_mmu_mode(); do { pte_t ptent = ptep_get(pte); - struct folio *folio; - struct page *page; int max_nr; nr = 1; @@ -XXX,XX +XXX,XX @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, if (need_resched()) break; + max_nr = (end - addr) / PAGE_SIZE; if (pte_present(ptent)) { - max_nr = (end - addr) / PAGE_SIZE; nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr, addr, details, rss, &force_flush, &force_break); @@ -XXX,XX +XXX,XX @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, addr += nr * PAGE_SIZE; break; } - continue; - } - - entry = pte_to_swp_entry(ptent); - if (is_device_private_entry(entry) || - is_device_exclusive_entry(entry)) { - page = pfn_swap_entry_to_page(entry); - folio = page_folio(page); - if (unlikely(!should_zap_folio(details, folio))) - continue; - /* - * Both device private/exclusive mappings should only - * work with anonymous page so far, so we don't need to - * consider uffd-wp bit when zap. For more information, - * see zap_install_uffd_wp_if_needed(). - */ - WARN_ON_ONCE(!vma_is_anonymous(vma)); - rss[mm_counter(folio)]--; - if (is_device_private_entry(entry)) - folio_remove_rmap_pte(folio, page, vma); - folio_put(folio); - } else if (!non_swap_entry(entry)) { - max_nr = (end - addr) / PAGE_SIZE; - nr = swap_pte_batch(pte, max_nr, ptent); - /* Genuine swap entries, hence a private anon pages */ - if (!should_zap_cows(details)) - continue; - rss[MM_SWAPENTS] -= nr; - free_swap_and_cache_nr(entry, nr); - } else if (is_migration_entry(entry)) { - folio = pfn_swap_entry_folio(entry); - if (!should_zap_folio(details, folio)) - continue; - rss[mm_counter(folio)]--; - } else if (pte_marker_entry_uffd_wp(entry)) { - /* - * For anon: always drop the marker; for file: only - * drop the marker if explicitly requested. - */ - if (!vma_is_anonymous(vma) && - !zap_drop_markers(details)) - continue; - } else if (is_guard_swp_entry(entry)) { - /* - * Ordinary zapping should not remove guard PTE - * markers. Only do so if we should remove PTE markers - * in general. - */ - if (!zap_drop_markers(details)) - continue; - } else if (is_hwpoison_entry(entry) || - is_poisoned_swp_entry(entry)) { - if (!should_zap_cows(details)) - continue; } else { - /* We should have covered all the swap entry types */ - pr_alert("unrecognized swap entry 0x%lx\n", entry.val); - WARN_ON_ONCE(1); + nr = zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, + addr, details, rss); } - clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); - zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent); } while (pte += nr, addr += PAGE_SIZE * nr, addr != end); add_mm_rss_vec(mm, rss); -- 2.20.1
This commit introduces do_zap_pte_range() to actually zap the PTEs, which will help improve code readability and facilitate secondary checking of the processed PTEs in the future. No functional change. Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Jann Horn <jannh@google.com> Acked-by: David Hildenbrand <david@redhat.com> --- mm/memory.c | 45 ++++++++++++++++++++++++++------------------- 1 file changed, 26 insertions(+), 19 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index XXXXXXX..XXXXXXX 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -XXX,XX +XXX,XX @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb, return nr; } +static inline int do_zap_pte_range(struct mmu_gather *tlb, + struct vm_area_struct *vma, pte_t *pte, + unsigned long addr, unsigned long end, + struct zap_details *details, int *rss, + bool *force_flush, bool *force_break) +{ + pte_t ptent = ptep_get(pte); + int max_nr = (end - addr) / PAGE_SIZE; + + if (pte_none(ptent)) + return 1; + + if (pte_present(ptent)) + return zap_present_ptes(tlb, vma, pte, ptent, max_nr, + addr, details, rss, force_flush, + force_break); + + return zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, addr, + details, rss); +} + static unsigned long zap_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, @@ -XXX,XX +XXX,XX @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); do { - pte_t ptent = ptep_get(pte); - int max_nr; - - nr = 1; - if (pte_none(ptent)) - continue; - if (need_resched()) break; - max_nr = (end - addr) / PAGE_SIZE; - if (pte_present(ptent)) { - nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr, - addr, details, rss, &force_flush, - &force_break); - if (unlikely(force_break)) { - addr += nr * PAGE_SIZE; - break; - } - } else { - nr = zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, - addr, details, rss); + nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, rss, + &force_flush, &force_break); + if (unlikely(force_break)) { + addr += nr * PAGE_SIZE; + break; } } while (pte += nr, addr += PAGE_SIZE * nr, addr != end); -- 2.20.1
Skip over all consecutive none ptes in do_zap_pte_range(), which helps optimize away need_resched() + force_break + incremental pte/addr increments etc. Suggested-by: David Hildenbrand <david@redhat.com> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> --- mm/memory.c | 27 ++++++++++++++++++++------- 1 file changed, 20 insertions(+), 7 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index XXXXXXX..XXXXXXX 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -XXX,XX +XXX,XX @@ static inline int do_zap_pte_range(struct mmu_gather *tlb, { pte_t ptent = ptep_get(pte); int max_nr = (end - addr) / PAGE_SIZE; + int nr = 0; - if (pte_none(ptent)) - return 1; + /* Skip all consecutive none ptes */ + if (pte_none(ptent)) { + for (nr = 1; nr < max_nr; nr++) { + ptent = ptep_get(pte + nr); + if (!pte_none(ptent)) + break; + } + max_nr -= nr; + if (!max_nr) + return nr; + pte += nr; + addr += nr * PAGE_SIZE; + } if (pte_present(ptent)) - return zap_present_ptes(tlb, vma, pte, ptent, max_nr, - addr, details, rss, force_flush, - force_break); + nr += zap_present_ptes(tlb, vma, pte, ptent, max_nr, addr, + details, rss, force_flush, force_break); + else + nr += zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, addr, + details, rss); - return zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, addr, - details, rss); + return nr; } static unsigned long zap_pte_range(struct mmu_gather *tlb, -- 2.20.1
In some cases, we'll replace the none pte with an uffd-wp swap special pte marker when necessary. Let's expose this information to the caller through the return value, so that subsequent commits can use this information to detect whether the PTE page is empty. Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> --- include/linux/mm_inline.h | 11 +++++++---- mm/memory.c | 16 ++++++++++++---- 2 files changed, 19 insertions(+), 8 deletions(-) diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index XXXXXXX..XXXXXXX 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -XXX,XX +XXX,XX @@ static inline pte_marker copy_pte_marker( * Must be called with pgtable lock held so that no thread will see the none * pte, and if they see it, they'll fault and serialize at the pgtable lock. * - * This function is a no-op if PTE_MARKER_UFFD_WP is not enabled. + * Returns true if an uffd-wp pte was installed, false otherwise. */ -static inline void +static inline bool pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr, pte_t *pte, pte_t pteval) { @@ -XXX,XX +XXX,XX @@ pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr, * with a swap pte. There's no way of leaking the bit. */ if (vma_is_anonymous(vma) || !userfaultfd_wp(vma)) - return; + return false; /* A uffd-wp wr-protected normal pte */ if (unlikely(pte_present(pteval) && pte_uffd_wp(pteval))) @@ -XXX,XX +XXX,XX @@ pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr, if (unlikely(pte_swp_uffd_wp_any(pteval))) arm_uffd_pte = true; - if (unlikely(arm_uffd_pte)) + if (unlikely(arm_uffd_pte)) { set_pte_at(vma->vm_mm, addr, pte, make_pte_marker(PTE_MARKER_UFFD_WP)); + return true; + } #endif + return false; } static inline bool vma_has_recency(struct vm_area_struct *vma) diff --git a/mm/memory.c b/mm/memory.c index XXXXXXX..XXXXXXX 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -XXX,XX +XXX,XX @@ static inline bool zap_drop_markers(struct zap_details *details) /* * This function makes sure that we'll replace the none pte with an uffd-wp * swap special pte marker when necessary. Must be with the pgtable lock held. + * + * Returns true if uffd-wp ptes was installed, false otherwise. */ -static inline void +static inline bool zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr, pte_t *pte, int nr, struct zap_details *details, pte_t pteval) { + bool was_installed = false; + +#ifdef CONFIG_PTE_MARKER_UFFD_WP /* Zap on anonymous always means dropping everything */ if (vma_is_anonymous(vma)) - return; + return false; if (zap_drop_markers(details)) - return; + return false; for (;;) { /* the PFN in the PTE is irrelevant. */ - pte_install_uffd_wp_if_needed(vma, addr, pte, pteval); + if (pte_install_uffd_wp_if_needed(vma, addr, pte, pteval)) + was_installed = true; if (--nr == 0) break; pte++; addr += PAGE_SIZE; } +#endif + return was_installed; } static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb, -- 2.20.1
Let the caller of do_zap_pte_range() know whether we skip zap ptes or reinstall uffd-wp ptes through any_skipped parameter, so that subsequent commits can use this information in zap_pte_range() to detect whether the PTE page can be reclaimed. Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> --- mm/memory.c | 36 +++++++++++++++++++++--------------- 1 file changed, 21 insertions(+), 15 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index XXXXXXX..XXXXXXX 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -XXX,XX +XXX,XX @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb, struct vm_area_struct *vma, struct folio *folio, struct page *page, pte_t *pte, pte_t ptent, unsigned int nr, unsigned long addr, struct zap_details *details, int *rss, - bool *force_flush, bool *force_break) + bool *force_flush, bool *force_break, bool *any_skipped) { struct mm_struct *mm = tlb->mm; bool delay_rmap = false; @@ -XXX,XX +XXX,XX @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb, arch_check_zapped_pte(vma, ptent); tlb_remove_tlb_entries(tlb, pte, nr, addr); if (unlikely(userfaultfd_pte_wp(vma, ptent))) - zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, - ptent); + *any_skipped = zap_install_uffd_wp_if_needed(vma, addr, pte, + nr, details, ptent); if (!delay_rmap) { folio_remove_rmap_ptes(folio, page, nr, vma); @@ -XXX,XX +XXX,XX @@ static inline int zap_present_ptes(struct mmu_gather *tlb, struct vm_area_struct *vma, pte_t *pte, pte_t ptent, unsigned int max_nr, unsigned long addr, struct zap_details *details, int *rss, bool *force_flush, - bool *force_break) + bool *force_break, bool *any_skipped) { const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY; struct mm_struct *mm = tlb->mm; @@ -XXX,XX +XXX,XX @@ static inline int zap_present_ptes(struct mmu_gather *tlb, arch_check_zapped_pte(vma, ptent); tlb_remove_tlb_entry(tlb, pte, addr); if (userfaultfd_pte_wp(vma, ptent)) - zap_install_uffd_wp_if_needed(vma, addr, pte, 1, - details, ptent); + *any_skipped = zap_install_uffd_wp_if_needed(vma, addr, + pte, 1, details, ptent); ksm_might_unmap_zero_page(mm, ptent); return 1; } folio = page_folio(page); - if (unlikely(!should_zap_folio(details, folio))) + if (unlikely(!should_zap_folio(details, folio))) { + *any_skipped = true; return 1; + } /* * Make sure that the common "small folio" case is as fast as possible @@ -XXX,XX +XXX,XX @@ static inline int zap_present_ptes(struct mmu_gather *tlb, zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr, addr, details, rss, force_flush, - force_break); + force_break, any_skipped); return nr; } zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, 1, addr, - details, rss, force_flush, force_break); + details, rss, force_flush, force_break, any_skipped); return 1; } static inline int zap_nonpresent_ptes(struct mmu_gather *tlb, struct vm_area_struct *vma, pte_t *pte, pte_t ptent, unsigned int max_nr, unsigned long addr, - struct zap_details *details, int *rss) + struct zap_details *details, int *rss, bool *any_skipped) { swp_entry_t entry; int nr = 1; + *any_skipped = true; entry = pte_to_swp_entry(ptent); if (is_device_private_entry(entry) || is_device_exclusive_entry(entry)) { @@ -XXX,XX +XXX,XX @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb, WARN_ON_ONCE(1); } clear_not_present_full_ptes(vma->vm_mm, addr, pte, nr, tlb->fullmm); - zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent); + *any_skipped = zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent); return nr; } @@ -XXX,XX +XXX,XX @@ static inline int do_zap_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pte_t *pte, unsigned long addr, unsigned long end, struct zap_details *details, int *rss, - bool *force_flush, bool *force_break) + bool *force_flush, bool *force_break, + bool *any_skipped) { pte_t ptent = ptep_get(pte); int max_nr = (end - addr) / PAGE_SIZE; @@ -XXX,XX +XXX,XX @@ static inline int do_zap_pte_range(struct mmu_gather *tlb, if (pte_present(ptent)) nr += zap_present_ptes(tlb, vma, pte, ptent, max_nr, addr, - details, rss, force_flush, force_break); + details, rss, force_flush, force_break, + any_skipped); else nr += zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, addr, - details, rss); + details, rss, any_skipped); return nr; } @@ -XXX,XX +XXX,XX @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, struct zap_details *details) { bool force_flush = false, force_break = false; + bool any_skipped = false; struct mm_struct *mm = tlb->mm; int rss[NR_MM_COUNTERS]; spinlock_t *ptl; @@ -XXX,XX +XXX,XX @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, break; nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, rss, - &force_flush, &force_break); + &force_flush, &force_break, &any_skipped); if (unlikely(force_break)) { addr += nr * PAGE_SIZE; break; -- 2.20.1
In preparation for reclaiming empty PTE pages, this commit first makes zap_pte_range() to handle the full within-PMD range, so that we can more easily detect and free PTE pages in this function in subsequent commits. Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Jann Horn <jannh@google.com> --- mm/memory.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index XXXXXXX..XXXXXXX 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -XXX,XX +XXX,XX @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, pte_t *pte; int nr; +retry: tlb_change_page_size(tlb, PAGE_SIZE); init_rss_vec(rss); start_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl); @@ -XXX,XX +XXX,XX @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, if (force_flush) tlb_flush_mmu(tlb); + if (addr != end) { + cond_resched(); + force_flush = false; + force_break = false; + goto retry; + } + return addr; } -- 2.20.1
Now in order to pursue high performance, applications mostly use some high-performance user-mode memory allocators, such as jemalloc or tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will release page table memory, which may cause huge page table memory usage. The following are a memory usage snapshot of one process which actually happened on our server: VIRT: 55t RES: 590g VmPTE: 110g In this case, most of the page table entries are empty. For such a PTE page where all entries are empty, we can actually free it back to the system for others to use. As a first step, this commit aims to synchronously free the empty PTE pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude cases other than madvise(MADV_DONTNEED). Once an empty PTE is detected, we first try to hold the pmd lock within the pte lock. If successful, we clear the pmd entry directly (fast path). Otherwise, we wait until the pte lock is released, then re-hold the pmd and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect whether the PTE page is empty and free it (slow path). For other cases such as madvise(MADV_FREE), consider scanning and freeing empty PTE pages asynchronously in the future. The following code snippet can show the effect of optimization: mmap 50G while (1) { for (; i < 1024 * 25; i++) { touch 2M memory madvise MADV_DONTNEED 2M } } As we can see, the memory usage of VmPTE is reduced: before after VIRT 50.0 GB 50.0 GB RES 3.1 MB 3.1 MB VmPTE 102640 KB 240 KB Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> --- include/linux/mm.h | 1 + mm/Kconfig | 15 ++++++++++ mm/Makefile | 1 + mm/internal.h | 19 +++++++++++++ mm/madvise.c | 7 ++++- mm/memory.c | 21 ++++++++++++-- mm/pt_reclaim.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 132 insertions(+), 3 deletions(-) create mode 100644 mm/pt_reclaim.c diff --git a/include/linux/mm.h b/include/linux/mm.h index XXXXXXX..XXXXXXX 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -XXX,XX +XXX,XX @@ extern void pagefault_out_of_memory(void); struct zap_details { struct folio *single_folio; /* Locked folio to be unmapped */ bool even_cows; /* Zap COWed private pages too? */ + bool reclaim_pt; /* Need reclaim page tables? */ zap_flags_t zap_flags; /* Extra flags for zapping */ }; diff --git a/mm/Kconfig b/mm/Kconfig index XXXXXXX..XXXXXXX 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -XXX,XX +XXX,XX @@ config ARCH_HAS_USER_SHADOW_STACK The architecture has hardware support for userspace shadow call stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss). +config ARCH_SUPPORTS_PT_RECLAIM + def_bool n + +config PT_RECLAIM + bool "reclaim empty user page table pages" + default y + depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP + select MMU_GATHER_RCU_TABLE_FREE + help + Try to reclaim empty user page table pages in paths other than munmap + and exit_mmap path. + + Note: now only empty user PTE page table pages will be reclaimed. + + source "mm/damon/Kconfig" endmenu diff --git a/mm/Makefile b/mm/Makefile index XXXXXXX..XXXXXXX 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -XXX,XX +XXX,XX @@ obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o obj-$(CONFIG_EXECMEM) += execmem.o obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o +obj-$(CONFIG_PT_RECLAIM) += pt_reclaim.o diff --git a/mm/internal.h b/mm/internal.h index XXXXXXX..XXXXXXX 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -XXX,XX +XXX,XX @@ int walk_page_range_mm(struct mm_struct *mm, unsigned long start, unsigned long end, const struct mm_walk_ops *ops, void *private); +/* pt_reclaim.c */ +bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval); +void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb, + pmd_t pmdval); +void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, + struct mmu_gather *tlb); + +#ifdef CONFIG_PT_RECLAIM +bool reclaim_pt_is_enabled(unsigned long start, unsigned long end, + struct zap_details *details); +#else +static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end, + struct zap_details *details) +{ + return false; +} +#endif /* CONFIG_PT_RECLAIM */ + + #endif /* __MM_INTERNAL_H */ diff --git a/mm/madvise.c b/mm/madvise.c index XXXXXXX..XXXXXXX 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -XXX,XX +XXX,XX @@ static int madvise_free_single_vma(struct vm_area_struct *vma, static long madvise_dontneed_single_vma(struct vm_area_struct *vma, unsigned long start, unsigned long end) { - zap_page_range_single(vma, start, end - start, NULL); + struct zap_details details = { + .reclaim_pt = true, + .even_cows = true, + }; + + zap_page_range_single(vma, start, end - start, &details); return 0; } diff --git a/mm/memory.c b/mm/memory.c index XXXXXXX..XXXXXXX 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -XXX,XX +XXX,XX @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) static inline bool should_zap_cows(struct zap_details *details) { /* By default, zap all pages */ - if (!details) + if (!details || details->reclaim_pt) return true; /* Or, we zap COWed pages only if the caller wants to */ @@ -XXX,XX +XXX,XX @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, struct zap_details *details) { bool force_flush = false, force_break = false; - bool any_skipped = false; struct mm_struct *mm = tlb->mm; int rss[NR_MM_COUNTERS]; spinlock_t *ptl; pte_t *start_pte; pte_t *pte; + pmd_t pmdval; + unsigned long start = addr; + bool can_reclaim_pt = reclaim_pt_is_enabled(start, end, details); + bool direct_reclaim = false; int nr; retry: @@ -XXX,XX +XXX,XX @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); do { + bool any_skipped = false; + if (need_resched()) break; nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, rss, &force_flush, &force_break, &any_skipped); + if (any_skipped) + can_reclaim_pt = false; if (unlikely(force_break)) { addr += nr * PAGE_SIZE; break; } } while (pte += nr, addr += PAGE_SIZE * nr, addr != end); + if (can_reclaim_pt && addr == end) + direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval); + add_mm_rss_vec(mm, rss); arch_leave_lazy_mmu_mode(); @@ -XXX,XX +XXX,XX @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, goto retry; } + if (can_reclaim_pt) { + if (direct_reclaim) + free_pte(mm, start, tlb, pmdval); + else + try_to_free_pte(mm, pmd, start, tlb); + } + return addr; } diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c new file mode 100644 index XXXXXXX..XXXXXXX --- /dev/null +++ b/mm/pt_reclaim.c @@ -XXX,XX +XXX,XX @@ +// SPDX-License-Identifier: GPL-2.0 +#include <linux/hugetlb.h> +#include <asm-generic/tlb.h> +#include <asm/pgalloc.h> + +#include "internal.h" + +bool reclaim_pt_is_enabled(unsigned long start, unsigned long end, + struct zap_details *details) +{ + return details && details->reclaim_pt && (end - start >= PMD_SIZE); +} + +bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval) +{ + spinlock_t *pml = pmd_lockptr(mm, pmd); + + if (!spin_trylock(pml)) + return false; + + *pmdval = pmdp_get_lockless(pmd); + pmd_clear(pmd); + spin_unlock(pml); + + return true; +} + +void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb, + pmd_t pmdval) +{ + pte_free_tlb(tlb, pmd_pgtable(pmdval), addr); + mm_dec_nr_ptes(mm); +} + +void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, + struct mmu_gather *tlb) +{ + pmd_t pmdval; + spinlock_t *pml, *ptl; + pte_t *start_pte, *pte; + int i; + + pml = pmd_lock(mm, pmd); + start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl); + if (!start_pte) + goto out_ptl; + if (ptl != pml) + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); + + /* Check if it is empty PTE page */ + for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) { + if (!pte_none(ptep_get(pte))) + goto out_ptl; + } + pte_unmap(start_pte); + + pmd_clear(pmd); + + if (ptl != pml) + spin_unlock(ptl); + spin_unlock(pml); + + free_pte(mm, addr, tlb, pmdval); + + return; +out_ptl: + if (start_pte) + pte_unmap_unlock(start_pte, ptl); + if (ptl != pml) + spin_unlock(pml); +} -- 2.20.1
Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, the page table pages will be freed by semi RCU, that is: - batch table freeing: asynchronous free by RCU - single table freeing: IPI + synchronous free In this way, the page table can be lockless traversed by disabling IRQ in paths such as fast GUP. But this is not enough to free the empty PTE page table pages in paths other that munmap and exit_mmap path, because IPI cannot be synchronized with rcu_read_lock() in pte_offset_map{_lock}(). In preparation for supporting empty PTE page table pages reclaimation, let single table also be freed by RCU like batch table freeing. Then we can also use pte_offset_map() etc to prevent PTE page from being freed. Like pte_free_defer(), we can also safely use ptdesc->pt_rcu_head to free the page table pages: - The pt_rcu_head is unioned with pt_list and pmd_huge_pte. - For pt_list, it is used to manage the PGD page in x86. Fortunately tlb_remove_table() will not be used for free PGD pages, so it is safe to use pt_rcu_head. - For pmd_huge_pte, it is used for THPs, so it is safe. After applying this patch, if CONFIG_PT_RECLAIM is enabled, the function call of free_pte() is as follows: free_pte pte_free_tlb __pte_free_tlb ___pte_free_tlb paravirt_tlb_remove_table tlb_remove_table [!CONFIG_PARAVIRT, Xen PV, Hyper-V, KVM] [no-free-memory slowpath:] tlb_table_invalidate tlb_remove_table_one __tlb_remove_table_one [frees via RCU] [fastpath:] tlb_table_flush tlb_remove_table_free [frees via RCU] native_tlb_remove_table [CONFIG_PARAVIRT on native] tlb_remove_table [see above] Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: x86@kernel.org Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> --- arch/x86/include/asm/tlb.h | 20 ++++++++++++++++++++ arch/x86/kernel/paravirt.c | 7 +++++++ arch/x86/mm/pgtable.c | 10 +++++++++- include/linux/mm_types.h | 4 +++- mm/mmu_gather.c | 9 ++++++++- 5 files changed, 47 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h index XXXXXXX..XXXXXXX 100644 --- a/arch/x86/include/asm/tlb.h +++ b/arch/x86/include/asm/tlb.h @@ -XXX,XX +XXX,XX @@ static inline void __tlb_remove_table(void *table) free_page_and_swap_cache(table); } +#ifdef CONFIG_PT_RECLAIM +static inline void __tlb_remove_table_one_rcu(struct rcu_head *head) +{ + struct page *page; + + page = container_of(head, struct page, rcu_head); + put_page(page); +} + +static inline void __tlb_remove_table_one(void *table) +{ + struct page *page; + + page = table; + call_rcu(&page->rcu_head, __tlb_remove_table_one_rcu); +} +#define __tlb_remove_table_one __tlb_remove_table_one +#endif /* CONFIG_PT_RECLAIM */ + static inline void invlpg(unsigned long addr) { asm volatile("invlpg (%0)" ::"r" (addr) : "memory"); } + #endif /* _ASM_X86_TLB_H */ diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index XXXXXXX..XXXXXXX 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -XXX,XX +XXX,XX @@ void __init native_pv_lock_init(void) static_branch_enable(&virt_spin_lock_key); } +#ifndef CONFIG_PT_RECLAIM static void native_tlb_remove_table(struct mmu_gather *tlb, void *table) { tlb_remove_page(tlb, table); } +#else +static void native_tlb_remove_table(struct mmu_gather *tlb, void *table) +{ + tlb_remove_table(tlb, table); +} +#endif struct static_key paravirt_steal_enabled; struct static_key paravirt_steal_rq_enabled; diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index XXXXXXX..XXXXXXX 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -XXX,XX +XXX,XX @@ EXPORT_SYMBOL(physical_mask); #endif #ifndef CONFIG_PARAVIRT +#ifndef CONFIG_PT_RECLAIM static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table) { tlb_remove_page(tlb, table); } -#endif +#else +static inline +void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table) +{ + tlb_remove_table(tlb, table); +} +#endif /* !CONFIG_PT_RECLAIM */ +#endif /* !CONFIG_PARAVIRT */ gfp_t __userpte_alloc_gfp = GFP_PGTABLE_USER | PGTABLE_HIGHMEM; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index XXXXXXX..XXXXXXX 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -XXX,XX +XXX,XX @@ FOLIO_MATCH(compound_head, _head_2a); * struct ptdesc - Memory descriptor for page tables. * @__page_flags: Same as page flags. Powerpc only. * @pt_rcu_head: For freeing page table pages. - * @pt_list: List of used page tables. Used for s390 and x86. + * @pt_list: List of used page tables. Used for s390 gmap shadow pages + * (which are not linked into the user page tables) and x86 + * pgds. * @_pt_pad_1: Padding that aliases with page's compound head. * @pmd_huge_pte: Protected by ptdesc->ptl, used for THPs. * @__page_mapping: Aliases with page->mapping. Unused for page tables. diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index XXXXXXX..XXXXXXX 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -XXX,XX +XXX,XX @@ static inline void tlb_table_invalidate(struct mmu_gather *tlb) } } -static void tlb_remove_table_one(void *table) +#ifndef __tlb_remove_table_one +static inline void __tlb_remove_table_one(void *table) { tlb_remove_table_sync_one(); __tlb_remove_table(table); } +#endif + +static void tlb_remove_table_one(void *table) +{ + __tlb_remove_table_one(table); +} static void tlb_table_flush(struct mmu_gather *tlb) { -- 2.20.1
Now, x86 has fully supported the CONFIG_PT_RECLAIM feature, and reclaiming PTE pages is profitable only on 64-bit systems, so select ARCH_SUPPORTS_PT_RECLAIM if X86_64. Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: x86@kernel.org Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> --- arch/x86/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index XXXXXXX..XXXXXXX 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -XXX,XX +XXX,XX @@ config X86 select FUNCTION_ALIGNMENT_4B imply IMA_SECURE_AND_OR_TRUSTED_BOOT if EFI select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE + select ARCH_SUPPORTS_PT_RECLAIM if X86_64 config INSTRUCTION_DECODER def_bool y -- 2.20.1