From nobody Thu Feb 12 10:54:05 2026 Received: from mail-pg1-f177.google.com (mail-pg1-f177.google.com [209.85.215.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8F27D1411E0 for ; Thu, 13 Jun 2024 08:39:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.177 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718267957; cv=none; b=bJqcB60fzpi7zlWfcejibdrNr+RdRncVidWlgAtcQ2Y8CnB8Vr74GrSIA5nTki4GJlhZdar7Pb5MQe+ALBQYlkPkcuS7Yp7E8fBq8FlKhxpJPcAgE02SSLomtTrmQK8MKWqOg9se1akTWSvxZDgrzAiY3Wjn1uXWip35M+uwIFA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718267957; c=relaxed/simple; bh=xdE94GfqCYx2aEuVmfPuT2Y5YFCBjfjEEQZ9wkLA5M4=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=i+l/D7k6DbM/QXWLc6Sxf+CReCNqrDvVyEzygZbZol3ZmIkbFOZjxeeoP/FvhzdjCyc1OOatYfSkJQ8LCMu88gEfN9HrruFOoOGYCU5Qxo5xV/ZKKJKyXbYFsaO6bgzMcSKt79sL8Qm08X4nldTRxxzXLZZVQ9mBhRrMGbOvAZc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=CoduOCp8; arc=none smtp.client-ip=209.85.215.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="CoduOCp8" Received: by mail-pg1-f177.google.com with SMTP id 41be03b00d2f7-6f2eecba617so35540a12.2 for ; Thu, 13 Jun 2024 01:39:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1718267955; x=1718872755; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=A5Cg5Bw8hj6kh0AKq+CP9F+WzciGspxa+25MjItj/4s=; b=CoduOCp8IxojeaHXANppP6BFGfP8ETKuJuaTNr/a5TQ/zS2w/umrHzNX0Of10NqP3Z iaBxt/H/96K9xWgUgNsbZ2mi9x66PDji2Ss1OzIwyaniVoABHrCytMKcCyMBlnPSeDqa MfSMlDuV/dQPjkPhpdDDYYt3x0/4pPmDr3l8umsm1hjfm20mqatMvyEyKHO/qd5tv6aA VjM8IotrqO5Y5ukoXkEvQvDWvfbi+4vOqHQ0TUOzQOAsxEamzYJQzYFwqjzO/TpfS1YO njM/+BDM4yNdnYC6FEHKW5iXCcRvK0EVZuRgOxY/YnKyLNTtFcMHtO+Sf5zWYZDJtdnv q03w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718267955; x=1718872755; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=A5Cg5Bw8hj6kh0AKq+CP9F+WzciGspxa+25MjItj/4s=; b=rtpvTnc/5jQsHwAWq16H4WYL2zrb9BaLXFE8haquMFjIRPlh6rpI6Km0mpfuKDWzuV VCxqqH89YwnwDFyLGbxfkoY4ZlVBngRLMV3+pRrDD2AClvlpvXzEitRWL+UOzZZlPpKq reW9FMxYsY048e1tNm/Z0tZcKt+MjAH7iOfYogqlWkCILK6a34LDZPDjYcWnbYYyuZvY QNgLXQ1gyNlVmIJTKWUsE+0zieLBX64wk1epGrX+6Outikwf/ra5U+uwqU9N43tItiw9 fB0kUnzxncTZ/YA+NRUpCOO+XjOWbETKHtfjmAsqQfiUk3eIldGiat2zLSQl4UQlJr2v kgzA== X-Forwarded-Encrypted: i=1; AJvYcCV2Z2oNyu8M0mnIz9PdpfTxQYQeBhPICy8TcW6MwsDqSfH8ADj3ypU7xk4SiArIbSmPNNIR/+g/Lz3gMLPABS9+9sZAPF8MJh6CyPZG X-Gm-Message-State: AOJu0YwDRvq3qLSukdiNujxBY2z35U/3/CZ9XJ/LFjp0+JSKorkrl/Sm kFckGc/SnDpZQvu////w1TxUHRMh+z1tCiKNHrB2WOjamdf9bZRZgLNX3G/BoaI= X-Google-Smtp-Source: AGHT+IFkPIn1Dv10c/sdgT1s4Zy3CxP/93/DlvxiYc4NUJmFcAF7R4ScSAp8hM8C7UzSnp9lmTxVHg== X-Received: by 2002:a05:6a20:da89:b0:1b4:4ed4:91f5 with SMTP id adf61e73a8af0-1b8ab6ab543mr4962426637.6.1718267954815; Thu, 13 Jun 2024 01:39:14 -0700 (PDT) Received: from C02DW0BEMD6R.bytedance.net ([203.208.167.150]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-705ccb980bdsm820856b3a.211.2024.06.13.01.39.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 13 Jun 2024 01:39:14 -0700 (PDT) From: Qi Zheng To: david@redhat.com, hughd@google.com, willy@infradead.org, mgorman@suse.de, muchun.song@linux.dev, akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Qi Zheng Subject: [RFC PATCH 1/3] mm: pgtable: move pte_free_defer() out of CONFIG_TRANSPARENT_HUGEPAGE Date: Thu, 13 Jun 2024 16:38:08 +0800 Message-Id: <7864fd8186075ae12fd227f13f4191f3d1bc6764.1718267194.git.zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" In order to reuse the pte_free_defer() in the subsequent work of freeing empty user PTE pages, move it out of the CONFIG_TRANSPARENT_HUGEPAGE range. No functional change intended. Signed-off-by: Qi Zheng --- arch/powerpc/mm/pgtable-frag.c | 2 -- arch/s390/mm/pgalloc.c | 2 -- arch/sparc/mm/init_64.c | 2 +- mm/pgtable-generic.c | 2 +- 4 files changed, 2 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c index 8c31802f97e8..46d8f4bec85e 100644 --- a/arch/powerpc/mm/pgtable-frag.c +++ b/arch/powerpc/mm/pgtable-frag.c @@ -133,7 +133,6 @@ void pte_fragment_free(unsigned long *table, int kernel) } } =20 -#ifdef CONFIG_TRANSPARENT_HUGEPAGE void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) { struct page *page; @@ -142,4 +141,3 @@ void pte_free_defer(struct mm_struct *mm, pgtable_t pgt= able) SetPageActive(page); pte_fragment_free((unsigned long *)pgtable, 0); } -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c index abb629d7e131..6415379bd3fd 100644 --- a/arch/s390/mm/pgalloc.c +++ b/arch/s390/mm/pgalloc.c @@ -204,7 +204,6 @@ void __tlb_remove_table(void *table) pagetable_pte_dtor_free(ptdesc); } =20 -#ifdef CONFIG_TRANSPARENT_HUGEPAGE static void pte_free_now(struct rcu_head *head) { struct ptdesc *ptdesc =3D container_of(head, struct ptdesc, pt_rcu_head); @@ -223,7 +222,6 @@ void pte_free_defer(struct mm_struct *mm, pgtable_t pgt= able) */ WARN_ON_ONCE(mm_has_pgste(mm)); } -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ =20 /* * Base infrastructure required to generate basic asces, region, segment, diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index 53d7cb5bbffe..20aaf123c9fc 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -2939,7 +2939,6 @@ void pgtable_free(void *table, bool is_page) kmem_cache_free(pgtable_cache, table); } =20 -#ifdef CONFIG_TRANSPARENT_HUGEPAGE static void pte_free_now(struct rcu_head *head) { struct page *page; @@ -2956,6 +2955,7 @@ void pte_free_defer(struct mm_struct *mm, pgtable_t p= gtable) call_rcu(&page->rcu_head, pte_free_now); } =20 +#ifdef CONFIG_TRANSPARENT_HUGEPAGE void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd) { diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index a78a4adf711a..197937495a0a 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -233,6 +233,7 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, u= nsigned long address, return pmd; } #endif +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ =20 /* arch define pte_free_defer in asm/pgalloc.h for its own implementation = */ #ifndef pte_free_defer @@ -252,7 +253,6 @@ void pte_free_defer(struct mm_struct *mm, pgtable_t pgt= able) call_rcu(&page->rcu_head, pte_free_now); } #endif /* pte_free_defer */ -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ =20 #if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \ (defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU)) --=20 2.20.1 From nobody Thu Feb 12 10:54:05 2026 Received: from mail-oa1-f47.google.com (mail-oa1-f47.google.com [209.85.160.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 10A441411F4 for ; Thu, 13 Jun 2024 08:39:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.160.47 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718267962; cv=none; b=r0KWyaq2B9gcBwBk+V/K3+O9+a6Y34GYZF/yvL0P+puRUjTWXSKtqtksLyKcCxt7i+6pfbkR54cuceVdswGMFn6q/QxKhkLOPk9xPUEUU5GB7qMGEowb20j4rYnhCHOoT/+r1U0oeeftVaJH9VXByYlBBW9TrptJRCTDEsZFnII= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718267962; c=relaxed/simple; bh=Ies1FaPAxygACj79wFDDkxn5MYHI5m4VB59yj4jorvs=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=lHh27aLgmfvtduP0Q1fcV2EC02jjN8uRwhK0jc3aVgQFDGFiR+2rrXwBg8hlnP4ouUOfScuTM7ekEfOlEDRzbwSMglZBCAVgXOq9JCiwCBJA4PO2/FycS2v0hy/8pgoyD3jAJA41pJecwjgclsetrm/7vzX55TagL53ivwZprGU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=QjzNFv81; arc=none smtp.client-ip=209.85.160.47 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="QjzNFv81" Received: by mail-oa1-f47.google.com with SMTP id 586e51a60fabf-24c5ec50da1so132973fac.3 for ; Thu, 13 Jun 2024 01:39:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1718267959; x=1718872759; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=ChynU9z5JOCMqbjVxFHZ6Asc4dZFXj2+q+uZdr+9n9E=; b=QjzNFv81TGb5zvA8Trxlh6PDmFGBHrboPDVzpKVwokmJC+4cXbv4NosTevGHmN5wei KiPJ3cle0f6pDcYIeDbGX8vu9H/xdONb7z4VQkmIMeTCxf765BKl1DcZOIMLXL7oUerd vuAKPWYZJZckbIadKywbNXZN3m7b67j3NAPO7Zk6nIoQE3L/oEfAGU14f3QJwVWdwNuL kDiU/oZLd0UZVsh4KfcIz1dV9wbdAdYW3QRLQrr6SUe4Bx1rcUw10BUGbC28W0ArE1NK 53pbWQ7uyQS+hYycFg9RAwLQgN4buWVOqj7Ur35PErVr0Ix7yhOQoMsmDQD5P7LRPo0u 3dNQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718267959; x=1718872759; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ChynU9z5JOCMqbjVxFHZ6Asc4dZFXj2+q+uZdr+9n9E=; b=xBHLD7xioDm6Stpj+XBzXFz6AR31+eahRx3S5GykWaYudZC9/WT7HKCOuZnAviT+dB ruA0dBW8ZUDe77YfdAriwhGufNLr2c3EJo9bb7zXac6lUaIXXiZhsJ++VPWq0UL5gQ01 aHuGeSTopxHZRQU/6Ikrjs01ta46B2WpMjYaIedCAk44Z1SJPGuoaCOvIUpOx/Vz4bUS ESx3ya5dzm0AWQvjXBToFguqvcud45qYG12BTsnWGkMr8ya4+L0E0wV2ak4nG/O8VQbh SsZ8AxOtKBEkylK5RBOefYKwqbaKE4Y3JhJWowe4eQ14PaccJ3SrMLV0gllSoVbwcIcb yaeg== X-Forwarded-Encrypted: i=1; AJvYcCX2Zsl24dTX407lkDfaiLA+z4OLH3pHulsfXKfdsctP/0PSUokq83cfWFaSNPUmyGnulMQLxeqxlH6VVQQ=@vger.kernel.org X-Gm-Message-State: AOJu0Yxlp48gH+sqWOFgEPwcC8Oe6AECCkfWVTQUCJiU2lboX7X5sKr4 oqdztS0QpxUAGdSYyWo090KIh5uxm3xBToGEb3eQI3BePXlrCEoYqQ4mOiLwT0U= X-Google-Smtp-Source: AGHT+IG5fRqFeCgKiHECkAvdVDXYqyUQDwS7LLbxLF7n/kGnf0XhUcBBmoT8lkZTDQg4z6Muj8MdhA== X-Received: by 2002:a05:6870:7193:b0:254:a5dd:3772 with SMTP id 586e51a60fabf-25514ef5f91mr4375320fac.4.1718267958799; Thu, 13 Jun 2024 01:39:18 -0700 (PDT) Received: from C02DW0BEMD6R.bytedance.net ([203.208.167.150]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-705ccb980bdsm820856b3a.211.2024.06.13.01.39.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 13 Jun 2024 01:39:18 -0700 (PDT) From: Qi Zheng To: david@redhat.com, hughd@google.com, willy@infradead.org, mgorman@suse.de, muchun.song@linux.dev, akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Qi Zheng Subject: [RFC PATCH 2/3] mm: pgtable: make pte_offset_map_nolock() return pmdval Date: Thu, 13 Jun 2024 16:38:09 +0800 Message-Id: X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Make pte_offset_map_nolock() return pmdval so that we can recheck the *pmd once the lock is taken. This is a preparation for freeing empty PTE pages, no functional changes are expected. Signed-off-by: Qi Zheng --- Documentation/mm/split_page_table_lock.rst | 3 ++- arch/arm/mm/fault-armv.c | 2 +- arch/powerpc/mm/pgtable.c | 2 +- include/linux/mm.h | 4 ++-- mm/filemap.c | 2 +- mm/khugepaged.c | 4 ++-- mm/memory.c | 4 ++-- mm/mremap.c | 2 +- mm/page_vma_mapped.c | 2 +- mm/pgtable-generic.c | 21 ++++++++++++--------- mm/userfaultfd.c | 4 ++-- mm/vmscan.c | 2 +- 12 files changed, 28 insertions(+), 24 deletions(-) diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/= split_page_table_lock.rst index e4f6972eb6c0..e6a47d57531c 100644 --- a/Documentation/mm/split_page_table_lock.rst +++ b/Documentation/mm/split_page_table_lock.rst @@ -18,7 +18,8 @@ There are helpers to lock/unlock a table and other access= or functions: pointer to its PTE table lock, or returns NULL if no PTE table; - pte_offset_map_nolock() maps PTE, returns pointer to PTE with pointer to its PTE table - lock (not taken), or returns NULL if no PTE table; + lock (not taken) and the value of its pmd entry, or returns NULL + if no PTE table; - pte_offset_map() maps PTE, returns pointer to PTE, or returns NULL if no PTE table; - pte_unmap() diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c index 2286c2ea60ec..3e4ed99b9330 100644 --- a/arch/arm/mm/fault-armv.c +++ b/arch/arm/mm/fault-armv.c @@ -117,7 +117,7 @@ static int adjust_pte(struct vm_area_struct *vma, unsig= ned long address, * must use the nested version. This also means we need to * open-code the spin-locking. */ - pte =3D pte_offset_map_nolock(vma->vm_mm, pmd, address, &ptl); + pte =3D pte_offset_map_nolock(vma->vm_mm, pmd, NULL, address, &ptl); if (!pte) return 0; =20 diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index 9e7ba9c3851f..ab0250f1b226 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -350,7 +350,7 @@ void assert_pte_locked(struct mm_struct *mm, unsigned l= ong addr) */ if (pmd_none(*pmd)) return; - pte =3D pte_offset_map_nolock(mm, pmd, addr, &ptl); + pte =3D pte_offset_map_nolock(mm, pmd, NULL, addr, &ptl); BUG_ON(!pte); assert_spin_locked(ptl); pte_unmap(pte); diff --git a/include/linux/mm.h b/include/linux/mm.h index 106bb0310352..d5550c3dc550 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2969,8 +2969,8 @@ static inline pte_t *pte_offset_map_lock(struct mm_st= ruct *mm, pmd_t *pmd, return pte; } =20 -pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, - unsigned long addr, spinlock_t **ptlp); +pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdv= alp, + unsigned long addr, spinlock_t **ptlp); =20 #define pte_unmap_unlock(pte, ptl) do { \ spin_unlock(ptl); \ diff --git a/mm/filemap.c b/mm/filemap.c index 37061aafd191..7eb2e3599966 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3231,7 +3231,7 @@ static vm_fault_t filemap_fault_recheck_pte_none(stru= ct vm_fault *vmf) if (!(vmf->flags & FAULT_FLAG_ORIG_PTE_VALID)) return 0; =20 - ptep =3D pte_offset_map_nolock(vma->vm_mm, vmf->pmd, vmf->address, + ptep =3D pte_offset_map_nolock(vma->vm_mm, vmf->pmd, NULL, vmf->address, &vmf->ptl); if (unlikely(!ptep)) return VM_FAULT_NOPAGE; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 774a97e6e2da..2a8703ee876c 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -992,7 +992,7 @@ static int __collapse_huge_page_swapin(struct mm_struct= *mm, }; =20 if (!pte++) { - pte =3D pte_offset_map_nolock(mm, pmd, address, &ptl); + pte =3D pte_offset_map_nolock(mm, pmd, NULL, address, &ptl); if (!pte) { mmap_read_unlock(mm); result =3D SCAN_PMD_NULL; @@ -1581,7 +1581,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, uns= igned long addr, if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED)) pml =3D pmd_lock(mm, pmd); =20 - start_pte =3D pte_offset_map_nolock(mm, pmd, haddr, &ptl); + start_pte =3D pte_offset_map_nolock(mm, pmd, NULL, haddr, &ptl); if (!start_pte) /* mmap_lock + page lock should prevent this */ goto abort; if (!pml) diff --git a/mm/memory.c b/mm/memory.c index 1bd2ffb76ec2..694c0989a1d8 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1108,7 +1108,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct= vm_area_struct *src_vma, ret =3D -ENOMEM; goto out; } - src_pte =3D pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl); + src_pte =3D pte_offset_map_nolock(src_mm, src_pmd, NULL, addr, &src_ptl); if (!src_pte) { pte_unmap_unlock(dst_pte, dst_ptl); /* ret =3D=3D 0 */ @@ -5486,7 +5486,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *v= mf) * it into a huge pmd: just retry later if so. */ vmf->pte =3D pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd, - vmf->address, &vmf->ptl); + NULL, vmf->address, &vmf->ptl); if (unlikely(!vmf->pte)) return 0; vmf->orig_pte =3D ptep_get_lockless(vmf->pte); diff --git a/mm/mremap.c b/mm/mremap.c index e7ae140fc640..f672d0218a6f 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -175,7 +175,7 @@ static int move_ptes(struct vm_area_struct *vma, pmd_t = *old_pmd, err =3D -EAGAIN; goto out; } - new_pte =3D pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl); + new_pte =3D pte_offset_map_nolock(mm, new_pmd, NULL, new_addr, &new_ptl); if (!new_pte) { pte_unmap_unlock(old_pte, old_ptl); err =3D -EAGAIN; diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index ae5cc42aa208..507701b7bcc1 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -33,7 +33,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, sp= inlock_t **ptlp) * Though, in most cases, page lock already protects this. */ pvmw->pte =3D pte_offset_map_nolock(pvmw->vma->vm_mm, pvmw->pmd, - pvmw->address, ptlp); + NULL, pvmw->address, ptlp); if (!pvmw->pte) return false; =20 diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 197937495a0a..b8b28715cb4f 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -305,7 +305,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr,= pmd_t *pmdvalp) return NULL; } =20 -pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, +pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdv= alp, unsigned long addr, spinlock_t **ptlp) { pmd_t pmdval; @@ -314,6 +314,8 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_= t *pmd, pte =3D __pte_offset_map(pmd, addr, &pmdval); if (likely(pte)) *ptlp =3D pte_lockptr(mm, &pmdval); + if (pmdvalp) + *pmdvalp =3D pmdval; return pte; } =20 @@ -347,14 +349,15 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pm= d_t *pmd, * and disconnected table. Until pte_unmap(pte) unmaps and rcu_read_unloc= k()s * afterwards. * - * pte_offset_map_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_m= ap(); - * but when successful, it also outputs a pointer to the spinlock in ptlp = - as - * pte_offset_map_lock() does, but in this case without locking it. This = helps - * the caller to avoid a later pte_lockptr(mm, *pmd), which might by that = time - * act on a changed *pmd: pte_offset_map_nolock() provides the correct spi= nlock - * pointer for the page table that it returns. In principle, the caller s= hould - * recheck *pmd once the lock is taken; in practice, no callsite needs tha= t - - * either the mmap_lock for write, or pte_same() check on contents, is eno= ugh. + * pte_offset_map_nolock(mm, pmd, pmdvalp, addr, ptlp), above, is like + * pte_offset_map(); but when successful, it also outputs a pointer to the + * spinlock in ptlp - as pte_offset_map_lock() does, but in this case with= out + * locking it. This helps the caller to avoid a later pte_lockptr(mm, *pm= d), + * which might by that time act on a changed *pmd: pte_offset_map_nolock() + * provides the correct spinlock pointer for the page table that it return= s. + * In principle, the caller should recheck *pmd once the lock is taken; Bu= t in + * most cases, either the mmap_lock for write, or pte_same() check on cont= ents, + * is enough. * * Note that free_pgtables(), used after unmapping detached vmas, or when * exiting the whole mm, does not take page table lock before freeing a pa= ge diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 5e7f2801698a..9c77271d499c 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1143,7 +1143,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t= *dst_pmd, pmd_t *src_pmd, src_addr, src_addr + PAGE_SIZE); mmu_notifier_invalidate_range_start(&range); retry: - dst_pte =3D pte_offset_map_nolock(mm, dst_pmd, dst_addr, &dst_ptl); + dst_pte =3D pte_offset_map_nolock(mm, dst_pmd, NULL, dst_addr, &dst_ptl); =20 /* Retry if a huge pmd materialized from under us */ if (unlikely(!dst_pte)) { @@ -1151,7 +1151,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t= *dst_pmd, pmd_t *src_pmd, goto out; } =20 - src_pte =3D pte_offset_map_nolock(mm, src_pmd, src_addr, &src_ptl); + src_pte =3D pte_offset_map_nolock(mm, src_pmd, NULL, src_addr, &src_ptl); =20 /* * We held the mmap_lock for reading so MADV_DONTNEED diff --git a/mm/vmscan.c b/mm/vmscan.c index c0429fd6c573..56727caa907b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3374,7 +3374,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long = start, unsigned long end, DEFINE_MAX_SEQ(walk->lruvec); int old_gen, new_gen =3D lru_gen_from_seq(max_seq); =20 - pte =3D pte_offset_map_nolock(args->mm, pmd, start & PMD_MASK, &ptl); + pte =3D pte_offset_map_nolock(args->mm, pmd, NULL, start & PMD_MASK, &ptl= ); if (!pte) return false; if (!spin_trylock(ptl)) { --=20 2.20.1 From nobody Thu Feb 12 10:54:05 2026 Received: from mail-oa1-f44.google.com (mail-oa1-f44.google.com [209.85.160.44]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C8E8E1428F5 for ; Thu, 13 Jun 2024 08:39:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.160.44 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718267966; cv=none; b=InGqDZn9tduq3920klYOvY07Fwn+SLk9pPWSAuYFfnq7vk6X6vOPORkt65hpRkQIW3XfhPn4l43h18JzFF+RTyC9r7GEr/PoIMSkK6m0M2PRrprt8N3zKCvuANaI+GmVxBpbQS0XBbAkohrVZVw3RhlyLwm+G/KFHHbUrdG11tE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718267966; c=relaxed/simple; bh=o3mXUY/N7b/OpNwSmmjqL7RvyAw6IigAabnLHhCtYHY=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=ClC3aQtzHLL37jHw2QrpS7Mt/oxiuI2QXnY7C/Xr79M33bmJtD/3kmfMbss5R/jeOCTDbg1C+VuH9k4RZopeFsLGCBeGW2hV/hY92gP8fPJX+eg/2zDU34hyQtk6RNJR7hAA/EW3KUWN2elcRT1LGTwQCrFUK4PI3zcPaWXsmHE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=El6R5jRZ; arc=none smtp.client-ip=209.85.160.44 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="El6R5jRZ" Received: by mail-oa1-f44.google.com with SMTP id 586e51a60fabf-254ceab70e3so121235fac.2 for ; Thu, 13 Jun 2024 01:39:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1718267963; x=1718872763; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=CewtBWl7EeBPsRvYcQKXFLRJSGy2eXvxF5yP4Eq9hNE=; b=El6R5jRZMKjhbqNYWUfwdhsfj9ACpa4PVSvLckJ2On0Du9OaRJ9G+HJi0CSIpJnVg8 dvEUs6GB744wsJ7itJuGdoB+3+dD+D5oFtn5touIP6mkJ2gnKEiKm52w0f9C9bRyvoLn Qla/VdH63zYEovDzxuFpSq9aPSqyNaeCexbOtjEeJ6VNRva8IK+PYhH0lBHDaZKsm0dn 2RM9wVjBoSCbQWhutLTqUyKFyr60NJbez2pxxjJCOSoqgsulrqxkuLGwm7D8E9UVVWLj QBMtdgkVUdVB1Nui9h03z101wsLIms04MUjkv3mDYOmqXTN4xSJBDtstPzklLjuC62y4 j1kg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718267963; x=1718872763; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=CewtBWl7EeBPsRvYcQKXFLRJSGy2eXvxF5yP4Eq9hNE=; b=NXhz/MY6S/m7vfkGXviNdfhV1RJmd05k4M17jc7IN4QrUEgxlVnfnIEdKv2w5wRfBP ETs/xVqxlNnEBufKWZtzQwJ0pUijNW9GZm7XAyCDjzSUjrizOievyUHPcdwDY4qbYELX tw3Qh9LGhJjPwPDm/DmoFh4Jw/Koll4t15qtMzVhVBvxf8ZdPJWQDxlFUur7rAxlSORo a+LQqP3MOYppHG1zw+NKN+PG2UkWUkWZkk79akXJzP5D9psrEGiDdLagaGzyysKQmVhM earBpZrMij3CQby8Dv7hSmR0mUmTOlmyhMC6M6QJHaAb4r3pSYduQ0st7YDEBe7zR8Fg yHPw== X-Forwarded-Encrypted: i=1; AJvYcCVhobsWZeuX9XvpsoxDhyGeTR85/R2xUyXJ60M587r/3u5MDqDGpzs0yTIrl9ZDtFzbkFz74mFXhv6oliO2BWXaK11v2um63g1dTbmO X-Gm-Message-State: AOJu0Yz/0bWMCLCNjUIzHP8ZQS6k44angXNgbpxnz96v2qUGsgWL+2Ez A8OwWBSa2acsi3drriYdnzFmnALegRwRoNV/OuOPcA3GA6h447EATtBP8XB6aZA= X-Google-Smtp-Source: AGHT+IF3BvBRGUqMq+LK1a8EUEuvL+E9w9pX4Nvfr12+FIihulto0pHvra8MZUn64vN2LmeV0o+qLQ== X-Received: by 2002:a05:6870:95a8:b0:250:702f:8bab with SMTP id 586e51a60fabf-2551505f631mr4491220fac.3.1718267962761; Thu, 13 Jun 2024 01:39:22 -0700 (PDT) Received: from C02DW0BEMD6R.bytedance.net ([203.208.167.150]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-705ccb980bdsm820856b3a.211.2024.06.13.01.39.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 13 Jun 2024 01:39:22 -0700 (PDT) From: Qi Zheng To: david@redhat.com, hughd@google.com, willy@infradead.org, mgorman@suse.de, muchun.song@linux.dev, akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Qi Zheng Subject: [RFC PATCH 3/3] mm: free empty user PTE pages Date: Thu, 13 Jun 2024 16:38:10 +0800 Message-Id: X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Now in order to pursue high performance, applications mostly use some high-performance user-mode memory allocators, such as jemalloc or tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will release page table memory, which may cause huge page table memory usage. The following are a memory usage snapshot of one process which actually happened on our server: VIRT: 55t RES: 590g VmPTE: 110g In this case, most of the page table entries are empty. For such a PTE page where all entries are empty, we can actually free it back to the system for others to use. Similar to numa_balancing, this commit adds a task_work to scan the address space of the user process when it returns to user space. If a suitable empty PTE page is found, it will be released. The following code snippet can show the effect of optimization: mmap 50G while (1) { for (; i < 1024 * 25; i++) { touch 2M memory madvise MADV_DONTNEED 2M } } As we can see, the memory usage of VmPTE is reduced: before after VIRT 50.0 GB 50.0 GB RES 3.1 MB 3.1 MB VmPTE 102640 kB 756 kB (Even less) Signed-off-by: Qi Zheng --- include/linux/mm_types.h | 4 + include/linux/pgtable.h | 14 +++ include/linux/sched.h | 1 + kernel/sched/core.c | 1 + kernel/sched/fair.c | 2 + mm/Makefile | 2 +- mm/freept.c | 180 +++++++++++++++++++++++++++++++++++++++ mm/khugepaged.c | 18 +++- 8 files changed, 220 insertions(+), 2 deletions(-) create mode 100644 mm/freept.c diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index ef09c4eef6d3..bbc697fa4a83 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -839,6 +839,10 @@ struct mm_struct { #endif #ifdef CONFIG_MMU atomic_long_t pgtables_bytes; /* size of all page tables */ + /* Next mm_pgtable scan (in jiffies) */ + unsigned long mm_pgtable_next_scan; + /* Restart point for scanning and freeing empty user PTE pages */ + unsigned long mm_pgtable_scan_offset; #endif int map_count; /* number of VMAs */ =20 diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index fbff20070ca3..4d1cfaa92422 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1589,6 +1589,20 @@ static inline unsigned long my_zero_pfn(unsigned lon= g addr) } #endif /* CONFIG_MMU */ =20 +#ifdef CONFIG_MMU +#define MM_PGTABLE_SCAN_DELAY 100 /* 100ms */ +#define MM_PGTABLE_SCAN_SIZE 256 /* 256MB */ +void init_mm_pgtable_work(struct task_struct *p); +void task_tick_mm_pgtable(struct task_struct *curr); +#else +static inline void init_mm_pgtable_work(struct task_struct *p) +{ +} +static inline void task_tick_mm_pgtable(struct task_struct *curr) +{ +} +#endif + #ifdef CONFIG_MMU =20 #ifndef CONFIG_TRANSPARENT_HUGEPAGE diff --git a/include/linux/sched.h b/include/linux/sched.h index 73c874e051f7..5c0f3d96d608 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1485,6 +1485,7 @@ struct task_struct { #ifdef CONFIG_MMU struct task_struct *oom_reaper_list; struct timer_list oom_reaper_timer; + struct callback_head pgtable_work; #endif #ifdef CONFIG_VMAP_STACK struct vm_struct *stack_vm_area; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c663075c86fb..d5f6df6f5c32 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4359,6 +4359,7 @@ static void __sched_fork(unsigned long clone_flags, s= truct task_struct *p) p->migration_pending =3D NULL; #endif init_sched_mm_cid(p); + init_mm_pgtable_work(p); } =20 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 41b58387023d..bbc7cbf22eaa 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -12696,6 +12696,8 @@ static void task_tick_fair(struct rq *rq, struct ta= sk_struct *curr, int queued) if (static_branch_unlikely(&sched_numa_balancing)) task_tick_numa(rq, curr); =20 + task_tick_mm_pgtable(curr); + update_misfit_status(curr, rq); check_update_overutilized_status(task_rq(curr)); =20 diff --git a/mm/Makefile b/mm/Makefile index 8fb85acda1b1..af1a324aa65e 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -54,7 +54,7 @@ obj-y :=3D filemap.o mempool.o oom_kill.o fadvise.o \ mm_init.o percpu.o slab_common.o \ compaction.o show_mem.o shmem_quota.o\ interval_tree.o list_lru.o workingset.o \ - debug.o gup.o mmap_lock.o $(mmu-y) + debug.o gup.o mmap_lock.o freept.o $(mmu-y) =20 # Give 'page_alloc' its own module-parameter namespace page-alloc-y :=3D page_alloc.o diff --git a/mm/freept.c b/mm/freept.c new file mode 100644 index 000000000000..ed1ea5535e03 --- /dev/null +++ b/mm/freept.c @@ -0,0 +1,180 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include + +void task_tick_mm_pgtable(struct task_struct *curr) +{ + struct callback_head *work =3D &curr->pgtable_work; + unsigned long now =3D jiffies; + + if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || + work->next !=3D work) + return; + + if (time_before(now, READ_ONCE(curr->mm->mm_pgtable_next_scan))) + return; + + task_work_add(curr, work, TWA_RESUME); +} + +/* + * Locking: + * - already held the mmap read lock to traverse the vma tree and pgtable + * - use pmd lock for clearing pmd entry + * - use pte lock for checking empty PTE page, and release it after clear= ing + * pmd entry, then we can capture the changed pmd in pte_offset_map_loc= k() + * etc after holding this pte lock. Thanks to this, we don't need to ho= ld the + * rmap-related locks. + * - users of pte_offset_map_lock() etc all expect the PTE page to be sta= ble by + * using rcu lock, so use pte_free_defer() to free PTE pages. + */ +static int freept_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long = next, + struct mm_walk *walk) +{ + struct mmu_notifier_range range; + struct mm_struct *mm =3D walk->mm; + pte_t *start_pte, *pte; + pmd_t pmdval; + spinlock_t *pml =3D NULL, *ptl; + unsigned long haddr =3D addr; + int i; + + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, + haddr, haddr + PMD_SIZE); + mmu_notifier_invalidate_range_start(&range); + + start_pte =3D pte_offset_map_nolock(mm, pmd, &pmdval, haddr, &ptl); + if (!start_pte) + goto out; + + pml =3D pmd_lock(mm, pmd); + if (ptl !=3D pml) + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); + + if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd)))) + goto out_ptl; + + /* Check if it is empty PTE page */ + for (i =3D 0, addr =3D haddr, pte =3D start_pte; + i < PTRS_PER_PTE; i++, addr +=3D PAGE_SIZE, pte++) { + if (!pte_none(ptep_get(pte))) + goto out_ptl; + } + pte_unmap(start_pte); + + pmd_clear(pmd); + flush_tlb_range(walk->vma, haddr, haddr + PMD_SIZE); + pmdp_get_lockless_sync(); + if (ptl !=3D pml) + spin_unlock(ptl); + spin_unlock(pml); + + mmu_notifier_invalidate_range_end(&range); + + mm_dec_nr_ptes(mm); + pte_free_defer(mm, pmd_pgtable(pmdval)); + + return 0; + +out_ptl: + pte_unmap_unlock(start_pte, ptl); + if (pml !=3D ptl) + spin_unlock(pml); +out: + mmu_notifier_invalidate_range_end(&range); + + return 0; +} + +static const struct mm_walk_ops mm_pgtable_walk_ops =3D { + .pmd_entry =3D freept_pmd_entry, + .walk_lock =3D PGWALK_RDLOCK, +}; + +static void task_mm_pgtable_work(struct callback_head *work) +{ + unsigned long now =3D jiffies, old_scan, next_scan; + struct task_struct *p =3D current; + struct mm_struct *mm =3D p->mm; + struct vm_area_struct *vma; + unsigned long start, end; + struct vma_iterator vmi; + + work->next =3D work; /* Prevent double-add */ + if (p->flags & PF_EXITING) + return; + + if (!mm->mm_pgtable_next_scan) { + mm->mm_pgtable_next_scan =3D now + msecs_to_jiffies(MM_PGTABLE_SCAN_DELA= Y); + return; + } + + old_scan =3D mm->mm_pgtable_next_scan; + if (time_before(now, old_scan)) + return; + + next_scan =3D now + msecs_to_jiffies(MM_PGTABLE_SCAN_DELAY); + if (!try_cmpxchg(&mm->mm_pgtable_next_scan, &old_scan, next_scan)) + return; + + if (!mmap_read_trylock(mm)) + return; + + start =3D mm->mm_pgtable_scan_offset; + vma_iter_init(&vmi, mm, start); + vma =3D vma_next(&vmi); + if (!vma) { + mm->mm_pgtable_scan_offset =3D 0; + start =3D 0; + vma_iter_set(&vmi, start); + vma =3D vma_next(&vmi); + } + + do { + /* Skip hugetlb case */ + if (is_vm_hugetlb_page(vma)) + continue; + + /* Leave this to the THP path to handle */ + if (vma->vm_flags & VM_HUGEPAGE) + continue; + + /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */ + if (userfaultfd_wp(vma)) + continue; + + /* Only consider PTE pages that do not cross vmas */ + start =3D ALIGN(vma->vm_start, PMD_SIZE); + end =3D ALIGN_DOWN(vma->vm_end, PMD_SIZE); + if (end - start < PMD_SIZE) + continue; + + walk_page_range_vma(vma, start, end, &mm_pgtable_walk_ops, NULL); + + if (end - mm->mm_pgtable_scan_offset >=3D (MM_PGTABLE_SCAN_SIZE << 20)) + goto out; + + cond_resched(); + } for_each_vma(vmi, vma); + +out: + mm->mm_pgtable_scan_offset =3D vma ? end : 0; + mmap_read_unlock(mm); +} + +void init_mm_pgtable_work(struct task_struct *p) +{ + struct mm_struct *mm =3D p->mm; + int mm_users =3D 0; + + if (mm) { + mm_users =3D atomic_read(&mm->mm_users); + if (mm_users =3D=3D 1) + mm->mm_pgtable_next_scan =3D jiffies + msecs_to_jiffies(MM_PGTABLE_SCAN= _DELAY); + } + p->pgtable_work.next =3D &p->pgtable_work; /* Protect against double add = */ + init_task_work(&p->pgtable_work, task_mm_pgtable_work); +} diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 2a8703ee876c..a2b96f4ba737 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1581,7 +1581,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, uns= igned long addr, if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED)) pml =3D pmd_lock(mm, pmd); =20 - start_pte =3D pte_offset_map_nolock(mm, pmd, NULL, haddr, &ptl); + start_pte =3D pte_offset_map_nolock(mm, pmd, &pgt_pmd, haddr, &ptl); if (!start_pte) /* mmap_lock + page lock should prevent this */ goto abort; if (!pml) @@ -1589,6 +1589,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, un= signed long addr, else if (ptl !=3D pml) spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); =20 + /* pmd entry may be changed by others */ + if (unlikely(!pml && !pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) + goto abort; + /* step 2: clear page table and adjust rmap */ for (i =3D 0, addr =3D haddr, pte =3D start_pte; i < HPAGE_PMD_NR; i++, addr +=3D PAGE_SIZE, pte++) { @@ -1636,6 +1640,11 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, un= signed long addr, pml =3D pmd_lock(mm, pmd); if (ptl !=3D pml) spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); + + if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) { + spin_unlock(ptl); + goto unlock; + } } pgt_pmd =3D pmdp_collapse_flush(vma, haddr, pmd); pmdp_get_lockless_sync(); @@ -1663,6 +1672,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, uns= igned long addr, } if (start_pte) pte_unmap_unlock(start_pte, ptl); +unlock: if (pml && pml !=3D ptl) spin_unlock(pml); if (notified) @@ -1722,6 +1732,12 @@ static void retract_page_tables(struct address_space= *mapping, pgoff_t pgoff) mmu_notifier_invalidate_range_start(&range); =20 pml =3D pmd_lock(mm, pmd); + /* check if the pmd is still valid */ + if (check_pmd_still_valid(mm, addr, pmd) !=3D SCAN_SUCCEED) { + spin_unlock(pml); + mmu_notifier_invalidate_range_end(&range); + continue; + } ptl =3D pte_lockptr(mm, pmd); if (ptl !=3D pml) spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); --=20 2.20.1