From nobody Mon Jun 8 04:24:22 2026 Received: from out-176.mta1.migadu.com (out-176.mta1.migadu.com [95.215.58.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 649503F39FC for ; Tue, 2 Jun 2026 14:25:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.176 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410358; cv=none; b=fvmTi+NXUk2JtSLfNjxDiQaTFsDFw5IzDR1GZL4Jv6WaZyl/UP6CAeIsKc9bwkcncFJO/WSemaZck6rwLYJ0tQVUwpQRd6UbmyOdkqdEGuxb7mucMLqvsB36KPa7tscEPEYQ15ZA7O6IIBmMlZqiqpfbqhdm0WZ3lf/rGpPWDDA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410358; c=relaxed/simple; bh=fASsLWpchBDrn1WDQ+m/YiVw/Q+a20SLDuxNasYwaDE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Xuu1z5aJAx2ISZdTZ1ZGpFpeh2I9a5nGMZY5zQXojKDIPHDIjFue8edAZxcaATB5eOp/leiCtbDZgHmWSmaJfy/ebd9/HHye4u130m5AuguQTFed7bzhjK3eF3ig4kvzDkBVdbq240G2qwQZFSN8adQRkp6SNURaj8oyi8icEdA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=FiQ60Fcz; arc=none smtp.client-ip=95.215.58.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="FiQ60Fcz" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1780410354; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Vz/wWPCZYCNnLyRczeohRvebmJAcFv6dSFI8rMdsXis=; b=FiQ60FcztJplSnJxuCbZlUmDxp6sKZSQBmN2NBggSduL8HBuutEcZw3c80OfBeDdMfKRxB wVRUZpFgkR2TBSqh3pKfMopMjbXwVeYOACM1/1RM9tzbzAO8xP0BDBU+bXF+j44m8jG9i8 BA2rlnr4dI64lUMce5hvZUgoYz1ey5g= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: ying.huang@linux.alibaba.com, Baoquan He , willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam R. Howlett , ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [v2 01/16] mm: add softleaf_to_pmd() and convert existing callers Date: Tue, 2 Jun 2026 07:24:09 -0700 Message-ID: <20260602142537.198755-2-usama.arif@linux.dev> In-Reply-To: <20260602142537.198755-1-usama.arif@linux.dev> References: <20260602142537.198755-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Add softleaf_to_pmd() as the PMD counterpart to softleaf_to_pte(), completing the symmetry of the softleaf abstraction for page table leaf entries. The upcoming PMD swap entry support needs to construct PMD entries from swap entries. Converting existing swp_entry_to_pmd() callers to softleaf_to_pmd() in a prep patch keeps the feature patches focused on new functionality rather than mixing refactoring with new code. Acked-by: David Hildenbrand (Arm) Signed-off-by: Usama Arif --- include/linux/leafops.h | 20 ++++++++++++++++++++ mm/debug_vm_pgtable.c | 4 ++-- mm/huge_memory.c | 12 ++++++------ mm/migrate_device.c | 2 +- 4 files changed, 29 insertions(+), 9 deletions(-) diff --git a/include/linux/leafops.h b/include/linux/leafops.h index 992cd8bd8ed0..803d312437df 100644 --- a/include/linux/leafops.h +++ b/include/linux/leafops.h @@ -108,6 +108,21 @@ static inline softleaf_t softleaf_from_pmd(pmd_t pmd) return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry)); } =20 +/** + * softleaf_to_pmd() - Obtain a PMD entry from a leaf entry. + * @entry: Leaf entry. + * + * This generates an architecture-specific PMD entry that can be utilised = to + * encode the metadata the leaf entry encodes. + * + * Returns: Architecture-specific PMD entry encoding leaf entry. + */ +static inline pmd_t softleaf_to_pmd(softleaf_t entry) +{ + /* Temporary until swp_entry_t eliminated. */ + return swp_entry_to_pmd(entry); +} + #else =20 static inline softleaf_t softleaf_from_pmd(pmd_t pmd) @@ -115,6 +130,11 @@ static inline softleaf_t softleaf_from_pmd(pmd_t pmd) return softleaf_mk_none(); } =20 +static inline pmd_t softleaf_to_pmd(softleaf_t entry) +{ + return __pmd(0); +} + #endif =20 /** diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 23dc3ee09561..18411fb09aab 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -758,7 +758,7 @@ static void __init pmd_leaf_soft_dirty_tests(struct pgt= able_debug_args *args) return; =20 pr_debug("Validating PMD swap soft dirty\n"); - pmd =3D swp_entry_to_pmd(args->leaf_entry); + pmd =3D softleaf_to_pmd(args->leaf_entry); WARN_ON(!pmd_is_huge(pmd)); WARN_ON(!pmd_is_valid_softleaf(pmd)); =20 @@ -829,7 +829,7 @@ static void __init pmd_softleaf_tests(struct pgtable_de= bug_args *args) return; =20 pr_debug("Validating PMD swap\n"); - pmd1 =3D swp_entry_to_pmd(args->leaf_entry); + pmd1 =3D softleaf_to_pmd(args->leaf_entry); WARN_ON(!pmd_is_huge(pmd1)); WARN_ON(!pmd_is_valid_softleaf(pmd1)); =20 diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 7f172f3257e8..15913a37b6df 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1820,7 +1820,7 @@ static void copy_huge_non_present_pmd( if (softleaf_is_migration_write(entry) || softleaf_is_migration_read_exclusive(entry)) { entry =3D make_readable_migration_entry(swp_offset(entry)); - pmd =3D swp_entry_to_pmd(entry); + pmd =3D softleaf_to_pmd(entry); if (pmd_swp_soft_dirty(*src_pmd)) pmd =3D pmd_swp_mksoft_dirty(pmd); if (pmd_swp_uffd_wp(*src_pmd)) @@ -1833,7 +1833,7 @@ static void copy_huge_non_present_pmd( */ if (softleaf_is_device_private_write(entry)) { entry =3D make_readable_device_private_entry(swp_offset(entry)); - pmd =3D swp_entry_to_pmd(entry); + pmd =3D softleaf_to_pmd(entry); =20 if (pmd_swp_soft_dirty(*src_pmd)) pmd =3D pmd_swp_mksoft_dirty(pmd); @@ -2571,12 +2571,12 @@ static void change_non_present_huge_pmd(struct mm_s= truct *mm, entry =3D make_readable_exclusive_migration_entry(swp_offset(entry)); else entry =3D make_readable_migration_entry(swp_offset(entry)); - newpmd =3D swp_entry_to_pmd(entry); + newpmd =3D softleaf_to_pmd(entry); if (pmd_swp_soft_dirty(*pmd)) newpmd =3D pmd_swp_mksoft_dirty(newpmd); } else if (softleaf_is_device_private_write(entry)) { entry =3D make_readable_device_private_entry(swp_offset(entry)); - newpmd =3D swp_entry_to_pmd(entry); + newpmd =3D softleaf_to_pmd(entry); if (pmd_swp_uffd_wp(*pmd)) newpmd =3D pmd_swp_mkuffd_wp(newpmd); } else { @@ -4901,7 +4901,7 @@ int set_pmd_migration_entry(struct page_vma_mapped_wa= lk *pvmw, entry =3D make_migration_entry_young(entry); if (pmd_dirty(pmdval)) entry =3D make_migration_entry_dirty(entry); - pmdswp =3D swp_entry_to_pmd(entry); + pmdswp =3D softleaf_to_pmd(entry); if (pmd_soft_dirty(pmdval)) pmdswp =3D pmd_swp_mksoft_dirty(pmdswp); if (pmd_uffd_wp(pmdval)) @@ -4952,7 +4952,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk= *pvmw, struct page *new) else entry =3D make_readable_device_private_entry( page_to_pfn(new)); - pmde =3D swp_entry_to_pmd(entry); + pmde =3D softleaf_to_pmd(entry); =20 if (pmd_swp_soft_dirty(*pvmw->pmd)) pmde =3D pmd_swp_mksoft_dirty(pmde); diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 554754eb26ff..ab93a8d11b70 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -835,7 +835,7 @@ static int migrate_vma_insert_huge_pmd_page(struct migr= ate_vma *migrate, else swp_entry =3D make_readable_device_private_entry( page_to_pfn(page)); - entry =3D swp_entry_to_pmd(swp_entry); + entry =3D softleaf_to_pmd(swp_entry); } else { if (folio_is_zone_device(folio) && !folio_is_device_coherent(folio)) { --=20 2.52.0 From nobody Mon Jun 8 04:24:22 2026 Received: from out-172.mta1.migadu.com (out-172.mta1.migadu.com [95.215.58.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CE4E13F44F4 for ; Tue, 2 Jun 2026 14:26:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410371; cv=none; b=fWSFKoSN2ZjkDZyjOLaSXHwYjWhpV9tbl0eKNxVHYJPwp5tG+KE4XIOWCcCRicVzqJmrPwv8fYkhNr/2ctCmP7naEkGmcPeU3+KC3D4EqheVG/WbZmwJvFoAkL2A3VgZ1CZ0837sBwZqDQoIPmt6+1Ax8HBrkE5QSifjREl0uE0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410371; c=relaxed/simple; bh=6maPEhjrpUdckQ2aCYWcZwWlNpHHVRSC389I+flZ+g4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=mxSKneKUYfpDTf/lRS9FYYe5UIzlv7vDMUzcfylaQYh4jcoTSYpYTPoLw89g7a9FtgRxCOejKpMqHosRn7Uy5IL9LQIsaeEnlp0ES4mQYH94tk2VT+Am20gEEeokPaUiH1wYzJ4qv+tyrMRofFsjaLM0I4mD3a8DGm67i2MRzMM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=L2RI+gNB; arc=none smtp.client-ip=95.215.58.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="L2RI+gNB" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1780410367; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=aaCCVm5RLQadoWEaT8DmWxpLPBnStemVS7Mt3XQQms8=; b=L2RI+gNBcMGevj0O6kgQ8FIn4IwANqNs7GPGJG7QXneLD7ooNbxp4zZf0UOO7jAR9KTg7Z zNaXOQctDLp+OunulhuUgj4mGHRL+FTZd+xBL8ALS6Ar0on4DhVaxoJMO8JS94nFh7dvBh 1rwDuKvlYpDaQ63Xla/iDQVnDe6zH2w= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: ying.huang@linux.alibaba.com, Baoquan He , willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam R. Howlett , ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [v2 02/16] mm: extract mm_prepare_for_swap_entries() helper Date: Tue, 2 Jun 2026 07:24:10 -0700 Message-ID: <20260602142537.198755-3-usama.arif@linux.dev> In-Reply-To: <20260602142537.198755-1-usama.arif@linux.dev> References: <20260602142537.198755-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" When a swap entry is installed in a page table, the mm must be added to init_mm.mmlist so that swapoff can find and unuse its swap entries. This double-checked locking pattern is currently open-coded in try_to_unmap_one() and copy_nonpresent_pte(). Move it into mm_prepare_for_swap_entries() in mm/internal.h and convert both callers so it can be reused by upcoming PMD-level swap entry code paths that also need to register the mm with swapoff. copy_nonpresent_pte() previously inserted into &src_mm->mmlist rather than &init_mm.mmlist, but the insertion point is irrelevant, mmlist is a circular list and swapoff walks it entirely from init_mm.mmlist, so only membership matters, not position. Reviewed-by: Dev Jain Signed-off-by: Usama Arif --- mm/internal.h | 13 +++++++++++++ mm/memory.c | 9 +-------- mm/rmap.c | 7 +------ 3 files changed, 15 insertions(+), 14 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 181e79f1d6a2..ace2f8ef1d35 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1951,4 +1951,17 @@ static inline int get_sysctl_max_map_count(void) bool may_expand_vm(struct mm_struct *mm, const vma_flags_t *vma_flags, unsigned long npages); =20 +/* + * Ensure @mm is on the init_mm.mmlist so swapoff can find it. + */ +static inline void mm_prepare_for_swap_entries(struct mm_struct *mm) +{ + if (list_empty(&mm->mmlist)) { + spin_lock(&mmlist_lock); + if (list_empty(&mm->mmlist)) + list_add(&mm->mmlist, &init_mm.mmlist); + spin_unlock(&mmlist_lock); + } +} + #endif /* __MM_INTERNAL_H */ diff --git a/mm/memory.c b/mm/memory.c index 56be920c56d7..137f34c3fd32 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -953,14 +953,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct m= m_struct *src_mm, if (swap_dup_entry_direct(entry) < 0) return -EIO; =20 - /* make sure dst_mm is on swapoff's mmlist. */ - if (unlikely(list_empty(&dst_mm->mmlist))) { - spin_lock(&mmlist_lock); - if (list_empty(&dst_mm->mmlist)) - list_add(&dst_mm->mmlist, - &src_mm->mmlist); - spin_unlock(&mmlist_lock); - } + mm_prepare_for_swap_entries(dst_mm); /* Mark the swap entry as shared. */ if (pte_swp_exclusive(orig_pte)) { pte =3D pte_swp_clear_exclusive(orig_pte); diff --git a/mm/rmap.c b/mm/rmap.c index 1c77d5dc06e9..b93caabd186f 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2304,12 +2304,7 @@ static bool try_to_unmap_one(struct folio *folio, st= ruct vm_area_struct *vma, set_pte_at(mm, address, pvmw.pte, pteval); goto walk_abort; } - if (list_empty(&mm->mmlist)) { - spin_lock(&mmlist_lock); - if (list_empty(&mm->mmlist)) - list_add(&mm->mmlist, &init_mm.mmlist); - spin_unlock(&mmlist_lock); - } + mm_prepare_for_swap_entries(mm); dec_mm_counter(mm, MM_ANONPAGES); inc_mm_counter(mm, MM_SWAPENTS); swp_pte =3D swp_entry_to_pte(entry); --=20 2.52.0 From nobody Mon Jun 8 04:24:22 2026 Received: from out-180.mta1.migadu.com (out-180.mta1.migadu.com [95.215.58.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 46F183F4DCC for ; Tue, 2 Jun 2026 14:26:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410378; cv=none; b=YWC9bz+y4csYirzrk0fIBjZ+bxb5uu8EzVeGWTjKNjkGV3ScPBz2q0nF8CP2lXkBKkfHNlED/vqwCJ6JcN/KvALxW+uRKy3/efROIiNvdv/CjbNLQZz/1gdoLrdXsZf7pozUXqHuR9gD3dCnAloOaBaDsiSQMfVDHS0BD3/u7FI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410378; c=relaxed/simple; bh=C88lGAINIFP+1Ky7ku+04w6HD7zfMD64jYRRFDOdK/A=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=fdrsq+ItsEo/q4Ah/v1zGf1Begrvc4alTgNa9thzNK+kMRA+c428DQzAuXw8zUEZKetC01bDDviz0rYtBuad6w75Ejvf7uhff6hI1Ku1W3xaX6Btx/Jsshu36fEWCsecC+H/y8Bha04gHTg+LxBUawO4dyK48wj+0eYGn3kRqnQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=l+oMCYbo; arc=none smtp.client-ip=95.215.58.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="l+oMCYbo" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1780410375; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=+sVKInhmZvUV0YIB7MuEGVbtRZhztIPVEWbwWdNoYq4=; b=l+oMCYboba/geZJscTiEbt3R5wKJH/gWrBwfaaMUczY5D8oVKSRZbuqBGTt/mYSh7JPepP OWNUKECfDjwTc2H4HfjV/EyE9Wr+XjwxzpeDRUKgue9/utl8oF9Sa0oqFr1XqALh/yz7FZ +RSw55gpt38pM6CfyBfSjdC6n3n2gBg= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: ying.huang@linux.alibaba.com, Baoquan He , willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam R. Howlett , ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [v2 03/16] fs/proc: use softleaf_has_pfn() in pagemap PMD walker Date: Tue, 2 Jun 2026 07:24:11 -0700 Message-ID: <20260602142537.198755-4-usama.arif@linux.dev> In-Reply-To: <20260602142537.198755-1-usama.arif@linux.dev> References: <20260602142537.198755-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" pagemap_pmd_range_thp() assumes that every non-present PMD is a migration entry and unconditionally calls softleaf_to_page(). This will crash on any non-present PMD type that does not encode a PFN, such as the upcoming PMD-level swap entries. Guard the page lookup with softleaf_has_pfn(), matching how pte_to_pagemap_entry() already handles non-present PTEs. Acked-by: David Hildenbrand (Arm) Signed-off-by: Usama Arif --- fs/proc/task_mmu.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index d32408f7cd5e..1fb5acd88ad0 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -2129,7 +2129,8 @@ static int pagemap_pmd_range_thp(pmd_t *pmdp, unsigne= d long addr, flags |=3D PM_SOFT_DIRTY; if (pmd_swp_uffd_wp(pmd)) flags |=3D PM_UFFD_WP; - page =3D softleaf_to_page(entry); + if (softleaf_has_pfn(entry)) + page =3D softleaf_to_page(entry); } =20 if (page) { --=20 2.52.0 From nobody Mon Jun 8 04:24:22 2026 Received: from out-189.mta0.migadu.com (out-189.mta0.migadu.com [91.218.175.189]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D3D583F0AA8 for ; Tue, 2 Jun 2026 14:26:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.189 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410384; cv=none; b=nm2xF44oHumtkF/AmyabRFZrva1uaIm/rGbNmC6McwXV2QtGuSu9KV3I8lh4PCuybIg014Myuz787tGCUBoQN1p9Y9FU0iLFJ4anZ+WUPPVGQGn1PcUIDIe7+2X9TyBH4ODaAJTUm3uTdGgnyQNFclWz42sZVVFAU55ux70W03I= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410384; c=relaxed/simple; bh=hKT+qTsdPyhdE4sNVeEYAAj2+MvGbcPWZN74/zC1yUA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=BppBDCDkbeMrV2eAtGrNDE8MUvnvYGICU1ihFfCyy+K6YvGoGIh01hyVk/zKAfN+mMiE2ypJUv/fn3X6RsoxSjGZ1TmAzGoZTVlgrtYI6PMgmv1pN0iYKI0lr2vh2u09zf48Y7LOAS9xC7TYB1pxLdoBD21wRvxwLkkbUd3weaw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=AVOw/ISZ; arc=none smtp.client-ip=91.218.175.189 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="AVOw/ISZ" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1780410380; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=pGQmOJINkpIXV0ZvH+Jm+oHlVpPUGDMIu+1lnED7ios=; b=AVOw/ISZBXBVizlZbXDxSwTDf5thD0B0csl6a/y8G+wGupAGChJXkHLIGLsqpfvmoTOG+8 Q3FV+EKJhJemeJmLl/oGWv99F4ur3FR3ki4VDol7GJazVUxtrOB3GhUn6HIcNDFrBL1j8M Z2+/d+F9J03eBO8eXdsx7Fdn62LZij4= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: ying.huang@linux.alibaba.com, Baoquan He , willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam R. Howlett , ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [v2 04/16] mm/huge_memory: move softleaf_to_folio() inside migration branch Date: Tue, 2 Jun 2026 07:24:12 -0700 Message-ID: <20260602142537.198755-5-usama.arif@linux.dev> In-Reply-To: <20260602142537.198755-1-usama.arif@linux.dev> References: <20260602142537.198755-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" change_non_present_huge_pmd() calls softleaf_to_folio() unconditionally at the top of the function. softleaf_to_folio() extracts a PFN from the entry and converts it to a folio pointer, which is only meaningful for migration and device_private entries that encode a real PFN. A swap entry encodes a swap offset instead, so softleaf_to_folio() would produce a bogus pointer and crash on mprotect() when a PMD swap entry is present. Move the call into the migration_write branch where the folio is actually used, so the function is safe for any non-present PMD type. Acked-by: David Hildenbrand (Arm) Reviewed-by: Dev Jain Signed-off-by: Usama Arif --- mm/huge_memory.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 15913a37b6df..b7b76eef6617 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2558,11 +2558,12 @@ static void change_non_present_huge_pmd(struct mm_s= truct *mm, bool uffd_wp_resolve) { softleaf_t entry =3D softleaf_from_pmd(*pmd); - const struct folio *folio =3D softleaf_to_folio(entry); pmd_t newpmd; =20 VM_WARN_ON(!pmd_is_valid_softleaf(*pmd)); if (softleaf_is_migration_write(entry)) { + const struct folio *folio =3D softleaf_to_folio(entry); + /* * A protection check is difficult so * just be safe and disable write --=20 2.52.0 From nobody Mon Jun 8 04:24:22 2026 Received: from out-178.mta1.migadu.com (out-178.mta1.migadu.com [95.215.58.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3D8323EB7FF for ; Tue, 2 Jun 2026 14:26:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410389; cv=none; b=iZlt+uRsKNANhVR9vp0nn9tJ/B2TVfqslwf7bkZGmV93FzHMUZ64AbCXv1DZMiWF7arxBvj6X4NR/TBQVI3OvzUiXNsP/AqPKV26LdbRWD3cti/OGDIWpELhzoeaCpmeIV6Fuz1E+2NE/60rWHAbUb3TiL1Pgit8MqUzyF6brbI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410389; c=relaxed/simple; bh=SLVh7Y6I2+dhFNn4Z2lYGwskxfVtdSRzkKsSoKml3Y0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=nK/AuNirjrjq4T41pbunl/RQmCihIHNu82XZ22JDIB1vuWxM/3G2dPkrWEVF8gaTL/BC9dsHUvUF8vE4QVjUCa6F4L+DMt+xSLBydO4cpdKv4EGMeQXqwpDu2jNzEdNaimSzSyyOUGOzqHILS2HuVQaHL3Lc9Iq2coWFLOHsz2s= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=g0La4iAm; arc=none smtp.client-ip=95.215.58.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="g0La4iAm" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1780410386; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=tos+vodqa+d+iP0mvZTDizOcd3cdK8I9xTBHz+4UXaU=; b=g0La4iAmfWib39OKPUOtDJcNLL1k3rDzma1A8q2qoMlQ+k3Qzf/Dv2GDNu/drFM9MW224G LBJUnof6pTM8YDGi3EenpPZaLKxD19VFvPPII+Uhln4Q6kv4CxzuRfAvJG7IDGBqlb8+SO nFq+PD3l4sNSmn4wY7TJc/2/zSK++R4= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: ying.huang@linux.alibaba.com, Baoquan He , willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam R. Howlett , ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [v2 05/16] mm/migrate_device: move softleaf_to_folio() inside device-private branch Date: Tue, 2 Jun 2026 07:24:13 -0700 Message-ID: <20260602142537.198755-6-usama.arif@linux.dev> In-Reply-To: <20260602142537.198755-1-usama.arif@linux.dev> References: <20260602142537.198755-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" migrate_vma_collect_pmd() calls softleaf_to_folio() on a non-present PMD before checking the entry's type. softleaf_to_folio() converts the entry's offset to a PFN, which is only meaningful for migration or device-private entries. A PMD swap entry's offset is a swap offset, not a PFN, so the lookup would either return a bogus folio pointer or trip pfn_to_page validation on a debug kernel. In the non-device-private path the returned folio is then unused (the OR short-circuits to migrate_vma_collect_skip()), but the lookup itself is already unsafe. Move the softleaf_to_folio() call inside the device-private branch where the folio is actually needed, mirroring the equivalent change_non_present_huge_pmd() fix. Signed-off-by: Usama Arif --- mm/migrate_device.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/mm/migrate_device.c b/mm/migrate_device.c index ab93a8d11b70..87f079b64265 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -166,11 +166,14 @@ static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, = unsigned long start, } else if (!pmd_present(*pmdp)) { const softleaf_t entry =3D softleaf_from_pmd(*pmdp); =20 - folio =3D softleaf_to_folio(entry); - if (!softleaf_is_device_private(entry) || - !(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) || - (folio->pgmap->owner !=3D migrate->pgmap_owner)) { + !(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE)) { + spin_unlock(ptl); + return migrate_vma_collect_skip(start, end, walk); + } + + folio =3D softleaf_to_folio(entry); + if (folio->pgmap->owner !=3D migrate->pgmap_owner) { spin_unlock(ptl); return migrate_vma_collect_skip(start, end, walk); } --=20 2.52.0 From nobody Mon Jun 8 04:24:22 2026 Received: from out-177.mta1.migadu.com (out-177.mta1.migadu.com [95.215.58.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5F3B43EB7FF for ; Tue, 2 Jun 2026 14:26:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.177 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410396; cv=none; b=Gx0EeQtjbssPr5gkTS0aXDKlb93xXpu1pseCdIY9/XRW6N8dQac7noNsKKaDop9K8DLK/+x85ir2gTKmVG8eiZYb7LRa6qasyPMjBrJjb9qIpihcCmDLa4TNmHv7qJ7y6M0oZwDKR+5HhhTx1JBBPuaX6otC5hHei3qi3Uk/GQY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410396; c=relaxed/simple; bh=PpHp3J3L1HGoNjC4ZY9idfFlwtxR4b504F8ZfgJuEKU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=sUbl5XWLdwwGkCEdH5moTQVJAW+ef+/zqnixKC7r9shfI0QkKq6DCYBoa16yTQf6I6J3Nlko73PAr8fh6JCYVZCoDUedIYp+d684mKiSp/XVGC33kNp4FZGjQVypV4Y9pC7iwrfqG/WEhndLo5Gv17qDtb+BMdvfBpjtKRN/4Ow= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=egJNq3Bg; arc=none smtp.client-ip=95.215.58.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="egJNq3Bg" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1780410392; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=wE4UcwvgayUPLuZnwTsVe5z2OmDVFbZwS/rfHWRIGmo=; b=egJNq3BgFV7jDRcrUZ50PXnEBGS4oU9415lyf5cnWPp0RWPjuK4D+tBKxwjhNJgGke37yt zYRUwL/7oaXgd19HwgnKm1vzJYcexVtT7F9iOEs4vNeAzMM4W69a3ZqiUAfjUSsIYOHtlX 1CXlYomMoqTGZ1IaspSwHEuvw1f+uWI= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: ying.huang@linux.alibaba.com, Baoquan He , willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam R. Howlett , ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [v2 06/16] mm: rename ARCH_ENABLE_THP_MIGRATION to ARCH_SUPPORTS_PMD_SOFTLEAF Date: Tue, 2 Jun 2026 07:24:14 -0700 Message-ID: <20260602142537.198755-7-usama.arif@linux.dev> In-Reply-To: <20260602142537.198755-1-usama.arif@linux.dev> References: <20260602142537.198755-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" CONFIG_ARCH_ENABLE_THP_MIGRATION started life gating just PMD-level migration entries, but has grown to gate the entire PMD-level softleaf machinery: migration entries, device-private entries, and soon swap entries. Rename CONFIG_ARCH_ENABLE_THP_MIGRATION to CONFIG_ARCH_SUPPORTS_PMD _SOFTLEAF to make this clear. This is a pure rename: the set of selecting architectures (x86, arm64, s390, riscv, loongarch, and powerpc on PPC_BOOK3S_64) and the gating semantics are unchanged. No functional change intended. Signed-off-by: Usama Arif --- arch/arm64/Kconfig | 2 +- arch/arm64/include/asm/pgtable.h | 4 ++-- arch/loongarch/Kconfig | 2 +- arch/powerpc/include/asm/book3s/64/pgtable.h | 2 +- arch/powerpc/platforms/Kconfig.cputype | 2 +- arch/riscv/Kconfig | 2 +- arch/riscv/include/asm/pgtable.h | 8 ++++---- arch/s390/Kconfig | 2 +- arch/s390/include/asm/pgtable.h | 2 +- arch/x86/Kconfig | 2 +- arch/x86/include/asm/pgtable.h | 2 +- include/linux/huge_mm.h | 2 +- include/linux/leafops.h | 8 ++++---- include/linux/pgtable.h | 2 +- include/linux/swapops.h | 6 +++--- mm/Kconfig | 2 +- mm/debug_vm_pgtable.c | 8 ++++---- mm/hmm.c | 4 ++-- mm/huge_memory.c | 2 +- mm/migrate.c | 4 ++-- mm/migrate_device.c | 6 +++--- mm/rmap.c | 2 +- 22 files changed, 38 insertions(+), 38 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index fe60738e5943..c6da904b0339 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -17,7 +17,7 @@ config ARM64 select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION select ARCH_ENABLE_MEMORY_HOTPLUG select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2 - select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE + select ARCH_SUPPORTS_PMD_SOFTLEAF if TRANSPARENT_HUGEPAGE select ARCH_HAS_CACHE_LINE_SIZE select ARCH_HAS_CC_PLATFORM select ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgta= ble.h index 4dfa42b7d053..623099303c7b 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -1534,10 +1534,10 @@ static inline pmd_t pmdp_establish(struct vm_area_s= truct *vma, #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(swp) ((pte_t) { (swp).val }) =20 -#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION +#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF #define __pmd_to_swp_entry(pmd) ((swp_entry_t) { pmd_val(pmd) }) #define __swp_entry_to_pmd(swp) __pmd((swp).val) -#endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */ +#endif /* CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF */ =20 /* * Ensure that there are not more swap files than can be encoded in the ke= rnel diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig index 606597da46b8..20ea972a876c 100644 --- a/arch/loongarch/Kconfig +++ b/arch/loongarch/Kconfig @@ -12,7 +12,7 @@ config LOONGARCH select ARCH_NEEDS_DEFER_KASAN select ARCH_DISABLE_KASAN_INLINE select ARCH_ENABLE_MEMORY_HOTPLUG - select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE + select ARCH_SUPPORTS_PMD_SOFTLEAF if TRANSPARENT_HUGEPAGE select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI select ARCH_HAS_CPU_FINALIZE_INIT select ARCH_HAS_CURRENT_STACK_POINTER diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/in= clude/asm/book3s/64/pgtable.h index e67e64ac6e8c..6f30aa8a6490 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -1060,7 +1060,7 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd) #define pmd_mksoft_dirty(pmd) pte_pmd(pte_mksoft_dirty(pmd_pte(pmd))) #define pmd_clear_soft_dirty(pmd) pte_pmd(pte_clear_soft_dirty(pmd_pte(pmd= ))) =20 -#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION +#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF #define pmd_swp_mksoft_dirty(pmd) pte_pmd(pte_swp_mksoft_dirty(pmd_pte(pmd= ))) #define pmd_swp_soft_dirty(pmd) pte_swp_soft_dirty(pmd_pte(pmd)) #define pmd_swp_clear_soft_dirty(pmd) pte_pmd(pte_swp_clear_soft_dirty(pmd= _pte(pmd))) diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platform= s/Kconfig.cputype index bac02c83bb3e..4a0fa681bf98 100644 --- a/arch/powerpc/platforms/Kconfig.cputype +++ b/arch/powerpc/platforms/Kconfig.cputype @@ -112,7 +112,7 @@ config PPC_THP depends on PPC_RADIX_MMU || (PPC_64S_HASH_MMU && PAGE_SIZE_64KB) select HAVE_ARCH_TRANSPARENT_HUGEPAGE select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD - select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE + select ARCH_SUPPORTS_PMD_SOFTLEAF if TRANSPARENT_HUGEPAGE =20 choice prompt "CPU selection" diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index c5754942cf85..de463524dab1 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -22,7 +22,7 @@ config RISCV select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM_VMEMMAP select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2 - select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE + select ARCH_SUPPORTS_PMD_SOFTLEAF if TRANSPARENT_HUGEPAGE select ARCH_HAS_BINFMT_FLAT select ARCH_HAS_CURRENT_STACK_POINTER select ARCH_HAS_DEBUG_VIRTUAL if MMU diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgta= ble.h index a1a7c6520a09..52cfd7df228b 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -936,7 +936,7 @@ static inline pmd_t pmd_clear_soft_dirty(pmd_t pmd) return pte_pmd(pte_clear_soft_dirty(pmd_pte(pmd))); } =20 -#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION +#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF static inline bool pmd_swp_soft_dirty(pmd_t pmd) { return pte_swp_soft_dirty(pmd_pte(pmd)); @@ -951,7 +951,7 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd) { return pte_pmd(pte_swp_clear_soft_dirty(pmd_pte(pmd))); } -#endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */ +#endif /* CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF */ #endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */ =20 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr, @@ -1198,10 +1198,10 @@ static inline pte_t pte_swp_clear_exclusive(pte_t p= te) return __pte(pte_val(pte) & ~_PAGE_SWP_EXCLUSIVE); } =20 -#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION +#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF #define __pmd_to_swp_entry(pmd) ((swp_entry_t) { pmd_val(pmd) }) #define __swp_entry_to_pmd(swp) __pmd((swp).val) -#endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */ +#endif /* CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF */ =20 /* * In the RV64 Linux scheme, we give the user half of the virtual-address = space diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index ecbcbb781e40..046866a0b44d 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -85,7 +85,7 @@ config S390 select ARCH_CORRECT_STACKTRACE_ON_KRETPROBE select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2 - select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE + select ARCH_SUPPORTS_PMD_SOFTLEAF if TRANSPARENT_HUGEPAGE select ARCH_HAS_CC_CAN_LINK select ARCH_HAS_CPU_FINALIZE_INIT select ARCH_HAS_CURRENT_STACK_POINTER diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtabl= e.h index 2c6cee8241e0..83d4516825f0 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -903,7 +903,7 @@ static inline pmd_t pmd_clear_soft_dirty(pmd_t pmd) return clear_pmd_bit(pmd, __pgprot(_SEGMENT_ENTRY_SOFT_DIRTY)); } =20 -#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION +#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF #define pmd_swp_soft_dirty(pmd) pmd_soft_dirty(pmd) #define pmd_swp_mksoft_dirty(pmd) pmd_mksoft_dirty(pmd) #define pmd_swp_clear_soft_dirty(pmd) pmd_clear_soft_dirty(pmd) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index f3f7cb01d69d..33c6920555b1 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -70,7 +70,7 @@ config X86 select ARCH_ENABLE_HUGEPAGE_MIGRATION if X86_64 && HUGETLB_PAGE && MIGRAT= ION select ARCH_ENABLE_MEMORY_HOTPLUG if X86_64 select ARCH_ENABLE_SPLIT_PMD_PTLOCK if (PGTABLE_LEVELS > 2) && (X86_64 ||= X86_PAE) - select ARCH_ENABLE_THP_MIGRATION if X86_64 && TRANSPARENT_HUGEPAGE + select ARCH_SUPPORTS_PMD_SOFTLEAF if X86_64 && TRANSPARENT_HUGEPAGE select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI select ARCH_HAS_CPU_ATTACK_VECTORS if CPU_MITIGATIONS select ARCH_HAS_CACHE_LINE_SIZE diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 2187e9cfcefa..6efc7980c95a 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1533,7 +1533,7 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pt= e) return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY); } =20 -#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION +#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd) { return pmd_set_flags(pmd, _PAGE_SWP_SOFT_DIRTY); diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index ad20f7f8c179..1487bf4af1a7 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -567,7 +567,7 @@ static inline struct folio *get_persistent_huge_zero_fo= lio(void) =20 static inline bool thp_migration_supported(void) { - return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION); + return IS_ENABLED(CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF); } =20 void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addre= ss, diff --git a/include/linux/leafops.h b/include/linux/leafops.h index 803d312437df..88888daeb018 100644 --- a/include/linux/leafops.h +++ b/include/linux/leafops.h @@ -81,7 +81,7 @@ static inline pte_t softleaf_to_pte(softleaf_t entry) return swp_entry_to_pte(entry); } =20 -#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION +#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF /** * softleaf_from_pmd() - Obtain a leaf entry from a PMD entry. * @pmd: PMD entry. @@ -587,7 +587,7 @@ static inline bool pte_is_uffd_marker(pte_t pte) return false; } =20 -#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATIO= N) +#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_ARCH_SUPPORTS_PMD_SOFTLE= AF) =20 /** * pmd_is_device_private_entry() - Check if PMD contains a device private = swap @@ -606,14 +606,14 @@ static inline bool pmd_is_device_private_entry(pmd_t = pmd) return softleaf_is_device_private(softleaf_from_pmd(pmd)); } =20 -#else /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */ +#else /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF */ =20 static inline bool pmd_is_device_private_entry(pmd_t pmd) { return false; } =20 -#endif /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */ +#endif /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF */ =20 /** * pmd_is_migration_entry() - Does this PMD entry encode a migration entry? diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index cdd68ed3ae1a..5ee80194e052 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1781,7 +1781,7 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot= , pgprot_t newprot) #endif =20 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY -#ifndef CONFIG_ARCH_ENABLE_THP_MIGRATION +#ifndef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd) { return pmd; diff --git a/include/linux/swapops.h b/include/linux/swapops.h index 8cfc966eae48..705a84154d28 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -321,7 +321,7 @@ static inline swp_entry_t make_guard_swp_entry(void) =20 struct page_vma_mapped_walk; =20 -#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION +#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF extern int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw, struct page *page); =20 @@ -338,7 +338,7 @@ static inline pmd_t swp_entry_to_pmd(swp_entry_t entry) return __swp_entry_to_pmd(arch_entry); } =20 -#else /* CONFIG_ARCH_ENABLE_THP_MIGRATION */ +#else /* CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF */ static inline int set_pmd_migration_entry(struct page_vma_mapped_walk *pvm= w, struct page *page) { @@ -358,7 +358,7 @@ static inline pmd_t swp_entry_to_pmd(swp_entry_t entry) return __pmd(0); } =20 -#endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */ +#endif /* CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF */ =20 #endif /* CONFIG_MMU */ #endif /* _LINUX_SWAPOPS_H */ diff --git a/mm/Kconfig b/mm/Kconfig index 776b67c66e82..3a3bbe000f85 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -650,7 +650,7 @@ config DEVICE_MIGRATION config ARCH_ENABLE_HUGEPAGE_MIGRATION bool =20 -config ARCH_ENABLE_THP_MIGRATION +config ARCH_SUPPORTS_PMD_SOFTLEAF bool =20 config HUGETLB_PAGE_SIZE_VARIABLE diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 18411fb09aab..507fbd1ae7e5 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -751,7 +751,7 @@ static void __init pmd_leaf_soft_dirty_tests(struct pgt= able_debug_args *args) pmd_t pmd; =20 if (!pgtable_supports_soft_dirty() || - !IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION)) + !IS_ENABLED(CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF)) return; =20 if (!has_transparent_hugepage()) @@ -819,7 +819,7 @@ static void __init pte_swap_tests(struct pgtable_debug_= args *args) WARN_ON(memcmp(&pte1, &pte2, sizeof(pte1))); } =20 -#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION +#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF static void __init pmd_softleaf_tests(struct pgtable_debug_args *args) { swp_entry_t arch_entry; @@ -837,9 +837,9 @@ static void __init pmd_softleaf_tests(struct pgtable_de= bug_args *args) pmd2 =3D __swp_entry_to_pmd(arch_entry); WARN_ON(memcmp(&pmd1, &pmd2, sizeof(pmd1))); } -#else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */ +#else /* !CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF */ static void __init pmd_softleaf_tests(struct pgtable_debug_args *args) { } -#endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */ +#endif /* CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF */ =20 static void __init swap_migration_tests(struct pgtable_debug_args *args) { diff --git a/mm/hmm.c b/mm/hmm.c index 5955f2f0c83d..cabf111f2ed2 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -331,7 +331,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, uns= igned long addr, return hmm_vma_fault(addr, end, required_fault, walk); } =20 -#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION +#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF static int hmm_vma_handle_absent_pmd(struct mm_walk *walk, unsigned long s= tart, unsigned long end, unsigned long *hmm_pfns, pmd_t pmd) @@ -391,7 +391,7 @@ static int hmm_vma_handle_absent_pmd(struct mm_walk *wa= lk, unsigned long start, return -EFAULT; return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); } -#endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */ +#endif /* CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF */ =20 static int hmm_vma_walk_pmd(pmd_t *pmdp, unsigned long start, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b7b76eef6617..af6a9c20131a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -4861,7 +4861,7 @@ static int __init split_huge_pages_debugfs(void) late_initcall(split_huge_pages_debugfs); #endif =20 -#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION +#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw, struct page *page) { diff --git a/mm/migrate.c b/mm/migrate.c index d9b23909d716..6f6518960882 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -362,7 +362,7 @@ static bool remove_migration_pte(struct folio *folio, idx =3D linear_page_index(vma, pvmw.address) - pvmw.pgoff; new =3D folio_page(folio, idx); =20 -#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION +#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF /* PMD-mapped THP migration entry */ if (!pvmw.pte) { VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) || @@ -545,7 +545,7 @@ void migration_entry_wait_huge(struct vm_area_struct *v= ma, unsigned long addr, p } #endif =20 -#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION +#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd) { spinlock_t *ptl; diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 87f079b64265..af336bcedeb3 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -771,7 +771,7 @@ int migrate_vma_setup(struct migrate_vma *args) } EXPORT_SYMBOL(migrate_vma_setup); =20 -#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION +#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF /** * migrate_vma_insert_huge_pmd_page: Insert a huge folio into @migrate->vm= a->vm_mm * at @addr. folio is already allocated as a part of the migration process= with @@ -926,7 +926,7 @@ static int migrate_vma_split_unmapped_folio(struct migr= ate_vma *migrate, migrate->src[i+idx] =3D migrate_pfn(pfn + i) | flags; return ret; } -#else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */ +#else /* !CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF */ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate, unsigned long addr, struct page *page, @@ -947,7 +947,7 @@ static int migrate_vma_split_unmapped_folio(struct migr= ate_vma *migrate, static unsigned long migrate_vma_nr_pages(unsigned long *src) { unsigned long nr =3D 1; -#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION +#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF if (*src & MIGRATE_PFN_COMPOUND) nr =3D HPAGE_PMD_NR; #else diff --git a/mm/rmap.c b/mm/rmap.c index b93caabd186f..0fb7a1b82cf3 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2472,7 +2472,7 @@ static bool try_to_migrate_one(struct folio *folio, s= truct vm_area_struct *vma, page_vma_mapped_walk_restart(&pvmw); continue; } -#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION +#ifdef CONFIG_ARCH_SUPPORTS_PMD_SOFTLEAF pmdval =3D pmdp_get(pvmw.pmd); if (likely(pmd_present(pmdval))) pfn =3D pmd_pfn(pmdval); --=20 2.52.0 From nobody Mon Jun 8 04:24:22 2026 Received: from out-177.mta0.migadu.com (out-177.mta0.migadu.com [91.218.175.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 03DCD3FBEC5 for ; Tue, 2 Jun 2026 14:26:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.177 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410400; cv=none; b=FYeBFQxjARtQUoEmc/yz6xEBzL4k/q32esLucJq9ZkuLbgUSp1u8Jso8p+6ilqZ0fRLfT9Iq/HqMVUp0DOEUj+LJShH/3ULAb3xpiQimHKC0qzaQOwED6aX5E7hKDEwvM8NPBpw02tOhOEMP4JtSMbekeg9YnfSFSlV/cErph5c= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410400; c=relaxed/simple; bh=wT8mDDoDLRCic9Uyb9/vi+tomRknlADC6TfhWSKcPUE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ZX/aunp8ah1Bjq8OK1wATaOxPf6kAip2t8MBLPMkinA54oP19Y9SO3MUQ4X8xUvVLCq5cM9HFu6Jm4lD2XkjETvOzxRZV5NHRnSbtvPfFunGS9uIoXyE/ijd1IXyoSPmhq1xHgvkVNlC+PrPReV4pQvfDzlnmeYHOXcrdNHF9J4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=TuHfr6OU; arc=none smtp.client-ip=91.218.175.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="TuHfr6OU" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1780410396; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=X7aHPix5o7hPRGrukSr5icc3CKnHLskakBYo1SfIj/o=; b=TuHfr6OU8gFJKAe4vFMHGbrkEJ6utK1yuBKmdwHgIjDgJ8LOAQtaOxP6ZRTpZQXWztVow6 bPHs1QPNo55itfkXN6VyFXjBU0wZvmFpBQ40gLDpQeR7jITZcdd+9lT7Cc8W1a6WTY0jRg FZtQnAblpxUXsmk7zDO2+P/aW7vk3kQ= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: ying.huang@linux.alibaba.com, Baoquan He , willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam R. Howlett , ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [v2 07/16] mm: add PMD swap entry detection support Date: Tue, 2 Jun 2026 07:24:15 -0700 Message-ID: <20260602142537.198755-8-usama.arif@linux.dev> In-Reply-To: <20260602142537.198755-1-usama.arif@linux.dev> References: <20260602142537.198755-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Currently when a PMD-mapped THP is swapped out, the PMD is always split into 512 PTE-level swap entries. To preserve huge page information across swap cycles, later patches will install a single PMD-level swap entry instead. This patch adds the infrastructure to detect those entries. Teach the softleaf layer to recognise PMD swap entries: pmd_is_swap_entry() detects them and softleaf_is_valid_pmd_entry() accepts them as a valid non-present type. Clear the exclusive overlay bit in softleaf_from_pmd() before decoding, matching how soft_dirty and uffd_wp bits are already stripped. Add pmd_swp_mkexclusive(), pmd_swp_exclusive(), and pmd_swp_clear_exclusive() helpers to each architecture that supports THP migration (x86, arm64, s390, riscv, loongarch, powerpc), mirroring the existing PTE swap exclusive helpers in each arch's pgtable.h. Signed-off-by: Usama Arif --- arch/arm64/include/asm/pgtable.h | 4 ++++ arch/loongarch/include/asm/pgtable.h | 17 ++++++++++++++ arch/powerpc/include/asm/book3s/64/pgtable.h | 15 ++++++++++++ arch/riscv/include/asm/pgtable.h | 15 ++++++++++++ arch/s390/include/asm/pgtable.h | 15 ++++++++++++ arch/x86/include/asm/pgtable.h | 15 ++++++++++++ include/linux/leafops.h | 24 ++++++++++++++++---- 7 files changed, 100 insertions(+), 5 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgta= ble.h index 623099303c7b..2f0d95ce341d 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -601,6 +601,10 @@ static inline int pmd_protnone(pmd_t pmd) #define pmd_swp_clear_uffd_wp(pmd) \ pte_pmd(pte_swp_clear_uffd_wp(pmd_pte(pmd))) #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */ +#define pmd_swp_exclusive(pmd) pte_swp_exclusive(pmd_pte(pmd)) +#define pmd_swp_mkexclusive(pmd) pte_pmd(pte_swp_mkexclusive(pmd_pte(pmd))) +#define pmd_swp_clear_exclusive(pmd) \ + pte_pmd(pte_swp_clear_exclusive(pmd_pte(pmd))) =20 #define pmd_write(pmd) pte_write(pmd_pte(pmd)) =20 diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/= asm/pgtable.h index 2a0b63ae421f..33bdfa1e8bbb 100644 --- a/arch/loongarch/include/asm/pgtable.h +++ b/arch/loongarch/include/asm/pgtable.h @@ -357,6 +357,23 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte) return pte; } =20 +static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd) +{ + pmd_val(pmd) |=3D _PAGE_SWP_EXCLUSIVE; + return pmd; +} + +static inline bool pmd_swp_exclusive(pmd_t pmd) +{ + return pmd_val(pmd) & _PAGE_SWP_EXCLUSIVE; +} + +static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd) +{ + pmd_val(pmd) &=3D ~_PAGE_SWP_EXCLUSIVE; + return pmd; +} + #define pte_none(pte) (!(pte_val(pte) & ~_PAGE_GLOBAL)) #define pte_present(pte) (pte_val(pte) & (_PAGE_PRESENT | _PAGE_PROTNONE)) #define pte_no_exec(pte) (pte_val(pte) & _PAGE_NO_EXEC) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/in= clude/asm/book3s/64/pgtable.h index 6f30aa8a6490..e8467ea4f4de 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -699,6 +699,21 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte) return __pte_raw(pte_raw(pte) & cpu_to_be64(~_PAGE_SWP_EXCLUSIVE)); } =20 +static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd) +{ + return __pmd_raw(pmd_raw(pmd) | cpu_to_be64(_PAGE_SWP_EXCLUSIVE)); +} + +static inline bool pmd_swp_exclusive(pmd_t pmd) +{ + return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_SWP_EXCLUSIVE)); +} + +static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd) +{ + return __pmd_raw(pmd_raw(pmd) & cpu_to_be64(~_PAGE_SWP_EXCLUSIVE)); +} + static inline bool check_pte_access(unsigned long access, unsigned long pt= ev) { /* diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgta= ble.h index 52cfd7df228b..0717b514a615 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -920,6 +920,21 @@ static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd) } #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */ =20 +static inline bool pmd_swp_exclusive(pmd_t pmd) +{ + return pte_swp_exclusive(pmd_pte(pmd)); +} + +static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd) +{ + return pte_pmd(pte_swp_mkexclusive(pmd_pte(pmd))); +} + +static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd) +{ + return pte_pmd(pte_swp_clear_exclusive(pmd_pte(pmd))); +} + #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY static inline bool pmd_soft_dirty(pmd_t pmd) { diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtabl= e.h index 83d4516825f0..88f2465fc482 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -870,6 +870,21 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte) return clear_pte_bit(pte, __pgprot(_PAGE_SWP_EXCLUSIVE)); } =20 +static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd) +{ + return set_pmd_bit(pmd, __pgprot(_PAGE_SWP_EXCLUSIVE)); +} + +static inline bool pmd_swp_exclusive(pmd_t pmd) +{ + return pmd_val(pmd) & _PAGE_SWP_EXCLUSIVE; +} + +static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd) +{ + return clear_pmd_bit(pmd, __pgprot(_PAGE_SWP_EXCLUSIVE)); +} + static inline int pte_soft_dirty(pte_t pte) { return pte_val(pte) & _PAGE_SOFT_DIRTY; diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 6efc7980c95a..c5c273bfcd04 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1517,6 +1517,21 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pt= e) return pte_clear_flags(pte, _PAGE_SWP_EXCLUSIVE); } =20 +static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd) +{ + return pmd_set_flags(pmd, _PAGE_SWP_EXCLUSIVE); +} + +static inline int pmd_swp_exclusive(pmd_t pmd) +{ + return pmd_flags(pmd) & _PAGE_SWP_EXCLUSIVE; +} + +static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd) +{ + return pmd_clear_flags(pmd, _PAGE_SWP_EXCLUSIVE); +} + #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY static inline pte_t pte_swp_mksoft_dirty(pte_t pte) { diff --git a/include/linux/leafops.h b/include/linux/leafops.h index 88888daeb018..988e59c6fa8a 100644 --- a/include/linux/leafops.h +++ b/include/linux/leafops.h @@ -102,6 +102,8 @@ static inline softleaf_t softleaf_from_pmd(pmd_t pmd) pmd =3D pmd_swp_clear_soft_dirty(pmd); if (pmd_swp_uffd_wp(pmd)) pmd =3D pmd_swp_clear_uffd_wp(pmd); + if (pmd_swp_exclusive(pmd)) + pmd =3D pmd_swp_clear_exclusive(pmd); arch_entry =3D __pmd_to_swp_entry(pmd); =20 /* Temporary until swp_entry_t eliminated. */ @@ -634,18 +636,30 @@ static inline bool pmd_is_migration_entry(pmd_t pmd) */ static inline bool softleaf_is_valid_pmd_entry(softleaf_t entry) { - /* Only device private, migration entries valid for PMD. */ + /* Device private, migration, and swap entries valid for PMD. */ return softleaf_is_device_private(entry) || - softleaf_is_migration(entry); + softleaf_is_migration(entry) || + softleaf_is_swap(entry); +} + +/** + * pmd_is_swap_entry() - Does this PMD entry encode an actual swap entry? + * @pmd: PMD entry. + * + * Returns: true if the PMD encodes a swap entry, otherwise false. + */ +static inline bool pmd_is_swap_entry(pmd_t pmd) +{ + return softleaf_is_swap(softleaf_from_pmd(pmd)); } =20 /** * pmd_is_valid_softleaf() - Is this PMD entry a valid softleaf entry? * @pmd: PMD entry. * - * PMD leaf entries are valid only if they are device private or migration - * entries. This function asserts that a PMD leaf entry is valid in this - * respect. + * PMD leaf entries are valid only if they are device private, migration, + * or swap entries. This function asserts that a PMD leaf entry is valid + * in this respect. * * Returns: true if the PMD entry is a valid leaf entry, otherwise false. */ --=20 2.52.0 From nobody Mon Jun 8 04:24:22 2026 Received: from out-185.mta0.migadu.com (out-185.mta0.migadu.com [91.218.175.185]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 616C03F5BD3 for ; Tue, 2 Jun 2026 14:26:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.185 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410405; cv=none; b=qgPb+L9iM55rwl9WmqYH1KuJ6DQS4V7GDgfO4nToeWGdES6l/INQux9l6GzO6wYirIjvRGYaK1DuM4YVqKp6Mr6DyaFtkQJc8Wr8C+kHVTrCame5tLmoVul2Co00NscZs8Gg2iNT0o7/3fdNYJCE3PvF9DeFyOmUv4gxvFoaZIE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410405; c=relaxed/simple; bh=g8fXFVvJ0sBwhRN4cmWnUh3DT3FnZaQP4kruSTH5CgM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=tA1wxHxk9I405xZmS/UCE7urUudRMvNuZazLoJGSkHMYSfhv9rGuHbuwgQYF+shx8Si0K6P9mB8UNgjx5FmS9yeljoev+uD/3xa4k5Pe6PLL1OsxUKIslfotyBFqmAKzVhMZS44fA5MSW/1Epfx7rtKDl84TWcS3pr1mnEcdLGk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=OSwv9tii; arc=none smtp.client-ip=91.218.175.185 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="OSwv9tii" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1780410402; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=UoRYKXRwctRxeiN8IuNvbapEIDidQ7fOG8OMVHSTk9c=; b=OSwv9tiieOZAw8uPxayyWCtZQKicqUWhHhLedJmLmiqEIUkciBN9wm/QHQLU2gM1YTlgXj wp8GvYuT9Qea5XT4j9jwxI2neJ4zW4n9wSIlUKt3ZAnCdTnd4tw78aJn1zz8tm619kr4QB Upq9B7LTov4H9OqWzUrYWb1BDzd5nBM= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: ying.huang@linux.alibaba.com, Baoquan He , willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam R. Howlett , ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [v2 08/16] mm: add PMD swap entry splitting support Date: Tue, 2 Jun 2026 07:24:16 -0700 Message-ID: <20260602142537.198755-9-usama.arif@linux.dev> In-Reply-To: <20260602142537.198755-1-usama.arif@linux.dev> References: <20260602142537.198755-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Add a swap branch in __split_huge_pmd_locked() that splits a PMD swap entry into 512 PTE swap entries. Unlike migration splits, no folio reference is needed because swap entries point to swap slots, not pages. Each PTE inherits the correct sub-slot offset and preserves soft_dirty, uffd_wp, and exclusive flags. This branch is reached from the explicit __split_huge_pmd() callers that hit a non-present PMD: partial-range mprotect / munmap, the wp_huge_pmd() PMD-COW fallback, and the swap-in / swapoff fallbacks added in later patches when the cached folio is no longer PMD-sized. page_vma_mapped_walk() does not iterate PMD swap entries, so try_to_unmap_one() and try_to_migrate_one() do not reach this branch and freeze=3Dtrue cannot occur in this branch today. page and folio are therefore left uninitialized in the swap branch; a VM_WARN_ON_ONCE(freeze) catches any future caller that breaks this invariant before the freeze path dereferences page_to_pfn(page + i) or put_page(page). Signed-off-by: Usama Arif --- mm/huge_memory.c | 27 ++++++++++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index af6a9c20131a..7cb1afde46e1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3144,6 +3144,12 @@ static void __split_huge_pmd_locked(struct vm_area_s= truct *vma, pmd_t *pmd, folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR, vma, haddr, rmap_flags); } + } else if (pmd_is_swap_entry(*pmd)) { + VM_WARN_ON_ONCE(freeze); + old_pmd =3D *pmd; + soft_dirty =3D pmd_swp_soft_dirty(old_pmd); + uffd_wp =3D pmd_swp_uffd_wp(old_pmd); + anon_exclusive =3D pmd_swp_exclusive(old_pmd); } else { /* * Up to this point the pmd is present and huge and userland has @@ -3280,6 +3286,25 @@ static void __split_huge_pmd_locked(struct vm_area_s= truct *vma, pmd_t *pmd, VM_WARN_ON(!pte_none(ptep_get(pte + i))); set_pte_at(mm, addr, pte + i, entry); } + } else if (pmd_is_swap_entry(old_pmd)) { + softleaf_t sl_entry =3D softleaf_from_pmd(old_pmd); + pte_t swp_pte; + swp_entry_t sub_entry; + + for (i =3D 0, addr =3D haddr; i < HPAGE_PMD_NR; + i++, addr +=3D PAGE_SIZE) { + sub_entry =3D swp_entry(swp_type(sl_entry), + swp_offset(sl_entry) + i); + swp_pte =3D swp_entry_to_pte(sub_entry); + if (soft_dirty) + swp_pte =3D pte_swp_mksoft_dirty(swp_pte); + if (uffd_wp) + swp_pte =3D pte_swp_mkuffd_wp(swp_pte); + if (anon_exclusive) + swp_pte =3D pte_swp_mkexclusive(swp_pte); + VM_WARN_ON(!pte_none(ptep_get(pte + i))); + set_pte_at(mm, addr, pte + i, swp_pte); + } } else { pte_t entry; =20 @@ -3303,7 +3328,7 @@ static void __split_huge_pmd_locked(struct vm_area_st= ruct *vma, pmd_t *pmd, } pte_unmap(pte); =20 - if (!pmd_is_migration_entry(*pmd)) + if (!pmd_is_migration_entry(*pmd) && !pmd_is_swap_entry(*pmd)) folio_remove_rmap_pmd(folio, page, vma); if (freeze) put_page(page); --=20 2.52.0 From nobody Mon Jun 8 04:24:22 2026 Received: from out-173.mta0.migadu.com (out-173.mta0.migadu.com [91.218.175.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8AAD23F5BD3 for ; Tue, 2 Jun 2026 14:26:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.173 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410411; cv=none; b=W9fIBqLCv6hFETfvEq0QpCh/wSaIdE0Pp8wGIbje9jOgJwtrGzTtTdllGnRhshToY8f08Q/23MVbgv9w8z3xgiMvMvYuE6uWP6PGmZwm69nejEmvqLT4d8Up3xVauyMSgyLgqxb164dlVtSj4HkmNQHAc9ts/4oDM875SmKgyrQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410411; c=relaxed/simple; bh=jIiMDBdEXNLSj3uLEKvFZNVmn8WieiDCI6AxNkihYQQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=IOpesmMUEntZjes11VXxcXtr8FHnC4EJLCIgMHdXMHTATb0+TTrQPMbr+mPgXhlFZCEwyiq1bprx+pwGFQh5KSBE1ZNa/+zocFfFXXtdEq9a+67+3JVz4qqmqK+J5ShyqRAVV6jM3+7SLpuSr3ZWF6sL0yMg06yUO83EIF049PU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=CcR00fPm; arc=none smtp.client-ip=91.218.175.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="CcR00fPm" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1780410407; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=JltyrHvN/NZNrOWmCbN4CJ5eXFJwXOO7msUXDDwYl98=; b=CcR00fPm0xXFx1+e43Qr561rtZRIGJI1IBsaPf/KoHfUvSGv7wcX5+g0dxnfeFd/mLVc0P HlXBv/ccrIZ/VtAKIIaZOM7+P1FYDzi6ftjYyVQDPrlR2n1OZN0EcAHtjFRrYXrIyG0HIh Ek7OZCc2wx1eeRrzuz6Dnm4VrW24sgc= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: ying.huang@linux.alibaba.com, Baoquan He , willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam R. Howlett , ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [v2 09/16] mm: handle PMD swap entries in fork path Date: Tue, 2 Jun 2026 07:24:17 -0700 Message-ID: <20260602142537.198755-10-usama.arif@linux.dev> In-Reply-To: <20260602142537.198755-1-usama.arif@linux.dev> References: <20260602142537.198755-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Teach copy_huge_pmd()/copy_huge_non_present_pmd() about swap entries, mirroring copy_nonpresent_pte(). swap_dup_entry_direct() gains a nr parameter (and is renamed to swap_dup_entries_direct()) so it can duplicate a contiguous range of swap slots in one call, matching the existing swap_put_entries_direct(entry, nr) API. Existing callers pass 1. copy_huge_non_present_pmd() "copies" PMD swap entries during fork instead of splitting, preserving the THP. This mirrors copy_nonpresent_pte() which duplicates the swap slot refcount, clears the exclusive bit on the source, and adds the destination mm to mmlist. If swap_dup_entries_direct() fails (GFP_ATOMIC table alloc), copy_huge_pmd() retries after swap_retry_table_alloc() with GFP_KERNEL, matching the PTE retry in copy_pte_range(). The PMD is stable across the retry because dup_mmap() holds write mmap_lock on both mm_structs. Signed-off-by: Usama Arif --- include/linux/swap.h | 4 ++-- mm/huge_memory.c | 52 +++++++++++++++++++++++++++++++++++++++----- mm/memory.c | 2 +- mm/swapfile.c | 7 +++--- 4 files changed, 53 insertions(+), 12 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 6d72778e6cc3..8a5ec5f0a7c7 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -458,7 +458,7 @@ sector_t swap_folio_sector(struct folio *folio); * All entries must be allocated by folio_alloc_swap(). And they must have * a swap count > 1. See comments of folio_*_swap helpers for more info. */ -int swap_dup_entry_direct(swp_entry_t entry); +int swap_dup_entries_direct(swp_entry_t entry, int nr); void swap_put_entries_direct(swp_entry_t entry, int nr); =20 /* @@ -502,7 +502,7 @@ static inline void free_swap_cache(struct folio *folio) { } =20 -static inline int swap_dup_entry_direct(swp_entry_t ent) +static inline int swap_dup_entries_direct(swp_entry_t ent, int nr) { return 0; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 7cb1afde46e1..a525417d13f6 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1806,7 +1806,7 @@ bool touch_pmd(struct vm_area_struct *vma, unsigned l= ong addr, return false; } =20 -static void copy_huge_non_present_pmd( +static int copy_huge_non_present_pmd( struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, @@ -1852,14 +1852,35 @@ static void copy_huge_non_present_pmd( */ folio_try_dup_anon_rmap_pmd(src_folio, &src_folio->page, dst_vma, src_vma); + } else if (softleaf_is_swap(entry)) { + int err; + + /* + * PMD swap entry: duplicate swap references and clear + * exclusive on source, matching copy_nonpresent_pte(). + */ + err =3D swap_dup_entries_direct(entry, HPAGE_PMD_NR); + if (err < 0) + return err; + + mm_prepare_for_swap_entries(dst_mm); + + if (pmd_swp_exclusive(pmd)) { + pmd =3D pmd_swp_clear_exclusive(pmd); + set_pmd_at(src_mm, addr, src_pmd, pmd); + } } =20 - add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); + if (softleaf_is_swap(entry)) + add_mm_counter(dst_mm, MM_SWAPENTS, HPAGE_PMD_NR); + else + add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); mm_inc_nr_ptes(dst_mm); pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); if (!userfaultfd_wp(dst_vma)) pmd =3D pmd_swp_clear_uffd_wp(pmd); set_pmd_at(dst_mm, addr, dst_pmd, pmd); + return 0; } =20 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, @@ -1900,6 +1921,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm= _struct *src_mm, if (unlikely(!pgtable)) goto out; =20 +retry: dst_ptl =3D pmd_lock(dst_mm, dst_pmd); src_ptl =3D pmd_lockptr(src_mm, src_pmd); spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); @@ -1907,10 +1929,28 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct = mm_struct *src_mm, ret =3D -EAGAIN; pmd =3D *src_pmd; =20 - if (unlikely(thp_migration_supported() && - pmd_is_valid_softleaf(pmd))) { - copy_huge_non_present_pmd(dst_mm, src_mm, dst_pmd, src_pmd, addr, - dst_vma, src_vma, pmd, pgtable); + if (unlikely(pmd_is_valid_softleaf(pmd))) { + ret =3D copy_huge_non_present_pmd(dst_mm, src_mm, dst_pmd, src_pmd, + addr, dst_vma, src_vma, pmd, + pgtable); + if (ret) { + spin_unlock(src_ptl); + spin_unlock(dst_ptl); + /* + * For PMD swap entries -ENOMEM means the per-cluster + * swap-extend table couldn't be GFP_ATOMIC-allocated. + * try the GFP_KERNEL fallback once before giving up. + */ + if (ret =3D=3D -ENOMEM) { + softleaf_t entry =3D softleaf_from_pmd(pmd); + + if (softleaf_is_swap(entry) && + !swap_retry_table_alloc(entry, GFP_KERNEL)) + goto retry; + } + pte_free(dst_mm, pgtable); + goto out; + } ret =3D 0; goto out_unlock; } diff --git a/mm/memory.c b/mm/memory.c index 137f34c3fd32..5cf02e394c92 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -950,7 +950,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm= _struct *src_mm, struct page *page; =20 if (likely(softleaf_is_swap(entry))) { - if (swap_dup_entry_direct(entry) < 0) + if (swap_dup_entries_direct(entry, 1) < 0) return -EIO; =20 mm_prepare_for_swap_entries(dst_mm); diff --git a/mm/swapfile.c b/mm/swapfile.c index e3d126602a1e..37408905490e 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3899,8 +3899,9 @@ void si_swapinfo(struct sysinfo *val) } =20 /* - * swap_dup_entry_direct() - Increase reference count of a swap entry by o= ne. + * swap_dup_entries_direct() - Increase reference count of swap entries by= one. * @entry: first swap entry from which we want to increase the refcount. + * @nr: number of contiguous swap entries to duplicate. * * Returns 0 for success, or -ENOMEM if the extend table is required * but could not be atomically allocated. Returns -EINVAL if the swap @@ -3912,7 +3913,7 @@ void si_swapinfo(struct sysinfo *val) * Also the swap entry must have a count >=3D 1. Otherwise folio_dup_swap = should * be used. */ -int swap_dup_entry_direct(swp_entry_t entry) +int swap_dup_entries_direct(swp_entry_t entry, int nr) { struct swap_info_struct *si; =20 @@ -3929,7 +3930,7 @@ int swap_dup_entry_direct(swp_entry_t entry) */ VM_WARN_ON_ONCE(!swap_entry_swapped(si, entry)); =20 - return swap_dup_entries_cluster(si, swp_offset(entry), 1); + return swap_dup_entries_cluster(si, swp_offset(entry), nr); } =20 #if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP) --=20 2.52.0 From nobody Mon Jun 8 04:24:22 2026 Received: from out-179.mta0.migadu.com (out-179.mta0.migadu.com [91.218.175.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 97DDD3FD12E for ; Tue, 2 Jun 2026 14:26:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410416; cv=none; b=N+0R9hcN6pN+QVJMBzl6ujgNLDTp67MsoMXu+tPjSRcb/b+GG5Rou0Xv60Vvz6rXUK+7ygPzXJ+4TWR8YyLyeSFZ+cIXsdrigepMCSbrIQ5aq1TYMnJonNmPWbXnHlsuHlQYWvbDCHl5bLSpywPTV+9B5JGVIR+NqokixQO0+eE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410416; c=relaxed/simple; bh=cqk3+taSyLhkf5kgJagAZFhI1o1VvjytPYh6XpUZsY4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=czk6hAipwsB74MTKeItFfab6ba53TLUJHjglJ/zWpUAVxUD92bxgwIIjufBP2uESGdVW1nvne+w3L0gBfwRlK/aMZdw3wnml9nNlb6c+sN9uobLEXCl8ZhodP+TWzpcOFJWzJ/H2HQSfc6A+AL0xAKekp2Gd9wQ01i4yRbKlOec= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=ucGs9R8n; arc=none smtp.client-ip=91.218.175.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="ucGs9R8n" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1780410412; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=06MxZOv/Uvd5gtLmvPmMkSAulI9G01NYpUuX43wXZOE=; b=ucGs9R8nk+jgS+wZZhDdMWXHeq85nzUx9Wp4QkaYCkzxiyLS6vNERSOqT60XvIEknrW14W brMwF1izcCZFfFzz+sOYRiuYCkdQ6b9kQExo5WZ6nCDaT59CeMVdwvLvdgEEr5Msfb9xVi hMNpLwIoQgABdEuGe/4NCkqoaLUyVFI= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: ying.huang@linux.alibaba.com, Baoquan He , willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam R. Howlett , ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [v2 10/16] mm: swap in PMD swap entries as whole THPs during swapoff Date: Tue, 2 Jun 2026 07:24:18 -0700 Message-ID: <20260602142537.198755-11-usama.arif@linux.dev> In-Reply-To: <20260602142537.198755-1-usama.arif@linux.dev> References: <20260602142537.198755-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Add unuse_pmd() and call it from unuse_pmd_range() to swap in PMD-level swap entries as whole THPs during swapoff. This mirrors the existing unuse_pte_range() but operates at PMD granularity. If the PMD-order folio cannot be allocated, the cached folio is no longer PMD-sized (e.g. split in the swap cache by deferred_split_scan() or memory_failure() while the PMD swap entry was installed), or the folio is not uptodate, the PMD swap entry is split into PTE-level entries via __split_huge_pmd() and a non-zero error is returned so unuse_pmd_range() falls through to unuse_pte_range(), which handles the individual entries at order-0. Signed-off-by: Usama Arif --- mm/swapfile.c | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 145 insertions(+) diff --git a/mm/swapfile.c b/mm/swapfile.c index 37408905490e..56454e486324 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -42,6 +42,7 @@ #include #include #include +#include =20 #include #include @@ -2641,6 +2642,138 @@ static int unuse_pte_range(struct vm_area_struct *v= ma, pmd_t *pmd, return 0; } =20 +/* + * unuse_pmd - Map a locked folio at PMD granularity during swapoff. + * + * The caller provides a locked, swapped-in folio. Returns 0 on success + * (PMD was mapped). Returns -EAGAIN if the swap cache folio no longer + * matches the entry or the PMD changed under the lock (try_to_unuse will + * rescan). Returns -EIO if the folio is not uptodate; in that case the + * PMD is split so unuse_pte_range() can handle individual pages. + */ +static int unuse_pmd(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr, softleaf_t entry, + struct folio *folio) +{ + struct mm_struct *mm =3D vma->vm_mm; + struct page *page; + pmd_t new_pmd, old_pmd; + spinlock_t *ptl; + rmap_t rmap_flags =3D RMAP_NONE; + bool exclusive; + + if (unlikely(!folio_matches_swap_entry(folio, entry))) + return -EAGAIN; + + if (unlikely(!folio_test_uptodate(folio))) { + __split_huge_pmd(vma, pmd, addr, false); + return -EIO; + } + + page =3D folio_page(folio, 0); + + ptl =3D pmd_lock(mm, pmd); + old_pmd =3D pmdp_get(pmd); + + if (!pmd_is_swap_entry(old_pmd) || + softleaf_from_pmd(old_pmd).val !=3D entry.val) { + spin_unlock(ptl); + return -EAGAIN; + } + + exclusive =3D pmd_swp_exclusive(old_pmd); + + /* + * Some architectures may have to restore extra metadata to the folio + * when reading from swap. This metadata may be indexed by swap entry + * so this must be called before folio_put_swap(). + */ + arch_swap_restore(folio_swap(entry, folio), folio); + + add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR); + add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR); + + new_pmd =3D folio_mk_pmd(folio, vma->vm_page_prot); + new_pmd =3D pmd_mkold(new_pmd); + if (pmd_swp_soft_dirty(old_pmd)) + new_pmd =3D pmd_mksoft_dirty(new_pmd); + if (pmd_swp_uffd_wp(old_pmd)) + new_pmd =3D pmd_mkuffd_wp(new_pmd); + + if (exclusive) + rmap_flags |=3D RMAP_EXCLUSIVE; + + folio_get(folio); + if (!folio_test_anon(folio)) + folio_add_new_anon_rmap(folio, vma, addr, rmap_flags); + else + folio_add_anon_rmap_pmd(folio, page, vma, addr, rmap_flags); + + set_pmd_at(mm, addr, pmd, new_pmd); + folio_put_swap(folio, NULL); + + spin_unlock(ptl); + + folio_free_swap(folio); + return 0; +} + +/* + * Try to swap in a PMD swap entry as a whole THP. Returns 0 on success. + * Returns -ENOMEM if the PMD-order folio could not be allocated/charged, + * -EIO if swap-in failed, or -EAGAIN if the cached folio is no longer + * PMD-sized; in all of these the PMD is split so the caller can fall + * back to unuse_pte_range(). Otherwise propagates the error from + * unuse_pmd(). + */ +static int unuse_pmd_entry(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr, softleaf_t entry) +{ + struct folio *folio; + int ret; + + folio =3D swap_cache_get_folio(entry); + if (!folio) { + struct vm_fault vmf =3D { + .vma =3D vma, + .address =3D addr, + .real_address =3D addr, + .pmd =3D pmd, + }; + + folio =3D swapin_sync(entry, GFP_HIGHUSER_MOVABLE, + BIT(HPAGE_PMD_ORDER), &vmf, NULL, 0); + if (IS_ERR_OR_NULL(folio)) { + ret =3D -ENOMEM; + goto split_fallback; + } + } + + folio_lock(folio); + folio_wait_writeback(folio); + /* + * If the cached folio is no longer PMD-sized (e.g. split in the + * swap cache by deferred_split_scan() or memory_failure() while + * the PMD swap entry was installed), the PMD swap entry no longer + * maps a single contiguous folio. Split the PMD swap entry so + * unuse_pte_range() can swap the per-slot folios in individually. + */ + if (folio_nr_pages(folio) !=3D HPAGE_PMD_NR) { + folio_unlock(folio); + folio_put(folio); + ret =3D -EAGAIN; + goto split_fallback; + } + ret =3D unuse_pmd(vma, pmd, addr, entry, folio); + folio_unlock(folio); + folio_put(folio); + return ret; + +split_fallback: + __split_huge_pmd(vma, pmd, addr, false); + return ret; +} + static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud, unsigned long addr, unsigned long end, unsigned int type) @@ -2653,6 +2786,18 @@ static inline int unuse_pmd_range(struct vm_area_str= uct *vma, pud_t *pud, do { cond_resched(); next =3D pmd_addr_end(addr, end); + + pmd_t pmdval =3D pmdp_get(pmd); + + if (pmd_is_swap_entry(pmdval)) { + softleaf_t sl =3D softleaf_from_pmd(pmdval); + + if (swp_type(sl) =3D=3D type) { + if (!unuse_pmd_entry(vma, pmd, addr, sl)) + continue; + } + } + ret =3D unuse_pte_range(vma, pmd, addr, next, type); if (ret) return ret; --=20 2.52.0 From nobody Mon Jun 8 04:24:22 2026 Received: from out-178.mta0.migadu.com (out-178.mta0.migadu.com [91.218.175.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 992803FE35F for ; Tue, 2 Jun 2026 14:26:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410421; cv=none; b=ElhSUY8yXypnT/9clC5apCMd1GlLjqzX1WkerN4EkbocYogYuoL0zsq09ORCJppn4UKIQLeFlw+L1WtJFt0ysqLWpbEahNWsMHVgwGW/DIC/0s5AewyRIRgL58GXVozzLVkWsVc56yEZEOxzVe3e4qwIfdK6tZdLJoaS7MfwPR8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410421; c=relaxed/simple; bh=E8zHRLikc1vrEZa1/QE/5ObknipANw0iGHdI89fnPS8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=jwD4XacHUUZF/sCeHgbGnqwCjFuMK6+vTHRUIioRyszZBbmtLx6rchvzVClTQUARoyO0tC7ylsyF9N2VDILCj7XKYSPweZLV2tt9tVwnSBx0WTW/doVgVJSGo+3UFC4ofYWbwyzAHd3nmXr08tGYCZp/Fly0OE2KWd/9GIBea9M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=mwmgp0gw; arc=none smtp.client-ip=91.218.175.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="mwmgp0gw" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1780410417; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=h/ZJc+dELxqKLj3TOn9ZdLZTgRymolGB/1+56SdxJ+g=; b=mwmgp0gw8F71axX6nONQDU1lZMSpj8WM3EI36R3NUliXu7XyAB3ky1PsWAcnCzR8rCySyZ /FcqbTfDwwj09ahAiqSwYmzh0xiKilQ+NsjORXFeSXHVdN/s3dmdYvgM8Mm2xZEYG6OQ6P aaPp2dvt/TOIiD1HAi/0z2RiOq1VFY8= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: ying.huang@linux.alibaba.com, Baoquan He , willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam R. Howlett , ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [v2 11/16] mm: handle PMD swap entries in non-present PMD walkers Date: Tue, 2 Jun 2026 07:24:19 -0700 Message-ID: <20260602142537.198755-12-usama.arif@linux.dev> In-Reply-To: <20260602142537.198755-1-usama.arif@linux.dev> References: <20260602142537.198755-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Teach the remaining non-present PMD walkers about swap entries, mirroring the PTE-level equivalents. smaps_pmd_entry() accounts swap and swap_pss via a new shared smaps_account_swap() helper used by both PTE and PMD paths. zap_huge_pmd() frees swap slots via swap_put_entries_direct(), matching zap_nonpresent_ptes(). change_non_present_huge_pmd() skips write-permission changes for swap entries and only updates uffd_wp, matching change_softleaf_pte(). move_soft_dirty_pmd(), clear_soft_dirty_pmd(), and make_uffd_wp_pmd(), pagemap_pmd_range_thp() and change_huge_pmd() handle swap entries alongside migration entries. madvise_cold_or_pageout_pte_range() extends its non-present PMD VM_BUG_ON to allow swap entries; without this, hitting a PMD swap entry on a DEBUG_VM kernel would BUG(). mincore_pte_range() routes the pmd_trans_huge_lock() branch through mincore_swap() for non-present PMDs, matching how the PTE path already calls mincore_swap() for non-present PTEs. Without this a swapped-out PMD-mapped THP would be reported as resident, because pmd_is_huge() (and therefore pmd_trans_huge_lock()) accepts any non-present non-none PMD and the old branch unconditionally did memset(vec, 1, nr). mincore_swap() returns 1 for migration / device-private entries (preserving the prior behavior for those) and checks swap-cache residency for swap entries. queue_folios_pmd() in mempolicy silently skips swap entries, matching the PTE walker which only counts migration entries as failures. Without this, mbind(MPOL_MF_STRICT) would spuriously return -EIO on a swapped-out THP. madvise_free_huge_pmd() handles PMD swap entries directly: for a full-range MADV_FREE it clears the PMD, frees the deposited page table, and releases the swap slots; for a partial range it splits to PTE swap entries. Without this, MADV_FREE silently becomes a no-op on swapped-out THPs, leaking swap slots. check_pmd_state() in khugepaged returns SCAN_PMD_MAPPED for PMD swap entries, treating a swapped-out THP as still being a THP from khugepaged's perspective and matching the existing migration-entry handling. hmm_vma_handle_absent_pmd() faults in PMD swap entries via hmm_vma_fault() instead of returning -EFAULT. The first per-page handle_mm_fault() call triggers do_huge_pmd_swap_page(), which maps the entire folio; subsequent calls become harmless huge_pmd_set_accessed() and the walker retries with a present PMD. Signed-off-by: Usama Arif --- fs/proc/task_mmu.c | 43 +++++++++++++++++++++------------- mm/hmm.c | 3 ++- mm/huge_memory.c | 58 +++++++++++++++++++++++++++++++++++----------- mm/khugepaged.c | 6 +++++ mm/madvise.c | 5 ++-- mm/mempolicy.c | 2 ++ mm/mincore.c | 14 ++++++++++- 7 files changed, 98 insertions(+), 33 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 1fb5acd88ad0..f85899eec80f 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1046,6 +1046,23 @@ static void smaps_pte_hole_lookup(unsigned long addr= , struct mm_walk *walk) #endif } =20 +static void smaps_account_swap(struct mem_size_stats *mss, + softleaf_t entry, unsigned long size) +{ + int mapcount; + + mss->swap +=3D size; + mapcount =3D swp_swapcount(entry); + if (mapcount >=3D 2) { + u64 pss_delta =3D (u64)size << PSS_SHIFT; + + do_div(pss_delta, mapcount); + mss->swap_pss +=3D pss_delta; + } else { + mss->swap_pss +=3D (u64)size << PSS_SHIFT; + } +} + static void smaps_pte_entry(pte_t *pte, unsigned long addr, struct mm_walk *walk) { @@ -1067,18 +1084,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned lon= g addr, const softleaf_t entry =3D softleaf_from_pte(ptent); =20 if (softleaf_is_swap(entry)) { - int mapcount; - - mss->swap +=3D PAGE_SIZE; - mapcount =3D swp_swapcount(entry); - if (mapcount >=3D 2) { - u64 pss_delta =3D (u64)PAGE_SIZE << PSS_SHIFT; - - do_div(pss_delta, mapcount); - mss->swap_pss +=3D pss_delta; - } else { - mss->swap_pss +=3D (u64)PAGE_SIZE << PSS_SHIFT; - } + smaps_account_swap(mss, entry, PAGE_SIZE); } else if (softleaf_has_pfn(entry)) { if (softleaf_is_device_private(entry)) present =3D true; @@ -1108,9 +1114,13 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned lon= g addr, if (pmd_present(*pmd)) { page =3D vm_normal_page_pmd(vma, addr, *pmd); present =3D true; - } else if (unlikely(thp_migration_supported())) { + } else { const softleaf_t entry =3D softleaf_from_pmd(*pmd); =20 + if (softleaf_is_swap(entry)) { + smaps_account_swap(mss, entry, HPAGE_PMD_SIZE); + return; + } if (softleaf_has_pfn(entry)) page =3D softleaf_to_page(entry); } @@ -1752,7 +1762,7 @@ static inline void clear_soft_dirty_pmd(struct vm_are= a_struct *vma, pmd =3D pmd_clear_soft_dirty(pmd); =20 set_pmd_at(vma->vm_mm, addr, pmdp, pmd); - } else if (pmd_is_migration_entry(pmd)) { + } else if (pmd_is_migration_entry(pmd) || pmd_is_swap_entry(pmd)) { pmd =3D pmd_swp_clear_soft_dirty(pmd); set_pmd_at(vma->vm_mm, addr, pmdp, pmd); } @@ -2112,7 +2122,8 @@ static int pagemap_pmd_range_thp(pmd_t *pmdp, unsigne= d long addr, flags |=3D PM_UFFD_WP; if (pm->show_pfn) frame =3D pmd_pfn(pmd) + idx; - } else if (thp_migration_supported()) { + } else if (pmd_is_swap_entry(pmd) || + (thp_migration_supported() && pmd_is_migration_entry(pmd))) { const softleaf_t entry =3D softleaf_from_pmd(pmd); unsigned long offset; =20 @@ -2550,7 +2561,7 @@ static void make_uffd_wp_pmd(struct vm_area_struct *v= ma, old =3D pmdp_invalidate_ad(vma, addr, pmdp); pmd =3D pmd_mkuffd_wp(old); set_pmd_at(vma->vm_mm, addr, pmdp, pmd); - } else if (pmd_is_migration_entry(pmd)) { + } else if (pmd_is_migration_entry(pmd) || pmd_is_swap_entry(pmd)) { pmd =3D pmd_swp_mkuffd_wp(pmd); set_pmd_at(vma->vm_mm, addr, pmdp, pmd); } diff --git a/mm/hmm.c b/mm/hmm.c index cabf111f2ed2..b5fa7549c183 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -370,7 +370,8 @@ static int hmm_vma_handle_absent_pmd(struct mm_walk *wa= lk, unsigned long start, required_fault =3D hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0); if (required_fault) { - if (softleaf_is_device_private(entry)) + if (softleaf_is_device_private(entry) || + softleaf_is_swap(entry)) return hmm_vma_fault(addr, end, required_fault, walk); else return -EFAULT; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index a525417d13f6..1d6d3817046d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2314,6 +2314,14 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vm= f) return 0; } =20 +static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd) +{ + pgtable_t pgtable; + + pgtable =3D pgtable_trans_huge_withdraw(mm, pmd); + pte_free(mm, pgtable); + mm_dec_nr_ptes(mm); +} /* * Return true if we do MADV_FREE successfully on entire pmd page. * Otherwise, return false. @@ -2338,8 +2346,23 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, s= truct vm_area_struct *vma, goto out; =20 if (unlikely(!pmd_present(orig_pmd))) { + if (pmd_is_swap_entry(orig_pmd)) { + if (next - addr !=3D HPAGE_PMD_SIZE) { + spin_unlock(ptl); + __split_huge_pmd(vma, pmd, addr, false); + goto out_unlocked; + } + softleaf_t sl =3D softleaf_from_pmd(orig_pmd); + + pmdp_huge_get_and_clear(mm, addr, pmd); + zap_deposited_table(mm, pmd); + spin_unlock(ptl); + swap_put_entries_direct(sl, HPAGE_PMD_NR); + add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR); + return true; + } VM_BUG_ON(thp_migration_supported() && - !pmd_is_migration_entry(orig_pmd)); + !pmd_is_migration_entry(orig_pmd)); goto out; } =20 @@ -2388,15 +2411,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, s= truct vm_area_struct *vma, return ret; } =20 -static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd) -{ - pgtable_t pgtable; - - pgtable =3D pgtable_trans_huge_withdraw(mm, pmd); - pte_free(mm, pgtable); - mm_dec_nr_ptes(mm); -} - static void zap_huge_pmd_folio(struct mm_struct *mm, struct vm_area_struct= *vma, pmd_t pmdval, struct folio *folio, bool is_present) { @@ -2489,6 +2503,16 @@ bool zap_huge_pmd(struct mmu_gather *tlb, struct vm_= area_struct *vma, arch_check_zapped_pmd(vma, orig_pmd); tlb_remove_pmd_tlb_entry(tlb, pmd, addr); =20 + if (pmd_is_swap_entry(orig_pmd)) { + softleaf_t sl =3D softleaf_from_pmd(orig_pmd); + + zap_deposited_table(mm, pmd); + spin_unlock(ptl); + swap_put_entries_direct(sl, HPAGE_PMD_NR); + add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR); + return true; + } + is_present =3D pmd_present(orig_pmd); folio =3D normal_or_softleaf_folio_pmd(vma, addr, orig_pmd, is_present); has_deposit =3D has_deposited_pgtable(vma, orig_pmd, folio); @@ -2521,7 +2545,8 @@ static inline int pmd_move_must_withdraw(spinlock_t *= new_pmd_ptl, static pmd_t move_soft_dirty_pmd(pmd_t pmd) { if (pgtable_supports_soft_dirty()) { - if (unlikely(pmd_is_migration_entry(pmd))) + if (unlikely(pmd_is_migration_entry(pmd) || + pmd_is_swap_entry(pmd))) pmd =3D pmd_swp_mksoft_dirty(pmd); else if (pmd_present(pmd)) pmd =3D pmd_mksoft_dirty(pmd); @@ -2601,7 +2626,14 @@ static void change_non_present_huge_pmd(struct mm_st= ruct *mm, pmd_t newpmd; =20 VM_WARN_ON(!pmd_is_valid_softleaf(*pmd)); - if (softleaf_is_migration_write(entry)) { + + /* + * PMD swap entries don't encode write permission in the entry type, + * so only uffd_wp flag changes apply. No folio lookup needed. + */ + if (softleaf_is_swap(entry)) { + newpmd =3D *pmd; + } else if (softleaf_is_migration_write(entry)) { const struct folio *folio =3D softleaf_to_folio(entry); =20 /* @@ -2660,7 +2692,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm= _area_struct *vma, if (!ptl) return 0; =20 - if (thp_migration_supported() && pmd_is_valid_softleaf(*pmd)) { + if (pmd_is_valid_softleaf(*pmd)) { change_non_present_huge_pmd(mm, addr, pmd, uffd_wp, uffd_wp_resolve); goto unlock; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 8ffb47f1e845..bb63700519ab 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1111,6 +1111,12 @@ static inline enum scan_result check_pmd_state(pmd_t= *pmd) */ if (pmd_is_migration_entry(pmde)) return SCAN_PMD_MAPPED; + /* + * A PMD-mapped THP that has been swapped out is still a THP from + * khugepaged's perspective; treat it like a present huge PMD. + */ + if (pmd_is_swap_entry(pmde)) + return SCAN_PMD_MAPPED; if (!pmd_present(pmde)) return SCAN_NO_PTE_TABLE; if (pmd_trans_huge(pmde)) diff --git a/mm/madvise.c b/mm/madvise.c index cd9bb077072c..00539022f804 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -390,7 +390,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, =20 if (unlikely(!pmd_present(orig_pmd))) { VM_BUG_ON(thp_migration_supported() && - !pmd_is_migration_entry(orig_pmd)); + !pmd_is_migration_entry(orig_pmd) && + !pmd_is_swap_entry(orig_pmd)); goto huge_unlock; } =20 @@ -666,7 +667,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned = long addr, int nr, max_nr; =20 next =3D pmd_addr_end(addr, end); - if (pmd_trans_huge(*pmd)) + if (pmd_trans_huge(*pmd) || pmd_is_swap_entry(*pmd)) if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next)) return 0; =20 diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 36699fabd3c2..25d929b2037e 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -658,6 +658,8 @@ static void queue_folios_pmd(pmd_t *pmd, struct mm_walk= *walk) qp->nr_failed++; return; } + if (unlikely(pmd_is_swap_entry(*pmd))) + return; folio =3D pmd_folio(*pmd); if (is_huge_zero_folio(folio)) { walk->action =3D ACTION_CONTINUE; diff --git a/mm/mincore.c b/mm/mincore.c index e5d13eea9234..3fee8a7b9d9d 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -172,7 +172,19 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long= addr, unsigned long end, =20 ptl =3D pmd_trans_huge_lock(pmd, vma); if (ptl) { - memset(vec, 1, nr); + if (pmd_present(*pmd)) { + memset(vec, 1, nr); + } else { + /* + * Non-present PMD: migration, device-private, or PMD + * swap entry. Route through mincore_swap() the same way + * the PTE path does -- the swap entry covers all 512 + * slots, so the whole vec gets the same answer. + */ + softleaf_t entry =3D softleaf_from_pmd(*pmd); + + memset(vec, mincore_swap(entry, false), nr); + } spin_unlock(ptl); goto out; } --=20 2.52.0 From nobody Mon Jun 8 04:24:22 2026 Received: from out-181.mta0.migadu.com (out-181.mta0.migadu.com [91.218.175.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2B08C3FE640 for ; Tue, 2 Jun 2026 14:27:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.181 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410426; cv=none; b=k62T70hR6RHt8+zF/+20cZoaIbg00GtmOCNVya5IDVXstnU7XuM160lnXiyWBZllSpVqatu3Hkp9RgpiZpwQaC2PsqSQf8jxD3KBYk8dUQf5H6uxtiaG8iBlY36InD0o44IPquKij3jfT81alHZS0eE4MvW97Ep/qI8IMEDh30A= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410426; c=relaxed/simple; bh=XpZXmWOLKdEl0ky1Wups9KNcUWfZu/4dTSGfdTcbrFk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=inMtfWZlSbAf9qwhMqRpQk9AaRY41BIcGKyeMIWzKlatjKmWVX/BnXeIh5f/5IjOR2w4a5MSxupauhuqk4I/E+JjpxTlmY6dKZWMLb9s7L6AcY+tfRVgsBtNPPDzqSZYQ2zrkg9YWIvTc+GmHM6Z73J28B4djWeqVGS8oF2s8jo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=hkVMII0r; arc=none smtp.client-ip=91.218.175.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="hkVMII0r" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1780410423; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=MblwLlwqCbAerbDko1BchwsCrXD3VZb6izB3pry9d1g=; b=hkVMII0r5J58OUGsynBUD9a4EDBaYUk8vlKTh8qkbpS9Wc6wyGhDqShzR0iau+RmYStX7h 1VlDt4g5kmDI1tWhdYlLPRn5V/JSneJu6i+TzOVl8CwV5De1zbrWpxGdFEbcORVEYpKJr6 Ug+LzRVHMjuIsMYNKJ2j1+bqNQ21mBA= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: ying.huang@linux.alibaba.com, Baoquan He , willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam R. Howlett , ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [v2 12/16] mm: handle PMD swap entries in MADV_WILLNEED Date: Tue, 2 Jun 2026 07:24:20 -0700 Message-ID: <20260602142537.198755-13-usama.arif@linux.dev> In-Reply-To: <20260602142537.198755-1-usama.arif@linux.dev> References: <20260602142537.198755-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" swapin_walk_pmd_entry() walks PTEs and skips non-present PMDs, so MADV_WILLNEED is a no-op on a PMD swap entry. Read the whole 2 MB folio in at PMD order via swapin_sync(BIT(HPAGE_PMD_ORDER)) so the subsequent fault hits do_huge_pmd_swap_page() and restores the THP mapping; an order-0 read-ahead would force the fault to split. Signed-off-by: Usama Arif --- mm/madvise.c | 40 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+) diff --git a/mm/madvise.c b/mm/madvise.c index 00539022f804..25f40542b951 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -193,6 +193,46 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned = long start, spinlock_t *ptl; unsigned long addr; =20 + ptl =3D pmd_trans_huge_lock(pmd, vma); + if (ptl) { + pmd_t pmdval =3D *pmd; + + if (pmd_is_swap_entry(pmdval)) { + softleaf_t entry =3D softleaf_from_pmd(pmdval); + struct vm_fault vmf =3D { + .vma =3D vma, + .address =3D start, + .real_address =3D start, + .pmd =3D pmd, + }; + struct swap_info_struct *si; + struct folio *folio; + + /* + * Pin the swap device under the PMD lock so the + * lookup is atomic with the PMD-swap-entry + * observation; swapin_sync() requires its caller to + * keep the device valid for the duration of the call. + */ + si =3D get_swap_device(entry); + spin_unlock(ptl); + if (!si) { + cond_resched(); + return 0; + } + + folio =3D swapin_sync(entry, GFP_HIGHUSER_MOVABLE, + BIT(HPAGE_PMD_ORDER), &vmf, + NULL, 0); + if (!IS_ERR_OR_NULL(folio)) + folio_put(folio); + put_swap_device(si); + cond_resched(); + return 0; + } + spin_unlock(ptl); + } + for (addr =3D start; addr < end; addr +=3D PAGE_SIZE) { pte_t pte; softleaf_t entry; --=20 2.52.0 From nobody Mon Jun 8 04:24:22 2026 Received: from out-179.mta1.migadu.com (out-179.mta1.migadu.com [95.215.58.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D0EFC3ECBF1 for ; Tue, 2 Jun 2026 14:27:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410439; cv=none; b=q6fgpp3fCREQ0+BSBswSciQdOUvM91ik9282t3dwFqP4GQHCc5P8Yc7+Dyeo7c20HV5vNt4thCL0IDWGKeBIlYxUT807xrCD0niWHUCkLhWtpSSp1Gah32Ok0AGp2pTj/LrYdcidumK6c66pm6OUKXNiooOrp40JQlXpknUFcsQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410439; c=relaxed/simple; bh=KI4DrZbhAoDOMqv9/PDDhRfjsF43l2K8YsyPeaRPbk8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=gDweE5SMljRjcIbhhgyMxkjULBxcJthgvx6K9JS2DcinirjyCLVheKoANPnuQfq9sGmaMehLkPFM2PE+snDb2zyVrzizcHSlm7MQCeWfUznjbWcmRlNNEc/FQf00O87r8jf+aTmbgMx9jz6NWEfczRmGA69Vx8gD8eSldgVQJ7M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=eGO8zd31; arc=none smtp.client-ip=95.215.58.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="eGO8zd31" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1780410435; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=79Y1dm8QMIeKr6vwxY04mtBDb5Hwz+TNhbxhdTYS+94=; b=eGO8zd31atZZc4clcYt7ciELk8Srv/oRldxylMcDDleaR/QloRPHQvnQ8qgJOnE2EhhN1U bz/u5NaJqXmuvZlglJEMaLL6sMtnL8BuZ2viAF75rxha1a5tFX53W9AW786+yIWtXE1p4k HCKB0+wtgKEjUSJpW9MM0Jn5eQKnzWY= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: ying.huang@linux.alibaba.com, Baoquan He , willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam R. Howlett , ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [v2 13/16] mm: handle PMD swap entries in UFFDIO_MOVE Date: Tue, 2 Jun 2026 07:24:21 -0700 Message-ID: <20260602142537.198755-14-usama.arif@linux.dev> In-Reply-To: <20260602142537.198755-1-usama.arif@linux.dev> References: <20260602142537.198755-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" move_pages_huge_pmd() returned -ENOENT for any non-trans_huge, non-migration PMD, which fails aligned UFFDIO_MOVE on a swapped-out THP -- the PMD swap entry is a perfectly valid mapping that should move whole. Splitting via the move_pages_ptes() fallback isn't a substitute either: __split_huge_pmd_locked() splits a PMD swap entry into HPAGE_PMD_NR PTE swap entries pointing at the same swap-cache folio, but move_swap_pte() refuses any swap-cache folio that is still large and returns -EBUSY. Add move_swap_pmd(), modeled on move_swap_pte(), that moves the swap entry whole-PMD and re-anchors the swap-cache folio's anon rmap to the destination VMA. Reject !pmd_swp_exclusive() entries with -EBUSY to preserve UFFDIO_MOVE's single-owner semantics, propagate soft-dirty, and carry the deposited page table across with the entry. The dispatcher in move_pages_huge_pmd() now waits for migration on a PMD migration entry (matching the PTE path) and routes PMD swap entries through move_swap_pmd() after pinning the swap device, fetching and locking any cached folio, and arming an mmu_notifier range so secondary MMUs see the move. If the swap-cache folio was split (e.g. by deferred_split_scan or memory_failure) between swap-out and UFFDIO_MOVE, src_folio is no longer PMD-sized but the PMD swap entry still covers all 512 slots. Moving the entry whole would only re-anchor one folio's anon rmap, leaving the other 511 with a stale anon_vma. Return -EBUSY in this case, matching move_pages_pte()'s rejection of large folios, so the caller falls back to PTE-level moves. Signed-off-by: Usama Arif --- mm/huge_memory.c | 113 ++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 112 insertions(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1d6d3817046d..f1379c8a92e5 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2812,6 +2812,62 @@ int change_huge_pud(struct mmu_gather *tlb, struct v= m_area_struct *vma, #endif =20 #ifdef CONFIG_USERFAULTFD +/* + * Move a PMD-level swap entry from src_pmd to dst_pmd. Both PMD locks are + * acquired here; src_folio (if present) must already be locked. The depos= ited + * page table backing the source THP is moved across with the entry. + */ +static int move_swap_pmd(struct mm_struct *mm, struct vm_area_struct *dst_= vma, + unsigned long dst_addr, unsigned long src_addr, + pmd_t *dst_pmd, pmd_t *src_pmd, + pmd_t orig_dst_pmd, pmd_t orig_src_pmd, + spinlock_t *dst_ptl, spinlock_t *src_ptl, + struct folio *src_folio, swp_entry_t entry) +{ + pgtable_t src_pgtable; + pmd_t moved_pmd; + + /* + * The folio may have been freed and reused for a different swap entry + * while it was unlocked. Re-verify the association. + */ + if (src_folio && unlikely(!folio_test_swapcache(src_folio) || + entry.val !=3D src_folio->swap.val)) + return -EAGAIN; + + double_pt_lock(dst_ptl, src_ptl); + + if (!pmd_same(*src_pmd, orig_src_pmd) || + !pmd_same(*dst_pmd, orig_dst_pmd)) { + double_pt_unlock(dst_ptl, src_ptl); + return -EAGAIN; + } + + /* + * If the folio is in the swap cache, re-anchor its anon rmap to the + * destination VMA so a future swap-in fault at dst_addr finds it. + * Otherwise, re-check that no folio was newly inserted under us. + */ + if (src_folio) { + folio_move_anon_rmap(src_folio, dst_vma); + src_folio->index =3D linear_page_index(dst_vma, dst_addr); + } else if (swap_cache_has_folio(entry)) { + double_pt_unlock(dst_ptl, src_ptl); + return -EAGAIN; + } + + moved_pmd =3D pmdp_huge_get_and_clear(mm, src_addr, src_pmd); + if (pgtable_supports_soft_dirty()) + moved_pmd =3D pmd_swp_mksoft_dirty(moved_pmd); + set_pmd_at(mm, dst_addr, dst_pmd, moved_pmd); + + src_pgtable =3D pgtable_trans_huge_withdraw(mm, src_pmd); + pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable); + + double_pt_unlock(dst_ptl, src_ptl); + return 0; +} + /* * The PT lock for src_pmd and dst_vma/src_vma (for reading) are locked by * the caller, but it must return after releasing the page_table_lock. @@ -2846,11 +2902,66 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t= *dst_pmd, pmd_t *src_pmd, pm } =20 if (!pmd_trans_huge(src_pmdval)) { - spin_unlock(src_ptl); if (pmd_is_migration_entry(src_pmdval)) { + spin_unlock(src_ptl); pmd_migration_entry_wait(mm, &src_pmdval); return -EAGAIN; } + if (pmd_is_swap_entry(src_pmdval)) { + swp_entry_t entry; + struct swap_info_struct *si; + + /* + * UFFDIO_MOVE on anon mappings requires single-owner + * semantics; refuse to move a shared swap entry. + */ + if (!pmd_swp_exclusive(src_pmdval)) { + spin_unlock(src_ptl); + return -EBUSY; + } + + entry =3D softleaf_from_pmd(src_pmdval); + spin_unlock(src_ptl); + + /* Pin the swap device against a racing swapoff. */ + si =3D get_swap_device(entry); + if (unlikely(!si)) + return -EAGAIN; + + src_folio =3D swap_cache_get_folio(entry); + + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, + mm, src_addr, + src_addr + HPAGE_PMD_SIZE); + mmu_notifier_invalidate_range_start(&range); + + if (src_folio) { + folio_lock(src_folio); + if (folio_nr_pages(src_folio) !=3D HPAGE_PMD_NR) { + err =3D -EBUSY; + folio_unlock(src_folio); + folio_put(src_folio); + mmu_notifier_invalidate_range_end(&range); + put_swap_device(si); + return err; + } + } + + dst_ptl =3D pmd_lockptr(mm, dst_pmd); + err =3D move_swap_pmd(mm, dst_vma, dst_addr, src_addr, + dst_pmd, src_pmd, dst_pmdval, + src_pmdval, dst_ptl, src_ptl, + src_folio, entry); + + mmu_notifier_invalidate_range_end(&range); + if (src_folio) { + folio_unlock(src_folio); + folio_put(src_folio); + } + put_swap_device(si); + return err; + } + spin_unlock(src_ptl); return -ENOENT; } =20 --=20 2.52.0 From nobody Mon Jun 8 04:24:22 2026 Received: from out-170.mta1.migadu.com (out-170.mta1.migadu.com [95.215.58.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1386A3ED12B for ; Tue, 2 Jun 2026 14:27:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.170 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410448; cv=none; b=J8mEH8qWxb4eg7B37FD5hEdrZpfTkb5iDk6t3fXqga1DmkSzBBnCkdrw5kPTDqZMXcZQ7lmArbpcKbFD2MPv1E4KR93go4SxNbW//BIiucuKgo30nOnxI/VqxhcEapXjFEyRiZOo7bGI/GjKu3iPkI3FFV8ZWRPOzlnBFlWzklU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410448; c=relaxed/simple; bh=Rog8c0o2RvGtBzB7mvFHmLdXXrPbt1vHuzjewizJ0wo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=LohEgeO38WxmeBR1zTHPblLEzgQn3EKUmZ6RXUT9N8/Q6f6/KRDyGlaIHqVi/GRVvXFbq/qpah8z7oifPww7snGhgNAuY7QFlaN1IIv1AQbxwCqKih/gI0StL54r7Aedwm4XfsbNjUDDyL5CaEEoSXJlmEzZUvamOkOBOGinOBY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=PSlfhHBP; arc=none smtp.client-ip=95.215.58.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="PSlfhHBP" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1780410442; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=0DNJ948G4yQK3LLY5Be7Hdmo3DK8ncCXtQ0JZVhnZpU=; b=PSlfhHBPW2w4zsJQcjuvwYmwpSwRgv7zk6hwlcrmLrzpT4v+s4L5bI7h0QQ7uT4M1aVWFf 2h5AU9xdbtccPNX3Bssx/gtXMqk3Ml7MtgrypLb0RCHfx25GiFRo0ny95LU63H7e2E8wsn vnG3XL5iIP4pH0mOGGANXDfReWGG9Qo= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: ying.huang@linux.alibaba.com, Baoquan He , willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam R. Howlett , ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [v2 14/16] mm: handle PMD swap entry faults on swap-in Date: Tue, 2 Jun 2026 07:24:22 -0700 Message-ID: <20260602142537.198755-15-usama.arif@linux.dev> In-Reply-To: <20260602142537.198755-1-usama.arif@linux.dev> References: <20260602142537.198755-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Add do_huge_pmd_swap_page() and dispatch to it from __handle_mm_fault() when vmf->orig_pmd encodes a swap entry. The handler resolves the entire 2 MB mapping in one shot, mirroring do_swap_page() (PTE path) at PMD granularity: - Look up the folio in the swap cache; on a miss, allocate a PMD-order folio via swap_cache_alloc_folio() and read from swap. - After locking, re-validate that the folio still corresponds to our entry and is still PMD-sized. Between the unlocked cache lookup and the lock, a racing swap-in on the same entry may have removed it from the cache via folio_free_swap(), or reclaim / memory_failure / deferred-split may have split the folio into smaller folios. - Restore soft_dirty and uffd_wp from the swap PMD. Map writable only when the entry was exclusive, the VMA permits writes, and uffd-wp is not armed. Drop the exclusive marker when the cached folio is under writeback to an SWP_STABLE_WRITES backend (zram, encrypted) so the PMD is mapped read-only; a later write COWs into a fresh folio rather than corrupting the in-flight writeback. Mirrors do_swap_page(). - When the resulting PMD is read-only but the fault was a write, update vmf->orig_pmd and call wp_huge_pmd() in the same handler to COW immediately rather than forcing a second fault. Mask VM_FAULT_FALLBACK from its return: a PMD-COW that splits to PTE-level is normal, but the bit is part of VM_FAULT_ERROR and arch fault handlers BUG() on it without SIGBUS/HWPOISON/SIGSEGV. Requires exposing wp_huge_pmd() via mm/internal.h. - Free the swap slot via should_try_to_free_swap() (hoisted from mm/memory.c into mm/internal.h so PTE- and PMD-level swap-in share the heuristic). When PMD-order resources are unavailable (folio allocation fails, the cached folio was split, memcg charge fails, or swapin_folio() races) split the PMD swap entry into 512 PTE swap entries via __split_huge_pmd() and return 0. The fault retries and do_swap_page() takes over per-PTE. This avoids returning VM_FAULT_OOM for transient PMD-order allocation failures. Signed-off-by: Usama Arif --- include/linux/huge_mm.h | 9 ++ mm/huge_memory.c | 198 ++++++++++++++++++++++++++++++++++++++++ mm/internal.h | 36 ++++++++ mm/memory.c | 40 +------- 4 files changed, 247 insertions(+), 36 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 1487bf4af1a7..9ec475ccfc91 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -531,6 +531,15 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf); =20 vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf); =20 +#ifdef CONFIG_THP_SWAP +vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf); +#else +static inline vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf) +{ + return 0; +} +#endif + extern struct folio *huge_zero_folio; extern unsigned long huge_zero_pfn; =20 diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f1379c8a92e5..3fc2f6e5eafa 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2314,6 +2314,204 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *v= mf) return 0; } =20 +#ifdef CONFIG_THP_SWAP +/** + * do_huge_pmd_swap_page() - Handle a fault on a PMD-level swap entry. + * @vmf: Fault context. vmf->orig_pmd contains the swap PMD. + * + * Looks up the folio in the swap cache, and if it is a PMD-sized folio, + * maps it directly at the PMD level. If the folio is not in the swap + * cache, allocates a PMD-sized folio and reads from swap. On allocation + * failure, splits the PMD swap entry into PTE-level entries and retries + * at PTE granularity. + * + * Return: VM_FAULT_* flags. + */ +vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf) +{ + struct vm_area_struct *vma =3D vmf->vma; + struct mm_struct *mm =3D vma->vm_mm; + struct folio *folio; + struct page *page; + struct swap_info_struct *si; + unsigned long haddr =3D vmf->address & HPAGE_PMD_MASK; + softleaf_t entry; + swp_entry_t swp_entry; + pmd_t pmd; + vm_fault_t ret =3D 0; + bool exclusive; + rmap_t rmap_flags =3D RMAP_NONE; + + entry =3D softleaf_from_pmd(vmf->orig_pmd); + if (unlikely(!softleaf_is_swap(entry))) + return 0; + + swp_entry =3D entry; + + /* Prevent swapoff from happening to us. */ + si =3D get_swap_device(swp_entry); + if (unlikely(!si)) + return 0; + + folio =3D swap_cache_get_folio(swp_entry); + if (!folio) { + folio =3D swapin_sync(swp_entry, GFP_HIGHUSER_MOVABLE, + BIT(HPAGE_PMD_ORDER), vmf, NULL, 0); + if (IS_ERR_OR_NULL(folio)) + goto split_fallback; + + /* Had to read from swap area: Major fault */ + ret =3D VM_FAULT_MAJOR; + count_vm_event(PGMAJFAULT); + count_memcg_event_mm(mm, PGMAJFAULT); + } + + ret |=3D folio_lock_or_retry(folio, vmf); + if (ret & VM_FAULT_RETRY) + goto out_release; + + /* Verify the folio is still in swap cache and matches our entry */ + if (unlikely(!folio_matches_swap_entry(folio, swp_entry))) + goto out_page; + + /* + * Folio should be PMD-sized; if not (e.g. split in swap cache), + * split the PMD swap entry and retry at PTE level. + */ + if (folio_nr_pages(folio) !=3D HPAGE_PMD_NR) { + folio_unlock(folio); + folio_put(folio); + goto split_fallback; + } + + if (unlikely(!folio_test_uptodate(folio))) { + ret =3D VM_FAULT_SIGBUS; + goto out_page; + } + + page =3D folio_page(folio, 0); + arch_swap_restore(folio_swap(swp_entry, folio), folio); + + if ((vmf->flags & FAULT_FLAG_WRITE) && !folio_test_lru(folio)) + lru_add_drain(); + + folio_throttle_swaprate(folio, GFP_KERNEL); + + /* Lock the PMD and verify it hasn't changed */ + vmf->ptl =3D pmd_lock(mm, vmf->pmd); + if (unlikely(!pmd_same(vmf->orig_pmd, pmdp_get(vmf->pmd)))) { + spin_unlock(vmf->ptl); + goto out_page; + } + + exclusive =3D pmd_swp_exclusive(vmf->orig_pmd); + + /* + * Some swap backends (e.g. zram) don't support concurrent page + * modifications while under writeback. If we map exclusive on such + * a backend while the folio is still under writeback, the writeback + * may see partial modifications and corrupt the swap slot. Drop the + * exclusive marker and only map R/O for that case; further GUP + * references can't appear once the page is fully unmapped, so this + * is safe. + */ + if (exclusive && folio_test_writeback(folio) && + data_race(si->flags & SWP_STABLE_WRITES)) + exclusive =3D false; + + /* + * Set up the PMD mapping. Similar to do_swap_page() but at PMD level. + */ + add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR); + add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR); + + pmd =3D folio_mk_pmd(folio, vma->vm_page_prot); + pmd =3D pmd_mkyoung(pmd); + + if (pmd_swp_soft_dirty(vmf->orig_pmd)) + pmd =3D pmd_mksoft_dirty(pmd); + if (pmd_swp_uffd_wp(vmf->orig_pmd)) + pmd =3D pmd_mkuffd_wp(pmd); + + /* + * Check exclusivity to determine if we can map writable. + */ + if (exclusive || folio_ref_count(folio) =3D=3D 1) { + if ((vma->vm_flags & VM_WRITE) && + !userfaultfd_huge_pmd_wp(vma, pmd) && + !pmd_needs_soft_dirty_wp(vma, pmd)) { + pmd =3D pmd_mkwrite(pmd, vma); + if (vmf->flags & FAULT_FLAG_WRITE) { + pmd =3D pmd_mkdirty(pmd); + vmf->flags &=3D ~FAULT_FLAG_WRITE; + } + } + rmap_flags |=3D RMAP_EXCLUSIVE; + } + + flush_icache_pages(vma, page, HPAGE_PMD_NR); + + if (!folio_test_anon(folio)) + folio_add_new_anon_rmap(folio, vma, haddr, rmap_flags); + else + folio_add_anon_rmap_pmd(folio, page, vma, haddr, rmap_flags); + + folio_put_swap(folio, NULL); + + set_pmd_at(mm, haddr, vmf->pmd, pmd); + update_mmu_cache_pmd(vma, haddr, vmf->pmd); + + /* Update orig_pmd for any follow-up wp_huge_pmd() below. */ + vmf->orig_pmd =3D pmd; + + /* + * Conditionally try to free up the swap cache. Do it after mapping, + * so raced page faults will likely see the folio in swap cache and + * wait on the folio lock. + */ + if (should_try_to_free_swap(si, folio, vma, 1, vmf->flags)) + folio_free_swap(folio); + + spin_unlock(vmf->ptl); + + folio_unlock(folio); + put_swap_device(si); + + /* + * If the write fault wasn't satisfied above (folio is shared without + * exclusivity), fall through to wp_huge_pmd to handle COW or + * userfaultfd-wp without forcing a second fault. + * + * wp_huge_pmd() may return VM_FAULT_FALLBACK if it had to split the + * PMD; that's a normal outcome =E2=80=94 the natural PTE-level refault w= ill + * complete the COW. Mask it so callers (and the arch fault handler) + * don't see VM_FAULT_FALLBACK as a fatal VM_FAULT_ERROR. + */ + if (vmf->flags & FAULT_FLAG_WRITE) { + vm_fault_t wp_ret =3D wp_huge_pmd(vmf); + + wp_ret &=3D ~VM_FAULT_FALLBACK; + ret |=3D wp_ret; + if (ret & VM_FAULT_ERROR) + ret &=3D VM_FAULT_ERROR; + } + + return ret; + +out_page: + folio_unlock(folio); +out_release: + folio_put(folio); + put_swap_device(si); + return ret; + +split_fallback: + __split_huge_pmd(vma, vmf->pmd, haddr, false); + put_swap_device(si); + return 0; +} +#endif /* CONFIG_THP_SWAP */ + static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd) { pgtable_t pgtable; diff --git a/mm/internal.h b/mm/internal.h index ace2f8ef1d35..574dafd18709 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -499,6 +499,42 @@ static inline vm_fault_t vmf_anon_prepare(struct vm_fa= ult *vmf) } =20 vm_fault_t do_swap_page(struct vm_fault *vmf); +vm_fault_t wp_huge_pmd(struct vm_fault *vmf); + +/* + * Check if we should call folio_free_swap to free the swap cache. + * folio_free_swap only frees the swap cache to release the slot if swap + * count is zero, so we don't need to check the swap count here. + */ +static inline bool should_try_to_free_swap(struct swap_info_struct *si, + struct folio *folio, + struct vm_area_struct *vma, + unsigned int extra_refs, + unsigned int fault_flags) +{ + if (!folio_test_swapcache(folio)) + return false; + /* + * Always try to free swap cache for SWP_SYNCHRONOUS_IO devices. Swap + * cache can help save some IO or memory overhead, but these devices + * are fast, and meanwhile, swap cache pinning the slot deferring the + * release of metadata or fragmentation is a more critical issue. + */ + if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) + return true; + if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) || + folio_test_mlocked(folio)) + return true; + /* + * If we want to map a page that's in the swapcache writable, we + * have to detect via the refcount if we're really the exclusive + * user. Try freeing the swapcache to get rid of the swapcache + * reference only in case it's likely that we'll be the exclusive user. + */ + return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) && + folio_ref_count(folio) =3D=3D (extra_refs + folio_nr_pages(folio)); +} + void folio_rotate_reclaimable(struct folio *folio); bool __folio_end_writeback(struct folio *folio); void deactivate_file_folio(struct folio *folio); diff --git a/mm/memory.c b/mm/memory.c index 5cf02e394c92..7272a10a0fe0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4497,40 +4497,6 @@ static vm_fault_t remove_device_exclusive_entry(stru= ct vm_fault *vmf) return 0; } =20 -/* - * Check if we should call folio_free_swap to free the swap cache. - * folio_free_swap only frees the swap cache to release the slot if swap - * count is zero, so we don't need to check the swap count here. - */ -static inline bool should_try_to_free_swap(struct swap_info_struct *si, - struct folio *folio, - struct vm_area_struct *vma, - unsigned int extra_refs, - unsigned int fault_flags) -{ - if (!folio_test_swapcache(folio)) - return false; - /* - * Always try to free swap cache for SWP_SYNCHRONOUS_IO devices. Swap - * cache can help save some IO or memory overhead, but these devices - * are fast, and meanwhile, swap cache pinning the slot deferring the - * release of metadata or fragmentation is a more critical issue. - */ - if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) - return true; - if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) || - folio_test_mlocked(folio)) - return true; - /* - * If we want to map a page that's in the swapcache writable, we - * have to detect via the refcount if we're really the exclusive - * user. Try freeing the swapcache to get rid of the swapcache - * reference only in case it's likely that we'll be the exclusive user. - */ - return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) && - folio_ref_count(folio) =3D=3D (extra_refs + folio_nr_pages(folio)); -} - static vm_fault_t pte_marker_clear(struct vm_fault *vmf) { vmf->pte =3D pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, @@ -6200,8 +6166,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fa= ult *vmf) return VM_FAULT_FALLBACK; } =20 -/* `inline' is required to avoid gcc 4.1.2 build error */ -static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) +vm_fault_t wp_huge_pmd(struct vm_fault *vmf) { struct vm_area_struct *vma =3D vmf->vma; const bool unshare =3D vmf->flags & FAULT_FLAG_UNSHARE; @@ -6486,6 +6451,9 @@ static vm_fault_t __handle_mm_fault(struct vm_area_st= ruct *vma, =20 if (pmd_is_migration_entry(vmf.orig_pmd)) pmd_migration_entry_wait(mm, vmf.pmd); + else if (IS_ENABLED(CONFIG_THP_SWAP) && + pmd_is_swap_entry(vmf.orig_pmd)) + return do_huge_pmd_swap_page(&vmf); return 0; } if (pmd_trans_huge(vmf.orig_pmd)) { --=20 2.52.0 From nobody Mon Jun 8 04:24:22 2026 Received: from out-170.mta1.migadu.com (out-170.mta1.migadu.com [95.215.58.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6D3C1406291 for ; Tue, 2 Jun 2026 14:27:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.170 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410452; cv=none; b=CGipWMAditgI0e35TVhxEHmbNs9O8e+1hQN1fivGIKzMiAWvYN2UDpSoKIURz7G1pPq7iKQ8msfFU5rllwSXQkx/tuJfcoj0yEy6HDTmUOFgB8BTe/7PrvNohwWd0fPvFAA29OpANubHiqGAtQCdPyhn8wY3gxV/Z+TztnZXifQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410452; c=relaxed/simple; bh=SWpkr6utVeEF1wWmzjFpWa7Sxr0a1BmKerweln1Bsk0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=FD/ZWrHMxVBLusQubJTvLHm4hjYoL9AJkbbObA8nYaxyRR+ef/J/FOroYnrFZu/3YqIWG3v+r3r7hP+diWxyNnLa2kr/rzlfnj18lqZFKKo/EBElt4awlj1MFJ91p/aLKUSLeWsVLbty9q5qXj5OlN8U/+PnvsOXe6PvlZ+2B0A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=KbUaIlqD; arc=none smtp.client-ip=95.215.58.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="KbUaIlqD" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1780410448; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=OTIAIEmEt5WXIs8yFowJcZexewO31Bt2GOy4KhyGYJI=; b=KbUaIlqD4WRVVDQw/JasvqfuUPoKXcJBVoFdGiojGTLmPA+K2u8pQye/B7euenH2yL0BG0 U/9eYS3B8XaOMLo6yqnuSaAFVttHTJKZQaVeXvn0exAjLHjRk0UVTkDI47ZU1W3+8it3rf nB2NGwnKQBqkT2xApF8oA3JjE3xeiys= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: ying.huang@linux.alibaba.com, Baoquan He , willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam R. Howlett , ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [v2 15/16] mm: install PMD swap entries on swap-out Date: Tue, 2 Jun 2026 07:24:23 -0700 Message-ID: <20260602142537.198755-16-usama.arif@linux.dev> In-Reply-To: <20260602142537.198755-1-usama.arif@linux.dev> References: <20260602142537.198755-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Reclaim today splits a PMD-mapped anonymous THP into 512 PTE swap entries before unmap, losing the huge mapping across the swap round-trip and forcing khugepaged to rebuild it later. The contiguous swap range was already secured when the folio was added to the swap cache (a non-contiguous allocation would have split the folio earlier), so the PMD can be replaced by a single PMD-level swap entry instead. This patch mirrors the existing PTE swap-out path at PMD granularity: - shrink_folio_list() drops TTU_SPLIT_HUGE_PMD for PMD-mappable swapcache folios, gated on zswap_never_enabled() since zswap cannot reconstruct a 2 MB folio from per-page blobs (Best to handle zswap case separately). - try_to_unmap_one() now has a PMD branch that calls set_pmd_swap_entry() and adjusts MM_ANONPAGES / MM_SWAPENTS by HPAGE_PMD_NR before walk_done. TTU_SPLIT_HUGE_PMD remains the fallback. - set_pmd_swap_entry() is the installer. Mirroring the PTE swap-out sequence at PMD granularity, it clears the present mapping (keeping the original for rollback), bumps the swap_map refcount for the folio's 512 slots, drops the exclusive mark if the page was anon-exclusive, propagates the dirty bit to the folio so writeback is not lost, and installs a swap PMD that preserves the original soft-dirty / uffd-wp / exclusive bits. Any failing step rolls back the present mapping. The swap entry value matches what 512 PTE swap entries would encode, so swap_map refcounting is unchanged: each of the 512 slots carries a count of 1, released individually on later split or together on swap-in. Signed-off-by: Usama Arif --- include/linux/huge_mm.h | 2 + include/linux/vm_event_item.h | 1 + mm/huge_memory.c | 78 +++++++++++++++++++++++++++++++++++ mm/rmap.c | 20 +++++++++ mm/vmscan.c | 14 ++++++- mm/vmstat.c | 1 + 6 files changed, 115 insertions(+), 1 deletion(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 9ec475ccfc91..b746f8c8db69 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -533,6 +533,8 @@ vm_fault_t do_huge_pmd_device_private(struct vm_fault *= vmf); =20 #ifdef CONFIG_THP_SWAP vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf); +int set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw, + struct folio *folio); #else static inline vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf) { diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 03fe95f5a020..7267c06674c0 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -108,6 +108,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, THP_ZERO_PAGE_ALLOC_FAILED, THP_SWPOUT, THP_SWPOUT_FALLBACK, + THP_SWPOUT_PMD, #endif #ifdef CONFIG_BALLOON BALLOON_INFLATE, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 3fc2f6e5eafa..1fed86065fd9 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -5385,3 +5385,81 @@ void remove_migration_pmd(struct page_vma_mapped_wal= k *pvmw, struct page *new) trace_remove_migration_pmd(address, pmd_val(pmde)); } #endif + +#ifdef CONFIG_THP_SWAP +/** + * set_pmd_swap_entry() - Replace a PMD mapping with a PMD-level swap entr= y. + * @pvmw: Page vma mapped walk context, must have pvmw->pmd set and + * pvmw->pte NULL (i.e. PMD-mapped). + * @folio: The folio being swapped out. Must be in the swap cache. + * + * This installs a PMD-level swap entry in place of a present PMD mapping, + * avoiding the need to split the PMD into PTE-level swap entries. + * + * Return: 0 on success, negative error code on failure. + */ +int set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw, + struct folio *folio) +{ + struct vm_area_struct *vma =3D pvmw->vma; + struct mm_struct *mm =3D vma->vm_mm; + unsigned long address =3D pvmw->address; + unsigned long haddr =3D address & HPAGE_PMD_MASK; + struct page *page =3D folio_page(folio, 0); + bool anon_exclusive; + pmd_t pmdval; + swp_entry_t entry; + pmd_t pmdswp; + + if (!(pvmw->pmd && !pvmw->pte)) + return 0; + + VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio); + VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio); + + if (unlikely(folio_test_swapbacked(folio) !=3D + folio_test_swapcache(folio))) { + WARN_ON_ONCE(1); + return -EBUSY; + } + + flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE); + + pmdval =3D pmdp_invalidate(vma, haddr, pvmw->pmd); + + /* Update high watermark before we lower rss */ + update_hiwater_rss(mm); + + if (folio_dup_swap(folio, NULL) < 0) { + set_pmd_at(mm, haddr, pvmw->pmd, pmdval); + return -ENOMEM; + } + + /* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */ + anon_exclusive =3D PageAnonExclusive(page); + if (anon_exclusive && folio_try_share_anon_rmap_pmd(folio, page)) { + folio_put_swap(folio, NULL); + set_pmd_at(mm, haddr, pvmw->pmd, pmdval); + return -EBUSY; + } + + if (pmd_dirty(pmdval)) + folio_mark_dirty(folio); + + entry =3D folio->swap; + pmdswp =3D softleaf_to_pmd(entry); + if (pmd_soft_dirty(pmdval)) + pmdswp =3D pmd_swp_mksoft_dirty(pmdswp); + if (pmd_uffd_wp(pmdval)) + pmdswp =3D pmd_swp_mkuffd_wp(pmdswp); + if (anon_exclusive) + pmdswp =3D pmd_swp_mkexclusive(pmdswp); + set_pmd_at(mm, haddr, pvmw->pmd, pmdswp); + + folio_remove_rmap_pmd(folio, page, vma); + folio_put(folio); + + count_vm_event(THP_SWPOUT_PMD); + return 0; +} +#endif /* CONFIG_THP_SWAP */ diff --git a/mm/rmap.c b/mm/rmap.c index 0fb7a1b82cf3..ffc7aa62a29e 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2079,6 +2079,26 @@ static bool try_to_unmap_one(struct folio *folio, st= ruct vm_area_struct *vma, goto walk_abort; } =20 +#ifdef CONFIG_THP_SWAP + /* + * If the folio is in the swap cache and we're not + * asked to split, install a PMD-level swap entry. + */ + if (!(flags & TTU_SPLIT_HUGE_PMD) && + folio_test_anon(folio) && + folio_test_swapcache(folio)) { + if (set_pmd_swap_entry(&pvmw, folio)) + goto walk_abort; + + mm_prepare_for_swap_entries(mm); + add_mm_counter(mm, MM_ANONPAGES, + -HPAGE_PMD_NR); + add_mm_counter(mm, MM_SWAPENTS, + HPAGE_PMD_NR); + goto walk_done; + } +#endif + if (flags & TTU_SPLIT_HUGE_PMD) { /* * We temporarily have to drop the PTL and diff --git a/mm/vmscan.c b/mm/vmscan.c index e8a90911bf88..0f376fbf9bb3 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -64,6 +64,7 @@ =20 #include #include +#include =20 #include "internal.h" #include "swap.h" @@ -1332,7 +1333,18 @@ static unsigned int shrink_folio_list(struct list_he= ad *folio_list, enum ttu_flags flags =3D TTU_BATCH_FLUSH; bool was_swapbacked =3D folio_test_swapbacked(folio); =20 - if (folio_test_pmd_mappable(folio)) + /* + * With THP_SWAP, PMD-mappable folios already in the + * swap cache can be unmapped with a PMD-level swap + * entry, avoiding the cost of splitting the PMD. + * Skip this when zswap has been enabled because + * zswap stores pages individually and cannot + * reconstruct a large folio on swap-in. + */ + if (folio_test_pmd_mappable(folio) && + !(IS_ENABLED(CONFIG_THP_SWAP) && + folio_test_swapcache(folio) && + zswap_never_enabled())) flags |=3D TTU_SPLIT_HUGE_PMD; /* * Without TTU_SYNC, try_to_unmap will only begin to diff --git a/mm/vmstat.c b/mm/vmstat.c index f534972f517d..9b4963a7eb04 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1421,6 +1421,7 @@ const char * const vmstat_text[] =3D { [I(THP_ZERO_PAGE_ALLOC_FAILED)] =3D "thp_zero_page_alloc_failed", [I(THP_SWPOUT)] =3D "thp_swpout", [I(THP_SWPOUT_FALLBACK)] =3D "thp_swpout_fallback", + [I(THP_SWPOUT_PMD)] =3D "thp_swpout_pmd", #endif #ifdef CONFIG_BALLOON [I(BALLOON_INFLATE)] =3D "balloon_inflate", --=20 2.52.0 From nobody Mon Jun 8 04:24:22 2026 Received: from out-182.mta1.migadu.com (out-182.mta1.migadu.com [95.215.58.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 049EA423A70 for ; Tue, 2 Jun 2026 14:27:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.182 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410461; cv=none; b=HhPP0YaX3C3uzOO5tVVnj1Vg6rW9Sq3Izz/nbms6HUkAZ7wjwqF0LIBJN+pu7qynTGcljgeCL/U15NMv0xqpVN/bZZo8t5t+HUD48qhtXQBCU4SKhoe2iUxI1d4VWLfV3/JiVz+QfmxMWL2LEutOAb+69CpvkLOCKwlK3jAnD+8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410461; c=relaxed/simple; bh=ljlR2XbucwKqlj//HgNcdjuzYPZauDdAAkvOh50xJKs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=kF0gtHGYhrg9NQExLRsgU02fGI0YyU+d65v2lpIlvCdwxtq89jqB1L/yhfq57AOND83JedjeLj0T7z4da1koOrLGFxewDJMN4YrW09w7KaT+ykhQ1sIhZ4gsiofIKZHYBtVXI7e7rNrNOx/g0QVSzt1cNYv/4HIv3a+3NMQ7QgE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=BVRVkvO1; arc=none smtp.client-ip=95.215.58.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="BVRVkvO1" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1780410456; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Lm8cM4gU0O0d+34abwdIkA8S4Xs8Gv6LbedA+Vjmex8=; b=BVRVkvO1JMlbW9GUi8bxP86+M7f9IbjtX/4PLiFLHOl6mUiFqIEDXnNQfCQwEyTcV/Q/Rp 2MAxIqqbD+nbhQCfVE6PxZmm9JZzv5whhmVCljg98P+hhsk9EfldzZZPkkJPnHG+361LxI 9Gz88VqIC/tY88MzL7wJ2C1yP+nJlTA= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: ying.huang@linux.alibaba.com, Baoquan He , willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam R. Howlett , ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [v2 16/16] selftests/mm: add PMD swap entry tests Date: Tue, 2 Jun 2026 07:24:24 -0700 Message-ID: <20260602142537.198755-17-usama.arif@linux.dev> In-Reply-To: <20260602142537.198755-1-usama.arif@linux.dev> References: <20260602142537.198755-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Exercise the PMD swap entry paths. The tests allocate a PMD-mapped THP, write a known pattern, swap it out via MADV_PAGEOUT, and then exercise different code paths: - swap-out / swap-in round-trip with data verification - fork with read-only access from both parent and child - fork with writes in both processes to verify COW isolation - repeated swap cycles to try and catch reference counting issues - write fault on a swapped PMD to verify dirty handling - munmap of a swapped PMD (zap_huge_pmd swap slot cleanup) - mprotect on a swapped PMD (change_non_present_huge_pmd) - mremap of a swapped PMD (move_soft_dirty_pmd) - pagemap reading (pagemap_pmd_range_thp softleaf_has_pfn guard) - MADV_FREE on a swapped PMD: verifies swap slots are freed via pagemap and the memory reads back as zero - UFFDIO_MOVE on a swapped PMD (move_pages_huge_pmd swap path); verifies the entry transfers without splitting and that the destination faults back in as a THP - swapoff with active PMD swap entries (unuse_pmd_range split) Signed-off-by: Usama Arif --- tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/pmd_swap.c | 672 ++++++++++++++++++++++++++ 2 files changed, 673 insertions(+) create mode 100644 tools/testing/selftests/mm/pmd_swap.c diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/= mm/Makefile index e6df968f0971..d442dac8460c 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -105,6 +105,7 @@ TEST_GEN_FILES +=3D guard-regions TEST_GEN_FILES +=3D merge TEST_GEN_FILES +=3D rmap TEST_GEN_FILES +=3D folio_split_race_test +TEST_GEN_FILES +=3D pmd_swap =20 ifneq ($(ARCH),arm64) TEST_GEN_FILES +=3D soft-dirty diff --git a/tools/testing/selftests/mm/pmd_swap.c b/tools/testing/selftest= s/mm/pmd_swap.c new file mode 100644 index 000000000000..01897bfa17dd --- /dev/null +++ b/tools/testing/selftests/mm/pmd_swap.c @@ -0,0 +1,672 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Test PMD-level swap entries. + * + * Verifies that when a PMD-mapped THP is swapped out the kernel installs + * a single PMD-level swap entry (instead of splitting into 512 PTE-level + * entries), and that operations on the swapped region behave correctly: + * basic - swap out + swap in preserves data + * fork - parent and child both see the data + * fork_cow - COW after fork keeps parent's data isolated + * cycles - repeated swap out/in does not corrupt data + * write - faulting in via a write keeps the rest of the THP + * munmap - munmap on a PMD swap entry frees swap slots cleanly + * mprotect - mprotect on a PMD swap entry preserves data + * mremap - mremap on a PMD swap entry preserves data + * pagemap - pagemap reports the entries as swapped + * madvise_free - MADV_FREE on a PMD swap entry does not crash + * madvise_willneed - MADV_WILLNEED reads the THP in at PMD order + * uffdio_move - UFFDIO_MOVE moves a PMD swap entry whole-PMD + * swapoff - swapoff faults the THP back in (needs PMD_SWAP_DEVIC= E) + */ +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "kselftest_harness.h" +#include "vm_util.h" + +static bool check_swapped(int pagemap_fd, char *addr, unsigned long size) +{ + unsigned long off; + + for (off =3D 0; off < size; off +=3D getpagesize()) + if (!pagemap_is_swapped(pagemap_fd, addr + off)) + return false; + return true; +} + +static bool swap_available(int pagemap_fd) +{ + char *p; + bool ret; + + p =3D mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (p =3D=3D MAP_FAILED) + return false; + + memset(p, 0xab, getpagesize()); + madvise(p, getpagesize(), MADV_PAGEOUT); + ret =3D pagemap_is_swapped(pagemap_fd, p); + munmap(p, getpagesize()); + return ret; +} + +static unsigned long read_vm_event(const char *name) +{ + char line[256]; + size_t name_len =3D strlen(name); + unsigned long val =3D 0; + FILE *f; + + f =3D fopen("/proc/vmstat", "r"); + if (!f) + return 0; + while (fgets(line, sizeof(line), f)) { + if (!strncmp(line, name, name_len) && line[name_len] =3D=3D ' ') { + val =3D strtoul(line + name_len + 1, NULL, 10); + break; + } + } + fclose(f); + return val; +} + +static bool read_pmd_mthp_stat(unsigned long pmd_size, const char *name, + unsigned long *val) +{ + char path[256]; + FILE *f; + int ret; + + ret =3D snprintf(path, sizeof(path), + "/sys/kernel/mm/transparent_hugepage/hugepages-%lukB/stats/%s", + pmd_size >> 10, name); + if (ret < 0 || ret >=3D sizeof(path)) + return false; + + f =3D fopen(path, "r"); + if (!f) + return false; + + ret =3D fscanf(f, "%lu", val); + fclose(f); + return ret =3D=3D 1; +} + +static unsigned int random_seed(void) +{ + unsigned int seed; + + if (getrandom(&seed, sizeof(seed), 0) !=3D sizeof(seed)) + seed =3D (unsigned int)time(NULL); + return seed; +} + +static unsigned char pattern_byte(unsigned int seed, unsigned long off) +{ + return (unsigned char)(seed + off); +} + +static void fill_pattern(char *buf, unsigned long size, unsigned int seed) +{ + unsigned long i; + + for (i =3D 0; i < size; i++) + buf[i] =3D (char)pattern_byte(seed, i); +} + +static bool verify_pattern(char *buf, unsigned long size, unsigned int see= d) +{ + unsigned long i; + + for (i =3D 0; i < size; i++) + if ((unsigned char)buf[i] !=3D pattern_byte(seed, i)) + return false; + return true; +} + +/* + * mmap an anonymous PMD-aligned region of pmd_size bytes. Over-allocates + * by one PMD and trims the unaligned head/tail so the returned address is + * PMD-aligned (required for whole-PMD UFFDIO_MOVE). + */ +static char *mmap_pmd_aligned(unsigned long pmd_size) +{ + unsigned long pad =3D pmd_size; + char *raw, *aligned; + + raw =3D mmap(NULL, pmd_size + pad, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (raw =3D=3D MAP_FAILED) + return MAP_FAILED; + + aligned =3D (char *)(((uintptr_t)raw + pmd_size - 1) & ~(pmd_size - 1)); + if (aligned !=3D raw) + munmap(raw, aligned - raw); + if (aligned + pmd_size !=3D raw + pmd_size + pad) + munmap(aligned + pmd_size, + (raw + pmd_size + pad) - (aligned + pmd_size)); + return aligned; +} + +/* + * mmap a PMD-aligned PMD-sized region, request THP, fill with a pattern, + * and swap it out. Verifies via the thp_swpout_pmd vmstat counter that + * the swap-out installed a PMD swap entry rather than splitting to PTEs. + */ +static char *alloc_fill_swap_thp(unsigned long pmd_size, int pagemap_fd, + unsigned int seed) +{ + unsigned long pmd_before, pmd_after; + char *mem; + + mem =3D mmap_pmd_aligned(pmd_size); + if (mem =3D=3D MAP_FAILED) + return MAP_FAILED; + + madvise(mem, pmd_size, MADV_HUGEPAGE); + fill_pattern(mem, pmd_size, seed); + + pmd_before =3D read_vm_event("thp_swpout_pmd"); + + if (madvise(mem, pmd_size, MADV_PAGEOUT) || + !check_swapped(pagemap_fd, mem, pmd_size)) { + munmap(mem, pmd_size); + return MAP_FAILED; + } + + pmd_after =3D read_vm_event("thp_swpout_pmd"); + printf("# thp_swpout_pmd: %lu -> %lu\n", pmd_before, pmd_after); + if (pmd_after - pmd_before < 1) { + munmap(mem, pmd_size); + return MAP_FAILED; + } + return mem; +} + +FIXTURE(pmd_swap) +{ + unsigned long pmd_size; + int pagemap_fd; + unsigned int seed; +}; + +FIXTURE_SETUP(pmd_swap) +{ + self->pagemap_fd =3D -1; + + self->pmd_size =3D read_pmd_pagesize(); + if (!self->pmd_size) + SKIP(return, "Cannot determine PMD size\n"); + + self->pagemap_fd =3D open("/proc/self/pagemap", O_RDONLY); + if (self->pagemap_fd < 0) + SKIP(return, "Cannot open /proc/self/pagemap\n"); + + if (!swap_available(self->pagemap_fd)) + SKIP(return, "Swap not available or not working\n"); + + self->seed =3D random_seed(); +} + +FIXTURE_TEARDOWN(pmd_swap) +{ + if (self->pagemap_fd >=3D 0) + close(self->pagemap_fd); +} + +/* + * Allocate a PMD-sized THP, write a pattern, swap it out, read it back, + * verify the pattern. + */ +TEST_F(pmd_swap, basic) +{ + char *mem; + + mem =3D alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed); + if (mem =3D=3D MAP_FAILED) + SKIP(return, "Could not create swapped THP\n"); + + ASSERT_TRUE(verify_pattern(mem, self->pmd_size, self->seed)); + + munmap(mem, self->pmd_size); +} + +/* + * Allocate a THP, swap it out, fork, verify both parent and child see + * the correct data. + */ +TEST_F(pmd_swap, fork) +{ + char *mem; + pid_t pid; + int status; + + mem =3D alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed); + if (mem =3D=3D MAP_FAILED) + SKIP(return, "Could not create swapped THP\n"); + + pid =3D fork(); + ASSERT_GE(pid, 0); + + if (pid =3D=3D 0) { + _exit(verify_pattern(mem, self->pmd_size, self->seed) ? 0 : 1); + } + + ASSERT_TRUE(verify_pattern(mem, self->pmd_size, self->seed)); + + ASSERT_EQ(waitpid(pid, &status, 0), pid); + ASSERT_TRUE(WIFEXITED(status)); + ASSERT_EQ(WEXITSTATUS(status), 0); + + munmap(mem, self->pmd_size); +} + +/* + * Swap out, fork, then have parent and child write different patterns. + * Exercises COW on shared PMD swap entries: writes after fork must + * trigger copy-on-write so the parent's data stays isolated. + */ +TEST_F(pmd_swap, fork_cow) +{ + unsigned int parent_seed =3D self->seed; + unsigned int child_seed =3D ~self->seed; + char *mem; + pid_t pid; + int status; + + mem =3D alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, parent_seed= ); + if (mem =3D=3D MAP_FAILED) + SKIP(return, "Could not create swapped THP\n"); + + pid =3D fork(); + ASSERT_GE(pid, 0); + + if (pid =3D=3D 0) { + fill_pattern(mem, self->pmd_size, child_seed); + _exit(verify_pattern(mem, self->pmd_size, child_seed) ? 0 : 1); + } + + ASSERT_EQ(waitpid(pid, &status, 0), pid); + + ASSERT_TRUE(verify_pattern(mem, self->pmd_size, parent_seed)); + ASSERT_TRUE(WIFEXITED(status)); + ASSERT_EQ(WEXITSTATUS(status), 0); + + munmap(mem, self->pmd_size); +} + +/* + * Swap a THP out and in repeatedly without data corruption. + */ +TEST_F(pmd_swap, cycles) +{ + const int num_cycles =3D 5; + char *mem; + int cycle; + + for (cycle =3D 0; cycle < num_cycles; cycle++) { + unsigned int seed =3D self->seed + cycle; + + mem =3D alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed); + if (mem =3D=3D MAP_FAILED) + SKIP(return, "Could not create swapped THP at cycle %d\n", + cycle); + + ASSERT_TRUE(verify_pattern(mem, self->pmd_size, seed)); + + munmap(mem, self->pmd_size); + } +} + +/* + * Swap out, fault in via a write to the first page, verify the write + * sticks and the rest of the THP is preserved. + */ +TEST_F(pmd_swap, write) +{ + unsigned int seed =3D self->seed; + char *mem; + unsigned long i; + + mem =3D alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed); + if (mem =3D=3D MAP_FAILED) + SKIP(return, "Could not create swapped THP\n"); + + mem[0] =3D 0xbb; + ASSERT_EQ(mem[0], (char)0xbb); + + for (i =3D 1; i < self->pmd_size; i++) + ASSERT_EQ((unsigned char)mem[i], pattern_byte(seed, i)); + + munmap(mem, self->pmd_size); +} + +/* + * munmap while the folio is swapped out. Exercises zap_huge_pmd() on a + * PMD swap entry =E2=80=94 must free the swap slots without trying to loo= k up + * a folio. + */ +TEST_F(pmd_swap, munmap) +{ + char *mem; + + mem =3D alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed); + if (mem =3D=3D MAP_FAILED) + SKIP(return, "Could not create swapped THP\n"); + + munmap(mem, self->pmd_size); +} + +/* + * Change protection on a swapped PMD entry, then fault back in and + * verify data. Exercises change_non_present_huge_pmd(). + */ +TEST_F(pmd_swap, mprotect) +{ + unsigned int seed =3D self->seed; + char *mem; + + mem =3D alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed); + if (mem =3D=3D MAP_FAILED) + SKIP(return, "Could not create swapped THP\n"); + + ASSERT_EQ(mprotect(mem, self->pmd_size, PROT_READ), 0); + ASSERT_EQ(mprotect(mem, self->pmd_size, PROT_READ | PROT_WRITE), 0); + + ASSERT_TRUE(verify_pattern(mem, self->pmd_size, seed)); + + munmap(mem, self->pmd_size); +} + +/* + * UFFDIO_MOVE a PMD swap entry from src to a registered dst. Exercises + * move_pages_huge_pmd() handling of pmd_is_swap_entry: the whole PMD swap + * entry must move to dst without splitting, and the destination must + * read back the original pattern after a swap-in fault. + */ +TEST_F(pmd_swap, uffdio_move) +{ + unsigned int seed =3D self->seed; + struct uffdio_register reg =3D {}; + struct uffdio_move move =3D {}; + struct uffdio_api api =3D {}; + char *src, *dst; + int uffd; + + dst =3D mmap_pmd_aligned(self->pmd_size); + if (dst =3D=3D MAP_FAILED) + SKIP(return, "Could not mmap aligned dst\n"); + + src =3D alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed); + if (src =3D=3D MAP_FAILED) { + munmap(dst, self->pmd_size); + SKIP(return, "Could not create swapped THP\n"); + } + if ((uintptr_t)src & (self->pmd_size - 1)) { + munmap(src, self->pmd_size); + munmap(dst, self->pmd_size); + SKIP(return, "src not PMD-aligned\n"); + } + + uffd =3D syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); + if (uffd < 0) { + munmap(src, self->pmd_size); + munmap(dst, self->pmd_size); + SKIP(return, "userfaultfd unavailable\n"); + } + + api.api =3D UFFD_API; + api.features =3D UFFD_FEATURE_MOVE; + if (ioctl(uffd, UFFDIO_API, &api) || + !(api.features & UFFD_FEATURE_MOVE)) { + close(uffd); + munmap(src, self->pmd_size); + munmap(dst, self->pmd_size); + SKIP(return, "UFFD_FEATURE_MOVE unsupported\n"); + } + + reg.range.start =3D (unsigned long)dst; + reg.range.len =3D self->pmd_size; + reg.mode =3D UFFDIO_REGISTER_MODE_MISSING; + if (ioctl(uffd, UFFDIO_REGISTER, ®)) { + close(uffd); + munmap(src, self->pmd_size); + munmap(dst, self->pmd_size); + SKIP(return, "UFFDIO_REGISTER failed\n"); + } + + move.dst =3D (unsigned long)dst; + move.src =3D (unsigned long)src; + move.len =3D self->pmd_size; + if (ioctl(uffd, UFFDIO_MOVE, &move)) { + close(uffd); + munmap(src, self->pmd_size); + munmap(dst, self->pmd_size); + ASSERT_EQ(errno, 0); + } + ASSERT_EQ(move.move, self->pmd_size); + + /* + * dst inherits the PMD swap entry; reading it must fault the THP + * back in via do_huge_pmd_swap_page() and yield the original data. + */ + ASSERT_TRUE(check_swapped(self->pagemap_fd, dst, self->pmd_size)); + ASSERT_TRUE(verify_pattern(dst, self->pmd_size, seed)); + /* The whole-PMD path must reinstate a THP, not 512 PTE folios. */ + ASSERT_TRUE(check_huge_anon(dst, 1, self->pmd_size)); + + close(uffd); + munmap(src, self->pmd_size); + munmap(dst, self->pmd_size); +} + +/* + * Move a swapped PMD entry to a new address, fault in, verify data. + * Exercises move_huge_pmd() and move_soft_dirty_pmd(). + */ +TEST_F(pmd_swap, mremap) +{ + unsigned int seed =3D self->seed; + char *mem, *new_mem; + + mem =3D alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed); + if (mem =3D=3D MAP_FAILED) + SKIP(return, "Could not create swapped THP\n"); + + new_mem =3D mremap(mem, self->pmd_size, self->pmd_size, MREMAP_MAYMOVE); + if (new_mem =3D=3D MAP_FAILED) { + munmap(mem, self->pmd_size); + ASSERT_NE(new_mem, MAP_FAILED); + } + + ASSERT_TRUE(verify_pattern(new_mem, self->pmd_size, seed)); + + munmap(new_mem, self->pmd_size); +} + +/* + * Read /proc/self/pagemap on a PMD swap entry. Exercises the pagemap + * PMD walker which must handle PMD swap entries without trying to + * convert them to a page via softleaf_to_page(). + */ +TEST_F(pmd_swap, pagemap) +{ + char *mem; + uint64_t entry; + unsigned long off; + + mem =3D alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed); + if (mem =3D=3D MAP_FAILED) + SKIP(return, "Could not create swapped THP\n"); + + for (off =3D 0; off < self->pmd_size; off +=3D getpagesize()) { + entry =3D pagemap_get_entry(self->pagemap_fd, mem + off); + /* Bit 62 =3D swapped */ + ASSERT_TRUE(entry & (1ULL << 62)); + } + + munmap(mem, self->pmd_size); +} + +/* + * MADV_FREE on a swapped-out PMD must free the swap slots and clear the + * entry. After the call, pagemap must no longer report the pages as + * swapped, and accessing the region must yield zero pages. + */ +TEST_F(pmd_swap, madvise_free) +{ + char *mem; + unsigned long i; + + mem =3D alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed); + if (mem =3D=3D MAP_FAILED) + SKIP(return, "Could not create swapped THP\n"); + + ASSERT_TRUE(check_swapped(self->pagemap_fd, mem, self->pmd_size)); + ASSERT_EQ(madvise(mem, self->pmd_size, MADV_FREE), 0); + ASSERT_FALSE(check_swapped(self->pagemap_fd, mem, self->pmd_size)); + + for (i =3D 0; i < self->pmd_size; i +=3D getpagesize()) + ASSERT_EQ(mem[i], 0); + + munmap(mem, self->pmd_size); +} + +/* + * MADV_WILLNEED on a swapped-out PMD-mapped THP must not split the + * mapping. After WILLNEED + a first-touch fault, the region must come + * back as a single PMD-sized THP with the original data intact. + */ +TEST_F(pmd_swap, madvise_willneed) +{ + unsigned long swpin_before, swpin_after; + volatile char c; + char *mem; + + mem =3D alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed); + if (mem =3D=3D MAP_FAILED) + SKIP(return, "Could not create swapped THP\n"); + + if (!read_pmd_mthp_stat(self->pmd_size, "swpin", &swpin_before)) { + munmap(mem, self->pmd_size); + SKIP(return, "Cannot read PMD-sized THP swpin stat\n"); + } + + ASSERT_EQ(madvise(mem, self->pmd_size, MADV_WILLNEED), 0); + ASSERT_TRUE(read_pmd_mthp_stat(self->pmd_size, "swpin", + &swpin_after)) { + munmap(mem, self->pmd_size); + } + ASSERT_GT(swpin_after, swpin_before) { + munmap(mem, self->pmd_size); + } + + /* First touch faults the THP back in via do_huge_pmd_swap_page(). */ + c =3D mem[0]; + (void)c; + + ASSERT_TRUE(check_huge_anon(mem, 1, self->pmd_size)); + ASSERT_TRUE(verify_pattern(mem, self->pmd_size, self->seed)); + + munmap(mem, self->pmd_size); +} + +/* + * swapoff requires a dedicated swap device path. Use a separate fixture + * that picks the device up from the PMD_SWAP_DEVICE environment variable + * and skips when unset. + */ +FIXTURE(pmd_swap_swapoff) +{ + unsigned long pmd_size; + int pagemap_fd; + const char *swap_dev; + unsigned int seed; +}; + +FIXTURE_SETUP(pmd_swap_swapoff) +{ + self->pagemap_fd =3D -1; + self->swap_dev =3D getenv("PMD_SWAP_DEVICE"); + if (!self->swap_dev) + SKIP(return, "PMD_SWAP_DEVICE env var not set\n"); + + self->pmd_size =3D read_pmd_pagesize(); + if (!self->pmd_size) + SKIP(return, "Cannot determine PMD size\n"); + + self->pagemap_fd =3D open("/proc/self/pagemap", O_RDONLY); + if (self->pagemap_fd < 0) + SKIP(return, "Cannot open /proc/self/pagemap\n"); + + if (!swap_available(self->pagemap_fd)) + SKIP(return, "Swap not available or not working\n"); + + self->seed =3D random_seed(); +} + +FIXTURE_TEARDOWN(pmd_swap_swapoff) +{ + if (self->pagemap_fd >=3D 0) + close(self->pagemap_fd); +} + +/* + * Swap out a THP, then turn off swap. The kernel must fault the entire + * THP back in via unuse_pmd(), preserving the huge mapping. Verify data + * is intact and the THP mapping is preserved. + */ +TEST_F(pmd_swap_swapoff, basic) +{ + unsigned int seed =3D self->seed; + char *mem; + int ret, err; + + mem =3D alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed); + if (mem =3D=3D MAP_FAILED) + SKIP(return, "Could not create swapped THP\n"); + + ret =3D swapoff(self->swap_dev); + err =3D errno; + ASSERT_EQ(ret, 0) { + TH_LOG("swapoff(%s) failed: %s", self->swap_dev, strerror(err)); + munmap(mem, self->pmd_size); + } + + ASSERT_TRUE(verify_pattern(mem, self->pmd_size, seed)) { + swapon(self->swap_dev, 0); + munmap(mem, self->pmd_size); + } + + ASSERT_TRUE(check_huge_anon(mem, 1, self->pmd_size)) { + swapon(self->swap_dev, 0); + munmap(mem, self->pmd_size); + } + + ret =3D swapon(self->swap_dev, 0); + err =3D errno; + ASSERT_EQ(ret, 0) { + TH_LOG("swapon(%s) failed: %s", self->swap_dev, strerror(err)); + munmap(mem, self->pmd_size); + } + + munmap(mem, self->pmd_size); +} + +TEST_HARNESS_MAIN --=20 2.52.0