From nobody Wed Jun 17 07:20:38 2026 Received: from out-179.mta0.migadu.com (out-179.mta0.migadu.com [91.218.175.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9EF293B47FC for ; Mon, 27 Apr 2026 10:06:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284371; cv=none; b=STrrz6BDms2DASoGmnNuKe+wUkKGcrivm6E3at/zHYS31jYKcz4gC+oVy2hY8OTfeLJFLnf4SiKCiq5erp9yzpVJhygsgNpShSy7Nx8IHFpSiPYL/V0kIhFByUyGjIPGZ/O9K7cXU63+zuVRTzHht0mtMiRAJCx9eUpJzk1soEw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284371; c=relaxed/simple; bh=WweONkIumLSUYabmPPCg/IvIGtOaTuFA+E9wualiiA8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=GPZhPr/MxmjXShN2Ck4YiU08J2HyG2qlj9SS3p+cl1thWIquTXrt/zMBt8yUBREYxpYJtyULRdvtISBcsSJTHMGqq4Lsl01bv3ATU8PJEdjZtQDDQfHnwu/+S7RobIrsIKtWl5UghoLx+R4gpqbgpq8qIwLM32quzTU7ZR6O1Lg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=cS/0OFS5; arc=none smtp.client-ip=91.218.175.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="cS/0OFS5" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1777284366; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=sAZdcYZD7LnVvePw8lU3o9errwwTUHtluzPg5XjFce8=; b=cS/0OFS5DqTZHjFAoNbUaX83gi0D7pWus6CixP8hQPp+v3BpD0fKJjJTPHMmBFfwmKG/CR HG0cIiX7ZeYcfeMVErEi1jWiHlL3QPhh50WzPdPlNaGQpFUMJsT6hfhswTKAXTg7p2FVXw 9qxynUFcQ4TDL4cs/8+C7aUkNvz95zs= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: bhe@redhat.com, willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [PATCH 01/13] mm: add softleaf_to_pmd() and convert existing callers Date: Mon, 27 Apr 2026 03:01:50 -0700 Message-ID: <20260427100553.2754667-2-usama.arif@linux.dev> In-Reply-To: <20260427100553.2754667-1-usama.arif@linux.dev> References: <20260427100553.2754667-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Add softleaf_to_pmd() as the PMD counterpart to softleaf_to_pte(), completing the symmetry of the softleaf abstraction for page table leaf entries. The upcoming PMD swap entry support needs to construct PMD entries from swap entries. Converting existing swp_entry_to_pmd() callers to softleaf_to_pmd() in a prep patch keeps the feature patches focused on new functionality rather than mixing refactoring with new code. Signed-off-by: Usama Arif Acked-by: David Hildenbrand (Arm) --- include/linux/leafops.h | 20 ++++++++++++++++++++ mm/huge_memory.c | 12 ++++++------ 2 files changed, 26 insertions(+), 6 deletions(-) diff --git a/include/linux/leafops.h b/include/linux/leafops.h index 992cd8bd8ed0..803d312437df 100644 --- a/include/linux/leafops.h +++ b/include/linux/leafops.h @@ -108,6 +108,21 @@ static inline softleaf_t softleaf_from_pmd(pmd_t pmd) return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry)); } =20 +/** + * softleaf_to_pmd() - Obtain a PMD entry from a leaf entry. + * @entry: Leaf entry. + * + * This generates an architecture-specific PMD entry that can be utilised = to + * encode the metadata the leaf entry encodes. + * + * Returns: Architecture-specific PMD entry encoding leaf entry. + */ +static inline pmd_t softleaf_to_pmd(softleaf_t entry) +{ + /* Temporary until swp_entry_t eliminated. */ + return swp_entry_to_pmd(entry); +} + #else =20 static inline softleaf_t softleaf_from_pmd(pmd_t pmd) @@ -115,6 +130,11 @@ static inline softleaf_t softleaf_from_pmd(pmd_t pmd) return softleaf_mk_none(); } =20 +static inline pmd_t softleaf_to_pmd(softleaf_t entry) +{ + return __pmd(0); +} + #endif =20 /** diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 970e077019b7..49da0746b8ca 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1881,7 +1881,7 @@ static void copy_huge_non_present_pmd( if (softleaf_is_migration_write(entry) || softleaf_is_migration_read_exclusive(entry)) { entry =3D make_readable_migration_entry(swp_offset(entry)); - pmd =3D swp_entry_to_pmd(entry); + pmd =3D softleaf_to_pmd(entry); if (pmd_swp_soft_dirty(*src_pmd)) pmd =3D pmd_swp_mksoft_dirty(pmd); if (pmd_swp_uffd_wp(*src_pmd)) @@ -1894,7 +1894,7 @@ static void copy_huge_non_present_pmd( */ if (softleaf_is_device_private_write(entry)) { entry =3D make_readable_device_private_entry(swp_offset(entry)); - pmd =3D swp_entry_to_pmd(entry); + pmd =3D softleaf_to_pmd(entry); =20 if (pmd_swp_soft_dirty(*src_pmd)) pmd =3D pmd_swp_mksoft_dirty(pmd); @@ -2632,12 +2632,12 @@ static void change_non_present_huge_pmd(struct mm_s= truct *mm, entry =3D make_readable_exclusive_migration_entry(swp_offset(entry)); else entry =3D make_readable_migration_entry(swp_offset(entry)); - newpmd =3D swp_entry_to_pmd(entry); + newpmd =3D softleaf_to_pmd(entry); if (pmd_swp_soft_dirty(*pmd)) newpmd =3D pmd_swp_mksoft_dirty(newpmd); } else if (softleaf_is_device_private_write(entry)) { entry =3D make_readable_device_private_entry(swp_offset(entry)); - newpmd =3D swp_entry_to_pmd(entry); + newpmd =3D softleaf_to_pmd(entry); } else { newpmd =3D *pmd; } @@ -5014,7 +5014,7 @@ int set_pmd_migration_entry(struct page_vma_mapped_wa= lk *pvmw, entry =3D make_migration_entry_young(entry); if (pmd_dirty(pmdval)) entry =3D make_migration_entry_dirty(entry); - pmdswp =3D swp_entry_to_pmd(entry); + pmdswp =3D softleaf_to_pmd(entry); if (pmd_soft_dirty(pmdval)) pmdswp =3D pmd_swp_mksoft_dirty(pmdswp); if (pmd_uffd_wp(pmdval)) @@ -5065,7 +5065,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk= *pvmw, struct page *new) else entry =3D make_readable_device_private_entry( page_to_pfn(new)); - pmde =3D swp_entry_to_pmd(entry); + pmde =3D softleaf_to_pmd(entry); =20 if (pmd_swp_soft_dirty(*pvmw->pmd)) pmde =3D pmd_swp_mksoft_dirty(pmde); --=20 2.52.0 From nobody Wed Jun 17 07:20:38 2026 Received: from out-180.mta1.migadu.com (out-180.mta1.migadu.com [95.215.58.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BB06E3B6365 for ; Mon, 27 Apr 2026 10:06:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284375; cv=none; b=lAlg3g2FZWAV/2wSFxmmDcTAYt+YQDSlNEm5N24Chh8HW7BhWeFhfmMZ8K7L15XMJhfY0VfHwnufKtbLsS15ZNBPfzEFxquvlXNMcjz2vTm0wRr8X5eNjasTXpNhSbv0gupYLW+WLqdMW0aDgrSVy69vlVYr4ZEJ5etM9GWJB0g= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284375; c=relaxed/simple; bh=VEq/Up7IsyaxDKvvHwNJ38lOOVSRAiaigzLZXEskE7g=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=YQPHndj2pHUonUgkGbasoDXvwZJuR/Orch4r4l+no6pMXx3QfKuX1pHz1dO00GKN4FcM5Mw/BG5ugPGzCQ0mUhCqmESMnCu/G3x0i827KsGU5o/A+XuvPUi/6YLrx/tefaLmzniVS1z0QpHSEjSTxuPGxO/yHKl8YCw4NB7soHM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=aOhw6XkA; arc=none smtp.client-ip=95.215.58.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="aOhw6XkA" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1777284371; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=1sDaL26tTvVssr5ZS2l/nk+ErEssHx199G/HJ8+Wq44=; b=aOhw6XkArjq3b4dj69njXaBW547v7drtMjVMKUX4PD5fnNpVqS72Vgu3K/cTcvbBZZZqxi zi/Zlk2KkLAEUDibkJI8h5pmfcI4ZPuaEahHEUV+KzHzP8kziVjatw6250hYH4w/Lh6rE3 c+gw2pPZmpvOKn+5yEHUQAF6LGDBbVE= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: bhe@redhat.com, willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [PATCH 02/13] mm: extract ensure_on_mmlist() helper Date: Mon, 27 Apr 2026 03:01:51 -0700 Message-ID: <20260427100553.2754667-3-usama.arif@linux.dev> In-Reply-To: <20260427100553.2754667-1-usama.arif@linux.dev> References: <20260427100553.2754667-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" When a swap entry is installed in a page table, the mm must be added to init_mm.mmlist so that swapoff can find and unuse its swap entries. This double-checked locking pattern is currently open-coded in try_to_unmap_one() and copy_nonpresent_pte(). Move it into ensure_on_mmlist() in mm/internal.h and convert both callers so it can be reused by upcoming PMD-level swap entry code paths that also need to register the mm with swapoff. copy_nonpresent_pte() previously inserted into &src_mm->mmlist rather than &init_mm.mmlist, but the insertion point is irrelevant, mmlist is a circular list and swapoff walks it entirely from init_mm.mmlist, so only membership matters, not position. Signed-off-by: Usama Arif Reviewed-by: Dev Jain --- mm/internal.h | 13 +++++++++++++ mm/memory.c | 9 +-------- mm/rmap.c | 7 +------ 3 files changed, 15 insertions(+), 14 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 5a2ddcf68e0b..7de489689f54 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1952,4 +1952,17 @@ static inline int get_sysctl_max_map_count(void) bool may_expand_vm(struct mm_struct *mm, const vma_flags_t *vma_flags, unsigned long npages); =20 +/* + * Ensure @mm is on the init_mm.mmlist so swapoff can find it. + */ +static inline void ensure_on_mmlist(struct mm_struct *mm) +{ + if (list_empty(&mm->mmlist)) { + spin_lock(&mmlist_lock); + if (list_empty(&mm->mmlist)) + list_add(&mm->mmlist, &init_mm.mmlist); + spin_unlock(&mmlist_lock); + } +} + #endif /* __MM_INTERNAL_H */ diff --git a/mm/memory.c b/mm/memory.c index ea6568571131..33d7cc274e23 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -937,14 +937,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct m= m_struct *src_mm, if (swap_dup_entry_direct(entry) < 0) return -EIO; =20 - /* make sure dst_mm is on swapoff's mmlist. */ - if (unlikely(list_empty(&dst_mm->mmlist))) { - spin_lock(&mmlist_lock); - if (list_empty(&dst_mm->mmlist)) - list_add(&dst_mm->mmlist, - &src_mm->mmlist); - spin_unlock(&mmlist_lock); - } + ensure_on_mmlist(dst_mm); /* Mark the swap entry as shared. */ if (pte_swp_exclusive(orig_pte)) { pte =3D pte_swp_clear_exclusive(orig_pte); diff --git a/mm/rmap.c b/mm/rmap.c index 78b7fb5f367c..057e18cb80b0 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2302,12 +2302,7 @@ static bool try_to_unmap_one(struct folio *folio, st= ruct vm_area_struct *vma, set_pte_at(mm, address, pvmw.pte, pteval); goto walk_abort; } - if (list_empty(&mm->mmlist)) { - spin_lock(&mmlist_lock); - if (list_empty(&mm->mmlist)) - list_add(&mm->mmlist, &init_mm.mmlist); - spin_unlock(&mmlist_lock); - } + ensure_on_mmlist(mm); dec_mm_counter(mm, MM_ANONPAGES); inc_mm_counter(mm, MM_SWAPENTS); swp_pte =3D swp_entry_to_pte(entry); --=20 2.52.0 From nobody Wed Jun 17 07:20:38 2026 Received: from out-177.mta0.migadu.com (out-177.mta0.migadu.com [91.218.175.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 732B627603C for ; Mon, 27 Apr 2026 10:06:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.177 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284380; cv=none; b=PZJWCbZwXwLu5xgxVmkPD48vamaRarLzYy8NaOXf/fjBUpClWTBFtvUvLlnHCeSOVVDYe4+G+fvbaTN8oSrpwXg1+Y9OWwi1bXYFAaj1jIQsrBYB7ckAJVh8lFvIH49U/WRmlw0vwrP9Hgi4+wFnBwzMdtAG/YCzBQPmhEiCMAA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284380; c=relaxed/simple; bh=NMIY67YYgduidE+T/urpPwMt1nNkWqtfRqW8wr9e7pw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=swYiM6R8jj8bK18YCYF5DYVlpKh8oq6/kTxtcM+rxqfZNaIdPqGBwHCsqcH4Jk5Yn5YlSIB576ZN+nhXpVh3YhRwr5lo73Z7mxUwUQLc6E3ExQier1pfXr3My6nPIKOwJr5RLTBP9D59yvMR9jDEufkTt5JSe3pNfsuHNzZ442w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=Y0W17+Qx; arc=none smtp.client-ip=91.218.175.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="Y0W17+Qx" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1777284376; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=WkQaonlSbia/a4KAdQJshQwfTS6rg6fLOV9FgvKBtm8=; b=Y0W17+QxsNNB0QMvzmIIcqr/EMlrVgl8BGBkmO8G2jcoA5XkKxT5cSEgMXqQ4e+JznbEFl jbkVMJeXK1GNblfHEUtOwwEOzWslqQ7Jr9gR5AYWwLLAkCS0ywgmPc1KKWm986IaTEaEC9 blU0yhyUciA5wc+Nbee8kQ6tYyyB/TQ= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: bhe@redhat.com, willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [PATCH 03/13] fs/proc: use softleaf_has_pfn() in pagemap PMD walker Date: Mon, 27 Apr 2026 03:01:52 -0700 Message-ID: <20260427100553.2754667-4-usama.arif@linux.dev> In-Reply-To: <20260427100553.2754667-1-usama.arif@linux.dev> References: <20260427100553.2754667-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" pagemap_pmd_range_thp() assumes that every non-present PMD is a migration entry and unconditionally calls softleaf_to_page(). This will crash on any non-present PMD type that does not encode a PFN, such as the upcoming PMD-level swap entries. Guard the page lookup with softleaf_has_pfn(), matching how pte_to_pagemap_entry() already handles non-present PTEs. Signed-off-by: Usama Arif Acked-by: David Hildenbrand (Arm) --- fs/proc/task_mmu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 751b9ba160fb..6d9f43881e62 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -2042,8 +2042,8 @@ static int pagemap_pmd_range_thp(pmd_t *pmdp, unsigne= d long addr, flags |=3D PM_SOFT_DIRTY; if (pmd_swp_uffd_wp(pmd)) flags |=3D PM_UFFD_WP; - VM_WARN_ON_ONCE(!pmd_is_migration_entry(pmd)); - page =3D softleaf_to_page(entry); + if (softleaf_has_pfn(entry)) + page =3D softleaf_to_page(entry); } =20 if (page) { --=20 2.52.0 From nobody Wed Jun 17 07:20:38 2026 Received: from out-173.mta0.migadu.com (out-173.mta0.migadu.com [91.218.175.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F22523B47D7 for ; Mon, 27 Apr 2026 10:06:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.173 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284384; cv=none; b=RXK/GTVLZFDYgij6Z9dUdNOKFbKvtvYerOYKWf/m9GdPjeuaR3KqIiwppmBAUbnGxFBbmkx3eXeO6LGtN3nhs+N+p9uugmaSZXUsUnzZXnYXeIaHeR+yCm4lQp0g3UIJQ+dQrBh+FfbW2wa4nCah+x0MZUBzjAPR4C+OmBVtXig= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284384; c=relaxed/simple; bh=9KAIy8thvqZKpnV6OfxJbnYJ5Bh3vjrP1RYtnmdGXRU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=KkaQGW11gUg2DrqBsw5OAncFwh+af1DoDFWFd83O9NPyHu8PeMp+qqYFQ2Rh0hYDDUqmVKnmbfJlsEEkVNlpS8gDzM9IDRQcdzh+yUq4hfOYdJyl456BZPeBqZ3L6xATU/wvzowJwF+N9i7nEoOi4V411cKLLTFbDd094Ow4ncI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=oYM28SOd; arc=none smtp.client-ip=91.218.175.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="oYM28SOd" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1777284381; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=iKT3s2bKzkz1uMChyisn9mj0jVShOFmoW06LrRHh/5o=; b=oYM28SOd6jCk2WFGRONHf31wS/cpq/FfB7e8tuOsMrqQvmdoJQFc+uRMn41mB35sw6haOH UVqwGYkG7D9x0wjO+YJfw/EhI4CJ/YZtYX1b6W5zCONSIM0SvJKGBFpiYdjXLO1/gTZTii SvM1xzwlBQs/aOGaV2Dch9rMTj9Akco= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: bhe@redhat.com, willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [PATCH 04/13] mm/huge_memory: move softleaf_to_folio() inside migration branch Date: Mon, 27 Apr 2026 03:01:53 -0700 Message-ID: <20260427100553.2754667-5-usama.arif@linux.dev> In-Reply-To: <20260427100553.2754667-1-usama.arif@linux.dev> References: <20260427100553.2754667-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" change_non_present_huge_pmd() calls softleaf_to_folio() unconditionally at the top of the function. softleaf_to_folio() extracts a PFN from the entry and converts it to a folio pointer, which is only meaningful for migration and device_private entries that encode a real PFN. A swap entry encodes a swap offset instead, so softleaf_to_folio() would produce a bogus pointer and crash on mprotect() when a PMD swap entry is present. Move the call into the migration_write branch where the folio is actually used, so the function is safe for any non-present PMD type. Signed-off-by: Usama Arif Acked-by: David Hildenbrand (Arm) Reviewed-by: Dev Jain --- mm/huge_memory.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 49da0746b8ca..d82a19b5e276 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2619,11 +2619,12 @@ static void change_non_present_huge_pmd(struct mm_s= truct *mm, bool uffd_wp_resolve) { softleaf_t entry =3D softleaf_from_pmd(*pmd); - const struct folio *folio =3D softleaf_to_folio(entry); pmd_t newpmd; =20 VM_WARN_ON(!pmd_is_valid_softleaf(*pmd)); if (softleaf_is_migration_write(entry)) { + const struct folio *folio =3D softleaf_to_folio(entry); + /* * A protection check is difficult so * just be safe and disable write --=20 2.52.0 From nobody Wed Jun 17 07:20:38 2026 Received: from out-171.mta0.migadu.com (out-171.mta0.migadu.com [91.218.175.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D7D9E3B2FEA for ; Mon, 27 Apr 2026 10:06:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284389; cv=none; b=XT2efkOZNPb82kF50L9STwLZcPRvZLHBxAr4SQGeOa/IaxrsSwT1AMRjrFvjRaPKQfILzTrvdxVEU3JS9OHHu6yrS3bVJuf8ISbwi1M+YSQTza2ddIWdBoyoO7R8yK+rK6xNgFFbnQ215hCkSwPn+6y9RMbmulBGZAAr6CPI/Qk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284389; c=relaxed/simple; bh=3MN45ET580FiWB/iTABCasd/hYMKWH6hZqu8yfAwbfc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=bc76nC+V/YFwXfxr6gUsTFZX/JQ5TvEg6eO4hfdAiGmb+ohMbesUVBtew+QSfXuMxaYK1FT7AcuL+zPPEce2/bpI1+Nj7OHbVZ1mWZ5ThF7Pbe9MbxuuHRLlpnUVSMlUTwuN4689sPcJxB7D/Hd8dyV4Oh6ImvLpqpJbi6rOfzw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=BT/toJAJ; arc=none smtp.client-ip=91.218.175.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="BT/toJAJ" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1777284385; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=d3DTRvDyseaP44w4E84pI+LnObWi0aHOFCJwUMf+7rE=; b=BT/toJAJdxq0x5Y1sjo5OkL3aclLdcBoalcv/sMZayH8A5NmdC1/x6R3OwuyyeFABDq5ZH NQk3fFVlnGdJsmDbiz0qaFa3nZq8KlpKMIwF1ZhbYghblB7QlhXyAgZLm2w2v/UAA8C0mk quxHF7gN2dTqH70tl4LM0pbxP9Ty7HI= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: bhe@redhat.com, willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [PATCH 05/13] mm: add PMD swap entry detection support Date: Mon, 27 Apr 2026 03:01:54 -0700 Message-ID: <20260427100553.2754667-6-usama.arif@linux.dev> In-Reply-To: <20260427100553.2754667-1-usama.arif@linux.dev> References: <20260427100553.2754667-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Currently when a PMD-mapped THP is swapped out, the PMD is always split into 512 PTE-level swap entries. To preserve huge page information across swap cycles, later patches will install a single PMD-level swap entry instead. This patch adds the infrastructure to detect those entries. Teach the softleaf layer to recognise PMD swap entries: pmd_is_swap_entry() detects them and softleaf_is_valid_pmd_entry() accepts them as a valid non-present type. Clear the exclusive overlay bit in softleaf_from_pmd() before decoding, matching how soft_dirty and uffd_wp bits are already stripped. Add pmd_swp_mkexclusive(), pmd_swp_exclusive(), and pmd_swp_clear_exclusive() helpers to each architecture that supports THP migration (x86, arm64, s390, riscv, loongarch), mirroring the existing PTE swap exclusive helpers in each arch's pgtable.h. Signed-off-by: Usama Arif --- arch/arm64/include/asm/pgtable.h | 4 ++++ arch/loongarch/include/asm/pgtable.h | 17 +++++++++++++++++ arch/riscv/include/asm/pgtable.h | 15 +++++++++++++++ arch/s390/include/asm/pgtable.h | 15 +++++++++++++++ arch/x86/include/asm/pgtable.h | 15 +++++++++++++++ include/linux/leafops.h | 18 ++++++++++++++++-- 6 files changed, 82 insertions(+), 2 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgta= ble.h index 9029b81ccbe8..ecb0ef6994cb 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -601,6 +601,10 @@ static inline int pmd_protnone(pmd_t pmd) #define pmd_swp_clear_uffd_wp(pmd) \ pte_pmd(pte_swp_clear_uffd_wp(pmd_pte(pmd))) #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */ +#define pmd_swp_exclusive(pmd) pte_swp_exclusive(pmd_pte(pmd)) +#define pmd_swp_mkexclusive(pmd) pte_pmd(pte_swp_mkexclusive(pmd_pte(pmd))) +#define pmd_swp_clear_exclusive(pmd) \ + pte_pmd(pte_swp_clear_exclusive(pmd_pte(pmd))) =20 #define pmd_write(pmd) pte_write(pmd_pte(pmd)) =20 diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/= asm/pgtable.h index 155f70e93460..f8e7761eb54e 100644 --- a/arch/loongarch/include/asm/pgtable.h +++ b/arch/loongarch/include/asm/pgtable.h @@ -345,6 +345,23 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte) return pte; } =20 +static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd) +{ + pmd_val(pmd) |=3D _PAGE_SWP_EXCLUSIVE; + return pmd; +} + +static inline bool pmd_swp_exclusive(pmd_t pmd) +{ + return pmd_val(pmd) & _PAGE_SWP_EXCLUSIVE; +} + +static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd) +{ + pmd_val(pmd) &=3D ~_PAGE_SWP_EXCLUSIVE; + return pmd; +} + #define pte_none(pte) (!(pte_val(pte) & ~_PAGE_GLOBAL)) #define pte_present(pte) (pte_val(pte) & (_PAGE_PRESENT | _PAGE_PROTNONE)) #define pte_no_exec(pte) (pte_val(pte) & _PAGE_NO_EXEC) diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgta= ble.h index a6e0eaba2653..f4cd59ebab58 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -935,6 +935,21 @@ static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd) } #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */ =20 +static inline bool pmd_swp_exclusive(pmd_t pmd) +{ + return pte_swp_exclusive(pmd_pte(pmd)); +} + +static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd) +{ + return pte_pmd(pte_swp_mkexclusive(pmd_pte(pmd))); +} + +static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd) +{ + return pte_pmd(pte_swp_clear_exclusive(pmd_pte(pmd))); +} + #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY static inline bool pmd_soft_dirty(pmd_t pmd) { diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtabl= e.h index 40a6fb19dd1d..9b05fd3e4df0 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -868,6 +868,21 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte) return clear_pte_bit(pte, __pgprot(_PAGE_SWP_EXCLUSIVE)); } =20 +static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd) +{ + return set_pmd_bit(pmd, __pgprot(_PAGE_SWP_EXCLUSIVE)); +} + +static inline bool pmd_swp_exclusive(pmd_t pmd) +{ + return pmd_val(pmd) & _PAGE_SWP_EXCLUSIVE; +} + +static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd) +{ + return clear_pmd_bit(pmd, __pgprot(_PAGE_SWP_EXCLUSIVE)); +} + static inline int pte_soft_dirty(pte_t pte) { return pte_val(pte) & _PAGE_SOFT_DIRTY; diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 13e3e9a054cb..eb8b7a6f4bb4 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1517,6 +1517,21 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pt= e) return pte_clear_flags(pte, _PAGE_SWP_EXCLUSIVE); } =20 +static inline pmd_t pmd_swp_mkexclusive(pmd_t pmd) +{ + return pmd_set_flags(pmd, _PAGE_SWP_EXCLUSIVE); +} + +static inline int pmd_swp_exclusive(pmd_t pmd) +{ + return pmd_flags(pmd) & _PAGE_SWP_EXCLUSIVE; +} + +static inline pmd_t pmd_swp_clear_exclusive(pmd_t pmd) +{ + return pmd_clear_flags(pmd, _PAGE_SWP_EXCLUSIVE); +} + #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY static inline pte_t pte_swp_mksoft_dirty(pte_t pte) { diff --git a/include/linux/leafops.h b/include/linux/leafops.h index 803d312437df..79e04db45bfb 100644 --- a/include/linux/leafops.h +++ b/include/linux/leafops.h @@ -102,6 +102,8 @@ static inline softleaf_t softleaf_from_pmd(pmd_t pmd) pmd =3D pmd_swp_clear_soft_dirty(pmd); if (pmd_swp_uffd_wp(pmd)) pmd =3D pmd_swp_clear_uffd_wp(pmd); + if (pmd_swp_exclusive(pmd)) + pmd =3D pmd_swp_clear_exclusive(pmd); arch_entry =3D __pmd_to_swp_entry(pmd); =20 /* Temporary until swp_entry_t eliminated. */ @@ -634,9 +636,21 @@ static inline bool pmd_is_migration_entry(pmd_t pmd) */ static inline bool softleaf_is_valid_pmd_entry(softleaf_t entry) { - /* Only device private, migration entries valid for PMD. */ + /* Device private, migration, and swap entries valid for PMD. */ return softleaf_is_device_private(entry) || - softleaf_is_migration(entry); + softleaf_is_migration(entry) || + softleaf_is_swap(entry); +} + +/** + * pmd_is_swap_entry() - Does this PMD entry encode an actual swap entry? + * @pmd: PMD entry. + * + * Returns: true if the PMD encodes a swap entry, otherwise false. + */ +static inline bool pmd_is_swap_entry(pmd_t pmd) +{ + return softleaf_is_swap(softleaf_from_pmd(pmd)); } =20 /** --=20 2.52.0 From nobody Wed Jun 17 07:20:38 2026 Received: from out-173.mta0.migadu.com (out-173.mta0.migadu.com [91.218.175.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 121483B47FA for ; Mon, 27 Apr 2026 10:06:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.173 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284398; cv=none; b=nqQT+CNd+LitI9Nc8foeXya2Vx7RFXgtkv2+16w1inqiQy8uT+xfyb8tvAVRZj96NygqaXeLAUA7L5RYg9HWv5SmxF5VzX1m9TPyZiTvdn6lpuOzF0FUIje8j0cXHidba+TcXava8aGPQx1N/jimnN04tm54xqa+Lq4R4Gdb6Mg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284398; c=relaxed/simple; bh=RX0Tt0oVSJHkw6emmj2XdQXtXYBxnOwodUhW3O/eHeM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=sKGejkRS49d4dimvHT6YGOBmxSguVqNLOLHPKaUQTeWK2o1dYyx7GW1Zz4ukBVmfEQEnYF3PBBqZk3lxCPCp3Evy3mHCaxnHe5oFEqoTutTqDb9HcRCUvwCl8Np9I8DXYDjh2NKErReHcegnpY+1BI1L4R/Sl0B9XASo3CN+piw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=XbbNN6vI; arc=none smtp.client-ip=91.218.175.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="XbbNN6vI" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1777284390; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=9Kp1QkYRXSgpi6gPGEgyWsBAfNXkVOPJekD3kea3f8U=; b=XbbNN6vI/PR6UOfzX8QZpOY22tWsYNCrev2+c0JZdgJY6eXgPdO5L77h6uoBAGi4E1O7vQ oFlj8f7QR0p3MHBO5TceOw1H7YLqhhitXGWfJXeKG0xH1JyscrXkRnG2FTk4wFnoJ47TtJ mX0PtfvtotW7COZ2K6HoX7mDWvQKivc= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: bhe@redhat.com, willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [PATCH 06/13] mm: add PMD swap entry splitting support Date: Mon, 27 Apr 2026 03:01:55 -0700 Message-ID: <20260427100553.2754667-7-usama.arif@linux.dev> In-Reply-To: <20260427100553.2754667-1-usama.arif@linux.dev> References: <20260427100553.2754667-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Add a swap branch in __split_huge_pmd_locked() that splits a PMD swap entry into 512 PTE swap entries. Unlike migration splits, no folio reference is needed because swap entries point to swap slots, not pages. Each PTE inherits the correct sub-slot offset and preserves soft_dirty, uffd_wp, and exclusive flags. This branch is reached from the explicit __split_huge_pmd() callers that hit a non-present PMD: partial-range mprotect / munmap, the wp_huge_pmd() PMD-COW fallback, and the swap-in / swapoff fallbacks added in later patches when the cached folio is no longer PMD-sized. page_vma_mapped_walk() does not iterate PMD swap entries, so try_to_unmap_one() and try_to_migrate_one() do not reach this branch and freeze=3Dtrue cannot occur in this branch today. page and folio are therefore left uninitialized in the swap branch; a VM_WARN_ON_ONCE(freeze) catches any future caller that breaks this invariant before the freeze path dereferences page_to_pfn(page + i) or put_page(page). Signed-off-by: Usama Arif --- include/linux/leafops.h | 6 +++--- mm/huge_memory.c | 27 ++++++++++++++++++++++++++- 2 files changed, 29 insertions(+), 4 deletions(-) diff --git a/include/linux/leafops.h b/include/linux/leafops.h index 79e04db45bfb..2c0dfce6d0f0 100644 --- a/include/linux/leafops.h +++ b/include/linux/leafops.h @@ -657,9 +657,9 @@ static inline bool pmd_is_swap_entry(pmd_t pmd) * pmd_is_valid_softleaf() - Is this PMD entry a valid softleaf entry? * @pmd: PMD entry. * - * PMD leaf entries are valid only if they are device private or migration - * entries. This function asserts that a PMD leaf entry is valid in this - * respect. + * PMD leaf entries are valid only if they are device private, migration, + * or swap entries. This function asserts that a PMD leaf entry is valid + * in this respect. * * Returns: true if the PMD entry is a valid leaf entry, otherwise false. */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d82a19b5e276..9f67638e43c8 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3201,6 +3201,12 @@ static void __split_huge_pmd_locked(struct vm_area_s= truct *vma, pmd_t *pmd, folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR, vma, haddr, rmap_flags); } + } else if (pmd_is_swap_entry(*pmd)) { + VM_WARN_ON_ONCE(freeze); + old_pmd =3D *pmd; + soft_dirty =3D pmd_swp_soft_dirty(old_pmd); + uffd_wp =3D pmd_swp_uffd_wp(old_pmd); + anon_exclusive =3D pmd_swp_exclusive(old_pmd); } else { /* * Up to this point the pmd is present and huge and userland has @@ -3337,6 +3343,25 @@ static void __split_huge_pmd_locked(struct vm_area_s= truct *vma, pmd_t *pmd, VM_WARN_ON(!pte_none(ptep_get(pte + i))); set_pte_at(mm, addr, pte + i, entry); } + } else if (pmd_is_swap_entry(old_pmd)) { + softleaf_t sl_entry =3D softleaf_from_pmd(old_pmd); + pte_t swp_pte; + swp_entry_t sub_entry; + + for (i =3D 0, addr =3D haddr; i < HPAGE_PMD_NR; + i++, addr +=3D PAGE_SIZE) { + sub_entry =3D swp_entry(swp_type(sl_entry), + swp_offset(sl_entry) + i); + swp_pte =3D swp_entry_to_pte(sub_entry); + if (soft_dirty) + swp_pte =3D pte_swp_mksoft_dirty(swp_pte); + if (uffd_wp) + swp_pte =3D pte_swp_mkuffd_wp(swp_pte); + if (anon_exclusive) + swp_pte =3D pte_swp_mkexclusive(swp_pte); + VM_WARN_ON(!pte_none(ptep_get(pte + i))); + set_pte_at(mm, addr, pte + i, swp_pte); + } } else { pte_t entry; =20 @@ -3360,7 +3385,7 @@ static void __split_huge_pmd_locked(struct vm_area_st= ruct *vma, pmd_t *pmd, } pte_unmap(pte); =20 - if (!pmd_is_migration_entry(*pmd)) + if (!pmd_is_migration_entry(*pmd) && !pmd_is_swap_entry(*pmd)) folio_remove_rmap_pmd(folio, page, vma); if (freeze) put_page(page); --=20 2.52.0 From nobody Wed Jun 17 07:20:38 2026 Received: from out-184.mta0.migadu.com (out-184.mta0.migadu.com [91.218.175.184]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6B3E63B5302 for ; Mon, 27 Apr 2026 10:06:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.184 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284401; cv=none; b=AlmXtQIbD8TPq50xsn8wucLqdE72cgGGJ6fyc9/asise9vt6LolxTh1hQp79ypX1gZ/16L0zCB7OtQIHmeEGHQqGrXKn4gnzyjV4hXM1ZQu52VZp/BhrZi0UIeuQ3cqrrjFIOJsVN6Cebl/RHql966L+0KbWyQ4Vi48BNnsSJrI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284401; c=relaxed/simple; bh=zDlCo/gURcB3MCQ38WY6s1akjLIZ7ZEDyf4/dOFlSww=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Z3Cvfv3Akpo/tpFYbx8lUeqdWLILHbj98FL3JhfNiFE+lyDrmiyo8exFsyBZrb4YTqHQLzwQxpYiMGwROKNsAhoXOpbxCoQJ4pozadGf5WjYuwUSMkNUHLTiAgl1qMdXdtAzDqfNWDNoAAkXM+vWhyEwe3yIOCrdFIkxDDzlKLc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=s/qjK6BO; arc=none smtp.client-ip=91.218.175.184 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="s/qjK6BO" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1777284398; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=gJkjOT8BwQKJuS2bsgBl30J2qwKscAkUJmfwgmoo1rY=; b=s/qjK6BOloqnUK1nNH6yKWUY7apRFwwujDkvuqx05P6TrIIN5boBkUZrlGLjAMQmmrjy/Z mCR2ZcgEt6i260AUAqhvLln13xAXzP5shvK0YjXzVm1mmfG5oy3HSWMI76ucZy1SU7yiOe I9WNyxpJrYVm+IrlDYjtcPdlceMI3SI= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: bhe@redhat.com, willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [PATCH 07/13] mm: handle PMD swap entries in fork path Date: Mon, 27 Apr 2026 03:01:56 -0700 Message-ID: <20260427100553.2754667-8-usama.arif@linux.dev> In-Reply-To: <20260427100553.2754667-1-usama.arif@linux.dev> References: <20260427100553.2754667-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Teach copy_huge_pmd()/copy_huge_non_present_pmd() about swap entries, mirroring copy_nonpresent_pte(). swap_dup_entry_direct() gains a nr parameter (and is renamed to swap_dup_entries_direct()) so it can duplicate a contiguous range of swap slots in one call, matching the existing swap_put_entries_direct(entry, nr) API. Existing callers pass 1. copy_huge_non_present_pmd() "copies" PMD swap entries during fork instead of splitting, preserving the THP. This mirrors copy_nonpresent_pte() which duplicates the swap slot refcount, clears the exclusive bit on the source, and adds the destination mm to mmlist. If swap_dup_entries_direct() fails (GFP_ATOMIC table alloc), copy_huge_pmd() retries after swap_retry_table_alloc() with GFP_KERNEL, matching the PTE retry in copy_pte_range(). The PMD is stable across the retry because dup_mmap() holds write mmap_lock on both mm_structs. Signed-off-by: Usama Arif --- include/linux/swap.h | 4 ++-- mm/huge_memory.c | 52 +++++++++++++++++++++++++++++++++++++++----- mm/memory.c | 2 +- mm/swapfile.c | 7 +++--- 4 files changed, 53 insertions(+), 12 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 1930f81e6be4..2f12c20baba1 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -457,7 +457,7 @@ sector_t swap_folio_sector(struct folio *folio); * All entries must be allocated by folio_alloc_swap(). And they must have * a swap count > 1. See comments of folio_*_swap helpers for more info. */ -int swap_dup_entry_direct(swp_entry_t entry); +int swap_dup_entries_direct(swp_entry_t entry, int nr); void swap_put_entries_direct(swp_entry_t entry, int nr); =20 /* @@ -501,7 +501,7 @@ static inline void free_swap_cache(struct folio *folio) { } =20 -static inline int swap_dup_entry_direct(swp_entry_t ent) +static inline int swap_dup_entries_direct(swp_entry_t ent, int nr) { return 0; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 9f67638e43c8..42887cf518cd 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1867,7 +1867,7 @@ bool touch_pmd(struct vm_area_struct *vma, unsigned l= ong addr, return false; } =20 -static void copy_huge_non_present_pmd( +static int copy_huge_non_present_pmd( struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, @@ -1913,14 +1913,35 @@ static void copy_huge_non_present_pmd( */ folio_try_dup_anon_rmap_pmd(src_folio, &src_folio->page, dst_vma, src_vma); + } else if (softleaf_is_swap(entry)) { + int err; + + /* + * PMD swap entry: duplicate swap references and clear + * exclusive on source, matching copy_nonpresent_pte(). + */ + err =3D swap_dup_entries_direct(entry, HPAGE_PMD_NR); + if (err < 0) + return err; + + ensure_on_mmlist(dst_mm); + + if (pmd_swp_exclusive(pmd)) { + pmd =3D pmd_swp_clear_exclusive(pmd); + set_pmd_at(src_mm, addr, src_pmd, pmd); + } } =20 - add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); + if (softleaf_is_swap(entry)) + add_mm_counter(dst_mm, MM_SWAPENTS, HPAGE_PMD_NR); + else + add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); mm_inc_nr_ptes(dst_mm); pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); if (!userfaultfd_wp(dst_vma)) pmd =3D pmd_swp_clear_uffd_wp(pmd); set_pmd_at(dst_mm, addr, dst_pmd, pmd); + return 0; } =20 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, @@ -1961,6 +1982,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm= _struct *src_mm, if (unlikely(!pgtable)) goto out; =20 +retry: dst_ptl =3D pmd_lock(dst_mm, dst_pmd); src_ptl =3D pmd_lockptr(src_mm, src_pmd); spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); @@ -1968,10 +1990,28 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct = mm_struct *src_mm, ret =3D -EAGAIN; pmd =3D *src_pmd; =20 - if (unlikely(thp_migration_supported() && - pmd_is_valid_softleaf(pmd))) { - copy_huge_non_present_pmd(dst_mm, src_mm, dst_pmd, src_pmd, addr, - dst_vma, src_vma, pmd, pgtable); + if (unlikely(pmd_is_valid_softleaf(pmd))) { + ret =3D copy_huge_non_present_pmd(dst_mm, src_mm, dst_pmd, src_pmd, + addr, dst_vma, src_vma, pmd, + pgtable); + if (ret) { + spin_unlock(src_ptl); + spin_unlock(dst_ptl); + /* + * For PMD swap entries -ENOMEM means the per-cluster + * swap-extend table couldn't be GFP_ATOMIC-allocated. + * try the GFP_KERNEL fallback once before giving up. + */ + if (ret =3D=3D -ENOMEM) { + softleaf_t entry =3D softleaf_from_pmd(pmd); + + if (softleaf_is_swap(entry) && + !swap_retry_table_alloc(entry, GFP_KERNEL)) + goto retry; + } + pte_free(dst_mm, pgtable); + goto out; + } ret =3D 0; goto out_unlock; } diff --git a/mm/memory.c b/mm/memory.c index 33d7cc274e23..8aa90afd601a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -934,7 +934,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm= _struct *src_mm, struct page *page; =20 if (likely(softleaf_is_swap(entry))) { - if (swap_dup_entry_direct(entry) < 0) + if (swap_dup_entries_direct(entry, 1) < 0) return -EIO; =20 ensure_on_mmlist(dst_mm); diff --git a/mm/swapfile.c b/mm/swapfile.c index c7e173b93e11..390f191be9a6 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3801,8 +3801,9 @@ void si_swapinfo(struct sysinfo *val) } =20 /* - * swap_dup_entry_direct() - Increase reference count of a swap entry by o= ne. + * swap_dup_entries_direct() - Increase reference count of swap entries by= one. * @entry: first swap entry from which we want to increase the refcount. + * @nr: number of contiguous swap entries to duplicate. * * Returns 0 for success, or -ENOMEM if the extend table is required * but could not be atomically allocated. Returns -EINVAL if the swap @@ -3814,7 +3815,7 @@ void si_swapinfo(struct sysinfo *val) * Also the swap entry must have a count >=3D 1. Otherwise folio_dup_swap = should * be used. */ -int swap_dup_entry_direct(swp_entry_t entry) +int swap_dup_entries_direct(swp_entry_t entry, int nr) { struct swap_info_struct *si; =20 @@ -3831,7 +3832,7 @@ int swap_dup_entry_direct(swp_entry_t entry) */ VM_WARN_ON_ONCE(!swap_entry_swapped(si, entry)); =20 - return swap_dup_entries_cluster(si, swp_offset(entry), 1); + return swap_dup_entries_cluster(si, swp_offset(entry), nr); } =20 #if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP) --=20 2.52.0 From nobody Wed Jun 17 07:20:38 2026 Received: from out-171.mta0.migadu.com (out-171.mta0.migadu.com [91.218.175.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 98FE43B4E9D for ; Mon, 27 Apr 2026 10:06:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284408; cv=none; b=MRhINZlLqMV70LmAr/THoj81YeI0WF3NFLxifrQw26kuezNPlYrSyN0Q5w2O3bNR+oV/xmEuLql+Im7n+zgMKbjw8z8tKbIYPjgGZH1Wc+Mexpt2Wfqq8n2Y953zPU/RGcxhkFTgAZKPEcwYYComL/nKd4dT1s+5lsnslyKOz2o= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284408; c=relaxed/simple; bh=/PJE3YaW1aCo8LXbm6xsPpbObQVtZXHypva7Z/wdmjw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=k+a1TspI7RmH8lfOiRU/R5fgR9fIj1dDgrkIpTPS15G3HsSXYJW3ORsoN1ZemyVqBT5NlrHFGnUEpdPgI9bBSSwl/vkd2pO5zI7AAa/buL6BX+wLtBCjTRcFgAIBTedL0MOgcOubV/SAOQuTir3OxiyimhezDsM8eX311wfw1E8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=a1tJJucI; arc=none smtp.client-ip=91.218.175.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="a1tJJucI" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1777284404; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Yfd9GmH0gizrPfR9x8NMn3XlSbw9wtpjS9uv7XLZVWM=; b=a1tJJucI3EgISTT/MGYrnUxmZUqNbnUf5tluskBmlEPN/ya/MjGpPjo/lROhaGifYRTder 9oAQd4nSIrK/ugb4+byJ41MvZHlkhA7Q2u8I6m1y7q8ICf1z1MCiU/T1NnclxeDu07ZpNn cF09AzoCYQUJFvR8hh1YJGPIyp9jLQE= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: bhe@redhat.com, willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [PATCH 08/13] mm: swap in PMD swap entries as whole THPs during swapoff Date: Mon, 27 Apr 2026 03:01:57 -0700 Message-ID: <20260427100553.2754667-9-usama.arif@linux.dev> In-Reply-To: <20260427100553.2754667-1-usama.arif@linux.dev> References: <20260427100553.2754667-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Add unuse_pmd() and call it from unuse_pmd_range() to swap in PMD-level swap entries as whole THPs during swapoff. This mirrors the existing unuse_pte_range() but operates at PMD granularity. If the PMD-order folio cannot be allocated, the cached folio is no longer PMD-sized (e.g. split in the swap cache by deferred_split_scan() or memory_failure() while the PMD swap entry was installed), or the folio is not uptodate, the PMD swap entry is split into PTE-level entries via __split_huge_pmd() and a non-zero error is returned so unuse_pmd_range() falls through to unuse_pte_range(), which handles the individual entries at order-0. swapin_alloc_pmd_folio() is a separate function in swap_state.c as it will be reused in swapin in a later patch. Signed-off-by: Usama Arif --- mm/swap.h | 7 +++ mm/swap_state.c | 35 +++++++++++++ mm/swapfile.c | 137 ++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 179 insertions(+) diff --git a/mm/swap.h b/mm/swap.h index a77016f2423b..76752df71693 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -301,6 +301,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry,= gfp_t flag, struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag, struct vm_fault *vmf); struct folio *swapin_folio(swp_entry_t entry, struct folio *folio); +struct folio *swapin_alloc_pmd_folio(swp_entry_t entry, struct mm_struct *= mm); void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma, unsigned long addr); =20 @@ -438,6 +439,12 @@ static inline struct folio *swapin_folio(swp_entry_t e= ntry, struct folio *folio) return NULL; } =20 +static inline struct folio *swapin_alloc_pmd_folio(swp_entry_t entry, + struct mm_struct *mm) +{ + return NULL; +} + static inline void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma, unsigned long addr) { diff --git a/mm/swap_state.c b/mm/swap_state.c index 1415a5c54a43..c2e8c76658f5 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -584,6 +584,41 @@ struct folio *swapin_folio(swp_entry_t entry, struct f= olio *folio) return swapcache; } =20 +#ifdef CONFIG_THP_SWAP +/** + * swapin_alloc_pmd_folio - allocate, charge, and read a PMD-sized swap fo= lio. + * @entry: starting swap entry to swap in + * @mm: mm to charge for the swap-in + * + * Allocate a HPAGE_PMD_ORDER folio, charge it to @mm's memcg for @entry, = and + * issue the swap-in via swapin_folio(). Used by callers that need to map a + * PMD swap entry as a whole THP (PMD swapoff). + * + * Return: the swapped-in folio, or NULL on alloc/charge/swapin failure (in + * which case the caller should fall back to splitting the PMD). + */ +struct folio *swapin_alloc_pmd_folio(swp_entry_t entry, struct mm_struct *= mm) +{ + struct folio *folio; + + folio =3D folio_alloc(GFP_HIGHUSER_MOVABLE, HPAGE_PMD_ORDER); + if (!folio) + return NULL; + + if (mem_cgroup_swapin_charge_folio(folio, mm, GFP_KERNEL, entry)) { + folio_put(folio); + return NULL; + } + + if (!swapin_folio(entry, folio)) { + folio_put(folio); + return NULL; + } + + return folio; +} +#endif /* CONFIG_THP_SWAP */ + /* * Locate a page of swap in physical memory, reserving swap cache space * and reading the disk if it is not already cached. diff --git a/mm/swapfile.c b/mm/swapfile.c index 390f191be9a6..7256edf4ce66 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -42,6 +42,7 @@ #include #include #include +#include =20 #include #include @@ -2519,6 +2520,130 @@ static int unuse_pte_range(struct vm_area_struct *v= ma, pmd_t *pmd, return 0; } =20 +/* + * unuse_pmd - Map a locked folio at PMD granularity during swapoff. + * + * The caller provides a locked, swapped-in folio. Returns 0 on success + * (PMD was mapped). Returns -EAGAIN if the swap cache folio no longer + * matches the entry or the PMD changed under the lock (try_to_unuse will + * rescan). Returns -EIO if the folio is not uptodate; in that case the + * PMD is split so unuse_pte_range() can handle individual pages. + */ +static int unuse_pmd(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr, softleaf_t entry, + struct folio *folio) +{ + struct mm_struct *mm =3D vma->vm_mm; + struct page *page; + pmd_t new_pmd, old_pmd; + spinlock_t *ptl; + rmap_t rmap_flags =3D RMAP_NONE; + bool exclusive; + + if (unlikely(!folio_matches_swap_entry(folio, entry))) + return -EAGAIN; + + if (unlikely(!folio_test_uptodate(folio))) { + __split_huge_pmd(vma, pmd, addr, false); + return -EIO; + } + + page =3D folio_page(folio, 0); + + ptl =3D pmd_lock(mm, pmd); + old_pmd =3D pmdp_get(pmd); + + if (!pmd_is_swap_entry(old_pmd) || + softleaf_from_pmd(old_pmd).val !=3D entry.val) { + spin_unlock(ptl); + return -EAGAIN; + } + + exclusive =3D pmd_swp_exclusive(old_pmd); + + /* + * Some architectures may have to restore extra metadata to the folio + * when reading from swap. This metadata may be indexed by swap entry + * so this must be called before folio_put_swap(). + */ + arch_swap_restore(folio_swap(entry, folio), folio); + + add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR); + add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR); + + new_pmd =3D folio_mk_pmd(folio, vma->vm_page_prot); + new_pmd =3D pmd_mkold(new_pmd); + if (pmd_swp_soft_dirty(old_pmd)) + new_pmd =3D pmd_mksoft_dirty(new_pmd); + if (pmd_swp_uffd_wp(old_pmd)) + new_pmd =3D pmd_mkuffd_wp(new_pmd); + + if (exclusive) + rmap_flags |=3D RMAP_EXCLUSIVE; + + folio_get(folio); + if (!folio_test_anon(folio)) + folio_add_new_anon_rmap(folio, vma, addr, rmap_flags); + else + folio_add_anon_rmap_pmd(folio, page, vma, addr, rmap_flags); + + set_pmd_at(mm, addr, pmd, new_pmd); + folio_put_swap(folio, NULL); + + spin_unlock(ptl); + + folio_free_swap(folio); + return 0; +} + +/* + * Try to swap in a PMD swap entry as a whole THP. Returns 0 on success. + * Returns -ENOMEM if the PMD-order folio could not be allocated/charged, + * -EIO if swap-in failed, or -EAGAIN if the cached folio is no longer + * PMD-sized; in all of these the PMD is split so the caller can fall + * back to unuse_pte_range(). Otherwise propagates the error from + * unuse_pmd(). + */ +static int unuse_pmd_entry(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr, softleaf_t entry) +{ + struct folio *folio; + int ret; + + folio =3D swap_cache_get_folio(entry); + if (!folio) { + folio =3D swapin_alloc_pmd_folio(entry, vma->vm_mm); + if (!folio) { + ret =3D -ENOMEM; + goto split_fallback; + } + } + + folio_lock(folio); + folio_wait_writeback(folio); + /* + * If the cached folio is no longer PMD-sized (e.g. split in the + * swap cache by deferred_split_scan() or memory_failure() while + * the PMD swap entry was installed), the PMD swap entry no longer + * maps a single contiguous folio. Split the PMD swap entry so + * unuse_pte_range() can swap the per-slot folios in individually. + */ + if (folio_nr_pages(folio) !=3D HPAGE_PMD_NR) { + folio_unlock(folio); + folio_put(folio); + ret =3D -EAGAIN; + goto split_fallback; + } + ret =3D unuse_pmd(vma, pmd, addr, entry, folio); + folio_unlock(folio); + folio_put(folio); + return ret; + +split_fallback: + __split_huge_pmd(vma, pmd, addr, false); + return ret; +} + static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud, unsigned long addr, unsigned long end, unsigned int type) @@ -2531,6 +2656,18 @@ static inline int unuse_pmd_range(struct vm_area_str= uct *vma, pud_t *pud, do { cond_resched(); next =3D pmd_addr_end(addr, end); + + pmd_t pmdval =3D pmdp_get(pmd); + + if (pmd_is_swap_entry(pmdval)) { + softleaf_t sl =3D softleaf_from_pmd(pmdval); + + if (swp_type(sl) =3D=3D type) { + if (!unuse_pmd_entry(vma, pmd, addr, sl)) + continue; + } + } + ret =3D unuse_pte_range(vma, pmd, addr, next, type); if (ret) return ret; --=20 2.52.0 From nobody Wed Jun 17 07:20:38 2026 Received: from out-174.mta0.migadu.com (out-174.mta0.migadu.com [91.218.175.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C320E3B4E9E for ; Mon, 27 Apr 2026 10:06:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.174 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284414; cv=none; b=kk2cdRs5YQ899MkbOMVUkLbb3rWm4wNNN6gy5II8VTWw7/Uz+xaUO17NiAdtgirL4S+6Fn3DDGGRaMFeN6Lk+hGywwks9cnDYDZq8OwFUFPCx543B/NcQAe1ArGyKeQRoTSnlGUqy096zZ3FTxH+2BNF9k+iijsaM/NNEgc7INA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284414; c=relaxed/simple; bh=JtwhvLivMy5r7v190apI6BIMQtAmrFXFhq8Ept26g30=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=e3jh4zvW5fXQ8o+wBgfXqJRDtHKnAzYjjKD7aGu3PMGsEWFIslw0p0qX/HE+hfY4F0V5oRczCkl0QUOlOhNGKsjYY0aLw5OyhkVKIYw/hj3f4FM2lzW0UogTEYrDfvV1IEtpr2gIX1F5WcSdK4G6FZ4ixBommznCk6zLCQg6jCw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=qwHRgJLz; arc=none smtp.client-ip=91.218.175.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="qwHRgJLz" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1777284410; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=QxbxalWwIq4ZtbVZrbrFAeHNq1hYKuXrYJvnHUfj2zo=; b=qwHRgJLzgBSofs8QiSEF1elLzpXl23bYF6dvp53q0EncRblBOyiNb03bxZACVBKe/njB1F TPHlV/hiSu0lY8hbAPv5EZYpii/b8yfor3b5tbhzd1OEewnfXiN6XuA4HyYRG5reDVRSZ3 YR5xcgeT694jb2FUBRjrR78JeDO3KJY= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: bhe@redhat.com, willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [PATCH 09/13] mm: handle PMD swap entries in non-present PMD walkers Date: Mon, 27 Apr 2026 03:01:58 -0700 Message-ID: <20260427100553.2754667-10-usama.arif@linux.dev> In-Reply-To: <20260427100553.2754667-1-usama.arif@linux.dev> References: <20260427100553.2754667-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Teach the remaining non-present PMD walkers about swap entries, mirroring the PTE-level equivalents. smaps_pmd_entry() accounts swap and swap_pss via a new shared smaps_account_swap() helper used by both PTE and PMD paths. zap_huge_pmd() frees swap slots via swap_put_entries_direct(), matching zap_nonpresent_ptes(). change_non_present_huge_pmd() skips write-permission changes for swap entries and only updates uffd_wp, matching change_softleaf_pte(). move_soft_dirty_pmd(), clear_soft_dirty_pmd(), and make_uffd_wp_pmd(), pagemap_pmd_range_thp() and change_huge_pmd() handle swap entries alongside migration entries. madvise_cold_or_pageout_pmd_range() extends its non-present PMD VM_BUG_ON to allow swap entries; without this, hitting a PMD swap entry on a DEBUG_VM kernel would BUG(). queue_folios_pmd() in mempolicy silently skips swap entries, matching the PTE walker which only counts migration entries as failures. Without this, mbind(MPOL_MF_STRICT) would spuriously return -EIO on a swapped-out THP. madvise_free_huge_pmd() handles PMD swap entries directly: for a full-range MADV_FREE it clears the PMD, frees the deposited page table, and releases the swap slots; for a partial range it splits to PTE swap entries. Without this, MADV_FREE silently becomes a no-op on swapped-out THPs, leaking swap slots. hmm_vma_handle_absent_pmd() faults in PMD swap entries via hmm_vma_fault() instead of returning -EFAULT. The first per-page handle_mm_fault() call triggers do_huge_pmd_swap_page(), which maps the entire folio; subsequent calls become harmless huge_pmd_set_accessed() and the walker retries with a present PMD. Signed-off-by: Usama Arif --- fs/proc/task_mmu.c | 43 +++++++++++++++++++++------------- mm/hmm.c | 3 ++- mm/huge_memory.c | 58 +++++++++++++++++++++++++++++++++++----------- mm/khugepaged.c | 6 +++++ mm/madvise.c | 5 ++-- mm/mempolicy.c | 2 ++ 6 files changed, 85 insertions(+), 32 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 6d9f43881e62..a6dd91d4cf24 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1015,6 +1015,23 @@ static void smaps_pte_hole_lookup(unsigned long addr= , struct mm_walk *walk) #endif } =20 +static void smaps_account_swap(struct mem_size_stats *mss, + softleaf_t entry, unsigned long size) +{ + int mapcount; + + mss->swap +=3D size; + mapcount =3D swp_swapcount(entry); + if (mapcount >=3D 2) { + u64 pss_delta =3D (u64)size << PSS_SHIFT; + + do_div(pss_delta, mapcount); + mss->swap_pss +=3D pss_delta; + } else { + mss->swap_pss +=3D (u64)size << PSS_SHIFT; + } +} + static void smaps_pte_entry(pte_t *pte, unsigned long addr, struct mm_walk *walk) { @@ -1036,18 +1053,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned lon= g addr, const softleaf_t entry =3D softleaf_from_pte(ptent); =20 if (softleaf_is_swap(entry)) { - int mapcount; - - mss->swap +=3D PAGE_SIZE; - mapcount =3D swp_swapcount(entry); - if (mapcount >=3D 2) { - u64 pss_delta =3D (u64)PAGE_SIZE << PSS_SHIFT; - - do_div(pss_delta, mapcount); - mss->swap_pss +=3D pss_delta; - } else { - mss->swap_pss +=3D (u64)PAGE_SIZE << PSS_SHIFT; - } + smaps_account_swap(mss, entry, PAGE_SIZE); } else if (softleaf_has_pfn(entry)) { if (softleaf_is_device_private(entry)) present =3D true; @@ -1077,9 +1083,13 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned lon= g addr, if (pmd_present(*pmd)) { page =3D vm_normal_page_pmd(vma, addr, *pmd); present =3D true; - } else if (unlikely(thp_migration_supported())) { + } else { const softleaf_t entry =3D softleaf_from_pmd(*pmd); =20 + if (softleaf_is_swap(entry)) { + smaps_account_swap(mss, entry, HPAGE_PMD_SIZE); + return; + } if (softleaf_has_pfn(entry)) page =3D softleaf_to_page(entry); } @@ -1665,7 +1675,7 @@ static inline void clear_soft_dirty_pmd(struct vm_are= a_struct *vma, pmd =3D pmd_clear_soft_dirty(pmd); =20 set_pmd_at(vma->vm_mm, addr, pmdp, pmd); - } else if (pmd_is_migration_entry(pmd)) { + } else if (pmd_is_migration_entry(pmd) || pmd_is_swap_entry(pmd)) { pmd =3D pmd_swp_clear_soft_dirty(pmd); set_pmd_at(vma->vm_mm, addr, pmdp, pmd); } @@ -2025,7 +2035,8 @@ static int pagemap_pmd_range_thp(pmd_t *pmdp, unsigne= d long addr, flags |=3D PM_UFFD_WP; if (pm->show_pfn) frame =3D pmd_pfn(pmd) + idx; - } else if (thp_migration_supported()) { + } else if (pmd_is_swap_entry(pmd) || + (thp_migration_supported() && pmd_is_migration_entry(pmd))) { const softleaf_t entry =3D softleaf_from_pmd(pmd); unsigned long offset; =20 @@ -2463,7 +2474,7 @@ static void make_uffd_wp_pmd(struct vm_area_struct *v= ma, old =3D pmdp_invalidate_ad(vma, addr, pmdp); pmd =3D pmd_mkuffd_wp(old); set_pmd_at(vma->vm_mm, addr, pmdp, pmd); - } else if (pmd_is_migration_entry(pmd)) { + } else if (pmd_is_migration_entry(pmd) || pmd_is_swap_entry(pmd)) { pmd =3D pmd_swp_mkuffd_wp(pmd); set_pmd_at(vma->vm_mm, addr, pmdp, pmd); } diff --git a/mm/hmm.c b/mm/hmm.c index 5955f2f0c83d..2bd3ebd1b8d6 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -370,7 +370,8 @@ static int hmm_vma_handle_absent_pmd(struct mm_walk *wa= lk, unsigned long start, required_fault =3D hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0); if (required_fault) { - if (softleaf_is_device_private(entry)) + if (softleaf_is_device_private(entry) || + softleaf_is_swap(entry)) return hmm_vma_fault(addr, end, required_fault, walk); else return -EFAULT; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 42887cf518cd..109e4dc4a167 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2375,6 +2375,14 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vm= f) return 0; } =20 +static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd) +{ + pgtable_t pgtable; + + pgtable =3D pgtable_trans_huge_withdraw(mm, pmd); + pte_free(mm, pgtable); + mm_dec_nr_ptes(mm); +} /* * Return true if we do MADV_FREE successfully on entire pmd page. * Otherwise, return false. @@ -2399,8 +2407,23 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, s= truct vm_area_struct *vma, goto out; =20 if (unlikely(!pmd_present(orig_pmd))) { + if (pmd_is_swap_entry(orig_pmd)) { + if (next - addr !=3D HPAGE_PMD_SIZE) { + spin_unlock(ptl); + __split_huge_pmd(vma, pmd, addr, false); + goto out_unlocked; + } + softleaf_t sl =3D softleaf_from_pmd(orig_pmd); + + pmdp_huge_get_and_clear(mm, addr, pmd); + zap_deposited_table(mm, pmd); + spin_unlock(ptl); + swap_put_entries_direct(sl, HPAGE_PMD_NR); + add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR); + return true; + } VM_BUG_ON(thp_migration_supported() && - !pmd_is_migration_entry(orig_pmd)); + !pmd_is_migration_entry(orig_pmd)); goto out; } =20 @@ -2449,15 +2472,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, s= truct vm_area_struct *vma, return ret; } =20 -static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd) -{ - pgtable_t pgtable; - - pgtable =3D pgtable_trans_huge_withdraw(mm, pmd); - pte_free(mm, pgtable); - mm_dec_nr_ptes(mm); -} - static void zap_huge_pmd_folio(struct mm_struct *mm, struct vm_area_struct= *vma, pmd_t pmdval, struct folio *folio, bool is_present) { @@ -2550,6 +2564,16 @@ bool zap_huge_pmd(struct mmu_gather *tlb, struct vm_= area_struct *vma, arch_check_zapped_pmd(vma, orig_pmd); tlb_remove_pmd_tlb_entry(tlb, pmd, addr); =20 + if (pmd_is_swap_entry(orig_pmd)) { + softleaf_t sl =3D softleaf_from_pmd(orig_pmd); + + zap_deposited_table(mm, pmd); + spin_unlock(ptl); + swap_put_entries_direct(sl, HPAGE_PMD_NR); + add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR); + return true; + } + is_present =3D pmd_present(orig_pmd); folio =3D normal_or_softleaf_folio_pmd(vma, addr, orig_pmd, is_present); has_deposit =3D has_deposited_pgtable(vma, orig_pmd, folio); @@ -2582,7 +2606,8 @@ static inline int pmd_move_must_withdraw(spinlock_t *= new_pmd_ptl, static pmd_t move_soft_dirty_pmd(pmd_t pmd) { if (pgtable_supports_soft_dirty()) { - if (unlikely(pmd_is_migration_entry(pmd))) + if (unlikely(pmd_is_migration_entry(pmd) || + pmd_is_swap_entry(pmd))) pmd =3D pmd_swp_mksoft_dirty(pmd); else if (pmd_present(pmd)) pmd =3D pmd_mksoft_dirty(pmd); @@ -2662,7 +2687,14 @@ static void change_non_present_huge_pmd(struct mm_st= ruct *mm, pmd_t newpmd; =20 VM_WARN_ON(!pmd_is_valid_softleaf(*pmd)); - if (softleaf_is_migration_write(entry)) { + + /* + * PMD swap entries don't encode write permission in the entry type, + * so only uffd_wp flag changes apply. No folio lookup needed. + */ + if (softleaf_is_swap(entry)) { + newpmd =3D *pmd; + } else if (softleaf_is_migration_write(entry)) { const struct folio *folio =3D softleaf_to_folio(entry); =20 /* @@ -2719,7 +2751,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm= _area_struct *vma, if (!ptl) return 0; =20 - if (thp_migration_supported() && pmd_is_valid_softleaf(*pmd)) { + if (pmd_is_valid_softleaf(*pmd)) { change_non_present_huge_pmd(mm, addr, pmd, uffd_wp, uffd_wp_resolve); goto unlock; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index b8452dbdb043..a7cc65c3d06a 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -950,6 +950,12 @@ static inline enum scan_result check_pmd_state(pmd_t *= pmd) */ if (pmd_is_migration_entry(pmde)) return SCAN_PMD_MAPPED; + /* + * A PMD-mapped THP that has been swapped out is still a THP from + * khugepaged's perspective; treat it like a present huge PMD. + */ + if (pmd_is_swap_entry(pmde)) + return SCAN_PMD_MAPPED; if (!pmd_present(pmde)) return SCAN_NO_PTE_TABLE; if (pmd_trans_huge(pmde)) diff --git a/mm/madvise.c b/mm/madvise.c index 69708e953cf5..2702eb0b1134 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -390,7 +390,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, =20 if (unlikely(!pmd_present(orig_pmd))) { VM_BUG_ON(thp_migration_supported() && - !pmd_is_migration_entry(orig_pmd)); + !pmd_is_migration_entry(orig_pmd) && + !pmd_is_swap_entry(orig_pmd)); goto huge_unlock; } =20 @@ -666,7 +667,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned = long addr, int nr, max_nr; =20 next =3D pmd_addr_end(addr, end); - if (pmd_trans_huge(*pmd)) + if (pmd_trans_huge(*pmd) || pmd_is_swap_entry(*pmd)) if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next)) return 0; =20 diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 4e4421b22b59..55b38fe13a63 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -658,6 +658,8 @@ static void queue_folios_pmd(pmd_t *pmd, struct mm_walk= *walk) qp->nr_failed++; return; } + if (unlikely(pmd_is_swap_entry(*pmd))) + return; folio =3D pmd_folio(*pmd); if (is_huge_zero_folio(folio)) { walk->action =3D ACTION_CONTINUE; --=20 2.52.0 From nobody Wed Jun 17 07:20:38 2026 Received: from out-188.mta0.migadu.com (out-188.mta0.migadu.com [91.218.175.188]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7B54E3B6BE8 for ; Mon, 27 Apr 2026 10:06:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.188 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284419; cv=none; b=pnJEm015plKRMtHh56WQ8+RY/aHIMbjtgyAmd55yygDK3v6EUgp0O25any7Y0VhD8zUQOpJvoXHJA40RC5bvEGr+Gy7wbe+FWwncslDuPx8V5G3fcR0UVkVrUBPp2HuZ0FCLN9ZtUxNLBXtZGBZLmov+Rmlfb488NXp18kNzMS0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284419; c=relaxed/simple; bh=OFR4XWZMU1nlgfgmqIvh0ii0g2O4NYvCx1qmywnoUDY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=j6UR8359/itCTx0Fj9RH3GZwDJWCns73FGjKoJHP8MkbMB2rUtLSSYkq4dVWs+WLpCDiQ7zpVpnL8UZnDQJV9qPKLnAJ7+WAtu/zwfVTo6uyS7duGuEnq6V59WIcoiM60dQQuJPZMFzhLiyZ0SSLi0o9i23X8OOxeitVKhOqv/4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=gCJdOCkZ; arc=none smtp.client-ip=91.218.175.188 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="gCJdOCkZ" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1777284415; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=AZBl4ylAGDSqk3dPTEed++o2pIaK/pmYovXJ16f2ROQ=; b=gCJdOCkZWPm8fxEN5Ih7QIgc7Bzg9s540KZ6+YL6Y18IXoknJI+jDA0f7Ow3T3tm/a2igc FmSlQN7CG/oKEJkGf3StTReqe08eyzpREydIRLodOzMBCjtatnTAwQIJOgWfhuwRzXwcYn 1RafuI6xLz1QXXWxmPr2crxpfAw4cNo= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: bhe@redhat.com, willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [PATCH 10/13] mm: handle PMD swap entries in UFFDIO_MOVE Date: Mon, 27 Apr 2026 03:01:59 -0700 Message-ID: <20260427100553.2754667-11-usama.arif@linux.dev> In-Reply-To: <20260427100553.2754667-1-usama.arif@linux.dev> References: <20260427100553.2754667-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" move_pages_huge_pmd() returned -ENOENT for any non-trans_huge, non-migration PMD, which fails aligned UFFDIO_MOVE on a swapped-out THP -- the PMD swap entry is a perfectly valid mapping that should move whole. Splitting via the move_pages_ptes() fallback isn't a substitute either: __split_huge_pmd_locked() splits a PMD swap entry into HPAGE_PMD_NR PTE swap entries pointing at the same swap-cache folio, but move_swap_pte() refuses any swap-cache folio that is still large and returns -EBUSY. Add move_swap_pmd(), modeled on move_swap_pte(), that moves the swap entry whole-PMD and re-anchors the swap-cache folio's anon rmap to the destination VMA. Reject !pmd_swp_exclusive() entries with -EBUSY to preserve UFFDIO_MOVE's single-owner semantics, propagate soft-dirty, and carry the deposited page table across with the entry. The dispatcher in move_pages_huge_pmd() now waits for migration on a PMD migration entry (matching the PTE path) and routes PMD swap entries through move_swap_pmd() after pinning the swap device, fetching and locking any cached folio, and arming an mmu_notifier range so secondary MMUs see the move. If the swap-cache folio was split (e.g. by deferred_split_scan or memory_failure) between swap-out and UFFDIO_MOVE, src_folio is no longer PMD-sized but the PMD swap entry still covers all 512 slots. Moving the entry whole would only re-anchor one folio's anon rmap, leaving the other 511 with a stale anon_vma. Return -EBUSY in this case, matching move_pages_pte()'s rejection of large folios, so the caller falls back to PTE-level moves. Signed-off-by: Usama Arif --- mm/huge_memory.c | 113 ++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 112 insertions(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 109e4dc4a167..bfcc9b274be7 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2871,6 +2871,62 @@ int change_huge_pud(struct mmu_gather *tlb, struct v= m_area_struct *vma, #endif =20 #ifdef CONFIG_USERFAULTFD +/* + * Move a PMD-level swap entry from src_pmd to dst_pmd. Both PMD locks are + * acquired here; src_folio (if present) must already be locked. The depos= ited + * page table backing the source THP is moved across with the entry. + */ +static int move_swap_pmd(struct mm_struct *mm, struct vm_area_struct *dst_= vma, + unsigned long dst_addr, unsigned long src_addr, + pmd_t *dst_pmd, pmd_t *src_pmd, + pmd_t orig_dst_pmd, pmd_t orig_src_pmd, + spinlock_t *dst_ptl, spinlock_t *src_ptl, + struct folio *src_folio, swp_entry_t entry) +{ + pgtable_t src_pgtable; + pmd_t moved_pmd; + + /* + * The folio may have been freed and reused for a different swap entry + * while it was unlocked. Re-verify the association. + */ + if (src_folio && unlikely(!folio_test_swapcache(src_folio) || + entry.val !=3D src_folio->swap.val)) + return -EAGAIN; + + double_pt_lock(dst_ptl, src_ptl); + + if (!pmd_same(*src_pmd, orig_src_pmd) || + !pmd_same(*dst_pmd, orig_dst_pmd)) { + double_pt_unlock(dst_ptl, src_ptl); + return -EAGAIN; + } + + /* + * If the folio is in the swap cache, re-anchor its anon rmap to the + * destination VMA so a future swap-in fault at dst_addr finds it. + * Otherwise, re-check that no folio was newly inserted under us. + */ + if (src_folio) { + folio_move_anon_rmap(src_folio, dst_vma); + src_folio->index =3D linear_page_index(dst_vma, dst_addr); + } else if (swap_cache_has_folio(entry)) { + double_pt_unlock(dst_ptl, src_ptl); + return -EAGAIN; + } + + moved_pmd =3D pmdp_huge_get_and_clear(mm, src_addr, src_pmd); + if (pgtable_supports_soft_dirty()) + moved_pmd =3D pmd_swp_mksoft_dirty(moved_pmd); + set_pmd_at(mm, dst_addr, dst_pmd, moved_pmd); + + src_pgtable =3D pgtable_trans_huge_withdraw(mm, src_pmd); + pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable); + + double_pt_unlock(dst_ptl, src_ptl); + return 0; +} + /* * The PT lock for src_pmd and dst_vma/src_vma (for reading) are locked by * the caller, but it must return after releasing the page_table_lock. @@ -2905,11 +2961,66 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t= *dst_pmd, pmd_t *src_pmd, pm } =20 if (!pmd_trans_huge(src_pmdval)) { - spin_unlock(src_ptl); if (pmd_is_migration_entry(src_pmdval)) { + spin_unlock(src_ptl); pmd_migration_entry_wait(mm, &src_pmdval); return -EAGAIN; } + if (pmd_is_swap_entry(src_pmdval)) { + swp_entry_t entry; + struct swap_info_struct *si; + + /* + * UFFDIO_MOVE on anon mappings requires single-owner + * semantics; refuse to move a shared swap entry. + */ + if (!pmd_swp_exclusive(src_pmdval)) { + spin_unlock(src_ptl); + return -EBUSY; + } + + entry =3D softleaf_from_pmd(src_pmdval); + spin_unlock(src_ptl); + + /* Pin the swap device against a racing swapoff. */ + si =3D get_swap_device(entry); + if (unlikely(!si)) + return -EAGAIN; + + src_folio =3D swap_cache_get_folio(entry); + + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, + mm, src_addr, + src_addr + HPAGE_PMD_SIZE); + mmu_notifier_invalidate_range_start(&range); + + if (src_folio) { + folio_lock(src_folio); + if (folio_nr_pages(src_folio) !=3D HPAGE_PMD_NR) { + err =3D -EBUSY; + folio_unlock(src_folio); + folio_put(src_folio); + mmu_notifier_invalidate_range_end(&range); + put_swap_device(si); + return err; + } + } + + dst_ptl =3D pmd_lockptr(mm, dst_pmd); + err =3D move_swap_pmd(mm, dst_vma, dst_addr, src_addr, + dst_pmd, src_pmd, dst_pmdval, + src_pmdval, dst_ptl, src_ptl, + src_folio, entry); + + mmu_notifier_invalidate_range_end(&range); + if (src_folio) { + folio_unlock(src_folio); + folio_put(src_folio); + } + put_swap_device(si); + return err; + } + spin_unlock(src_ptl); return -ENOENT; } =20 --=20 2.52.0 From nobody Wed Jun 17 07:20:38 2026 Received: from out-182.mta0.migadu.com (out-182.mta0.migadu.com [91.218.175.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B64773B6C09 for ; Mon, 27 Apr 2026 10:07:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.182 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284425; cv=none; b=qXVc0FWLNBMfWfZ/WTNGtTc9p2J1jSkEOdtsS4rTvJpCMO80HTlgCR3Y5XN/N0Z6z/SLhiNwjthoh6SO7JqYkdrOFgcRdWApqA9RV7od8nK5ZFVN5yoNoNdPmrUgX9LgVg/2vkVqcZ5UCYpBc69ogdAIl8MdzEkg9aUof/UXTFk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284425; c=relaxed/simple; bh=S57HdT2rm8Sovkra4ATq7RT+vV9HpstGNJ8EJenP+4E=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=VQivxGEDdzMY/vlstuQf+pVBngEMmKOvTcvXGPqSG29k3G87FX/iYI9DqnxNc7V7kVoHu84A2pWD37+IKH29tOkVORKBXCNRG3hrowxRznYfKA8qrpcExrnSfrlYOivU7rvgfielbrk92dtAAffa4ezMsCbMzqUFFO4BcjZGDOE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=cnLcfiot; arc=none smtp.client-ip=91.218.175.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="cnLcfiot" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1777284420; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=XFcOlvA4XPLMVWGfM2EDnm9H4rzCw4SoXZ6Ff4QTzhk=; b=cnLcfiotY+wY2qWGQvpTkmUlzm+WmFoBE5tQYjkRJ7HXk8dzY/GrHc2XM22UFcFbHMiVCE rdPDPsx363qNp1kZlOSafVFA133H1LIeOnO75N1anR5ba1gtDLnGOkfhQZBZNLzitZnWRO ad0Hq/U/fP4ZCo3o4DbI8qNRM+dWU70= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: bhe@redhat.com, willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [PATCH 11/13] mm: handle PMD swap entry faults on swap-in Date: Mon, 27 Apr 2026 03:02:00 -0700 Message-ID: <20260427100553.2754667-12-usama.arif@linux.dev> In-Reply-To: <20260427100553.2754667-1-usama.arif@linux.dev> References: <20260427100553.2754667-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Add do_huge_pmd_swap_page() and dispatch to it from __handle_mm_fault() when vmf->orig_pmd encodes a swap entry. The handler resolves the entire 2 MB mapping in one shot, mirroring do_swap_page() (PTE path) at PMD granularity: - Look up the folio in the swap cache; on a miss, allocate a PMD-order folio and read from swap (shared with unuse_pmd_entry() via swapin_alloc_pmd_folio() in mm/swap_state.c). - After locking, re-validate that the folio still corresponds to our entry and is still PMD-sized. Between the unlocked cache lookup and the lock, a racing swap-in on the same entry may have removed it from the cache via folio_free_swap(), or reclaim / memory_failure / deferred-split may have split the folio into smaller folios. - Restore soft_dirty and uffd_wp from the swap PMD. Map writable only when the entry was exclusive, the VMA permits writes, and uffd-wp is not armed. Drop the exclusive marker when the cached folio is under writeback to an SWP_STABLE_WRITES backend (zram, encrypted) so the PMD is mapped read-only; a later write COWs into a fresh folio rather than corrupting the in-flight writeback. Mirrors do_swap_page(). - When the resulting PMD is read-only but the fault was a write, update vmf->orig_pmd and call wp_huge_pmd() in the same handler to COW immediately rather than forcing a second fault. Mask VM_FAULT_FALLBACK from its return: a PMD-COW that splits to PTE-level is normal, but the bit is part of VM_FAULT_ERROR and arch fault handlers BUG() on it without SIGBUS/HWPOISON/SIGSEGV. Requires exposing wp_huge_pmd() via mm/internal.h. - Free the swap slot via should_try_to_free_swap() (hoisted from mm/memory.c into mm/internal.h so PTE- and PMD-level swap-in share the heuristic). When PMD-order resources are unavailable (folio allocation fails, the cached folio was split, memcg charge fails, or swapin_folio() races) split the PMD swap entry into 512 PTE swap entries via __split_huge_pmd() and return 0. The fault retries and do_swap_page() takes over per-PTE. This avoids returning VM_FAULT_OOM for transient PMD-order allocation failures. Signed-off-by: Usama Arif --- include/linux/huge_mm.h | 9 ++ mm/huge_memory.c | 197 ++++++++++++++++++++++++++++++++++++++++ mm/internal.h | 36 ++++++++ mm/memory.c | 40 +------- mm/swap_state.c | 2 +- 5 files changed, 247 insertions(+), 37 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 2949e5acff35..93ee6c36d6ea 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -522,6 +522,15 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf); =20 vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf); =20 +#ifdef CONFIG_THP_SWAP +vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf); +#else +static inline vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf) +{ + return 0; +} +#endif + extern struct folio *huge_zero_folio; extern unsigned long huge_zero_pfn; =20 diff --git a/mm/huge_memory.c b/mm/huge_memory.c index bfcc9b274be7..141ab45adee4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2375,6 +2375,203 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *v= mf) return 0; } =20 +#ifdef CONFIG_THP_SWAP +/** + * do_huge_pmd_swap_page() - Handle a fault on a PMD-level swap entry. + * @vmf: Fault context. vmf->orig_pmd contains the swap PMD. + * + * Looks up the folio in the swap cache, and if it is a PMD-sized folio, + * maps it directly at the PMD level. If the folio is not in the swap + * cache, allocates a PMD-sized folio and reads from swap. On allocation + * failure, splits the PMD swap entry into PTE-level entries and retries + * at PTE granularity. + * + * Return: VM_FAULT_* flags. + */ +vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf) +{ + struct vm_area_struct *vma =3D vmf->vma; + struct mm_struct *mm =3D vma->vm_mm; + struct folio *folio; + struct page *page; + struct swap_info_struct *si; + unsigned long haddr =3D vmf->address & HPAGE_PMD_MASK; + softleaf_t entry; + swp_entry_t swp_entry; + pmd_t pmd; + vm_fault_t ret =3D 0; + bool exclusive; + rmap_t rmap_flags =3D RMAP_NONE; + + entry =3D softleaf_from_pmd(vmf->orig_pmd); + if (unlikely(!softleaf_is_swap(entry))) + return 0; + + swp_entry =3D entry; + + /* Prevent swapoff from happening to us. */ + si =3D get_swap_device(swp_entry); + if (unlikely(!si)) + return 0; + + folio =3D swap_cache_get_folio(swp_entry); + if (!folio) { + folio =3D swapin_alloc_pmd_folio(swp_entry, mm); + if (!folio) + goto split_fallback; + + /* Had to read from swap area: Major fault */ + ret =3D VM_FAULT_MAJOR; + count_vm_event(PGMAJFAULT); + count_memcg_event_mm(mm, PGMAJFAULT); + } + + ret |=3D folio_lock_or_retry(folio, vmf); + if (ret & VM_FAULT_RETRY) + goto out_release; + + /* Verify the folio is still in swap cache and matches our entry */ + if (unlikely(!folio_matches_swap_entry(folio, swp_entry))) + goto out_page; + + /* + * Folio should be PMD-sized; if not (e.g. split in swap cache), + * split the PMD swap entry and retry at PTE level. + */ + if (folio_nr_pages(folio) !=3D HPAGE_PMD_NR) { + folio_unlock(folio); + folio_put(folio); + goto split_fallback; + } + + if (unlikely(!folio_test_uptodate(folio))) { + ret =3D VM_FAULT_SIGBUS; + goto out_page; + } + + page =3D folio_page(folio, 0); + arch_swap_restore(folio_swap(swp_entry, folio), folio); + + if ((vmf->flags & FAULT_FLAG_WRITE) && !folio_test_lru(folio)) + lru_add_drain(); + + folio_throttle_swaprate(folio, GFP_KERNEL); + + /* Lock the PMD and verify it hasn't changed */ + vmf->ptl =3D pmd_lock(mm, vmf->pmd); + if (unlikely(!pmd_same(vmf->orig_pmd, pmdp_get(vmf->pmd)))) { + spin_unlock(vmf->ptl); + goto out_page; + } + + exclusive =3D pmd_swp_exclusive(vmf->orig_pmd); + + /* + * Some swap backends (e.g. zram) don't support concurrent page + * modifications while under writeback. If we map exclusive on such + * a backend while the folio is still under writeback, the writeback + * may see partial modifications and corrupt the swap slot. Drop the + * exclusive marker and only map R/O for that case; further GUP + * references can't appear once the page is fully unmapped, so this + * is safe. + */ + if (exclusive && folio_test_writeback(folio) && + data_race(si->flags & SWP_STABLE_WRITES)) + exclusive =3D false; + + /* + * Set up the PMD mapping. Similar to do_swap_page() but at PMD level. + */ + add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR); + add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR); + + pmd =3D folio_mk_pmd(folio, vma->vm_page_prot); + pmd =3D pmd_mkyoung(pmd); + + if (pmd_swp_soft_dirty(vmf->orig_pmd)) + pmd =3D pmd_mksoft_dirty(pmd); + if (pmd_swp_uffd_wp(vmf->orig_pmd)) + pmd =3D pmd_mkuffd_wp(pmd); + + /* + * Check exclusivity to determine if we can map writable. + */ + if (exclusive || folio_ref_count(folio) =3D=3D 1) { + if ((vma->vm_flags & VM_WRITE) && + !userfaultfd_huge_pmd_wp(vma, pmd) && + !pmd_needs_soft_dirty_wp(vma, pmd)) { + pmd =3D pmd_mkwrite(pmd, vma); + if (vmf->flags & FAULT_FLAG_WRITE) { + pmd =3D pmd_mkdirty(pmd); + vmf->flags &=3D ~FAULT_FLAG_WRITE; + } + } + rmap_flags |=3D RMAP_EXCLUSIVE; + } + + flush_icache_pages(vma, page, HPAGE_PMD_NR); + + if (!folio_test_anon(folio)) + folio_add_new_anon_rmap(folio, vma, haddr, rmap_flags); + else + folio_add_anon_rmap_pmd(folio, page, vma, haddr, rmap_flags); + + folio_put_swap(folio, NULL); + + set_pmd_at(mm, haddr, vmf->pmd, pmd); + update_mmu_cache_pmd(vma, haddr, vmf->pmd); + + /* Update orig_pmd for any follow-up wp_huge_pmd() below. */ + vmf->orig_pmd =3D pmd; + + /* + * Conditionally try to free up the swap cache. Do it after mapping, + * so raced page faults will likely see the folio in swap cache and + * wait on the folio lock. + */ + if (should_try_to_free_swap(si, folio, vma, 1, vmf->flags)) + folio_free_swap(folio); + + spin_unlock(vmf->ptl); + + folio_unlock(folio); + put_swap_device(si); + + /* + * If the write fault wasn't satisfied above (folio is shared without + * exclusivity), fall through to wp_huge_pmd to handle COW or + * userfaultfd-wp without forcing a second fault. + * + * wp_huge_pmd() may return VM_FAULT_FALLBACK if it had to split the + * PMD; that's a normal outcome =E2=80=94 the natural PTE-level refault w= ill + * complete the COW. Mask it so callers (and the arch fault handler) + * don't see VM_FAULT_FALLBACK as a fatal VM_FAULT_ERROR. + */ + if (vmf->flags & FAULT_FLAG_WRITE) { + vm_fault_t wp_ret =3D wp_huge_pmd(vmf); + + wp_ret &=3D ~VM_FAULT_FALLBACK; + ret |=3D wp_ret; + if (ret & VM_FAULT_ERROR) + ret &=3D VM_FAULT_ERROR; + } + + return ret; + +out_page: + folio_unlock(folio); +out_release: + folio_put(folio); + put_swap_device(si); + return ret; + +split_fallback: + __split_huge_pmd(vma, vmf->pmd, haddr, false); + put_swap_device(si); + return 0; +} +#endif /* CONFIG_THP_SWAP */ + static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd) { pgtable_t pgtable; diff --git a/mm/internal.h b/mm/internal.h index 7de489689f54..c522bff72688 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -508,6 +508,42 @@ static inline vm_fault_t vmf_anon_prepare(struct vm_fa= ult *vmf) } =20 vm_fault_t do_swap_page(struct vm_fault *vmf); +vm_fault_t wp_huge_pmd(struct vm_fault *vmf); + +/* + * Check if we should call folio_free_swap to free the swap cache. + * folio_free_swap only frees the swap cache to release the slot if swap + * count is zero, so we don't need to check the swap count here. + */ +static inline bool should_try_to_free_swap(struct swap_info_struct *si, + struct folio *folio, + struct vm_area_struct *vma, + unsigned int extra_refs, + unsigned int fault_flags) +{ + if (!folio_test_swapcache(folio)) + return false; + /* + * Always try to free swap cache for SWP_SYNCHRONOUS_IO devices. Swap + * cache can help save some IO or memory overhead, but these devices + * are fast, and meanwhile, swap cache pinning the slot deferring the + * release of metadata or fragmentation is a more critical issue. + */ + if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) + return true; + if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) || + folio_test_mlocked(folio)) + return true; + /* + * If we want to map a page that's in the swapcache writable, we + * have to detect via the refcount if we're really the exclusive + * user. Try freeing the swapcache to get rid of the swapcache + * reference only in case it's likely that we'll be the exclusive user. + */ + return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) && + folio_ref_count(folio) =3D=3D (extra_refs + folio_nr_pages(folio)); +} + void folio_rotate_reclaimable(struct folio *folio); bool __folio_end_writeback(struct folio *folio); void deactivate_file_folio(struct folio *folio); diff --git a/mm/memory.c b/mm/memory.c index 8aa90afd601a..3006e1bc2bd7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4481,40 +4481,6 @@ static vm_fault_t remove_device_exclusive_entry(stru= ct vm_fault *vmf) return 0; } =20 -/* - * Check if we should call folio_free_swap to free the swap cache. - * folio_free_swap only frees the swap cache to release the slot if swap - * count is zero, so we don't need to check the swap count here. - */ -static inline bool should_try_to_free_swap(struct swap_info_struct *si, - struct folio *folio, - struct vm_area_struct *vma, - unsigned int extra_refs, - unsigned int fault_flags) -{ - if (!folio_test_swapcache(folio)) - return false; - /* - * Always try to free swap cache for SWP_SYNCHRONOUS_IO devices. Swap - * cache can help save some IO or memory overhead, but these devices - * are fast, and meanwhile, swap cache pinning the slot deferring the - * release of metadata or fragmentation is a more critical issue. - */ - if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) - return true; - if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) || - folio_test_mlocked(folio)) - return true; - /* - * If we want to map a page that's in the swapcache writable, we - * have to detect via the refcount if we're really the exclusive - * user. Try freeing the swapcache to get rid of the swapcache - * reference only in case it's likely that we'll be the exclusive user. - */ - return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) && - folio_ref_count(folio) =3D=3D (extra_refs + folio_nr_pages(folio)); -} - static vm_fault_t pte_marker_clear(struct vm_fault *vmf) { vmf->pte =3D pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, @@ -6233,8 +6199,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fa= ult *vmf) return VM_FAULT_FALLBACK; } =20 -/* `inline' is required to avoid gcc 4.1.2 build error */ -static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) +vm_fault_t wp_huge_pmd(struct vm_fault *vmf) { struct vm_area_struct *vma =3D vmf->vma; const bool unshare =3D vmf->flags & FAULT_FLAG_UNSHARE; @@ -6518,6 +6483,9 @@ static vm_fault_t __handle_mm_fault(struct vm_area_st= ruct *vma, =20 if (pmd_is_migration_entry(vmf.orig_pmd)) pmd_migration_entry_wait(mm, vmf.pmd); + else if (IS_ENABLED(CONFIG_THP_SWAP) && + pmd_is_swap_entry(vmf.orig_pmd)) + return do_huge_pmd_swap_page(&vmf); return 0; } if (pmd_trans_huge(vmf.orig_pmd)) { diff --git a/mm/swap_state.c b/mm/swap_state.c index c2e8c76658f5..19c6759006bb 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -592,7 +592,7 @@ struct folio *swapin_folio(swp_entry_t entry, struct fo= lio *folio) * * Allocate a HPAGE_PMD_ORDER folio, charge it to @mm's memcg for @entry, = and * issue the swap-in via swapin_folio(). Used by callers that need to map a - * PMD swap entry as a whole THP (PMD swapoff). + * PMD swap entry as a whole THP (PMD swap-in fault and swapoff). * * Return: the swapped-in folio, or NULL on alloc/charge/swapin failure (in * which case the caller should fall back to splitting the PMD). --=20 2.52.0 From nobody Wed Jun 17 07:20:38 2026 Received: from out-179.mta1.migadu.com (out-179.mta1.migadu.com [95.215.58.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6EFCD37472D for ; Mon, 27 Apr 2026 10:07:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284437; cv=none; b=GFhJ4gHbpwnYq0yK/HL8AUXe2AQMslnRwN1uEi4s0uqSsiTX4Os4o4NIgi/USi0cQpJ0L4LH3oCFQm+Do6m+sNf3Qjq4aXzoJuKDH0S6K9YMVul/awZgqiuL5O12VTfO0I0bWhLe76OMb5njjD4ajGCRFjkQdaUGDCBCiqFMiwo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284437; c=relaxed/simple; bh=uNyFiUDKTHdwcO0ZbABM/GsAEnvBT6gP+5TFX9YOnzc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=CaGpqTLPwHkLmtfstiaI7Lh5vEC44SuC0QNz/VsMA7wVarLGRqwD2E6SWpgAFH7L7KI1Z+3RaHcCbURHXXDUtdwGmFwKq4n+1y/T6WNJV9tB9FYeN1xb6NK3zndtadyorG1EhPInmoJpXfVjB/LI4sxwvPbhSyOt3gLefsH7c4E= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=TQ0yJd0I; arc=none smtp.client-ip=95.215.58.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="TQ0yJd0I" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1777284433; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=y1OHVsAXxMFfWk6JE9Gj2xlQe5BNdiMFZPhyECznPTk=; b=TQ0yJd0IdaebYVoyjiHAwFAEFZqDT99p5QksXJBrsRhtNnntwM0NTxhmeuk4jIBbYoGW5M 4QqB16pCLPMiFucc6NI6Lvtg5WsBWNOA9GNDg81AN+jhkYRLNw24cGmDHRCOCmIsfDu8B8 ZoUQ8bR2ajSzAj5NJvk3Tqx6/W8j0sQ= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: bhe@redhat.com, willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [PATCH 12/13] mm: install PMD swap entries on swap-out Date: Mon, 27 Apr 2026 03:02:01 -0700 Message-ID: <20260427100553.2754667-13-usama.arif@linux.dev> In-Reply-To: <20260427100553.2754667-1-usama.arif@linux.dev> References: <20260427100553.2754667-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Reclaim today splits a PMD-mapped anonymous THP into 512 PTE swap entries before unmap, losing the huge mapping across the swap round-trip and forcing khugepaged to rebuild it later. The contiguous swap range was already secured when the folio was added to the swap cache (a non-contiguous allocation would have split the folio earlier), so the PMD can be replaced by a single PMD-level swap entry instead. This patch mirrors the existing PTE swap-out path at PMD granularity: - shrink_folio_list() drops TTU_SPLIT_HUGE_PMD for PMD-mappable swapcache folios, gated on zswap_never_enabled() since zswap cannot reconstruct a 2 MB folio from per-page blobs (Best to handle zswap case separately). - try_to_unmap_one() now has a PMD branch that calls set_pmd_swap_entry() and adjusts MM_ANONPAGES / MM_SWAPENTS by HPAGE_PMD_NR before walk_done. TTU_SPLIT_HUGE_PMD remains the fallback. - set_pmd_swap_entry() is the installer. Mirroring the PTE swap-out sequence at PMD granularity, it clears the present mapping (keeping the original for rollback), bumps the swap_map refcount for the folio's 512 slots, drops the exclusive mark if the page was anon-exclusive, propagates the dirty bit to the folio so writeback is not lost, and installs a swap PMD that preserves the original soft-dirty / uffd-wp / exclusive bits. Any failing step rolls back the present mapping. The swap entry value matches what 512 PTE swap entries would encode, so swap_map refcounting is unchanged: each of the 512 slots carries a count of 1, released individually on later split or together on swap-in. Signed-off-by: Usama Arif --- include/linux/huge_mm.h | 2 + include/linux/vm_event_item.h | 1 + mm/huge_memory.c | 78 +++++++++++++++++++++++++++++++++++ mm/rmap.c | 20 +++++++++ mm/vmscan.c | 14 ++++++- mm/vmstat.c | 1 + 6 files changed, 115 insertions(+), 1 deletion(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 93ee6c36d6ea..cbfac4720fc9 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -524,6 +524,8 @@ vm_fault_t do_huge_pmd_device_private(struct vm_fault *= vmf); =20 #ifdef CONFIG_THP_SWAP vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf); +int set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw, + struct folio *folio); #else static inline vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf) { diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 03fe95f5a020..7267c06674c0 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -108,6 +108,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, THP_ZERO_PAGE_ALLOC_FAILED, THP_SWPOUT, THP_SWPOUT_FALLBACK, + THP_SWPOUT_PMD, #endif #ifdef CONFIG_BALLOON BALLOON_INFLATE, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 141ab45adee4..47ff7fb9ee9b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -5497,3 +5497,81 @@ void remove_migration_pmd(struct page_vma_mapped_wal= k *pvmw, struct page *new) trace_remove_migration_pmd(address, pmd_val(pmde)); } #endif + +#ifdef CONFIG_THP_SWAP +/** + * set_pmd_swap_entry() - Replace a PMD mapping with a PMD-level swap entr= y. + * @pvmw: Page vma mapped walk context, must have pvmw->pmd set and + * pvmw->pte NULL (i.e. PMD-mapped). + * @folio: The folio being swapped out. Must be in the swap cache. + * + * This installs a PMD-level swap entry in place of a present PMD mapping, + * avoiding the need to split the PMD into PTE-level swap entries. + * + * Return: 0 on success, negative error code on failure. + */ +int set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw, + struct folio *folio) +{ + struct vm_area_struct *vma =3D pvmw->vma; + struct mm_struct *mm =3D vma->vm_mm; + unsigned long address =3D pvmw->address; + unsigned long haddr =3D address & HPAGE_PMD_MASK; + struct page *page =3D folio_page(folio, 0); + bool anon_exclusive; + pmd_t pmdval; + swp_entry_t entry; + pmd_t pmdswp; + + if (!(pvmw->pmd && !pvmw->pte)) + return 0; + + VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio); + VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio); + + if (unlikely(folio_test_swapbacked(folio) !=3D + folio_test_swapcache(folio))) { + WARN_ON_ONCE(1); + return -EBUSY; + } + + flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE); + + pmdval =3D pmdp_invalidate(vma, haddr, pvmw->pmd); + + /* Update high watermark before we lower rss */ + update_hiwater_rss(mm); + + if (folio_dup_swap(folio, NULL) < 0) { + set_pmd_at(mm, haddr, pvmw->pmd, pmdval); + return -ENOMEM; + } + + /* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */ + anon_exclusive =3D PageAnonExclusive(page); + if (anon_exclusive && folio_try_share_anon_rmap_pmd(folio, page)) { + folio_put_swap(folio, NULL); + set_pmd_at(mm, haddr, pvmw->pmd, pmdval); + return -EBUSY; + } + + if (pmd_dirty(pmdval)) + folio_mark_dirty(folio); + + entry =3D folio->swap; + pmdswp =3D softleaf_to_pmd(entry); + if (pmd_soft_dirty(pmdval)) + pmdswp =3D pmd_swp_mksoft_dirty(pmdswp); + if (pmd_uffd_wp(pmdval)) + pmdswp =3D pmd_swp_mkuffd_wp(pmdswp); + if (anon_exclusive) + pmdswp =3D pmd_swp_mkexclusive(pmdswp); + set_pmd_at(mm, haddr, pvmw->pmd, pmdswp); + + folio_remove_rmap_pmd(folio, page, vma); + folio_put(folio); + + count_vm_event(THP_SWPOUT_PMD); + return 0; +} +#endif /* CONFIG_THP_SWAP */ diff --git a/mm/rmap.c b/mm/rmap.c index 057e18cb80b0..b188213648c5 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2077,6 +2077,26 @@ static bool try_to_unmap_one(struct folio *folio, st= ruct vm_area_struct *vma, goto walk_abort; } =20 +#ifdef CONFIG_THP_SWAP + /* + * If the folio is in the swap cache and we're not + * asked to split, install a PMD-level swap entry. + */ + if (!(flags & TTU_SPLIT_HUGE_PMD) && + folio_test_anon(folio) && + folio_test_swapcache(folio)) { + if (set_pmd_swap_entry(&pvmw, folio)) + goto walk_abort; + + ensure_on_mmlist(mm); + add_mm_counter(mm, MM_ANONPAGES, + -HPAGE_PMD_NR); + add_mm_counter(mm, MM_SWAPENTS, + HPAGE_PMD_NR); + goto walk_done; + } +#endif + if (flags & TTU_SPLIT_HUGE_PMD) { /* * We temporarily have to drop the PTL and diff --git a/mm/vmscan.c b/mm/vmscan.c index bd1b1aa12581..e895aaade8f2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -64,6 +64,7 @@ =20 #include #include +#include =20 #include "internal.h" #include "swap.h" @@ -1330,7 +1331,18 @@ static unsigned int shrink_folio_list(struct list_he= ad *folio_list, enum ttu_flags flags =3D TTU_BATCH_FLUSH; bool was_swapbacked =3D folio_test_swapbacked(folio); =20 - if (folio_test_pmd_mappable(folio)) + /* + * With THP_SWAP, PMD-mappable folios already in the + * swap cache can be unmapped with a PMD-level swap + * entry, avoiding the cost of splitting the PMD. + * Skip this when zswap has been enabled because + * zswap stores pages individually and cannot + * reconstruct a large folio on swap-in. + */ + if (folio_test_pmd_mappable(folio) && + !(IS_ENABLED(CONFIG_THP_SWAP) && + folio_test_swapcache(folio) && + zswap_never_enabled())) flags |=3D TTU_SPLIT_HUGE_PMD; /* * Without TTU_SYNC, try_to_unmap will only begin to diff --git a/mm/vmstat.c b/mm/vmstat.c index f534972f517d..9b4963a7eb04 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1421,6 +1421,7 @@ const char * const vmstat_text[] =3D { [I(THP_ZERO_PAGE_ALLOC_FAILED)] =3D "thp_zero_page_alloc_failed", [I(THP_SWPOUT)] =3D "thp_swpout", [I(THP_SWPOUT_FALLBACK)] =3D "thp_swpout_fallback", + [I(THP_SWPOUT_PMD)] =3D "thp_swpout_pmd", #endif #ifdef CONFIG_BALLOON [I(BALLOON_INFLATE)] =3D "balloon_inflate", --=20 2.52.0 From nobody Wed Jun 17 07:20:38 2026 Received: from out-171.mta0.migadu.com (out-171.mta0.migadu.com [91.218.175.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0D8493B52F0 for ; Mon, 27 Apr 2026 10:07:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284442; cv=none; b=IEPsE+PFChjvDrZgh2k7UKk9P8B35LAVsPbo7c+JKf37ZHYENbAgYf3hmygQDzg0f/wKTUc4hTUpIfIMnhOkUDItBPAS6xZh+tGd6gh6Gt+vTjmdrjxAphxnTcB7XlkEj1+tZd7heWC7Zk8WM/ZE32qlw76sLvUXlL+Ra6w/sco= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284442; c=relaxed/simple; bh=oEr8hMyMbk+VFnjWzuFoMS1tCB5ku9gnWNSl51mPVe4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=V1rep04htDUSn0eSmKYoDNBhMWSgLtd1d6Iuu0WDCNAHKHLQKLmt3VMNnOCMkcc8I45cO0iGyyvZO1lfNf7S+a3b5Gh04ZV8n3yITpGVNnpdsbbdDT4YDqZ/ly58Cbg9QGtBOsCpX+hAuzopAMvU/HRF271tu9lmRwKFsO7ZqAs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=MX32xG9x; arc=none smtp.client-ip=91.218.175.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="MX32xG9x" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1777284438; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=LB9xxhYruP7Jnqp/VrL48F9nmOjxcoZB/RwHUFrTRKI=; b=MX32xG9xMPv3ghoGAZUBSVapS/Mo/l6iNnHWg7+gN/hDkPnTZGEFJLiGNUNCybiYZqkQzJ S0JiITy0pNzi20SSzCvg4geCgY9fiCoZ1m1jv7nVZoXHMgeN7jrYkO8zcPF+JX7lev7W2K nFjNjr1vixZBfaQplHPEmRWEOa6g/8Y= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: bhe@redhat.com, willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [PATCH 13/13] selftests/mm: add PMD swap entry tests Date: Mon, 27 Apr 2026 03:02:02 -0700 Message-ID: <20260427100553.2754667-14-usama.arif@linux.dev> In-Reply-To: <20260427100553.2754667-1-usama.arif@linux.dev> References: <20260427100553.2754667-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Exercise the PMD swap entry paths. The tests allocate a PMD-mapped THP, write a known pattern, swap it out via MADV_PAGEOUT, and then exercise different code paths: - swap-out / swap-in round-trip with data verification - fork with read-only access from both parent and child - fork with writes in both processes to verify COW isolation - repeated swap cycles to try and catch reference counting issues - write fault on a swapped PMD to verify dirty handling - munmap of a swapped PMD (zap_huge_pmd swap slot cleanup) - mprotect on a swapped PMD (change_non_present_huge_pmd) - mremap of a swapped PMD (move_soft_dirty_pmd) - pagemap reading (pagemap_pmd_range_thp softleaf_has_pfn guard) - MADV_FREE on a swapped PMD: verifies swap slots are freed via pagemap and the memory reads back as zero - UFFDIO_MOVE on a swapped PMD (move_pages_huge_pmd swap path); verifies the entry transfers without splitting and that the destination faults back in as a THP - swapoff with active PMD swap entries (unuse_pmd_range split) Signed-off-by: Usama Arif --- tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/pmd_swap.c | 607 ++++++++++++++++++++++++++ 2 files changed, 608 insertions(+) create mode 100644 tools/testing/selftests/mm/pmd_swap.c diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/= mm/Makefile index cd24596cdd27..3c753dba863f 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -106,6 +106,7 @@ TEST_GEN_FILES +=3D guard-regions TEST_GEN_FILES +=3D merge TEST_GEN_FILES +=3D rmap TEST_GEN_FILES +=3D folio_split_race_test +TEST_GEN_FILES +=3D pmd_swap =20 ifneq ($(ARCH),arm64) TEST_GEN_FILES +=3D soft-dirty diff --git a/tools/testing/selftests/mm/pmd_swap.c b/tools/testing/selftest= s/mm/pmd_swap.c new file mode 100644 index 000000000000..28147ddd824c --- /dev/null +++ b/tools/testing/selftests/mm/pmd_swap.c @@ -0,0 +1,607 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Test PMD-level swap entries. + * + * Verifies that when a PMD-mapped THP is swapped out the kernel installs + * a single PMD-level swap entry (instead of splitting into 512 PTE-level + * entries), and that operations on the swapped region behave correctly: + * basic - swap out + swap in preserves data + * fork - parent and child both see the data + * fork_cow - COW after fork keeps parent's data isolated + * cycles - repeated swap out/in does not corrupt data + * write - faulting in via a write keeps the rest of the THP + * munmap - munmap on a PMD swap entry frees swap slots cleanly + * mprotect - mprotect on a PMD swap entry preserves data + * mremap - mremap on a PMD swap entry preserves data + * pagemap - pagemap reports the entries as swapped + * madvise_free - MADV_FREE on a PMD swap entry does not crash + * uffdio_move - UFFDIO_MOVE moves a PMD swap entry whole-PMD + * swapoff - swapoff faults the THP back in (needs PMD_SWAP_DEVICE) + */ +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "kselftest_harness.h" +#include "vm_util.h" + +static bool check_swapped(int pagemap_fd, char *addr, unsigned long size) +{ + unsigned long off; + + for (off =3D 0; off < size; off +=3D getpagesize()) + if (!pagemap_is_swapped(pagemap_fd, addr + off)) + return false; + return true; +} + +static bool swap_available(int pagemap_fd) +{ + char *p; + bool ret; + + p =3D mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (p =3D=3D MAP_FAILED) + return false; + + memset(p, 0xab, getpagesize()); + madvise(p, getpagesize(), MADV_PAGEOUT); + ret =3D pagemap_is_swapped(pagemap_fd, p); + munmap(p, getpagesize()); + return ret; +} + +static unsigned long read_vm_event(const char *name) +{ + char line[256]; + size_t name_len =3D strlen(name); + unsigned long val =3D 0; + FILE *f; + + f =3D fopen("/proc/vmstat", "r"); + if (!f) + return 0; + while (fgets(line, sizeof(line), f)) { + if (!strncmp(line, name, name_len) && line[name_len] =3D=3D ' ') { + val =3D strtoul(line + name_len + 1, NULL, 10); + break; + } + } + fclose(f); + return val; +} + +static unsigned int random_seed(void) +{ + unsigned int seed; + + if (getrandom(&seed, sizeof(seed), 0) !=3D sizeof(seed)) + seed =3D (unsigned int)time(NULL); + return seed; +} + +static unsigned char pattern_byte(unsigned int seed, unsigned long off) +{ + return (unsigned char)(seed + off); +} + +static void fill_pattern(char *buf, unsigned long size, unsigned int seed) +{ + unsigned long i; + + for (i =3D 0; i < size; i++) + buf[i] =3D (char)pattern_byte(seed, i); +} + +static bool verify_pattern(char *buf, unsigned long size, unsigned int see= d) +{ + unsigned long i; + + for (i =3D 0; i < size; i++) + if ((unsigned char)buf[i] !=3D pattern_byte(seed, i)) + return false; + return true; +} + +/* + * mmap a PMD-sized region, request THP, fill with a pattern, and swap + * it out. Verifies via the thp_swpout_pmd vmstat counter that the + * swap-out installed a PMD swap entry rather than splitting to PTEs. + */ +static char *alloc_fill_swap_thp(unsigned long pmd_size, int pagemap_fd, + unsigned int seed) +{ + unsigned long pmd_before, pmd_after; + char *mem; + + mem =3D mmap(NULL, pmd_size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (mem =3D=3D MAP_FAILED) + return MAP_FAILED; + + madvise(mem, pmd_size, MADV_HUGEPAGE); + fill_pattern(mem, pmd_size, seed); + + pmd_before =3D read_vm_event("thp_swpout_pmd"); + + if (madvise(mem, pmd_size, MADV_PAGEOUT) || + !check_swapped(pagemap_fd, mem, pmd_size)) { + munmap(mem, pmd_size); + return MAP_FAILED; + } + + pmd_after =3D read_vm_event("thp_swpout_pmd"); + printf("# thp_swpout_pmd: %lu -> %lu\n", pmd_before, pmd_after); + if (pmd_after - pmd_before < 1) { + munmap(mem, pmd_size); + return MAP_FAILED; + } + return mem; +} + +FIXTURE(pmd_swap) +{ + unsigned long pmd_size; + int pagemap_fd; + unsigned int seed; +}; + +FIXTURE_SETUP(pmd_swap) +{ + self->pagemap_fd =3D -1; + + self->pmd_size =3D read_pmd_pagesize(); + if (!self->pmd_size) + SKIP(return, "Cannot determine PMD size\n"); + + self->pagemap_fd =3D open("/proc/self/pagemap", O_RDONLY); + if (self->pagemap_fd < 0) + SKIP(return, "Cannot open /proc/self/pagemap\n"); + + if (!swap_available(self->pagemap_fd)) + SKIP(return, "Swap not available or not working\n"); + + self->seed =3D random_seed(); +} + +FIXTURE_TEARDOWN(pmd_swap) +{ + if (self->pagemap_fd >=3D 0) + close(self->pagemap_fd); +} + +/* + * Allocate a PMD-sized THP, write a pattern, swap it out, read it back, + * verify the pattern. + */ +TEST_F(pmd_swap, basic) +{ + char *mem; + + mem =3D alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed); + if (mem =3D=3D MAP_FAILED) + SKIP(return, "Could not create swapped THP\n"); + + ASSERT_TRUE(verify_pattern(mem, self->pmd_size, self->seed)); + + munmap(mem, self->pmd_size); +} + +/* + * Allocate a THP, swap it out, fork, verify both parent and child see + * the correct data. + */ +TEST_F(pmd_swap, fork) +{ + char *mem; + pid_t pid; + int status; + + mem =3D alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed); + if (mem =3D=3D MAP_FAILED) + SKIP(return, "Could not create swapped THP\n"); + + pid =3D fork(); + ASSERT_GE(pid, 0); + + if (pid =3D=3D 0) { + _exit(verify_pattern(mem, self->pmd_size, self->seed) ? 0 : 1); + } + + ASSERT_TRUE(verify_pattern(mem, self->pmd_size, self->seed)); + + ASSERT_EQ(waitpid(pid, &status, 0), pid); + ASSERT_TRUE(WIFEXITED(status)); + ASSERT_EQ(WEXITSTATUS(status), 0); + + munmap(mem, self->pmd_size); +} + +/* + * Swap out, fork, then have parent and child write different patterns. + * Exercises COW on shared PMD swap entries: writes after fork must + * trigger copy-on-write so the parent's data stays isolated. + */ +TEST_F(pmd_swap, fork_cow) +{ + unsigned int parent_seed =3D self->seed; + unsigned int child_seed =3D ~self->seed; + char *mem; + pid_t pid; + int status; + + mem =3D alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, parent_seed= ); + if (mem =3D=3D MAP_FAILED) + SKIP(return, "Could not create swapped THP\n"); + + pid =3D fork(); + ASSERT_GE(pid, 0); + + if (pid =3D=3D 0) { + fill_pattern(mem, self->pmd_size, child_seed); + _exit(verify_pattern(mem, self->pmd_size, child_seed) ? 0 : 1); + } + + ASSERT_EQ(waitpid(pid, &status, 0), pid); + + ASSERT_TRUE(verify_pattern(mem, self->pmd_size, parent_seed)); + ASSERT_TRUE(WIFEXITED(status)); + ASSERT_EQ(WEXITSTATUS(status), 0); + + munmap(mem, self->pmd_size); +} + +/* + * Swap a THP out and in repeatedly without data corruption. + */ +TEST_F(pmd_swap, cycles) +{ + const int num_cycles =3D 5; + char *mem; + int cycle; + + for (cycle =3D 0; cycle < num_cycles; cycle++) { + unsigned int seed =3D self->seed + cycle; + + mem =3D alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed); + if (mem =3D=3D MAP_FAILED) + SKIP(return, "Could not create swapped THP at cycle %d\n", + cycle); + + ASSERT_TRUE(verify_pattern(mem, self->pmd_size, seed)); + + munmap(mem, self->pmd_size); + } +} + +/* + * Swap out, fault in via a write to the first page, verify the write + * sticks and the rest of the THP is preserved. + */ +TEST_F(pmd_swap, write) +{ + unsigned int seed =3D self->seed; + char *mem; + unsigned long i; + + mem =3D alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed); + if (mem =3D=3D MAP_FAILED) + SKIP(return, "Could not create swapped THP\n"); + + mem[0] =3D 0xbb; + ASSERT_EQ(mem[0], (char)0xbb); + + for (i =3D 1; i < self->pmd_size; i++) + ASSERT_EQ((unsigned char)mem[i], pattern_byte(seed, i)); + + munmap(mem, self->pmd_size); +} + +/* + * munmap while the folio is swapped out. Exercises zap_huge_pmd() on a + * PMD swap entry =E2=80=94 must free the swap slots without trying to loo= k up + * a folio. + */ +TEST_F(pmd_swap, munmap) +{ + char *mem; + + mem =3D alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed); + if (mem =3D=3D MAP_FAILED) + SKIP(return, "Could not create swapped THP\n"); + + munmap(mem, self->pmd_size); +} + +/* + * Change protection on a swapped PMD entry, then fault back in and + * verify data. Exercises change_non_present_huge_pmd(). + */ +TEST_F(pmd_swap, mprotect) +{ + unsigned int seed =3D self->seed; + char *mem; + + mem =3D alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed); + if (mem =3D=3D MAP_FAILED) + SKIP(return, "Could not create swapped THP\n"); + + ASSERT_EQ(mprotect(mem, self->pmd_size, PROT_READ), 0); + ASSERT_EQ(mprotect(mem, self->pmd_size, PROT_READ | PROT_WRITE), 0); + + ASSERT_TRUE(verify_pattern(mem, self->pmd_size, seed)); + + munmap(mem, self->pmd_size); +} + +/* + * mmap an anonymous PMD-aligned region of pmd_size bytes. Over-allocates + * by one PMD and trims the unaligned head/tail so the returned address is + * PMD-aligned (required for whole-PMD UFFDIO_MOVE). + */ +static char *mmap_pmd_aligned(unsigned long pmd_size) +{ + unsigned long pad =3D pmd_size; + char *raw, *aligned; + + raw =3D mmap(NULL, pmd_size + pad, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (raw =3D=3D MAP_FAILED) + return MAP_FAILED; + + aligned =3D (char *)(((uintptr_t)raw + pmd_size - 1) & ~(pmd_size - 1)); + if (aligned !=3D raw) + munmap(raw, aligned - raw); + if (aligned + pmd_size !=3D raw + pmd_size + pad) + munmap(aligned + pmd_size, + (raw + pmd_size + pad) - (aligned + pmd_size)); + return aligned; +} + +/* + * UFFDIO_MOVE a PMD swap entry from src to a registered dst. Exercises + * move_pages_huge_pmd() handling of pmd_is_swap_entry: the whole PMD swap + * entry must move to dst without splitting, and the destination must + * read back the original pattern after a swap-in fault. + */ +TEST_F(pmd_swap, uffdio_move) +{ + unsigned int seed =3D self->seed; + struct uffdio_register reg =3D {}; + struct uffdio_move move =3D {}; + struct uffdio_api api =3D {}; + char *src, *dst; + int uffd; + + dst =3D mmap_pmd_aligned(self->pmd_size); + if (dst =3D=3D MAP_FAILED) + SKIP(return, "Could not mmap aligned dst\n"); + + src =3D alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed); + if (src =3D=3D MAP_FAILED) { + munmap(dst, self->pmd_size); + SKIP(return, "Could not create swapped THP\n"); + } + if ((uintptr_t)src & (self->pmd_size - 1)) { + munmap(src, self->pmd_size); + munmap(dst, self->pmd_size); + SKIP(return, "src not PMD-aligned\n"); + } + + uffd =3D syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); + if (uffd < 0) { + munmap(src, self->pmd_size); + munmap(dst, self->pmd_size); + SKIP(return, "userfaultfd unavailable\n"); + } + + api.api =3D UFFD_API; + api.features =3D UFFD_FEATURE_MOVE; + if (ioctl(uffd, UFFDIO_API, &api) || + !(api.features & UFFD_FEATURE_MOVE)) { + close(uffd); + munmap(src, self->pmd_size); + munmap(dst, self->pmd_size); + SKIP(return, "UFFD_FEATURE_MOVE unsupported\n"); + } + + reg.range.start =3D (unsigned long)dst; + reg.range.len =3D self->pmd_size; + reg.mode =3D UFFDIO_REGISTER_MODE_MISSING; + if (ioctl(uffd, UFFDIO_REGISTER, ®)) { + close(uffd); + munmap(src, self->pmd_size); + munmap(dst, self->pmd_size); + SKIP(return, "UFFDIO_REGISTER failed\n"); + } + + move.dst =3D (unsigned long)dst; + move.src =3D (unsigned long)src; + move.len =3D self->pmd_size; + if (ioctl(uffd, UFFDIO_MOVE, &move)) { + close(uffd); + munmap(src, self->pmd_size); + munmap(dst, self->pmd_size); + ASSERT_EQ(errno, 0); + } + ASSERT_EQ(move.move, self->pmd_size); + + /* + * dst inherits the PMD swap entry; reading it must fault the THP + * back in via do_huge_pmd_swap_page() and yield the original data. + */ + ASSERT_TRUE(check_swapped(self->pagemap_fd, dst, self->pmd_size)); + ASSERT_TRUE(verify_pattern(dst, self->pmd_size, seed)); + /* The whole-PMD path must reinstate a THP, not 512 PTE folios. */ + ASSERT_TRUE(check_huge_anon(dst, 1, self->pmd_size)); + + close(uffd); + munmap(src, self->pmd_size); + munmap(dst, self->pmd_size); +} + +/* + * Move a swapped PMD entry to a new address, fault in, verify data. + * Exercises move_huge_pmd() and move_soft_dirty_pmd(). + */ +TEST_F(pmd_swap, mremap) +{ + unsigned int seed =3D self->seed; + char *mem, *new_mem; + + mem =3D alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed); + if (mem =3D=3D MAP_FAILED) + SKIP(return, "Could not create swapped THP\n"); + + new_mem =3D mremap(mem, self->pmd_size, self->pmd_size, MREMAP_MAYMOVE); + if (new_mem =3D=3D MAP_FAILED) { + munmap(mem, self->pmd_size); + ASSERT_NE(new_mem, MAP_FAILED); + } + + ASSERT_TRUE(verify_pattern(new_mem, self->pmd_size, seed)); + + munmap(new_mem, self->pmd_size); +} + +/* + * Read /proc/self/pagemap on a PMD swap entry. Exercises the pagemap + * PMD walker which must handle PMD swap entries without trying to + * convert them to a page via softleaf_to_page(). + */ +TEST_F(pmd_swap, pagemap) +{ + char *mem; + uint64_t entry; + unsigned long off; + + mem =3D alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed); + if (mem =3D=3D MAP_FAILED) + SKIP(return, "Could not create swapped THP\n"); + + for (off =3D 0; off < self->pmd_size; off +=3D getpagesize()) { + entry =3D pagemap_get_entry(self->pagemap_fd, mem + off); + /* Bit 62 =3D swapped */ + ASSERT_TRUE(entry & (1ULL << 62)); + } + + munmap(mem, self->pmd_size); +} + +/* + * MADV_FREE on a swapped-out PMD must free the swap slots and clear the + * entry. After the call, pagemap must no longer report the pages as + * swapped, and accessing the region must yield zero pages. + */ +TEST_F(pmd_swap, madvise_free) +{ + char *mem; + unsigned long i; + + mem =3D alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, self->seed); + if (mem =3D=3D MAP_FAILED) + SKIP(return, "Could not create swapped THP\n"); + + ASSERT_TRUE(check_swapped(self->pagemap_fd, mem, self->pmd_size)); + ASSERT_EQ(madvise(mem, self->pmd_size, MADV_FREE), 0); + ASSERT_FALSE(check_swapped(self->pagemap_fd, mem, self->pmd_size)); + + for (i =3D 0; i < self->pmd_size; i +=3D getpagesize()) + ASSERT_EQ(mem[i], 0); + + munmap(mem, self->pmd_size); +} + +/* + * swapoff requires a dedicated swap device path. Use a separate fixture + * that picks the device up from the PMD_SWAP_DEVICE environment variable + * and skips when unset. + */ +FIXTURE(pmd_swap_swapoff) +{ + unsigned long pmd_size; + int pagemap_fd; + const char *swap_dev; + unsigned int seed; +}; + +FIXTURE_SETUP(pmd_swap_swapoff) +{ + self->pagemap_fd =3D -1; + self->swap_dev =3D getenv("PMD_SWAP_DEVICE"); + if (!self->swap_dev) + SKIP(return, "PMD_SWAP_DEVICE env var not set\n"); + + self->pmd_size =3D read_pmd_pagesize(); + if (!self->pmd_size) + SKIP(return, "Cannot determine PMD size\n"); + + self->pagemap_fd =3D open("/proc/self/pagemap", O_RDONLY); + if (self->pagemap_fd < 0) + SKIP(return, "Cannot open /proc/self/pagemap\n"); + + if (!swap_available(self->pagemap_fd)) + SKIP(return, "Swap not available or not working\n"); + + self->seed =3D random_seed(); +} + +FIXTURE_TEARDOWN(pmd_swap_swapoff) +{ + if (self->pagemap_fd >=3D 0) + close(self->pagemap_fd); +} + +/* + * Swap out a THP, then turn off swap. The kernel must fault the entire + * THP back in via unuse_pmd(), preserving the huge mapping. Verify data + * is intact and the THP mapping is preserved. + */ +TEST_F(pmd_swap_swapoff, basic) +{ + unsigned int seed =3D self->seed; + char *mem; + + mem =3D alloc_fill_swap_thp(self->pmd_size, self->pagemap_fd, seed); + if (mem =3D=3D MAP_FAILED) + SKIP(return, "Could not create swapped THP\n"); + + if (swapoff(self->swap_dev)) { + munmap(mem, self->pmd_size); + ASSERT_EQ(swapoff(self->swap_dev), 0); + } + + if (!verify_pattern(mem, self->pmd_size, seed)) { + swapon(self->swap_dev, 0); + munmap(mem, self->pmd_size); + ASSERT_TRUE(verify_pattern(mem, self->pmd_size, seed)); + } + + if (!check_huge_anon(mem, 1, self->pmd_size)) { + swapon(self->swap_dev, 0); + munmap(mem, self->pmd_size); + ASSERT_TRUE(check_huge_anon(mem, 1, self->pmd_size)); + } + + if (swapon(self->swap_dev, 0)) + fprintf(stderr, "Warning: swapon failed: %s\n", + strerror(errno)); + + munmap(mem, self->pmd_size); +} + +TEST_HARNESS_MAIN --=20 2.52.0