From nobody Wed Apr 1 09:44:32 2026 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 70FC6373C1E; Mon, 30 Mar 2026 16:17:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774887441; cv=none; b=BSNaTZ49mp2M6jtkmbxyur/SF/Lci1ONaM7IWMekJ6IycOFq8uNuq/Xgo9HdZ0OV3aMJYTImi2ObtQYTGf5MU1T0f3PLS0alrTspgBjt/qa/ON2etCX7dWpWOZLaac36BAPhIVFSN2cN/fm2TqJlXYiQwtxCO2UF9tSzW9+0BiQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774887441; c=relaxed/simple; bh=r3k1Z9klFrD/nffPa33a1oaZbeOO3hVcdWdTsDunBW8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=oLIgX1RBSRUrfEoUmlsdi30yhXzkbafEXhWEyj0LsKZw3MJK2/sjOzpSMJ3OvtHmLD6LMBuGoxHesmVs7Rqxnd/uw85O0IhooFLo2lfiDwH5uymn8USCLH3Ojg2eTdZXt8yM31fBc89sZqhixDNvRmlVes8wg4pg0a+6yurYpQ8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; dkim=pass (1024-bit key) header.d=arm.com header.i=@arm.com header.b=MKZ0QgqS; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=arm.com header.i=@arm.com header.b="MKZ0QgqS" Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 0D5A41F91; Mon, 30 Mar 2026 09:17:13 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.27]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 7227E3F7D8; Mon, 30 Mar 2026 09:17:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=arm.com; s=foss; t=1774887438; bh=r3k1Z9klFrD/nffPa33a1oaZbeOO3hVcdWdTsDunBW8=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=MKZ0QgqSC/inGWaruxli87p2e6u7URCxf5VZuE5dzt8UI6bNwj38vs2twFzDezEJX N2h65vH3wjrPJipyZpuZWaRkWkGMYTVNsEQZcJK45X5B4+Of5g/KVaBll7Lcpmi8Iy jlvMvgvjrCUo1+lDzv9wXKCx6D3PJsz4iO3VpfbQ= From: Ryan Roberts To: Catalin Marinas , Will Deacon , "David Hildenbrand (Arm)" , Dev Jain , Yang Shi , Suzuki K Poulose , Jinjiang Tu , Kevin Brodsky Cc: Ryan Roberts , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org Subject: [PATCH v2 1/3] arm64: mm: Fix rodata=full block mapping support for realm guests Date: Mon, 30 Mar 2026 17:17:02 +0100 Message-ID: <20260330161705.3349825-2-ryan.roberts@arm.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20260330161705.3349825-1-ryan.roberts@arm.com> References: <20260330161705.3349825-1-ryan.roberts@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Commit a166563e7ec37 ("arm64: mm: support large block mapping when rodata=3Dfull") enabled the linear map to be mapped by block/cont while still allowing granular permission changes on BBML2_NOABORT systems by lazily splitting the live mappings. This mechanism was intended to be usable by realm guests since they need to dynamically share dma buffers with the host by "decrypting" them - which for Arm CCA, means marking them as shared in the page tables. However, it turns out that the mechanism was failing for realm guests because realms need to share their dma buffers (via __set_memory_enc_dec()) much earlier during boot than split_kernel_leaf_mapping() was able to handle. The report linked below showed that GIC's ITS was one such user. But during the investigation I found other callsites that could not meet the split_kernel_leaf_mapping() constraints. The problem is that we block map the linear map based on the boot CPU supporting BBML2_NOABORT, then check that all the other CPUs support it too when finalizing the caps. If they don't, then we stop_machine() and split to ptes. For safety, split_kernel_leaf_mapping() previously wouldn't permit splitting until after the caps were finalized. That ensured that if any secondary cpus were running that didn't support BBML2_NOABORT, we wouldn't risk breaking them. I've fix this problem by reducing the black-out window where we refuse to split; there are now 2 windows. The first is from T0 until the page allocator is inititialized. Splitting allocates memory for the page allocator so it must be in use. The second covers the period between starting to online the secondary cpus until the system caps are finalized (this is a very small window). All of the problematic callers are calling __set_memory_enc_dec() before the secondary cpus come online, so this solves the problem. However, one of these callers, swiotlb_update_mem_attributes(), was trying to split before the page allocator was initialized. So I have moved this call from arch_mm_preinit() to mem_init(), which solves the ordering issue. I've added warnings and return an error if any attempt is made to split in the black-out windows. Note there are other issues which prevent booting all the way to user space, which will be fixed in subsequent patches. Reported-by: Jinjiang Tu Closes: https://lore.kernel.org/all/0b2a4ae5-fc51-4d77-b177-b2e9db74f11d@hu= awei.com/ Fixes: a166563e7ec37 ("arm64: mm: support large block mapping when rodata= =3Dfull") Cc: stable@vger.kernel.org Reviewed-by: Kevin Brodsky Signed-off-by: Ryan Roberts Reviewed-by: Suzuki K Poulose Tested-by: Suzuki K Poulose --- arch/arm64/include/asm/mmu.h | 2 ++ arch/arm64/mm/init.c | 9 +++++++- arch/arm64/mm/mmu.c | 45 +++++++++++++++++++++++++----------- 3 files changed, 42 insertions(+), 14 deletions(-) diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h index 137a173df1ff8..472610433aaea 100644 --- a/arch/arm64/include/asm/mmu.h +++ b/arch/arm64/include/asm/mmu.h @@ -112,5 +112,7 @@ void kpti_install_ng_mappings(void); static inline void kpti_install_ng_mappings(void) {} #endif =20 +extern bool page_alloc_available; + #endif /* !__ASSEMBLER__ */ #endif diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 96711b8578fd0..b9b248d24fd10 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -350,7 +350,6 @@ void __init arch_mm_preinit(void) } =20 swiotlb_init(swiotlb, flags); - swiotlb_update_mem_attributes(); =20 /* * Check boundaries twice: Some fundamental inconsistencies can be @@ -377,6 +376,14 @@ void __init arch_mm_preinit(void) } } =20 +bool page_alloc_available __ro_after_init; + +void __init mem_init(void) +{ + page_alloc_available =3D true; + swiotlb_update_mem_attributes(); +} + void free_initmem(void) { void *lm_init_begin =3D lm_alias(__init_begin); diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index a6a00accf4f93..223947487a223 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -768,30 +768,51 @@ static inline bool force_pte_mapping(void) } =20 static DEFINE_MUTEX(pgtable_split_lock); +static bool linear_map_requires_bbml2; =20 int split_kernel_leaf_mapping(unsigned long start, unsigned long end) { int ret; =20 - /* - * !BBML2_NOABORT systems should not be trying to change permissions on - * anything that is not pte-mapped in the first place. Just return early - * and let the permission change code raise a warning if not already - * pte-mapped. - */ - if (!system_supports_bbml2_noabort()) - return 0; - /* * If the region is within a pte-mapped area, there is no need to try to * split. Additionally, CONFIG_DEBUG_PAGEALLOC and CONFIG_KFENCE may * change permissions from atomic context so for those cases (which are * always pte-mapped), we must not go any further because taking the - * mutex below may sleep. + * mutex below may sleep. Do not call force_pte_mapping() here because + * it could return a confusing result if called from a secondary cpu + * prior to finalizing caps. Instead, linear_map_requires_bbml2 gives us + * what we need. */ - if (force_pte_mapping() || is_kfence_address((void *)start)) + if (!linear_map_requires_bbml2 || is_kfence_address((void *)start)) return 0; =20 + if (!system_supports_bbml2_noabort()) { + /* + * !BBML2_NOABORT systems should not be trying to change + * permissions on anything that is not pte-mapped in the first + * place. Just return early and let the permission change code + * raise a warning if not already pte-mapped. + */ + if (system_capabilities_finalized()) + return 0; + + /* + * Boot-time: split_kernel_leaf_mapping_locked() allocates from + * page allocator. Can't split until it's available. + */ + if (WARN_ON(!page_alloc_available)) + return -EBUSY; + + /* + * Boot-time: Started secondary cpus but don't know if they + * support BBML2_NOABORT yet. Can't allow splitting in this + * window in case they don't. + */ + if (WARN_ON(num_online_cpus() > 1)) + return -EBUSY; + } + /* * Ensure start and end are at least page-aligned since this is the * finest granularity we can split to. @@ -891,8 +912,6 @@ static int range_split_to_ptes(unsigned long start, uns= igned long end, gfp_t gfp return ret; } =20 -static bool linear_map_requires_bbml2 __initdata; - u32 idmap_kpti_bbml2_flag; =20 static void __init init_idmap_kpti_bbml2_flag(void) --=20 2.43.0 From nobody Wed Apr 1 09:44:32 2026 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 36564377558; Mon, 30 Mar 2026 16:17:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774887443; cv=none; b=kqPhG3JcstVyaBqjx592SCELVzRDGRZSFn0pKALCqI9FgMh2NFIky4+AIF4Ed0npwFhw4SGSOl7U5KRA58bgEoI8ymTXRkl98yulgsHiqcLlbeSMAYxxjp2+STNWEKs8GMBmm6kk+Lh3zQYYxV6+a38AAIpO6F2ecvElLsamNLg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774887443; c=relaxed/simple; bh=COaDMbsN8eO6SAbKcHec0PuEr6HZPMdZOsyBW4Qmzog=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ZepAhC2bdrK+MhXs0Bo6iQ9mlqjWMAydD2JeYO0xD7lzs7rcDtl+qm7wr00O4XwtPwXTE6bRNlKBfVFu0q8haobxGCVNXiz8Q5zqaLSEqK7LmPFm88hP/QyyPYOJNXO37uLsAyCp/W2rsYKerrPxfjL3Z1ntz0JCgmZ8XQW1N0Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; dkim=pass (1024-bit key) header.d=arm.com header.i=@arm.com header.b=jHgzeW/2; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=arm.com header.i=@arm.com header.b="jHgzeW/2" Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E500922EE; Mon, 30 Mar 2026 09:17:14 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.27]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 32D873F7D8; Mon, 30 Mar 2026 09:17:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=arm.com; s=foss; t=1774887440; bh=COaDMbsN8eO6SAbKcHec0PuEr6HZPMdZOsyBW4Qmzog=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=jHgzeW/2y4QAYFhk4rTRGcFkIPeNSwnunltm11A6sAR98pfTeCCcUSp/gO6+vFiBQ 935Wm+DchOSaQ6ffEwErE5QUsMP6FpS8K8PF8RPmtgUBEamIArG2iWbZCmMZYtKeii EAQdBJt/0AhHTPxCSdykBHDg//qSZNn05aKY1k7A= From: Ryan Roberts To: Catalin Marinas , Will Deacon , "David Hildenbrand (Arm)" , Dev Jain , Yang Shi , Suzuki K Poulose , Jinjiang Tu , Kevin Brodsky Cc: Ryan Roberts , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org Subject: [PATCH v2 2/3] arm64: mm: Handle invalid large leaf mappings correctly Date: Mon, 30 Mar 2026 17:17:03 +0100 Message-ID: <20260330161705.3349825-3-ryan.roberts@arm.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20260330161705.3349825-1-ryan.roberts@arm.com> References: <20260330161705.3349825-1-ryan.roberts@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" It has been possible for a long time to mark ptes in the linear map as invalid. This is done for secretmem, kfence, realm dma memory un/share, and others, by simply clearing the PTE_VALID bit. But until commit a166563e7ec37 ("arm64: mm: support large block mapping when rodata=3Dfull") large leaf mappings were never made invalid in this way. It turns out various parts of the code base are not equipped to handle invalid large leaf mappings (in the way they are currently encoded) and I've observed a kernel panic while booting a realm guest on a BBML2_NOABORT system as a result: [ 15.432706] software IO TLB: Memory encryption is active and system is u= sing DMA bounce buffers [ 15.476896] Unable to handle kernel paging request at virtual address ff= ff000019600000 [ 15.513762] Mem abort info: [ 15.527245] ESR =3D 0x0000000096000046 [ 15.548553] EC =3D 0x25: DABT (current EL), IL =3D 32 bits [ 15.572146] SET =3D 0, FnV =3D 0 [ 15.592141] EA =3D 0, S1PTW =3D 0 [ 15.612694] FSC =3D 0x06: level 2 translation fault [ 15.640644] Data abort info: [ 15.661983] ISV =3D 0, ISS =3D 0x00000046, ISS2 =3D 0x00000000 [ 15.694875] CM =3D 0, WnR =3D 1, TnD =3D 0, TagAccess =3D 0 [ 15.723740] GCS =3D 0, Overlay =3D 0, DirtyBit =3D 0, Xs =3D 0 [ 15.755776] swapper pgtable: 4k pages, 48-bit VAs, pgdp=3D0000000081f3f0= 00 [ 15.800410] [ffff000019600000] pgd=3D0000000000000000, p4d=3D180000009ff= ff403, pud=3D180000009fffe403, pmd=3D00e8000199600704 [ 15.855046] Internal error: Oops: 0000000096000046 [#1] SMP [ 15.886394] Modules linked in: [ 15.900029] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-rc4-d= irty #4 PREEMPT [ 15.935258] Hardware name: linux,dummy-virt (DT) [ 15.955612] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE= =3D--) [ 15.986009] pc : __pi_memcpy_generic+0x128/0x22c [ 16.006163] lr : swiotlb_bounce+0xf4/0x158 [ 16.024145] sp : ffff80008000b8f0 [ 16.038896] x29: ffff80008000b8f0 x28: 0000000000000000 x27: 00000000000= 00000 [ 16.069953] x26: ffffb3976d261ba8 x25: 0000000000000000 x24: ffff0000196= 00000 [ 16.100876] x23: 0000000000000001 x22: ffff0000043430d0 x21: 00000000000= 07ff0 [ 16.131946] x20: 0000000084570010 x19: 0000000000000000 x18: ffff00001ff= e3fcc [ 16.163073] x17: 0000000000000000 x16: 00000000003fffff x15: 646e6120657= 66974 [ 16.194131] x14: 0000000000000000 x13: 0000000000000000 x12: 00000000000= 00000 [ 16.225059] x11: 0000000000000000 x10: 0000000000000010 x9 : 00000000000= 00018 [ 16.256113] x8 : 0000000000000018 x7 : 0000000000000000 x6 : 00000000000= 00000 [ 16.287203] x5 : ffff000019607ff0 x4 : ffff000004578000 x3 : ffff0000196= 00000 [ 16.318145] x2 : 0000000000007ff0 x1 : ffff000004570010 x0 : ffff0000196= 00000 [ 16.349071] Call trace: [ 16.360143] __pi_memcpy_generic+0x128/0x22c (P) [ 16.380310] swiotlb_tbl_map_single+0x154/0x2b4 [ 16.400282] swiotlb_map+0x5c/0x228 [ 16.415984] dma_map_phys+0x244/0x2b8 [ 16.432199] dma_map_page_attrs+0x44/0x58 [ 16.449782] virtqueue_map_page_attrs+0x38/0x44 [ 16.469596] virtqueue_map_single_attrs+0xc0/0x130 [ 16.490509] virtnet_rq_alloc.isra.0+0xa4/0x1fc [ 16.510355] try_fill_recv+0x2a4/0x584 [ 16.526989] virtnet_open+0xd4/0x238 [ 16.542775] __dev_open+0x110/0x24c [ 16.558280] __dev_change_flags+0x194/0x20c [ 16.576879] netif_change_flags+0x24/0x6c [ 16.594489] dev_change_flags+0x48/0x7c [ 16.611462] ip_auto_config+0x258/0x1114 [ 16.628727] do_one_initcall+0x80/0x1c8 [ 16.645590] kernel_init_freeable+0x208/0x2f0 [ 16.664917] kernel_init+0x24/0x1e0 [ 16.680295] ret_from_fork+0x10/0x20 [ 16.696369] Code: 927cec03 cb0e0021 8b0e0042 a9411c26 (a900340c) [ 16.723106] ---[ end trace 0000000000000000 ]--- [ 16.752866] Kernel panic - not syncing: Attempted to kill init! exitcode= =3D0x0000000b [ 16.792556] Kernel Offset: 0x3396ea200000 from 0xffff800080000000 [ 16.818966] PHYS_OFFSET: 0xfff1000080000000 [ 16.837237] CPU features: 0x0000000,00060005,13e38581,957e772f [ 16.862904] Memory Limit: none [ 16.876526] ---[ end Kernel panic - not syncing: Attempted to kill init!= exitcode=3D0x0000000b ]--- This panic occurs because the swiotlb memory was previously shared to the host (__set_memory_enc_dec()), which involves transitioning the (large) leaf mappings to invalid, sharing to the host, then marking the mappings valid again. But pageattr_p[mu]d_entry() would only update the entry if it is a section mapping, since otherwise it concluded it must be a table entry so shouldn't be modified. But p[mu]d_sect() only returns true if the entry is valid. So the result was that the large leaf entry was made invalid in the first pass then ignored in the second pass. It remains invalid until the above code tries to access it and blows up. The simple fix would be to update pageattr_pmd_entry() to use !pmd_table() instead of pmd_sect(). That would solve this problem. But the ptdump code also suffers from a similar issue. It checks pmd_leaf() and doesn't call into the arch-specific note_page() machinery if it returns false. As a result of this, ptdump wasn't even able to show the invalid large leaf mappings; it looked like they were valid which made this super fun to debug. the ptdump code is core-mm and pmd_table() is arm64-specific so we can't use the same trick to solve that. But we already support the concept of "present-invalid" for user space entries. And even better, pmd_leaf() will return true for a leaf mapping that is marked present-invalid. So let's just use that encoding for present-invalid kernel mappings too. Then we can use pmd_leaf() where we previously used pmd_sect() and everything is magically fixed. Additionally, from inspection kernel_page_present() was broken in a similar way, so I'm also updating that to use pmd_leaf(). The transitional page tables component was also similarly broken; it creates a copy of the kernel page tables, making RO leaf mappings RW in the process. It also makes invalid (but-not-none) pte mappings valid. But it was not doing this for large leaf mappings. This could have resulted in crashes at kexec- or hibernate-time. This code is fixed to flip "present-invalid" mappings back to "present-valid" at all levels. Finally, I have hardened split_pmd()/split_pud() so that if it is passed a "present-invalid" leaf, it will maintain that property in the split leaves, since I wasn't able to convince myself that it would only ever be called for "present-valid" leaves. Fixes: a166563e7ec37 ("arm64: mm: support large block mapping when rodata= =3Dfull") Cc: stable@vger.kernel.org Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/pgtable-prot.h | 2 ++ arch/arm64/include/asm/pgtable.h | 9 +++-- arch/arm64/mm/mmu.c | 4 +++ arch/arm64/mm/pageattr.c | 50 +++++++++++++++------------ arch/arm64/mm/trans_pgd.c | 42 ++++------------------ 5 files changed, 48 insertions(+), 59 deletions(-) diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm= /pgtable-prot.h index f560e64202674..212ce1b02e15e 100644 --- a/arch/arm64/include/asm/pgtable-prot.h +++ b/arch/arm64/include/asm/pgtable-prot.h @@ -25,6 +25,8 @@ */ #define PTE_PRESENT_INVALID (PTE_NG) /* only when !PTE_VALID */ =20 +#define PTE_PRESENT_VALID_KERNEL (PTE_VALID | PTE_MAYBE_NG) + #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP #define PTE_UFFD_WP (_AT(pteval_t, 1) << 58) /* uffd-wp tracking */ #define PTE_SWP_UFFD_WP (_AT(pteval_t, 1) << 3) /* only for swp ptes */ diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgta= ble.h index b3e58735c49bd..dd062179b9b66 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -322,9 +322,11 @@ static inline pte_t pte_mknoncont(pte_t pte) return clear_pte_bit(pte, __pgprot(PTE_CONT)); } =20 -static inline pte_t pte_mkvalid(pte_t pte) +static inline pte_t pte_mkvalid_k(pte_t pte) { - return set_pte_bit(pte, __pgprot(PTE_VALID)); + pte =3D clear_pte_bit(pte, __pgprot(PTE_PRESENT_INVALID)); + pte =3D set_pte_bit(pte, __pgprot(PTE_PRESENT_VALID_KERNEL)); + return pte; } =20 static inline pte_t pte_mkinvalid(pte_t pte) @@ -594,6 +596,7 @@ static inline int pmd_protnone(pmd_t pmd) #define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd))) #define pmd_mkdirty(pmd) pte_pmd(pte_mkdirty(pmd_pte(pmd))) #define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd))) +#define pmd_mkvalid_k(pmd) pte_pmd(pte_mkvalid_k(pmd_pte(pmd))) #define pmd_mkinvalid(pmd) pte_pmd(pte_mkinvalid(pmd_pte(pmd))) #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP #define pmd_uffd_wp(pmd) pte_uffd_wp(pmd_pte(pmd)) @@ -635,6 +638,8 @@ static inline pmd_t pmd_mkspecial(pmd_t pmd) =20 #define pud_young(pud) pte_young(pud_pte(pud)) #define pud_mkyoung(pud) pte_pud(pte_mkyoung(pud_pte(pud))) +#define pud_mkwrite_novma(pud) pte_pud(pte_mkwrite_novma(pud_pte(pud))) +#define pud_mkvalid_k(pud) pte_pud(pte_mkvalid_k(pud_pte(pud))) #define pud_write(pud) pte_write(pud_pte(pud)) =20 static inline pud_t pud_mkhuge(pud_t pud) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 223947487a223..1575680675d8d 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -602,6 +602,8 @@ static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp,= bool to_cont) tableprot |=3D PMD_TABLE_PXN; =20 prot =3D __pgprot((pgprot_val(prot) & ~PTE_TYPE_MASK) | PTE_TYPE_PAGE); + if (!pmd_valid(pmd)) + prot =3D pte_pgprot(pte_mkinvalid(pfn_pte(0, prot))); prot =3D __pgprot(pgprot_val(prot) & ~PTE_CONT); if (to_cont) prot =3D __pgprot(pgprot_val(prot) | PTE_CONT); @@ -647,6 +649,8 @@ static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp,= bool to_cont) tableprot |=3D PUD_TABLE_PXN; =20 prot =3D __pgprot((pgprot_val(prot) & ~PMD_TYPE_MASK) | PMD_TYPE_SECT); + if (!pud_valid(pud)) + prot =3D pmd_pgprot(pmd_mkinvalid(pfn_pmd(0, prot))); prot =3D __pgprot(pgprot_val(prot) & ~PTE_CONT); if (to_cont) prot =3D __pgprot(pgprot_val(prot) | PTE_CONT); diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c index 358d1dc9a576f..ce035e1b4eaf6 100644 --- a/arch/arm64/mm/pageattr.c +++ b/arch/arm64/mm/pageattr.c @@ -25,6 +25,11 @@ static ptdesc_t set_pageattr_masks(ptdesc_t val, struct = mm_walk *walk) { struct page_change_data *masks =3D walk->private; =20 + /* + * Some users clear and set bits which alias each other (e.g. PTE_NG and + * PTE_PRESENT_INVALID). It is therefore important that we always clear + * first then set. + */ val &=3D ~(pgprot_val(masks->clear_mask)); val |=3D (pgprot_val(masks->set_mask)); =20 @@ -36,7 +41,7 @@ static int pageattr_pud_entry(pud_t *pud, unsigned long a= ddr, { pud_t val =3D pudp_get(pud); =20 - if (pud_sect(val)) { + if (pud_leaf(val)) { if (WARN_ON_ONCE((next - addr) !=3D PUD_SIZE)) return -EINVAL; val =3D __pud(set_pageattr_masks(pud_val(val), walk)); @@ -52,7 +57,7 @@ static int pageattr_pmd_entry(pmd_t *pmd, unsigned long a= ddr, { pmd_t val =3D pmdp_get(pmd); =20 - if (pmd_sect(val)) { + if (pmd_leaf(val)) { if (WARN_ON_ONCE((next - addr) !=3D PMD_SIZE)) return -EINVAL; val =3D __pmd(set_pageattr_masks(pmd_val(val), walk)); @@ -132,11 +137,12 @@ static int __change_memory_common(unsigned long start= , unsigned long size, ret =3D update_range_prot(start, size, set_mask, clear_mask); =20 /* - * If the memory is being made valid without changing any other bits - * then a TLBI isn't required as a non-valid entry cannot be cached in - * the TLB. + * If the memory is being switched from present-invalid to valid without + * changing any other bits then a TLBI isn't required as a non-valid + * entry cannot be cached in the TLB. */ - if (pgprot_val(set_mask) !=3D PTE_VALID || pgprot_val(clear_mask)) + if (pgprot_val(set_mask) !=3D PTE_PRESENT_VALID_KERNEL || + pgprot_val(clear_mask) !=3D PTE_PRESENT_INVALID) flush_tlb_kernel_range(start, start + size); return ret; } @@ -237,18 +243,18 @@ int set_memory_valid(unsigned long addr, int numpages= , int enable) { if (enable) return __change_memory_common(addr, PAGE_SIZE * numpages, - __pgprot(PTE_VALID), - __pgprot(0)); + __pgprot(PTE_PRESENT_VALID_KERNEL), + __pgprot(PTE_PRESENT_INVALID)); else return __change_memory_common(addr, PAGE_SIZE * numpages, - __pgprot(0), - __pgprot(PTE_VALID)); + __pgprot(PTE_PRESENT_INVALID), + __pgprot(PTE_PRESENT_VALID_KERNEL)); } =20 int set_direct_map_invalid_noflush(struct page *page) { - pgprot_t clear_mask =3D __pgprot(PTE_VALID); - pgprot_t set_mask =3D __pgprot(0); + pgprot_t clear_mask =3D __pgprot(PTE_PRESENT_VALID_KERNEL); + pgprot_t set_mask =3D __pgprot(PTE_PRESENT_INVALID); =20 if (!can_set_direct_map()) return 0; @@ -259,8 +265,8 @@ int set_direct_map_invalid_noflush(struct page *page) =20 int set_direct_map_default_noflush(struct page *page) { - pgprot_t set_mask =3D __pgprot(PTE_VALID | PTE_WRITE); - pgprot_t clear_mask =3D __pgprot(PTE_RDONLY); + pgprot_t set_mask =3D __pgprot(PTE_PRESENT_VALID_KERNEL | PTE_WRITE); + pgprot_t clear_mask =3D __pgprot(PTE_PRESENT_INVALID | PTE_RDONLY); =20 if (!can_set_direct_map()) return 0; @@ -296,8 +302,8 @@ static int __set_memory_enc_dec(unsigned long addr, * entries or Synchronous External Aborts caused by RIPAS_EMPTY */ ret =3D __change_memory_common(addr, PAGE_SIZE * numpages, - __pgprot(set_prot), - __pgprot(clear_prot | PTE_VALID)); + __pgprot(set_prot | PTE_PRESENT_INVALID), + __pgprot(clear_prot | PTE_PRESENT_VALID_KERNEL)); =20 if (ret) return ret; @@ -311,8 +317,8 @@ static int __set_memory_enc_dec(unsigned long addr, return ret; =20 return __change_memory_common(addr, PAGE_SIZE * numpages, - __pgprot(PTE_VALID), - __pgprot(0)); + __pgprot(PTE_PRESENT_VALID_KERNEL), + __pgprot(PTE_PRESENT_INVALID)); } =20 static int realm_set_memory_encrypted(unsigned long addr, int numpages) @@ -404,15 +410,15 @@ bool kernel_page_present(struct page *page) pud =3D READ_ONCE(*pudp); if (pud_none(pud)) return false; - if (pud_sect(pud)) - return true; + if (pud_leaf(pud)) + return pud_valid(pud); =20 pmdp =3D pmd_offset(pudp, addr); pmd =3D READ_ONCE(*pmdp); if (pmd_none(pmd)) return false; - if (pmd_sect(pmd)) - return true; + if (pmd_leaf(pmd)) + return pmd_valid(pmd); =20 ptep =3D pte_offset_kernel(pmdp, addr); return pte_valid(__ptep_get(ptep)); diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c index 18543b603c77b..cca9706a875c3 100644 --- a/arch/arm64/mm/trans_pgd.c +++ b/arch/arm64/mm/trans_pgd.c @@ -31,36 +31,6 @@ static void *trans_alloc(struct trans_pgd_info *info) return info->trans_alloc_page(info->trans_alloc_arg); } =20 -static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr) -{ - pte_t pte =3D __ptep_get(src_ptep); - - if (pte_valid(pte)) { - /* - * Resume will overwrite areas that may be marked - * read only (code, rodata). Clear the RDONLY bit from - * the temporary mappings we use during restore. - */ - __set_pte(dst_ptep, pte_mkwrite_novma(pte)); - } else if (!pte_none(pte)) { - /* - * debug_pagealloc will removed the PTE_VALID bit if - * the page isn't in use by the resume kernel. It may have - * been in use by the original kernel, in which case we need - * to put it back in our copy to do the restore. - * - * Other cases include kfence / vmalloc / memfd_secret which - * may call `set_direct_map_invalid_noflush()`. - * - * Before marking this entry valid, check the pfn should - * be mapped. - */ - BUG_ON(!pfn_valid(pte_pfn(pte))); - - __set_pte(dst_ptep, pte_mkvalid(pte_mkwrite_novma(pte))); - } -} - static int copy_pte(struct trans_pgd_info *info, pmd_t *dst_pmdp, pmd_t *src_pmdp, unsigned long start, unsigned long end) { @@ -76,7 +46,11 @@ static int copy_pte(struct trans_pgd_info *info, pmd_t *= dst_pmdp, =20 src_ptep =3D pte_offset_kernel(src_pmdp, start); do { - _copy_pte(dst_ptep, src_ptep, addr); + pte_t pte =3D __ptep_get(src_ptep); + + if (pte_none(pte)) + continue; + __set_pte(dst_ptep, pte_mkvalid_k(pte_mkwrite_novma(pte))); } while (dst_ptep++, src_ptep++, addr +=3D PAGE_SIZE, addr !=3D end); =20 return 0; @@ -109,8 +83,7 @@ static int copy_pmd(struct trans_pgd_info *info, pud_t *= dst_pudp, if (copy_pte(info, dst_pmdp, src_pmdp, addr, next)) return -ENOMEM; } else { - set_pmd(dst_pmdp, - __pmd(pmd_val(pmd) & ~PMD_SECT_RDONLY)); + set_pmd(dst_pmdp, pmd_mkvalid_k(pmd_mkwrite_novma(pmd))); } } while (dst_pmdp++, src_pmdp++, addr =3D next, addr !=3D end); =20 @@ -145,8 +118,7 @@ static int copy_pud(struct trans_pgd_info *info, p4d_t = *dst_p4dp, if (copy_pmd(info, dst_pudp, src_pudp, addr, next)) return -ENOMEM; } else { - set_pud(dst_pudp, - __pud(pud_val(pud) & ~PUD_SECT_RDONLY)); + set_pud(dst_pudp, pud_mkvalid_k(pud_mkwrite_novma(pud))); } } while (dst_pudp++, src_pudp++, addr =3D next, addr !=3D end); =20 --=20 2.43.0 From nobody Wed Apr 1 09:44:32 2026 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 1A8F6377034 for ; Mon, 30 Mar 2026 16:17:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774887445; cv=none; b=LOaZcyIL2FDZo94nNwgksl2se0eB2Qrrd2CeP8b1FxSDmqFEQNjMxL6qEhYXSCjR0YRkCzIUS9UnGbQpzNRmwUGXED4V4IctNV7KVk94OxUdYWNWLNuvUeuMJGai/35HNOyuCAH/3vfOPHwODMjfJnQvlcLGy0ppYwxwDLVHlfU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774887445; c=relaxed/simple; bh=+3ztKn8+vWKCZnOUAI7aUx3xPekEk67SRoVhvi8keJI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=sxOlr20TXMj8ZepkXfjArk19HTp7SaL/Gj6ENZ5CNtmouUpvzmiJwuDkvXP1+LTS6BTdvyLOsnDidvzJjlCJrVjB+kLULLsmOqFTU9kLE9S6R8uAEipvHWrcCCtagYgBI5MI9bXYFvjqReeq/Wbo1FN5+rREG85wHTKpFh2+yU8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; dkim=pass (1024-bit key) header.d=arm.com header.i=@arm.com header.b=enmQqJNv; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=arm.com header.i=@arm.com header.b="enmQqJNv" Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9C7CC2403; Mon, 30 Mar 2026 09:17:16 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.27]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 16E673F7D8; Mon, 30 Mar 2026 09:17:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=arm.com; s=foss; t=1774887442; bh=+3ztKn8+vWKCZnOUAI7aUx3xPekEk67SRoVhvi8keJI=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=enmQqJNvDTlUKQZU6bTl+9V9fEoxuFHh4uX/kt/i5ys3TRGLf0xp7g/YDZcgtWJ3F lkyqWZOKTXe91pot2KSqSUUYfgdxnxZw80re9a2655bgLVItfoSKCu7OcIgAdi8FGD XhOiumC2o9B8AF5t5GNICg9JrdGqaL+BJgh6k2mU= From: Ryan Roberts To: Catalin Marinas , Will Deacon , "David Hildenbrand (Arm)" , Dev Jain , Yang Shi , Suzuki K Poulose , Jinjiang Tu , Kevin Brodsky Cc: Ryan Roberts , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org Subject: [PATCH v2 3/3] arm64: mm: Remove pmd_sect() and pud_sect() Date: Mon, 30 Mar 2026 17:17:04 +0100 Message-ID: <20260330161705.3349825-4-ryan.roberts@arm.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20260330161705.3349825-1-ryan.roberts@arm.com> References: <20260330161705.3349825-1-ryan.roberts@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The semantics of pXd_leaf() are very similar to pXd_sect(). The only difference is that pXd_sect() only considers it a section if PTE_VALID is set, whereas pXd_leaf() permits both "valid" and "present-invalid" types. Using pXd_sect() has caused issues now that large leaf entries can be present-invalid since commit a166563e7ec37 ("arm64: mm: support large block mapping when rodata=3Dfull"), so let's just remove the API and standardize on pXd_leaf(). There are a few callsites of the form pXd_leaf(READ_ONCE(*pXdp)). This was previously fine for the pXd_sect() macro because it only evaluated its argument once. But pXd_leaf() evaluates its argument multiple times. So let's avoid unintended side effects by reimplementing pXd_leaf() as an inline function. Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/pgtable.h | 19 ++++++++++++------- arch/arm64/mm/mmu.c | 18 +++++++++--------- 2 files changed, 21 insertions(+), 16 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgta= ble.h index dd062179b9b66..5bc42b85acfc0 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -784,9 +784,13 @@ extern pgprot_t phys_mem_access_prot(struct file *file= , unsigned long pfn, =20 #define pmd_table(pmd) ((pmd_val(pmd) & PMD_TYPE_MASK) =3D=3D \ PMD_TYPE_TABLE) -#define pmd_sect(pmd) ((pmd_val(pmd) & PMD_TYPE_MASK) =3D=3D \ - PMD_TYPE_SECT) -#define pmd_leaf(pmd) (pmd_present(pmd) && !pmd_table(pmd)) + +#define pmd_leaf pmd_leaf +static inline bool pmd_leaf(pmd_t pmd) +{ + return pmd_present(pmd) && !pmd_table(pmd); +} + #define pmd_bad(pmd) (!pmd_table(pmd)) =20 #define pmd_leaf_size(pmd) (pmd_cont(pmd) ? CONT_PMD_SIZE : PMD_SIZE) @@ -804,11 +808,8 @@ static inline int pmd_trans_huge(pmd_t pmd) #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ =20 #if defined(CONFIG_ARM64_64K_PAGES) || CONFIG_PGTABLE_LEVELS < 3 -static inline bool pud_sect(pud_t pud) { return false; } static inline bool pud_table(pud_t pud) { return true; } #else -#define pud_sect(pud) ((pud_val(pud) & PUD_TYPE_MASK) =3D=3D \ - PUD_TYPE_SECT) #define pud_table(pud) ((pud_val(pud) & PUD_TYPE_MASK) =3D=3D \ PUD_TYPE_TABLE) #endif @@ -878,7 +879,11 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd) PUD_TYPE_TABLE) #define pud_present(pud) pte_present(pud_pte(pud)) #ifndef __PAGETABLE_PMD_FOLDED -#define pud_leaf(pud) (pud_present(pud) && !pud_table(pud)) +#define pud_leaf pud_leaf +static inline bool pud_leaf(pud_t pud) +{ + return pud_present(pud) && !pud_table(pud); +} #else #define pud_leaf(pud) false #endif diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 1575680675d8d..dcee56bb622ad 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -204,7 +204,7 @@ static int alloc_init_cont_pte(pmd_t *pmdp, unsigned lo= ng addr, pmd_t pmd =3D READ_ONCE(*pmdp); pte_t *ptep; =20 - BUG_ON(pmd_sect(pmd)); + BUG_ON(pmd_leaf(pmd)); if (pmd_none(pmd)) { pmdval_t pmdval =3D PMD_TYPE_TABLE | PMD_TABLE_UXN | PMD_TABLE_AF; phys_addr_t pte_phys; @@ -303,7 +303,7 @@ static int alloc_init_cont_pmd(pud_t *pudp, unsigned lo= ng addr, /* * Check for initial section mappings in the pgd/pud. */ - BUG_ON(pud_sect(pud)); + BUG_ON(pud_leaf(pud)); if (pud_none(pud)) { pudval_t pudval =3D PUD_TYPE_TABLE | PUD_TABLE_UXN | PUD_TABLE_AF; phys_addr_t pmd_phys; @@ -1503,7 +1503,7 @@ static void unmap_hotplug_pmd_range(pud_t *pudp, unsi= gned long addr, continue; =20 WARN_ON(!pmd_present(pmd)); - if (pmd_sect(pmd)) { + if (pmd_leaf(pmd)) { pmd_clear(pmdp); =20 /* @@ -1536,7 +1536,7 @@ static void unmap_hotplug_pud_range(p4d_t *p4dp, unsi= gned long addr, continue; =20 WARN_ON(!pud_present(pud)); - if (pud_sect(pud)) { + if (pud_leaf(pud)) { pud_clear(pudp); =20 /* @@ -1650,7 +1650,7 @@ static void free_empty_pmd_table(pud_t *pudp, unsigne= d long addr, if (pmd_none(pmd)) continue; =20 - WARN_ON(!pmd_present(pmd) || !pmd_table(pmd) || pmd_sect(pmd)); + WARN_ON(!pmd_present(pmd) || !pmd_table(pmd)); free_empty_pte_table(pmdp, addr, next, floor, ceiling); } while (addr =3D next, addr < end); =20 @@ -1690,7 +1690,7 @@ static void free_empty_pud_table(p4d_t *p4dp, unsigne= d long addr, if (pud_none(pud)) continue; =20 - WARN_ON(!pud_present(pud) || !pud_table(pud) || pud_sect(pud)); + WARN_ON(!pud_present(pud) || !pud_table(pud)); free_empty_pmd_table(pudp, addr, next, floor, ceiling); } while (addr =3D next, addr < end); =20 @@ -1786,7 +1786,7 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node, { vmemmap_verify((pte_t *)pmdp, node, addr, next); =20 - return pmd_sect(READ_ONCE(*pmdp)); + return pmd_leaf(READ_ONCE(*pmdp)); } =20 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int= node, @@ -1850,7 +1850,7 @@ void p4d_clear_huge(p4d_t *p4dp) =20 int pud_clear_huge(pud_t *pudp) { - if (!pud_sect(READ_ONCE(*pudp))) + if (!pud_leaf(READ_ONCE(*pudp))) return 0; pud_clear(pudp); return 1; @@ -1858,7 +1858,7 @@ int pud_clear_huge(pud_t *pudp) =20 int pmd_clear_huge(pmd_t *pmdp) { - if (!pmd_sect(READ_ONCE(*pmdp))) + if (!pmd_leaf(READ_ONCE(*pmdp))) return 0; pmd_clear(pmdp); return 1; --=20 2.43.0