arch/arm64/Kconfig | 2 + arch/arm64/include/asm/kpkeys.h | 62 +++++++++ arch/arm64/include/asm/pgtable-prot.h | 16 +-- arch/arm64/include/asm/pgtable.h | 57 +++++++- arch/arm64/include/asm/por.h | 11 ++ arch/arm64/include/asm/processor.h | 2 + arch/arm64/include/asm/ptrace.h | 4 + arch/arm64/include/asm/set_memory.h | 4 + arch/arm64/kernel/asm-offsets.c | 3 + arch/arm64/kernel/cpufeature.c | 5 +- arch/arm64/kernel/entry.S | 24 +++- arch/arm64/kernel/process.c | 9 ++ arch/arm64/kernel/smp.c | 2 + arch/arm64/mm/fault.c | 2 + arch/arm64/mm/mmu.c | 26 ++-- arch/arm64/mm/pageattr.c | 25 ++++ include/asm-generic/kpkeys.h | 21 +++ include/asm-generic/pgalloc.h | 15 ++- include/linux/kpkeys.h | 157 ++++++++++++++++++++++ include/linux/mm.h | 27 ++-- include/linux/set_memory.h | 7 + mm/Kconfig | 5 + mm/Makefile | 2 + mm/kpkeys_hardened_pgtables.c | 44 ++++++ mm/memory.c | 137 +++++++++++++++++++ mm/tests/kpkeys_hardened_pgtables_kunit.c | 106 +++++++++++++++ security/Kconfig.hardening | 24 ++++ 27 files changed, 758 insertions(+), 41 deletions(-) create mode 100644 arch/arm64/include/asm/kpkeys.h create mode 100644 include/asm-generic/kpkeys.h create mode 100644 include/linux/kpkeys.h create mode 100644 mm/kpkeys_hardened_pgtables.c create mode 100644 mm/tests/kpkeys_hardened_pgtables_kunit.c
This is a proposal to leverage protection keys (pkeys) to harden critical kernel data, by making it mostly read-only. The series includes a simple framework called "kpkeys" to manipulate pkeys for in-kernel use, as well as a page table hardening feature based on that framework, "kpkeys_hardened_pgtables". Both are implemented on arm64 as a proof of concept, but they are designed to be compatible with any architecture that supports pkeys. The proposed approach is a typical use of pkeys: the data to protect is mapped with a given pkey P, and the pkey register is initially configured to grant read-only access to P. Where the protected data needs to be written to, the pkey register is temporarily switched to grant write access to P on the current CPU. The key fact this approach relies on is that the target data is only written to via a limited and well-defined API. This makes it possible to explicitly switch the pkey register where needed, without introducing excessively invasive changes, and only for a small amount of trusted code. Page tables were chosen as they are a popular (and critical) target for attacks, but there are of course many others - this is only a starting point (see section "Further use-cases"). It has become more and more common for accesses to such target data to be mediated by a hypervisor in vendor kernels; the hope is that kpkeys can provide much of that protection in a simpler and cheaper manner. A rough performance estimation has been performed on a modern arm64 system, see section "Performance". This series has similarities with the "PKS write protected page tables" series posted by Rick Edgecombe a few years ago [1], but it is not specific to x86/PKS - the approach is meant to be generic. kpkeys ====== The use of pkeys involves two separate mechanisms: assigning a pkey to pages, and defining the pkeys -> permissions mapping via the pkey register. This is implemented through the following interface: - Pages in the linear mapping are assigned a pkey using set_memory_pkey(). This is sufficient for this series, but of course higher-level interfaces can be introduced later to ask allocators to return pages marked with a given pkey. It should also be possible to extend this to vmalloc() if needed. - The pkey register is configured based on a *kpkeys level*. kpkeys levels are simple integers that correspond to a given configuration, for instance: KPKEYS_LVL_DEFAULT: RW access to KPKEYS_PKEY_DEFAULT RO access to any other KPKEYS_PKEY_* KPKEYS_LVL_<FEAT>: RW access to KPKEYS_PKEY_DEFAULT RW access to KPKEYS_PKEY_<FEAT> RO access to any other KPKEYS_PKEY_* Only pkeys that are managed by the kpkeys framework are impacted; permissions for other pkeys are left unchanged (this allows for other schemes using pkeys to be used in parallel, and arch-specific use of certain pkeys). The kpkeys level is changed by calling kpkeys_set_level(), setting the pkey register accordingly and returning the original value. A subsequent call to kpkeys_restore_pkey_reg() restores the kpkeys level. The numeric value of KPKEYS_LVL_* (kpkeys level) is purely symbolic and thus generic, however each architecture is free to define KPKEYS_PKEY_* (pkey value). kpkeys_hardened_pgtables ======================== The kpkeys_hardened_pgtables feature uses the interface above to make the (kernel and user) page tables read-only by default, enabling write access only in helpers such as set_pte(). One complication is that those helpers as well as page table allocators are used very early, before kpkeys become available. Enabling kpkeys_hardened_pgtables, if and when kpkeys become available, is therefore done as follows: 1. A static key is turned on. This enables a transition to KPKEYS_LVL_PGTABLES in all helpers writing to page tables, and also impacts page table allocators (step 3). 2. swapper_pg_dir is walked to set all early page table pages to KPKEYS_PKEY_PGTABLES. 3. Page table allocators set the returned pages to KPKEYS_PKEY_PGTABLES (and the pkey is reset upon freeing). This ensures that all page tables are mapped with that privileged pkey. This series =========== The series is composed of two parts: - The kpkeys framework (patch 1-9). The main API is introduced in <linux/kpkeys.h>, and it is implemented on arm64 using the POE (Permission Overlay Extension) feature. - The kpkeys_hardened_pgtables feature (patch 10-18). <linux/kpkeys.h> is extended with an API to set page table pages to a given pkey and a guard object to switch kpkeys level accordingly, both gated on a static key. This is then used in generic and arm64 pgtable handling code as needed. Finally a simple KUnit-based test suite is added to demonstrate the page table protection. pkey register management ======================== The kpkeys model relies on the kernel pkey register being set to a specific value for the duration of a relatively small section of code, and otherwise to the default value. Accordingly, the arm64 implementation based on POE handles its pkey register (POR_EL1) as follows: - POR_EL1 is saved and reset to its default value on exception entry, and restored on exception return. This ensures that exception handling code runs in a fixed kpkeys state. - POR_EL1 is context-switched per-thread. This allows sections of code that run at a non-default kpkeys level to sleep (e.g. when locking a mutex). For kpkeys_hardened_pgtables, only involuntary preemption is relevant and the previous point already handles that; however sleeping is likely to occur in more advanced uses of kpkeys. An important assumption is that all kpkeys levels allow RW access to the default pkey (0). Otherwise, saving POR_EL1 before resetting it on exception entry would be a best difficult, and context-switching it too. Performance =========== No arm64 hardware currently implements POE. To estimate the performance impact of kpkeys_hardened_pgtables, a mock implementation of kpkeys has been used, replacing accesses to the POR_EL1 register with accesses to another system register that is otherwise unused (CONTEXTIDR_EL1), and leaving everything else unchanged. Most of the kpkeys overhead is expected to originate from the barrier (ISB) that is required after writing to POR_EL1, and from setting the POIndex (pkey) in page tables; both of these are done exactly in the same way in the mock implementation. The original implementation of kpkeys_hardened_pgtables is very inefficient when many PTEs are changed at once, as the kpkeys level is switched twice for every PTE (two ISBs per PTE). Patch 18 introduces an optimisation that makes use of the lazy_mmu mode to batch those switches: 1. switch to KPKEYS_LVL_PGTABLES on arch_enter_lazy_mmu_mode(), 2. skip any kpkeys switch while in that section, and 3. restore the kpkeys level on arch_leave_lazy_mmu_mode(). When that last function already issues an ISB (when updating kernel page tables), we get a further optimisation as we can skip the ISB when restoring the kpkeys level. Both implementations (without and with batching) were evaluated on an Amazon EC2 M7g instance (Graviton3), using a variety of benchmarks that involve heavy page table manipulations. The results shown below are relative to the baseline for this series, which is 6.17-rc1. The branches used for all three sets of results (baseline, with/without batching) are available in a repository, see next section. Caveat: these numbers should be seen as a lower bound for the overhead of a real POE-based protection. The hardware checks added by POE are however not expected to incur significant extra overhead. Reading example: for the fix_size_alloc_test benchmark, using 1 page per iteration (no hugepage), kpkeys_hardened_pgtables incurs 17.35% overhead without batching, and 14.62% overhead with batching. Both results are considered statistically significant (95% confidence interval), indicated by "(R)". +-------------------+----------------------------------+------------------+---------------+ | Benchmark | Result Class | Without batching | With batching | +===================+==================================+==================+===============+ | mmtests/kernbench | real time | 0.30% | 0.11% | | | system time | (R) 3.97% | (R) 2.17% | | | user time | 0.12% | 0.02% | +-------------------+----------------------------------+------------------+---------------+ | micromm/fork | fork: h:0 | (R) 217.31% | -0.97% | | | fork: h:1 | (R) 275.25% | (R) 2.25% | +-------------------+----------------------------------+------------------+---------------+ | micromm/munmap | munmap: h:0 | (R) 15.57% | -1.95% | | | munmap: h:1 | (R) 169.53% | (R) 6.53% | +-------------------+----------------------------------+------------------+---------------+ | micromm/vmalloc | fix_size_alloc_test: p:1, h:0 | (R) 17.35% | (R) 14.62% | | | fix_size_alloc_test: p:4, h:0 | (R) 37.54% | (R) 9.35% | | | fix_size_alloc_test: p:16, h:0 | (R) 66.08% | (R) 3.15% | | | fix_size_alloc_test: p:64, h:0 | (R) 82.94% | -0.39% | | | fix_size_alloc_test: p:256, h:0 | (R) 87.85% | -1.67% | | | fix_size_alloc_test: p:16, h:1 | (R) 50.31% | 3.00% | | | fix_size_alloc_test: p:64, h:1 | (R) 59.73% | 2.23% | | | fix_size_alloc_test: p:256, h:1 | (R) 62.14% | 1.51% | | | random_size_alloc_test: p:1, h:0 | (R) 77.82% | -0.21% | | | vm_map_ram_test: p:1, h:0 | (R) 30.66% | (R) 27.30% | +-------------------+----------------------------------+------------------+---------------+ Benchmarks: - mmtests/kernbench: running kernbench (kernel build) [4]. - micromm/{fork,munmap}: from David Hildenbrand's benchmark suite. A 1 GB mapping is created and then fork/unmap is called. The mapping is created using either page-sized (h:0) or hugepage folios (h:1); in all cases the memory is PTE-mapped. - micromm/vmalloc: from test_vmalloc.ko, varying the number of pages (p:) and whether huge pages are used (h:). On a "real-world" and fork-heavy workload like kernbench, the estimated overhead of kpkeys_hardened_pgtables is reasonable: 4% system time overhead without batching, and about half that figure (2.2%) with batching. The real time overhead is negligible. Microbenchmarks show large overheads without batching, which increase with the number of pages being manipulated. Batching drastically reduces that overhead, almost negating it for micromm/fork. Because all PTEs in the mapping are modified in the same lazy_mmu section, the kpkeys level is changed just twice regardless of the mapping size; as a result the relative overhead actually decreases as the size increases for fix_size_alloc_test. Note: the performance impact of set_memory_pkey() is likely to be relatively low on arm64 because the linear mapping uses PTE-level descriptors only. This means that set_memory_pkey() simply changes the attributes of some PTE descriptors. However, some systems may be able to use higher-level descriptors in the future [5], meaning that set_memory_pkey() may have to split mappings. Allocating page tables from a contiguous cache of pages could help minimise the overhead, as proposed for x86 in [1]. Branches ======== To make reviewing and testing easier, this series is available here: https://gitlab.arm.com/linux-arm/linux-kb The following branches are available: - kpkeys/rfc-v5 - this series, as posted - kpkeys/rfc-v5-base - the baseline for this series, that is 6.17-rc1 - kpkeys/rfc-v5-bench-batching - this series + patch for benchmarking on a regular arm64 system (see section above) - kpkeys/rfc-v5-bench-no-batching - this series without patch 18 (batching) + benchmarking patch Threat model ============ The proposed scheme aims at mitigating data-only attacks (e.g. use-after-free/cross-cache attacks). In other words, it is assumed that control flow is not corrupted, and that the attacker does not achieve arbitrary code execution. Nothing prevents the pkey register from being set to its most permissive state - the assumption is that the register is only modified on legitimate code paths. A few related notes: - Functions that set the pkey register are all implemented inline. Besides performance considerations, this is meant to avoid creating a function that can be used as a straightforward gadget to set the pkey register to an arbitrary value. - kpkeys_set_level() only accepts a compile-time constant as argument, as a variable could be manipulated by an attacker. This could be relaxed but it seems unlikely that a variable kpkeys level would be needed in practice. Further use-cases ================= It should be possible to harden various targets using kpkeys, including: - struct cred - kpkeys-based cred hardening is now available in a separate series [6] - fixmap (occasionally used even after early boot, e.g. set_swapper_pgd() in arch/arm64/mm/mmu.c) - eBPF programs (preventing direct access to core kernel code/data) - SELinux state (e.g. struct selinux_state::initialized) ... and many others. kpkeys could also be used to strengthen the confidentiality of secret data by making it completely inaccessible by default, and granting read-only or read-write access as needed. This requires such data to be rarely accessed (or via a limited interface only). One example on arm64 is the pointer authentication keys in thread_struct, whose leakage to userspace would lead to pointer authentication being easily defeated. Open questions ============== A few aspects in this RFC that are debatable and/or worth discussing: - There is currently no restriction on how kpkeys levels map to pkeys permissions. A typical approach is to allocate one pkey per level and make it writable at that level only. As the number of levels increases, we may however run out of pkeys, especially on arm64 (just 8 pkeys with POE). Depending on the use-cases, it may be acceptable to use the same pkey for the data associated to multiple levels. Another potential concern is that a given piece of code may require write access to multiple privileged pkeys. This could be addressed by introducing a notion of hierarchy in trust levels, where Tn is able to write to memory owned by Tm if n >= m, for instance. - kpkeys_set_level() and kpkeys_restore_pkey_reg() are not symmetric: the former takes a kpkeys level and returns a pkey register value, to be consumed by the latter. It would be more intuitive to manipulate kpkeys levels only. However this assumes that there is a 1:1 mapping between kpkeys levels and pkey register values, while in principle the mapping is 1:n (certain pkeys may be used outside the kpkeys framework). - An architecture that supports kpkeys is expected to select CONFIG_ARCH_HAS_KPKEYS and always enable them if available - there is no CONFIG_KPKEYS to control this behaviour. Since this creates no significant overhead (at least on arm64), it seemed better to keep it simple. Each hardening feature does have its own option and arch opt-in if needed (CONFIG_KPKEYS_HARDENED_PGTABLES, CONFIG_ARCH_HAS_KPKEYS_HARDENED_PGTABLES). Any comment or feedback will be highly appreciated, be it on the high-level approach or implementation choices! - Kevin --- Changelog RFC v4..v5: - Rebased on v6.17-rc1. - Cover letter: re-ran benchmarks on top of v6.17-rc1, made various small improvements especially to the "Performance" section. - Patch 18: disable batching while in interrupt, since POR_EL1 is reset on exception entry, making the TIF_LAZY_MMU flag meaningless. This fixes a crash that may occur when a page table page is freed while in interrupt context. - Patch 17: ensure that the target kernel address is actually PTE-mapped. Certain mappings (e.g. code) may be PMD-mapped instead - this explains why the change made in v4 was required. RFC v4: https://lore.kernel.org/linux-mm/20250411091631.954228-1-kevin.brodsky@arm.com/ RFC v3..v4: - Added appropriate handling of the arm64 pkey register (POR_EL1): context-switching between threads and resetting on exception entry (patch 7 and 8). See section "pkey register management" above for more details. A new POR_EL1_INIT macro is introduced to make the default value available to assembly (where POR_EL1 is reset on exception entry); it is updated in each patch allocating new keys. - Added patch 18 making use of the lazy_mmu mode to batch switches to KPKEYS_LVL_PGTABLES - just once per lazy_mmu section rather than on every pgtable write. See section "Performance" for details. - Rebased on top of [2]. No direct impact on the patches, but it ensures that the ctor/dtor is always called for kernel pgtables. This is an important fix as kernel PTEs allocated after boot were not protected by kpkeys_hardened_pgtables in v3 - a new test was added to patch 17 to ensure that pgtables created by vmalloc are protected too. - Rebased on top of [3]. The batching of kpkeys level switches in patch 18 relies on the last patch in [3]. - Moved kpkeys guard definitions out of <linux/kpkeys.h> and to a relevant header for each subsystem (e.g. <asm/pgtable.h> for the kpkeys_hardened_pgtables guard). - Patch 1,5: marked kpkeys_{set_level,restore_pkey_reg} as __always_inline to ensure that no callable gadget is created. [Maxwell Bland's suggestion] - Patch 5: added helper __kpkeys_set_pkey_reg_nosync(). - Patch 10: marked kernel_pgtables_set_pkey() and related helpers as __init. [Linus Walleij's suggestion] - Patch 11: added helper kpkeys_hardened_pgtables_enabled(), renamed the static key to kpkeys_hardened_pgtables_key. - Patch 17: followed the KUnit conventions more closely. [Kees Cook's suggestion] - Patch 17: changed the address used in the write_linear_map_pte() test. It seems that the PTEs that map some functions are allocated in ZONE_DMA and read-only (unclear why exactly). This doesn't seem to occur for global variables. - Various minor fixes/improvements. - Rebased on v6.15-rc1. This includes [7], which renames a few POE symbols: s/POE_RXW/POE_RWX/ and s/por_set_pkey_perms/por_elx_set_pkey_perms/ RFC v3: https://lore.kernel.org/linux-hardening/20250203101839.1223008-1-kevin.brodsky@arm.com/ RFC v2..v3: - Patch 1: kpkeys_set_level() may now return KPKEYS_PKEY_REG_INVAL to indicate that the pkey register wasn't written to, and as a result that kpkeys_restore_pkey_reg() should do nothing. This simplifies the conditional guard macro and also allows architectures to skip writes to the pkey register if the target value is the same as the current one. - Patch 1: introduced additional KPKEYS_GUARD* macros to cover more use-cases and reduce duplication. - Patch 6: reject pkey value above arch_max_pkey(). - Patch 13: added missing guard(kpkeys_hardened_pgtables) in __clear_young_dirty_pte(). - Rebased on v6.14-rc1. RFC v2: https://lore.kernel.org/linux-hardening/20250108103250.3188419-1-kevin.brodsky@arm.com/ RFC v1..v2: - A new approach is used to set the pkey of page table pages. Thanks to Qi Zheng's and my own series [8][9], pagetable_*_ctor is systematically called when a PTP is allocated at any level (PTE to PGD), and pagetable_*_dtor when it is freed, on all architectures. Patch 11 makes use of this to call kpkeys_{,un}protect_pgtable_memory from the common ctor/dtor helper. The arm64 patches from v1 (patch 12 and 13) are dropped as they are no longer needed. Patch 10 is introduced to allow pagetable_*_ctor to fail at all levels, since kpkeys_protect_pgtable_memory may itself fail. [Original suggestion by Peter Zijlstra] - Changed the prototype of kpkeys_{,un}protect_pgtable_memory in patch 9 to take a struct folio * for more convenience, and implemented them out-of-line to avoid a circular dependency with <linux/mm.h>. - Rebased on next-20250107, which includes [8] and [9]. - Added locking in patch 8. [Peter Zijlstra's suggestion] RFC v1: https://lore.kernel.org/linux-hardening/20241206101110.1646108-1-kevin.brodsky@arm.com/ --- References [1] https://lore.kernel.org/all/20210830235927.6443-1-rick.p.edgecombe@intel.com/ [2] https://lore.kernel.org/linux-mm/20250408095222.860601-1-kevin.brodsky@arm.com/ [3] https://lore.kernel.org/linux-mm/20250304150444.3788920-1-ryan.roberts@arm.com/ [4] https://github.com/gormanm/mmtests/blob/master/shellpack_src/src/kernbench/kernbench-bench [5] https://lore.kernel.org/all/20250724221216.1998696-1-yang@os.amperecomputing.com/ [6] https://lore.kernel.org/linux-mm/?q=s%3Apkeys+s%3Acred+s%3A0 [7] https://lore.kernel.org/linux-arm-kernel/20250219164029.2309119-1-kevin.brodsky@arm.com/ [8] https://lore.kernel.org/linux-mm/cover.1736317725.git.zhengqi.arch@bytedance.com/ [9] https://lore.kernel.org/linux-mm/20250103184415.2744423-1-kevin.brodsky@arm.com/ --- Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Joey Gouly <joey.gouly@arm.com> Cc: Kees Cook <kees@kernel.org> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Maxwell Bland <mbland@motorola.com> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pierre Langlois <pierre.langlois@arm.com> Cc: Quentin Perret <qperret@google.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-mm@kvack.org Cc: x86@kernel.org --- Kevin Brodsky (18): mm: Introduce kpkeys set_memory: Introduce set_memory_pkey() stub arm64: mm: Enable overlays for all EL1 indirect permissions arm64: Introduce por_elx_set_pkey_perms() helper arm64: Implement asm/kpkeys.h using POE arm64: set_memory: Implement set_memory_pkey() arm64: Reset POR_EL1 on exception entry arm64: Context-switch POR_EL1 arm64: Enable kpkeys mm: Introduce kernel_pgtables_set_pkey() mm: Introduce kpkeys_hardened_pgtables mm: Allow __pagetable_ctor() to fail mm: Map page tables with privileged pkey arm64: kpkeys: Support KPKEYS_LVL_PGTABLES arm64: mm: Guard page table writes with kpkeys arm64: Enable kpkeys_hardened_pgtables support mm: Add basic tests for kpkeys_hardened_pgtables arm64: mm: Batch kpkeys level switches arch/arm64/Kconfig | 2 + arch/arm64/include/asm/kpkeys.h | 62 +++++++++ arch/arm64/include/asm/pgtable-prot.h | 16 +-- arch/arm64/include/asm/pgtable.h | 57 +++++++- arch/arm64/include/asm/por.h | 11 ++ arch/arm64/include/asm/processor.h | 2 + arch/arm64/include/asm/ptrace.h | 4 + arch/arm64/include/asm/set_memory.h | 4 + arch/arm64/kernel/asm-offsets.c | 3 + arch/arm64/kernel/cpufeature.c | 5 +- arch/arm64/kernel/entry.S | 24 +++- arch/arm64/kernel/process.c | 9 ++ arch/arm64/kernel/smp.c | 2 + arch/arm64/mm/fault.c | 2 + arch/arm64/mm/mmu.c | 26 ++-- arch/arm64/mm/pageattr.c | 25 ++++ include/asm-generic/kpkeys.h | 21 +++ include/asm-generic/pgalloc.h | 15 ++- include/linux/kpkeys.h | 157 ++++++++++++++++++++++ include/linux/mm.h | 27 ++-- include/linux/set_memory.h | 7 + mm/Kconfig | 5 + mm/Makefile | 2 + mm/kpkeys_hardened_pgtables.c | 44 ++++++ mm/memory.c | 137 +++++++++++++++++++ mm/tests/kpkeys_hardened_pgtables_kunit.c | 106 +++++++++++++++ security/Kconfig.hardening | 24 ++++ 27 files changed, 758 insertions(+), 41 deletions(-) create mode 100644 arch/arm64/include/asm/kpkeys.h create mode 100644 include/asm-generic/kpkeys.h create mode 100644 include/linux/kpkeys.h create mode 100644 mm/kpkeys_hardened_pgtables.c create mode 100644 mm/tests/kpkeys_hardened_pgtables_kunit.c base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585 -- 2.47.0
On 15/08/2025 10:54, Kevin Brodsky wrote: > [...] > > Performance > =========== > > No arm64 hardware currently implements POE. To estimate the performance > impact of kpkeys_hardened_pgtables, a mock implementation of kpkeys has > been used, replacing accesses to the POR_EL1 register with accesses to > another system register that is otherwise unused (CONTEXTIDR_EL1), and > leaving everything else unchanged. Most of the kpkeys overhead is > expected to originate from the barrier (ISB) that is required after > writing to POR_EL1, and from setting the POIndex (pkey) in page tables; > both of these are done exactly in the same way in the mock > implementation. It turns out this wasn't the case regarding the pkey setting - because patch 6 gates set_memory_pkey() on system_supports_poe() and not arch_kpkeys_enabled(), the mock implementation turned set_memory_pkey() into a no-op. Many thanks to Rick Edgecombe for highlighting that the overheads were suspiciously low for some benchmarks! > The original implementation of kpkeys_hardened_pgtables is very > inefficient when many PTEs are changed at once, as the kpkeys level is > switched twice for every PTE (two ISBs per PTE). Patch 18 introduces > an optimisation that makes use of the lazy_mmu mode to batch those > switches: 1. switch to KPKEYS_LVL_PGTABLES on arch_enter_lazy_mmu_mode(), > 2. skip any kpkeys switch while in that section, and 3. restore the > kpkeys level on arch_leave_lazy_mmu_mode(). When that last function > already issues an ISB (when updating kernel page tables), we get a > further optimisation as we can skip the ISB when restoring the kpkeys > level. > > Both implementations (without and with batching) were evaluated on an > Amazon EC2 M7g instance (Graviton3), using a variety of benchmarks that > involve heavy page table manipulations. The results shown below are > relative to the baseline for this series, which is 6.17-rc1. The > branches used for all three sets of results (baseline, with/without > batching) are available in a repository, see next section. > > Caveat: these numbers should be seen as a lower bound for the overhead > of a real POE-based protection. The hardware checks added by POE are > however not expected to incur significant extra overhead. > > Reading example: for the fix_size_alloc_test benchmark, using 1 page per > iteration (no hugepage), kpkeys_hardened_pgtables incurs 17.35% overhead > without batching, and 14.62% overhead with batching. Both results are > considered statistically significant (95% confidence interval), > indicated by "(R)". > > +-------------------+----------------------------------+------------------+---------------+ > | Benchmark | Result Class | Without batching | With batching | > +===================+==================================+==================+===============+ > | mmtests/kernbench | real time | 0.30% | 0.11% | > | | system time | (R) 3.97% | (R) 2.17% | > | | user time | 0.12% | 0.02% | > +-------------------+----------------------------------+------------------+---------------+ > | micromm/fork | fork: h:0 | (R) 217.31% | -0.97% | > | | fork: h:1 | (R) 275.25% | (R) 2.25% | > +-------------------+----------------------------------+------------------+---------------+ > | micromm/munmap | munmap: h:0 | (R) 15.57% | -1.95% | > | | munmap: h:1 | (R) 169.53% | (R) 6.53% | > +-------------------+----------------------------------+------------------+---------------+ > | micromm/vmalloc | fix_size_alloc_test: p:1, h:0 | (R) 17.35% | (R) 14.62% | > | | fix_size_alloc_test: p:4, h:0 | (R) 37.54% | (R) 9.35% | > | | fix_size_alloc_test: p:16, h:0 | (R) 66.08% | (R) 3.15% | > | | fix_size_alloc_test: p:64, h:0 | (R) 82.94% | -0.39% | > | | fix_size_alloc_test: p:256, h:0 | (R) 87.85% | -1.67% | > | | fix_size_alloc_test: p:16, h:1 | (R) 50.31% | 3.00% | > | | fix_size_alloc_test: p:64, h:1 | (R) 59.73% | 2.23% | > | | fix_size_alloc_test: p:256, h:1 | (R) 62.14% | 1.51% | > | | random_size_alloc_test: p:1, h:0 | (R) 77.82% | -0.21% | > | | vm_map_ram_test: p:1, h:0 | (R) 30.66% | (R) 27.30% | > +-------------------+----------------------------------+------------------+---------------+ These numbers therefore correspond to set_memory_pkey() being a no-op, in other words they represent the overhead of switching the pkey register only. I have amended the mock implementation so that set_memory_pkey() is run as it would on a real POE implementation (i.e. actually setting the PTE bits). Here are the new results, representing the overhead of both pkey register switching and setting the pkey of page table pages (PTPs) on alloc/free: +-------------------+----------------------------------+------------------+---------------+ | Benchmark | Result Class | Without batching | With batching | +===================+==================================+==================+===============+ | mmtests/kernbench | real time | 0.32% | 0.35% | | | system time | (R) 4.18% | (R) 3.18% | | | user time | 0.08% | 0.20% | +-------------------+----------------------------------+------------------+---------------+ | micromm/fork | fork: h:0 | (R) 221.39% | (R) 3.35% | | | fork: h:1 | (R) 282.89% | (R) 6.99% | +-------------------+----------------------------------+------------------+---------------+ | micromm/munmap | munmap: h:0 | (R) 17.37% | -0.28% | | | munmap: h:1 | (R) 172.61% | (R) 8.08% | +-------------------+----------------------------------+------------------+---------------+ | micromm/vmalloc | fix_size_alloc_test: p:1, h:0 | (R) 15.54% | (R) 12.57% | | | fix_size_alloc_test: p:4, h:0 | (R) 39.18% | (R) 9.13% | | | fix_size_alloc_test: p:16, h:0 | (R) 65.81% | 2.97% | | | fix_size_alloc_test: p:64, h:0 | (R) 83.39% | -0.49% | | | fix_size_alloc_test: p:256, h:0 | (R) 87.85% | (I) -2.04% | | | fix_size_alloc_test: p:16, h:1 | (R) 51.21% | 3.77% | | | fix_size_alloc_test: p:64, h:1 | (R) 60.02% | 0.99% | | | fix_size_alloc_test: p:256, h:1 | (R) 63.82% | 1.16% | | | random_size_alloc_test: p:1, h:0 | (R) 77.79% | -0.51% | | | vm_map_ram_test: p:1, h:0 | (R) 30.67% | (R) 27.09% | +-------------------+----------------------------------+------------------+---------------+ Those results are overall very similar to the original ones. micromm/fork is however clearly impacted - around 4% additional overhead from set_memory_pkey(); it makes sense considering that forking requires duplicating (and therefore allocating) a full set of page tables. kernbench is also a fork-heavy workload and it gets a 1% hit in system time (with batching). It seems fair to conclude that, on arm64, setting the pkey whenever a PTP is allocated/freed is not particularly expensive. The situation may well be different on x86 as Rick pointed out, and it may also change on newer arm64 systems as I noted further down. Allocating/freeing PTPs in bulk should help if setting the pkey in the pgtable ctor/dtor proves too expensive. - Kevin > Benchmarks: > - mmtests/kernbench: running kernbench (kernel build) [4]. > - micromm/{fork,munmap}: from David Hildenbrand's benchmark suite. A > 1 GB mapping is created and then fork/unmap is called. The mapping is > created using either page-sized (h:0) or hugepage folios (h:1); in all > cases the memory is PTE-mapped. > - micromm/vmalloc: from test_vmalloc.ko, varying the number of pages > (p:) and whether huge pages are used (h:). > > On a "real-world" and fork-heavy workload like kernbench, the estimated > overhead of kpkeys_hardened_pgtables is reasonable: 4% system time > overhead without batching, and about half that figure (2.2%) with > batching. The real time overhead is negligible. > > Microbenchmarks show large overheads without batching, which increase > with the number of pages being manipulated. Batching drastically reduces > that overhead, almost negating it for micromm/fork. Because all PTEs in > the mapping are modified in the same lazy_mmu section, the kpkeys level > is changed just twice regardless of the mapping size; as a result the > relative overhead actually decreases as the size increases for > fix_size_alloc_test. > > Note: the performance impact of set_memory_pkey() is likely to be > relatively low on arm64 because the linear mapping uses PTE-level > descriptors only. This means that set_memory_pkey() simply changes the > attributes of some PTE descriptors. However, some systems may be able to > use higher-level descriptors in the future [5], meaning that > set_memory_pkey() may have to split mappings. Allocating page tables > from a contiguous cache of pages could help minimise the overhead, as > proposed for x86 in [1]. > > [...]
On 20/08/2025 17:53, Kevin Brodsky wrote: > On 15/08/2025 10:54, Kevin Brodsky wrote: >> [...] >> >> Performance >> =========== >> >> No arm64 hardware currently implements POE. To estimate the performance >> impact of kpkeys_hardened_pgtables, a mock implementation of kpkeys has >> been used, replacing accesses to the POR_EL1 register with accesses to >> another system register that is otherwise unused (CONTEXTIDR_EL1), and >> leaving everything else unchanged. Most of the kpkeys overhead is >> expected to originate from the barrier (ISB) that is required after >> writing to POR_EL1, and from setting the POIndex (pkey) in page tables; >> both of these are done exactly in the same way in the mock >> implementation. > It turns out this wasn't the case regarding the pkey setting - because > patch 6 gates set_memory_pkey() on system_supports_poe() and not > arch_kpkeys_enabled(), the mock implementation turned set_memory_pkey() > into a no-op. Many thanks to Rick Edgecombe for highlighting that the > overheads were suspiciously low for some benchmarks! > >> The original implementation of kpkeys_hardened_pgtables is very >> inefficient when many PTEs are changed at once, as the kpkeys level is >> switched twice for every PTE (two ISBs per PTE). Patch 18 introduces >> an optimisation that makes use of the lazy_mmu mode to batch those >> switches: 1. switch to KPKEYS_LVL_PGTABLES on arch_enter_lazy_mmu_mode(), >> 2. skip any kpkeys switch while in that section, and 3. restore the >> kpkeys level on arch_leave_lazy_mmu_mode(). When that last function >> already issues an ISB (when updating kernel page tables), we get a >> further optimisation as we can skip the ISB when restoring the kpkeys >> level. >> >> Both implementations (without and with batching) were evaluated on an >> Amazon EC2 M7g instance (Graviton3), using a variety of benchmarks that >> involve heavy page table manipulations. The results shown below are >> relative to the baseline for this series, which is 6.17-rc1. The >> branches used for all three sets of results (baseline, with/without >> batching) are available in a repository, see next section. >> >> Caveat: these numbers should be seen as a lower bound for the overhead >> of a real POE-based protection. The hardware checks added by POE are >> however not expected to incur significant extra overhead. >> >> Reading example: for the fix_size_alloc_test benchmark, using 1 page per >> iteration (no hugepage), kpkeys_hardened_pgtables incurs 17.35% overhead >> without batching, and 14.62% overhead with batching. Both results are >> considered statistically significant (95% confidence interval), >> indicated by "(R)". >> >> +-------------------+----------------------------------+------------------+---------------+ >> | Benchmark | Result Class | Without batching | With batching | >> +===================+==================================+==================+===============+ >> | mmtests/kernbench | real time | 0.30% | 0.11% | >> | | system time | (R) 3.97% | (R) 2.17% | >> | | user time | 0.12% | 0.02% | >> +-------------------+----------------------------------+------------------+---------------+ >> | micromm/fork | fork: h:0 | (R) 217.31% | -0.97% | >> | | fork: h:1 | (R) 275.25% | (R) 2.25% | >> +-------------------+----------------------------------+------------------+---------------+ >> | micromm/munmap | munmap: h:0 | (R) 15.57% | -1.95% | >> | | munmap: h:1 | (R) 169.53% | (R) 6.53% | >> +-------------------+----------------------------------+------------------+---------------+ >> | micromm/vmalloc | fix_size_alloc_test: p:1, h:0 | (R) 17.35% | (R) 14.62% | >> | | fix_size_alloc_test: p:4, h:0 | (R) 37.54% | (R) 9.35% | >> | | fix_size_alloc_test: p:16, h:0 | (R) 66.08% | (R) 3.15% | >> | | fix_size_alloc_test: p:64, h:0 | (R) 82.94% | -0.39% | >> | | fix_size_alloc_test: p:256, h:0 | (R) 87.85% | -1.67% | >> | | fix_size_alloc_test: p:16, h:1 | (R) 50.31% | 3.00% | >> | | fix_size_alloc_test: p:64, h:1 | (R) 59.73% | 2.23% | >> | | fix_size_alloc_test: p:256, h:1 | (R) 62.14% | 1.51% | >> | | random_size_alloc_test: p:1, h:0 | (R) 77.82% | -0.21% | >> | | vm_map_ram_test: p:1, h:0 | (R) 30.66% | (R) 27.30% | >> +-------------------+----------------------------------+------------------+---------------+ > These numbers therefore correspond to set_memory_pkey() being a no-op, > in other words they represent the overhead of switching the pkey > register only. > > I have amended the mock implementation so that set_memory_pkey() is run > as it would on a real POE implementation (i.e. actually setting the PTE > bits). Here are the new results, representing the overhead of both pkey > register switching and setting the pkey of page table pages (PTPs) on > alloc/free: > > +-------------------+----------------------------------+------------------+---------------+ > | Benchmark | Result Class | Without > batching | With batching | > +===================+==================================+==================+===============+ > | mmtests/kernbench | real time | > 0.32% | 0.35% | > | | system time | (R) > 4.18% | (R) 3.18% | > | | user time | > 0.08% | 0.20% | > +-------------------+----------------------------------+------------------+---------------+ > | micromm/fork | fork: h:0 | (R) > 221.39% | (R) 3.35% | > | | fork: h:1 | (R) > 282.89% | (R) 6.99% | > +-------------------+----------------------------------+------------------+---------------+ > | micromm/munmap | munmap: h:0 | (R) > 17.37% | -0.28% | > | | munmap: h:1 | (R) > 172.61% | (R) 8.08% | > +-------------------+----------------------------------+------------------+---------------+ > | micromm/vmalloc | fix_size_alloc_test: p:1, h:0 | (R) > 15.54% | (R) 12.57% | > | | fix_size_alloc_test: p:4, h:0 | (R) > 39.18% | (R) 9.13% | > | | fix_size_alloc_test: p:16, h:0 | (R) > 65.81% | 2.97% | > | | fix_size_alloc_test: p:64, h:0 | (R) > 83.39% | -0.49% | > | | fix_size_alloc_test: p:256, h:0 | (R) > 87.85% | (I) -2.04% | > | | fix_size_alloc_test: p:16, h:1 | (R) > 51.21% | 3.77% | > | | fix_size_alloc_test: p:64, h:1 | (R) > 60.02% | 0.99% | > | | fix_size_alloc_test: p:256, h:1 | (R) > 63.82% | 1.16% | > | | random_size_alloc_test: p:1, h:0 | (R) > 77.79% | -0.51% | > | | vm_map_ram_test: p:1, h:0 | (R) > 30.67% | (R) 27.09% | > +-------------------+----------------------------------+------------------+---------------+ Apologies, Thunderbird helpfully decided to wrap around that table... Here's the unmangled table: +-------------------+----------------------------------+------------------+---------------+ | Benchmark | Result Class | Without batching | With batching | +===================+==================================+==================+===============+ | mmtests/kernbench | real time | 0.32% | 0.35% | | | system time | (R) 4.18% | (R) 3.18% | | | user time | 0.08% | 0.20% | +-------------------+----------------------------------+------------------+---------------+ | micromm/fork | fork: h:0 | (R) 221.39% | (R) 3.35% | | | fork: h:1 | (R) 282.89% | (R) 6.99% | +-------------------+----------------------------------+------------------+---------------+ | micromm/munmap | munmap: h:0 | (R) 17.37% | -0.28% | | | munmap: h:1 | (R) 172.61% | (R) 8.08% | +-------------------+----------------------------------+------------------+---------------+ | micromm/vmalloc | fix_size_alloc_test: p:1, h:0 | (R) 15.54% | (R) 12.57% | | | fix_size_alloc_test: p:4, h:0 | (R) 39.18% | (R) 9.13% | | | fix_size_alloc_test: p:16, h:0 | (R) 65.81% | 2.97% | | | fix_size_alloc_test: p:64, h:0 | (R) 83.39% | -0.49% | | | fix_size_alloc_test: p:256, h:0 | (R) 87.85% | (I) -2.04% | | | fix_size_alloc_test: p:16, h:1 | (R) 51.21% | 3.77% | | | fix_size_alloc_test: p:64, h:1 | (R) 60.02% | 0.99% | | | fix_size_alloc_test: p:256, h:1 | (R) 63.82% | 1.16% | | | random_size_alloc_test: p:1, h:0 | (R) 77.79% | -0.51% | | | vm_map_ram_test: p:1, h:0 | (R) 30.67% | (R) 27.09% | +-------------------+----------------------------------+------------------+---------------+ > Those results are overall very similar to the original ones. > micromm/fork is however clearly impacted - around 4% additional overhead > from set_memory_pkey(); it makes sense considering that forking requires > duplicating (and therefore allocating) a full set of page tables. > kernbench is also a fork-heavy workload and it gets a 1% hit in system > time (with batching). > > It seems fair to conclude that, on arm64, setting the pkey whenever a > PTP is allocated/freed is not particularly expensive. The situation may > well be different on x86 as Rick pointed out, and it may also change on > newer arm64 systems as I noted further down. Allocating/freeing PTPs in > bulk should help if setting the pkey in the pgtable ctor/dtor proves too > expensive. > > - Kevin > >> Benchmarks: >> - mmtests/kernbench: running kernbench (kernel build) [4]. >> - micromm/{fork,munmap}: from David Hildenbrand's benchmark suite. A >> 1 GB mapping is created and then fork/unmap is called. The mapping is >> created using either page-sized (h:0) or hugepage folios (h:1); in all >> cases the memory is PTE-mapped. >> - micromm/vmalloc: from test_vmalloc.ko, varying the number of pages >> (p:) and whether huge pages are used (h:). >> >> On a "real-world" and fork-heavy workload like kernbench, the estimated >> overhead of kpkeys_hardened_pgtables is reasonable: 4% system time >> overhead without batching, and about half that figure (2.2%) with >> batching. The real time overhead is negligible. >> >> Microbenchmarks show large overheads without batching, which increase >> with the number of pages being manipulated. Batching drastically reduces >> that overhead, almost negating it for micromm/fork. Because all PTEs in >> the mapping are modified in the same lazy_mmu section, the kpkeys level >> is changed just twice regardless of the mapping size; as a result the >> relative overhead actually decreases as the size increases for >> fix_size_alloc_test. >> >> Note: the performance impact of set_memory_pkey() is likely to be >> relatively low on arm64 because the linear mapping uses PTE-level >> descriptors only. This means that set_memory_pkey() simply changes the >> attributes of some PTE descriptors. However, some systems may be able to >> use higher-level descriptors in the future [5], meaning that >> set_memory_pkey() may have to split mappings. Allocating page tables >> from a contiguous cache of pages could help minimise the overhead, as >> proposed for x86 in [1]. >> >> [...]
On Wed, 2025-08-20 at 18:01 +0200, Kevin Brodsky wrote: > Apologies, Thunderbird helpfully decided to wrap around that table... > Here's the unmangled table: > > +-------------------+----------------------------------+------------------+---------------+ > > Benchmark | Result Class | Without batching | With batching | > +===================+==================================+==================+===============+ > > mmtests/kernbench | real time | 0.32% | 0.35% | > > | system time | (R) 4.18% | (R) 3.18% | > > | user time | 0.08% | 0.20% | > +-------------------+----------------------------------+------------------+---------------+ > > micromm/fork | fork: h:0 | (R) 221.39% | (R) 3.35% | > > | fork: h:1 | (R) 282.89% | (R) 6.99% | > +-------------------+----------------------------------+------------------+---------------+ > > micromm/munmap | munmap: h:0 | (R) 17.37% | -0.28% | > > | munmap: h:1 | (R) 172.61% | (R) 8.08% | > +-------------------+----------------------------------+------------------+---------------+ > > micromm/vmalloc | fix_size_alloc_test: p:1, h:0 | (R) 15.54% | (R) 12.57% | Both this and the previous one have the 95% confidence interval. So it saw a 16% speed up with direct map modification. Possible? > > | fix_size_alloc_test: p:4, h:0 | (R) 39.18% | (R) 9.13% | > > | fix_size_alloc_test: p:16, h:0 | (R) 65.81% | 2.97% | > > | fix_size_alloc_test: p:64, h:0 | (R) 83.39% | -0.49% | > > | fix_size_alloc_test: p:256, h:0 | (R) 87.85% | (I) -2.04% | > > | fix_size_alloc_test: p:16, h:1 | (R) 51.21% | 3.77% | > > | fix_size_alloc_test: p:64, h:1 | (R) 60.02% | 0.99% | > > | fix_size_alloc_test: p:256, h:1 | (R) 63.82% | 1.16% | > > | random_size_alloc_test: p:1, h:0 | (R) 77.79% | -0.51% | > > | vm_map_ram_test: p:1, h:0 | (R) 30.67% | (R) 27.09% | > +-------------------+----------------------------------+------------------+---------------+ Hmm, still surprisingly low to me, but ok. It would be good have x86 and arm work the same, but I don't think we have line of sight to x86 currently. And I actually never did real benchmarks.
On 20/08/2025 18:18, Edgecombe, Rick P wrote: > On Wed, 2025-08-20 at 18:01 +0200, Kevin Brodsky wrote: >> Apologies, Thunderbird helpfully decided to wrap around that table... >> Here's the unmangled table: >> >> +-------------------+----------------------------------+------------------+---------------+ >>> Benchmark | Result Class | Without batching | With batching | >> +===================+==================================+==================+===============+ >>> mmtests/kernbench | real time | 0.32% | 0.35% | >>> | system time | (R) 4.18% | (R) 3.18% | >>> | user time | 0.08% | 0.20% | >> +-------------------+----------------------------------+------------------+---------------+ >>> micromm/fork | fork: h:0 | (R) 221.39% | (R) 3.35% | >>> | fork: h:1 | (R) 282.89% | (R) 6.99% | >> +-------------------+----------------------------------+------------------+---------------+ >>> micromm/munmap | munmap: h:0 | (R) 17.37% | -0.28% | >>> | munmap: h:1 | (R) 172.61% | (R) 8.08% | >> +-------------------+----------------------------------+------------------+---------------+ >>> micromm/vmalloc | fix_size_alloc_test: p:1, h:0 | (R) 15.54% | (R) 12.57% | > Both this and the previous one have the 95% confidence interval. So it saw a 16% > speed up with direct map modification. Possible? Positive numbers mean performance degradation ("(R)" actually stands for regression), so in that case the protection is adding a 16%/13% overhead. Here this is mainly due to the added pkey register switching (+ barrier) happening on every call to vmalloc() and vfree(), which has a large relative impact since only one page is being allocated/freed. >>> | fix_size_alloc_test: p:4, h:0 | (R) 39.18% | (R) 9.13% | >>> | fix_size_alloc_test: p:16, h:0 | (R) 65.81% | 2.97% | >>> | fix_size_alloc_test: p:64, h:0 | (R) 83.39% | -0.49% | >>> | fix_size_alloc_test: p:256, h:0 | (R) 87.85% | (I) -2.04% | >>> | fix_size_alloc_test: p:16, h:1 | (R) 51.21% | 3.77% | >>> | fix_size_alloc_test: p:64, h:1 | (R) 60.02% | 0.99% | >>> | fix_size_alloc_test: p:256, h:1 | (R) 63.82% | 1.16% | >>> | random_size_alloc_test: p:1, h:0 | (R) 77.79% | -0.51% | >>> | vm_map_ram_test: p:1, h:0 | (R) 30.67% | (R) 27.09% | >> +-------------------+----------------------------------+------------------+---------------+ > Hmm, still surprisingly low to me, but ok. It would be good have x86 and arm > work the same, but I don't think we have line of sight to x86 currently. And I > actually never did real benchmarks. It would certainly be good to get numbers on x86 as well - I'm hoping that someone with a better understanding of x86 than myself could implement kpkeys on x86 at some point, so that we can run the same benchmarks there. - Kevin
Hi Kevin, On 8/15/25 1:54 AM, Kevin Brodsky wrote: > This is a proposal to leverage protection keys (pkeys) to harden > critical kernel data, by making it mostly read-only. The series includes > a simple framework called "kpkeys" to manipulate pkeys for in-kernel use, > as well as a page table hardening feature based on that framework, > "kpkeys_hardened_pgtables". Both are implemented on arm64 as a proof of > concept, but they are designed to be compatible with any architecture > that supports pkeys. [...] > > Note: the performance impact of set_memory_pkey() is likely to be > relatively low on arm64 because the linear mapping uses PTE-level > descriptors only. This means that set_memory_pkey() simply changes the > attributes of some PTE descriptors. However, some systems may be able to > use higher-level descriptors in the future [5], meaning that > set_memory_pkey() may have to split mappings. Allocating page tables I'm supposed the page table hardening feature will be opt-in due to its overhead? If so I think you can just keep kernel linear mapping using PTE, just like debug page alloc. > from a contiguous cache of pages could help minimise the overhead, as > proposed for x86 in [1]. I'm a little bit confused about how this can work. The contiguous cache of pages should be some large page, for example, 2M. But the page table pages allocated from the cache may have different permissions if I understand correctly. The default permission is RO, but some of them may become R/W at sometime, for example, when calling set_pte_at(). You still need to split the linear mapping, right? Regards, Yang > >
On 21/08/2025 19:29, Yang Shi wrote: > Hi Kevin, > > On 8/15/25 1:54 AM, Kevin Brodsky wrote: >> This is a proposal to leverage protection keys (pkeys) to harden >> critical kernel data, by making it mostly read-only. The series includes >> a simple framework called "kpkeys" to manipulate pkeys for in-kernel >> use, >> as well as a page table hardening feature based on that framework, >> "kpkeys_hardened_pgtables". Both are implemented on arm64 as a proof of >> concept, but they are designed to be compatible with any architecture >> that supports pkeys. > > [...] > >> >> Note: the performance impact of set_memory_pkey() is likely to be >> relatively low on arm64 because the linear mapping uses PTE-level >> descriptors only. This means that set_memory_pkey() simply changes the >> attributes of some PTE descriptors. However, some systems may be able to >> use higher-level descriptors in the future [5], meaning that >> set_memory_pkey() may have to split mappings. Allocating page tables > > I'm supposed the page table hardening feature will be opt-in due to > its overhead? If so I think you can just keep kernel linear mapping > using PTE, just like debug page alloc. Indeed, I don't expect it to be turned on by default (in defconfig). If the overhead proves too large when block mappings are used, it seems reasonable to force PTE mappings when kpkeys_hardened_pgtables is enabled. > >> from a contiguous cache of pages could help minimise the overhead, as >> proposed for x86 in [1]. > > I'm a little bit confused about how this can work. The contiguous > cache of pages should be some large page, for example, 2M. But the > page table pages allocated from the cache may have different > permissions if I understand correctly. The default permission is RO, > but some of them may become R/W at sometime, for example, when calling > set_pte_at(). You still need to split the linear mapping, right? When such a helper is called, *all* PTPs become writeable - there is no per-PTP permission switching. PTPs remain mapped RW (i.e. the base permissions set at the PTE level are RW). With this series, they are also all mapped with the same pkey (1). By default, the pkey register is configured so that pkey 1 provides RO access. The net result is that PTPs are RO by default, since the pkey restricts the effective permissions. When calling e.g. set_pte(), the pkey register is modified to enable RW access to pkey 1, making it possible to write to any PTP. Its value is restored when the function exit so that PTPs are once again RO. - Kevin
On 25/08/2025 09:31, Kevin Brodsky wrote: >>> Note: the performance impact of set_memory_pkey() is likely to be >>> relatively low on arm64 because the linear mapping uses PTE-level >>> descriptors only. This means that set_memory_pkey() simply changes the >>> attributes of some PTE descriptors. However, some systems may be able to >>> use higher-level descriptors in the future [5], meaning that >>> set_memory_pkey() may have to split mappings. Allocating page tables >> I'm supposed the page table hardening feature will be opt-in due to >> its overhead? If so I think you can just keep kernel linear mapping >> using PTE, just like debug page alloc. > Indeed, I don't expect it to be turned on by default (in defconfig). If > the overhead proves too large when block mappings are used, it seems > reasonable to force PTE mappings when kpkeys_hardened_pgtables is enabled. I had a closer look at what happens when the linear map uses block mappings, rebasing this series on top of [1]. Unfortunately, this is worse than I thought: it does not work at all as things stand. The main issue is that calling set_memory_pkey() in pagetable_*_ctor() can cause the linear map to be split, which requires new PTP(s) to be allocated, which means more nested call(s) to set_memory_pkey(). This explodes as a non-recursive lock is taken on that path. More fundamentally, this cannot work unless we can explicitly allocate PTPs from either: 1. A pool of PTE-mapped pages 2. A pool of memory that is already mapped with the right pkey (at any level) This is where I have to apologise to Rick for not having studied his series more thoroughly, as patch 17 [2] covers this issue very well in the commit message. It seems fair to say there is no ideal or simple solution, though. Rick's patch reserves enough (PTE-mapped) memory for fully splitting the linear map, which is relatively simple but not very pleasant. Chatting with Ryan Roberts, we figured another approach, improving on solution 1 mentioned in [2]. It would rely on allocating all PTPs from a special pool (without using set_memory_pkey() in pagetable_*_ctor), along those lines: 1. 2 pages are reserved at all times (with the appropriate pkey) 2. Try to allocate a 2M block. If needed, use a reserved page as PMD to split a PUD. If successful, set its pkey - the entire block can now be used for PTPs. Replenish the reserve from the block if needed. 3. If no block is available, make an order-2 allocation (4 pages). If needed, use 1-2 reserved pages to split PUD/PMD. Set the pkey of the 4 pages, take 1-2 pages to replenish the reserve if needed. This ensures that we never run out of PTPs for splitting. We may get into an OOM situation more easily due to the order-2 requirement, but the risk remains low compared to requiring a 2M block. A bigger concern is concurrency - do we need a per-CPU cache? Reserving a 2M block per CPU could be very much overkill. No matter which solution is used, this clearly increases the complexity of kpkeys_hardened_pgtables. Mike Rapoport has posted a number of RFCs [3][4] that aim at addressing this problem more generally, but no consensus seems to have emerged and I'm not sure they would completely solve this specific problem either. For now, my plan is to stick to solution 3 from [2], i.e. force the linear map to be PTE-mapped. This is easily done on arm64 as it is the default, and is required for rodata=full, unless [1] is applied and the system supports BBML2_NOABORT. See [1] for the potential performance improvements we'd be missing out on (~5% ballpark). I'm not quite sure what the picture looks like on x86 - it may well be more significant as Rick suggested. - Kevin [1] https://lore.kernel.org/all/20250829115250.2395585-1-ryan.roberts@arm.com/ [2] https://lore.kernel.org/all/20210830235927.6443-18-rick.p.edgecombe@intel.com/ [3] https://lore.kernel.org/lkml/20210823132513.15836-1-rppt@kernel.org/ [4] https://lore.kernel.org/all/20230308094106.227365-1-rppt@kernel.org/
On Thu, 2025-09-18 at 16:15 +0200, Kevin Brodsky wrote: > This is where I have to apologise to Rick for not having studied his > series more thoroughly, as patch 17 [2] covers this issue very well in > the commit message. > > It seems fair to say there is no ideal or simple solution, though. > Rick's patch reserves enough (PTE-mapped) memory for fully splitting the > linear map, which is relatively simple but not very pleasant. Chatting > with Ryan Roberts, we figured another approach, improving on solution 1 > mentioned in [2]. It would rely on allocating all PTPs from a special > pool (without using set_memory_pkey() in pagetable_*_ctor), along those > lines: Oh I didn't realize ARM split the direct map now at runtime. IIRC it used to just map at 4k if there were any permissions configured. > > 1. 2 pages are reserved at all times (with the appropriate pkey) > 2. Try to allocate a 2M block. If needed, use a reserved page as PMD to > split a PUD. If successful, set its pkey - the entire block can now be > used for PTPs. Replenish the reserve from the block if needed. > 3. If no block is available, make an order-2 allocation (4 pages). If > needed, use 1-2 reserved pages to split PUD/PMD. Set the pkey of the 4 > pages, take 1-2 pages to replenish the reserve if needed. Oh, good idea! > > This ensures that we never run out of PTPs for splitting. We may get > into an OOM situation more easily due to the order-2 requirement, but > the risk remains low compared to requiring a 2M block. A bigger concern > is concurrency - do we need a per-CPU cache? Reserving a 2M block per > CPU could be very much overkill. > > No matter which solution is used, this clearly increases the complexity > of kpkeys_hardened_pgtables. Mike Rapoport has posted a number of RFCs > [3][4] that aim at addressing this problem more generally, but no > consensus seems to have emerged and I'm not sure they would completely > solve this specific problem either. > > For now, my plan is to stick to solution 3 from [2], i.e. force the > linear map to be PTE-mapped. This is easily done on arm64 as it is the > default, and is required for rodata=full, unless [1] is applied and the > system supports BBML2_NOABORT. See [1] for the potential performance > improvements we'd be missing out on (~5% ballpark). > I continue to be surprised that allocation time pkey conversion is not a performance disaster, even with the directmap pre-split. > I'm not quite sure > what the picture looks like on x86 - it may well be more significant as > Rick suggested. I think having more efficient direct map permissions is a solvable problem, but each usage is just a little too small to justify the infrastructure for a good solution. And each simple solution is a little too much overhead to justify the usage. So there is a long tail of blocked usages: - pkeys usages (page tables and secret protection) - kernel shadow stacks - More efficient executable code allocations (BPF, kprobe trampolines, etc) Although the BPF folks started doing their own thing for this. But I don't think there are any fundamentally unsolvable problems for a generic solution. It's a question of a leading killer usage to justify the infrastructure. Maybe it will be kernel shadow stack.
On 18/09/2025 19:31, Edgecombe, Rick P wrote: > On Thu, 2025-09-18 at 16:15 +0200, Kevin Brodsky wrote: >> This is where I have to apologise to Rick for not having studied his >> series more thoroughly, as patch 17 [2] covers this issue very well in >> the commit message. >> >> It seems fair to say there is no ideal or simple solution, though. >> Rick's patch reserves enough (PTE-mapped) memory for fully splitting the >> linear map, which is relatively simple but not very pleasant. Chatting >> with Ryan Roberts, we figured another approach, improving on solution 1 >> mentioned in [2]. It would rely on allocating all PTPs from a special >> pool (without using set_memory_pkey() in pagetable_*_ctor), along those >> lines: > Oh I didn't realize ARM split the direct map now at runtime. IIRC it used to > just map at 4k if there were any permissions configured. Until recently the linear map was always PTE-mapped on arm64 if rodata=full (default) or in other situations (e.g. DEBUG_PAGEALLOC), so that it never needed to be split at runtime. Since [1b] landed though, there is support for setting permissions at the block level and splitting, meaning that the linear map can be block-mapped in most cases (see force_pte_mapping() in patch 3 for details). This is only enabled on systems with the BBML2_NOABORT feature though. [1b] https://lore.kernel.org/all/20250917190323.3828347-1-yang@os.amperecomputing.com/ >> 1. 2 pages are reserved at all times (with the appropriate pkey) >> 2. Try to allocate a 2M block. If needed, use a reserved page as PMD to >> split a PUD. If successful, set its pkey - the entire block can now be >> used for PTPs. Replenish the reserve from the block if needed. >> 3. If no block is available, make an order-2 allocation (4 pages). If >> needed, use 1-2 reserved pages to split PUD/PMD. Set the pkey of the 4 >> pages, take 1-2 pages to replenish the reserve if needed. > Oh, good idea! > >> This ensures that we never run out of PTPs for splitting. We may get >> into an OOM situation more easily due to the order-2 requirement, but >> the risk remains low compared to requiring a 2M block. A bigger concern >> is concurrency - do we need a per-CPU cache? Reserving a 2M block per >> CPU could be very much overkill. >> >> No matter which solution is used, this clearly increases the complexity >> of kpkeys_hardened_pgtables. Mike Rapoport has posted a number of RFCs >> [3][4] that aim at addressing this problem more generally, but no >> consensus seems to have emerged and I'm not sure they would completely >> solve this specific problem either. >> >> For now, my plan is to stick to solution 3 from [2], i.e. force the >> linear map to be PTE-mapped. This is easily done on arm64 as it is the >> default, and is required for rodata=full, unless [1] is applied and the >> system supports BBML2_NOABORT. See [1] for the potential performance >> improvements we'd be missing out on (~5% ballpark). >> > I continue to be surprised that allocation time pkey conversion is not a > performance disaster, even with the directmap pre-split. > >> I'm not quite sure >> what the picture looks like on x86 - it may well be more significant as >> Rick suggested. > I think having more efficient direct map permissions is a solvable problem, but > each usage is just a little too small to justify the infrastructure for a good > solution. And each simple solution is a little too much overhead to justify the > usage. So there is a long tail of blocked usages: > - pkeys usages (page tables and secret protection) > - kernel shadow stacks > - More efficient executable code allocations (BPF, kprobe trampolines, etc) > > Although the BPF folks started doing their own thing for this. But I don't think > there are any fundamentally unsolvable problems for a generic solution. It's a > question of a leading killer usage to justify the infrastructure. Maybe it will > be kernel shadow stack. It seems to be exactly the situation yes. Given Will's feedback, I'll try to implement such a dedicated allocator one more time (based on the scheme I suggested above) and see how it goes. Hopefully that will create more momentum for a generic infrastructure :) - Kevin
On Thu, Sep 18, 2025 at 04:15:52PM +0200, Kevin Brodsky wrote: > On 25/08/2025 09:31, Kevin Brodsky wrote: > >>> Note: the performance impact of set_memory_pkey() is likely to be > >>> relatively low on arm64 because the linear mapping uses PTE-level > >>> descriptors only. This means that set_memory_pkey() simply changes the > >>> attributes of some PTE descriptors. However, some systems may be able to > >>> use higher-level descriptors in the future [5], meaning that > >>> set_memory_pkey() may have to split mappings. Allocating page tables > >> I'm supposed the page table hardening feature will be opt-in due to > >> its overhead? If so I think you can just keep kernel linear mapping > >> using PTE, just like debug page alloc. > > Indeed, I don't expect it to be turned on by default (in defconfig). If > > the overhead proves too large when block mappings are used, it seems > > reasonable to force PTE mappings when kpkeys_hardened_pgtables is enabled. > > I had a closer look at what happens when the linear map uses block > mappings, rebasing this series on top of [1]. Unfortunately, this is > worse than I thought: it does not work at all as things stand. > > The main issue is that calling set_memory_pkey() in pagetable_*_ctor() > can cause the linear map to be split, which requires new PTP(s) to be > allocated, which means more nested call(s) to set_memory_pkey(). This > explodes as a non-recursive lock is taken on that path. > > More fundamentally, this cannot work unless we can explicitly allocate > PTPs from either: > 1. A pool of PTE-mapped pages > 2. A pool of memory that is already mapped with the right pkey (at any > level) > > This is where I have to apologise to Rick for not having studied his > series more thoroughly, as patch 17 [2] covers this issue very well in > the commit message. > > It seems fair to say there is no ideal or simple solution, though. > Rick's patch reserves enough (PTE-mapped) memory for fully splitting the > linear map, which is relatively simple but not very pleasant. Chatting > with Ryan Roberts, we figured another approach, improving on solution 1 > mentioned in [2]. It would rely on allocating all PTPs from a special > pool (without using set_memory_pkey() in pagetable_*_ctor), along those > lines: > > 1. 2 pages are reserved at all times (with the appropriate pkey) > 2. Try to allocate a 2M block. If needed, use a reserved page as PMD to > split a PUD. If successful, set its pkey - the entire block can now be > used for PTPs. Replenish the reserve from the block if needed. > 3. If no block is available, make an order-2 allocation (4 pages). If > needed, use 1-2 reserved pages to split PUD/PMD. Set the pkey of the 4 > pages, take 1-2 pages to replenish the reserve if needed. > > This ensures that we never run out of PTPs for splitting. We may get > into an OOM situation more easily due to the order-2 requirement, but > the risk remains low compared to requiring a 2M block. A bigger concern > is concurrency - do we need a per-CPU cache? Reserving a 2M block per > CPU could be very much overkill. > > No matter which solution is used, this clearly increases the complexity > of kpkeys_hardened_pgtables. Mike Rapoport has posted a number of RFCs > [3][4] that aim at addressing this problem more generally, but no > consensus seems to have emerged and I'm not sure they would completely > solve this specific problem either. > > For now, my plan is to stick to solution 3 from [2], i.e. force the > linear map to be PTE-mapped. This is easily done on arm64 as it is the > default, and is required for rodata=full, unless [1] is applied and the > system supports BBML2_NOABORT. See [1] for the potential performance > improvements we'd be missing out on (~5% ballpark). I'm not quite sure > what the picture looks like on x86 - it may well be more significant as > Rick suggested. Just as a data point, but forcing the linear map down to 4k would likely prevent us from being able to enable this on Android. We've measured a considerable (double digit %) increase in CPU power consumption for some real-life camera workloads when mapping the linear map at 4k granularity thanks to the additional memory traffic from the PTW. At some point, KFENCE required 4k granularity for the linear map, but we fixed it. rodata=full requires 4k granularity, but there are patches to fix that too. So I think we should avoid making the same mistake from the start for this series. Will
On 18/09/2025 16:57, Will Deacon wrote: > On Thu, Sep 18, 2025 at 04:15:52PM +0200, Kevin Brodsky wrote: >> [...] >> >> For now, my plan is to stick to solution 3 from [2], i.e. force the >> linear map to be PTE-mapped. This is easily done on arm64 as it is the >> default, and is required for rodata=full, unless [1] is applied and the >> system supports BBML2_NOABORT. See [1] for the potential performance >> improvements we'd be missing out on (~5% ballpark). I'm not quite sure >> what the picture looks like on x86 - it may well be more significant as >> Rick suggested. > Just as a data point, but forcing the linear map down to 4k would likely > prevent us from being able to enable this on Android. We've measured a > considerable (double digit %) increase in CPU power consumption for some > real-life camera workloads when mapping the linear map at 4k granularity > thanks to the additional memory traffic from the PTW. Good to know! > At some point, KFENCE required 4k granularity for the linear map, but we > fixed it. rodata=full requires 4k granularity, but there are patches to > fix that too. So I think we should avoid making the same mistake from > the start for this series. Understood, makes sense. I'll be looking into implementing the custom PTP allocator I suggested above then. - Kevin
On 8/25/25 12:31 AM, Kevin Brodsky wrote: > On 21/08/2025 19:29, Yang Shi wrote: >> Hi Kevin, >> >> On 8/15/25 1:54 AM, Kevin Brodsky wrote: >>> This is a proposal to leverage protection keys (pkeys) to harden >>> critical kernel data, by making it mostly read-only. The series includes >>> a simple framework called "kpkeys" to manipulate pkeys for in-kernel >>> use, >>> as well as a page table hardening feature based on that framework, >>> "kpkeys_hardened_pgtables". Both are implemented on arm64 as a proof of >>> concept, but they are designed to be compatible with any architecture >>> that supports pkeys. >> [...] >> >>> Note: the performance impact of set_memory_pkey() is likely to be >>> relatively low on arm64 because the linear mapping uses PTE-level >>> descriptors only. This means that set_memory_pkey() simply changes the >>> attributes of some PTE descriptors. However, some systems may be able to >>> use higher-level descriptors in the future [5], meaning that >>> set_memory_pkey() may have to split mappings. Allocating page tables >> I'm supposed the page table hardening feature will be opt-in due to >> its overhead? If so I think you can just keep kernel linear mapping >> using PTE, just like debug page alloc. > Indeed, I don't expect it to be turned on by default (in defconfig). If > the overhead proves too large when block mappings are used, it seems > reasonable to force PTE mappings when kpkeys_hardened_pgtables is enabled. > >>> from a contiguous cache of pages could help minimise the overhead, as >>> proposed for x86 in [1]. >> I'm a little bit confused about how this can work. The contiguous >> cache of pages should be some large page, for example, 2M. But the >> page table pages allocated from the cache may have different >> permissions if I understand correctly. The default permission is RO, >> but some of them may become R/W at sometime, for example, when calling >> set_pte_at(). You still need to split the linear mapping, right? > When such a helper is called, *all* PTPs become writeable - there is no > per-PTP permission switching. OK, so all PTPs in the same contiguous cache will become writeable even though the helper (i.e. set_pte_at()) is just called on one of the PTPs. But doesn't it compromise the page table hardening somehow? The PTPs from the same cache may belong to different processes. Thanks, Yang > > PTPs remain mapped RW (i.e. the base permissions set at the PTE level > are RW). With this series, they are also all mapped with the same pkey > (1). By default, the pkey register is configured so that pkey 1 provides > RO access. The net result is that PTPs are RO by default, since the pkey > restricts the effective permissions. > > When calling e.g. set_pte(), the pkey register is modified to enable RW > access to pkey 1, making it possible to write to any PTP. Its value is > restored when the function exit so that PTPs are once again RO. > > - Kevin
On 26/08/2025 21:18, Yang Shi wrote: >> >>>> from a contiguous cache of pages could help minimise the overhead, as >>>> proposed for x86 in [1]. >>> I'm a little bit confused about how this can work. The contiguous >>> cache of pages should be some large page, for example, 2M. But the >>> page table pages allocated from the cache may have different >>> permissions if I understand correctly. The default permission is RO, >>> but some of them may become R/W at sometime, for example, when calling >>> set_pte_at(). You still need to split the linear mapping, right? >> When such a helper is called, *all* PTPs become writeable - there is no >> per-PTP permission switching. > > OK, so all PTPs in the same contiguous cache will become writeable > even though the helper (i.e. set_pte_at()) is just called on one of > the PTPs. But doesn't it compromise the page table hardening somehow? > The PTPs from the same cache may belong to different processes. First just a note that this is true regardless of how the PTPs are allocated (i.e. this is already the case in this version of the series). Either way, yes you are right, this approach does not introduce any isolation *between* page tables - pgtable helpers are able to write to all page tables. In principle it should be possible to use a different pkey for kernel and user page tables, but that would make the kpkeys level switching in helpers quite a bit more complicated. Isolating further is impractical as we have so few pkeys (just 8 on arm64). That said, what kpkeys really tries to protect against is the direct corruption of critical data by arbitrary (unprivileged) code. If the attacker is able to manipulate calls to set_pte() and the likes, kpkeys cannot provide much protection - even if we restricted the writes to a specific set of page tables, the attacker would still be able to insert a translation to any arbitrary physical page. - Kevin
On 8/27/25 9:09 AM, Kevin Brodsky wrote: > On 26/08/2025 21:18, Yang Shi wrote: >>>>> from a contiguous cache of pages could help minimise the overhead, as >>>>> proposed for x86 in [1]. >>>> I'm a little bit confused about how this can work. The contiguous >>>> cache of pages should be some large page, for example, 2M. But the >>>> page table pages allocated from the cache may have different >>>> permissions if I understand correctly. The default permission is RO, >>>> but some of them may become R/W at sometime, for example, when calling >>>> set_pte_at(). You still need to split the linear mapping, right? >>> When such a helper is called, *all* PTPs become writeable - there is no >>> per-PTP permission switching. >> OK, so all PTPs in the same contiguous cache will become writeable >> even though the helper (i.e. set_pte_at()) is just called on one of >> the PTPs. But doesn't it compromise the page table hardening somehow? >> The PTPs from the same cache may belong to different processes. > First just a note that this is true regardless of how the PTPs are > allocated (i.e. this is already the case in this version of the series). > > Either way, yes you are right, this approach does not introduce any > isolation *between* page tables - pgtable helpers are able to write to > all page tables. In principle it should be possible to use a different > pkey for kernel and user page tables, but that would make the kpkeys > level switching in helpers quite a bit more complicated. Isolating > further is impractical as we have so few pkeys (just 8 on arm64). > > That said, what kpkeys really tries to protect against is the direct > corruption of critical data by arbitrary (unprivileged) code. If the > attacker is able to manipulate calls to set_pte() and the likes, kpkeys > cannot provide much protection - even if we restricted the writes to a > specific set of page tables, the attacker would still be able to insert > a translation to any arbitrary physical page. I see. Thanks for elaborating this. Yang > > - Kevin
© 2016 - 2025 Red Hat, Inc.