[RFC PATCH v5 00/18] pkeys-based page table hardening

Kevin Brodsky posted 18 patches 1 month, 2 weeks ago
arch/arm64/Kconfig                        |   2 +
arch/arm64/include/asm/kpkeys.h           |  62 +++++++++
arch/arm64/include/asm/pgtable-prot.h     |  16 +--
arch/arm64/include/asm/pgtable.h          |  57 +++++++-
arch/arm64/include/asm/por.h              |  11 ++
arch/arm64/include/asm/processor.h        |   2 +
arch/arm64/include/asm/ptrace.h           |   4 +
arch/arm64/include/asm/set_memory.h       |   4 +
arch/arm64/kernel/asm-offsets.c           |   3 +
arch/arm64/kernel/cpufeature.c            |   5 +-
arch/arm64/kernel/entry.S                 |  24 +++-
arch/arm64/kernel/process.c               |   9 ++
arch/arm64/kernel/smp.c                   |   2 +
arch/arm64/mm/fault.c                     |   2 +
arch/arm64/mm/mmu.c                       |  26 ++--
arch/arm64/mm/pageattr.c                  |  25 ++++
include/asm-generic/kpkeys.h              |  21 +++
include/asm-generic/pgalloc.h             |  15 ++-
include/linux/kpkeys.h                    | 157 ++++++++++++++++++++++
include/linux/mm.h                        |  27 ++--
include/linux/set_memory.h                |   7 +
mm/Kconfig                                |   5 +
mm/Makefile                               |   2 +
mm/kpkeys_hardened_pgtables.c             |  44 ++++++
mm/memory.c                               | 137 +++++++++++++++++++
mm/tests/kpkeys_hardened_pgtables_kunit.c | 106 +++++++++++++++
security/Kconfig.hardening                |  24 ++++
27 files changed, 758 insertions(+), 41 deletions(-)
create mode 100644 arch/arm64/include/asm/kpkeys.h
create mode 100644 include/asm-generic/kpkeys.h
create mode 100644 include/linux/kpkeys.h
create mode 100644 mm/kpkeys_hardened_pgtables.c
create mode 100644 mm/tests/kpkeys_hardened_pgtables_kunit.c
[RFC PATCH v5 00/18] pkeys-based page table hardening
Posted by Kevin Brodsky 1 month, 2 weeks ago
This is a proposal to leverage protection keys (pkeys) to harden
critical kernel data, by making it mostly read-only. The series includes
a simple framework called "kpkeys" to manipulate pkeys for in-kernel use,
as well as a page table hardening feature based on that framework,
"kpkeys_hardened_pgtables". Both are implemented on arm64 as a proof of
concept, but they are designed to be compatible with any architecture
that supports pkeys.

The proposed approach is a typical use of pkeys: the data to protect is
mapped with a given pkey P, and the pkey register is initially configured
to grant read-only access to P. Where the protected data needs to be
written to, the pkey register is temporarily switched to grant write
access to P on the current CPU.

The key fact this approach relies on is that the target data is
only written to via a limited and well-defined API. This makes it
possible to explicitly switch the pkey register where needed, without
introducing excessively invasive changes, and only for a small amount of
trusted code.

Page tables were chosen as they are a popular (and critical) target for
attacks, but there are of course many others - this is only a starting
point (see section "Further use-cases"). It has become more and more
common for accesses to such target data to be mediated by a hypervisor
in vendor kernels; the hope is that kpkeys can provide much of that
protection in a simpler and cheaper manner. A rough performance
estimation has been performed on a modern arm64 system, see section
"Performance".

This series has similarities with the "PKS write protected page tables"
series posted by Rick Edgecombe a few years ago [1], but it is not
specific to x86/PKS - the approach is meant to be generic.

kpkeys
======

The use of pkeys involves two separate mechanisms: assigning a pkey to
pages, and defining the pkeys -> permissions mapping via the pkey
register. This is implemented through the following interface:

- Pages in the linear mapping are assigned a pkey using set_memory_pkey().
  This is sufficient for this series, but of course higher-level
  interfaces can be introduced later to ask allocators to return pages
  marked with a given pkey. It should also be possible to extend this to
  vmalloc() if needed.

- The pkey register is configured based on a *kpkeys level*. kpkeys
  levels are simple integers that correspond to a given configuration,
  for instance:

  KPKEYS_LVL_DEFAULT:
        RW access to KPKEYS_PKEY_DEFAULT
        RO access to any other KPKEYS_PKEY_*

  KPKEYS_LVL_<FEAT>:
        RW access to KPKEYS_PKEY_DEFAULT
        RW access to KPKEYS_PKEY_<FEAT>
        RO access to any other KPKEYS_PKEY_*

  Only pkeys that are managed by the kpkeys framework are impacted;
  permissions for other pkeys are left unchanged (this allows for other
  schemes using pkeys to be used in parallel, and arch-specific use of
  certain pkeys).

  The kpkeys level is changed by calling kpkeys_set_level(), setting the
  pkey register accordingly and returning the original value. A
  subsequent call to kpkeys_restore_pkey_reg() restores the kpkeys
  level. The numeric value of KPKEYS_LVL_* (kpkeys level) is purely
  symbolic and thus generic, however each architecture is free to define
  KPKEYS_PKEY_* (pkey value).

kpkeys_hardened_pgtables
========================

The kpkeys_hardened_pgtables feature uses the interface above to make
the (kernel and user) page tables read-only by default, enabling write
access only in helpers such as set_pte(). One complication is that those
helpers as well as page table allocators are used very early, before
kpkeys become available. Enabling kpkeys_hardened_pgtables, if and when
kpkeys become available, is therefore done as follows:

1. A static key is turned on. This enables a transition to
   KPKEYS_LVL_PGTABLES in all helpers writing to page tables, and also
   impacts page table allocators (step 3).

2. swapper_pg_dir is walked to set all early page table pages to
   KPKEYS_PKEY_PGTABLES.

3. Page table allocators set the returned pages to KPKEYS_PKEY_PGTABLES
   (and the pkey is reset upon freeing). This ensures that all page
   tables are mapped with that privileged pkey.

This series
===========

The series is composed of two parts:

- The kpkeys framework (patch 1-9). The main API is introduced in
  <linux/kpkeys.h>, and it is implemented on arm64 using the POE
  (Permission Overlay Extension) feature.

- The kpkeys_hardened_pgtables feature (patch 10-18). <linux/kpkeys.h>
  is extended with an API to set page table pages to a given pkey and a
  guard object to switch kpkeys level accordingly, both gated on a
  static key. This is then used in generic and arm64 pgtable handling
  code as needed. Finally a simple KUnit-based test suite is added to
  demonstrate the page table protection.

pkey register management
========================

The kpkeys model relies on the kernel pkey register being set to a
specific value for the duration of a relatively small section of code,
and otherwise to the default value. Accordingly, the arm64
implementation based on POE handles its pkey register (POR_EL1) as
follows:

- POR_EL1 is saved and reset to its default value on exception entry,
  and restored on exception return. This ensures that exception handling
  code runs in a fixed kpkeys state.

- POR_EL1 is context-switched per-thread. This allows sections of code
  that run at a non-default kpkeys level to sleep (e.g. when locking a
  mutex). For kpkeys_hardened_pgtables, only involuntary preemption is
  relevant and the previous point already handles that; however sleeping
  is likely to occur in more advanced uses of kpkeys.

An important assumption is that all kpkeys levels allow RW access to the
default pkey (0). Otherwise, saving POR_EL1 before resetting it on
exception entry would be a best difficult, and context-switching it too.

Performance
===========

No arm64 hardware currently implements POE. To estimate the performance
impact of kpkeys_hardened_pgtables, a mock implementation of kpkeys has
been used, replacing accesses to the POR_EL1 register with accesses to
another system register that is otherwise unused (CONTEXTIDR_EL1), and
leaving everything else unchanged. Most of the kpkeys overhead is
expected to originate from the barrier (ISB) that is required after
writing to POR_EL1, and from setting the POIndex (pkey) in page tables;
both of these are done exactly in the same way in the mock
implementation.

The original implementation of kpkeys_hardened_pgtables is very
inefficient when many PTEs are changed at once, as the kpkeys level is
switched twice for every PTE (two ISBs per PTE). Patch 18 introduces
an optimisation that makes use of the lazy_mmu mode to batch those
switches: 1. switch to KPKEYS_LVL_PGTABLES on arch_enter_lazy_mmu_mode(),
2. skip any kpkeys switch while in that section, and 3. restore the
kpkeys level on arch_leave_lazy_mmu_mode(). When that last function
already issues an ISB (when updating kernel page tables), we get a
further optimisation as we can skip the ISB when restoring the kpkeys
level.

Both implementations (without and with batching) were evaluated on an
Amazon EC2 M7g instance (Graviton3), using a variety of benchmarks that
involve heavy page table manipulations. The results shown below are
relative to the baseline for this series, which is 6.17-rc1. The
branches used for all three sets of results (baseline, with/without
batching) are available in a repository, see next section.

Caveat: these numbers should be seen as a lower bound for the overhead
of a real POE-based protection. The hardware checks added by POE are
however not expected to incur significant extra overhead.

Reading example: for the fix_size_alloc_test benchmark, using 1 page per
iteration (no hugepage), kpkeys_hardened_pgtables incurs 17.35% overhead
without batching, and 14.62% overhead with batching. Both results are
considered statistically significant (95% confidence interval),
indicated by "(R)".

+-------------------+----------------------------------+------------------+---------------+
| Benchmark         | Result Class                     | Without batching | With batching |
+===================+==================================+==================+===============+
| mmtests/kernbench | real time                        |            0.30% |         0.11% |
|                   | system time                      |        (R) 3.97% |     (R) 2.17% |
|                   | user time                        |            0.12% |         0.02% |
+-------------------+----------------------------------+------------------+---------------+
| micromm/fork      | fork: h:0                        |      (R) 217.31% |        -0.97% |
|                   | fork: h:1                        |      (R) 275.25% |     (R) 2.25% |
+-------------------+----------------------------------+------------------+---------------+
| micromm/munmap    | munmap: h:0                      |       (R) 15.57% |        -1.95% |
|                   | munmap: h:1                      |      (R) 169.53% |     (R) 6.53% |
+-------------------+----------------------------------+------------------+---------------+
| micromm/vmalloc   | fix_size_alloc_test: p:1, h:0    |       (R) 17.35% |    (R) 14.62% |
|                   | fix_size_alloc_test: p:4, h:0    |       (R) 37.54% |     (R) 9.35% |
|                   | fix_size_alloc_test: p:16, h:0   |       (R) 66.08% |     (R) 3.15% |
|                   | fix_size_alloc_test: p:64, h:0   |       (R) 82.94% |        -0.39% |
|                   | fix_size_alloc_test: p:256, h:0  |       (R) 87.85% |        -1.67% |
|                   | fix_size_alloc_test: p:16, h:1   |       (R) 50.31% |         3.00% |
|                   | fix_size_alloc_test: p:64, h:1   |       (R) 59.73% |         2.23% |
|                   | fix_size_alloc_test: p:256, h:1  |       (R) 62.14% |         1.51% |
|                   | random_size_alloc_test: p:1, h:0 |       (R) 77.82% |        -0.21% |
|                   | vm_map_ram_test: p:1, h:0        |       (R) 30.66% |    (R) 27.30% |
+-------------------+----------------------------------+------------------+---------------+

Benchmarks:
- mmtests/kernbench: running kernbench (kernel build) [4].
- micromm/{fork,munmap}: from David Hildenbrand's benchmark suite. A
  1 GB mapping is created and then fork/unmap is called. The mapping is
  created using either page-sized (h:0) or hugepage folios (h:1); in all
  cases the memory is PTE-mapped.
- micromm/vmalloc: from test_vmalloc.ko, varying the number of pages
  (p:) and whether huge pages are used (h:).

On a "real-world" and fork-heavy workload like kernbench, the estimated
overhead of kpkeys_hardened_pgtables is reasonable: 4% system time
overhead without batching, and about half that figure (2.2%) with
batching. The real time overhead is negligible.

Microbenchmarks show large overheads without batching, which increase
with the number of pages being manipulated. Batching drastically reduces
that overhead, almost negating it for micromm/fork. Because all PTEs in
the mapping are modified in the same lazy_mmu section, the kpkeys level
is changed just twice regardless of the mapping size; as a result the
relative overhead actually decreases as the size increases for
fix_size_alloc_test.

Note: the performance impact of set_memory_pkey() is likely to be
relatively low on arm64 because the linear mapping uses PTE-level
descriptors only. This means that set_memory_pkey() simply changes the
attributes of some PTE descriptors. However, some systems may be able to
use higher-level descriptors in the future [5], meaning that
set_memory_pkey() may have to split mappings. Allocating page tables
from a contiguous cache of pages could help minimise the overhead, as
proposed for x86 in [1].

Branches
========

To make reviewing and testing easier, this series is available here:

  https://gitlab.arm.com/linux-arm/linux-kb

The following branches are available:

- kpkeys/rfc-v5 - this series, as posted

- kpkeys/rfc-v5-base - the baseline for this series, that is 6.17-rc1

- kpkeys/rfc-v5-bench-batching - this series + patch for benchmarking on
  a regular arm64 system (see section above)

- kpkeys/rfc-v5-bench-no-batching - this series without patch 18
  (batching) + benchmarking patch

Threat model
============

The proposed scheme aims at mitigating data-only attacks (e.g.
use-after-free/cross-cache attacks). In other words, it is assumed that
control flow is not corrupted, and that the attacker does not achieve
arbitrary code execution. Nothing prevents the pkey register from being
set to its most permissive state - the assumption is that the register
is only modified on legitimate code paths.

A few related notes:

- Functions that set the pkey register are all implemented inline.
  Besides performance considerations, this is meant to avoid creating
  a function that can be used as a straightforward gadget to set the
  pkey register to an arbitrary value.

- kpkeys_set_level() only accepts a compile-time constant as argument,
  as a variable could be manipulated by an attacker. This could be
  relaxed but it seems unlikely that a variable kpkeys level would be
  needed in practice.

Further use-cases
=================

It should be possible to harden various targets using kpkeys, including:

- struct cred - kpkeys-based cred hardening is now available in a
  separate series [6]

- fixmap (occasionally used even after early boot, e.g.
  set_swapper_pgd() in arch/arm64/mm/mmu.c)

- eBPF programs (preventing direct access to core kernel code/data)

- SELinux state (e.g. struct selinux_state::initialized)

... and many others.

kpkeys could also be used to strengthen the confidentiality of secret
data by making it completely inaccessible by default, and granting
read-only or read-write access as needed. This requires such data to be
rarely accessed (or via a limited interface only). One example on arm64
is the pointer authentication keys in thread_struct, whose leakage to
userspace would lead to pointer authentication being easily defeated.

Open questions
==============

A few aspects in this RFC that are debatable and/or worth discussing:

- There is currently no restriction on how kpkeys levels map to pkeys
  permissions. A typical approach is to allocate one pkey per level and
  make it writable at that level only. As the number of levels
  increases, we may however run out of pkeys, especially on arm64 (just
  8 pkeys with POE). Depending on the use-cases, it may be acceptable to
  use the same pkey for the data associated to multiple levels.

  Another potential concern is that a given piece of code may require
  write access to multiple privileged pkeys. This could be addressed by
  introducing a notion of hierarchy in trust levels, where Tn is able to
  write to memory owned by Tm if n >= m, for instance.

- kpkeys_set_level() and kpkeys_restore_pkey_reg() are not symmetric:
  the former takes a kpkeys level and returns a pkey register value, to
  be consumed by the latter. It would be more intuitive to manipulate
  kpkeys levels only. However this assumes that there is a 1:1 mapping
  between kpkeys levels and pkey register values, while in principle
  the mapping is 1:n (certain pkeys may be used outside the kpkeys
  framework).

- An architecture that supports kpkeys is expected to select
  CONFIG_ARCH_HAS_KPKEYS and always enable them if available - there is
  no CONFIG_KPKEYS to control this behaviour. Since this creates no
  significant overhead (at least on arm64), it seemed better to keep it
  simple. Each hardening feature does have its own option and arch
  opt-in if needed (CONFIG_KPKEYS_HARDENED_PGTABLES,
  CONFIG_ARCH_HAS_KPKEYS_HARDENED_PGTABLES).


Any comment or feedback will be highly appreciated, be it on the
high-level approach or implementation choices!

- Kevin

---
Changelog

RFC v4..v5:

- Rebased on v6.17-rc1.

- Cover letter: re-ran benchmarks on top of v6.17-rc1, made various
  small improvements especially to the "Performance" section.

- Patch 18: disable batching while in interrupt, since POR_EL1 is reset
  on exception entry, making the TIF_LAZY_MMU flag meaningless. This
  fixes a crash that may occur when a page table page is freed while in
  interrupt context.

- Patch 17: ensure that the target kernel address is actually
  PTE-mapped. Certain mappings (e.g. code) may be PMD-mapped instead -
  this explains why the change made in v4 was required.


RFC v4: https://lore.kernel.org/linux-mm/20250411091631.954228-1-kevin.brodsky@arm.com/

RFC v3..v4:

- Added appropriate handling of the arm64 pkey register (POR_EL1):
  context-switching between threads and resetting on exception entry
  (patch 7 and 8). See section "pkey register management" above for more
  details. A new POR_EL1_INIT macro is introduced to make the default
  value available to assembly (where POR_EL1 is reset on exception
  entry); it is updated in each patch allocating new keys.

- Added patch 18 making use of the lazy_mmu mode to batch switches to
  KPKEYS_LVL_PGTABLES - just once per lazy_mmu section rather than on
  every pgtable write. See section "Performance" for details.

- Rebased on top of [2]. No direct impact on the patches, but it ensures that
  the ctor/dtor is always called for kernel pgtables. This is an
  important fix as kernel PTEs allocated after boot were not protected
  by kpkeys_hardened_pgtables in v3 - a new test was added to patch 17
  to ensure that pgtables created by vmalloc are protected too.

- Rebased on top of [3]. The batching of kpkeys level switches in patch
  18 relies on the last patch in [3].

- Moved kpkeys guard definitions out of <linux/kpkeys.h> and to a relevant
  header for each subsystem (e.g. <asm/pgtable.h> for the
  kpkeys_hardened_pgtables guard).

- Patch 1,5: marked kpkeys_{set_level,restore_pkey_reg} as
  __always_inline to ensure that no callable gadget is created.
  [Maxwell Bland's suggestion]

- Patch 5: added helper __kpkeys_set_pkey_reg_nosync().

- Patch 10: marked kernel_pgtables_set_pkey() and related helpers as
  __init. [Linus Walleij's suggestion]

- Patch 11: added helper kpkeys_hardened_pgtables_enabled(), renamed the
  static key to kpkeys_hardened_pgtables_key.

- Patch 17: followed the KUnit conventions more closely. [Kees Cook's
  suggestion]

- Patch 17: changed the address used in the write_linear_map_pte()
  test. It seems that the PTEs that map some functions are allocated in
  ZONE_DMA and read-only (unclear why exactly). This doesn't seem to
  occur for global variables.

- Various minor fixes/improvements.

- Rebased on v6.15-rc1. This includes [7], which renames a few POE
  symbols: s/POE_RXW/POE_RWX/ and
  s/por_set_pkey_perms/por_elx_set_pkey_perms/


RFC v3: https://lore.kernel.org/linux-hardening/20250203101839.1223008-1-kevin.brodsky@arm.com/

RFC v2..v3:

- Patch 1: kpkeys_set_level() may now return KPKEYS_PKEY_REG_INVAL to indicate
  that the pkey register wasn't written to, and as a result that
  kpkeys_restore_pkey_reg() should do nothing. This simplifies the conditional
  guard macro and also allows architectures to skip writes to the pkey
  register if the target value is the same as the current one.

- Patch 1: introduced additional KPKEYS_GUARD* macros to cover more use-cases
  and reduce duplication.

- Patch 6: reject pkey value above arch_max_pkey().

- Patch 13: added missing guard(kpkeys_hardened_pgtables) in
  __clear_young_dirty_pte().

- Rebased on v6.14-rc1.

RFC v2: https://lore.kernel.org/linux-hardening/20250108103250.3188419-1-kevin.brodsky@arm.com/

RFC v1..v2:

- A new approach is used to set the pkey of page table pages. Thanks to
  Qi Zheng's and my own series [8][9], pagetable_*_ctor is
  systematically called when a PTP is allocated at any level (PTE to
  PGD), and pagetable_*_dtor when it is freed, on all architectures.
  Patch 11 makes use of this to call kpkeys_{,un}protect_pgtable_memory
  from the common ctor/dtor helper. The arm64 patches from v1 (patch 12
  and 13) are dropped as they are no longer needed. Patch 10 is
  introduced to allow pagetable_*_ctor to fail at all levels, since
  kpkeys_protect_pgtable_memory may itself fail.
  [Original suggestion by Peter Zijlstra]

- Changed the prototype of kpkeys_{,un}protect_pgtable_memory in patch 9
  to take a struct folio * for more convenience, and implemented them
  out-of-line to avoid a circular dependency with <linux/mm.h>.

- Rebased on next-20250107, which includes [8] and [9].

- Added locking in patch 8. [Peter Zijlstra's suggestion]

RFC v1: https://lore.kernel.org/linux-hardening/20241206101110.1646108-1-kevin.brodsky@arm.com/
---
References

[1] https://lore.kernel.org/all/20210830235927.6443-1-rick.p.edgecombe@intel.com/
[2] https://lore.kernel.org/linux-mm/20250408095222.860601-1-kevin.brodsky@arm.com/
[3] https://lore.kernel.org/linux-mm/20250304150444.3788920-1-ryan.roberts@arm.com/
[4] https://github.com/gormanm/mmtests/blob/master/shellpack_src/src/kernbench/kernbench-bench
[5] https://lore.kernel.org/all/20250724221216.1998696-1-yang@os.amperecomputing.com/
[6] https://lore.kernel.org/linux-mm/?q=s%3Apkeys+s%3Acred+s%3A0
[7] https://lore.kernel.org/linux-arm-kernel/20250219164029.2309119-1-kevin.brodsky@arm.com/
[8] https://lore.kernel.org/linux-mm/cover.1736317725.git.zhengqi.arch@bytedance.com/
[9] https://lore.kernel.org/linux-mm/20250103184415.2744423-1-kevin.brodsky@arm.com/
---
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maxwell Bland <mbland@motorola.com>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pierre Langlois <pierre.langlois@arm.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-mm@kvack.org
Cc: x86@kernel.org
---
Kevin Brodsky (18):
  mm: Introduce kpkeys
  set_memory: Introduce set_memory_pkey() stub
  arm64: mm: Enable overlays for all EL1 indirect permissions
  arm64: Introduce por_elx_set_pkey_perms() helper
  arm64: Implement asm/kpkeys.h using POE
  arm64: set_memory: Implement set_memory_pkey()
  arm64: Reset POR_EL1 on exception entry
  arm64: Context-switch POR_EL1
  arm64: Enable kpkeys
  mm: Introduce kernel_pgtables_set_pkey()
  mm: Introduce kpkeys_hardened_pgtables
  mm: Allow __pagetable_ctor() to fail
  mm: Map page tables with privileged pkey
  arm64: kpkeys: Support KPKEYS_LVL_PGTABLES
  arm64: mm: Guard page table writes with kpkeys
  arm64: Enable kpkeys_hardened_pgtables support
  mm: Add basic tests for kpkeys_hardened_pgtables
  arm64: mm: Batch kpkeys level switches

 arch/arm64/Kconfig                        |   2 +
 arch/arm64/include/asm/kpkeys.h           |  62 +++++++++
 arch/arm64/include/asm/pgtable-prot.h     |  16 +--
 arch/arm64/include/asm/pgtable.h          |  57 +++++++-
 arch/arm64/include/asm/por.h              |  11 ++
 arch/arm64/include/asm/processor.h        |   2 +
 arch/arm64/include/asm/ptrace.h           |   4 +
 arch/arm64/include/asm/set_memory.h       |   4 +
 arch/arm64/kernel/asm-offsets.c           |   3 +
 arch/arm64/kernel/cpufeature.c            |   5 +-
 arch/arm64/kernel/entry.S                 |  24 +++-
 arch/arm64/kernel/process.c               |   9 ++
 arch/arm64/kernel/smp.c                   |   2 +
 arch/arm64/mm/fault.c                     |   2 +
 arch/arm64/mm/mmu.c                       |  26 ++--
 arch/arm64/mm/pageattr.c                  |  25 ++++
 include/asm-generic/kpkeys.h              |  21 +++
 include/asm-generic/pgalloc.h             |  15 ++-
 include/linux/kpkeys.h                    | 157 ++++++++++++++++++++++
 include/linux/mm.h                        |  27 ++--
 include/linux/set_memory.h                |   7 +
 mm/Kconfig                                |   5 +
 mm/Makefile                               |   2 +
 mm/kpkeys_hardened_pgtables.c             |  44 ++++++
 mm/memory.c                               | 137 +++++++++++++++++++
 mm/tests/kpkeys_hardened_pgtables_kunit.c | 106 +++++++++++++++
 security/Kconfig.hardening                |  24 ++++
 27 files changed, 758 insertions(+), 41 deletions(-)
 create mode 100644 arch/arm64/include/asm/kpkeys.h
 create mode 100644 include/asm-generic/kpkeys.h
 create mode 100644 include/linux/kpkeys.h
 create mode 100644 mm/kpkeys_hardened_pgtables.c
 create mode 100644 mm/tests/kpkeys_hardened_pgtables_kunit.c


base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585
-- 
2.47.0
Re: [RFC PATCH v5 00/18] pkeys-based page table hardening
Posted by Kevin Brodsky 1 month, 2 weeks ago
On 15/08/2025 10:54, Kevin Brodsky wrote:
> [...]
>
> Performance
> ===========
>
> No arm64 hardware currently implements POE. To estimate the performance
> impact of kpkeys_hardened_pgtables, a mock implementation of kpkeys has
> been used, replacing accesses to the POR_EL1 register with accesses to
> another system register that is otherwise unused (CONTEXTIDR_EL1), and
> leaving everything else unchanged. Most of the kpkeys overhead is
> expected to originate from the barrier (ISB) that is required after
> writing to POR_EL1, and from setting the POIndex (pkey) in page tables;
> both of these are done exactly in the same way in the mock
> implementation.

It turns out this wasn't the case regarding the pkey setting - because
patch 6 gates set_memory_pkey() on system_supports_poe() and not
arch_kpkeys_enabled(), the mock implementation turned set_memory_pkey()
into a no-op. Many thanks to Rick Edgecombe for highlighting that the
overheads were suspiciously low for some benchmarks!

> The original implementation of kpkeys_hardened_pgtables is very
> inefficient when many PTEs are changed at once, as the kpkeys level is
> switched twice for every PTE (two ISBs per PTE). Patch 18 introduces
> an optimisation that makes use of the lazy_mmu mode to batch those
> switches: 1. switch to KPKEYS_LVL_PGTABLES on arch_enter_lazy_mmu_mode(),
> 2. skip any kpkeys switch while in that section, and 3. restore the
> kpkeys level on arch_leave_lazy_mmu_mode(). When that last function
> already issues an ISB (when updating kernel page tables), we get a
> further optimisation as we can skip the ISB when restoring the kpkeys
> level.
>
> Both implementations (without and with batching) were evaluated on an
> Amazon EC2 M7g instance (Graviton3), using a variety of benchmarks that
> involve heavy page table manipulations. The results shown below are
> relative to the baseline for this series, which is 6.17-rc1. The
> branches used for all three sets of results (baseline, with/without
> batching) are available in a repository, see next section.
>
> Caveat: these numbers should be seen as a lower bound for the overhead
> of a real POE-based protection. The hardware checks added by POE are
> however not expected to incur significant extra overhead.
>
> Reading example: for the fix_size_alloc_test benchmark, using 1 page per
> iteration (no hugepage), kpkeys_hardened_pgtables incurs 17.35% overhead
> without batching, and 14.62% overhead with batching. Both results are
> considered statistically significant (95% confidence interval),
> indicated by "(R)".
>
> +-------------------+----------------------------------+------------------+---------------+
> | Benchmark         | Result Class                     | Without batching | With batching |
> +===================+==================================+==================+===============+
> | mmtests/kernbench | real time                        |            0.30% |         0.11% |
> |                   | system time                      |        (R) 3.97% |     (R) 2.17% |
> |                   | user time                        |            0.12% |         0.02% |
> +-------------------+----------------------------------+------------------+---------------+
> | micromm/fork      | fork: h:0                        |      (R) 217.31% |        -0.97% |
> |                   | fork: h:1                        |      (R) 275.25% |     (R) 2.25% |
> +-------------------+----------------------------------+------------------+---------------+
> | micromm/munmap    | munmap: h:0                      |       (R) 15.57% |        -1.95% |
> |                   | munmap: h:1                      |      (R) 169.53% |     (R) 6.53% |
> +-------------------+----------------------------------+------------------+---------------+
> | micromm/vmalloc   | fix_size_alloc_test: p:1, h:0    |       (R) 17.35% |    (R) 14.62% |
> |                   | fix_size_alloc_test: p:4, h:0    |       (R) 37.54% |     (R) 9.35% |
> |                   | fix_size_alloc_test: p:16, h:0   |       (R) 66.08% |     (R) 3.15% |
> |                   | fix_size_alloc_test: p:64, h:0   |       (R) 82.94% |        -0.39% |
> |                   | fix_size_alloc_test: p:256, h:0  |       (R) 87.85% |        -1.67% |
> |                   | fix_size_alloc_test: p:16, h:1   |       (R) 50.31% |         3.00% |
> |                   | fix_size_alloc_test: p:64, h:1   |       (R) 59.73% |         2.23% |
> |                   | fix_size_alloc_test: p:256, h:1  |       (R) 62.14% |         1.51% |
> |                   | random_size_alloc_test: p:1, h:0 |       (R) 77.82% |        -0.21% |
> |                   | vm_map_ram_test: p:1, h:0        |       (R) 30.66% |    (R) 27.30% |
> +-------------------+----------------------------------+------------------+---------------+

These numbers therefore correspond to set_memory_pkey() being a no-op,
in other words they represent the overhead of switching the pkey
register only.

I have amended the mock implementation so that set_memory_pkey() is run
as it would on a real POE implementation (i.e. actually setting the PTE
bits). Here are the new results, representing the overhead of both pkey
register switching and setting the pkey of page table pages (PTPs) on
alloc/free:

+-------------------+----------------------------------+------------------+---------------+
| Benchmark         | Result Class                     | Without
batching | With batching |
+===================+==================================+==================+===============+
| mmtests/kernbench | real time                        |           
0.32% |         0.35% |
|                   | system time                      |        (R)
4.18% |     (R) 3.18% |
|                   | user time                        |           
0.08% |         0.20% |
+-------------------+----------------------------------+------------------+---------------+
| micromm/fork      | fork: h:0                        |      (R)
221.39% |     (R) 3.35% |
|                   | fork: h:1                        |      (R)
282.89% |     (R) 6.99% |
+-------------------+----------------------------------+------------------+---------------+
| micromm/munmap    | munmap: h:0                      |       (R)
17.37% |        -0.28% |
|                   | munmap: h:1                      |      (R)
172.61% |     (R) 8.08% |
+-------------------+----------------------------------+------------------+---------------+
| micromm/vmalloc   | fix_size_alloc_test: p:1, h:0    |       (R)
15.54% |    (R) 12.57% |
|                   | fix_size_alloc_test: p:4, h:0    |       (R)
39.18% |     (R) 9.13% |
|                   | fix_size_alloc_test: p:16, h:0   |       (R)
65.81% |         2.97% |
|                   | fix_size_alloc_test: p:64, h:0   |       (R)
83.39% |        -0.49% |
|                   | fix_size_alloc_test: p:256, h:0  |       (R)
87.85% |    (I) -2.04% |
|                   | fix_size_alloc_test: p:16, h:1   |       (R)
51.21% |         3.77% |
|                   | fix_size_alloc_test: p:64, h:1   |       (R)
60.02% |         0.99% |
|                   | fix_size_alloc_test: p:256, h:1  |       (R)
63.82% |         1.16% |
|                   | random_size_alloc_test: p:1, h:0 |       (R)
77.79% |        -0.51% |
|                   | vm_map_ram_test: p:1, h:0        |       (R)
30.67% |    (R) 27.09% |
+-------------------+----------------------------------+------------------+---------------+

Those results are overall very similar to the original ones.
micromm/fork is however clearly impacted - around 4% additional overhead
from set_memory_pkey(); it makes sense considering that forking requires
duplicating (and therefore allocating) a full set of page tables.
kernbench is also a fork-heavy workload and it gets a 1% hit in system
time (with batching).

It seems fair to conclude that, on arm64, setting the pkey whenever a
PTP is allocated/freed is not particularly expensive. The situation may
well be different on x86 as Rick pointed out, and it may also change on
newer arm64 systems as I noted further down. Allocating/freeing PTPs in
bulk should help if setting the pkey in the pgtable ctor/dtor proves too
expensive.

- Kevin

> Benchmarks:
> - mmtests/kernbench: running kernbench (kernel build) [4].
> - micromm/{fork,munmap}: from David Hildenbrand's benchmark suite. A
>   1 GB mapping is created and then fork/unmap is called. The mapping is
>   created using either page-sized (h:0) or hugepage folios (h:1); in all
>   cases the memory is PTE-mapped.
> - micromm/vmalloc: from test_vmalloc.ko, varying the number of pages
>   (p:) and whether huge pages are used (h:).
>
> On a "real-world" and fork-heavy workload like kernbench, the estimated
> overhead of kpkeys_hardened_pgtables is reasonable: 4% system time
> overhead without batching, and about half that figure (2.2%) with
> batching. The real time overhead is negligible.
>
> Microbenchmarks show large overheads without batching, which increase
> with the number of pages being manipulated. Batching drastically reduces
> that overhead, almost negating it for micromm/fork. Because all PTEs in
> the mapping are modified in the same lazy_mmu section, the kpkeys level
> is changed just twice regardless of the mapping size; as a result the
> relative overhead actually decreases as the size increases for
> fix_size_alloc_test.
>
> Note: the performance impact of set_memory_pkey() is likely to be
> relatively low on arm64 because the linear mapping uses PTE-level
> descriptors only. This means that set_memory_pkey() simply changes the
> attributes of some PTE descriptors. However, some systems may be able to
> use higher-level descriptors in the future [5], meaning that
> set_memory_pkey() may have to split mappings. Allocating page tables
> from a contiguous cache of pages could help minimise the overhead, as
> proposed for x86 in [1].
>
> [...]
Re: [RFC PATCH v5 00/18] pkeys-based page table hardening
Posted by Kevin Brodsky 1 month, 2 weeks ago
On 20/08/2025 17:53, Kevin Brodsky wrote:
> On 15/08/2025 10:54, Kevin Brodsky wrote:
>> [...]
>>
>> Performance
>> ===========
>>
>> No arm64 hardware currently implements POE. To estimate the performance
>> impact of kpkeys_hardened_pgtables, a mock implementation of kpkeys has
>> been used, replacing accesses to the POR_EL1 register with accesses to
>> another system register that is otherwise unused (CONTEXTIDR_EL1), and
>> leaving everything else unchanged. Most of the kpkeys overhead is
>> expected to originate from the barrier (ISB) that is required after
>> writing to POR_EL1, and from setting the POIndex (pkey) in page tables;
>> both of these are done exactly in the same way in the mock
>> implementation.
> It turns out this wasn't the case regarding the pkey setting - because
> patch 6 gates set_memory_pkey() on system_supports_poe() and not
> arch_kpkeys_enabled(), the mock implementation turned set_memory_pkey()
> into a no-op. Many thanks to Rick Edgecombe for highlighting that the
> overheads were suspiciously low for some benchmarks!
>
>> The original implementation of kpkeys_hardened_pgtables is very
>> inefficient when many PTEs are changed at once, as the kpkeys level is
>> switched twice for every PTE (two ISBs per PTE). Patch 18 introduces
>> an optimisation that makes use of the lazy_mmu mode to batch those
>> switches: 1. switch to KPKEYS_LVL_PGTABLES on arch_enter_lazy_mmu_mode(),
>> 2. skip any kpkeys switch while in that section, and 3. restore the
>> kpkeys level on arch_leave_lazy_mmu_mode(). When that last function
>> already issues an ISB (when updating kernel page tables), we get a
>> further optimisation as we can skip the ISB when restoring the kpkeys
>> level.
>>
>> Both implementations (without and with batching) were evaluated on an
>> Amazon EC2 M7g instance (Graviton3), using a variety of benchmarks that
>> involve heavy page table manipulations. The results shown below are
>> relative to the baseline for this series, which is 6.17-rc1. The
>> branches used for all three sets of results (baseline, with/without
>> batching) are available in a repository, see next section.
>>
>> Caveat: these numbers should be seen as a lower bound for the overhead
>> of a real POE-based protection. The hardware checks added by POE are
>> however not expected to incur significant extra overhead.
>>
>> Reading example: for the fix_size_alloc_test benchmark, using 1 page per
>> iteration (no hugepage), kpkeys_hardened_pgtables incurs 17.35% overhead
>> without batching, and 14.62% overhead with batching. Both results are
>> considered statistically significant (95% confidence interval),
>> indicated by "(R)".
>>
>> +-------------------+----------------------------------+------------------+---------------+
>> | Benchmark         | Result Class                     | Without batching | With batching |
>> +===================+==================================+==================+===============+
>> | mmtests/kernbench | real time                        |            0.30% |         0.11% |
>> |                   | system time                      |        (R) 3.97% |     (R) 2.17% |
>> |                   | user time                        |            0.12% |         0.02% |
>> +-------------------+----------------------------------+------------------+---------------+
>> | micromm/fork      | fork: h:0                        |      (R) 217.31% |        -0.97% |
>> |                   | fork: h:1                        |      (R) 275.25% |     (R) 2.25% |
>> +-------------------+----------------------------------+------------------+---------------+
>> | micromm/munmap    | munmap: h:0                      |       (R) 15.57% |        -1.95% |
>> |                   | munmap: h:1                      |      (R) 169.53% |     (R) 6.53% |
>> +-------------------+----------------------------------+------------------+---------------+
>> | micromm/vmalloc   | fix_size_alloc_test: p:1, h:0    |       (R) 17.35% |    (R) 14.62% |
>> |                   | fix_size_alloc_test: p:4, h:0    |       (R) 37.54% |     (R) 9.35% |
>> |                   | fix_size_alloc_test: p:16, h:0   |       (R) 66.08% |     (R) 3.15% |
>> |                   | fix_size_alloc_test: p:64, h:0   |       (R) 82.94% |        -0.39% |
>> |                   | fix_size_alloc_test: p:256, h:0  |       (R) 87.85% |        -1.67% |
>> |                   | fix_size_alloc_test: p:16, h:1   |       (R) 50.31% |         3.00% |
>> |                   | fix_size_alloc_test: p:64, h:1   |       (R) 59.73% |         2.23% |
>> |                   | fix_size_alloc_test: p:256, h:1  |       (R) 62.14% |         1.51% |
>> |                   | random_size_alloc_test: p:1, h:0 |       (R) 77.82% |        -0.21% |
>> |                   | vm_map_ram_test: p:1, h:0        |       (R) 30.66% |    (R) 27.30% |
>> +-------------------+----------------------------------+------------------+---------------+
> These numbers therefore correspond to set_memory_pkey() being a no-op,
> in other words they represent the overhead of switching the pkey
> register only.
>
> I have amended the mock implementation so that set_memory_pkey() is run
> as it would on a real POE implementation (i.e. actually setting the PTE
> bits). Here are the new results, representing the overhead of both pkey
> register switching and setting the pkey of page table pages (PTPs) on
> alloc/free:
>
> +-------------------+----------------------------------+------------------+---------------+
> | Benchmark         | Result Class                     | Without
> batching | With batching |
> +===================+==================================+==================+===============+
> | mmtests/kernbench | real time                        |           
> 0.32% |         0.35% |
> |                   | system time                      |        (R)
> 4.18% |     (R) 3.18% |
> |                   | user time                        |           
> 0.08% |         0.20% |
> +-------------------+----------------------------------+------------------+---------------+
> | micromm/fork      | fork: h:0                        |      (R)
> 221.39% |     (R) 3.35% |
> |                   | fork: h:1                        |      (R)
> 282.89% |     (R) 6.99% |
> +-------------------+----------------------------------+------------------+---------------+
> | micromm/munmap    | munmap: h:0                      |       (R)
> 17.37% |        -0.28% |
> |                   | munmap: h:1                      |      (R)
> 172.61% |     (R) 8.08% |
> +-------------------+----------------------------------+------------------+---------------+
> | micromm/vmalloc   | fix_size_alloc_test: p:1, h:0    |       (R)
> 15.54% |    (R) 12.57% |
> |                   | fix_size_alloc_test: p:4, h:0    |       (R)
> 39.18% |     (R) 9.13% |
> |                   | fix_size_alloc_test: p:16, h:0   |       (R)
> 65.81% |         2.97% |
> |                   | fix_size_alloc_test: p:64, h:0   |       (R)
> 83.39% |        -0.49% |
> |                   | fix_size_alloc_test: p:256, h:0  |       (R)
> 87.85% |    (I) -2.04% |
> |                   | fix_size_alloc_test: p:16, h:1   |       (R)
> 51.21% |         3.77% |
> |                   | fix_size_alloc_test: p:64, h:1   |       (R)
> 60.02% |         0.99% |
> |                   | fix_size_alloc_test: p:256, h:1  |       (R)
> 63.82% |         1.16% |
> |                   | random_size_alloc_test: p:1, h:0 |       (R)
> 77.79% |        -0.51% |
> |                   | vm_map_ram_test: p:1, h:0        |       (R)
> 30.67% |    (R) 27.09% |
> +-------------------+----------------------------------+------------------+---------------+

Apologies, Thunderbird helpfully decided to wrap around that table...
Here's the unmangled table:

+-------------------+----------------------------------+------------------+---------------+
| Benchmark         | Result Class                     | Without batching | With batching |
+===================+==================================+==================+===============+
| mmtests/kernbench | real time                        |            0.32% |         0.35% |
|                   | system time                      |        (R) 4.18% |     (R) 3.18% |
|                   | user time                        |            0.08% |         0.20% |
+-------------------+----------------------------------+------------------+---------------+
| micromm/fork      | fork: h:0                        |      (R) 221.39% |     (R) 3.35% |
|                   | fork: h:1                        |      (R) 282.89% |     (R) 6.99% |
+-------------------+----------------------------------+------------------+---------------+
| micromm/munmap    | munmap: h:0                      |       (R) 17.37% |        -0.28% |
|                   | munmap: h:1                      |      (R) 172.61% |     (R) 8.08% |
+-------------------+----------------------------------+------------------+---------------+
| micromm/vmalloc   | fix_size_alloc_test: p:1, h:0    |       (R) 15.54% |    (R) 12.57% |
|                   | fix_size_alloc_test: p:4, h:0    |       (R) 39.18% |     (R) 9.13% |
|                   | fix_size_alloc_test: p:16, h:0   |       (R) 65.81% |         2.97% |
|                   | fix_size_alloc_test: p:64, h:0   |       (R) 83.39% |        -0.49% |
|                   | fix_size_alloc_test: p:256, h:0  |       (R) 87.85% |    (I) -2.04% |
|                   | fix_size_alloc_test: p:16, h:1   |       (R) 51.21% |         3.77% |
|                   | fix_size_alloc_test: p:64, h:1   |       (R) 60.02% |         0.99% |
|                   | fix_size_alloc_test: p:256, h:1  |       (R) 63.82% |         1.16% |
|                   | random_size_alloc_test: p:1, h:0 |       (R) 77.79% |        -0.51% |
|                   | vm_map_ram_test: p:1, h:0        |       (R) 30.67% |    (R) 27.09% |
+-------------------+----------------------------------+------------------+---------------+

> Those results are overall very similar to the original ones.
> micromm/fork is however clearly impacted - around 4% additional overhead
> from set_memory_pkey(); it makes sense considering that forking requires
> duplicating (and therefore allocating) a full set of page tables.
> kernbench is also a fork-heavy workload and it gets a 1% hit in system
> time (with batching).
>
> It seems fair to conclude that, on arm64, setting the pkey whenever a
> PTP is allocated/freed is not particularly expensive. The situation may
> well be different on x86 as Rick pointed out, and it may also change on
> newer arm64 systems as I noted further down. Allocating/freeing PTPs in
> bulk should help if setting the pkey in the pgtable ctor/dtor proves too
> expensive.
>
> - Kevin
>
>> Benchmarks:
>> - mmtests/kernbench: running kernbench (kernel build) [4].
>> - micromm/{fork,munmap}: from David Hildenbrand's benchmark suite. A
>>   1 GB mapping is created and then fork/unmap is called. The mapping is
>>   created using either page-sized (h:0) or hugepage folios (h:1); in all
>>   cases the memory is PTE-mapped.
>> - micromm/vmalloc: from test_vmalloc.ko, varying the number of pages
>>   (p:) and whether huge pages are used (h:).
>>
>> On a "real-world" and fork-heavy workload like kernbench, the estimated
>> overhead of kpkeys_hardened_pgtables is reasonable: 4% system time
>> overhead without batching, and about half that figure (2.2%) with
>> batching. The real time overhead is negligible.
>>
>> Microbenchmarks show large overheads without batching, which increase
>> with the number of pages being manipulated. Batching drastically reduces
>> that overhead, almost negating it for micromm/fork. Because all PTEs in
>> the mapping are modified in the same lazy_mmu section, the kpkeys level
>> is changed just twice regardless of the mapping size; as a result the
>> relative overhead actually decreases as the size increases for
>> fix_size_alloc_test.
>>
>> Note: the performance impact of set_memory_pkey() is likely to be
>> relatively low on arm64 because the linear mapping uses PTE-level
>> descriptors only. This means that set_memory_pkey() simply changes the
>> attributes of some PTE descriptors. However, some systems may be able to
>> use higher-level descriptors in the future [5], meaning that
>> set_memory_pkey() may have to split mappings. Allocating page tables
>> from a contiguous cache of pages could help minimise the overhead, as
>> proposed for x86 in [1].
>>
>> [...]
Re: [RFC PATCH v5 00/18] pkeys-based page table hardening
Posted by Edgecombe, Rick P 1 month, 2 weeks ago
On Wed, 2025-08-20 at 18:01 +0200, Kevin Brodsky wrote:
> Apologies, Thunderbird helpfully decided to wrap around that table...
> Here's the unmangled table:
> 
> +-------------------+----------------------------------+------------------+---------------+
> > Benchmark         | Result Class                     | Without batching | With batching |
> +===================+==================================+==================+===============+
> > mmtests/kernbench | real time                        |            0.32% |         0.35% |
> >                    | system time                      |        (R) 4.18% |     (R) 3.18% |
> >                    | user time                        |            0.08% |         0.20% |
> +-------------------+----------------------------------+------------------+---------------+
> > micromm/fork      | fork: h:0                        |      (R) 221.39% |     (R) 3.35% |
> >                    | fork: h:1                        |      (R) 282.89% |     (R) 6.99% |
> +-------------------+----------------------------------+------------------+---------------+
> > micromm/munmap    | munmap: h:0                      |       (R) 17.37% |        -0.28% |
> >                    | munmap: h:1                      |      (R) 172.61% |     (R) 8.08% |
> +-------------------+----------------------------------+------------------+---------------+
> > micromm/vmalloc   | fix_size_alloc_test: p:1, h:0    |       (R) 15.54% |    (R) 12.57% |

Both this and the previous one have the 95% confidence interval. So it saw a 16%
speed up with direct map modification. Possible?

> >                    | fix_size_alloc_test: p:4, h:0    |       (R) 39.18% |     (R) 9.13% |
> >                    | fix_size_alloc_test: p:16, h:0   |       (R) 65.81% |         2.97% |
> >                    | fix_size_alloc_test: p:64, h:0   |       (R) 83.39% |        -0.49% |
> >                    | fix_size_alloc_test: p:256, h:0  |       (R) 87.85% |    (I) -2.04% |
> >                    | fix_size_alloc_test: p:16, h:1   |       (R) 51.21% |         3.77% |
> >                    | fix_size_alloc_test: p:64, h:1   |       (R) 60.02% |         0.99% |
> >                    | fix_size_alloc_test: p:256, h:1  |       (R) 63.82% |         1.16% |
> >                    | random_size_alloc_test: p:1, h:0 |       (R) 77.79% |        -0.51% |
> >                    | vm_map_ram_test: p:1, h:0        |       (R) 30.67% |    (R) 27.09% |
> +-------------------+----------------------------------+------------------+---------------+

Hmm, still surprisingly low to me, but ok. It would be good have x86 and arm
work the same, but I don't think we have line of sight to x86 currently. And I
actually never did real benchmarks.
Re: [RFC PATCH v5 00/18] pkeys-based page table hardening
Posted by Kevin Brodsky 1 month, 2 weeks ago
On 20/08/2025 18:18, Edgecombe, Rick P wrote:
> On Wed, 2025-08-20 at 18:01 +0200, Kevin Brodsky wrote:
>> Apologies, Thunderbird helpfully decided to wrap around that table...
>> Here's the unmangled table:
>>
>> +-------------------+----------------------------------+------------------+---------------+
>>> Benchmark         | Result Class                     | Without batching | With batching |
>> +===================+==================================+==================+===============+
>>> mmtests/kernbench | real time                        |            0.32% |         0.35% |
>>>                    | system time                      |        (R) 4.18% |     (R) 3.18% |
>>>                    | user time                        |            0.08% |         0.20% |
>> +-------------------+----------------------------------+------------------+---------------+
>>> micromm/fork      | fork: h:0                        |      (R) 221.39% |     (R) 3.35% |
>>>                    | fork: h:1                        |      (R) 282.89% |     (R) 6.99% |
>> +-------------------+----------------------------------+------------------+---------------+
>>> micromm/munmap    | munmap: h:0                      |       (R) 17.37% |        -0.28% |
>>>                    | munmap: h:1                      |      (R) 172.61% |     (R) 8.08% |
>> +-------------------+----------------------------------+------------------+---------------+
>>> micromm/vmalloc   | fix_size_alloc_test: p:1, h:0    |       (R) 15.54% |    (R) 12.57% |
> Both this and the previous one have the 95% confidence interval. So it saw a 16%
> speed up with direct map modification. Possible?

Positive numbers mean performance degradation ("(R)" actually stands for
regression), so in that case the protection is adding a 16%/13%
overhead. Here this is mainly due to the added pkey register switching
(+ barrier) happening on every call to vmalloc() and vfree(), which has
a large relative impact since only one page is being allocated/freed.

>>>                    | fix_size_alloc_test: p:4, h:0    |       (R) 39.18% |     (R) 9.13% |
>>>                    | fix_size_alloc_test: p:16, h:0   |       (R) 65.81% |         2.97% |
>>>                    | fix_size_alloc_test: p:64, h:0   |       (R) 83.39% |        -0.49% |
>>>                    | fix_size_alloc_test: p:256, h:0  |       (R) 87.85% |    (I) -2.04% |
>>>                    | fix_size_alloc_test: p:16, h:1   |       (R) 51.21% |         3.77% |
>>>                    | fix_size_alloc_test: p:64, h:1   |       (R) 60.02% |         0.99% |
>>>                    | fix_size_alloc_test: p:256, h:1  |       (R) 63.82% |         1.16% |
>>>                    | random_size_alloc_test: p:1, h:0 |       (R) 77.79% |        -0.51% |
>>>                    | vm_map_ram_test: p:1, h:0        |       (R) 30.67% |    (R) 27.09% |
>> +-------------------+----------------------------------+------------------+---------------+
> Hmm, still surprisingly low to me, but ok. It would be good have x86 and arm
> work the same, but I don't think we have line of sight to x86 currently. And I
> actually never did real benchmarks.

It would certainly be good to get numbers on x86 as well - I'm hoping
that someone with a better understanding of x86 than myself could
implement kpkeys on x86 at some point, so that we can run the same
benchmarks there.

- Kevin
Re: [RFC PATCH v5 00/18] pkeys-based page table hardening
Posted by Yang Shi 1 month, 1 week ago
Hi Kevin,

On 8/15/25 1:54 AM, Kevin Brodsky wrote:
> This is a proposal to leverage protection keys (pkeys) to harden
> critical kernel data, by making it mostly read-only. The series includes
> a simple framework called "kpkeys" to manipulate pkeys for in-kernel use,
> as well as a page table hardening feature based on that framework,
> "kpkeys_hardened_pgtables". Both are implemented on arm64 as a proof of
> concept, but they are designed to be compatible with any architecture
> that supports pkeys.

[...]

>
> Note: the performance impact of set_memory_pkey() is likely to be
> relatively low on arm64 because the linear mapping uses PTE-level
> descriptors only. This means that set_memory_pkey() simply changes the
> attributes of some PTE descriptors. However, some systems may be able to
> use higher-level descriptors in the future [5], meaning that
> set_memory_pkey() may have to split mappings. Allocating page tables

I'm supposed the page table hardening feature will be opt-in due to its 
overhead? If so I think you can just keep kernel linear mapping using 
PTE, just like debug page alloc.

> from a contiguous cache of pages could help minimise the overhead, as
> proposed for x86 in [1].

I'm a little bit confused about how this can work. The contiguous cache 
of pages should be some large page, for example, 2M. But the page table 
pages allocated from the cache may have different permissions if I 
understand correctly. The default permission is RO, but some of them may 
become R/W at sometime, for example, when calling set_pte_at(). You 
still need to split the linear mapping, right?

Regards,
Yang

>
>
Re: [RFC PATCH v5 00/18] pkeys-based page table hardening
Posted by Kevin Brodsky 1 month, 1 week ago
On 21/08/2025 19:29, Yang Shi wrote:
> Hi Kevin,
>
> On 8/15/25 1:54 AM, Kevin Brodsky wrote:
>> This is a proposal to leverage protection keys (pkeys) to harden
>> critical kernel data, by making it mostly read-only. The series includes
>> a simple framework called "kpkeys" to manipulate pkeys for in-kernel
>> use,
>> as well as a page table hardening feature based on that framework,
>> "kpkeys_hardened_pgtables". Both are implemented on arm64 as a proof of
>> concept, but they are designed to be compatible with any architecture
>> that supports pkeys.
>
> [...]
>
>>
>> Note: the performance impact of set_memory_pkey() is likely to be
>> relatively low on arm64 because the linear mapping uses PTE-level
>> descriptors only. This means that set_memory_pkey() simply changes the
>> attributes of some PTE descriptors. However, some systems may be able to
>> use higher-level descriptors in the future [5], meaning that
>> set_memory_pkey() may have to split mappings. Allocating page tables
>
> I'm supposed the page table hardening feature will be opt-in due to
> its overhead? If so I think you can just keep kernel linear mapping
> using PTE, just like debug page alloc.

Indeed, I don't expect it to be turned on by default (in defconfig). If
the overhead proves too large when block mappings are used, it seems
reasonable to force PTE mappings when kpkeys_hardened_pgtables is enabled.

>
>> from a contiguous cache of pages could help minimise the overhead, as
>> proposed for x86 in [1].
>
> I'm a little bit confused about how this can work. The contiguous
> cache of pages should be some large page, for example, 2M. But the
> page table pages allocated from the cache may have different
> permissions if I understand correctly. The default permission is RO,
> but some of them may become R/W at sometime, for example, when calling
> set_pte_at(). You still need to split the linear mapping, right?

When such a helper is called, *all* PTPs become writeable - there is no
per-PTP permission switching.

PTPs remain mapped RW (i.e. the base permissions set at the PTE level
are RW). With this series, they are also all mapped with the same pkey
(1). By default, the pkey register is configured so that pkey 1 provides
RO access. The net result is that PTPs are RO by default, since the pkey
restricts the effective permissions.

When calling e.g. set_pte(), the pkey register is modified to enable RW
access to pkey 1, making it possible to write to any PTP. Its value is
restored when the function exit so that PTPs are once again RO.

- Kevin
Re: [RFC PATCH v5 00/18] pkeys-based page table hardening
Posted by Kevin Brodsky 2 weeks, 1 day ago
On 25/08/2025 09:31, Kevin Brodsky wrote:
>>> Note: the performance impact of set_memory_pkey() is likely to be
>>> relatively low on arm64 because the linear mapping uses PTE-level
>>> descriptors only. This means that set_memory_pkey() simply changes the
>>> attributes of some PTE descriptors. However, some systems may be able to
>>> use higher-level descriptors in the future [5], meaning that
>>> set_memory_pkey() may have to split mappings. Allocating page tables
>> I'm supposed the page table hardening feature will be opt-in due to
>> its overhead? If so I think you can just keep kernel linear mapping
>> using PTE, just like debug page alloc.
> Indeed, I don't expect it to be turned on by default (in defconfig). If
> the overhead proves too large when block mappings are used, it seems
> reasonable to force PTE mappings when kpkeys_hardened_pgtables is enabled.

I had a closer look at what happens when the linear map uses block
mappings, rebasing this series on top of [1]. Unfortunately, this is
worse than I thought: it does not work at all as things stand.

The main issue is that calling set_memory_pkey() in pagetable_*_ctor()
can cause the linear map to be split, which requires new PTP(s) to be
allocated, which means more nested call(s) to set_memory_pkey(). This
explodes as a non-recursive lock is taken on that path.

More fundamentally, this cannot work unless we can explicitly allocate
PTPs from either:
1. A pool of PTE-mapped pages
2. A pool of memory that is already mapped with the right pkey (at any
level)

This is where I have to apologise to Rick for not having studied his
series more thoroughly, as patch 17 [2] covers this issue very well in
the commit message.

It seems fair to say there is no ideal or simple solution, though.
Rick's patch reserves enough (PTE-mapped) memory for fully splitting the
linear map, which is relatively simple but not very pleasant. Chatting
with Ryan Roberts, we figured another approach, improving on solution 1
mentioned in [2]. It would rely on allocating all PTPs from a special
pool (without using set_memory_pkey() in pagetable_*_ctor), along those
lines:

1. 2 pages are reserved at all times (with the appropriate pkey)
2. Try to allocate a 2M block. If needed, use a reserved page as PMD to
split a PUD. If successful, set its pkey - the entire block can now be
used for PTPs. Replenish the reserve from the block if needed.
3. If no block is available, make an order-2 allocation (4 pages). If
needed, use 1-2 reserved pages to split PUD/PMD. Set the pkey of the 4
pages, take 1-2 pages to replenish the reserve if needed.

This ensures that we never run out of PTPs for splitting. We may get
into an OOM situation more easily due to the order-2 requirement, but
the risk remains low compared to requiring a 2M block. A bigger concern
is concurrency - do we need a per-CPU cache? Reserving a 2M block per
CPU could be very much overkill.

No matter which solution is used, this clearly increases the complexity
of kpkeys_hardened_pgtables. Mike Rapoport has posted a number of RFCs
[3][4] that aim at addressing this problem more generally, but no
consensus seems to have emerged and I'm not sure they would completely
solve this specific problem either.

For now, my plan is to stick to solution 3 from [2], i.e. force the
linear map to be PTE-mapped. This is easily done on arm64 as it is the
default, and is required for rodata=full, unless [1] is applied and the
system supports BBML2_NOABORT. See [1] for the potential performance
improvements we'd be missing out on (~5% ballpark). I'm not quite sure
what the picture looks like on x86 - it may well be more significant as
Rick suggested.

- Kevin

[1]
https://lore.kernel.org/all/20250829115250.2395585-1-ryan.roberts@arm.com/
[2]
https://lore.kernel.org/all/20210830235927.6443-18-rick.p.edgecombe@intel.com/
[3] https://lore.kernel.org/lkml/20210823132513.15836-1-rppt@kernel.org/
[4] https://lore.kernel.org/all/20230308094106.227365-1-rppt@kernel.org/
Re: [RFC PATCH v5 00/18] pkeys-based page table hardening
Posted by Edgecombe, Rick P 2 weeks, 1 day ago
On Thu, 2025-09-18 at 16:15 +0200, Kevin Brodsky wrote:
> This is where I have to apologise to Rick for not having studied his
> series more thoroughly, as patch 17 [2] covers this issue very well in
> the commit message.
> 
> It seems fair to say there is no ideal or simple solution, though.
> Rick's patch reserves enough (PTE-mapped) memory for fully splitting the
> linear map, which is relatively simple but not very pleasant. Chatting
> with Ryan Roberts, we figured another approach, improving on solution 1
> mentioned in [2]. It would rely on allocating all PTPs from a special
> pool (without using set_memory_pkey() in pagetable_*_ctor), along those
> lines:

Oh I didn't realize ARM split the direct map now at runtime. IIRC it used to
just map at 4k if there were any permissions configured.

> 
> 1. 2 pages are reserved at all times (with the appropriate pkey)
> 2. Try to allocate a 2M block. If needed, use a reserved page as PMD to
> split a PUD. If successful, set its pkey - the entire block can now be
> used for PTPs. Replenish the reserve from the block if needed.
> 3. If no block is available, make an order-2 allocation (4 pages). If
> needed, use 1-2 reserved pages to split PUD/PMD. Set the pkey of the 4
> pages, take 1-2 pages to replenish the reserve if needed.

Oh, good idea!

> 
> This ensures that we never run out of PTPs for splitting. We may get
> into an OOM situation more easily due to the order-2 requirement, but
> the risk remains low compared to requiring a 2M block. A bigger concern
> is concurrency - do we need a per-CPU cache? Reserving a 2M block per
> CPU could be very much overkill.
> 
> No matter which solution is used, this clearly increases the complexity
> of kpkeys_hardened_pgtables. Mike Rapoport has posted a number of RFCs
> [3][4] that aim at addressing this problem more generally, but no
> consensus seems to have emerged and I'm not sure they would completely
> solve this specific problem either.
> 
> For now, my plan is to stick to solution 3 from [2], i.e. force the
> linear map to be PTE-mapped. This is easily done on arm64 as it is the
> default, and is required for rodata=full, unless [1] is applied and the
> system supports BBML2_NOABORT. See [1] for the potential performance
> improvements we'd be missing out on (~5% ballpark).
> 

I continue to be surprised that allocation time pkey conversion is not a
performance disaster, even with the directmap pre-split.

> I'm not quite sure
> what the picture looks like on x86 - it may well be more significant as
> Rick suggested.

I think having more efficient direct map permissions is a solvable problem, but
each usage is just a little too small to justify the infrastructure for a good
solution. And each simple solution is a little too much overhead to justify the
usage. So there is a long tail of blocked usages:
 - pkeys usages (page tables and secret protection)
 - kernel shadow stacks
 - More efficient executable code allocations (BPF, kprobe trampolines, etc)

Although the BPF folks started doing their own thing for this. But I don't think
there are any fundamentally unsolvable problems for a generic solution. It's a
question of a leading killer usage to justify the infrastructure. Maybe it will
be kernel shadow stack.
Re: [RFC PATCH v5 00/18] pkeys-based page table hardening
Posted by Kevin Brodsky 3 days ago
On 18/09/2025 19:31, Edgecombe, Rick P wrote:
> On Thu, 2025-09-18 at 16:15 +0200, Kevin Brodsky wrote:
>> This is where I have to apologise to Rick for not having studied his
>> series more thoroughly, as patch 17 [2] covers this issue very well in
>> the commit message.
>>
>> It seems fair to say there is no ideal or simple solution, though.
>> Rick's patch reserves enough (PTE-mapped) memory for fully splitting the
>> linear map, which is relatively simple but not very pleasant. Chatting
>> with Ryan Roberts, we figured another approach, improving on solution 1
>> mentioned in [2]. It would rely on allocating all PTPs from a special
>> pool (without using set_memory_pkey() in pagetable_*_ctor), along those
>> lines:
> Oh I didn't realize ARM split the direct map now at runtime. IIRC it used to
> just map at 4k if there were any permissions configured.

Until recently the linear map was always PTE-mapped on arm64 if
rodata=full (default) or in other situations (e.g. DEBUG_PAGEALLOC), so
that it never needed to be split at runtime. Since [1b] landed though,
there is support for setting permissions at the block level and
splitting, meaning that the linear map can be block-mapped in most cases
(see force_pte_mapping() in patch 3 for details). This is only enabled
on systems with the BBML2_NOABORT feature though.

[1b]
https://lore.kernel.org/all/20250917190323.3828347-1-yang@os.amperecomputing.com/

>> 1. 2 pages are reserved at all times (with the appropriate pkey)
>> 2. Try to allocate a 2M block. If needed, use a reserved page as PMD to
>> split a PUD. If successful, set its pkey - the entire block can now be
>> used for PTPs. Replenish the reserve from the block if needed.
>> 3. If no block is available, make an order-2 allocation (4 pages). If
>> needed, use 1-2 reserved pages to split PUD/PMD. Set the pkey of the 4
>> pages, take 1-2 pages to replenish the reserve if needed.
> Oh, good idea!
>
>> This ensures that we never run out of PTPs for splitting. We may get
>> into an OOM situation more easily due to the order-2 requirement, but
>> the risk remains low compared to requiring a 2M block. A bigger concern
>> is concurrency - do we need a per-CPU cache? Reserving a 2M block per
>> CPU could be very much overkill.
>>
>> No matter which solution is used, this clearly increases the complexity
>> of kpkeys_hardened_pgtables. Mike Rapoport has posted a number of RFCs
>> [3][4] that aim at addressing this problem more generally, but no
>> consensus seems to have emerged and I'm not sure they would completely
>> solve this specific problem either.
>>
>> For now, my plan is to stick to solution 3 from [2], i.e. force the
>> linear map to be PTE-mapped. This is easily done on arm64 as it is the
>> default, and is required for rodata=full, unless [1] is applied and the
>> system supports BBML2_NOABORT. See [1] for the potential performance
>> improvements we'd be missing out on (~5% ballpark).
>>
> I continue to be surprised that allocation time pkey conversion is not a
> performance disaster, even with the directmap pre-split.
>
>> I'm not quite sure
>> what the picture looks like on x86 - it may well be more significant as
>> Rick suggested.
> I think having more efficient direct map permissions is a solvable problem, but
> each usage is just a little too small to justify the infrastructure for a good
> solution. And each simple solution is a little too much overhead to justify the
> usage. So there is a long tail of blocked usages:
>  - pkeys usages (page tables and secret protection)
>  - kernel shadow stacks
>  - More efficient executable code allocations (BPF, kprobe trampolines, etc)
>
> Although the BPF folks started doing their own thing for this. But I don't think
> there are any fundamentally unsolvable problems for a generic solution. It's a
> question of a leading killer usage to justify the infrastructure. Maybe it will
> be kernel shadow stack.
It seems to be exactly the situation yes. Given Will's feedback, I'll
try to implement such a dedicated allocator one more time (based on the
scheme I suggested above) and see how it goes. Hopefully that will
create more momentum for a generic infrastructure :) - Kevin
Re: [RFC PATCH v5 00/18] pkeys-based page table hardening
Posted by Will Deacon 2 weeks, 1 day ago
On Thu, Sep 18, 2025 at 04:15:52PM +0200, Kevin Brodsky wrote:
> On 25/08/2025 09:31, Kevin Brodsky wrote:
> >>> Note: the performance impact of set_memory_pkey() is likely to be
> >>> relatively low on arm64 because the linear mapping uses PTE-level
> >>> descriptors only. This means that set_memory_pkey() simply changes the
> >>> attributes of some PTE descriptors. However, some systems may be able to
> >>> use higher-level descriptors in the future [5], meaning that
> >>> set_memory_pkey() may have to split mappings. Allocating page tables
> >> I'm supposed the page table hardening feature will be opt-in due to
> >> its overhead? If so I think you can just keep kernel linear mapping
> >> using PTE, just like debug page alloc.
> > Indeed, I don't expect it to be turned on by default (in defconfig). If
> > the overhead proves too large when block mappings are used, it seems
> > reasonable to force PTE mappings when kpkeys_hardened_pgtables is enabled.
> 
> I had a closer look at what happens when the linear map uses block
> mappings, rebasing this series on top of [1]. Unfortunately, this is
> worse than I thought: it does not work at all as things stand.
> 
> The main issue is that calling set_memory_pkey() in pagetable_*_ctor()
> can cause the linear map to be split, which requires new PTP(s) to be
> allocated, which means more nested call(s) to set_memory_pkey(). This
> explodes as a non-recursive lock is taken on that path.
> 
> More fundamentally, this cannot work unless we can explicitly allocate
> PTPs from either:
> 1. A pool of PTE-mapped pages
> 2. A pool of memory that is already mapped with the right pkey (at any
> level)
> 
> This is where I have to apologise to Rick for not having studied his
> series more thoroughly, as patch 17 [2] covers this issue very well in
> the commit message.
> 
> It seems fair to say there is no ideal or simple solution, though.
> Rick's patch reserves enough (PTE-mapped) memory for fully splitting the
> linear map, which is relatively simple but not very pleasant. Chatting
> with Ryan Roberts, we figured another approach, improving on solution 1
> mentioned in [2]. It would rely on allocating all PTPs from a special
> pool (without using set_memory_pkey() in pagetable_*_ctor), along those
> lines:
> 
> 1. 2 pages are reserved at all times (with the appropriate pkey)
> 2. Try to allocate a 2M block. If needed, use a reserved page as PMD to
> split a PUD. If successful, set its pkey - the entire block can now be
> used for PTPs. Replenish the reserve from the block if needed.
> 3. If no block is available, make an order-2 allocation (4 pages). If
> needed, use 1-2 reserved pages to split PUD/PMD. Set the pkey of the 4
> pages, take 1-2 pages to replenish the reserve if needed.
> 
> This ensures that we never run out of PTPs for splitting. We may get
> into an OOM situation more easily due to the order-2 requirement, but
> the risk remains low compared to requiring a 2M block. A bigger concern
> is concurrency - do we need a per-CPU cache? Reserving a 2M block per
> CPU could be very much overkill.
> 
> No matter which solution is used, this clearly increases the complexity
> of kpkeys_hardened_pgtables. Mike Rapoport has posted a number of RFCs
> [3][4] that aim at addressing this problem more generally, but no
> consensus seems to have emerged and I'm not sure they would completely
> solve this specific problem either.
> 
> For now, my plan is to stick to solution 3 from [2], i.e. force the
> linear map to be PTE-mapped. This is easily done on arm64 as it is the
> default, and is required for rodata=full, unless [1] is applied and the
> system supports BBML2_NOABORT. See [1] for the potential performance
> improvements we'd be missing out on (~5% ballpark). I'm not quite sure
> what the picture looks like on x86 - it may well be more significant as
> Rick suggested.

Just as a data point, but forcing the linear map down to 4k would likely
prevent us from being able to enable this on Android. We've measured a
considerable (double digit %) increase in CPU power consumption for some
real-life camera workloads when mapping the linear map at 4k granularity
thanks to the additional memory traffic from the PTW.

At some point, KFENCE required 4k granularity for the linear map, but we
fixed it. rodata=full requires 4k granularity, but there are patches to
fix that too. So I think we should avoid making the same mistake from
the start for this series.

Will
Re: [RFC PATCH v5 00/18] pkeys-based page table hardening
Posted by Kevin Brodsky 3 days ago
On 18/09/2025 16:57, Will Deacon wrote:
> On Thu, Sep 18, 2025 at 04:15:52PM +0200, Kevin Brodsky wrote:
>> [...]
>>
>> For now, my plan is to stick to solution 3 from [2], i.e. force the
>> linear map to be PTE-mapped. This is easily done on arm64 as it is the
>> default, and is required for rodata=full, unless [1] is applied and the
>> system supports BBML2_NOABORT. See [1] for the potential performance
>> improvements we'd be missing out on (~5% ballpark). I'm not quite sure
>> what the picture looks like on x86 - it may well be more significant as
>> Rick suggested.
> Just as a data point, but forcing the linear map down to 4k would likely
> prevent us from being able to enable this on Android. We've measured a
> considerable (double digit %) increase in CPU power consumption for some
> real-life camera workloads when mapping the linear map at 4k granularity
> thanks to the additional memory traffic from the PTW.

Good to know!

> At some point, KFENCE required 4k granularity for the linear map, but we
> fixed it. rodata=full requires 4k granularity, but there are patches to
> fix that too. So I think we should avoid making the same mistake from
> the start for this series.

Understood, makes sense. I'll be looking into implementing the custom
PTP allocator I suggested above then.

- Kevin
Re: [RFC PATCH v5 00/18] pkeys-based page table hardening
Posted by Yang Shi 1 month, 1 week ago

On 8/25/25 12:31 AM, Kevin Brodsky wrote:
> On 21/08/2025 19:29, Yang Shi wrote:
>> Hi Kevin,
>>
>> On 8/15/25 1:54 AM, Kevin Brodsky wrote:
>>> This is a proposal to leverage protection keys (pkeys) to harden
>>> critical kernel data, by making it mostly read-only. The series includes
>>> a simple framework called "kpkeys" to manipulate pkeys for in-kernel
>>> use,
>>> as well as a page table hardening feature based on that framework,
>>> "kpkeys_hardened_pgtables". Both are implemented on arm64 as a proof of
>>> concept, but they are designed to be compatible with any architecture
>>> that supports pkeys.
>> [...]
>>
>>> Note: the performance impact of set_memory_pkey() is likely to be
>>> relatively low on arm64 because the linear mapping uses PTE-level
>>> descriptors only. This means that set_memory_pkey() simply changes the
>>> attributes of some PTE descriptors. However, some systems may be able to
>>> use higher-level descriptors in the future [5], meaning that
>>> set_memory_pkey() may have to split mappings. Allocating page tables
>> I'm supposed the page table hardening feature will be opt-in due to
>> its overhead? If so I think you can just keep kernel linear mapping
>> using PTE, just like debug page alloc.
> Indeed, I don't expect it to be turned on by default (in defconfig). If
> the overhead proves too large when block mappings are used, it seems
> reasonable to force PTE mappings when kpkeys_hardened_pgtables is enabled.
>
>>> from a contiguous cache of pages could help minimise the overhead, as
>>> proposed for x86 in [1].
>> I'm a little bit confused about how this can work. The contiguous
>> cache of pages should be some large page, for example, 2M. But the
>> page table pages allocated from the cache may have different
>> permissions if I understand correctly. The default permission is RO,
>> but some of them may become R/W at sometime, for example, when calling
>> set_pte_at(). You still need to split the linear mapping, right?
> When such a helper is called, *all* PTPs become writeable - there is no
> per-PTP permission switching.

OK, so all PTPs in the same contiguous cache will become writeable even 
though the helper (i.e. set_pte_at()) is just called on one of the 
PTPs.  But doesn't it compromise the page table hardening somehow? The 
PTPs from the same cache may belong to different processes.

Thanks,
Yang

>
> PTPs remain mapped RW (i.e. the base permissions set at the PTE level
> are RW). With this series, they are also all mapped with the same pkey
> (1). By default, the pkey register is configured so that pkey 1 provides
> RO access. The net result is that PTPs are RO by default, since the pkey
> restricts the effective permissions.
>
> When calling e.g. set_pte(), the pkey register is modified to enable RW
> access to pkey 1, making it possible to write to any PTP. Its value is
> restored when the function exit so that PTPs are once again RO.
>
> - Kevin

Re: [RFC PATCH v5 00/18] pkeys-based page table hardening
Posted by Kevin Brodsky 1 month, 1 week ago
On 26/08/2025 21:18, Yang Shi wrote:
>>
>>>> from a contiguous cache of pages could help minimise the overhead, as
>>>> proposed for x86 in [1].
>>> I'm a little bit confused about how this can work. The contiguous
>>> cache of pages should be some large page, for example, 2M. But the
>>> page table pages allocated from the cache may have different
>>> permissions if I understand correctly. The default permission is RO,
>>> but some of them may become R/W at sometime, for example, when calling
>>> set_pte_at(). You still need to split the linear mapping, right?
>> When such a helper is called, *all* PTPs become writeable - there is no
>> per-PTP permission switching.
>
> OK, so all PTPs in the same contiguous cache will become writeable
> even though the helper (i.e. set_pte_at()) is just called on one of
> the PTPs.  But doesn't it compromise the page table hardening somehow?
> The PTPs from the same cache may belong to different processes. 

First just a note that this is true regardless of how the PTPs are
allocated (i.e. this is already the case in this version of the series).

Either way, yes you are right, this approach does not introduce any
isolation *between* page tables - pgtable helpers are able to write to
all page tables. In principle it should be possible to use a different
pkey for kernel and user page tables, but that would make the kpkeys
level switching in helpers quite a bit more complicated. Isolating
further is impractical as we have so few pkeys (just 8 on arm64).

That said, what kpkeys really tries to protect against is the direct
corruption of critical data by arbitrary (unprivileged) code. If the
attacker is able to manipulate calls to set_pte() and the likes, kpkeys
cannot provide much protection - even if we restricted the writes to a
specific set of page tables, the attacker would still be able to insert
a translation to any arbitrary physical page.

- Kevin
Re: [RFC PATCH v5 00/18] pkeys-based page table hardening
Posted by Yang Shi 1 month ago

On 8/27/25 9:09 AM, Kevin Brodsky wrote:
> On 26/08/2025 21:18, Yang Shi wrote:
>>>>> from a contiguous cache of pages could help minimise the overhead, as
>>>>> proposed for x86 in [1].
>>>> I'm a little bit confused about how this can work. The contiguous
>>>> cache of pages should be some large page, for example, 2M. But the
>>>> page table pages allocated from the cache may have different
>>>> permissions if I understand correctly. The default permission is RO,
>>>> but some of them may become R/W at sometime, for example, when calling
>>>> set_pte_at(). You still need to split the linear mapping, right?
>>> When such a helper is called, *all* PTPs become writeable - there is no
>>> per-PTP permission switching.
>> OK, so all PTPs in the same contiguous cache will become writeable
>> even though the helper (i.e. set_pte_at()) is just called on one of
>> the PTPs.  But doesn't it compromise the page table hardening somehow?
>> The PTPs from the same cache may belong to different processes.
> First just a note that this is true regardless of how the PTPs are
> allocated (i.e. this is already the case in this version of the series).
>
> Either way, yes you are right, this approach does not introduce any
> isolation *between* page tables - pgtable helpers are able to write to
> all page tables. In principle it should be possible to use a different
> pkey for kernel and user page tables, but that would make the kpkeys
> level switching in helpers quite a bit more complicated. Isolating
> further is impractical as we have so few pkeys (just 8 on arm64).
>
> That said, what kpkeys really tries to protect against is the direct
> corruption of critical data by arbitrary (unprivileged) code. If the
> attacker is able to manipulate calls to set_pte() and the likes, kpkeys
> cannot provide much protection - even if we restricted the writes to a
> specific set of page tables, the attacker would still be able to insert
> a translation to any arbitrary physical page.

I see. Thanks for elaborating this.

Yang

>
> - Kevin