Documentation/arch/x86/x86_64/mm.rst | 4 +- arch/x86/Kconfig | 3 + arch/x86/include/asm/mermap.h | 23 + arch/x86/include/asm/mmu_context.h | 71 ++- arch/x86/include/asm/pgalloc.h | 33 ++ arch/x86/include/asm/pgtable_64_types.h | 19 +- arch/x86/include/asm/pgtable_types.h | 2 + arch/x86/include/asm/tlbflush.h | 43 +- arch/x86/kernel/ldt.c | 137 ++---- arch/x86/mm/init_64.c | 44 +- arch/x86/mm/pgtable.c | 3 + include/linux/freetype.h | 147 ++++++ include/linux/gfp.h | 25 +- include/linux/gfp_types.h | 26 ++ include/linux/mermap.h | 63 +++ include/linux/mermap_types.h | 43 ++ include/linux/mm.h | 13 + include/linux/mm_types.h | 6 + include/linux/mmzone.h | 84 ++-- include/linux/pageblock-flags.h | 16 +- include/trace/events/mmflags.h | 9 +- kernel/fork.c | 6 + kernel/panic.c | 2 + kernel/power/snapshot.c | 8 +- mm/Kconfig | 41 ++ mm/Makefile | 3 + mm/compaction.c | 36 +- mm/init-mm.c | 3 + mm/internal.h | 43 +- mm/mermap.c | 323 +++++++++++++ mm/mm_init.c | 11 +- mm/page_alloc.c | 782 +++++++++++++++++++++++--------- mm/page_isolation.c | 2 +- mm/page_owner.c | 7 +- mm/page_reporting.c | 4 +- mm/pgalloc-track.h | 6 + mm/show_mem.c | 4 +- mm/tests/mermap_kunit.c | 231 ++++++++++ mm/tests/page_alloc_kunit.c | 250 ++++++++++ 39 files changed, 2099 insertions(+), 477 deletions(-)
.:: What? Why?
This series adds support for efficiently allocating pages that are not
present in the direct map. This is instrumental to two different
immediate goals:
1. This supports the effort to remove guest_memfd memory from the direct
map [0]. One of the challenges faced in that effort has been
efficiently eliminating TLB entries, this series offers a solution to
that problem
2. Address Space Isolation (ASI) [1] also needs an efficient way to
allocate pages that are missing from the direct map. Although for ASI
the needs are slightly different (in that case, the pages need only
be removed from ASI's special pagetables), the most interesting mm
challenges are basically the same.
So, __GFP_UNMAPPED serves as a Trojan horse to get the page allocator
into a state where adding ASI's features "Should Be Easy".
This series _also_ serves as a Trojan horse for the "mermap" (details
below) which is also a key building block for making ASI efficient.
Longer term, there are a wide range of security techniques unlocked by
being able to efficiently remove pages from the kernel's address space.
There may also be non-security usecases for this feature, for example
at LPC Sumit Garg presented an issue with memory-firewalled client
devices that could he remediated by __GFP_UNMAPPED [2].
.:: Design
The key design elements introduced here are just repurposed from
previous attempts to directly introduce ASI's needs to the page
allocator [3]. The only real difference is that now these support
totally unmapping stuff from the direct map, instead of only unmapping
it from ASI's special pagetables.
.:::: Design: Introducing "freetypes"
The biggest challenge for efficiently getting stuff out of the direct
map is TLB flushing. Pushing this problem into the page allocator turns
out to enable amortising that flush cost into almost nothing. The core
idea is to have pools of already-unmapped pages. We'd like those pages
to be physically contiguous so they don't unduly fragment the pagetables
around them, and we'd like to be able to efficiently look up these
already-unmapped pages during allocation. The page allocator already has
deeply-ingrained functionality for physically grouping pages by a
certain attribute, and then indexing free pages by that attribute, this
mechanism is: migratetypes.
So basically, this series extends the concepts of migratetypes in the
allocator so that as well as just representing mobility, they can
represent other properties of the page too. (Actually, migratetypes are
already sort of overloaded, but the main extension is to be able to
represent _orthogonal_ properties). In order to avoid further
overloading the concept of a migratetype, this extension is done by
adding a new concept on top of migratetype: the _freetype_. A freetype
is basically just a migratetype plus some flags, and it replaces
migratetypes wherever the latter is currently used as to index free
pages.
The first freetype flag is then added, which marks the pages it indexes
as being absent from the direct map. This is then used to implement the
new __GFP_UNMAPPED flag, which allocates pages from pageblocks that have
the new flag, or unmaps pages if no existing ones are already available.
.:::: Design: Introducing the "mermap"
Sharp readers might by now be asking how __GFP_UNMAPPED interacts with
__GFP_ZERO. If pages aren't in the direct map, how can the page
allocator zero them? The solution is the "mermap", short for "epheMERal
mapping". The mermap provides an efficient way to temporarily map pages
into the local address space, and the allocator uses these mappings to
zero pages.
Using the mermap securely requires some knowledge about the usage of the
pages. One slightly awkward part of this design is that the page
allocator's usage of the mermap then "leaks" out so that callers who
allocate with __GFP_UNMAPPED|__GFP_ZERO need to be aware of the mermap's
security implications. For the guest_memfd unmapping usecase, that means
when guest_memfd.c makes these special allocations, it is only safe
because the pages will belong to the current process. In other words,
the use of the mermap potentially allows that process to leak the pages
via CPU sidechannels (unless more holistic/expensive mitigations are
enabled).
Since this cover letter is already too long I won't describe most
details of the mermap here, please see the patch that introduces it.
But one key detail is that it requires a kernel-space but mm-local
virtual address region. So... this series adds that too (for x86). This
is called the mm-local region and is implemented by "just" extending and
generalising the LDT remap area.
.:: Outline of the patchset
- Patches 1 -> 2 introduce the mm-local region for x86
- Patches 3 -> 5 introduce the mermap
- Patches 6 -> 14 introduce freetypes
- Patch 8 in particular is the big annoying switch-over which changes
a whole bunch of code from "migratetype" to "freetype". In order to
try and have the compiler help out with catching bugs, this is done
with an annoying typedef. I'm sorry that this patch is so annoying,
but I think if we do want to extend the allocator along these lines
then a typedef + big annoying patch is probably the safest way.
- Patches 15 -> 20 introduce __GFP_UNMAPPED
.:: Why [RFC]?
I really wanted to stop sending RFC and start sending PATCHes but
getting this series out has taken months longer than I expected, so it's
time to get something on the list. The known issues here are:
1. __GFP_UNMAPPED isn't useful yet until guest_memfd unmapping support
[0] gets merged.
2. Apparently while implementing the mm-local region, I totally forgot
that KPTI existed on 32-bit systems. I expect the 0-day bot to fire a
failure on that patch.
There is also one really nasty hack in mermap.c, namely
set_unmapped_pte(). This is basically a symptom of the problem I
propose to discuss at LSF/MM/BPF [3], i.e. the fact that there are
lots of pagetable libraries yet none of them are flexible enough to do
anything new (in this case the "new thing" is pre-allocating pagetables
then subsequently populating them in a separate context). Whether this
particular hack should block merging the mermap is not clear to me, I'd
be interested to hear opinions.
.:: Performance
In [4] is a branch containing:
1. This series.
2. All the key kernel patches from the Firecracker team's "secret-free"
effort, which includes guest_memfd unmapping ([0]).
3. Some prototype patches to switch guest_memfd over from an ad-hoc
unmapping logic to use of __GFP_UNMAPPED (plus direct use of the
mermap to implement write()).
I benchmarked this using Firecracker's own performance tests [4], which
measure the time required to populate the VM guest's memory. This
population happens via write() so it exercises the mermap. I ran this on
a Sapphire Rapids machine [5]. The baseline here is just the secret-free
patches on their own. "gfp_unmapped" is the branch described above.
"skip-flush" provides a reference against an implementation that just
skips flushing the TLB when unmapping guest_memfd pages, which serves as
an upper-bound on performance.
metric: populate_latency (ms) | test: firecracker-perf-tests-wrapped
+---------------+---------+----------+----------+------------------------+----------+--------+
| nixos_variant | samples | mean | min | histogram | max | Δμ |
+---------------+---------+----------+----------+------------------------+----------+--------+
| | 30 | 1.04s | 1.02s | █ | 1.10s | |
| gfp_unmapped | 30 | 313.02ms | 299.48ms | █ | 343.25ms | -70.0% |
| skip-flush | 30 | 325.80ms | 307.91ms | █ | 333.30ms | -68.8% |
+---------------+---------+----------+----------+------------------------+----------+--------+
Conclusion: it's close to the best case performance for this particular
workload. (Note in the sample above the mean is actually faster - that's
noise, this isn't a consistent observation).
[0] [PATCH v10 00/15] Direct Map Removal Support for guest_memfd
https://lore.kernel.org/all/20260126164445.11867-1-kalyazin@amazon.com/
[1] https://linuxasi.dev/
[2] https://lpc.events/event/19/contributions/2095/
[3] https://lore.kernel.org/all/20260219175113.618562-1-jackmanb@google.com/
[4] https://github.com/bjackman/kernel-benchmarks-nix/blob/fd56c93344760927b71161368230a15741a5869f/packages/benchmarks/firecracker-perf-tests/firecracker-perf-tests.sh
[5] https://github.com/bjackman/aethelred/blob/eb0dd0e99ee08fa0534733113e93b89499affe91
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: x86@kernel.org
Cc: rppt@kernel.org
Cc: Sumit Garg <sumit.garg@oss.qualcomm.com>
To: Borislav Petkov <bp@alien8.de>
To: Dave Hansen <dave.hansen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>
To: Andrew Morton <akpm@linux-foundation.org>
To: David Hildenbrand <david@kernel.org>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Vlastimil Babka <vbabka@kernel.org>
To: Mike Rapoport <rppt@kernel.org>
To: Wei Xu <weixugc@google.com>
To: Johannes Weiner <hannes@cmpxchg.org>
To: Zi Yan <ziy@nvidia.com>
Cc: yosryahmed@google.com
Cc: derkling@google.com
Cc: reijiw@google.com
Cc: Will Deacon <will@kernel.org>
Cc: rientjes@google.com
Cc: "Kalyazin, Nikita" <kalyazin@amazon.co.uk>
Cc: patrick.roy@linux.dev
Cc: "Itazuri, Takahiro" <itazur@amazon.co.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: David Kaplan <david.kaplan@amd.com>
Cc: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
Brendan Jackman (19):
x86/mm: split out preallocate_sub_pgd()
x86/mm: Generalize LDT remap into "mm-local region"
x86/tlb: Expose some flush function declarations to modules
x86/mm: introduce the mermap
mm: KUnit tests for the mermap
mm: introduce for_each_free_list()
mm/page_alloc: don't overload migratetype in find_suitable_fallback()
mm: introduce freetype_t
mm: move migratetype definitions to freetype.h
mm: add definitions for allocating unmapped pages
mm: rejig pageblock mask definitions
mm: encode freetype flags in pageblock flags
mm/page_alloc: remove ifdefs from pindex helpers
mm/page_alloc: separate pcplists by freetype flags
mm/page_alloc: rename ALLOC_NON_BLOCK back to _HARDER
mm/page_alloc: introduce ALLOC_NOBLOCK
mm/page_alloc: implement __GFP_UNMAPPED allocations
mm/page_alloc: implement __GFP_UNMAPPED|__GFP_ZERO allocations
mm: Minimal KUnit tests for some new page_alloc logic
Documentation/arch/x86/x86_64/mm.rst | 4 +-
arch/x86/Kconfig | 3 +
arch/x86/include/asm/mermap.h | 23 +
arch/x86/include/asm/mmu_context.h | 71 ++-
arch/x86/include/asm/pgalloc.h | 33 ++
arch/x86/include/asm/pgtable_64_types.h | 19 +-
arch/x86/include/asm/pgtable_types.h | 2 +
arch/x86/include/asm/tlbflush.h | 43 +-
arch/x86/kernel/ldt.c | 137 ++----
arch/x86/mm/init_64.c | 44 +-
arch/x86/mm/pgtable.c | 3 +
include/linux/freetype.h | 147 ++++++
include/linux/gfp.h | 25 +-
include/linux/gfp_types.h | 26 ++
include/linux/mermap.h | 63 +++
include/linux/mermap_types.h | 43 ++
include/linux/mm.h | 13 +
include/linux/mm_types.h | 6 +
include/linux/mmzone.h | 84 ++--
include/linux/pageblock-flags.h | 16 +-
include/trace/events/mmflags.h | 9 +-
kernel/fork.c | 6 +
kernel/panic.c | 2 +
kernel/power/snapshot.c | 8 +-
mm/Kconfig | 41 ++
mm/Makefile | 3 +
mm/compaction.c | 36 +-
mm/init-mm.c | 3 +
mm/internal.h | 43 +-
mm/mermap.c | 323 +++++++++++++
mm/mm_init.c | 11 +-
mm/page_alloc.c | 782 +++++++++++++++++++++++---------
mm/page_isolation.c | 2 +-
mm/page_owner.c | 7 +-
mm/page_reporting.c | 4 +-
mm/pgalloc-track.h | 6 +
mm/show_mem.c | 4 +-
mm/tests/mermap_kunit.c | 231 ++++++++++
mm/tests/page_alloc_kunit.c | 250 ++++++++++
39 files changed, 2099 insertions(+), 477 deletions(-)
---
base-commit: 44982d352c33767cd8d19f8044e7e1161a587ff7
change-id: 20260112-page_alloc-unmapped-944fe5d7b55c
Best regards,
--
Brendan Jackman <jackmanb@google.com>
On 25/02/2026 17:34, Brendan Jackman wrote: > .:::: Design: Introducing "freetypes" > > The biggest challenge for efficiently getting stuff out of the direct > map is TLB flushing. Pushing this problem into the page allocator turns > out to enable amortising that flush cost into almost nothing. The core > idea is to have pools of already-unmapped pages. We'd like those pages > to be physically contiguous so they don't unduly fragment the pagetables > around them, and we'd like to be able to efficiently look up these > already-unmapped pages during allocation. The page allocator already has > deeply-ingrained functionality for physically grouping pages by a > certain attribute, and then indexing free pages by that attribute, this > mechanism is: migratetypes. > > So basically, this series extends the concepts of migratetypes in the > allocator so that as well as just representing mobility, they can > represent other properties of the page too. (Actually, migratetypes are > already sort of overloaded, but the main extension is to be able to > represent _orthogonal_ properties). In order to avoid further > overloading the concept of a migratetype, this extension is done by > adding a new concept on top of migratetype: the _freetype_. A freetype > is basically just a migratetype plus some flags, and it replaces > migratetypes wherever the latter is currently used as to index free > pages. > > The first freetype flag is then added, which marks the pages it indexes > as being absent from the direct map. This is then used to implement the > new __GFP_UNMAPPED flag, which allocates pages from pageblocks that have > the new flag, or unmaps pages if no existing ones are already available. This approach seems very interesting to me, and I wonder if it could be applied to another use-case. I am working on a security feature to protect page table pages (PTPs) using pkeys [1]. This relies on all PTPs being mapped with a specific pkey (in the direct map). That requires changing a mapping attribute rather than making it invalid, but AFAICT this is essentially the same problem as the one you're trying to solve. There are however extra challenges with mapping PTPs with special attributes. The main one, which you mention in patch 17, is that splitting the direct map may require allocating PTPs, which may lead to recursion. [1] introduces a dedicated page table allocator on top of the buddy allocator, which attempts to cache PMD-sized blocks if possible. It ensures that no recursion occurs by using a special flag when allocating PTPs while splitting the direct map, and keeping a reserve of pages specifically for that situation (patch 15 and 24). There is also special handling for early page tables (essentially keeping track of them and setting their pkey once we can split the direct map). Do you think that this freetype infrastructure could be used for that purpose, instead of introducing a layer on top of the buddy allocator? I expect that much of the special handling for allocating PTPs can be kept separate. Ensuring that protected pages are always available to split the direct map may be difficult though... This is deeply embedded in the allocator I proposed. - Kevin [1] https://lore.kernel.org/linux-hardening/20260227175518.3728055-1-kevin.brodsky@arm.com/
On Thu Mar 5, 2026 at 2:51 PM UTC, Kevin Brodsky wrote: > On 25/02/2026 17:34, Brendan Jackman wrote: >> .:::: Design: Introducing "freetypes" >> >> The biggest challenge for efficiently getting stuff out of the direct >> map is TLB flushing. Pushing this problem into the page allocator turns >> out to enable amortising that flush cost into almost nothing. The core >> idea is to have pools of already-unmapped pages. We'd like those pages >> to be physically contiguous so they don't unduly fragment the pagetables >> around them, and we'd like to be able to efficiently look up these >> already-unmapped pages during allocation. The page allocator already has >> deeply-ingrained functionality for physically grouping pages by a >> certain attribute, and then indexing free pages by that attribute, this >> mechanism is: migratetypes. >> >> So basically, this series extends the concepts of migratetypes in the >> allocator so that as well as just representing mobility, they can >> represent other properties of the page too. (Actually, migratetypes are >> already sort of overloaded, but the main extension is to be able to >> represent _orthogonal_ properties). In order to avoid further >> overloading the concept of a migratetype, this extension is done by >> adding a new concept on top of migratetype: the _freetype_. A freetype >> is basically just a migratetype plus some flags, and it replaces >> migratetypes wherever the latter is currently used as to index free >> pages. >> >> The first freetype flag is then added, which marks the pages it indexes >> as being absent from the direct map. This is then used to implement the >> new __GFP_UNMAPPED flag, which allocates pages from pageblocks that have >> the new flag, or unmaps pages if no existing ones are already available. > > This approach seems very interesting to me, and I wonder if it could be > applied to another use-case. > > I am working on a security feature to protect page table pages (PTPs) > using pkeys [1]. This relies on all PTPs being mapped with a specific > pkey (in the direct map). That requires changing a mapping attribute > rather than making it invalid, but AFAICT this is essentially the same > problem as the one you're trying to solve. Yeah, I think so: 1. The fragmentation issues seem exactly the same. 2. The TLB flushing issues are probably also basically the same, I assume you need to flush the TLB when you convert a page to use for pagetables, and without allocator integration that can happen pretty often and in hot paths. Correct? > There are however extra challenges with mapping PTPs with special > attributes. The main one, which you mention in patch 17, is that > splitting the direct map may require allocating PTPs, which may lead to > recursion. > > [1] introduces a dedicated page table allocator on top of the buddy > allocator, which attempts to cache PMD-sized blocks if possible. It > ensures that no recursion occurs by using a special flag when allocating > PTPs while splitting the direct map, and keeping a reserve of pages > specifically for that situation (patch 15 and 24). Right, and actually just today someone pointed out mm/execmem.c to me, I think execmem_cache_populate() is basically doing the same thing (although it's also creating a separate virtual mapping). > There is also special > handling for early page tables (essentially keeping track of them and > setting their pkey once we can split the direct map). > > Do you think that this freetype infrastructure could be used for that > purpose, instead of introducing a layer on top of the buddy allocator? Yes!!! 100% definitely, my code certainly solves all your problems... > I > expect that much of the special handling for allocating PTPs can be kept > separate. Ensuring that protected pages are always available to split > the direct map may be difficult though... This is deeply embedded in the > allocator I proposed. ...Oh, hm, well, um, good point. Thinking aloud a bit... The way this series dodges the question is (copying from the code comments in patch 17 for convenient reading): 1) - The direct map starts out fully mapped at boot. (This is not really * an assumption" as its in direct control of page_alloc.c). * 2) - Once pages in the direct map are broken down, they are not * re-aggregated into larger pages again. * 3) - Pagetables are never allocated with __GFP_UNMAPPED. * * Under these assumptions, a pagetable might need to be allocated while * _unmapping_ stuff from the direct map during a __GFP_UNMAPPED * allocation. But, the allocation of that pagetable never requires * allocating a further pagetable. In other words, we might need to allocate while we allocate (which is fine because I have to do locking shenanigans anyway due to x86 TLB shootdown requirements), but there's no further recursion after that. Can we come up with an analogue for protected PTPs? Point 3) is the inflexible one, and we obviously can't say "PTPs are never allocated as PTPs". But if we invert it and _also_ invert point 1) I think we get something that works in principle: 1) The direct map starts out _fully protected_ (i.e. we treat everything as if it's a pagetable at first). 2) We assume the direct map doesn't get reaggregated once we've broken things down to serve PTP allocations 3) PTPs are always PTPs... But... this is a bit silly, since what it means is we'll then go through ~all the pagetblocks in the system (except the ones that _are_ actually used for PTPs) and flip their pkey, breaking down the physmap to pageblock granularity as we go. And... if we're gonna do that, we might as well just say the physmap has to be at pageblock granularity to begin with. (Could we do that? Maybe - Mike Rapoport has previously argued that physmap fragmentation is not a very big deal, so I guess the question is whether we're ready to really lean into that analysis, it would be quite painful if it turned out to be wrong). Another potential "dodge": Is it really important that the PTPs are always protected from the very moment they are created? Coz this feature still seems pretty useful even if there's an awkward fallback case where, under specific memory pressure patterns, we temporarily use unprotected pagetables to set up protected pagetables. That still makes exploiting a pagetable overwrite an order of magnitude harder than before, right? Similar to how there's probably ways to exploit bugs if you can get them to race with the intended pagetable update paths that flip the pkey register, or if you can get a ROP chain to flip that register for you or whatever.
On 05/03/2026 16:58, Brendan Jackman wrote: > On Thu Mar 5, 2026 at 2:51 PM UTC, Kevin Brodsky wrote: >> [...] >> This approach seems very interesting to me, and I wonder if it could be >> applied to another use-case. >> >> I am working on a security feature to protect page table pages (PTPs) >> using pkeys [1]. This relies on all PTPs being mapped with a specific >> pkey (in the direct map). That requires changing a mapping attribute >> rather than making it invalid, but AFAICT this is essentially the same >> problem as the one you're trying to solve. > Yeah, I think so: > > 1. The fragmentation issues seem exactly the same. I believe so. > 2. The TLB flushing issues are probably also basically the same, I > assume you need to flush the TLB when you convert a page to use for > pagetables, and without allocator integration that can happen pretty > often and in hot paths. Correct? Indeed. Up until v5 [2] no special allocator was used - the pkey was set at the page level every time a PTP was allocated or freed. Clearly suboptimal, and doesn't work at all if large mappings are used due to the risk of recursion. >> There are however extra challenges with mapping PTPs with special >> attributes. The main one, which you mention in patch 17, is that >> splitting the direct map may require allocating PTPs, which may lead to >> recursion. >> >> [1] introduces a dedicated page table allocator on top of the buddy >> allocator, which attempts to cache PMD-sized blocks if possible. It >> ensures that no recursion occurs by using a special flag when allocating >> PTPs while splitting the direct map, and keeping a reserve of pages >> specifically for that situation (patch 15 and 24). > Right, and actually just today someone pointed out mm/execmem.c to me, I > think execmem_cache_populate() is basically doing the same thing > (although it's also creating a separate virtual mapping). Ah interesting I didn't know about that cache. It does have similarities, and the motivation seems similar too. >> There is also special >> handling for early page tables (essentially keeping track of them and >> setting their pkey once we can split the direct map). >> >> Do you think that this freetype infrastructure could be used for that >> purpose, instead of introducing a layer on top of the buddy allocator? > Yes!!! 100% definitely, my code certainly solves all your problems... Almost ;) >> I >> expect that much of the special handling for allocating PTPs can be kept >> separate. Ensuring that protected pages are always available to split >> the direct map may be difficult though... This is deeply embedded in the >> allocator I proposed. > ...Oh, hm, well, um, good point. Thinking aloud a bit... > > The way this series dodges the question is (copying from the code > comments in patch 17 for convenient reading): > > 1) - The direct map starts out fully mapped at boot. (This is not really > * an assumption" as its in direct control of page_alloc.c). > * > 2) - Once pages in the direct map are broken down, they are not > * re-aggregated into larger pages again. > * > 3) - Pagetables are never allocated with __GFP_UNMAPPED. > * > * Under these assumptions, a pagetable might need to be allocated while > * _unmapping_ stuff from the direct map during a __GFP_UNMAPPED > * allocation. But, the allocation of that pagetable never requires > * allocating a further pagetable. > > In other words, we might need to allocate while we allocate (which is > fine because I have to do locking shenanigans anyway due to x86 TLB > shootdown requirements), but there's no further recursion after that. > > Can we come up with an analogue for protected PTPs? Point 3) is > the inflexible one, and we obviously can't say "PTPs are never allocated > as PTPs". But if we invert it and _also_ invert point 1) I think we get > something that works in principle: > > 1) The direct map starts out _fully protected_ (i.e. we treat everything > as if it's a pagetable at first). > > 2) We assume the direct map doesn't get reaggregated once we've broken > things down to serve PTP allocations > > 3) PTPs are always PTPs... > > But... this is a bit silly, since what it means is we'll then go through > ~all the pagetblocks in the system (except the ones that _are_ actually > used for PTPs) and flip their pkey, breaking down the physmap to > pageblock granularity as we go. And... if we're gonna do that, we might > as well just say the physmap has to be at pageblock granularity to begin > with. Having to change the pkey of every pageblock when allocating it for anything but page tables seems rather unreasonable... And in case of memory pressure, where fragmentation is high, we may not have any protected pageblock left. The allocator I proposed falls back to order-2 allocations if necessary (which is sufficient to replenish the page reserve even if PMD+PTE pages are allocated for splitting). > (Could we do that? Maybe - Mike Rapoport has previously argued that > physmap fragmentation is not a very big deal, so I guess the question > is whether we're ready to really lean into that analysis, it would be > quite painful if it turned out to be wrong). > > Another potential "dodge": Is it really important that the PTPs are > always protected from the very moment they are created? > Coz this feature still seems pretty useful even if there's an awkward > fallback case where, under specific memory pressure patterns, we > temporarily use unprotected pagetables to set up protected pagetables. > That still makes exploiting a pagetable overwrite an order of magnitude > harder than before, right? Similar to how there's probably ways to > exploit bugs if you can get them to race with the intended pagetable > update paths that flip the pkey register, or if you can get a ROP chain > to flip that register for you or whatever. I considered this - I agree that having page tables unprotected inside a small window may be acceptable, considering that this is hardening and not bullet-proof isolation. That said, I'm not sure it helps all that much. You'd need a mechanism to defer setting the pkey for those PTPs. Once you decide to set the pkey, you may very well end up splitting the direct map again, deferring new PTPs... This could go on, and every time fragmentation increases. I think it is really desirable to have that reserve of pages so that splitting the direct map does not become recursive (whether deferred or not). - Kevin [2] https://lore.kernel.org/linux-hardening/20250815085512.2182322-1-kevin.brodsky@arm.com/
On Fri, Mar 06, 2026 at 01:31:15PM +0100, Kevin Brodsky wrote: > On 05/03/2026 16:58, Brendan Jackman wrote: > > > Right, and actually just today someone pointed out mm/execmem.c to me, I > > think execmem_cache_populate() is basically doing the same thing > > (although it's also creating a separate virtual mapping). > > Ah interesting I didn't know about that cache. It does have > similarities, and the motivation seems similar too. The motivation for execmem cache is slightly different. The goal there was to ensure kernel's executable memory (modules, kprobes, ftrace and potentially BPF) is mapped at PMD level at vmalloc address space. And the removal of the direct map alias for execmem is rather a side effect :) But sure, there are similarities. -- Sincerely yours, Mike.
On 2/25/26 17:34, Brendan Jackman wrote: > .:: What? Why? > .:: Why [RFC]? > > I really wanted to stop sending RFC and start sending PATCHes but > getting this series out has taken months longer than I expected, so it's > time to get something on the list. The known issues here are: > > 1. __GFP_UNMAPPED isn't useful yet until guest_memfd unmapping support > [0] gets merged. > > 2. Apparently while implementing the mm-local region, I totally forgot > that KPTI existed on 32-bit systems. I expect the 0-day bot to fire a > failure on that patch. I don't think you mentioned (at least in the cover letter) the mm resistance to add new gfp flags due to number of them being uncomfortably close to 32 already. But I see you've put the new one behind a config. Together with point 2 I wonder if this is where we can start making some flags and associated functionality 64-bit only and change gfp_t to unsigned long?
Hey Vlastimil, sorry for the delay I've been unexpectedly out of office. On Mon Mar 2, 2026 at 3:36 PM UTC, Vlastimil Babka (SUSE) wrote: > On 2/25/26 17:34, Brendan Jackman wrote: >> .:: What? Why? >> .:: Why [RFC]? >> >> I really wanted to stop sending RFC and start sending PATCHes but >> getting this series out has taken months longer than I expected, so it's >> time to get something on the list. The known issues here are: >> >> 1. __GFP_UNMAPPED isn't useful yet until guest_memfd unmapping support >> [0] gets merged. >> >> 2. Apparently while implementing the mm-local region, I totally forgot >> that KPTI existed on 32-bit systems. I expect the 0-day bot to fire a >> failure on that patch. > I don't think you mentioned (at least in the cover letter) the mm resistance > to add new gfp flags due to number of them being uncomfortably close to 32 > already. But I see you've put the new one behind a config. Together with > point 2 I wonder if this is where we can start making some flags and > associated functionality 64-bit only and change gfp_t to unsigned long? Yeah, making __GFP_UNMAPPED 64bit-only would be fine with me. Ultimately the fact that we have KPTI for 32-bit makes it sound like we would also want ASI for 32-bit, so I guess I would still want to add a GFP flag to support that on 32-bit. But that's a pretty futuristic problem, I would say we should focus on __GFP_UNMAPPED in isolation right now. (Just to be clear regarding point 2 - that bug still matters, even if __GFP_UNMAPPED itself is 64-bit only the mm-local region is separate and needs to be correct on 32-bit).
On Thu, Mar 05, 2026 at 11:16:07AM +0000, Brendan Jackman wrote:
> Ultimately the fact that we have KPTI for 32-bit makes it sound like we
> would also want ASI for 32-bit, so I guess I would still want to add a
> GFP flag to support that on 32-bit. But that's a pretty futuristic
> problem, I would say we should focus on __GFP_UNMAPPED in isolation
> right now.
I'd wait until someone really really presents a valid use case to not move to
64-bit. And then justify the effort for adding and supporting ASI on 32-bit.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Wed, 2026-02-25 at 16:34 +0000, Brendan Jackman wrote: > __GFP_UNMAPPED Haven't looked at this in detail, but there was some previous work that even used the same flag name. In the end, the discussion leaned towards a dedicated API instead of a flag. Not saying the flag approach is dead, but might useful to explain how it fits in with that discussion. https://lore.kernel.org/lkml/20230308094106.227365-1-rppt@kernel.org/
Hi Rick, sorry I was just scanning over this thread again while prepping the next version and noticed I never replied to this. On Fri Mar 6, 2026 at 5:38 PM UTC, Rick P Edgecombe wrote: > On Wed, 2026-02-25 at 16:34 +0000, Brendan Jackman wrote: >> __GFP_UNMAPPED > > Haven't looked at this in detail, but there was some previous work that even > used the same flag name. In the end, the discussion leaned towards a dedicated > API instead of a flag. Not saying the flag approach is dead, but might useful to > explain how it fits in with that discussion. > > https://lore.kernel.org/lkml/20230308094106.227365-1-rppt@kernel.org/ I am not at all wed to using a GFP flag for this, but a key difference between this and Mike's original __GFP_UNMAPPED is that this is integrated directly into the page allocator itself, and that's a load bearing element. Technically speaking, in this series __GFP_UNMAPPED is only supported for unmovable allocations, but that's just to avoid bloating the data structures (there isn't a user for that type of allocation yet, so there's no point in creating freelists for it). But, in principle, the goal here is to support all the fancy stuff that the mm does for this memory. That's important because for the real usecases I have in mind here, the vast majority of memory in the system should eventually be relying on the page allocator to unmap it (either completely as in __GFP_UNMAPPED, or just from the special ASI pagetables as in __GFP_SENSITIVE, which will be added later). So, yeah we can always have a special API but that would be a bit of a roundabout way to just save a bit in a an enum, it wouldn't actually represent any simplification of the page allocator's API. Anyway thanks for pointing this out, I will neded to explain this in the next version's cover letter, but in the meantime there's a quick braindump of my thinking.
© 2016 - 2026 Red Hat, Inc.