[v1] mm: Add __GFP_UNMAPPED

[PATCH RFC 00/19] mm: Add __GFP_UNMAPPED

Posted by Brendan Jackman 1 month, 2 weeks ago

.:: What? Why?

This series adds support for efficiently allocating pages that are not
present in the direct map. This is instrumental to two different
immediate goals:

1. This supports the effort to remove guest_memfd memory from the direct
   map [0]. One of the challenges faced in that effort has been
   efficiently eliminating TLB entries, this series offers a solution to
   that problem

2. Address Space Isolation (ASI) [1] also needs an efficient way to
   allocate pages that are missing from the direct map. Although for ASI
   the needs are slightly different (in that case, the pages need only
   be removed from ASI's special pagetables), the most interesting mm
   challenges are basically the same.

   So, __GFP_UNMAPPED serves as a Trojan horse to get the page allocator
   into a state where adding ASI's features "Should Be Easy".

   This series _also_ serves as a Trojan horse for the "mermap" (details
   below) which is also a key building block for making ASI efficient.

Longer term, there are a wide range of security techniques unlocked by
being able to efficiently remove pages from the kernel's address space.

There may also be non-security usecases for this feature, for example
at LPC Sumit Garg presented an issue with memory-firewalled client
devices that could he remediated by __GFP_UNMAPPED [2]. 

.:: Design

The key design elements introduced here are just repurposed from
previous attempts to directly introduce ASI's needs to the page
allocator [3]. The only real difference is that now these support
totally unmapping stuff from the direct map, instead of only unmapping
it from ASI's special pagetables.

.:::: Design: Introducing "freetypes"

The biggest challenge for efficiently getting stuff out of the direct
map is TLB flushing. Pushing this problem into the page allocator turns
out to enable amortising that flush cost into almost nothing. The core
idea is to have pools of already-unmapped pages. We'd like those pages
to be physically contiguous so they don't unduly fragment the pagetables
around them, and we'd like to be able to efficiently look up these
already-unmapped pages during allocation. The page allocator already has
deeply-ingrained functionality for physically grouping pages by a
certain attribute, and then indexing free pages by that attribute, this
mechanism is: migratetypes.

So basically, this series extends the concepts of migratetypes in the
allocator so that as well as just representing mobility, they can
represent other properties of the page too. (Actually, migratetypes are
already sort of overloaded, but the main extension is to be able to
represent _orthogonal_ properties). In order to avoid further
overloading the concept of a migratetype, this extension is done by
adding a new concept on top of migratetype: the _freetype_. A freetype
is basically just a migratetype plus some flags, and it replaces
migratetypes wherever the latter is currently used as to index free
pages.

The first freetype flag is then added, which marks the pages it indexes
as being absent from the direct map. This is then used to implement the
new __GFP_UNMAPPED flag, which allocates pages from pageblocks that have
the new flag, or unmaps pages if no existing ones are already available.

.:::: Design: Introducing the "mermap"

Sharp readers might by now be asking how __GFP_UNMAPPED interacts with
__GFP_ZERO. If pages aren't in the direct map, how can the page
allocator zero them? The solution is the "mermap", short for "epheMERal
mapping". The mermap provides an efficient way to temporarily map pages
into the local address space, and the allocator uses these mappings to
zero pages.

Using the mermap securely requires some knowledge about the usage of the
pages. One slightly awkward part of this design is that the page
allocator's usage of the mermap then "leaks" out so that callers who
allocate with __GFP_UNMAPPED|__GFP_ZERO need to be aware of the mermap's
security implications. For the guest_memfd unmapping usecase, that means
when guest_memfd.c makes these special allocations, it is only safe
because the pages will belong to the current process. In other words,
the use of the mermap potentially allows that process to leak the pages
via CPU sidechannels (unless more holistic/expensive mitigations are
enabled).

Since this cover letter is already too long I won't describe most
details of the mermap here, please see the patch that introduces it.

But one key detail is that it requires a kernel-space but mm-local
virtual address region. So... this series adds that too (for x86). This
is called the mm-local region and is implemented by "just" extending and
generalising the LDT remap area.

.:: Outline of the patchset

- Patches  1 ->  2 introduce the mm-local region for x86

- Patches  3 ->  5 introduce the mermap

- Patches  6 -> 14 introduce freetypes

  - Patch 8 in particular is the big annoying switch-over which changes
    a whole bunch of code from "migratetype" to "freetype". In order to
    try and have the compiler help out with catching bugs, this is done
    with an annoying typedef. I'm sorry that this patch is so annoying,
    but I think if we do want to extend the allocator along these lines
    then a typedef + big annoying patch is probably the safest way.

- Patches 15 -> 20 introduce __GFP_UNMAPPED

.:: Why [RFC]?

I really wanted to stop sending RFC and start sending PATCHes but
getting this series out has taken months longer than I expected, so it's
time to get something on the list. The known issues here are:

1. __GFP_UNMAPPED isn't useful yet until guest_memfd unmapping support
   [0] gets merged.

2. Apparently while implementing the mm-local region, I totally forgot
   that KPTI existed on 32-bit systems. I expect the 0-day bot to fire a
   failure on that patch.

There is also one really nasty hack in mermap.c, namely
set_unmapped_pte(). This is basically a symptom of the problem I
propose to discuss at LSF/MM/BPF [3], i.e. the fact that there are
lots of pagetable libraries yet none of them are flexible enough to do
anything new (in this case the "new thing" is pre-allocating pagetables
then subsequently populating them in a separate context). Whether this
particular hack should block merging the mermap is not clear to me, I'd
be interested to hear opinions.

.:: Performance

In [4] is a branch containing: 

1. This series.

2. All the key kernel patches from the Firecracker team's "secret-free"
   effort, which includes guest_memfd unmapping ([0]).

3. Some prototype patches to switch guest_memfd over from an ad-hoc
   unmapping logic to use of __GFP_UNMAPPED (plus direct use of the
   mermap to implement write()).

I benchmarked this using Firecracker's own performance tests [4], which
measure the time required to populate the VM guest's memory. This
population happens via write() so it exercises the mermap. I ran this on
a Sapphire Rapids machine [5]. The baseline here is just the secret-free
patches on their own. "gfp_unmapped" is the branch described above.
"skip-flush" provides a reference against an implementation that just
skips flushing the TLB when unmapping guest_memfd pages, which serves as
an upper-bound on performance.

metric: populate_latency (ms)   |  test: firecracker-perf-tests-wrapped
+---------------+---------+----------+----------+------------------------+----------+--------+
| nixos_variant | samples |     mean |      min | histogram              |      max | Δμ     |
+---------------+---------+----------+----------+------------------------+----------+--------+
|               |      30 |    1.04s |    1.02s |                     █  |    1.10s |        |
| gfp_unmapped  |      30 | 313.02ms | 299.48ms |       █                | 343.25ms | -70.0% |
| skip-flush    |      30 | 325.80ms | 307.91ms |       █                | 333.30ms | -68.8% |
+---------------+---------+----------+----------+------------------------+----------+--------+

Conclusion: it's close to the best case performance for this particular
workload. (Note in the sample above the mean is actually faster - that's
noise, this isn't a consistent observation).

[0] [PATCH v10 00/15] Direct Map Removal Support for guest_memfd
    https://lore.kernel.org/all/20260126164445.11867-1-kalyazin@amazon.com/

[1] https://linuxasi.dev/

[2] https://lpc.events/event/19/contributions/2095/

[3] https://lore.kernel.org/all/20260219175113.618562-1-jackmanb@google.com/

[4] https://github.com/bjackman/kernel-benchmarks-nix/blob/fd56c93344760927b71161368230a15741a5869f/packages/benchmarks/firecracker-perf-tests/firecracker-perf-tests.sh

[5] https://github.com/bjackman/aethelred/blob/eb0dd0e99ee08fa0534733113e93b89499affe91

Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: x86@kernel.org
Cc: rppt@kernel.org
Cc: Sumit Garg <sumit.garg@oss.qualcomm.com>
To: Borislav Petkov <bp@alien8.de>
To: Dave Hansen <dave.hansen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>
To: Andrew Morton <akpm@linux-foundation.org>
To: David Hildenbrand <david@kernel.org>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Vlastimil Babka <vbabka@kernel.org>
To: Mike Rapoport <rppt@kernel.org>
To: Wei Xu <weixugc@google.com>
To: Johannes Weiner <hannes@cmpxchg.org>
To: Zi Yan <ziy@nvidia.com>
Cc: yosryahmed@google.com
Cc: derkling@google.com
Cc: reijiw@google.com
Cc: Will Deacon <will@kernel.org>
Cc: rientjes@google.com
Cc: "Kalyazin, Nikita" <kalyazin@amazon.co.uk>
Cc: patrick.roy@linux.dev
Cc: "Itazuri, Takahiro" <itazur@amazon.co.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: David Kaplan <david.kaplan@amd.com>
Cc: Thomas Gleixner <tglx@kernel.org>

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
Brendan Jackman (19):
      x86/mm: split out preallocate_sub_pgd()
      x86/mm: Generalize LDT remap into "mm-local region"
      x86/tlb: Expose some flush function declarations to modules
      x86/mm: introduce the mermap
      mm: KUnit tests for the mermap
      mm: introduce for_each_free_list()
      mm/page_alloc: don't overload migratetype in find_suitable_fallback()
      mm: introduce freetype_t
      mm: move migratetype definitions to freetype.h
      mm: add definitions for allocating unmapped pages
      mm: rejig pageblock mask definitions
      mm: encode freetype flags in pageblock flags
      mm/page_alloc: remove ifdefs from pindex helpers
      mm/page_alloc: separate pcplists by freetype flags
      mm/page_alloc: rename ALLOC_NON_BLOCK back to _HARDER
      mm/page_alloc: introduce ALLOC_NOBLOCK
      mm/page_alloc: implement __GFP_UNMAPPED allocations
      mm/page_alloc: implement __GFP_UNMAPPED|__GFP_ZERO allocations
      mm: Minimal KUnit tests for some new page_alloc logic

 Documentation/arch/x86/x86_64/mm.rst    |   4 +-
 arch/x86/Kconfig                        |   3 +
 arch/x86/include/asm/mermap.h           |  23 +
 arch/x86/include/asm/mmu_context.h      |  71 ++-
 arch/x86/include/asm/pgalloc.h          |  33 ++
 arch/x86/include/asm/pgtable_64_types.h |  19 +-
 arch/x86/include/asm/pgtable_types.h    |   2 +
 arch/x86/include/asm/tlbflush.h         |  43 +-
 arch/x86/kernel/ldt.c                   | 137 ++----
 arch/x86/mm/init_64.c                   |  44 +-
 arch/x86/mm/pgtable.c                   |   3 +
 include/linux/freetype.h                | 147 ++++++
 include/linux/gfp.h                     |  25 +-
 include/linux/gfp_types.h               |  26 ++
 include/linux/mermap.h                  |  63 +++
 include/linux/mermap_types.h            |  43 ++
 include/linux/mm.h                      |  13 +
 include/linux/mm_types.h                |   6 +
 include/linux/mmzone.h                  |  84 ++--
 include/linux/pageblock-flags.h         |  16 +-
 include/trace/events/mmflags.h          |   9 +-
 kernel/fork.c                           |   6 +
 kernel/panic.c                          |   2 +
 kernel/power/snapshot.c                 |   8 +-
 mm/Kconfig                              |  41 ++
 mm/Makefile                             |   3 +
 mm/compaction.c                         |  36 +-
 mm/init-mm.c                            |   3 +
 mm/internal.h                           |  43 +-
 mm/mermap.c                             | 323 +++++++++++++
 mm/mm_init.c                            |  11 +-
 mm/page_alloc.c                         | 782 +++++++++++++++++++++++---------
 mm/page_isolation.c                     |   2 +-
 mm/page_owner.c                         |   7 +-
 mm/page_reporting.c                     |   4 +-
 mm/pgalloc-track.h                      |   6 +
 mm/show_mem.c                           |   4 +-
 mm/tests/mermap_kunit.c                 | 231 ++++++++++
 mm/tests/page_alloc_kunit.c             | 250 ++++++++++
 39 files changed, 2099 insertions(+), 477 deletions(-)
---
base-commit: 44982d352c33767cd8d19f8044e7e1161a587ff7
change-id: 20260112-page_alloc-unmapped-944fe5d7b55c

Best regards,
-- 
Brendan Jackman <jackmanb@google.com>

Re: [PATCH RFC 00/19] mm: Add __GFP_UNMAPPED

Posted by Kevin Brodsky 1 month, 1 week ago

On 25/02/2026 17:34, Brendan Jackman wrote:
> .:::: Design: Introducing "freetypes"
>
> The biggest challenge for efficiently getting stuff out of the direct
> map is TLB flushing. Pushing this problem into the page allocator turns
> out to enable amortising that flush cost into almost nothing. The core
> idea is to have pools of already-unmapped pages. We'd like those pages
> to be physically contiguous so they don't unduly fragment the pagetables
> around them, and we'd like to be able to efficiently look up these
> already-unmapped pages during allocation. The page allocator already has
> deeply-ingrained functionality for physically grouping pages by a
> certain attribute, and then indexing free pages by that attribute, this
> mechanism is: migratetypes.
>
> So basically, this series extends the concepts of migratetypes in the
> allocator so that as well as just representing mobility, they can
> represent other properties of the page too. (Actually, migratetypes are
> already sort of overloaded, but the main extension is to be able to
> represent _orthogonal_ properties). In order to avoid further
> overloading the concept of a migratetype, this extension is done by
> adding a new concept on top of migratetype: the _freetype_. A freetype
> is basically just a migratetype plus some flags, and it replaces
> migratetypes wherever the latter is currently used as to index free
> pages.
>
> The first freetype flag is then added, which marks the pages it indexes
> as being absent from the direct map. This is then used to implement the
> new __GFP_UNMAPPED flag, which allocates pages from pageblocks that have
> the new flag, or unmaps pages if no existing ones are already available.

This approach seems very interesting to me, and I wonder if it could be
applied to another use-case.

I am working on a security feature to protect page table pages (PTPs)
using pkeys [1]. This relies on all PTPs being mapped with a specific
pkey (in the direct map). That requires changing a mapping attribute
rather than making it invalid, but AFAICT this is essentially the same
problem as the one you're trying to solve.

There are however extra challenges with mapping PTPs with special
attributes. The main one, which you mention in patch 17, is that
splitting the direct map may require allocating PTPs, which may lead to
recursion.

[1] introduces a dedicated page table allocator on top of the buddy
allocator, which attempts to cache PMD-sized blocks if possible. It
ensures that no recursion occurs by using a special flag when allocating
PTPs while splitting the direct map, and keeping a reserve of pages
specifically for that situation (patch 15 and 24). There is also special
handling for early page tables (essentially keeping track of them and
setting their pkey once we can split the direct map).

Do you think that this freetype infrastructure could be used for that
purpose, instead of introducing a layer on top of the buddy allocator? I
expect that much of the special handling for allocating PTPs can be kept
separate. Ensuring that protected pages are always available to split
the direct map may be difficult though... This is deeply embedded in the
allocator I proposed.

- Kevin

[1]
https://lore.kernel.org/linux-hardening/20260227175518.3728055-1-kevin.brodsky@arm.com/

Re: [PATCH RFC 00/19] mm: Add __GFP_UNMAPPED

Posted by Brendan Jackman 1 month, 1 week ago

On Thu Mar 5, 2026 at 2:51 PM UTC, Kevin Brodsky wrote:
> On 25/02/2026 17:34, Brendan Jackman wrote:
>> .:::: Design: Introducing "freetypes"
>>
>> The biggest challenge for efficiently getting stuff out of the direct
>> map is TLB flushing. Pushing this problem into the page allocator turns
>> out to enable amortising that flush cost into almost nothing. The core
>> idea is to have pools of already-unmapped pages. We'd like those pages
>> to be physically contiguous so they don't unduly fragment the pagetables
>> around them, and we'd like to be able to efficiently look up these
>> already-unmapped pages during allocation. The page allocator already has
>> deeply-ingrained functionality for physically grouping pages by a
>> certain attribute, and then indexing free pages by that attribute, this
>> mechanism is: migratetypes.
>>
>> So basically, this series extends the concepts of migratetypes in the
>> allocator so that as well as just representing mobility, they can
>> represent other properties of the page too. (Actually, migratetypes are
>> already sort of overloaded, but the main extension is to be able to
>> represent _orthogonal_ properties). In order to avoid further
>> overloading the concept of a migratetype, this extension is done by
>> adding a new concept on top of migratetype: the _freetype_. A freetype
>> is basically just a migratetype plus some flags, and it replaces
>> migratetypes wherever the latter is currently used as to index free
>> pages.
>>
>> The first freetype flag is then added, which marks the pages it indexes
>> as being absent from the direct map. This is then used to implement the
>> new __GFP_UNMAPPED flag, which allocates pages from pageblocks that have
>> the new flag, or unmaps pages if no existing ones are already available.
>
> This approach seems very interesting to me, and I wonder if it could be
> applied to another use-case.
>
> I am working on a security feature to protect page table pages (PTPs)
> using pkeys [1]. This relies on all PTPs being mapped with a specific
> pkey (in the direct map). That requires changing a mapping attribute
> rather than making it invalid, but AFAICT this is essentially the same
> problem as the one you're trying to solve.

Yeah, I think so:

1. The fragmentation issues seem exactly the same.

2. The TLB flushing issues are probably also basically the same, I
assume you need to flush the TLB when you convert a page to use for
pagetables, and without allocator integration that can happen pretty
often and in hot paths. Correct?

> There are however extra challenges with mapping PTPs with special
> attributes. The main one, which you mention in patch 17, is that
> splitting the direct map may require allocating PTPs, which may lead to
> recursion.
>
> [1] introduces a dedicated page table allocator on top of the buddy
> allocator, which attempts to cache PMD-sized blocks if possible. It
> ensures that no recursion occurs by using a special flag when allocating
> PTPs while splitting the direct map, and keeping a reserve of pages
> specifically for that situation (patch 15 and 24). 

Right, and actually just today someone pointed out mm/execmem.c to me, I
think execmem_cache_populate() is basically doing the same thing
(although it's also creating a separate virtual mapping).

> There is also special
> handling for early page tables (essentially keeping track of them and
> setting their pkey once we can split the direct map).
>
> Do you think that this freetype infrastructure could be used for that
> purpose, instead of introducing a layer on top of the buddy allocator? 

Yes!!! 100% definitely, my code certainly solves all your problems...

> I
> expect that much of the special handling for allocating PTPs can be kept
> separate. Ensuring that protected pages are always available to split
> the direct map may be difficult though... This is deeply embedded in the
> allocator I proposed.

...Oh, hm, well, um, good point. Thinking aloud a bit...

The way this series dodges the question is (copying from the code
comments in patch 17 for convenient reading):

1) - The direct map starts out fully mapped at boot. (This is not really
 *   an assumption" as its in direct control of page_alloc.c).
 *
2) - Once pages in the direct map are broken down, they are not
 *   re-aggregated into larger pages again.
 *
3) - Pagetables are never allocated with __GFP_UNMAPPED.
 *
 * Under these assumptions, a pagetable might need to be allocated while
 * _unmapping_ stuff from the direct map during a __GFP_UNMAPPED
 * allocation. But, the allocation of that pagetable never requires
 * allocating a further pagetable.

In other words, we might need to allocate while we allocate (which is
fine because I have to do locking shenanigans anyway due to x86 TLB
shootdown requirements), but there's no further recursion after that.

Can we come up with an analogue for protected PTPs? Point 3) is
the inflexible one, and we obviously can't say "PTPs are never allocated
as PTPs". But if we invert it and _also_ invert point 1) I think we get
something that works in principle:

1) The direct map starts out _fully protected_ (i.e. we treat everything
   as if it's a pagetable at first).

2) We assume the direct map doesn't get reaggregated once we've broken
   things down to serve PTP allocations

3) PTPs are always PTPs...

But... this is a bit silly, since what it means is we'll then go through
~all the pagetblocks in the system (except the ones that _are_ actually
used for PTPs) and flip their pkey, breaking down the physmap to
pageblock granularity as we go. And... if we're gonna do that, we might
as well just say the physmap has to be at pageblock granularity to begin
with.

(Could we do that? Maybe - Mike Rapoport has previously argued that
physmap fragmentation is not a very big deal, so I guess the question
is whether we're ready to really lean into that analysis, it would be
quite painful if it turned out to be wrong).

Another potential "dodge": Is it really important that the PTPs are
always protected from the very moment they are created?
Coz this feature still seems pretty useful even if there's an awkward
fallback case where, under specific memory pressure patterns, we
temporarily use unprotected pagetables to set up protected pagetables.
That still makes exploiting a pagetable overwrite an order of magnitude
harder than before, right? Similar to how there's probably ways to
exploit bugs if you can get them to race with the intended pagetable
update paths that flip the pkey register, or if you can get a ROP chain
to flip that register for you or whatever.

Re: [PATCH RFC 00/19] mm: Add __GFP_UNMAPPED

Posted by Kevin Brodsky 1 month, 1 week ago

On 05/03/2026 16:58, Brendan Jackman wrote:
> On Thu Mar 5, 2026 at 2:51 PM UTC, Kevin Brodsky wrote:
>> [...]
>> This approach seems very interesting to me, and I wonder if it could be
>> applied to another use-case.
>>
>> I am working on a security feature to protect page table pages (PTPs)
>> using pkeys [1]. This relies on all PTPs being mapped with a specific
>> pkey (in the direct map). That requires changing a mapping attribute
>> rather than making it invalid, but AFAICT this is essentially the same
>> problem as the one you're trying to solve.
> Yeah, I think so:
>
> 1. The fragmentation issues seem exactly the same.

I believe so.

> 2. The TLB flushing issues are probably also basically the same, I
> assume you need to flush the TLB when you convert a page to use for
> pagetables, and without allocator integration that can happen pretty
> often and in hot paths. Correct?

Indeed. Up until v5 [2] no special allocator was used - the pkey was set
at the page level every time a PTP was allocated or freed. Clearly
suboptimal, and doesn't work at all if large mappings are used due to
the risk of recursion.

>> There are however extra challenges with mapping PTPs with special
>> attributes. The main one, which you mention in patch 17, is that
>> splitting the direct map may require allocating PTPs, which may lead to
>> recursion.
>>
>> [1] introduces a dedicated page table allocator on top of the buddy
>> allocator, which attempts to cache PMD-sized blocks if possible. It
>> ensures that no recursion occurs by using a special flag when allocating
>> PTPs while splitting the direct map, and keeping a reserve of pages
>> specifically for that situation (patch 15 and 24). 
> Right, and actually just today someone pointed out mm/execmem.c to me, I
> think execmem_cache_populate() is basically doing the same thing
> (although it's also creating a separate virtual mapping).

Ah interesting I didn't know about that cache. It does have
similarities, and the motivation seems similar too.

>> There is also special
>> handling for early page tables (essentially keeping track of them and
>> setting their pkey once we can split the direct map).
>>
>> Do you think that this freetype infrastructure could be used for that
>> purpose, instead of introducing a layer on top of the buddy allocator? 
> Yes!!! 100% definitely, my code certainly solves all your problems...

Almost ;)

>> I
>> expect that much of the special handling for allocating PTPs can be kept
>> separate. Ensuring that protected pages are always available to split
>> the direct map may be difficult though... This is deeply embedded in the
>> allocator I proposed.
> ...Oh, hm, well, um, good point. Thinking aloud a bit...
>
> The way this series dodges the question is (copying from the code
> comments in patch 17 for convenient reading):
>
> 1) - The direct map starts out fully mapped at boot. (This is not really
>  *   an assumption" as its in direct control of page_alloc.c).
>  *
> 2) - Once pages in the direct map are broken down, they are not
>  *   re-aggregated into larger pages again.
>  *
> 3) - Pagetables are never allocated with __GFP_UNMAPPED.
>  *
>  * Under these assumptions, a pagetable might need to be allocated while
>  * _unmapping_ stuff from the direct map during a __GFP_UNMAPPED
>  * allocation. But, the allocation of that pagetable never requires
>  * allocating a further pagetable.
>
> In other words, we might need to allocate while we allocate (which is
> fine because I have to do locking shenanigans anyway due to x86 TLB
> shootdown requirements), but there's no further recursion after that.
>
> Can we come up with an analogue for protected PTPs? Point 3) is
> the inflexible one, and we obviously can't say "PTPs are never allocated
> as PTPs". But if we invert it and _also_ invert point 1) I think we get
> something that works in principle:
>
> 1) The direct map starts out _fully protected_ (i.e. we treat everything
>    as if it's a pagetable at first).
>
> 2) We assume the direct map doesn't get reaggregated once we've broken
>    things down to serve PTP allocations
>
> 3) PTPs are always PTPs...
>
> But... this is a bit silly, since what it means is we'll then go through
> ~all the pagetblocks in the system (except the ones that _are_ actually
> used for PTPs) and flip their pkey, breaking down the physmap to
> pageblock granularity as we go. And... if we're gonna do that, we might
> as well just say the physmap has to be at pageblock granularity to begin
> with.

Having to change the pkey of every pageblock when allocating it for
anything but page tables seems rather unreasonable... And in case of
memory pressure, where fragmentation is high, we may not have any
protected pageblock left. The allocator I proposed falls back to order-2
allocations if necessary (which is sufficient to replenish the page
reserve even if PMD+PTE pages are allocated for splitting).

> (Could we do that? Maybe - Mike Rapoport has previously argued that
> physmap fragmentation is not a very big deal, so I guess the question
> is whether we're ready to really lean into that analysis, it would be
> quite painful if it turned out to be wrong).
>
> Another potential "dodge": Is it really important that the PTPs are
> always protected from the very moment they are created?
> Coz this feature still seems pretty useful even if there's an awkward
> fallback case where, under specific memory pressure patterns, we
> temporarily use unprotected pagetables to set up protected pagetables.
> That still makes exploiting a pagetable overwrite an order of magnitude
> harder than before, right? Similar to how there's probably ways to
> exploit bugs if you can get them to race with the intended pagetable
> update paths that flip the pkey register, or if you can get a ROP chain
> to flip that register for you or whatever.

I considered this - I agree that having page tables unprotected inside a
small window may be acceptable, considering that this is hardening and
not bullet-proof isolation. That said, I'm not sure it helps all that
much. You'd need a mechanism to defer setting the pkey for those PTPs.
Once you decide to set the pkey, you may very well end up splitting the
direct map again, deferring new PTPs... This could go on, and every time
fragmentation increases. I think it is really desirable to have that
reserve of pages so that splitting the direct map does not become
recursive (whether deferred or not).

- Kevin

[2]
https://lore.kernel.org/linux-hardening/20250815085512.2182322-1-kevin.brodsky@arm.com/

Re: [PATCH RFC 00/19] mm: Add __GFP_UNMAPPED

Posted by Mike Rapoport 1 month, 1 week ago

On Fri, Mar 06, 2026 at 01:31:15PM +0100, Kevin Brodsky wrote:
> On 05/03/2026 16:58, Brendan Jackman wrote:
>
> > Right, and actually just today someone pointed out mm/execmem.c to me, I
> > think execmem_cache_populate() is basically doing the same thing
> > (although it's also creating a separate virtual mapping).
> 
> Ah interesting I didn't know about that cache. It does have
> similarities, and the motivation seems similar too.

The motivation for execmem cache is slightly different. The goal there was
to ensure kernel's executable memory (modules, kprobes, ftrace and
potentially BPF) is mapped at PMD level at vmalloc address space.
And the removal of the direct map alias for execmem is rather a side effect :)

But sure, there are similarities.

-- 
Sincerely yours,
Mike.

Re: [PATCH RFC 00/19] mm: Add __GFP_UNMAPPED

Posted by Vlastimil Babka (SUSE) 1 month, 2 weeks ago

On 2/25/26 17:34, Brendan Jackman wrote:
> .:: What? Why?
> .:: Why [RFC]?
> 
> I really wanted to stop sending RFC and start sending PATCHes but
> getting this series out has taken months longer than I expected, so it's
> time to get something on the list. The known issues here are:
> 
> 1. __GFP_UNMAPPED isn't useful yet until guest_memfd unmapping support
>    [0] gets merged.
> 
> 2. Apparently while implementing the mm-local region, I totally forgot
>    that KPTI existed on 32-bit systems. I expect the 0-day bot to fire a
>    failure on that patch.

I don't think you mentioned (at least in the cover letter) the mm resistance
to add new gfp flags due to number of them being uncomfortably close to 32
already. But I see you've put the new one behind a config. Together with
point 2 I wonder if this is where we can start making some flags and
associated functionality 64-bit only and change gfp_t to unsigned long?

Re: [PATCH RFC 00/19] mm: Add __GFP_UNMAPPED

Posted by Brendan Jackman 1 month, 1 week ago

Hey Vlastimil, sorry for the delay I've been unexpectedly out of office.

On Mon Mar 2, 2026 at 3:36 PM UTC, Vlastimil Babka (SUSE) wrote:
> On 2/25/26 17:34, Brendan Jackman wrote:
>> .:: What? Why?
>> .:: Why [RFC]?
>> 
>> I really wanted to stop sending RFC and start sending PATCHes but
>> getting this series out has taken months longer than I expected, so it's
>> time to get something on the list. The known issues here are:
>> 
>> 1. __GFP_UNMAPPED isn't useful yet until guest_memfd unmapping support
>>    [0] gets merged.
>> 
>> 2. Apparently while implementing the mm-local region, I totally forgot
>>    that KPTI existed on 32-bit systems. I expect the 0-day bot to fire a
>>    failure on that patch.

> I don't think you mentioned (at least in the cover letter) the mm resistance
> to add new gfp flags due to number of them being uncomfortably close to 32
> already. But I see you've put the new one behind a config. Together with
> point 2 I wonder if this is where we can start making some flags and
> associated functionality 64-bit only and change gfp_t to unsigned long?

Yeah, making __GFP_UNMAPPED 64bit-only would be fine with me.

Ultimately the fact that we have KPTI for 32-bit makes it sound like we
would also want ASI for 32-bit, so I guess I would still want to add a
GFP flag to support that on 32-bit. But that's a pretty futuristic
problem, I would say we should focus on __GFP_UNMAPPED in isolation
right now.

(Just to be clear regarding point 2 - that bug still matters, even if
__GFP_UNMAPPED itself is 64-bit only the mm-local region is separate
and needs to be correct on 32-bit).

Re: [PATCH RFC 00/19] mm: Add __GFP_UNMAPPED

Posted by Borislav Petkov 1 month, 1 week ago

On Thu, Mar 05, 2026 at 11:16:07AM +0000, Brendan Jackman wrote:
> Ultimately the fact that we have KPTI for 32-bit makes it sound like we
> would also want ASI for 32-bit, so I guess I would still want to add a
> GFP flag to support that on 32-bit. But that's a pretty futuristic
> problem, I would say we should focus on __GFP_UNMAPPED in isolation
> right now.

I'd wait until someone really really presents a valid use case to not move to
64-bit. And then justify the effort for adding and supporting ASI on 32-bit.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH RFC 00/19] mm: Add __GFP_UNMAPPED

Posted by Edgecombe, Rick P 1 month, 1 week ago

On Wed, 2026-02-25 at 16:34 +0000, Brendan Jackman wrote:
> __GFP_UNMAPPED

Haven't looked at this in detail, but there was some previous work that even
used the same flag name. In the end, the discussion leaned towards a dedicated
API instead of a flag. Not saying the flag approach is dead, but might useful to
explain how it fits in with that discussion.

https://lore.kernel.org/lkml/20230308094106.227365-1-rppt@kernel.org/

Re: [PATCH RFC 00/19] mm: Add __GFP_UNMAPPED

Posted by Brendan Jackman 1 month ago

Hi Rick, sorry I was just scanning over this thread again while prepping
the next version and noticed I never replied to this.

On Fri Mar 6, 2026 at 5:38 PM UTC, Rick P Edgecombe wrote:
> On Wed, 2026-02-25 at 16:34 +0000, Brendan Jackman wrote:
>> __GFP_UNMAPPED
>
> Haven't looked at this in detail, but there was some previous work that even
> used the same flag name. In the end, the discussion leaned towards a dedicated
> API instead of a flag. Not saying the flag approach is dead, but might useful to
> explain how it fits in with that discussion.
>
> https://lore.kernel.org/lkml/20230308094106.227365-1-rppt@kernel.org/

I am not at all wed to using a GFP flag for this, but a key difference
between this and Mike's original __GFP_UNMAPPED is that this is
integrated directly into the page allocator itself, and that's a
load bearing element. 

Technically speaking, in this series __GFP_UNMAPPED is only supported
for unmovable allocations, but that's just to avoid bloating the data
structures (there isn't a user for that type of allocation yet, so
there's no point in creating freelists for it). But, in principle, the
goal here is to support all the fancy stuff that the mm does for this
memory. That's important because for the real usecases I have in mind
here, the vast majority of memory in the system should eventually be
relying on the page allocator to unmap it (either completely as in
__GFP_UNMAPPED, or just from the special ASI pagetables as in
__GFP_SENSITIVE, which will be added later).

So, yeah we can always have a special API but that would be a bit of a
roundabout way to just save a bit in a an enum, it wouldn't actually
represent any simplification of the page allocator's API.

Anyway thanks for pointing this out, I will neded to explain this in the
next version's cover letter, but in the meantime there's a quick
braindump of my thinking.