[PATCH RFC 00/11] mm: ASI integration for the page allocator

Brendan Jackman posted 11 patches 9 months, 1 week ago
arch/Kconfig                         |  14 ++++
arch/x86/Kconfig                     |   1 +
arch/x86/include/asm/asi.h           |  36 ++++++++
arch/x86/include/asm/pgtable_types.h |   2 +
arch/x86/mm/Makefile                 |   1 +
arch/x86/mm/asi.c                    |  85 +++++++++++++++++++
arch/x86/mm/init.c                   |   3 +-
arch/x86/mm/init_64.c                |  53 ++++++++++--
arch/x86/mm/pat/set_memory.c         |  34 ++++++++
include/linux/asi.h                  |  20 +++++
include/linux/gfp.h                  |  30 ++++---
include/linux/gfp_types.h            |  15 +++-
include/linux/mmzone.h               |  19 ++++-
include/linux/vmalloc.h              |   4 +
mm/internal.h                        |   5 ++
mm/memory_hotplug.c                  |   2 +-
mm/page_alloc.c                      | 158 +++++++++++++++++++++++++++++++----
mm/show_mem.c                        |  13 +--
mm/slub.c                            |   6 +-
mm/vmalloc.c                         |  32 ++++---
20 files changed, 475 insertions(+), 58 deletions(-)
[PATCH RFC 00/11] mm: ASI integration for the page allocator
Posted by Brendan Jackman 9 months, 1 week ago
.:: Intro

This code illustrates the idea I'm proposing at LSF/MM/BPF [0].
Sorry it's so close to the conference, I was initially quite ambitious
in what I wanted to show here and tried to implement a more complete
patch series. Now I've run out of time and I've had to reduce the scope
and just hack some minimal stuff together. Now, this series is _only_
supposed to be about page_alloc.c, everything else is just there as
scaffolding so that allocator code can be discussed.

I've marked the most incomplete patches with [HACKS] in the title to
illustrate what aspects are less worthy of attention.

See [0] and also [1] for broader context on the ASI/page_alloc topic.
See [2] for context about ASI itself. For this RFC the most important
fact is: ASI requires creating another kernel address space (the
"restricted address space") that is a subset of that normal one (i.e.
the "unrestricted address space"). That is, an address space just like
the normal one, but with holes in it. Pages that are unmapped from the
restricted address space are called "sensitive".

.:: The Idea

What is sensitive (i.e.  where the holes are) is decided at allocation
time. This illustrates an initial implementation of that capability for
the direct map. The basic idea of this implementation is to operate at
pageblock-granularity, and use migratetypes to track sensitivity.  The
key advantages of this approach are:

- Migratetypes exist to avoid fragmentation. Using them to index pages
  by sensitivity takes advantage of this, so that the physmap doesn't
  get fragmented with respect to sensitivity. This means we can use
  large TLB entries for the restricted physmap.

- Since pageblocks are never smaller than a PMD mapping, if the
  restricted physmap is always made of PMDs, we never have to break down
  mappings while changing sensitivity. This means we don't have
  difficulties with needing to allocate pagetables in the middle of the
  allocator.

- Migratetypes already offer indexing capability - that is, there are
  separate freelists for each migratetype. This means when the user
  allocates a page with a given sensitivity, all the infrastructure is
  already in place to look up a page that is already mapped/unmapped as
  needed (if it exists). This minimizes unnecessary TLB flushes.

This differs from Mike Rapoport's work on __GFP_UNMAPPED [3] in that,
instead of having a totally separate free area for the pages that are
unmapped, it aims to pervade the allocator. If it turns out that for all
nonsensitive (or all sensitive, which seems highly unlikely) pages, a
access to the full feature set of the page allocator is not needed for a
performant system, we could certainly do something like Mike's patchset.
But we don't have any reason to expect a correlation between
sensitivity and performance needs.

.:: Patchset overview

- Patch 1 adds a minimal subset of the base ASI framework that was
  introduced by the RFCv2 [2].

- Patches 2-5 add the necessary framework for creating and manipulating
  the ASI physmap. This is the area where I have had to reduce the scope
  of this series, I had hoped to present a proper integration here. But
  instead I've had to just hack something together that kinda works.
  You can probably skip over this section.

- Patches 6-8 are preparatory hacks and changes to the generic mm code.

- Patches 9-11 are the important bit. The new migratetypes are created.
  Then logic is added to create nonsensitive pageblocks when needed.
  Then logic is added to change them back to sensitive pageblocks when
  needed.

.:: TODOs

 - This doesn't let you allocate from MIGRATE_HIGHATOMIC pageblocks
   unless you have __GFP_SENSITIVE. We probably need to make the
   pageblock type and per-freelist logic more advanced to be able to
   account for this.

 - When pages transition from sensitive to nonsensitive, they need to be
   zeroed to prevent any leftover data being leaked. This series doesn't
   address that requirement at all.

 - Although I think the abstract design is OK, the actual implementation
   of calling asi_map()/asi_unmap() from page_alloc.c is pretty
   confusing: asi_map() is implicit when calling
   set_pageblock_migratetype() but asi_unmap() is up to the caller. This
   requires some refactoring.

 - Changes to the unrestricted physmap (page protection changes, memory
   hotplug) are not properly mirrored into the restricted physmap.

 - There's no integration with CMA. The branch at [4] has some minimal
   integration into alloc_contig_range().

.:: References

[0] https://lore.kernel.org/linux-mm/CA+i-1C169s8pyqZDx+iSnFmftmGfssdQA29+pYm-gqySAYWgpg@mail.gmail.com/
[1] Some slides I presented in an earlier discussion of this topic:
    https://docs.google.com/presentation/d/1Ozuan7E4z2YWm4V6uE_fe7YoF2BdS3m5jXjDKO7DVy0/edit#slide=id.g32d28ea451a_0_43
[2] https://lore.kernel.org/linux-mm/20250110-asi-rfc-v2-v2-0-8419288bc805@google.com/
[3] https://lore.kernel.org/all/20230308094106.227365-1-rppt@kernel.org/
[5] https://lore.kernel.org/linux-mm/20250129144320.2675822-1-jackmanb@google.com/

This series is available as a branch with some additional testing here:

[4] https://github.com/bjackman/linux/tree/asi/page-alloc-lsfmmbpf25

This applies to mm-unstable.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
Brendan Jackman (11):
      x86/mm: Bare minimum ASI API for page_alloc integration
      x86/mm: Factor out phys_pgd_init()
      x86/mm: Add lookup_pgtable_in_pgd()
      x86/mm/asi: Sync physmap into ASI_GLOBAL_NONSENSITIVE
      [RFC HACKS] Add asi_map() and asi_unmap()
      mm/page_alloc: Add __GFP_SENSITIVE and always set it
      [RFC HACKS] mm/slub: Set __GFP_SENSITIVE for reclaimable slabs
      [RFC HACKS] mm/page_alloc: Simplify gfp_migratetype()
      mm/page_alloc: Split MIGRATE_UNMOVABLE by sensitivity
      mm/page_alloc: Add support for nonsensitive allocations
      mm/page_alloc: Add support for ASI-unmapping pages

 arch/Kconfig                         |  14 ++++
 arch/x86/Kconfig                     |   1 +
 arch/x86/include/asm/asi.h           |  36 ++++++++
 arch/x86/include/asm/pgtable_types.h |   2 +
 arch/x86/mm/Makefile                 |   1 +
 arch/x86/mm/asi.c                    |  85 +++++++++++++++++++
 arch/x86/mm/init.c                   |   3 +-
 arch/x86/mm/init_64.c                |  53 ++++++++++--
 arch/x86/mm/pat/set_memory.c         |  34 ++++++++
 include/linux/asi.h                  |  20 +++++
 include/linux/gfp.h                  |  30 ++++---
 include/linux/gfp_types.h            |  15 +++-
 include/linux/mmzone.h               |  19 ++++-
 include/linux/vmalloc.h              |   4 +
 mm/internal.h                        |   5 ++
 mm/memory_hotplug.c                  |   2 +-
 mm/page_alloc.c                      | 158 +++++++++++++++++++++++++++++++----
 mm/show_mem.c                        |  13 +--
 mm/slub.c                            |   6 +-
 mm/vmalloc.c                         |  32 ++++---
 20 files changed, 475 insertions(+), 58 deletions(-)
---
base-commit: 5ee93e1a769230377c3d44edd4917e8df77be566
change-id: 20250310-asi-page-alloc-80ea1f8307d0

Best regards,
-- 
Brendan Jackman <jackmanb@google.com>
Re: [PATCH RFC 00/11] mm: ASI integration for the page allocator
Posted by Brendan Jackman 6 months, 1 week ago
On Thu Mar 13, 2025 at 6:11 PM UTC, Brendan Jackman wrote:
> .:: Patchset overview

Hey all, I have been down the pagetable mines lately trying to figure
out a solution to the page cache issue (the 70% FIO degradatation [0]).
I've got a prototype based on the idea I discussed at LSF/MM/BPF
that's slowly coming together. My hope is that as soon as I can
convincingly claim with a straight face that I know how to solve that
problem, I can transition from <post an RFC every N months then
disappear> mode into being a bit more visible with development
iterations...

[0] https://lore.kernel.org/linux-mm/20250129144320.2675822-1-jackmanb@google.com/

In the meantime, I am still provisionally planning to make the topic
of this RFC the first [PATCH] series for ASI. Obviously before I can
seriously ask Andrew to merge I'll also need to establish some
consensus on the x86 side, but in the meantime I think we're getting
close enough to start discussing the mm code.

So.. does anyone have a bit of time to look over this and see if the
implementation makes sense? Is the basic idea on the right lines?
Also if there's anything I can do to make that easier (is it worth
rebasing?) let me know.

Also, I guess I should also note my aspirational plan for the next few
months, it goes...

1. Get a convincing PoC working that improves the FIO degradation.

2. Gather it into a fairly messy but at least surveyable branch and push
   that to Github or whatever.

3. Show that to x86 folks and hopefully (!!) get some maintainers to
   give a nod like "yep we want ASI and we're more or less sold that
   the developers know how to make it performant".

4. Turn this [RFC] into a [PATCH]. So start by trying to merge the stuff
   that manages the restricted address space, leaving the logic of actually
   _using_ it for a later series.

5. [Maybe this can be partially paralellised with 4] start a new [PATCH]
   series that starts adding in the x86 stuff to actually switch address
   spaces etc. Basically this means respinning the patches that Boris
   has reviewed in [1]. Since we already have the page_alloc stuff, it
   should be possible to start testing this code end-to-end quickly.

[1] https://lore.kernel.org/linux-mm/20250110-asi-rfc-v2-v2-0-8419288bc805@google.com/

Anyone have any thoughts on that overall strategy?

Cheers,
Brendan