drivers/base/node.c | 250 +++- drivers/cxl/Kconfig | 2 + drivers/cxl/Makefile | 2 + drivers/cxl/core/Makefile | 1 + drivers/cxl/core/core.h | 4 + drivers/cxl/core/port.c | 2 + drivers/cxl/core/region_sysram.c | 381 ++++++ drivers/cxl/cxl.h | 53 + drivers/cxl/type3_drivers/Kconfig | 3 + drivers/cxl/type3_drivers/Makefile | 3 + .../cxl/type3_drivers/cxl_compression/Kconfig | 20 + .../type3_drivers/cxl_compression/Makefile | 4 + .../cxl_compression/compression.c | 1025 +++++++++++++++++ .../cxl/type3_drivers/cxl_mempolicy/Kconfig | 16 + .../cxl/type3_drivers/cxl_mempolicy/Makefile | 4 + .../type3_drivers/cxl_mempolicy/mempolicy.c | 297 +++++ include/linux/cpuset.h | 9 - include/linux/cram.h | 66 ++ include/linux/gfp_types.h | 15 +- include/linux/memory-tiers.h | 9 + include/linux/memory_hotplug.h | 11 + include/linux/migrate.h | 17 +- include/linux/mm.h | 22 + include/linux/mmzone.h | 16 + include/linux/node_private.h | 532 +++++++++ include/linux/nodemask.h | 1 + include/trace/events/mmflags.h | 4 +- include/uapi/linux/mempolicy.h | 1 + kernel/cgroup/cpuset.c | 49 +- mm/Kconfig | 10 + mm/Makefile | 1 + mm/compaction.c | 32 +- mm/cram.c | 508 ++++++++ mm/damon/paddr.c | 3 + mm/huge_memory.c | 23 +- mm/hugetlb.c | 2 +- mm/internal.h | 226 +++- mm/khugepaged.c | 7 +- mm/ksm.c | 9 +- mm/madvise.c | 5 +- mm/memory-failure.c | 15 + mm/memory-tiers.c | 46 +- mm/memory.c | 26 + mm/memory_hotplug.c | 122 +- mm/mempolicy.c | 69 +- mm/migrate.c | 63 +- mm/mlock.c | 5 +- mm/mprotect.c | 4 +- mm/oom_kill.c | 52 +- mm/page_alloc.c | 79 +- mm/rmap.c | 4 +- mm/slub.c | 3 +- mm/swap.c | 21 +- mm/vmscan.c | 55 +- 54 files changed, 4057 insertions(+), 152 deletions(-) create mode 100644 drivers/cxl/core/region_sysram.c create mode 100644 drivers/cxl/type3_drivers/Kconfig create mode 100644 drivers/cxl/type3_drivers/Makefile create mode 100644 drivers/cxl/type3_drivers/cxl_compression/Kconfig create mode 100644 drivers/cxl/type3_drivers/cxl_compression/Makefile create mode 100644 drivers/cxl/type3_drivers/cxl_compression/compression.c create mode 100644 drivers/cxl/type3_drivers/cxl_mempolicy/Kconfig create mode 100644 drivers/cxl/type3_drivers/cxl_mempolicy/Makefile create mode 100644 drivers/cxl/type3_drivers/cxl_mempolicy/mempolicy.c create mode 100644 include/linux/cram.h create mode 100644 include/linux/node_private.h create mode 100644 mm/cram.c
Topic type: MM
Presenter: Gregory Price <gourry@gourry.net>
This series introduces N_MEMORY_PRIVATE, a NUMA node state for memory
managed by the buddy allocator but excluded from normal allocations.
I present it with an end-to-end Compressed RAM service (mm/cram.c)
that would otherwise not be possible (or would be considerably more
difficult, be device-specific, and add to the ZONE_DEVICE boondoggle).
TL;DR
===
N_MEMORY_PRIVATE is all about isolating NUMA nodes and then punching
explicit holes in that isolation to do useful things we couldn't do
before without re-implementing entire portions of mm/ in a driver.
/* This is my memory. There are many like it, but this one is mine. */
rc = add_private_memory_driver_managed(nid, start, size, name, flags,
online_type, private_context);
page = alloc_pages_node(nid, __GFP_PRIVATE, 0);
/* Ok but I want to do something useful with it */
static const struct node_private_ops ops = {
.migrate_to = my_migrate_to,
.folio_migrate = my_folio_migrate,
.flags = NP_OPS_MIGRATION | NP_OPS_MEMPOLICY,
};
node_private_set_ops(nid, &ops);
/* And now I can use mempolicy with my memory */
buf = mmap(...);
mbind(buf, len, mode, private_node, ...);
buf[0] = 0xdeadbeef; /* Faults onto private node */
/* And to be clear, no one else gets my memory */
buf2 = malloc(4096); /* Standard allocation */
buf2[0] = 0xdeadbeef; /* Can never land on private node */
/* But i can choose to migrate it to the private node */
move_pages(0, 1, &buf, &private_node, NULL, ...);
/* And more fun things like this */
Patchwork
===
A fully working branch based on cxl/next can be found here:
https://github.com/gourryinverse/linux/tree/private_compression
A QEMU device which can inject high/low interrupts can be found here:
https://github.com/gourryinverse/qemu/tree/compressed_cxl_clean
The additional patches on these branches are CXL and DAX driver
housecleaning only tangentially relevant to this RFC, so i've
omitted them for the sake of trying to keep it somewhat clean
here. Those patches should (hopefully) be going upstream anyway.
Patches 1-22: Core Private Node Infrastructure
Patch 1: Introduce N_MEMORY_PRIVATE scaffolding
Patch 2: Introduce __GFP_PRIVATE
Patch 3: Apply allocation isolation mechanisms
Patch 4: Add N_MEMORY nodes to private fallback lists
Patches 5-9: Filter operations not yet supported
Patch 10: free_folio callback
Patch 11: split_folio callback
Patches 12-20: mm/ service opt-ins:
Migration, Mempolicy, Demotion, Write Protect,
Reclaim, OOM, NUMA Balancing, Compaction,
LongTerm Pinning
Patch 21: memory_failure callback
Patch 22: Memory hotplug plumbing for private nodes
Patch 23: mm/cram -- Compressed RAM Management
Patches 24-27: CXL Driver examples
Sysram Regions with Private node support
Basic Driver Example: (MIGRATION | MEMPOLICY)
Compression Driver Example (Generic)
Background
===
Today, drivers that want mm-like services on non-general-purpose
memory either use ZONE_DEVICE (self-managed memory) or hotplug into
N_MEMORY and accept the risk of uncontrolled allocation.
Neither option provides what we really want - the ability to:
1) selectively participate in mm/ subsystems, while
2) isolating that memory from general purpose use.
Some device-attached memory cannot be managed as fully general-purpose
system RAM. CXL devices with inline compression, for example, may
corrupt data or crash the machine if the compression ratio drops
below a threshold -- we simply run out of physical memory.
This is a hard problem to solve: how does an operating system deal
with a device that basically lies about how much capacity it has?
(We'll discuss that in the CRAM section)
Core Proposal: N_MEMORY_PRIVATE
===
Introduce N_MEMORY_PRIVATE, a NUMA node state for memory managed by
the buddy allocator, but excluded from normal allocation paths.
Private nodes:
- Are filtered from zonelist fallback: all existing callers to
get_page_from_freelist cannot reach these nodes through any
normal fallback mechanism.
- Filter allocation requests on __GFP_PRIVATE
numa_zone_allowed() excludes them otherwise.
Applies to systems with and without cpusets.
GFP_PRIVATE is (__GFP_PRIVATE | __GFP_THISNODE).
Services use it when they need to allocate specifically from
a private node (e.g., CRAM allocating a destination folio).
No existing allocator path sets __GFP_PRIVATE, so private nodes
are unreachable by default.
- Use standard struct page / folio. No ZONE_DEVICE, no pgmap,
no struct page metadata limitations.
- Use a node-scoped metadata structure to accomplish filtering
and callback support.
- May participate in the buddy allocator, reclaim, compaction,
and LRU like normal memory, gated by an opt-in set of flags.
The key abstraction is node_private_ops: a per-node callback table
registered by a driver or service.
Each callback is individually gated by an NP_OPS_* capability flag.
A driver opts in only to the mm/ operations it needs.
It is similar to ZONE_DEVICE's pgmap at a node granularity.
In fact...
Re-use of ZONE_DEVICE Hooks
===
The callback insertion points deliberately mirror existing ZONE_DEVICE
hooks to minimize the surface area of the mechanism.
I believe this could subsume most DEVICE_COHERENT users, and greatly
simplify the device-managed memory development process (no more
per-driver allocator and migration code).
(Also it's just "So Fresh, So Clean").
The base set of callbacks introduced include:
free_folio - mirrors ZONE_DEVICE's
free_zone_device_page() hook in
__folio_put() / folios_put_refs()
folio_split - mirrors ZONE_DEVICE's
called when a huge page is split up
migrate_to - demote_folio_list() custom demotion (same
site as ZONE_DEVICE demotion rejection)
folio_migrate - called when private node folio is moved to
another location (e.g. compaction)
handle_fault - mirrors the ZONE_DEVICE fault dispatch in
handle_pte_fault() (do_wp_page path)
reclaim_policy - called by reclaim to let a driver own the
boost lifecycle (driver can driver node reclaim)
memory_failure - parallels memory_failure_dev_pagemap(),
but for online pages that enter the normal
hwpoison path
At skip sites (mlock, madvise, KSM, user migration), a unified
folio_is_private_managed() predicate covers both ZONE_DEVICE and
N_MEMORY_PRIVATE folios, consolidating existing zone_device checks
with private node checks rather than adding new ones.
static inline bool folio_is_private_managed(struct folio *folio)
{
return folio_is_zone_device(folio) ||
folio_is_private_node(folio);
}
Most integration points become a one-line swap:
- if (folio_is_zone_device(folio))
+ if (unlikely(folio_is_private_managed(folio)))
Where a one-line integration is insufficient, the integration is
kept as clean as possible with zone_device, rather than simply
adding more call-sites on top of it:
static inline bool folio_managed_handle_fault(struct folio *folio,
struct vm_fault *vmf, vm_fault_t *ret)
{
/* Zone device pages use swap entries; handled in do_swap_page */
if (folio_is_zone_device(folio))
return false;
if (folio_is_private_node(folio)) {
const struct node_private_ops *ops = folio_node_private_ops(folio);
if (ops && ops->handle_fault) {
*ret = ops->handle_fault(vmf);
return true;
}
}
return false;
}
Flag-gated behavior (NP_OPS_*) controls:
===
We use OPS flags to denote what mm/ services we want to allow on our
private node. I've plumbed these through so far:
NP_OPS_MIGRATION - Node supports migration
NP_OPS_MEMPOLICY - Node supports mempolicy actions
NP_OPS_DEMOTION - Node appears in demotion target lists
NP_OPS_PROTECT_WRITE - Node memory is read-only (wrprotect)
NP_OPS_RECLAIM - Node supports reclaim
NP_OPS_NUMA_BALANCING - Node supports numa balancing
NP_OPS_COMPACTION - Node supports compaction
NP_OPS_LONGTERM_PIN - Node supports longterm pinning
NP_OPS_OOM_ELIGIBLE - (MIGRATION | DEMOTION), node is reachable
as normal system ram storage, so it should
be considered in OOM pressure calculations.
I wasn't quite sure how to classify ksm, khugepaged, madvise, and
mlock - so i have omitted those for now.
Most hooks are straightforward.
Including a node as a demotion-eligible target was as simple as:
static void establish_demotion_targets(void)
{
..... snip .....
/*
* Include private nodes that have opted in to demotion
* via NP_OPS_DEMOTION. A node might have custom migrate
*/
all_memory = node_states[N_MEMORY];
for_each_node_state(node, N_MEMORY_PRIVATE) {
if (node_private_has_flag(node, NP_OPS_DEMOTION))
node_set(node, all_memory);
}
..... snip .....
}
The Migration and Mempolicy support are the two most complex pieces,
and most useful things are built on top of Migration (meaning the
remaining implementations are usually simple).
Private Node Hotplug Lifecycle
===
Registration follows a strict order enforced by
add_private_memory_driver_managed():
1. Driver calls add_private_memory_driver_managed(nid, start,
size, resource_name, mhp_flags, online_type, &np).
2. node_private_register(nid, &np) stores the driver's
node_private in pgdat and sets pgdat->private. N_MEMORY and
N_MEMORY_PRIVATE are mutually exclusive -- registration fails
with -EBUSY if the node already has N_MEMORY set.
Only one driver may register per private node.
3. Memory is hotplugged via __add_memory_driver_managed().
When online_pages() runs, it checks pgdat->private and sets
N_MEMORY_PRIVATE instead of N_MEMORY.
Zonelist construction gives private nodes a self-only NOFALLBACK
list and an N_MEMORY fallback list (so kernel/slab allocations on
behalf of private node work can fall back to DRAM).
4. kswapd and kcompactd are NOT started for private nodes. The
owning service is responsible for driving reclaim if needed
(e.g., CRAM uses watermark_boost to wake kswapd on demand).
Teardown is the reverse:
1. Driver calls offline_and_remove_private_memory(nid, start,
size).
2. offline_pages() offlines the memory. When the last block is
offlined, N_MEMORY_PRIVATE is cleared automatically.
3. node_private_unregister() clears pgdat->node_private and
drops the refcount. It refuses to unregister (-EBUSY) if
N_MEMORY_PRIVATE is still set (other memory ranges remain).
The driver is responsible for ensuring memory is hot-unpluggable
before teardown. The service must ensure all memory is cleaned
up before hot-unplug - or the service must support migration (so
memory_hotplug.c can evacuate the memory itself).
In the CRAM example, the service supports migration, so memory
hot-unplug can remove memory without any special infrastructure.
Application: Compressed RAM (mm/cram)
===
Compressed RAM has a serious design issue: Its capacity a lie.
A compression device reports more capacity than it physically has.
If workloads write faster than the OS can reclaim from the device,
we run out of real backing store and corrupt data or crash.
I call this problem: "Trying to Out Run A Bear"
I.e. This is only stable as long as we stay ahead of the pressure.
We don't want to design a system where stability depends on outrunning
a bear - I am slow and do not know where to acquire bear spray.
Fun fact: Grizzly bears have a top-speed of 56-64 km/h.
Unfun Fact: Humans typically top out at ~24 km/h.
This MVP takes a conservative position:
all compressed memory is mapped read-only.
- Folios reach the private node only via reclaim (demotion)
- migrate_to implements custom demotion with backpressure.
- fixup_migration_pte write-protects PTEs on arrival.
- wrprotect hooks prevent silent upgrades
- handle_fault promotes folios back to DRAM on write.
- free_folio scrubs stale data before buddy free.
Because pages are read-only, writes can never cause runaway
compression ratio loss behind the allocator's back. Every write
goes through handle_fault, which promotes the folio to DRAM first.
The device only ever sees net compression (demotion in) and explicit
decompression (promotion out via fault or reclaim), and has a much
wider timeframe to respond to poor compression scenarios.
That means there's no bear to out run. The bears are safely asleep in
their bear den, and even if they show up we have a bear-proof cage.
The backpressure system is our bear-proof cage: the driver reports real
device utilization (generalized via watermark_boost on the private
node's zone), and CRAM throttles demotion when capacity is tight.
If compression ratios are bad, we stop demoting pages and start
evicting pages aggressively.
The service as designed is ~350 functional lines of code because it
re-uses mm/ services:
- Existing reclaim/vmscan code handles demotion.
- Existing migration code handles migration to/from.
- Existing page fault handling dispatches faults.
The driver contains all the CXL nastiness core developers don't want
anything to do with - No vendor logic touches mm/ internals.
Future CRAM : Loosening the read-only constraint
===
The read-only model is safe but conservative. For workloads where
compressed pages are occasionally written, the promotion fault adds
latency. A future optimization could allow a tunable fraction of
compressed pages to be mapped writable, accepting some risk of
write-driven decompression in exchange for lower overhead.
The private node ops make this straightforward:
- Adjust fixup_migration_pte to selectively skip
write-protection.
- Use the backpressure system to either revoke writable mappings,
deny additional demotions, or evict when device pressure rises.
This comes at a mild memory overhead: 32MB of DRAM per 1TB of CRAM.
(1 bit per 4KB page).
This is not proposed here, but it should be somewhat trivial.
Discussion Topics
===
0. Obviously I've included the set as an RFC, please rip it apart.
1. Is N_MEMORY_PRIVATE the right isolation abstraction, or should
this extend ZONE_DEVICE? Prior feedback pushed away from new
ZONE logic, but this will likely be debated further.
My comments on this:
ZONE_DEVICE requires re-implementing every service you want to
provide to your device memory, including basic allocation.
Private nodes use real struct pages with no metadata
limitations, participate in the buddy allocator, and get NUMA
topology for free.
2. Can this subsume ZONE_DEVICE COHERENT users? The architecture
was designed with this in mind, but it is only a thought experiment.
3. Is a dedicated mm/ service (cram) the right place for compressed
memory management, or should this be purely driver-side until
more devices exist?
I wrote it this way because I forsee more "innovation" in the
compressed RAM space given current... uh... "Market Conditions".
I don't see CRAM being CXL-specific, though the only solutions I've
seen have been CXL. Nothing is stopping someone from soldering such
memory directly to a PCB.
5. Where is your hardware-backed data that shows this works?
I should have some by conference time.
Thanks for reading
Gregory (Gourry)
Gregory Price (27):
numa: introduce N_MEMORY_PRIVATE node state
mm,cpuset: gate allocations from N_MEMORY_PRIVATE behind __GFP_PRIVATE
mm/page_alloc: add numa_zone_allowed() and wire it up
mm/page_alloc: Add private node handling to build_zonelists
mm: introduce folio_is_private_managed() unified predicate
mm/mlock: skip mlock for managed-memory folios
mm/madvise: skip madvise for managed-memory folios
mm/ksm: skip KSM for managed-memory folios
mm/khugepaged: skip private node folios when trying to collapse.
mm/swap: add free_folio callback for folio release cleanup
mm/huge_memory.c: add private node folio split notification callback
mm/migrate: NP_OPS_MIGRATION - support private node user migration
mm/mempolicy: NP_OPS_MEMPOLICY - support private node mempolicy
mm/memory-tiers: NP_OPS_DEMOTION - support private node demotion
mm/mprotect: NP_OPS_PROTECT_WRITE - gate PTE/PMD write-upgrades
mm: NP_OPS_RECLAIM - private node reclaim participation
mm/oom: NP_OPS_OOM_ELIGIBLE - private node OOM participation
mm/memory: NP_OPS_NUMA_BALANCING - private node NUMA balancing
mm/compaction: NP_OPS_COMPACTION - private node compaction support
mm/gup: NP_OPS_LONGTERM_PIN - private node longterm pin support
mm/memory-failure: add memory_failure callback to node_private_ops
mm/memory_hotplug: add add_private_memory_driver_managed()
mm/cram: add compressed ram memory management subsystem
cxl/core: Add cxl_sysram region type
cxl/core: Add private node support to cxl_sysram
cxl: add cxl_mempolicy sample PCI driver
cxl: add cxl_compression PCI driver
drivers/base/node.c | 250 +++-
drivers/cxl/Kconfig | 2 +
drivers/cxl/Makefile | 2 +
drivers/cxl/core/Makefile | 1 +
drivers/cxl/core/core.h | 4 +
drivers/cxl/core/port.c | 2 +
drivers/cxl/core/region_sysram.c | 381 ++++++
drivers/cxl/cxl.h | 53 +
drivers/cxl/type3_drivers/Kconfig | 3 +
drivers/cxl/type3_drivers/Makefile | 3 +
.../cxl/type3_drivers/cxl_compression/Kconfig | 20 +
.../type3_drivers/cxl_compression/Makefile | 4 +
.../cxl_compression/compression.c | 1025 +++++++++++++++++
.../cxl/type3_drivers/cxl_mempolicy/Kconfig | 16 +
.../cxl/type3_drivers/cxl_mempolicy/Makefile | 4 +
.../type3_drivers/cxl_mempolicy/mempolicy.c | 297 +++++
include/linux/cpuset.h | 9 -
include/linux/cram.h | 66 ++
include/linux/gfp_types.h | 15 +-
include/linux/memory-tiers.h | 9 +
include/linux/memory_hotplug.h | 11 +
include/linux/migrate.h | 17 +-
include/linux/mm.h | 22 +
include/linux/mmzone.h | 16 +
include/linux/node_private.h | 532 +++++++++
include/linux/nodemask.h | 1 +
include/trace/events/mmflags.h | 4 +-
include/uapi/linux/mempolicy.h | 1 +
kernel/cgroup/cpuset.c | 49 +-
mm/Kconfig | 10 +
mm/Makefile | 1 +
mm/compaction.c | 32 +-
mm/cram.c | 508 ++++++++
mm/damon/paddr.c | 3 +
mm/huge_memory.c | 23 +-
mm/hugetlb.c | 2 +-
mm/internal.h | 226 +++-
mm/khugepaged.c | 7 +-
mm/ksm.c | 9 +-
mm/madvise.c | 5 +-
mm/memory-failure.c | 15 +
mm/memory-tiers.c | 46 +-
mm/memory.c | 26 +
mm/memory_hotplug.c | 122 +-
mm/mempolicy.c | 69 +-
mm/migrate.c | 63 +-
mm/mlock.c | 5 +-
mm/mprotect.c | 4 +-
mm/oom_kill.c | 52 +-
mm/page_alloc.c | 79 +-
mm/rmap.c | 4 +-
mm/slub.c | 3 +-
mm/swap.c | 21 +-
mm/vmscan.c | 55 +-
54 files changed, 4057 insertions(+), 152 deletions(-)
create mode 100644 drivers/cxl/core/region_sysram.c
create mode 100644 drivers/cxl/type3_drivers/Kconfig
create mode 100644 drivers/cxl/type3_drivers/Makefile
create mode 100644 drivers/cxl/type3_drivers/cxl_compression/Kconfig
create mode 100644 drivers/cxl/type3_drivers/cxl_compression/Makefile
create mode 100644 drivers/cxl/type3_drivers/cxl_compression/compression.c
create mode 100644 drivers/cxl/type3_drivers/cxl_mempolicy/Kconfig
create mode 100644 drivers/cxl/type3_drivers/cxl_mempolicy/Makefile
create mode 100644 drivers/cxl/type3_drivers/cxl_mempolicy/mempolicy.c
create mode 100644 include/linux/cram.h
create mode 100644 include/linux/node_private.h
create mode 100644 mm/cram.c
--
2.53.0
On 2/22/26 08:48, Gregory Price wrote:
> Topic type: MM
>
> Presenter: Gregory Price <gourry@gourry.net>
>
> This series introduces N_MEMORY_PRIVATE, a NUMA node state for memory
> managed by the buddy allocator but excluded from normal allocations.
>
> I present it with an end-to-end Compressed RAM service (mm/cram.c)
> that would otherwise not be possible (or would be considerably more
> difficult, be device-specific, and add to the ZONE_DEVICE boondoggle).
>
>
> TL;DR
> ===
>
> N_MEMORY_PRIVATE is all about isolating NUMA nodes and then punching
> explicit holes in that isolation to do useful things we couldn't do
> before without re-implementing entire portions of mm/ in a driver.
>
>
> /* This is my memory. There are many like it, but this one is mine. */
> rc = add_private_memory_driver_managed(nid, start, size, name, flags,
> online_type, private_context);
>
> page = alloc_pages_node(nid, __GFP_PRIVATE, 0);
Hi Gregory,
I can see the nid param is just a "preferred nid" with alloc pages.
Using __GFP_PRIVATE will restrict the allocation to private nodes but I
think the idea here is:
1) I own this node
2) Do not give me memory from another private node but from mine.
Should not this be ensure somehow?
> /* Ok but I want to do something useful with it */
> static const struct node_private_ops ops = {
> .migrate_to = my_migrate_to,
> .folio_migrate = my_folio_migrate,
> .flags = NP_OPS_MIGRATION | NP_OPS_MEMPOLICY,
> };
> node_private_set_ops(nid, &ops);
>
> /* And now I can use mempolicy with my memory */
> buf = mmap(...);
> mbind(buf, len, mode, private_node, ...);
> buf[0] = 0xdeadbeef; /* Faults onto private node */
>
> /* And to be clear, no one else gets my memory */
> buf2 = malloc(4096); /* Standard allocation */
> buf2[0] = 0xdeadbeef; /* Can never land on private node */
>
> /* But i can choose to migrate it to the private node */
> move_pages(0, 1, &buf, &private_node, NULL, ...);
>
> /* And more fun things like this */
>
>
> Patchwork
> ===
> A fully working branch based on cxl/next can be found here:
> https://github.com/gourryinverse/linux/tree/private_compression
>
> A QEMU device which can inject high/low interrupts can be found here:
> https://github.com/gourryinverse/qemu/tree/compressed_cxl_clean
>
> The additional patches on these branches are CXL and DAX driver
> housecleaning only tangentially relevant to this RFC, so i've
> omitted them for the sake of trying to keep it somewhat clean
> here. Those patches should (hopefully) be going upstream anyway.
>
> Patches 1-22: Core Private Node Infrastructure
>
> Patch 1: Introduce N_MEMORY_PRIVATE scaffolding
> Patch 2: Introduce __GFP_PRIVATE
> Patch 3: Apply allocation isolation mechanisms
> Patch 4: Add N_MEMORY nodes to private fallback lists
> Patches 5-9: Filter operations not yet supported
> Patch 10: free_folio callback
> Patch 11: split_folio callback
> Patches 12-20: mm/ service opt-ins:
> Migration, Mempolicy, Demotion, Write Protect,
> Reclaim, OOM, NUMA Balancing, Compaction,
> LongTerm Pinning
> Patch 21: memory_failure callback
> Patch 22: Memory hotplug plumbing for private nodes
>
> Patch 23: mm/cram -- Compressed RAM Management
>
> Patches 24-27: CXL Driver examples
> Sysram Regions with Private node support
> Basic Driver Example: (MIGRATION | MEMPOLICY)
> Compression Driver Example (Generic)
>
>
> Background
> ===
>
> Today, drivers that want mm-like services on non-general-purpose
> memory either use ZONE_DEVICE (self-managed memory) or hotplug into
> N_MEMORY and accept the risk of uncontrolled allocation.
>
> Neither option provides what we really want - the ability to:
> 1) selectively participate in mm/ subsystems, while
> 2) isolating that memory from general purpose use.
>
> Some device-attached memory cannot be managed as fully general-purpose
> system RAM. CXL devices with inline compression, for example, may
> corrupt data or crash the machine if the compression ratio drops
> below a threshold -- we simply run out of physical memory.
>
> This is a hard problem to solve: how does an operating system deal
> with a device that basically lies about how much capacity it has?
>
> (We'll discuss that in the CRAM section)
>
>
> Core Proposal: N_MEMORY_PRIVATE
> ===
>
> Introduce N_MEMORY_PRIVATE, a NUMA node state for memory managed by
> the buddy allocator, but excluded from normal allocation paths.
>
> Private nodes:
>
> - Are filtered from zonelist fallback: all existing callers to
> get_page_from_freelist cannot reach these nodes through any
> normal fallback mechanism.
>
> - Filter allocation requests on __GFP_PRIVATE
> numa_zone_allowed() excludes them otherwise.
>
> Applies to systems with and without cpusets.
>
> GFP_PRIVATE is (__GFP_PRIVATE | __GFP_THISNODE).
>
> Services use it when they need to allocate specifically from
> a private node (e.g., CRAM allocating a destination folio).
>
> No existing allocator path sets __GFP_PRIVATE, so private nodes
> are unreachable by default.
>
> - Use standard struct page / folio. No ZONE_DEVICE, no pgmap,
> no struct page metadata limitations.
>
> - Use a node-scoped metadata structure to accomplish filtering
> and callback support.
>
> - May participate in the buddy allocator, reclaim, compaction,
> and LRU like normal memory, gated by an opt-in set of flags.
>
> The key abstraction is node_private_ops: a per-node callback table
> registered by a driver or service.
>
> Each callback is individually gated by an NP_OPS_* capability flag.
>
> A driver opts in only to the mm/ operations it needs.
>
> It is similar to ZONE_DEVICE's pgmap at a node granularity.
>
> In fact...
>
>
> Re-use of ZONE_DEVICE Hooks
> ===
>
> The callback insertion points deliberately mirror existing ZONE_DEVICE
> hooks to minimize the surface area of the mechanism.
>
> I believe this could subsume most DEVICE_COHERENT users, and greatly
> simplify the device-managed memory development process (no more
> per-driver allocator and migration code).
>
> (Also it's just "So Fresh, So Clean").
>
> The base set of callbacks introduced include:
>
> free_folio - mirrors ZONE_DEVICE's
> free_zone_device_page() hook in
> __folio_put() / folios_put_refs()
>
> folio_split - mirrors ZONE_DEVICE's
> called when a huge page is split up
>
> migrate_to - demote_folio_list() custom demotion (same
> site as ZONE_DEVICE demotion rejection)
>
> folio_migrate - called when private node folio is moved to
> another location (e.g. compaction)
>
> handle_fault - mirrors the ZONE_DEVICE fault dispatch in
> handle_pte_fault() (do_wp_page path)
>
> reclaim_policy - called by reclaim to let a driver own the
> boost lifecycle (driver can driver node reclaim)
>
> memory_failure - parallels memory_failure_dev_pagemap(),
> but for online pages that enter the normal
> hwpoison path
>
> At skip sites (mlock, madvise, KSM, user migration), a unified
> folio_is_private_managed() predicate covers both ZONE_DEVICE and
> N_MEMORY_PRIVATE folios, consolidating existing zone_device checks
> with private node checks rather than adding new ones.
>
> static inline bool folio_is_private_managed(struct folio *folio)
> {
> return folio_is_zone_device(folio) ||
> folio_is_private_node(folio);
> }
>
> Most integration points become a one-line swap:
>
> - if (folio_is_zone_device(folio))
> + if (unlikely(folio_is_private_managed(folio)))
>
>
> Where a one-line integration is insufficient, the integration is
> kept as clean as possible with zone_device, rather than simply
> adding more call-sites on top of it:
>
> static inline bool folio_managed_handle_fault(struct folio *folio,
> struct vm_fault *vmf, vm_fault_t *ret)
> {
> /* Zone device pages use swap entries; handled in do_swap_page */
> if (folio_is_zone_device(folio))
> return false;
>
> if (folio_is_private_node(folio)) {
> const struct node_private_ops *ops = folio_node_private_ops(folio);
>
> if (ops && ops->handle_fault) {
> *ret = ops->handle_fault(vmf);
> return true;
> }
> }
> return false;
> }
>
>
>
> Flag-gated behavior (NP_OPS_*) controls:
> ===
>
> We use OPS flags to denote what mm/ services we want to allow on our
> private node. I've plumbed these through so far:
>
> NP_OPS_MIGRATION - Node supports migration
> NP_OPS_MEMPOLICY - Node supports mempolicy actions
> NP_OPS_DEMOTION - Node appears in demotion target lists
> NP_OPS_PROTECT_WRITE - Node memory is read-only (wrprotect)
> NP_OPS_RECLAIM - Node supports reclaim
> NP_OPS_NUMA_BALANCING - Node supports numa balancing
> NP_OPS_COMPACTION - Node supports compaction
> NP_OPS_LONGTERM_PIN - Node supports longterm pinning
> NP_OPS_OOM_ELIGIBLE - (MIGRATION | DEMOTION), node is reachable
> as normal system ram storage, so it should
> be considered in OOM pressure calculations.
>
> I wasn't quite sure how to classify ksm, khugepaged, madvise, and
> mlock - so i have omitted those for now.
>
> Most hooks are straightforward.
>
> Including a node as a demotion-eligible target was as simple as:
>
> static void establish_demotion_targets(void)
> {
> ..... snip .....
> /*
> * Include private nodes that have opted in to demotion
> * via NP_OPS_DEMOTION. A node might have custom migrate
> */
> all_memory = node_states[N_MEMORY];
> for_each_node_state(node, N_MEMORY_PRIVATE) {
> if (node_private_has_flag(node, NP_OPS_DEMOTION))
> node_set(node, all_memory);
> }
> ..... snip .....
> }
>
> The Migration and Mempolicy support are the two most complex pieces,
> and most useful things are built on top of Migration (meaning the
> remaining implementations are usually simple).
>
>
> Private Node Hotplug Lifecycle
> ===
>
> Registration follows a strict order enforced by
> add_private_memory_driver_managed():
>
> 1. Driver calls add_private_memory_driver_managed(nid, start,
> size, resource_name, mhp_flags, online_type, &np).
>
> 2. node_private_register(nid, &np) stores the driver's
> node_private in pgdat and sets pgdat->private. N_MEMORY and
> N_MEMORY_PRIVATE are mutually exclusive -- registration fails
> with -EBUSY if the node already has N_MEMORY set.
>
> Only one driver may register per private node.
>
> 3. Memory is hotplugged via __add_memory_driver_managed().
>
> When online_pages() runs, it checks pgdat->private and sets
> N_MEMORY_PRIVATE instead of N_MEMORY.
>
> Zonelist construction gives private nodes a self-only NOFALLBACK
> list and an N_MEMORY fallback list (so kernel/slab allocations on
> behalf of private node work can fall back to DRAM).
>
> 4. kswapd and kcompactd are NOT started for private nodes. The
> owning service is responsible for driving reclaim if needed
> (e.g., CRAM uses watermark_boost to wake kswapd on demand).
>
> Teardown is the reverse:
>
> 1. Driver calls offline_and_remove_private_memory(nid, start,
> size).
>
> 2. offline_pages() offlines the memory. When the last block is
> offlined, N_MEMORY_PRIVATE is cleared automatically.
>
> 3. node_private_unregister() clears pgdat->node_private and
> drops the refcount. It refuses to unregister (-EBUSY) if
> N_MEMORY_PRIVATE is still set (other memory ranges remain).
>
> The driver is responsible for ensuring memory is hot-unpluggable
> before teardown. The service must ensure all memory is cleaned
> up before hot-unplug - or the service must support migration (so
> memory_hotplug.c can evacuate the memory itself).
>
> In the CRAM example, the service supports migration, so memory
> hot-unplug can remove memory without any special infrastructure.
>
>
> Application: Compressed RAM (mm/cram)
> ===
>
> Compressed RAM has a serious design issue: Its capacity a lie.
>
> A compression device reports more capacity than it physically has.
> If workloads write faster than the OS can reclaim from the device,
> we run out of real backing store and corrupt data or crash.
>
> I call this problem: "Trying to Out Run A Bear"
>
> I.e. This is only stable as long as we stay ahead of the pressure.
>
> We don't want to design a system where stability depends on outrunning
> a bear - I am slow and do not know where to acquire bear spray.
>
> Fun fact: Grizzly bears have a top-speed of 56-64 km/h.
> Unfun Fact: Humans typically top out at ~24 km/h.
>
> This MVP takes a conservative position:
>
> all compressed memory is mapped read-only.
>
> - Folios reach the private node only via reclaim (demotion)
> - migrate_to implements custom demotion with backpressure.
> - fixup_migration_pte write-protects PTEs on arrival.
> - wrprotect hooks prevent silent upgrades
> - handle_fault promotes folios back to DRAM on write.
> - free_folio scrubs stale data before buddy free.
>
> Because pages are read-only, writes can never cause runaway
> compression ratio loss behind the allocator's back. Every write
> goes through handle_fault, which promotes the folio to DRAM first.
>
> The device only ever sees net compression (demotion in) and explicit
> decompression (promotion out via fault or reclaim), and has a much
> wider timeframe to respond to poor compression scenarios.
>
> That means there's no bear to out run. The bears are safely asleep in
> their bear den, and even if they show up we have a bear-proof cage.
>
> The backpressure system is our bear-proof cage: the driver reports real
> device utilization (generalized via watermark_boost on the private
> node's zone), and CRAM throttles demotion when capacity is tight.
>
> If compression ratios are bad, we stop demoting pages and start
> evicting pages aggressively.
>
> The service as designed is ~350 functional lines of code because it
> re-uses mm/ services:
>
> - Existing reclaim/vmscan code handles demotion.
> - Existing migration code handles migration to/from.
> - Existing page fault handling dispatches faults.
>
> The driver contains all the CXL nastiness core developers don't want
> anything to do with - No vendor logic touches mm/ internals.
>
>
>
> Future CRAM : Loosening the read-only constraint
> ===
>
> The read-only model is safe but conservative. For workloads where
> compressed pages are occasionally written, the promotion fault adds
> latency. A future optimization could allow a tunable fraction of
> compressed pages to be mapped writable, accepting some risk of
> write-driven decompression in exchange for lower overhead.
>
> The private node ops make this straightforward:
>
> - Adjust fixup_migration_pte to selectively skip
> write-protection.
> - Use the backpressure system to either revoke writable mappings,
> deny additional demotions, or evict when device pressure rises.
>
> This comes at a mild memory overhead: 32MB of DRAM per 1TB of CRAM.
> (1 bit per 4KB page).
>
> This is not proposed here, but it should be somewhat trivial.
>
>
> Discussion Topics
> ===
> 0. Obviously I've included the set as an RFC, please rip it apart.
>
> 1. Is N_MEMORY_PRIVATE the right isolation abstraction, or should
> this extend ZONE_DEVICE? Prior feedback pushed away from new
> ZONE logic, but this will likely be debated further.
>
> My comments on this:
>
> ZONE_DEVICE requires re-implementing every service you want to
> provide to your device memory, including basic allocation.
>
> Private nodes use real struct pages with no metadata
> limitations, participate in the buddy allocator, and get NUMA
> topology for free.
>
> 2. Can this subsume ZONE_DEVICE COHERENT users? The architecture
> was designed with this in mind, but it is only a thought experiment.
>
> 3. Is a dedicated mm/ service (cram) the right place for compressed
> memory management, or should this be purely driver-side until
> more devices exist?
>
> I wrote it this way because I forsee more "innovation" in the
> compressed RAM space given current... uh... "Market Conditions".
>
> I don't see CRAM being CXL-specific, though the only solutions I've
> seen have been CXL. Nothing is stopping someone from soldering such
> memory directly to a PCB.
>
> 5. Where is your hardware-backed data that shows this works?
>
> I should have some by conference time.
>
> Thanks for reading
> Gregory (Gourry)
>
>
> Gregory Price (27):
> numa: introduce N_MEMORY_PRIVATE node state
> mm,cpuset: gate allocations from N_MEMORY_PRIVATE behind __GFP_PRIVATE
> mm/page_alloc: add numa_zone_allowed() and wire it up
> mm/page_alloc: Add private node handling to build_zonelists
> mm: introduce folio_is_private_managed() unified predicate
> mm/mlock: skip mlock for managed-memory folios
> mm/madvise: skip madvise for managed-memory folios
> mm/ksm: skip KSM for managed-memory folios
> mm/khugepaged: skip private node folios when trying to collapse.
> mm/swap: add free_folio callback for folio release cleanup
> mm/huge_memory.c: add private node folio split notification callback
> mm/migrate: NP_OPS_MIGRATION - support private node user migration
> mm/mempolicy: NP_OPS_MEMPOLICY - support private node mempolicy
> mm/memory-tiers: NP_OPS_DEMOTION - support private node demotion
> mm/mprotect: NP_OPS_PROTECT_WRITE - gate PTE/PMD write-upgrades
> mm: NP_OPS_RECLAIM - private node reclaim participation
> mm/oom: NP_OPS_OOM_ELIGIBLE - private node OOM participation
> mm/memory: NP_OPS_NUMA_BALANCING - private node NUMA balancing
> mm/compaction: NP_OPS_COMPACTION - private node compaction support
> mm/gup: NP_OPS_LONGTERM_PIN - private node longterm pin support
> mm/memory-failure: add memory_failure callback to node_private_ops
> mm/memory_hotplug: add add_private_memory_driver_managed()
> mm/cram: add compressed ram memory management subsystem
> cxl/core: Add cxl_sysram region type
> cxl/core: Add private node support to cxl_sysram
> cxl: add cxl_mempolicy sample PCI driver
> cxl: add cxl_compression PCI driver
>
> drivers/base/node.c | 250 +++-
> drivers/cxl/Kconfig | 2 +
> drivers/cxl/Makefile | 2 +
> drivers/cxl/core/Makefile | 1 +
> drivers/cxl/core/core.h | 4 +
> drivers/cxl/core/port.c | 2 +
> drivers/cxl/core/region_sysram.c | 381 ++++++
> drivers/cxl/cxl.h | 53 +
> drivers/cxl/type3_drivers/Kconfig | 3 +
> drivers/cxl/type3_drivers/Makefile | 3 +
> .../cxl/type3_drivers/cxl_compression/Kconfig | 20 +
> .../type3_drivers/cxl_compression/Makefile | 4 +
> .../cxl_compression/compression.c | 1025 +++++++++++++++++
> .../cxl/type3_drivers/cxl_mempolicy/Kconfig | 16 +
> .../cxl/type3_drivers/cxl_mempolicy/Makefile | 4 +
> .../type3_drivers/cxl_mempolicy/mempolicy.c | 297 +++++
> include/linux/cpuset.h | 9 -
> include/linux/cram.h | 66 ++
> include/linux/gfp_types.h | 15 +-
> include/linux/memory-tiers.h | 9 +
> include/linux/memory_hotplug.h | 11 +
> include/linux/migrate.h | 17 +-
> include/linux/mm.h | 22 +
> include/linux/mmzone.h | 16 +
> include/linux/node_private.h | 532 +++++++++
> include/linux/nodemask.h | 1 +
> include/trace/events/mmflags.h | 4 +-
> include/uapi/linux/mempolicy.h | 1 +
> kernel/cgroup/cpuset.c | 49 +-
> mm/Kconfig | 10 +
> mm/Makefile | 1 +
> mm/compaction.c | 32 +-
> mm/cram.c | 508 ++++++++
> mm/damon/paddr.c | 3 +
> mm/huge_memory.c | 23 +-
> mm/hugetlb.c | 2 +-
> mm/internal.h | 226 +++-
> mm/khugepaged.c | 7 +-
> mm/ksm.c | 9 +-
> mm/madvise.c | 5 +-
> mm/memory-failure.c | 15 +
> mm/memory-tiers.c | 46 +-
> mm/memory.c | 26 +
> mm/memory_hotplug.c | 122 +-
> mm/mempolicy.c | 69 +-
> mm/migrate.c | 63 +-
> mm/mlock.c | 5 +-
> mm/mprotect.c | 4 +-
> mm/oom_kill.c | 52 +-
> mm/page_alloc.c | 79 +-
> mm/rmap.c | 4 +-
> mm/slub.c | 3 +-
> mm/swap.c | 21 +-
> mm/vmscan.c | 55 +-
> 54 files changed, 4057 insertions(+), 152 deletions(-)
> create mode 100644 drivers/cxl/core/region_sysram.c
> create mode 100644 drivers/cxl/type3_drivers/Kconfig
> create mode 100644 drivers/cxl/type3_drivers/Makefile
> create mode 100644 drivers/cxl/type3_drivers/cxl_compression/Kconfig
> create mode 100644 drivers/cxl/type3_drivers/cxl_compression/Makefile
> create mode 100644 drivers/cxl/type3_drivers/cxl_compression/compression.c
> create mode 100644 drivers/cxl/type3_drivers/cxl_mempolicy/Kconfig
> create mode 100644 drivers/cxl/type3_drivers/cxl_mempolicy/Makefile
> create mode 100644 drivers/cxl/type3_drivers/cxl_mempolicy/mempolicy.c
> create mode 100644 include/linux/cram.h
> create mode 100644 include/linux/node_private.h
> create mode 100644 mm/cram.c
>
On Wed, Feb 25, 2026 at 12:40:09PM +0000, Alejandro Lucero Palau wrote: > > > /* This is my memory. There are many like it, but this one is mine. */ > > rc = add_private_memory_driver_managed(nid, start, size, name, flags, > > online_type, private_context); > > > > page = alloc_pages_node(nid, __GFP_PRIVATE, 0); > > > Hi Gregory, > > > I can see the nid param is just a "preferred nid" with alloc pages. Using > __GFP_PRIVATE will restrict the allocation to private nodes but I think the > idea here is: > > > 1) I own this node > > 2) Do not give me memory from another private node but from mine. > > > Should not this be ensure somehow? > A right I set up GFP_PRIVATE for this #define GFP_PRIVATE (__GFP_PRIVATE | __GFP_THISNODE) If your service hides the interface to get to this node behind something it controls, and it doesn't enable things like reclaim/compaction etc, then it's responsible for dealing with an out-of-memory situation. __THISNODE was insufficient alone to allow isolation since it is used in a variety of scenarios around the kernel. v2 or v3 explored that. ~Gregory
On 2026-02-22 at 19:48 +1100, Gregory Price <gourry@gourry.net> wrote...
> Topic type: MM
>
> Presenter: Gregory Price <gourry@gourry.net>
>
> This series introduces N_MEMORY_PRIVATE, a NUMA node state for memory
> managed by the buddy allocator but excluded from normal allocations.
>
> I present it with an end-to-end Compressed RAM service (mm/cram.c)
> that would otherwise not be possible (or would be considerably more
> difficult, be device-specific, and add to the ZONE_DEVICE boondoggle).
>
>
> TL;DR
> ===
>
> N_MEMORY_PRIVATE is all about isolating NUMA nodes and then punching
> explicit holes in that isolation to do useful things we couldn't do
> before without re-implementing entire portions of mm/ in a driver.
Having had to re-implement entire portions of mm/ in a driver I agree this isn't
something anyone sane should do :-) However aspects of ZONE_DEVICE were added
precisely to help with that so I'm not sure N_MEMORY_PRIVATE is the only or best
way to do that.
Based on our discussion at LPC I believe one of the primary motivators here was
to re-use the existing mm buddy allocator rather than writing your own. I remain
to be convinced that alone is justification enough for doing all this - DRM for
example already has quite a nice standalone buddy allocator (drm_buddy.c) that
could presumably be used, or adapted for use, by any device driver.
The interesting part of this series (which I have skimmed but not read in
detail) is how device memory gets exposed to userspace - this is something that
existing ZONE_DEVICE implementations don't address, instead leaving it up to
drivers and associated userspace stacks to deal with allocation, migration, etc.
>
>
> /* This is my memory. There are many like it, but this one is mine. */
> rc = add_private_memory_driver_managed(nid, start, size, name, flags,
> online_type, private_context);
>
> page = alloc_pages_node(nid, __GFP_PRIVATE, 0);
>
> /* Ok but I want to do something useful with it */
> static const struct node_private_ops ops = {
> .migrate_to = my_migrate_to,
> .folio_migrate = my_folio_migrate,
> .flags = NP_OPS_MIGRATION | NP_OPS_MEMPOLICY,
> };
> node_private_set_ops(nid, &ops);
>
> /* And now I can use mempolicy with my memory */
> buf = mmap(...);
> mbind(buf, len, mode, private_node, ...);
> buf[0] = 0xdeadbeef; /* Faults onto private node */
>
> /* And to be clear, no one else gets my memory */
> buf2 = malloc(4096); /* Standard allocation */
> buf2[0] = 0xdeadbeef; /* Can never land on private node */
>
> /* But i can choose to migrate it to the private node */
> move_pages(0, 1, &buf, &private_node, NULL, ...);
>
> /* And more fun things like this */
This is I think is one of the key things that should be enabled - providing a
standard interface to userspace for managing device memory. The existing NUMA
APIs do seem like a reasonable way to do this.
> Patchwork
> ===
> A fully working branch based on cxl/next can be found here:
> https://github.com/gourryinverse/linux/tree/private_compression
>
> A QEMU device which can inject high/low interrupts can be found here:
> https://github.com/gourryinverse/qemu/tree/compressed_cxl_clean
>
> The additional patches on these branches are CXL and DAX driver
> housecleaning only tangentially relevant to this RFC, so i've
> omitted them for the sake of trying to keep it somewhat clean
> here. Those patches should (hopefully) be going upstream anyway.
>
> Patches 1-22: Core Private Node Infrastructure
>
> Patch 1: Introduce N_MEMORY_PRIVATE scaffolding
> Patch 2: Introduce __GFP_PRIVATE
> Patch 3: Apply allocation isolation mechanisms
> Patch 4: Add N_MEMORY nodes to private fallback lists
> Patches 5-9: Filter operations not yet supported
> Patch 10: free_folio callback
> Patch 11: split_folio callback
> Patches 12-20: mm/ service opt-ins:
> Migration, Mempolicy, Demotion, Write Protect,
> Reclaim, OOM, NUMA Balancing, Compaction,
> LongTerm Pinning
> Patch 21: memory_failure callback
> Patch 22: Memory hotplug plumbing for private nodes
>
> Patch 23: mm/cram -- Compressed RAM Management
>
> Patches 24-27: CXL Driver examples
> Sysram Regions with Private node support
> Basic Driver Example: (MIGRATION | MEMPOLICY)
> Compression Driver Example (Generic)
>
>
> Background
> ===
>
> Today, drivers that want mm-like services on non-general-purpose
> memory either use ZONE_DEVICE (self-managed memory) or hotplug into
> N_MEMORY and accept the risk of uncontrolled allocation.
>
> Neither option provides what we really want - the ability to:
> 1) selectively participate in mm/ subsystems, while
> 2) isolating that memory from general purpose use.
>
> Some device-attached memory cannot be managed as fully general-purpose
> system RAM. CXL devices with inline compression, for example, may
> corrupt data or crash the machine if the compression ratio drops
> below a threshold -- we simply run out of physical memory.
>
> This is a hard problem to solve: how does an operating system deal
> with a device that basically lies about how much capacity it has?
>
> (We'll discuss that in the CRAM section)
>
>
> Core Proposal: N_MEMORY_PRIVATE
> ===
>
> Introduce N_MEMORY_PRIVATE, a NUMA node state for memory managed by
> the buddy allocator, but excluded from normal allocation paths.
>
> Private nodes:
>
> - Are filtered from zonelist fallback: all existing callers to
> get_page_from_freelist cannot reach these nodes through any
> normal fallback mechanism.
>
> - Filter allocation requests on __GFP_PRIVATE
> numa_zone_allowed() excludes them otherwise.
>
> Applies to systems with and without cpusets.
>
> GFP_PRIVATE is (__GFP_PRIVATE | __GFP_THISNODE).
>
> Services use it when they need to allocate specifically from
> a private node (e.g., CRAM allocating a destination folio).
>
> No existing allocator path sets __GFP_PRIVATE, so private nodes
> are unreachable by default.
>
> - Use standard struct page / folio. No ZONE_DEVICE, no pgmap,
> no struct page metadata limitations.
>
> - Use a node-scoped metadata structure to accomplish filtering
> and callback support.
>
> - May participate in the buddy allocator, reclaim, compaction,
> and LRU like normal memory, gated by an opt-in set of flags.
>
> The key abstraction is node_private_ops: a per-node callback table
> registered by a driver or service.
>
> Each callback is individually gated by an NP_OPS_* capability flag.
>
> A driver opts in only to the mm/ operations it needs.
>
> It is similar to ZONE_DEVICE's pgmap at a node granularity.
>
> In fact...
>
>
> Re-use of ZONE_DEVICE Hooks
> ===
>
> The callback insertion points deliberately mirror existing ZONE_DEVICE
> hooks to minimize the surface area of the mechanism.
>
> I believe this could subsume most DEVICE_COHERENT users, and greatly
> simplify the device-managed memory development process (no more
> per-driver allocator and migration code).
>
> (Also it's just "So Fresh, So Clean").
>
> The base set of callbacks introduced include:
>
> free_folio - mirrors ZONE_DEVICE's
> free_zone_device_page() hook in
> __folio_put() / folios_put_refs()
>
> folio_split - mirrors ZONE_DEVICE's
> called when a huge page is split up
>
> migrate_to - demote_folio_list() custom demotion (same
> site as ZONE_DEVICE demotion rejection)
>
> folio_migrate - called when private node folio is moved to
> another location (e.g. compaction)
>
> handle_fault - mirrors the ZONE_DEVICE fault dispatch in
> handle_pte_fault() (do_wp_page path)
>
> reclaim_policy - called by reclaim to let a driver own the
> boost lifecycle (driver can driver node reclaim)
>
> memory_failure - parallels memory_failure_dev_pagemap(),
> but for online pages that enter the normal
> hwpoison path
One does not have to squint too hard to see that the above is not so different
from what ZONE_DEVICE provides today via dev_pagemap_ops(). So I think I think
it would be worth outlining why the existing ZONE_DEVICE mechanism can't be
extended to provide these kind of services.
This seems to add a bunch of code just to use NODE_DATA instead of page->pgmap,
without really explaining why just extending dev_pagemap_ops wouldn't work. The
obvious reason is that if you want to support things like reclaim, compaction,
etc. these pages need to be on the LRU, which is a little bit hard when that
field is also used by the pgmap pointer for ZONE_DEVICE pages.
But it might be good to explore other options for storing the pgmap - for
example page_ext could be used. Or I hear struct page may go away in place of
folios any day now, so maybe that gives us space for both :-)
> At skip sites (mlock, madvise, KSM, user migration), a unified
> folio_is_private_managed() predicate covers both ZONE_DEVICE and
> N_MEMORY_PRIVATE folios, consolidating existing zone_device checks
> with private node checks rather than adding new ones.
>
> static inline bool folio_is_private_managed(struct folio *folio)
> {
> return folio_is_zone_device(folio) ||
> folio_is_private_node(folio);
> }
>
> Most integration points become a one-line swap:
>
> - if (folio_is_zone_device(folio))
> + if (unlikely(folio_is_private_managed(folio)))
>
>
> Where a one-line integration is insufficient, the integration is
> kept as clean as possible with zone_device, rather than simply
> adding more call-sites on top of it:
>
> static inline bool folio_managed_handle_fault(struct folio *folio,
> struct vm_fault *vmf, vm_fault_t *ret)
> {
> /* Zone device pages use swap entries; handled in do_swap_page */
> if (folio_is_zone_device(folio))
> return false;
>
> if (folio_is_private_node(folio)) {
> const struct node_private_ops *ops = folio_node_private_ops(folio);
>
> if (ops && ops->handle_fault) {
> *ret = ops->handle_fault(vmf);
> return true;
> }
> }
> return false;
> }
>
>
>
> Flag-gated behavior (NP_OPS_*) controls:
> ===
>
> We use OPS flags to denote what mm/ services we want to allow on our
> private node. I've plumbed these through so far:
>
> NP_OPS_MIGRATION - Node supports migration
> NP_OPS_MEMPOLICY - Node supports mempolicy actions
> NP_OPS_DEMOTION - Node appears in demotion target lists
> NP_OPS_PROTECT_WRITE - Node memory is read-only (wrprotect)
> NP_OPS_RECLAIM - Node supports reclaim
> NP_OPS_NUMA_BALANCING - Node supports numa balancing
> NP_OPS_COMPACTION - Node supports compaction
> NP_OPS_LONGTERM_PIN - Node supports longterm pinning
> NP_OPS_OOM_ELIGIBLE - (MIGRATION | DEMOTION), node is reachable
> as normal system ram storage, so it should
> be considered in OOM pressure calculations.
>
> I wasn't quite sure how to classify ksm, khugepaged, madvise, and
> mlock - so i have omitted those for now.
>
> Most hooks are straightforward.
>
> Including a node as a demotion-eligible target was as simple as:
>
> static void establish_demotion_targets(void)
> {
> ..... snip .....
> /*
> * Include private nodes that have opted in to demotion
> * via NP_OPS_DEMOTION. A node might have custom migrate
> */
> all_memory = node_states[N_MEMORY];
> for_each_node_state(node, N_MEMORY_PRIVATE) {
> if (node_private_has_flag(node, NP_OPS_DEMOTION))
> node_set(node, all_memory);
> }
> ..... snip .....
> }
>
> The Migration and Mempolicy support are the two most complex pieces,
> and most useful things are built on top of Migration (meaning the
> remaining implementations are usually simple).
>
>
> Private Node Hotplug Lifecycle
> ===
>
> Registration follows a strict order enforced by
> add_private_memory_driver_managed():
>
> 1. Driver calls add_private_memory_driver_managed(nid, start,
> size, resource_name, mhp_flags, online_type, &np).
>
> 2. node_private_register(nid, &np) stores the driver's
> node_private in pgdat and sets pgdat->private. N_MEMORY and
> N_MEMORY_PRIVATE are mutually exclusive -- registration fails
> with -EBUSY if the node already has N_MEMORY set.
>
> Only one driver may register per private node.
>
> 3. Memory is hotplugged via __add_memory_driver_managed().
>
> When online_pages() runs, it checks pgdat->private and sets
> N_MEMORY_PRIVATE instead of N_MEMORY.
>
> Zonelist construction gives private nodes a self-only NOFALLBACK
> list and an N_MEMORY fallback list (so kernel/slab allocations on
> behalf of private node work can fall back to DRAM).
>
> 4. kswapd and kcompactd are NOT started for private nodes. The
> owning service is responsible for driving reclaim if needed
> (e.g., CRAM uses watermark_boost to wake kswapd on demand).
>
> Teardown is the reverse:
>
> 1. Driver calls offline_and_remove_private_memory(nid, start,
> size).
>
> 2. offline_pages() offlines the memory. When the last block is
> offlined, N_MEMORY_PRIVATE is cleared automatically.
>
> 3. node_private_unregister() clears pgdat->node_private and
> drops the refcount. It refuses to unregister (-EBUSY) if
> N_MEMORY_PRIVATE is still set (other memory ranges remain).
>
> The driver is responsible for ensuring memory is hot-unpluggable
> before teardown. The service must ensure all memory is cleaned
> up before hot-unplug - or the service must support migration (so
> memory_hotplug.c can evacuate the memory itself).
>
> In the CRAM example, the service supports migration, so memory
> hot-unplug can remove memory without any special infrastructure.
The above also looks pretty similar to the existing ZONE_DEVICE methods for
doing this which is another reason to argue for just building up the feature set
of the existing boondoggle rather than adding another thingymebob.
It seems the key thing we are looking for is:
1) A userspace API to allocate/manage device memory (ie. move_pages(), mbind(),
etc.)
2) Allowing reclaim/LRU list processing of device memory.
From my perspective both of these are interesting and I look forward to the
discussion (hopefully I can make it to LSFMM). Mostly I'm interested in the
implementation as this does on the surface seem to sprinkle around and duplicate
a lot of hooks similar to what ZONE_DEVICE already provides.
> Application: Compressed RAM (mm/cram)
> ===
>
> Compressed RAM has a serious design issue: Its capacity a lie.
>
> A compression device reports more capacity than it physically has.
> If workloads write faster than the OS can reclaim from the device,
> we run out of real backing store and corrupt data or crash.
>
> I call this problem: "Trying to Out Run A Bear"
>
> I.e. This is only stable as long as we stay ahead of the pressure.
>
> We don't want to design a system where stability depends on outrunning
> a bear - I am slow and do not know where to acquire bear spray.
>
> Fun fact: Grizzly bears have a top-speed of 56-64 km/h.
> Unfun Fact: Humans typically top out at ~24 km/h.
>
> This MVP takes a conservative position:
>
> all compressed memory is mapped read-only.
>
> - Folios reach the private node only via reclaim (demotion)
> - migrate_to implements custom demotion with backpressure.
> - fixup_migration_pte write-protects PTEs on arrival.
> - wrprotect hooks prevent silent upgrades
> - handle_fault promotes folios back to DRAM on write.
> - free_folio scrubs stale data before buddy free.
>
> Because pages are read-only, writes can never cause runaway
> compression ratio loss behind the allocator's back. Every write
> goes through handle_fault, which promotes the folio to DRAM first.
>
> The device only ever sees net compression (demotion in) and explicit
> decompression (promotion out via fault or reclaim), and has a much
> wider timeframe to respond to poor compression scenarios.
>
> That means there's no bear to out run. The bears are safely asleep in
> their bear den, and even if they show up we have a bear-proof cage.
>
> The backpressure system is our bear-proof cage: the driver reports real
> device utilization (generalized via watermark_boost on the private
> node's zone), and CRAM throttles demotion when capacity is tight.
>
> If compression ratios are bad, we stop demoting pages and start
> evicting pages aggressively.
>
> The service as designed is ~350 functional lines of code because it
> re-uses mm/ services:
>
> - Existing reclaim/vmscan code handles demotion.
> - Existing migration code handles migration to/from.
> - Existing page fault handling dispatches faults.
>
> The driver contains all the CXL nastiness core developers don't want
> anything to do with - No vendor logic touches mm/ internals.
>
>
>
> Future CRAM : Loosening the read-only constraint
> ===
>
> The read-only model is safe but conservative. For workloads where
> compressed pages are occasionally written, the promotion fault adds
> latency. A future optimization could allow a tunable fraction of
> compressed pages to be mapped writable, accepting some risk of
> write-driven decompression in exchange for lower overhead.
>
> The private node ops make this straightforward:
>
> - Adjust fixup_migration_pte to selectively skip
> write-protection.
> - Use the backpressure system to either revoke writable mappings,
> deny additional demotions, or evict when device pressure rises.
>
> This comes at a mild memory overhead: 32MB of DRAM per 1TB of CRAM.
> (1 bit per 4KB page).
>
> This is not proposed here, but it should be somewhat trivial.
>
>
> Discussion Topics
> ===
> 0. Obviously I've included the set as an RFC, please rip it apart.
>
> 1. Is N_MEMORY_PRIVATE the right isolation abstraction, or should
> this extend ZONE_DEVICE? Prior feedback pushed away from new
> ZONE logic, but this will likely be debated further.
>
> My comments on this:
>
> ZONE_DEVICE requires re-implementing every service you want to
> provide to your device memory, including basic allocation.
For basic allocation I agree this is the case. But there's no reason some device
allocator library couldn't be written. Or in fact as pointed out above reuse the
already existing one in drm_buddy.c. So would be interested to hear arguments
for why allocation has to be done by the mm allocator and/or why an allocation
library wouldn't work here given DRM already has them.
> Private nodes use real struct pages with no metadata
> limitations, participate in the buddy allocator, and get NUMA
> topology for free.
ZONE_DEVICE pages are in fact real struct pages, but I will concede that
perspective probably depends on which bits of the mm you play in. The real
limitations you seem to be addressing is more around how we get these pages in
an LRU, or are there other limitations?
> 2. Can this subsume ZONE_DEVICE COHERENT users? The architecture
> was designed with this in mind, but it is only a thought experiment.
What I'd like to explore is why ZONE_DEVICE_COHERENT couldn't just be extended
to support your usecase? It seems a couple of extra dev_pagemap_ops and being
able to go on the LRU would get you there.
- Alistair
> 3. Is a dedicated mm/ service (cram) the right place for compressed
> memory management, or should this be purely driver-side until
> more devices exist?
>
> I wrote it this way because I forsee more "innovation" in the
> compressed RAM space given current... uh... "Market Conditions".
>
> I don't see CRAM being CXL-specific, though the only solutions I've
> seen have been CXL. Nothing is stopping someone from soldering such
> memory directly to a PCB.
>
> 5. Where is your hardware-backed data that shows this works?
>
> I should have some by conference time.
>
> Thanks for reading
> Gregory (Gourry)
>
>
> Gregory Price (27):
> numa: introduce N_MEMORY_PRIVATE node state
> mm,cpuset: gate allocations from N_MEMORY_PRIVATE behind __GFP_PRIVATE
> mm/page_alloc: add numa_zone_allowed() and wire it up
> mm/page_alloc: Add private node handling to build_zonelists
> mm: introduce folio_is_private_managed() unified predicate
> mm/mlock: skip mlock for managed-memory folios
> mm/madvise: skip madvise for managed-memory folios
> mm/ksm: skip KSM for managed-memory folios
> mm/khugepaged: skip private node folios when trying to collapse.
> mm/swap: add free_folio callback for folio release cleanup
> mm/huge_memory.c: add private node folio split notification callback
> mm/migrate: NP_OPS_MIGRATION - support private node user migration
> mm/mempolicy: NP_OPS_MEMPOLICY - support private node mempolicy
> mm/memory-tiers: NP_OPS_DEMOTION - support private node demotion
> mm/mprotect: NP_OPS_PROTECT_WRITE - gate PTE/PMD write-upgrades
> mm: NP_OPS_RECLAIM - private node reclaim participation
> mm/oom: NP_OPS_OOM_ELIGIBLE - private node OOM participation
> mm/memory: NP_OPS_NUMA_BALANCING - private node NUMA balancing
> mm/compaction: NP_OPS_COMPACTION - private node compaction support
> mm/gup: NP_OPS_LONGTERM_PIN - private node longterm pin support
> mm/memory-failure: add memory_failure callback to node_private_ops
> mm/memory_hotplug: add add_private_memory_driver_managed()
> mm/cram: add compressed ram memory management subsystem
> cxl/core: Add cxl_sysram region type
> cxl/core: Add private node support to cxl_sysram
> cxl: add cxl_mempolicy sample PCI driver
> cxl: add cxl_compression PCI driver
>
> drivers/base/node.c | 250 +++-
> drivers/cxl/Kconfig | 2 +
> drivers/cxl/Makefile | 2 +
> drivers/cxl/core/Makefile | 1 +
> drivers/cxl/core/core.h | 4 +
> drivers/cxl/core/port.c | 2 +
> drivers/cxl/core/region_sysram.c | 381 ++++++
> drivers/cxl/cxl.h | 53 +
> drivers/cxl/type3_drivers/Kconfig | 3 +
> drivers/cxl/type3_drivers/Makefile | 3 +
> .../cxl/type3_drivers/cxl_compression/Kconfig | 20 +
> .../type3_drivers/cxl_compression/Makefile | 4 +
> .../cxl_compression/compression.c | 1025 +++++++++++++++++
> .../cxl/type3_drivers/cxl_mempolicy/Kconfig | 16 +
> .../cxl/type3_drivers/cxl_mempolicy/Makefile | 4 +
> .../type3_drivers/cxl_mempolicy/mempolicy.c | 297 +++++
> include/linux/cpuset.h | 9 -
> include/linux/cram.h | 66 ++
> include/linux/gfp_types.h | 15 +-
> include/linux/memory-tiers.h | 9 +
> include/linux/memory_hotplug.h | 11 +
> include/linux/migrate.h | 17 +-
> include/linux/mm.h | 22 +
> include/linux/mmzone.h | 16 +
> include/linux/node_private.h | 532 +++++++++
> include/linux/nodemask.h | 1 +
> include/trace/events/mmflags.h | 4 +-
> include/uapi/linux/mempolicy.h | 1 +
> kernel/cgroup/cpuset.c | 49 +-
> mm/Kconfig | 10 +
> mm/Makefile | 1 +
> mm/compaction.c | 32 +-
> mm/cram.c | 508 ++++++++
> mm/damon/paddr.c | 3 +
> mm/huge_memory.c | 23 +-
> mm/hugetlb.c | 2 +-
> mm/internal.h | 226 +++-
> mm/khugepaged.c | 7 +-
> mm/ksm.c | 9 +-
> mm/madvise.c | 5 +-
> mm/memory-failure.c | 15 +
> mm/memory-tiers.c | 46 +-
> mm/memory.c | 26 +
> mm/memory_hotplug.c | 122 +-
> mm/mempolicy.c | 69 +-
> mm/migrate.c | 63 +-
> mm/mlock.c | 5 +-
> mm/mprotect.c | 4 +-
> mm/oom_kill.c | 52 +-
> mm/page_alloc.c | 79 +-
> mm/rmap.c | 4 +-
> mm/slub.c | 3 +-
> mm/swap.c | 21 +-
> mm/vmscan.c | 55 +-
> 54 files changed, 4057 insertions(+), 152 deletions(-)
> create mode 100644 drivers/cxl/core/region_sysram.c
> create mode 100644 drivers/cxl/type3_drivers/Kconfig
> create mode 100644 drivers/cxl/type3_drivers/Makefile
> create mode 100644 drivers/cxl/type3_drivers/cxl_compression/Kconfig
> create mode 100644 drivers/cxl/type3_drivers/cxl_compression/Makefile
> create mode 100644 drivers/cxl/type3_drivers/cxl_compression/compression.c
> create mode 100644 drivers/cxl/type3_drivers/cxl_mempolicy/Kconfig
> create mode 100644 drivers/cxl/type3_drivers/cxl_mempolicy/Makefile
> create mode 100644 drivers/cxl/type3_drivers/cxl_mempolicy/mempolicy.c
> create mode 100644 include/linux/cram.h
> create mode 100644 include/linux/node_private.h
> create mode 100644 mm/cram.c
>
> --
> 2.53.0
>
On Tue, Feb 24, 2026 at 05:19:11PM +1100, Alistair Popple wrote:
> On 2026-02-22 at 19:48 +1100, Gregory Price <gourry@gourry.net> wrote...
>
> Based on our discussion at LPC I believe one of the primary motivators here was
> to re-use the existing mm buddy allocator rather than writing your own. I remain
> to be convinced that alone is justification enough for doing all this - DRM for
> example already has quite a nice standalone buddy allocator (drm_buddy.c) that
> could presumably be used, or adapted for use, by any device driver.
>
> The interesting part of this series (which I have skimmed but not read in
> detail) is how device memory gets exposed to userspace - this is something that
> existing ZONE_DEVICE implementations don't address, instead leaving it up to
> drivers and associated userspace stacks to deal with allocation, migration, etc.
>
I agree that buddy-access alone is insufficient justification, it
started off that way - but if you want mempolicy/NUMA UAPI access,
it turns into "Re-use all of MM" - and that means using the buddy.
I also expected ZONE_DEVICE vs NODE_DATA to be the primary discussion,
I raise replacing it as a thought experiment, but not the proposal.
The idea that drm/ is going to switch to private nodes is outside the
realm of reality, but part of that is because of years of infrastructure
built on the assumption that re-using mm/ is infeasible.
But, lets talk about DEVICE_COHERENT
---
DEVICE_COHERENT is the odd-man out among ZONE_DEVICE modes. The others
use softleaf entries and don't allow direct mappings.
(DEVICE_PRIVATE sort of does if you squint, but you can also view that
a bit like PROT_NONE or read-only controls to force migrations).
If you take DEVICE_COHERENT and:
- Move pgmap out of the struct page (page_ext, NODE_DATA, etc) to free
the LRU list_head
- Put pages in the buddy (free lists, watermarks, managed_pages) or add
pgmap->device_alloc() at every allocation callsite / buddy hook
- Add LRU support (aging, reclaim, compaction)
- Add isolated gating (new GFP flag and adjusted zonelist filtering)
- Add new dev_pagemap_ops callbacks for the various mm/ features
- Audit evey folio_is_zone_device() to distinguish zone device modes
... you've built N_MEMORY_PRIVATE inside ZONE_DEVICE. Except now
page_zone(page) returns ZONE_DEVICE - so you inherit the wrong
defaults at every existing ZONE_DEVICE check.
Skip-sites become things to opt-out of instead of opting into.
You just end up with
if (folio_is_zone_device(folio))
if (folio_is_my_special_zone_device())
else ....
and this just generalizes to
if (folio_is_private_managed(folio))
folio_managed_my_hooked_operation()
So you get the same code, but have added more complexity to ZONE_DEVICE.
I don't think that's needed if we just recognize ZONE is the wrong
abstraction to be operating on.
Honestly, even ZONE_MOVABLE becomes pointless with N_MEMORY_PRIVATE
if you disallow longterm pinning - because the managing service handles
allocations (it has to inject GFP_PRIVATE to get access) or selectively
enables the mm/ services it knows are safe (mempolicy).
Even if you allow longterm pinning, if your service controls what does
the pinning it can still be reclaimable - just manually (killing
processes) instead of letting hotplug do it via migration.
If your service only allocates movable pages - your ZONE_NORMAL is
effectively ZONE_MOVABLE.
In some cases we use ZONE_MOVABLE to prevent the kernel from allocating
memory onto devices (like CXL). This means struct page is forced to
take up DRAM or use memmap_on_memory - meaning you lose high-value
capacity or sacrifice contiguity (less huge page support).
This entire problem can evaporate if you can just use ZONE_NORMAL.
There are a lot of benefits to just re-using the buddy like this.
Zones are the wrong abstraction and cause more problems.
> > free_folio - mirrors ZONE_DEVICE's
> > folio_split - mirrors ZONE_DEVICE's
> > migrate_to - ... same as ZONE_DEVICE
> > handle_fault - mirrors the ZONE_DEVICE ...
> > memory_failure - parallels memory_failure_dev_pagemap(),
>
> One does not have to squint too hard to see that the above is not so different
> from what ZONE_DEVICE provides today via dev_pagemap_ops(). So I think I think
> it would be worth outlining why the existing ZONE_DEVICE mechanism can't be
> extended to provide these kind of services.
>
> This seems to add a bunch of code just to use NODE_DATA instead of page->pgmap,
> without really explaining why just extending dev_pagemap_ops wouldn't work. The
> obvious reason is that if you want to support things like reclaim, compaction,
> etc. these pages need to be on the LRU, which is a little bit hard when that
> field is also used by the pgmap pointer for ZONE_DEVICE pages.
>
You don't have to squint because it was deliberate :]
The callback similarity is the feature - they're the same logical
operations. The difference is the direction of the defaults.
Extending ZONE_DEVICE into these areas requires the same set of hooks,
plus distinguishing "old ZONE_DEVICE" from "new ZONE_DEVICE".
Where there are new injection sites, it's because ZONE_DEVICE opts
out of ever touching that code in some other silently implied way.
For example, reclaim/compaction doesn't run because ZONE_DEVICE doesn't
add to managed_pages (among other reasons).
You'd have to go figure out how to hack those things into ZONE_DEVICE
*and then* opt every *other* ZONE_DEVICE mode *back out*.
So you still end up with something like this anyway:
static inline bool folio_managed_handle_fault(struct folio *folio,
struct vm_fault *vmf,
enum pgtable_level level,
vm_fault_t *ret)
{
/* Zone device pages use swap entries; handled in do_swap_page */
if (folio_is_zone_device(folio))
return false;
if (folio_is_private_node(folio))
...
return false;
}
> example page_ext could be used. Or I hear struct page may go away in place of
> folios any day now, so maybe that gives us space for both :-)
>
If NUMA is the interface we want, then NODE_DATA is the right direction
regardless of struct page's future or what zone it lives in.
There's no reason to keep per-page pgmap w/ device-to-node mappings.
You can have one driver manage multiple devices with the same numa node
if it uses the same owner context (PFN already differentiates devices).
The existing code allows for this.
> The above also looks pretty similar to the existing ZONE_DEVICE methods for
> doing this which is another reason to argue for just building up the feature set
> of the existing boondoggle rather than adding another thingymebob.
>
> It seems the key thing we are looking for is:
>
> 1) A userspace API to allocate/manage device memory (ie. move_pages(), mbind(),
> etc.)
>
> 2) Allowing reclaim/LRU list processing of device memory.
>
> From my perspective both of these are interesting and I look forward to the
> discussion (hopefully I can make it to LSFMM). Mostly I'm interested in the
> implementation as this does on the surface seem to sprinkle around and duplicate
> a lot of hooks similar to what ZONE_DEVICE already provides.
>
On (1): ZONE_DEVICE NUMA UAPI is harder than it looks from the surface
Much of the kernel mm/ infrastructure is written on top of the buddy and
expects N_MEMORY to be the sole arbiter of "Where to Acquire Pages".
Mempolicy depends on:
- Buddy support or a new alloc hook around the buddy
- Migration support (mbind() after allocation migrates)
- Migration also deeply assumes buddy and LRU support
- Changing validations on node states
- mempolicy checks N_MEMORY membership, so you have to hack
N_MEMORY onto ZONE_DEVICE
(or teach it about a new node state... N_MEMORY_PRIVATE)
Getting mempolicy to work with N_MEMORY_PRIVATE amounts to adding 2
lines of code in vma_alloc_folio_noprof:
struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
struct vm_area_struct *vma,
unsigned long addr)
{
if (pol->flags & MPOL_F_PRIVATE)
gfp |= __GFP_PRIVATE;
folio = folio_alloc_mpol_noprof(gfp, order, pol, ilx, numa_node_id());
/* Woo! I faulted a DEVICE PAGE! */
}
But this requires the pages to be managed by the buddy.
The rest of the mempolicy support is around keeping sane nodemasks when
things like cpuset.mems rebinds occur and validating you don't end up
with private nodes that don't support mempolicy in your nodemask.
You have to do all of this anyway, but with the added bonus of fighting
with the overloaded nature of ZONE_DEVICE at every step.
==========
On (2): Assume you solve LRU.
Zone Device has no free lists, managed_pages, or watermarks.
kswapd can't run, compaction has no targets, vmscan's pressure model
doesn't function. These all come for free when the pages are
buddy-managed on a real zone. Why re-invent the wheel?
==========
So you really have two options here:
a) Put pages in the buddy, or
b) Add pgmap->device_alloc() callbacks at every allocation site that
could target a node:
- vma_alloc_folio
- alloc_migration_target
- alloc_demote_folio
- alloc_pages_node
- alloc_contig_pages
- list goes on
Or more likely - hooking get_page_from_freelist. Which at that
point... just use the buddy? You're already deep in the hot path.
>
> For basic allocation I agree this is the case. But there's no reason some device
> allocator library couldn't be written. Or in fact as pointed out above reuse the
> already existing one in drm_buddy.c. So would be interested to hear arguments
> for why allocation has to be done by the mm allocator and/or why an allocation
> library wouldn't work here given DRM already has them.
>
Using the buddy underpins the rest of mm/ services we want to re-use.
That's basically it. Otherwise you have to inject hooks into every
surface that touches the buddy...
... or in the buddy (get_page_from_freelist), at which point why not
just use the buddy?
~Gregory
On 2026-02-25 at 02:17 +1100, Gregory Price <gourry@gourry.net> wrote...
> On Tue, Feb 24, 2026 at 05:19:11PM +1100, Alistair Popple wrote:
> > On 2026-02-22 at 19:48 +1100, Gregory Price <gourry@gourry.net> wrote...
> >
> > Based on our discussion at LPC I believe one of the primary motivators here was
> > to re-use the existing mm buddy allocator rather than writing your own. I remain
> > to be convinced that alone is justification enough for doing all this - DRM for
> > example already has quite a nice standalone buddy allocator (drm_buddy.c) that
> > could presumably be used, or adapted for use, by any device driver.
> >
> > The interesting part of this series (which I have skimmed but not read in
> > detail) is how device memory gets exposed to userspace - this is something that
> > existing ZONE_DEVICE implementations don't address, instead leaving it up to
> > drivers and associated userspace stacks to deal with allocation, migration, etc.
> >
>
> I agree that buddy-access alone is insufficient justification, it
> started off that way - but if you want mempolicy/NUMA UAPI access,
> it turns into "Re-use all of MM" - and that means using the buddy.
>
> I also expected ZONE_DEVICE vs NODE_DATA to be the primary discussion,
>
> I raise replacing it as a thought experiment, but not the proposal.
>
> The idea that drm/ is going to switch to private nodes is outside the
> realm of reality, but part of that is because of years of infrastructure
> built on the assumption that re-using mm/ is infeasible.
>
> But, lets talk about DEVICE_COHERENT
>
> ---
>
> DEVICE_COHERENT is the odd-man out among ZONE_DEVICE modes. The others
> use softleaf entries and don't allow direct mappings.
I think you have this around the wrong way - DEVICE_PRIVATE is the odd one out as
it is the one ZONE_DEVICE page type that uses softleaf entries and doesn't
allow direct mappings. Every other type of ZONE_DEVICE page allows for direct
mappings.
> (DEVICE_PRIVATE sort of does if you squint, but you can also view that
> a bit like PROT_NONE or read-only controls to force migrations).
>
> If you take DEVICE_COHERENT and:
>
> - Move pgmap out of the struct page (page_ext, NODE_DATA, etc) to free
> the LRU list_head
> - Put pages in the buddy (free lists, watermarks, managed_pages) or add
> pgmap->device_alloc() at every allocation callsite / buddy hook
> - Add LRU support (aging, reclaim, compaction)
> - Add isolated gating (new GFP flag and adjusted zonelist filtering)
> - Add new dev_pagemap_ops callbacks for the various mm/ features
> - Audit evey folio_is_zone_device() to distinguish zone device modes
>
> ... you've built N_MEMORY_PRIVATE inside ZONE_DEVICE. Except now
> page_zone(page) returns ZONE_DEVICE - so you inherit the wrong
> defaults at every existing ZONE_DEVICE check.
>
> Skip-sites become things to opt-out of instead of opting into.
>
> You just end up with
>
> if (folio_is_zone_device(folio))
> if (folio_is_my_special_zone_device())
> else ....
>
> and this just generalizes to
>
> if (folio_is_private_managed(folio))
> folio_managed_my_hooked_operation()
I don't quite get this - couldn't you just as easily do:
if (folio_is_zone_device(folio))
folio_device_my_hooked_operation()
Where folio_device_my_hooked_operation() is just:
if (pgmap->ops->my_hoooked_operation)
pgmap->ops->my_hooked_operation();
> So you get the same code, but have added more complexity to ZONE_DEVICE.
Don't you still have to add code to hook every operation you care about for your
private managed nodes?
> I don't think that's needed if we just recognize ZONE is the wrong
> abstraction to be operating on.
>
> Honestly, even ZONE_MOVABLE becomes pointless with N_MEMORY_PRIVATE
> if you disallow longterm pinning - because the managing service handles
> allocations (it has to inject GFP_PRIVATE to get access) or selectively
> enables the mm/ services it knows are safe (mempolicy).
>
> Even if you allow longterm pinning, if your service controls what does
> the pinning it can still be reclaimable - just manually (killing
> processes) instead of letting hotplug do it via migration.
>
> If your service only allocates movable pages - your ZONE_NORMAL is
> effectively ZONE_MOVABLE.
This is interesting - it sounds like the conclusion of this is ZONE_* is just a
bad abstraction and should be replaced with something else maybe some like this?
And FWIW I'm not tied to the ZONE_DEVICE as being a good abstraction, it's just
what we seem to have today for determing page types. It almost sounds like what
we want is just a bunch of hooks that can be associated with a range of pages,
and then you just get rid of ZONE_DEVICE and instead install hooks appropriate
for each page a driver manages. I have to think more about that though, this
is just what popped into my head when you start saying ZONE_MOVABLE could also
disappear :-)
> In some cases we use ZONE_MOVABLE to prevent the kernel from allocating
> memory onto devices (like CXL). This means struct page is forced to
> take up DRAM or use memmap_on_memory - meaning you lose high-value
> capacity or sacrifice contiguity (less huge page support).
One of the other reasons is to prevent long term pinning. But I think that's a
conversation that warrants a whole separate thread.
> This entire problem can evaporate if you can just use ZONE_NORMAL.
>
> There are a lot of benefits to just re-using the buddy like this.
>
> Zones are the wrong abstraction and cause more problems.
>
> > > free_folio - mirrors ZONE_DEVICE's
> > > folio_split - mirrors ZONE_DEVICE's
> > > migrate_to - ... same as ZONE_DEVICE
> > > handle_fault - mirrors the ZONE_DEVICE ...
> > > memory_failure - parallels memory_failure_dev_pagemap(),
> >
> > One does not have to squint too hard to see that the above is not so different
> > from what ZONE_DEVICE provides today via dev_pagemap_ops(). So I think I think
> > it would be worth outlining why the existing ZONE_DEVICE mechanism can't be
> > extended to provide these kind of services.
> >
> > This seems to add a bunch of code just to use NODE_DATA instead of page->pgmap,
> > without really explaining why just extending dev_pagemap_ops wouldn't work. The
> > obvious reason is that if you want to support things like reclaim, compaction,
> > etc. these pages need to be on the LRU, which is a little bit hard when that
> > field is also used by the pgmap pointer for ZONE_DEVICE pages.
> >
>
> You don't have to squint because it was deliberate :]
Nice.
> The callback similarity is the feature - they're the same logical
> operations. The difference is the direction of the defaults.
>
> Extending ZONE_DEVICE into these areas requires the same set of hooks,
> plus distinguishing "old ZONE_DEVICE" from "new ZONE_DEVICE".
>
> Where there are new injection sites, it's because ZONE_DEVICE opts
> out of ever touching that code in some other silently implied way.
Yeah, I hate that aspect of ZONE_DEVICE. There are far too many places where we
"prove" you can't have a ZONE_DEVICE page because of ad-hoc "reasons". Usually
they take the form of it's not on the LRU, or it's not an anonymous page and
this isn't DAX, etc.
> For example, reclaim/compaction doesn't run because ZONE_DEVICE doesn't
> add to managed_pages (among other reasons).
And people can't even agree on the reasons. I would argue the primary reason is
reclaim/compaction doesn't run because it can't even find the pages due to them
not being on the LRU. But everyone is equally correct.
> You'd have to go figure out how to hack those things into ZONE_DEVICE
> *and then* opt every *other* ZONE_DEVICE mode *back out*.
>
> So you still end up with something like this anyway:
>
> static inline bool folio_managed_handle_fault(struct folio *folio,
> struct vm_fault *vmf,
> enum pgtable_level level,
> vm_fault_t *ret)
> {
> /* Zone device pages use swap entries; handled in do_swap_page */
> if (folio_is_zone_device(folio))
> return false;
>
> if (folio_is_private_node(folio))
> ...
> return false;
> }
>
>
> > example page_ext could be used. Or I hear struct page may go away in place of
> > folios any day now, so maybe that gives us space for both :-)
> >
>
> If NUMA is the interface we want, then NODE_DATA is the right direction
> regardless of struct page's future or what zone it lives in.
>
> There's no reason to keep per-page pgmap w/ device-to-node mappings.
In reality I suspect that's already the case today. I'm not sure we need
per-page pgmap.
> You can have one driver manage multiple devices with the same numa node
> if it uses the same owner context (PFN already differentiates devices).
>
> The existing code allows for this.
>
> > The above also looks pretty similar to the existing ZONE_DEVICE methods for
> > doing this which is another reason to argue for just building up the feature set
> > of the existing boondoggle rather than adding another thingymebob.
> >
> > It seems the key thing we are looking for is:
> >
> > 1) A userspace API to allocate/manage device memory (ie. move_pages(), mbind(),
> > etc.)
> >
> > 2) Allowing reclaim/LRU list processing of device memory.
> >
> > From my perspective both of these are interesting and I look forward to the
> > discussion (hopefully I can make it to LSFMM). Mostly I'm interested in the
> > implementation as this does on the surface seem to sprinkle around and duplicate
> > a lot of hooks similar to what ZONE_DEVICE already provides.
> >
>
> On (1): ZONE_DEVICE NUMA UAPI is harder than it looks from the surface
Ok, I will admit I've only been hovering on the surface so need to give this
some more thought. Everything you've written below makes sense and is definitely
food for thought. Thanks.
- Alistair
> Much of the kernel mm/ infrastructure is written on top of the buddy and
> expects N_MEMORY to be the sole arbiter of "Where to Acquire Pages".
>
> Mempolicy depends on:
> - Buddy support or a new alloc hook around the buddy
>
> - Migration support (mbind() after allocation migrates)
> - Migration also deeply assumes buddy and LRU support
>
> - Changing validations on node states
> - mempolicy checks N_MEMORY membership, so you have to hack
> N_MEMORY onto ZONE_DEVICE
> (or teach it about a new node state... N_MEMORY_PRIVATE)
>
>
> Getting mempolicy to work with N_MEMORY_PRIVATE amounts to adding 2
> lines of code in vma_alloc_folio_noprof:
>
> struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
> struct vm_area_struct *vma,
> unsigned long addr)
> {
> if (pol->flags & MPOL_F_PRIVATE)
> gfp |= __GFP_PRIVATE;
>
> folio = folio_alloc_mpol_noprof(gfp, order, pol, ilx, numa_node_id());
> /* Woo! I faulted a DEVICE PAGE! */
> }
>
> But this requires the pages to be managed by the buddy.
>
> The rest of the mempolicy support is around keeping sane nodemasks when
> things like cpuset.mems rebinds occur and validating you don't end up
> with private nodes that don't support mempolicy in your nodemask.
>
> You have to do all of this anyway, but with the added bonus of fighting
> with the overloaded nature of ZONE_DEVICE at every step.
>
> ==========
>
> On (2): Assume you solve LRU.
>
> Zone Device has no free lists, managed_pages, or watermarks.
>
> kswapd can't run, compaction has no targets, vmscan's pressure model
> doesn't function. These all come for free when the pages are
> buddy-managed on a real zone. Why re-invent the wheel?
>
> ==========
>
> So you really have two options here:
>
> a) Put pages in the buddy, or
>
> b) Add pgmap->device_alloc() callbacks at every allocation site that
> could target a node:
> - vma_alloc_folio
> - alloc_migration_target
> - alloc_demote_folio
> - alloc_pages_node
> - alloc_contig_pages
> - list goes on
>
> Or more likely - hooking get_page_from_freelist. Which at that
> point... just use the buddy? You're already deep in the hot path.
>
> >
> > For basic allocation I agree this is the case. But there's no reason some device
> > allocator library couldn't be written. Or in fact as pointed out above reuse the
> > already existing one in drm_buddy.c. So would be interested to hear arguments
> > for why allocation has to be done by the mm allocator and/or why an allocation
> > library wouldn't work here given DRM already has them.
> >
>
> Using the buddy underpins the rest of mm/ services we want to re-use.
>
> That's basically it. Otherwise you have to inject hooks into every
> surface that touches the buddy...
>
> ... or in the buddy (get_page_from_freelist), at which point why not
> just use the buddy?
>
> ~Gregory
On Thu, Feb 26, 2026 at 02:27:24PM +1100, Alistair Popple wrote:
> On 2026-02-25 at 02:17 +1100, Gregory Price <gourry@gourry.net> wrote...
> >
> > If your service only allocates movable pages - your ZONE_NORMAL is
> > effectively ZONE_MOVABLE.
>
> This is interesting - it sounds like the conclusion of this is ZONE_* is just a
> bad abstraction and should be replaced with something else maybe some like this?
>
> And FWIW I'm not tied to the ZONE_DEVICE as being a good abstraction, it's just
> what we seem to have today for determing page types. It almost sounds like what
> we want is just a bunch of hooks that can be associated with a range of pages,
> and then you just get rid of ZONE_DEVICE and instead install hooks appropriate
> for each page a driver manages. I have to think more about that though, this
> is just what popped into my head when you start saying ZONE_MOVABLE could also
> disappear :-)
>
... snip ...
> >
> > You don't have to squint because it was deliberate :]
>
> Nice.
>
I've had some time to chew on this a bit more.
Adding a node-scope `struct dev_pagemap` produces some interesting
(arguably useful / valuable) effects.
The invariant would be clamping the entire node to ZONE_DEVICE
(more on this below).
So if we think about it this way - we could just view this whole thing
as another variant of ZONE_DEVICE - but without needing the memremap
infrastructure (you can use normal hotplug to achieve it).
0. pgdat->private becomes pgdat->dev_pagemap
N_MEMORY_PRIVATE -> N_MEMORY_DEVICE ?
As a start, do a direct conversion, and use the existing
infrastructure. then expand hooks as needed (and as-is reasonable)
Some of the `struct dev_pagemap {}` fields become dead at the node
scope, but this is a plumbing issue.
There's already an similar split between the dev_pagemap and the ops
structure, so it might map very cleanly.
1. "Clamping the entire node to ZONE_DEVICE"
When we do this, the *actual* ZONE becomes completely irrelevant.
The allocation path is entirely controlled, so you might actually end
up freeing up the folio flags that track the zone:
static inline enum zone_type memdesc_zonenum(memdesc_flags_t flags)
{
ASSERT_EXCLUSIVE_BITS(flags.f, ZONES_MASK << ZONES_PGSHIFT);
return (flags.f >> ZONES_PGSHIFT) & ZONES_MASK;
}
becomes:
folio_is_zone_device(folio) {
return node_is_device_node(folio_nid(folio)) ||
memdesc_is_zone_device(folio->flags);
}
Kind of an interesting. You still need these flags for traditional
ZONE_DEVICE, so you can't evict it completely, but you can start to
see a path here.
2. One dev_pagemap per node or multiple w/ pagemap range searching
Checking membership is always cheap:
node_is_device_node()
Getting ops can be cheap if 1:1 mappings exists:
pgdat->device_ops->callback()
Or may be expensive if range-based matching is required:
node_device_op(folio, ...) {
ops = node_ops_lookup(folio); /* pfn-range binary search */
ops->callback(folio, ...)
}
pgmap already has an embedded range:
struct dev_pagemap {
...
int nr_range;
union {
struct range range;
DECLARE_FLEX_ARRAY(struct range, ranges);
};
};
Example: Nouveau, registers hundreds of pgmap instances that it
uses to recover driver contexts for that specific folio.
This would not scale well.
But most other drivers register between 1-8. That might.
That means this might actually be an effective way to evict pgmap
from struct folio / struct page. (Not making this a requirement or
saying it's reasonable, just an interesting observation).
3. Some existing drivers with 1 pgmap per driver instance instantly get
the folio->lru field back - even if they continue to use ZONE_DEVICE.
At least 3 drivers use page->zone_device_data as a page freelist
rather than actual per-page data. Those drivers could just start
using folio/page->lru instead.
Some store actual per-page zone_device_data that would prevent this,
but from poking around it seems like it might be feasible.
Some use the pgmap as a container_of() argument to get driver
context, may or may not be supportable out of the box, but it seemed
like mild refactoring might get them back the use of folio->lru.
None of this is required, the goal is explicitly not disrupting any
current users of ZONE_DEVICE.
Just some additional food for thought.
As-designed now, this would only apply to NUMA systems, meaning you
can't fully evict pgmap from struct page/folio --- but you could
imagine a world where in non-numa mode we even register a separate
pglist_data specifically for device memory even w/o NUMA.
~Gregory
On Thu, Feb 26, 2026 at 02:27:24PM +1100, Alistair Popple wrote: > On 2026-02-25 at 02:17 +1100, Gregory Price <gourry@gourry.net> wrote... > > > > DEVICE_COHERENT is the odd-man out among ZONE_DEVICE modes. The others > > use softleaf entries and don't allow direct mappings. > > I think you have this around the wrong way - DEVICE_PRIVATE is the odd one out as > it is the one ZONE_DEVICE page type that uses softleaf entries and doesn't > allow direct mappings. Every other type of ZONE_DEVICE page allows for direct > mappings. > Sorry, you are correct. I have trouble keeping the ZONE_DEVICE modes straight sometimes, and all the hook sites have different reasons for why all the different ZONE_DEVICE modes and it mucks with my head :[ Device Coherent is the one that doesn't allow pinning, but still comes with all the baggage of not being on the lru. Spoke a bit too bluntly here, apologies. > Don't you still have to add code to hook every operation you care about for your > private managed nodes? > ... snip ... below > > I don't think that's needed if we just recognize ZONE is the wrong > > abstraction to be operating on. > > ... snip ... below > > If your service only allocates movable pages - your ZONE_NORMAL is > > effectively ZONE_MOVABLE. > > This is interesting - it sounds like the conclusion of this is ZONE_* is just a > bad abstraction and should be replaced with something else maybe some like this? > Yeah i'm not totally married to this being a node, but it makes far more sense to me than a zone. ZONE_DEVICE sorta kinda really *wants* to be its own node, but it seems that "what constitutes a node" was largely informed by ACPI Proximity Domains. Nothing in the rules say that has to remain the case. To answer your question above - yea you still need to add code to hook the operations - but this is essentially already true of ZONE_DEVICE (except you have to contend with other weird ZONE_DEVICE situations). Some of the hooks here are an experimentation in what's possible, not what I think is reasonable (mempolicy is a good example - i don't think userland should really be doing this anyway... but neat, it works :P) > And FWIW I'm not tied to the ZONE_DEVICE as being a good abstraction, it's just > what we seem to have today for determing page types. It almost sounds like what > we want is just a bunch of hooks that can be associated with a range of pages, > and then you just get rid of ZONE_DEVICE and instead install hooks appropriate > for each page a driver manages. I have to think more about that though, this > is just what popped into my head when you start saying ZONE_MOVABLE could also > disappear :-) Yup! Basically ZONE_MOVABLE and CMA and ZONE_DEVICE/COHERENT all try to do similar things for different reasons. Zones manage to somehow be both too-broad AND too-narrow. In my head, we should just be able to just plop these things "into the buddy" and provide hooks that say what's allow "for those pages". That sounds like Non-Uniform Memory Access *cough* :P Heck, I was even playing with adding these nodes *back into* the fallback lists for some situations. NP_OPS_DIRECT / NP_OPS_FALLBACK don't require __GFP_PRIVATE, but give me the hooks I want :V > > Where there are new injection sites, it's because ZONE_DEVICE opts > > out of ever touching that code in some other silently implied way. > > Yeah, I hate that aspect of ZONE_DEVICE. There are far too many places where we > "prove" you can't have a ZONE_DEVICE page because of ad-hoc "reasons". Usually > they take the form of it's not on the LRU, or it's not an anonymous page and > this isn't DAX, etc. > It's kinda the opposite of how operating systems do everything else. Generally we start from a basis of isolation and then poke deliberate holes, as opposed to try to patch things up after the fact. > > If NUMA is the interface we want, then NODE_DATA is the right direction > > regardless of struct page's future or what zone it lives in. > > > > There's no reason to keep per-page pgmap w/ device-to-node mappings. > > In reality I suspect that's already the case today. I'm not sure we need > per-page pgmap. > Probably, and maybe there's a good argument for stealing 80-90% of the common surface here, shunting ZONE_DEVICE to use this instead of pgmap before we go all the way to private nodes. cough cough maybe i'll have looked into this by LSFMM cough cough > > On (1): ZONE_DEVICE NUMA UAPI is harder than it looks from the surface > > Ok, I will admit I've only been hovering on the surface so need to give this > some more thought. Everything you've written below makes sense and is definitely > food for thought. Thanks. > cheers! thanks for reading :) ~Gregory
On Thu, Feb 26, 2026 at 12:54:08AM -0500, Gregory Price wrote: > On Thu, Feb 26, 2026 at 02:27:24PM +1100, Alistair Popple wrote: > > > > If NUMA is the interface we want, then NODE_DATA is the right direction > > > regardless of struct page's future or what zone it lives in. > > > > > > There's no reason to keep per-page pgmap w/ device-to-node mappings. > > > > In reality I suspect that's already the case today. I'm not sure we need > > per-page pgmap. > > > > Probably, and maybe there's a good argument for stealing 80-90% of the > common surface here, shunting ZONE_DEVICE to use this instead of pgmap > before we go all the way to private nodes. > Out of curiosity i went digging through existing users, and it seems like the average driver has 1-8 discrete pgmaps, with Nouveau being an outliar that does ad-hoc registering in 256MB chunks, with the relevant annoyance being the percpu_ref it uses to track lifetime of the pgmap, and the fact that they can be non-contiguous. tl;dr here: a 1-to-1 mapping of node-to-pgmap isn't realistic for most existing ZONE_DEVICE users, meaning a 1-op lookup (page->pgmap) turns into a multi-op pointer chase on and range comparison. Not sure that turns out well for anyone (only on ZONE_DEVICE / managed node users, all traditional nodes still have a simple pgdat or page->flag lookup to check membership). There's an argument for trying to do this just for the sake of getting pgmap out of struct page/folio, but this only deals with the problem on NUMA systems. For non-numa systems the pgmap still probably ends up in folio_ext (assuming we get there), but even that might not be sufficient get LRU back. Might need Willy's opinion here. ~Gregory
On Tue, Feb 24, 2026 at 10:17:38AM -0500, Gregory Price wrote:
> On Tue, Feb 24, 2026 at 05:19:11PM +1100, Alistair Popple wrote:
> > On 2026-02-22 at 19:48 +1100, Gregory Price <gourry@gourry.net> wrote...
> >
> > Based on our discussion at LPC I believe one of the primary motivators here was
> > to re-use the existing mm buddy allocator rather than writing your own. I remain
> > to be convinced that alone is justification enough for doing all this - DRM for
> > example already has quite a nice standalone buddy allocator (drm_buddy.c) that
> > could presumably be used, or adapted for use, by any device driver.
> >
> > The interesting part of this series (which I have skimmed but not read in
> > detail) is how device memory gets exposed to userspace - this is something that
> > existing ZONE_DEVICE implementations don't address, instead leaving it up to
> > drivers and associated userspace stacks to deal with allocation, migration, etc.
> >
>
> I agree that buddy-access alone is insufficient justification, it
> started off that way - but if you want mempolicy/NUMA UAPI access,
> it turns into "Re-use all of MM" - and that means using the buddy.
>
> I also expected ZONE_DEVICE vs NODE_DATA to be the primary discussion,
>
> I raise replacing it as a thought experiment, but not the proposal.
>
> The idea that drm/ is going to switch to private nodes is outside the
> realm of reality, but part of that is because of years of infrastructure
> built on the assumption that re-using mm/ is infeasible.
I was about to chime in with essentially the same comment about DRM.
Switching over to core-managed MM is a massive shift and is likely
infeasible, or so extreme that we’d end up throwing away any the
existing driver and starting from scratch. At least for Xe, our MM code
is baked into all meaningful components of the driver. It’s also a
unified driver that has to work on iGPU, dGPU over PCIe, dGPU over a
coherent bus once we get there, devices with GPU pagefaults, and devices
without GPU pagefaults. It also has to support both 3D and compute
user-space stacks, etc. So requirements of what it needs to support is
quite large.
IIRC, Christian once mentioned that AMD was exploring using NUMA and
udma-buf rather than DRM GEMs for MM on coherent-bus devices. I would
think AMDGPU has nearly all the same requirements as Xe, aside from
supporting both 3D and compute stacks, since AMDKFD currently handles
compute. It might be worth getting Christian’s input on this RFC as he
likely has better insight then myself on DRM's future here.
Matt
>
> But, lets talk about DEVICE_COHERENT
>
> ---
>
> DEVICE_COHERENT is the odd-man out among ZONE_DEVICE modes. The others
> use softleaf entries and don't allow direct mappings.
>
> (DEVICE_PRIVATE sort of does if you squint, but you can also view that
> a bit like PROT_NONE or read-only controls to force migrations).
>
> If you take DEVICE_COHERENT and:
>
> - Move pgmap out of the struct page (page_ext, NODE_DATA, etc) to free
> the LRU list_head
> - Put pages in the buddy (free lists, watermarks, managed_pages) or add
> pgmap->device_alloc() at every allocation callsite / buddy hook
> - Add LRU support (aging, reclaim, compaction)
> - Add isolated gating (new GFP flag and adjusted zonelist filtering)
> - Add new dev_pagemap_ops callbacks for the various mm/ features
> - Audit evey folio_is_zone_device() to distinguish zone device modes
>
> ... you've built N_MEMORY_PRIVATE inside ZONE_DEVICE. Except now
> page_zone(page) returns ZONE_DEVICE - so you inherit the wrong
> defaults at every existing ZONE_DEVICE check.
>
> Skip-sites become things to opt-out of instead of opting into.
>
> You just end up with
>
> if (folio_is_zone_device(folio))
> if (folio_is_my_special_zone_device())
> else ....
>
> and this just generalizes to
>
> if (folio_is_private_managed(folio))
> folio_managed_my_hooked_operation()
>
> So you get the same code, but have added more complexity to ZONE_DEVICE.
>
> I don't think that's needed if we just recognize ZONE is the wrong
> abstraction to be operating on.
>
> Honestly, even ZONE_MOVABLE becomes pointless with N_MEMORY_PRIVATE
> if you disallow longterm pinning - because the managing service handles
> allocations (it has to inject GFP_PRIVATE to get access) or selectively
> enables the mm/ services it knows are safe (mempolicy).
>
> Even if you allow longterm pinning, if your service controls what does
> the pinning it can still be reclaimable - just manually (killing
> processes) instead of letting hotplug do it via migration.
>
> If your service only allocates movable pages - your ZONE_NORMAL is
> effectively ZONE_MOVABLE.
>
> In some cases we use ZONE_MOVABLE to prevent the kernel from allocating
> memory onto devices (like CXL). This means struct page is forced to
> take up DRAM or use memmap_on_memory - meaning you lose high-value
> capacity or sacrifice contiguity (less huge page support).
>
> This entire problem can evaporate if you can just use ZONE_NORMAL.
>
> There are a lot of benefits to just re-using the buddy like this.
>
> Zones are the wrong abstraction and cause more problems.
>
> > > free_folio - mirrors ZONE_DEVICE's
> > > folio_split - mirrors ZONE_DEVICE's
> > > migrate_to - ... same as ZONE_DEVICE
> > > handle_fault - mirrors the ZONE_DEVICE ...
> > > memory_failure - parallels memory_failure_dev_pagemap(),
> >
> > One does not have to squint too hard to see that the above is not so different
> > from what ZONE_DEVICE provides today via dev_pagemap_ops(). So I think I think
> > it would be worth outlining why the existing ZONE_DEVICE mechanism can't be
> > extended to provide these kind of services.
> >
> > This seems to add a bunch of code just to use NODE_DATA instead of page->pgmap,
> > without really explaining why just extending dev_pagemap_ops wouldn't work. The
> > obvious reason is that if you want to support things like reclaim, compaction,
> > etc. these pages need to be on the LRU, which is a little bit hard when that
> > field is also used by the pgmap pointer for ZONE_DEVICE pages.
> >
>
> You don't have to squint because it was deliberate :]
>
> The callback similarity is the feature - they're the same logical
> operations. The difference is the direction of the defaults.
>
> Extending ZONE_DEVICE into these areas requires the same set of hooks,
> plus distinguishing "old ZONE_DEVICE" from "new ZONE_DEVICE".
>
> Where there are new injection sites, it's because ZONE_DEVICE opts
> out of ever touching that code in some other silently implied way.
>
> For example, reclaim/compaction doesn't run because ZONE_DEVICE doesn't
> add to managed_pages (among other reasons).
>
> You'd have to go figure out how to hack those things into ZONE_DEVICE
> *and then* opt every *other* ZONE_DEVICE mode *back out*.
>
> So you still end up with something like this anyway:
>
> static inline bool folio_managed_handle_fault(struct folio *folio,
> struct vm_fault *vmf,
> enum pgtable_level level,
> vm_fault_t *ret)
> {
> /* Zone device pages use swap entries; handled in do_swap_page */
> if (folio_is_zone_device(folio))
> return false;
>
> if (folio_is_private_node(folio))
> ...
> return false;
> }
>
>
> > example page_ext could be used. Or I hear struct page may go away in place of
> > folios any day now, so maybe that gives us space for both :-)
> >
>
> If NUMA is the interface we want, then NODE_DATA is the right direction
> regardless of struct page's future or what zone it lives in.
>
> There's no reason to keep per-page pgmap w/ device-to-node mappings.
>
> You can have one driver manage multiple devices with the same numa node
> if it uses the same owner context (PFN already differentiates devices).
>
> The existing code allows for this.
>
> > The above also looks pretty similar to the existing ZONE_DEVICE methods for
> > doing this which is another reason to argue for just building up the feature set
> > of the existing boondoggle rather than adding another thingymebob.
> >
> > It seems the key thing we are looking for is:
> >
> > 1) A userspace API to allocate/manage device memory (ie. move_pages(), mbind(),
> > etc.)
> >
> > 2) Allowing reclaim/LRU list processing of device memory.
> >
> > From my perspective both of these are interesting and I look forward to the
> > discussion (hopefully I can make it to LSFMM). Mostly I'm interested in the
> > implementation as this does on the surface seem to sprinkle around and duplicate
> > a lot of hooks similar to what ZONE_DEVICE already provides.
> >
>
> On (1): ZONE_DEVICE NUMA UAPI is harder than it looks from the surface
>
> Much of the kernel mm/ infrastructure is written on top of the buddy and
> expects N_MEMORY to be the sole arbiter of "Where to Acquire Pages".
>
> Mempolicy depends on:
> - Buddy support or a new alloc hook around the buddy
>
> - Migration support (mbind() after allocation migrates)
> - Migration also deeply assumes buddy and LRU support
>
> - Changing validations on node states
> - mempolicy checks N_MEMORY membership, so you have to hack
> N_MEMORY onto ZONE_DEVICE
> (or teach it about a new node state... N_MEMORY_PRIVATE)
>
>
> Getting mempolicy to work with N_MEMORY_PRIVATE amounts to adding 2
> lines of code in vma_alloc_folio_noprof:
>
> struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
> struct vm_area_struct *vma,
> unsigned long addr)
> {
> if (pol->flags & MPOL_F_PRIVATE)
> gfp |= __GFP_PRIVATE;
>
> folio = folio_alloc_mpol_noprof(gfp, order, pol, ilx, numa_node_id());
> /* Woo! I faulted a DEVICE PAGE! */
> }
>
> But this requires the pages to be managed by the buddy.
>
> The rest of the mempolicy support is around keeping sane nodemasks when
> things like cpuset.mems rebinds occur and validating you don't end up
> with private nodes that don't support mempolicy in your nodemask.
>
> You have to do all of this anyway, but with the added bonus of fighting
> with the overloaded nature of ZONE_DEVICE at every step.
>
> ==========
>
> On (2): Assume you solve LRU.
>
> Zone Device has no free lists, managed_pages, or watermarks.
>
> kswapd can't run, compaction has no targets, vmscan's pressure model
> doesn't function. These all come for free when the pages are
> buddy-managed on a real zone. Why re-invent the wheel?
>
> ==========
>
> So you really have two options here:
>
> a) Put pages in the buddy, or
>
> b) Add pgmap->device_alloc() callbacks at every allocation site that
> could target a node:
> - vma_alloc_folio
> - alloc_migration_target
> - alloc_demote_folio
> - alloc_pages_node
> - alloc_contig_pages
> - list goes on
>
> Or more likely - hooking get_page_from_freelist. Which at that
> point... just use the buddy? You're already deep in the hot path.
>
> >
> > For basic allocation I agree this is the case. But there's no reason some device
> > allocator library couldn't be written. Or in fact as pointed out above reuse the
> > already existing one in drm_buddy.c. So would be interested to hear arguments
> > for why allocation has to be done by the mm allocator and/or why an allocation
> > library wouldn't work here given DRM already has them.
> >
>
> Using the buddy underpins the rest of mm/ services we want to re-use.
>
> That's basically it. Otherwise you have to inject hooks into every
> surface that touches the buddy...
>
> ... or in the buddy (get_page_from_freelist), at which point why not
> just use the buddy?
>
> ~Gregory
On Wed, Feb 25, 2026 at 02:21:54PM -0800, Matthew Brost wrote: > On Tue, Feb 24, 2026 at 10:17:38AM -0500, Gregory Price wrote: > > > > The idea that drm/ is going to switch to private nodes is outside the > > realm of reality, but part of that is because of years of infrastructure > > built on the assumption that re-using mm/ is infeasible. > > I was about to chime in with essentially the same comment about DRM. > Switching over to core-managed MM is a massive shift and is likely > infeasible, or so extreme that we’d end up throwing away any the > existing driver and starting from scratch. At least for Xe, our MM code > is baked into all meaningful components of the driver. It’s also a > unified driver that has to work on iGPU, dGPU over PCIe, dGPU over a > coherent bus once we get there, devices with GPU pagefaults, and devices > without GPU pagefaults. It also has to support both 3D and compute > user-space stacks, etc. So requirements of what it needs to support is > quite large. > > IIRC, Christian once mentioned that AMD was exploring using NUMA and > udma-buf rather than DRM GEMs for MM on coherent-bus devices. I would > think AMDGPU has nearly all the same requirements as Xe, aside from > supporting both 3D and compute stacks, since AMDKFD currently handles > compute. It might be worth getting Christian’s input on this RFC as he > likely has better insight then myself on DRM's future here. > I also think the usage patterns don't quite match (today). GPUs seem to care very much about specific size allocations, contiguity, how users get swapped in/out, how reclaim occurs, specific shutdown procedures - etc. A private node service just wants to be the arbiter of who can access the memory, but it may not really care to have extremely deep control over the actual management of said memory. Maybe there is a world where GPUs trend in that direction, but it's certainly not where they are today. But trying to generalize DRM's infrastructure seems bad. At best we end up with two mm/ implementations - not good at all. I do think this fundamentally changes how NUMA gets used by userspace, but I think userspace should stop reasoning about nodes for memory placement beyond simple cpu-socket-dram mappings </opinion>. (using mm/mempolicy.c just makes your code less portable by design) --- As a side note, This infrastructure is not just limited to devices, and I probably should have pointed this out in the cover. We could create service-dedicated memory pools directly from DRAM. Something I was exploring this week: Private-CMA Hack off a chunk of DRAM at boot, hand it to a driver to hotplug as a private node in ZONE_NORMAL with MIGRATE_CMA, and add that node as a valid demotion target. You get: 1) A node of general purpose memory full of (reasonably) cold data 2) Tracked by CMA 3) The CMA is dedicated to a single service 4) And the memory can be pinned for DMA Right now CMA is somewhat of a free-for-all and if you have multiple CMA users you can end up in situations where even CMA fragments. Splitting up users might be nice - but you need some kind of delimiting mechanism for that. A node seems just about right. ~Gregory
On Tue, Feb 24, 2026 at 10:17:38AM -0500, Gregory Price wrote: > - Changing validations on node states > - mempolicy checks N_MEMORY membership, so you have to hack > N_MEMORY onto ZONE_DEVICE > (or teach it about a new node state... N_MEMORY_PRIVATE) > This gave me something to chew on I think this can be done without introducing N_MEMORY_PRIVATE and just checking: NODE_DATA(target_nid)->private meaning these nodes can just be N_MEMORY with the same isolations. I'll look at this a bit more. ~Gregory
On 2/22/26 09:48, Gregory Price wrote:
> Topic type: MM
>
> Presenter: Gregory Price <gourry@gourry.net>
>
> This series introduces N_MEMORY_PRIVATE, a NUMA node state for memory
> managed by the buddy allocator but excluded from normal allocations.
>
> I present it with an end-to-end Compressed RAM service (mm/cram.c)
> that would otherwise not be possible (or would be considerably more
> difficult, be device-specific, and add to the ZONE_DEVICE boondoggle).
>
>
> TL;DR
> ===
>
> N_MEMORY_PRIVATE is all about isolating NUMA nodes and then punching
> explicit holes in that isolation to do useful things we couldn't do
> before without re-implementing entire portions of mm/ in a driver.
>
>
> /* This is my memory. There are many like it, but this one is mine. */
> rc = add_private_memory_driver_managed(nid, start, size, name, flags,
> online_type, private_context);
>
> page = alloc_pages_node(nid, __GFP_PRIVATE, 0);
>
> /* Ok but I want to do something useful with it */
> static const struct node_private_ops ops = {
> .migrate_to = my_migrate_to,
> .folio_migrate = my_folio_migrate,
> .flags = NP_OPS_MIGRATION | NP_OPS_MEMPOLICY,
> };
> node_private_set_ops(nid, &ops);
>
> /* And now I can use mempolicy with my memory */
> buf = mmap(...);
> mbind(buf, len, mode, private_node, ...);
> buf[0] = 0xdeadbeef; /* Faults onto private node */
>
> /* And to be clear, no one else gets my memory */
> buf2 = malloc(4096); /* Standard allocation */
> buf2[0] = 0xdeadbeef; /* Can never land on private node */
>
> /* But i can choose to migrate it to the private node */
> move_pages(0, 1, &buf, &private_node, NULL, ...);
>
> /* And more fun things like this */
>
>
> Patchwork
> ===
> A fully working branch based on cxl/next can be found here:
> https://github.com/gourryinverse/linux/tree/private_compression
>
> A QEMU device which can inject high/low interrupts can be found here:
> https://github.com/gourryinverse/qemu/tree/compressed_cxl_clean
>
> The additional patches on these branches are CXL and DAX driver
> housecleaning only tangentially relevant to this RFC, so i've
> omitted them for the sake of trying to keep it somewhat clean
> here. Those patches should (hopefully) be going upstream anyway.
>
> Patches 1-22: Core Private Node Infrastructure
>
> Patch 1: Introduce N_MEMORY_PRIVATE scaffolding
> Patch 2: Introduce __GFP_PRIVATE
> Patch 3: Apply allocation isolation mechanisms
> Patch 4: Add N_MEMORY nodes to private fallback lists
> Patches 5-9: Filter operations not yet supported
> Patch 10: free_folio callback
> Patch 11: split_folio callback
> Patches 12-20: mm/ service opt-ins:
> Migration, Mempolicy, Demotion, Write Protect,
> Reclaim, OOM, NUMA Balancing, Compaction,
> LongTerm Pinning
> Patch 21: memory_failure callback
> Patch 22: Memory hotplug plumbing for private nodes
>
> Patch 23: mm/cram -- Compressed RAM Management
>
> Patches 24-27: CXL Driver examples
> Sysram Regions with Private node support
> Basic Driver Example: (MIGRATION | MEMPOLICY)
> Compression Driver Example (Generic)
>
>
> Background
> ===
>
> Today, drivers that want mm-like services on non-general-purpose
> memory either use ZONE_DEVICE (self-managed memory) or hotplug into
> N_MEMORY and accept the risk of uncontrolled allocation.
>
> Neither option provides what we really want - the ability to:
> 1) selectively participate in mm/ subsystems, while
> 2) isolating that memory from general purpose use.
>
> Some device-attached memory cannot be managed as fully general-purpose
> system RAM. CXL devices with inline compression, for example, may
> corrupt data or crash the machine if the compression ratio drops
> below a threshold -- we simply run out of physical memory.
>
> This is a hard problem to solve: how does an operating system deal
> with a device that basically lies about how much capacity it has?
>
> (We'll discuss that in the CRAM section)
>
>
> Core Proposal: N_MEMORY_PRIVATE
> ===
>
> Introduce N_MEMORY_PRIVATE, a NUMA node state for memory managed by
> the buddy allocator, but excluded from normal allocation paths.
>
> Private nodes:
>
> - Are filtered from zonelist fallback: all existing callers to
> get_page_from_freelist cannot reach these nodes through any
> normal fallback mechanism.
>
> - Filter allocation requests on __GFP_PRIVATE
> numa_zone_allowed() excludes them otherwise.
>
> Applies to systems with and without cpusets.
>
> GFP_PRIVATE is (__GFP_PRIVATE | __GFP_THISNODE).
>
> Services use it when they need to allocate specifically from
> a private node (e.g., CRAM allocating a destination folio).
>
> No existing allocator path sets __GFP_PRIVATE, so private nodes
> are unreachable by default.
>
> - Use standard struct page / folio. No ZONE_DEVICE, no pgmap,
> no struct page metadata limitations.
>
> - Use a node-scoped metadata structure to accomplish filtering
> and callback support.
>
> - May participate in the buddy allocator, reclaim, compaction,
> and LRU like normal memory, gated by an opt-in set of flags.
>
> The key abstraction is node_private_ops: a per-node callback table
> registered by a driver or service.
>
> Each callback is individually gated by an NP_OPS_* capability flag.
>
> A driver opts in only to the mm/ operations it needs.
>
> It is similar to ZONE_DEVICE's pgmap at a node granularity.
>
> In fact...
>
>
> Re-use of ZONE_DEVICE Hooks
> ===
>
> The callback insertion points deliberately mirror existing ZONE_DEVICE
> hooks to minimize the surface area of the mechanism.
>
> I believe this could subsume most DEVICE_COHERENT users, and greatly
> simplify the device-managed memory development process (no more
> per-driver allocator and migration code).
>
> (Also it's just "So Fresh, So Clean").
>
> The base set of callbacks introduced include:
>
> free_folio - mirrors ZONE_DEVICE's
> free_zone_device_page() hook in
> __folio_put() / folios_put_refs()
>
> folio_split - mirrors ZONE_DEVICE's
> called when a huge page is split up
>
> migrate_to - demote_folio_list() custom demotion (same
> site as ZONE_DEVICE demotion rejection)
>
> folio_migrate - called when private node folio is moved to
> another location (e.g. compaction)
>
> handle_fault - mirrors the ZONE_DEVICE fault dispatch in
> handle_pte_fault() (do_wp_page path)
>
> reclaim_policy - called by reclaim to let a driver own the
> boost lifecycle (driver can driver node reclaim)
>
> memory_failure - parallels memory_failure_dev_pagemap(),
> but for online pages that enter the normal
> hwpoison path
>
> At skip sites (mlock, madvise, KSM, user migration), a unified
> folio_is_private_managed() predicate covers both ZONE_DEVICE and
> N_MEMORY_PRIVATE folios, consolidating existing zone_device checks
> with private node checks rather than adding new ones.
>
> static inline bool folio_is_private_managed(struct folio *folio)
> {
> return folio_is_zone_device(folio) ||
> folio_is_private_node(folio);
> }
>
> Most integration points become a one-line swap:
>
> - if (folio_is_zone_device(folio))
> + if (unlikely(folio_is_private_managed(folio)))
>
>
> Where a one-line integration is insufficient, the integration is
> kept as clean as possible with zone_device, rather than simply
> adding more call-sites on top of it:
>
> static inline bool folio_managed_handle_fault(struct folio *folio,
> struct vm_fault *vmf, vm_fault_t *ret)
> {
> /* Zone device pages use swap entries; handled in do_swap_page */
> if (folio_is_zone_device(folio))
> return false;
>
> if (folio_is_private_node(folio)) {
> const struct node_private_ops *ops = folio_node_private_ops(folio);
>
> if (ops && ops->handle_fault) {
> *ret = ops->handle_fault(vmf);
> return true;
> }
> }
> return false;
> }
>
>
>
> Flag-gated behavior (NP_OPS_*) controls:
> ===
>
> We use OPS flags to denote what mm/ services we want to allow on our
> private node. I've plumbed these through so far:
>
> NP_OPS_MIGRATION - Node supports migration
> NP_OPS_MEMPOLICY - Node supports mempolicy actions
> NP_OPS_DEMOTION - Node appears in demotion target lists
> NP_OPS_PROTECT_WRITE - Node memory is read-only (wrprotect)
> NP_OPS_RECLAIM - Node supports reclaim
> NP_OPS_NUMA_BALANCING - Node supports numa balancing
> NP_OPS_COMPACTION - Node supports compaction
> NP_OPS_LONGTERM_PIN - Node supports longterm pinning
> NP_OPS_OOM_ELIGIBLE - (MIGRATION | DEMOTION), node is reachable
> as normal system ram storage, so it should
> be considered in OOM pressure calculations.
>
> I wasn't quite sure how to classify ksm, khugepaged, madvise, and
> mlock - so i have omitted those for now.
>
> Most hooks are straightforward.
>
> Including a node as a demotion-eligible target was as simple as:
>
> static void establish_demotion_targets(void)
> {
> ..... snip .....
> /*
> * Include private nodes that have opted in to demotion
> * via NP_OPS_DEMOTION. A node might have custom migrate
> */
> all_memory = node_states[N_MEMORY];
> for_each_node_state(node, N_MEMORY_PRIVATE) {
> if (node_private_has_flag(node, NP_OPS_DEMOTION))
> node_set(node, all_memory);
> }
> ..... snip .....
> }
>
> The Migration and Mempolicy support are the two most complex pieces,
> and most useful things are built on top of Migration (meaning the
> remaining implementations are usually simple).
>
>
> Private Node Hotplug Lifecycle
> ===
>
> Registration follows a strict order enforced by
> add_private_memory_driver_managed():
>
> 1. Driver calls add_private_memory_driver_managed(nid, start,
> size, resource_name, mhp_flags, online_type, &np).
>
> 2. node_private_register(nid, &np) stores the driver's
> node_private in pgdat and sets pgdat->private. N_MEMORY and
> N_MEMORY_PRIVATE are mutually exclusive -- registration fails
> with -EBUSY if the node already has N_MEMORY set.
>
> Only one driver may register per private node.
>
> 3. Memory is hotplugged via __add_memory_driver_managed().
>
> When online_pages() runs, it checks pgdat->private and sets
> N_MEMORY_PRIVATE instead of N_MEMORY.
>
> Zonelist construction gives private nodes a self-only NOFALLBACK
> list and an N_MEMORY fallback list (so kernel/slab allocations on
> behalf of private node work can fall back to DRAM).
>
> 4. kswapd and kcompactd are NOT started for private nodes. The
> owning service is responsible for driving reclaim if needed
> (e.g., CRAM uses watermark_boost to wake kswapd on demand).
>
> Teardown is the reverse:
>
> 1. Driver calls offline_and_remove_private_memory(nid, start,
> size).
>
> 2. offline_pages() offlines the memory. When the last block is
> offlined, N_MEMORY_PRIVATE is cleared automatically.
>
> 3. node_private_unregister() clears pgdat->node_private and
> drops the refcount. It refuses to unregister (-EBUSY) if
> N_MEMORY_PRIVATE is still set (other memory ranges remain).
>
> The driver is responsible for ensuring memory is hot-unpluggable
> before teardown. The service must ensure all memory is cleaned
> up before hot-unplug - or the service must support migration (so
> memory_hotplug.c can evacuate the memory itself).
>
> In the CRAM example, the service supports migration, so memory
> hot-unplug can remove memory without any special infrastructure.
>
>
> Application: Compressed RAM (mm/cram)
> ===
>
> Compressed RAM has a serious design issue: Its capacity a lie.
>
> A compression device reports more capacity than it physically has.
> If workloads write faster than the OS can reclaim from the device,
> we run out of real backing store and corrupt data or crash.
>
> I call this problem: "Trying to Out Run A Bear"
>
> I.e. This is only stable as long as we stay ahead of the pressure.
>
> We don't want to design a system where stability depends on outrunning
> a bear - I am slow and do not know where to acquire bear spray.
>
> Fun fact: Grizzly bears have a top-speed of 56-64 km/h.
> Unfun Fact: Humans typically top out at ~24 km/h.
>
> This MVP takes a conservative position:
>
> all compressed memory is mapped read-only.
>
> - Folios reach the private node only via reclaim (demotion)
> - migrate_to implements custom demotion with backpressure.
> - fixup_migration_pte write-protects PTEs on arrival.
> - wrprotect hooks prevent silent upgrades
> - handle_fault promotes folios back to DRAM on write.
> - free_folio scrubs stale data before buddy free.
>
> Because pages are read-only, writes can never cause runaway
> compression ratio loss behind the allocator's back. Every write
> goes through handle_fault, which promotes the folio to DRAM first.
>
> The device only ever sees net compression (demotion in) and explicit
> decompression (promotion out via fault or reclaim), and has a much
> wider timeframe to respond to poor compression scenarios.
>
> That means there's no bear to out run. The bears are safely asleep in
> their bear den, and even if they show up we have a bear-proof cage.
>
> The backpressure system is our bear-proof cage: the driver reports real
> device utilization (generalized via watermark_boost on the private
> node's zone), and CRAM throttles demotion when capacity is tight.
>
> If compression ratios are bad, we stop demoting pages and start
> evicting pages aggressively.
>
> The service as designed is ~350 functional lines of code because it
> re-uses mm/ services:
>
> - Existing reclaim/vmscan code handles demotion.
> - Existing migration code handles migration to/from.
> - Existing page fault handling dispatches faults.
>
> The driver contains all the CXL nastiness core developers don't want
> anything to do with - No vendor logic touches mm/ internals.
>
>
>
> Future CRAM : Loosening the read-only constraint
> ===
>
> The read-only model is safe but conservative. For workloads where
> compressed pages are occasionally written, the promotion fault adds
> latency. A future optimization could allow a tunable fraction of
> compressed pages to be mapped writable, accepting some risk of
> write-driven decompression in exchange for lower overhead.
>
> The private node ops make this straightforward:
>
> - Adjust fixup_migration_pte to selectively skip
> write-protection.
> - Use the backpressure system to either revoke writable mappings,
> deny additional demotions, or evict when device pressure rises.
>
> This comes at a mild memory overhead: 32MB of DRAM per 1TB of CRAM.
> (1 bit per 4KB page).
>
> This is not proposed here, but it should be somewhat trivial.
>
>
> Discussion Topics
> ===
> 0. Obviously I've included the set as an RFC, please rip it apart.
>
> 1. Is N_MEMORY_PRIVATE the right isolation abstraction, or should
> this extend ZONE_DEVICE? Prior feedback pushed away from new
> ZONE logic, but this will likely be debated further.
>
> My comments on this:
>
> ZONE_DEVICE requires re-implementing every service you want to
> provide to your device memory, including basic allocation.
>
> Private nodes use real struct pages with no metadata
> limitations, participate in the buddy allocator, and get NUMA
> topology for free.
>
> 2. Can this subsume ZONE_DEVICE COHERENT users? The architecture
> was designed with this in mind, but it is only a thought experiment.
>
> 3. Is a dedicated mm/ service (cram) the right place for compressed
> memory management, or should this be purely driver-side until
> more devices exist?
>
> I wrote it this way because I forsee more "innovation" in the
> compressed RAM space given current... uh... "Market Conditions".
>
> I don't see CRAM being CXL-specific, though the only solutions I've
> seen have been CXL. Nothing is stopping someone from soldering such
> memory directly to a PCB.
>
> 5. Where is your hardware-backed data that shows this works?
>
> I should have some by conference time.
>
> Thanks for reading
> Gregory (Gourry)
>
>
> Gregory Price (27):
> numa: introduce N_MEMORY_PRIVATE node state
> mm,cpuset: gate allocations from N_MEMORY_PRIVATE behind __GFP_PRIVATE
> mm/page_alloc: add numa_zone_allowed() and wire it up
> mm/page_alloc: Add private node handling to build_zonelists
> mm: introduce folio_is_private_managed() unified predicate
> mm/mlock: skip mlock for managed-memory folios
> mm/madvise: skip madvise for managed-memory folios
> mm/ksm: skip KSM for managed-memory folios
> mm/khugepaged: skip private node folios when trying to collapse.
> mm/swap: add free_folio callback for folio release cleanup
> mm/huge_memory.c: add private node folio split notification callback
> mm/migrate: NP_OPS_MIGRATION - support private node user migration
> mm/mempolicy: NP_OPS_MEMPOLICY - support private node mempolicy
> mm/memory-tiers: NP_OPS_DEMOTION - support private node demotion
> mm/mprotect: NP_OPS_PROTECT_WRITE - gate PTE/PMD write-upgrades
I'm concerned about adding more special-casing (similar to what we
already added for ZONE_DEVICE) all over the place.
Like the whole folio_managed_() stuff in mprotect.c
Having that said, sounds like a reasonable topic to discuss.
--
Cheers,
David
On Mon, Feb 23, 2026 at 02:07:15PM +0100, David Hildenbrand (Arm) wrote:
> >
> > Gregory Price (27):
> > numa: introduce N_MEMORY_PRIVATE node state
> > mm,cpuset: gate allocations from N_MEMORY_PRIVATE behind __GFP_PRIVATE
> > mm/page_alloc: add numa_zone_allowed() and wire it up
> > mm/page_alloc: Add private node handling to build_zonelists
> > mm: introduce folio_is_private_managed() unified predicate
> > mm/mlock: skip mlock for managed-memory folios
> > mm/madvise: skip madvise for managed-memory folios
> > mm/ksm: skip KSM for managed-memory folios
> > mm/khugepaged: skip private node folios when trying to collapse.
> > mm/swap: add free_folio callback for folio release cleanup
> > mm/huge_memory.c: add private node folio split notification callback
> > mm/migrate: NP_OPS_MIGRATION - support private node user migration
> > mm/mempolicy: NP_OPS_MEMPOLICY - support private node mempolicy
> > mm/memory-tiers: NP_OPS_DEMOTION - support private node demotion
> > mm/mprotect: NP_OPS_PROTECT_WRITE - gate PTE/PMD write-upgrades
>
> I'm concerned about adding more special-casing (similar to what we already
> added for ZONE_DEVICE) all over the place.
>
> Like the whole folio_managed_() stuff in mprotect.c
>
> Having that said, sounds like a reasonable topic to discuss.
>
It's a valid concern - and is why I tried to re-use as many of the
zone_device hooks as possible. It does not seem zone_device has quite
the same semantics for a case like this, so I had to make something new.
DEVICE_COHERENT injects a temporary swap entry to allow the device to do
a large atomic operation - then the page table is restored and the CPU
is free to change entries as it pleases.
Another option would be to add the hook to vma_wants_writenotify()
instead of the page table code - and mask MM_CP_TRY_CHANGE_WRITABLE.
This would require adding a vma flag - or maybe a count of protected /
device pages.
int mprotect_fixup() {
...
if (vma_wants_manual_pte_write_upgrade(vma))
mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE;
}
bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
{
if (vma->managed_wrprotect)
return true;
}
That would localize the change in folio_managed_fixup_migration_pte() :
static inline pte_t folio_managed_fixup_migration_pte(struct page *new,
pte_t pte,
pte_t old_pte,
struct vm_area_struct *vma)
{
...
} else if (folio_managed_wrprotect(page_folio(new))) {
pte = pte_wrprotect(pte);
+ atomic_inc(&vma->managed_wrprotect);
}
return pte;
}
This would cover both the huge_memory.c and mprotect, and maybe that's
just generally cleaner? I can try that to see if it actually works.
~Gregory
On Mon, Feb 23, 2026 at 09:54:55AM -0500, Gregory Price wrote: > On Mon, Feb 23, 2026 at 02:07:15PM +0100, David Hildenbrand (Arm) wrote: > > > > I'm concerned about adding more special-casing (similar to what we already > > added for ZONE_DEVICE) all over the place. > > > > Like the whole folio_managed_() stuff in mprotect.c > > > > Having that said, sounds like a reasonable topic to discuss. > > > > Another option would be to add the hook to vma_wants_writenotify() > instead of the page table code - and mask MM_CP_TRY_CHANGE_WRITABLE. > scratch all this - existing hooks exist for exactly this purpose: can_change_[pte|pmd]_writable() Surprised I missed this. I can clean this up to remove it from the page table walks. Still valid to question whether we want this, but at least the hook lives with other write-protect hooks now. ~Gregory
On 2/23/26 17:08, Gregory Price wrote: > On Mon, Feb 23, 2026 at 09:54:55AM -0500, Gregory Price wrote: >> On Mon, Feb 23, 2026 at 02:07:15PM +0100, David Hildenbrand (Arm) wrote: >>> >>> I'm concerned about adding more special-casing (similar to what we already >>> added for ZONE_DEVICE) all over the place. >>> >>> Like the whole folio_managed_() stuff in mprotect.c >>> >>> Having that said, sounds like a reasonable topic to discuss. >>> >> >> Another option would be to add the hook to vma_wants_writenotify() >> instead of the page table code - and mask MM_CP_TRY_CHANGE_WRITABLE. >> > > scratch all this - existing hooks exist for exactly this purpose: > > can_change_[pte|pmd]_writable() > > Surprised I missed this. > > I can clean this up to remove it from the page table walks. Sorry for the late reply -- sounds like we can handle this cleaner. But I am wondering: why is this even required? Is it just for "Services that intercept write faults (e.g., for promotion tracking) need PTEs to stay read-only" But that promotion tracking sounds like some orthogonal work to me. What am I missing that this is required in this patch set? (is it just for the special compressed RAM bits?) -- Cheers, David
On Tue, Mar 17, 2026 at 02:05:53PM +0100, David Hildenbrand (Arm) wrote: > On 2/23/26 17:08, Gregory Price wrote: > > On Mon, Feb 23, 2026 at 09:54:55AM -0500, Gregory Price wrote: > >> On Mon, Feb 23, 2026 at 02:07:15PM +0100, David Hildenbrand (Arm) wrote: > >>> > >>> I'm concerned about adding more special-casing (similar to what we already > >>> added for ZONE_DEVICE) all over the place. > >>> > >>> Like the whole folio_managed_() stuff in mprotect.c > >>> > >>> Having that said, sounds like a reasonable topic to discuss. > >>> > >> > >> Another option would be to add the hook to vma_wants_writenotify() > >> instead of the page table code - and mask MM_CP_TRY_CHANGE_WRITABLE. > >> > > > > scratch all this - existing hooks exist for exactly this purpose: > > > > can_change_[pte|pmd]_writable() > > > > Surprised I missed this. > > > > I can clean this up to remove it from the page table walks. > > Sorry for the late reply -- sounds like we can handle this cleaner. > > But I am wondering: why is this even required? > > Is it just for "Services that intercept write faults (e.g., for > promotion tracking) need PTEs to stay read-only" > > But that promotion tracking sounds like some orthogonal work to me. What > am I missing that this is required in this patch set? (is it just for > the special compressed RAM bits?) > Yes, this was specific to the compressed ram bits - it allows for a service to control where/when writes to the device can happen. In this case, I've limited writes to just the demotion step. (Although I have since realized i need to not allow file-backed memory to be demoted). There may be a better way to do this, but also it may very well be the case that such a hook is just a bridge too far and isn't wanted. I think this debate is warranted. ~Gregory
On 2/22/26 09:48, Gregory Price wrote:
> Topic type: MM
Hi Gregory,
stumbling over this again, some questions whereby I'll just ignore the
compressed RAM bits for now and focus on use cases where promotion etc
are not relevant :)
[...]
>
> TL;DR
> ===
>
> N_MEMORY_PRIVATE is all about isolating NUMA nodes and then punching
> explicit holes in that isolation to do useful things we couldn't do
> before without re-implementing entire portions of mm/ in a driver.
Just to clarify: we don't currently have any mechanism to expose, say,
SPM/PMEM/whatsoever to the buddy allocator through the dax/kmem driver
and *not* have random allocations end up on it, correct?
Assume we online the memory to ZONE_MOVABLE, still other (fallback)
allocations might end up on that memory.
How would we currently handle something like that? (do we have drivers
for that? I'd assume that drivers would only migrate some user memory to
ZONE_DEVICE memory.)
Assuming we don't have such a mechanism, I assume that part of your
proposal would be very interesting: online the memory to a
"special"/"restricted" (you call it private) NUMA node, whereby all
memory of that NUMA node will only be consumable through
mbind() and friends.
Any other allocations (including automatic page migration etc) would not
end up on that memory.
Thinking of some "terribly slow" or "terribly fast" memory that we don't
want to involve in automatic memory tiering, being able to just let
selected workloads consume that memory sounds very helpful.
(wondering if there could be some way allocations might get migrated out
of the node, for example, during memory offlining etc, which might also
not be desirable)
I am not sure if __GFP_PRIVATE etc is really required for that. But some
mechanism to make that work seems extremely helpful.
Because ...
>
>
> /* This is my memory. There are many like it, but this one is mine. */
> rc = add_private_memory_driver_managed(nid, start, size, name, flags,
> online_type, private_context);
>
> page = alloc_pages_node(nid, __GFP_PRIVATE, 0);
>
> /* Ok but I want to do something useful with it */
> static const struct node_private_ops ops = {
> .migrate_to = my_migrate_to,
> .folio_migrate = my_folio_migrate,
> .flags = NP_OPS_MIGRATION | NP_OPS_MEMPOLICY,
> };
> node_private_set_ops(nid, &ops);
>
> /* And now I can use mempolicy with my memory */
> buf = mmap(...);
> mbind(buf, len, mode, private_node, ...);
> buf[0] = 0xdeadbeef; /* Faults onto private node */
... just being able to consume that memory through mbind() and having
guarantees sounds extremely helpful.
[...]
>
>
> Background
> ===
>
> Today, drivers that want mm-like services on non-general-purpose
> memory either use ZONE_DEVICE (self-managed memory) or hotplug into
> N_MEMORY and accept the risk of uncontrolled allocation.
>
> Neither option provides what we really want - the ability to:
> 1) selectively participate in mm/ subsystems, while
> 2) isolating that memory from general purpose use.
>
> Some device-attached memory cannot be managed as fully general-purpose
> system RAM. CXL devices with inline compression, for example, may
> corrupt data or crash the machine if the compression ratio drops
> below a threshold -- we simply run out of physical memory.
>
> This is a hard problem to solve: how does an operating system deal
> with a device that basically lies about how much capacity it has?
>
> (We'll discuss that in the CRAM section)
>
>
> Core Proposal: N_MEMORY_PRIVATE
> ===
>
> Introduce N_MEMORY_PRIVATE, a NUMA node state for memory managed by
> the buddy allocator, but excluded from normal allocation paths.
>
> Private nodes:
>
> - Are filtered from zonelist fallback: all existing callers to
> get_page_from_freelist cannot reach these nodes through any
> normal fallback mechanism.
Good.
>
> - Filter allocation requests on __GFP_PRIVATE
> numa_zone_allowed() excludes them otherwise.
I think we discussed that in the past, but why can't we find a way that
only people requesting __GFP_THISNODE could allocate that memory, for
example? I guess we'd have to remove it from all "default NUMA bitmaps"
somehow.
>
> Applies to systems with and without cpusets.
>
> GFP_PRIVATE is (__GFP_PRIVATE | __GFP_THISNODE).
>
> Services use it when they need to allocate specifically from
> a private node (e.g., CRAM allocating a destination folio).
>
> No existing allocator path sets __GFP_PRIVATE, so private nodes
> are unreachable by default.
>
> - Use standard struct page / folio. No ZONE_DEVICE, no pgmap,
> no struct page metadata limitations.
Good.
>
> - Use a node-scoped metadata structure to accomplish filtering
> and callback support.
>
> - May participate in the buddy allocator, reclaim, compaction,
> and LRU like normal memory, gated by an opt-in set of flags.
>
> The key abstraction is node_private_ops: a per-node callback table
> registered by a driver or service.
>
> Each callback is individually gated by an NP_OPS_* capability flag.
>
> A driver opts in only to the mm/ operations it needs.
>
> It is similar to ZONE_DEVICE's pgmap at a node granularity.
>
> In fact...
>
>
> Re-use of ZONE_DEVICE Hooks
> ===
I think all of that might not be required for the simplistic use case I
mentioned above (fast/slow memory only to be consumed by selected user
space that opts in through mbind() and friends).
Or are there other use cases for these callbacks
[...]
>
>
> Flag-gated behavior (NP_OPS_*) controls:
> ===
>
> We use OPS flags to denote what mm/ services we want to allow on our
> private node. I've plumbed these through so far:
>
> NP_OPS_MIGRATION - Node supports migration
> NP_OPS_MEMPOLICY - Node supports mempolicy actions
> NP_OPS_DEMOTION - Node appears in demotion target lists
> NP_OPS_PROTECT_WRITE - Node memory is read-only (wrprotect)
> NP_OPS_RECLAIM - Node supports reclaim
> NP_OPS_NUMA_BALANCING - Node supports numa balancing
> NP_OPS_COMPACTION - Node supports compaction
> NP_OPS_LONGTERM_PIN - Node supports longterm pinning
> NP_OPS_OOM_ELIGIBLE - (MIGRATION | DEMOTION), node is reachable
> as normal system ram storage, so it should
> be considered in OOM pressure calculations.
I have to think about all that, and whether that would be required as a
first step. I'd assume in a simplistic use case mentioned above we might
only forbid the memory to be used as a fallback for any oom etc.
Whether reclaim (e.g., swapout) makes sense is a good question.
--
Cheers,
David
On Tue, Mar 17, 2026 at 02:25:29PM +0100, David Hildenbrand (Arm) wrote:
> On 2/22/26 09:48, Gregory Price wrote:
> > Topic type: MM
>
> Hi Gregory,
>
> stumbling over this again, some questions whereby I'll just ignore the
> compressed RAM bits for now and focus on use cases where promotion etc
> are not relevant :)
A more concrete example up your alley:
I've since been playing with a virtio-net private node.
Normally cloud-hypervisor VMs with virtio-net can't be subject to KSM
because the entire boot region gets marked shared. If virtio-net has
its own private node / region separate from the boot region, the boot
region is now free to be subject to KSM.
I may have that up as an example sometime before LSF, but i need to
clean up some networking stack hacks i've made to make it work.
> >
> > N_MEMORY_PRIVATE is all about isolating NUMA nodes and then punching
> > explicit holes in that isolation to do useful things we couldn't do
> > before without re-implementing entire portions of mm/ in a driver.
>
> Just to clarify: we don't currently have any mechanism to expose, say,
> SPM/PMEM/whatsoever to the buddy allocator through the dax/kmem driver
> and *not* have random allocations end up on it, correct?
>
> Assume we online the memory to ZONE_MOVABLE, still other (fallback)
> allocations might end up on that memory.
>
Correct, when you hotplug memory into a node, it's a free for all.
Fallbacks are going to happen.
I see you saw below that one of the extensions is removing the nodes
from the fallback list. That is part one, but it's insufficient to
prevent complete leakage (someone might iterate over the nodes-possible
list and try migrating memory).
> How would we currently handle something like that? (do we have drivers
> for that? I'd assume that drivers would only migrate some user memory to
> ZONE_DEVICE memory.)
>
> Assuming we don't have such a mechanism, I assume that part of your
> proposal would be very interesting: online the memory to a
> "special"/"restricted" (you call it private) NUMA node, whereby all
> memory of that NUMA node will only be consumable through
> mbind() and friends.
>
Basically the only isolation mechanism we have today is ZONE_DEVICE.
Either via mbind and friends, or even just the driver itself managing it
directly via alloc_pages_node() and exposing some userland interface.
You can imagine a network driver providing an ioctl for a shared buffer
or a driver exposing a mmap'able file descriptor as the trivial case.
> Any other allocations (including automatic page migration etc) would not
> end up on that memory.
One of the complications of exposing this memory via mbind is that
mempolicy.c has a lot of migration mechanics, just to name two:
- migrate on mbind
- cpuset rebinds
So for a completely solution you need to support migration if you
support mempolicy. But with the callbacks, you can control how/when
migration occurs.
tl;dr: many of mm/'s services are actually predicated on migration
support, so you have to manage that somehow.
>
> Thinking of some "terribly slow" or "terribly fast" memory that we don't
> want to involve in automatic memory tiering, being able to just let
> selected workloads consume that memory sounds very helpful.
>
>
> (wondering if there could be some way allocations might get migrated out
> of the node, for example, during memory offlining etc, which might also
> not be desirable)
>
in the NP_OPS_MIGRATION patch, this gets covered.
I'm not sure the NP_OPS_* pattern is what we actually want, it's just
what i came up with to make it clear what's being enabled.
Basically without NP_OPS_MIGRATION, this memory is completely
non-migratable. The driver managing it therefore needs to control the
lifetime, and if hotplug is requested - kill anyone using it (which by
definition should not the kernel) and either release the pages or take
them so they can be released while hotplug is spinning.
> I am not sure if __GFP_PRIVATE etc is really required for that. But some
> mechanism to make that work seems extremely helpful.
>
> Because ...
>
> > /* And now I can use mempolicy with my memory */
> > buf = mmap(...);
> > mbind(buf, len, mode, private_node, ...);
> > buf[0] = 0xdeadbeef; /* Faults onto private node */
>
> ... just being able to consume that memory through mbind() and having
> guarantees sounds extremely helpful.
>
Yes! :]
> >
> > - Filter allocation requests on __GFP_PRIVATE
> > numa_zone_allowed() excludes them otherwise.
>
> I think we discussed that in the past, but why can't we find a way that
> only people requesting __GFP_THISNODE could allocate that memory, for
> example? I guess we'd have to remove it from all "default NUMA bitmaps"
> somehow.
>
I experimented with this. There were two concerns:
1) as you note, removing it from the default bitmaps, which is actually
hard. You can't remove it from the possible-node bitmap, so that
just seemed non-tractable.
2) __GFP_THISNODE actually means (among other things) "don't fallback".
And, in fact, there are some hotplug-time allocations that occur in
SLAB (pglist_data) that target the private node that *must* fallback
to successfully allocate for successful kernel operation.
So separating PRIVATE from THISNODE and allowing some use of fallback
mechanics resolves some problems here.
I think #2 is a solvable problem, but #1 i don't think can be addressed.
I need to investigate the slab interactions a little more.
> > - Use standard struct page / folio. No ZONE_DEVICE, no pgmap,
> > no struct page metadata limitations.
>
> Good.
Note: I've actually since explored merging this with pgmap, and
rebranding it as node-scope pgmap.
In that sense, you could think of this as NODE_DEVICE instead of
NODE_PRIVATE - but maybe I'm inviting too much baggage :]
> >
> > Re-use of ZONE_DEVICE Hooks
> > ===
>
> I think all of that might not be required for the simplistic use case I
> mentioned above (fast/slow memory only to be consumed by selected user
> space that opts in through mbind() and friends).
>
> Or are there other use cases for these callbacks
>
Many `folio_is_zone_device()` hooks result in the operations being
a no-op / failing. We need all those same hooks.
Some hooks I added - such as migration hooks, are combined with the
zone_device hooks via i helper to demonstrate the pattern is the same
when the memory is opted into migration.
I do not think all of these hooks are required, I would think of this
more as an exploration of the whole space, and then we can throw what
does not have an active use case.
For the compressed ram component I've been designing, the needs are:
- Migration
- Reclaim
- Demotion
- Write Protect (maybe, possibly optional)
But you could argue another user might want the same device to have:
- Migration
- Mempolicy
Where they manage things from userland, rather than via reclaim.
The flexibility is kind of the point :]
> [...]
> >
> >
> > Flag-gated behavior (NP_OPS_*) controls:
> > ===
> >
> > We use OPS flags to denote what mm/ services we want to allow on our
> > private node. I've plumbed these through so far:
> >
> > NP_OPS_MIGRATION - Node supports migration
> > NP_OPS_MEMPOLICY - Node supports mempolicy actions
> > NP_OPS_DEMOTION - Node appears in demotion target lists
> > NP_OPS_PROTECT_WRITE - Node memory is read-only (wrprotect)
> > NP_OPS_RECLAIM - Node supports reclaim
> > NP_OPS_NUMA_BALANCING - Node supports numa balancing
> > NP_OPS_COMPACTION - Node supports compaction
> > NP_OPS_LONGTERM_PIN - Node supports longterm pinning
> > NP_OPS_OOM_ELIGIBLE - (MIGRATION | DEMOTION), node is reachable
> > as normal system ram storage, so it should
> > be considered in OOM pressure calculations.
>
> I have to think about all that, and whether that would be required as a
> first step. I'd assume in a simplistic use case mentioned above we might
> only forbid the memory to be used as a fallback for any oom etc.
>
> Whether reclaim (e.g., swapout) makes sense is a good question.
>
I would simply state: "That depends on the memory device"
Which is kind of the point. The ability to isolate and poke holes in
that isolation explictly, while using the same mm/ code, creates a new
design space we haven't had before.
---
I think it would be fair to say all of these would not be required for
an MVP interface, and should require a use case to merge. But the code
is here because I wanted to explore just how far it can go.
In fact, I believe I have gotten to the point where I could add:
NP_OPS_FALLBACK_NODE - re-add the node to the fallback list
do not require __GFP_PRIVATE for allocation
Which would require all of the other bits to be turned on.
The result of this is essentially a numa node with otherwise normal
memory, but for which a driver gets callbacks on certain operations
(migration, free, etc). That ALSO seems useful.
It's... an interesting result of the whole exploration.
~Gregory
On 3/19/26 16:09, Gregory Price wrote: > On Tue, Mar 17, 2026 at 02:25:29PM +0100, David Hildenbrand (Arm) wrote: >> On 2/22/26 09:48, Gregory Price wrote: >>> Topic type: MM >> >> Hi Gregory, >> >> stumbling over this again, some questions whereby I'll just ignore the >> compressed RAM bits for now and focus on use cases where promotion etc >> are not relevant :) > > A more concrete example up your alley: > > I've since been playing with a virtio-net private node. > > Normally cloud-hypervisor VMs with virtio-net can't be subject to KSM > because the entire boot region gets marked shared. What exactly do you mean with "mark shared". Do you mean, that "shared memory" is used in the hypervisor for all boot memory? > If virtio-net has > its own private node / region separate from the boot region, the boot > region is now free to be subject to KSM. You mean, in the VM, memory usable by virtio-net can only be consumed from a dedicated physical memory region, and that region would be a separate node? > > I may have that up as an example sometime before LSF, but i need to > clean up some networking stack hacks i've made to make it work. > >>> >>> N_MEMORY_PRIVATE is all about isolating NUMA nodes and then punching >>> explicit holes in that isolation to do useful things we couldn't do >>> before without re-implementing entire portions of mm/ in a driver. >> >> Just to clarify: we don't currently have any mechanism to expose, say, >> SPM/PMEM/whatsoever to the buddy allocator through the dax/kmem driver >> and *not* have random allocations end up on it, correct? >> >> Assume we online the memory to ZONE_MOVABLE, still other (fallback) >> allocations might end up on that memory. >> > > Correct, when you hotplug memory into a node, it's a free for all. > Fallbacks are going to happen. Right, and I agree that having a mechanism to prevent that is reasonable. > > I see you saw below that one of the extensions is removing the nodes > from the fallback list. That is part one, but it's insufficient to > prevent complete leakage (someone might iterate over the nodes-possible > list and try migrating memory). Which code would do that? > >> How would we currently handle something like that? (do we have drivers >> for that? I'd assume that drivers would only migrate some user memory to >> ZONE_DEVICE memory.) >> >> Assuming we don't have such a mechanism, I assume that part of your >> proposal would be very interesting: online the memory to a >> "special"/"restricted" (you call it private) NUMA node, whereby all >> memory of that NUMA node will only be consumable through >> mbind() and friends. >> > > Basically the only isolation mechanism we have today is ZONE_DEVICE. > > Either via mbind and friends, or even just the driver itself managing it > directly via alloc_pages_node() and exposing some userland interface. Would mbind() work here? I thought mbind() would not suddenly give access to some ZONE_DEVICE memory. > > You can imagine a network driver providing an ioctl for a shared buffer > or a driver exposing a mmap'able file descriptor as the trivial case. Right. > >> Any other allocations (including automatic page migration etc) would not >> end up on that memory. > > One of the complications of exposing this memory via mbind is that > mempolicy.c has a lot of migration mechanics, just to name two: > > - migrate on mbind > - cpuset rebinds > > So for a completely solution you need to support migration if you > support mempolicy. But with the callbacks, you can control how/when > migration occurs. > > tl;dr: many of mm/'s services are actually predicated on migration > support, so you have to manage that somehow. Agreed. > >> >> Thinking of some "terribly slow" or "terribly fast" memory that we don't >> want to involve in automatic memory tiering, being able to just let >> selected workloads consume that memory sounds very helpful. >> >> >> (wondering if there could be some way allocations might get migrated out >> of the node, for example, during memory offlining etc, which might also >> not be desirable) >> > > in the NP_OPS_MIGRATION patch, this gets covered. Right, but I am not sure if NP_OPS_MIGRATION is really the right approach for that. Have to think about that. > > I'm not sure the NP_OPS_* pattern is what we actually want, it's just > what i came up with to make it clear what's being enabled. > > Basically without NP_OPS_MIGRATION, this memory is completely > non-migratable. The driver managing it therefore needs to control the > lifetime, and if hotplug is requested - kill anyone using it (which by > definition should not the kernel) and either release the pages or take > them so they can be released while hotplug is spinning. > >> I am not sure if __GFP_PRIVATE etc is really required for that. But some >> mechanism to make that work seems extremely helpful. >> >> Because ... >> >>> /* And now I can use mempolicy with my memory */ >>> buf = mmap(...); >>> mbind(buf, len, mode, private_node, ...); >>> buf[0] = 0xdeadbeef; /* Faults onto private node */ >> >> ... just being able to consume that memory through mbind() and having >> guarantees sounds extremely helpful. >> > > Yes! :] > >>> >>> - Filter allocation requests on __GFP_PRIVATE >>> numa_zone_allowed() excludes them otherwise. >> >> I think we discussed that in the past, but why can't we find a way that >> only people requesting __GFP_THISNODE could allocate that memory, for >> example? I guess we'd have to remove it from all "default NUMA bitmaps" >> somehow. >> > > I experimented with this. There were two concerns: > > 1) as you note, removing it from the default bitmaps, which is actually > hard. You can't remove it from the possible-node bitmap, so that > just seemed non-tractable. What about making people use a different set of bitmaps here? Quite some work, but maybe that's the right direction given that we'll now treat some nodes differently. > > 2) __GFP_THISNODE actually means (among other things) "don't fallback". > And, in fact, there are some hotplug-time allocations that occur in > SLAB (pglist_data) that target the private node that *must* fallback > to successfully allocate for successful kernel operation. Can you point me at the code? > > So separating PRIVATE from THISNODE and allowing some use of fallback > mechanics resolves some problems here. > > I think #2 is a solvable problem, but #1 i don't think can be addressed. > I need to investigate the slab interactions a little more. I'll also have to think about this some more. > >>> - Use standard struct page / folio. No ZONE_DEVICE, no pgmap, >>> no struct page metadata limitations. >> >> Good. > > Note: I've actually since explored merging this with pgmap, and > rebranding it as node-scope pgmap. > > In that sense, you could think of this as NODE_DEVICE instead of > NODE_PRIVATE - but maybe I'm inviting too much baggage :] :) NODE_DEVICE sounds interesting though. > >>> >>> Re-use of ZONE_DEVICE Hooks >>> === >> >> I think all of that might not be required for the simplistic use case I >> mentioned above (fast/slow memory only to be consumed by selected user >> space that opts in through mbind() and friends). >> >> Or are there other use cases for these callbacks >> > > Many `folio_is_zone_device()` hooks result in the operations being > a no-op / failing. We need all those same hooks. > > Some hooks I added - such as migration hooks, are combined with the > zone_device hooks via i helper to demonstrate the pattern is the same > when the memory is opted into migration. > > I do not think all of these hooks are required, I would think of this > more as an exploration of the whole space, and then we can throw what > does not have an active use case. > > For the compressed ram component I've been designing, the needs are: > > - Migration > - Reclaim > - Demotion > - Write Protect (maybe, possibly optional) > > But you could argue another user might want the same device to have: > - Migration > - Mempolicy > > Where they manage things from userland, rather than via reclaim. > > The flexibility is kind of the point :] Yeah, but it would be interesting which minimal support we would need to just let some special memory be managed by the kernel, allowing mbind() users to use it, but not have any other fallback allocations end up on it. Something very basic, on which we could build additional functionality. > >> [...] >>> >>> >>> Flag-gated behavior (NP_OPS_*) controls: >>> === >>> >>> We use OPS flags to denote what mm/ services we want to allow on our >>> private node. I've plumbed these through so far: >>> >>> NP_OPS_MIGRATION - Node supports migration >>> NP_OPS_MEMPOLICY - Node supports mempolicy actions >>> NP_OPS_DEMOTION - Node appears in demotion target lists >>> NP_OPS_PROTECT_WRITE - Node memory is read-only (wrprotect) >>> NP_OPS_RECLAIM - Node supports reclaim >>> NP_OPS_NUMA_BALANCING - Node supports numa balancing >>> NP_OPS_COMPACTION - Node supports compaction >>> NP_OPS_LONGTERM_PIN - Node supports longterm pinning >>> NP_OPS_OOM_ELIGIBLE - (MIGRATION | DEMOTION), node is reachable >>> as normal system ram storage, so it should >>> be considered in OOM pressure calculations. >> >> I have to think about all that, and whether that would be required as a >> first step. I'd assume in a simplistic use case mentioned above we might >> only forbid the memory to be used as a fallback for any oom etc. >> >> Whether reclaim (e.g., swapout) makes sense is a good question. >> > > I would simply state: "That depends on the memory device" Let's keep it very simple: just some memory that you mbind(), and you only want the mbind() user to make use of that memory. What would be the minimal set of hooks to guarantee that. For example, I assume compaction could just be supported for such memory? Similarly, longterm-pinning. For some of the other hooks it's rather unclear how they would affect the very simple mbind() rule. What is the effect of demotion or NUMA balancing? I'm afraid we're making things too complicated here or it might be the wrong abstraction, if i cannot even figure out how to make the simplest use case work. Maybe I'm wrong :) > > Which is kind of the point. The ability to isolate and poke holes in > that isolation explictly, while using the same mm/ code, creates a new > design space we haven't had before. > > --- > > I think it would be fair to say all of these would not be required for > an MVP interface, and should require a use case to merge. But the code > is here because I wanted to explore just how far it can go. That's absolutely fair. :) -- Cheers, David
On Mon, Apr 13, 2026 at 03:11:12PM +0200, David Hildenbrand (Arm) wrote:
> > Normally cloud-hypervisor VMs with virtio-net can't be subject to KSM
> > because the entire boot region gets marked shared.
>
> What exactly do you mean with "mark shared". Do you mean, that "shared
> memory" is used in the hypervisor for all boot memory?
>
Sorry, meant MAP_SHARED. But yes, in some setups the hypervisor simply
makes a memfd with the entire main memory region MAP_SHARED.
This is because the virtio-net device / network stack does GFP_KERNEL
allocations and then pins them on the host to allow zero-copy - so all
of ZONE_NORMAL is a valid target.
(At least that's my best understanding of the entire setup).
>
> You mean, in the VM, memory usable by virtio-net can only be consumed
> from a dedicated physical memory region, and that region would be a
> separate node?
>
Correct - it does requires teaching the network stack numa awareness.
I was surprised by how little code this required, though I can't be
100% sure of its correctness since networking isn't my normal space.
Alternatively you could imagine this as a real device bringing its own
dedicated networking memory for network buffers, and then telling the
network start "Hey, prefer this node over normal kernel allocations".
What I'd been hacking on was cobbled together with memfd + SRAT bits to
bring up a private node statically and then have the device claim it -
but this is just a proof of concept. A proper implementation would be
extending virtio-net to report a dedicated EFI_RESERVED region.
> >
> > I see you saw below that one of the extensions is removing the nodes
> > from the fallback list. That is part one, but it's insufficient to
> > prevent complete leakage (someone might iterate over the nodes-possible
> > list and try migrating memory).
>
> Which code would do that?
>
There are many callers of for_each_node() throughout the system.
but one discrete example:
int alloc_shrinker_info(struct mem_cgroup *memcg)
{
... snip ...
for_each_node(nid) {
struct shrinker_info *info = kvzalloc_node(sizeof(*info) + array_size,
GFP_KERNEL, nid);
... snip ..
}
If you disallow fallbacks in this scenario, this allocation always fails.
This partially answers your question about slub fallback allocations,
there are slab allocations like this that depend on fallbacks (more
below on this explicitly).
> > Basically the only isolation mechanism we have today is ZONE_DEVICE.
> >
> > Either via mbind and friends, or even just the driver itself managing it
> > directly via alloc_pages_node() and exposing some userland interface.
>
> Would mbind() work here? I thought mbind() would not suddenly give
> access to some ZONE_DEVICE memory.
>
Sorry these were orthogonal thoughts.
1) We don't have such a mechanism. ZONE_DEVICE's preferred mechanism is
setting up explicit migrations via migrate_device.c
2) mbind / alloc_pages_node would only work for private nodes.
Extending ZONE_DEVICE to enable mbind() would be an extreme lift,
as the kernel makes a lot of assumptions about folio->lru.
This is why i went the node route in the first place.
> >
> > in the NP_OPS_MIGRATION patch, this gets covered.
>
> Right, but I am not sure if NP_OPS_MIGRATION is really the right
> approach for that. Have to think about that.
>
So, OPS is a bit misleading, but it's the closest i came to some
existing pattern. OPS does not necessarily need to imply callbacks.
I've been trying to minimize the patch set and I'm starting to think
the MVP may actually be able to do away with the private_ops structure
for a basic migration+mempolicy example by simply teaching some services
(migrate.c, mempolicy.c) how/when to inject __GFP_PRIVATE.
the mempolicy.c patch already does this, but not migrate.c - i haven't
figured out the right pattern for that yet.
> > 1) as you note, removing it from the default bitmaps, which is actually
> > hard. You can't remove it from the possible-node bitmap, so that
> > just seemed non-tractable.
>
> What about making people use a different set of bitmaps here? Quite some
> work, but maybe that's the right direction given that we'll now treat
> some nodes differently.
>
It's an option, although it is fragile. That means having to police all
future users of possible-nodes and for_each_node and etc.
I've been err'ing on the side of "not fragile", but i'm open to rework.
> >
> > 2) __GFP_THISNODE actually means (among other things) "don't fallback".
> > And, in fact, there are some hotplug-time allocations that occur in
> > SLAB (pglist_data) that target the private node that *must* fallback
> > to successfully allocate for successful kernel operation.
>
>
> Can you point me at the code?
>
There is actually a comment in slub.c that addresses this directly:
static int slab_mem_going_online_callback(int nid)
{
... snip ...
/*
* XXX: kmem_cache_alloc_node will fallback to other nodes
* since memory is not yet available from the node that
* is brought up.
*/
n = kmem_cache_alloc(kmem_cache_node, GFP_KERNEL);
... snip ...
}
Slab basically acknowledges the behavior is required on existing nodes
and just falls back immediately for the "going online" path.
Other specific calls in the hotplug path:
mm/sparse.c: kzalloc_node(size, GFP_KERNEL, nid)
mm/sparse-vmemmap.c: alloc_pages_node(nid, GFP_KERNEL|...)
mm/slub.c: kmalloc_node(sizeof(*barn), GFP_KERNEL, nid)
There are quite a number of callers to kmem_cache_alloc_node() that
would have to be individually audited.
And some non-slab interfaces examples as well:
alloc_shrinker_info
alloc_node_nr_active
I've been looking at this for a while, but I'm starting to think trying
to touch all this surface area is simply too fragile compared to just
letting normal memory be a fallback for private nodes and adding:
__GFP_PRIVATE - unlock's private node, but allow fallback
#define GFP_PRIVATE (__GFP_PRIVATE | __GFP_THISNODE) - only this node
__GFP_PRIVATE vs GFP_PRIVATE then is just a matter of use case.
For mbind() it probably makes sense we'd use GFP_PRIVATE - either it
succeeds or it OOMs.
> > The flexibility is kind of the point :]
>
> Yeah, but it would be interesting which minimal support we would need to
> just let some special memory be managed by the kernel, allowing mbind()
> users to use it, but not have any other fallback allocations end up on it.
>
> Something very basic, on which we could build additional functionality.
>
I actually have a simplistic CXL driver that does exactly this:
https://github.com/gourryinverse/linux/blob/072ecf7cbebd9871e76c0b52fd99aa1321405a59/drivers/cxl/type3_drivers/cxl_mempolicy/mempolicy.c#L65
We have to support migration because mbind can migrate on bind if the
VMA already has memory - but all this means is the migrate interfaces
are live - not that the kernel actually uses them.
so mbind requires (OPS_MIGRATE | OPS_MEMPOLICY)
All these flags say is:
- move_pages() syscalls can accept these nodes
- migrate_pages() function calls can accept these nodes
- mempolicy.c nodemasks allow the nodes (should restrict to mbind)
- vma's with these nodes now inject __GFP_PRIVATE on fault
All other services (reclaim, compaction, khugepaged, etc) do not scan
these nodes and do not know about __GFP_PRIVATE, so they never see
private node folios and can't allocate from the node.
In this example, all migrate_to() really does is inject __GFP_THISNODE,
but I've been thinking about whether we can just do this in migrate.c
and leave implementing the .ops to a user that requires is.
But otherwise "it just works".
One note here though - OOM conditions and allocation failures are not
intuitive, especially when THP/non-order-0 allocations are involved.
But that might just mean this minimal setup should only allow order-0
allocations - which is fiiiiiiiiiiiiiine :P.
-----------------
For basic examples
I've implemented 4 examples to consider building on:
1) CXL mempolicy driver:
https://github.com/gourryinverse/linux/blob/072ecf7cbebd9871e76c0b52fd99aa1321405a59/drivers/cxl/type3_drivers/cxl_mempolicy/mempolicy.c#L65
As described above
2) Virtio-net / CXL.mem Network Card
(Not published yet)
This doesn't require any ops at all - the plumbing happens entirely
inside the kernel. I onlined the node with an SRAT hack and no ops
structure at all associated with the device (just set node affinity
to the pcie_dev and plumbed it through the network stack).
A proper implementation would have virtio-net register is own
reserved memory region and online it during probe.
3) Accelerator
(Not published yet)
I have converted an open source but out of tree GPU driver which
uses NUMA nodes to use private nodes. This required:
NP_OPS_MIGRATION
NP_OPS_MEMPOLICY
The pattern is very similar to the CXL mempolicy driver, except
that the driver had alloc_pages_node() calls that needed to have
__GFP_PRIVATE added to ensure allocations landed on the device.
4) CXL Compressed RAM driver:
https://github.com/gourryinverse/linux/blob/55c06eb6bced58132d9001e318f2958e8ac80614/mm/cram.c#L340
needs pretty much everything - it's "normal memory" with access
rules, so the driver isn't really in the management lifecycle.
In this example - the only way to allocate memory on the node is
via demotion. This allows us to close off the device to new
allocations if the hardware reports low memory but the OS percieves
the device to still have free memory.
Which is a cool example: The driver just sets up the node with
certain attributes and then lets the kernel deal with it.
I have started compacting the _OPS_* flags related to reclaim into a
single NP_OPS_RECLAIM flag while testing with this. Really i've come
around to thinking many mm/ services need to be taken as a package,
not fully piecemeal.
The tl;dr: Once you cede some control over to the kernel, you're
very close to ceding ALL control, but you still get some control
over how/when allocations on the node can be made.
It is important to note that even if we don't expose callbacks, we do
still need a modicum of node filtering in some places that still use
for_each_node() (vmscan.c, compaction.c, oom_kill.c, etc).
These are basically all the places ZONE_DEVICE *implicitly* opts itself
out of by having managed_pages=0. We have to make those situations
explicit - but that doesn't mean we need callbacks.
> >
> > I would simply state: "That depends on the memory device"
>
> Let's keep it very simple: just some memory that you mbind(), and you
> only want the mbind() user to make use of that memory.
>
> What would be the minimal set of hooks to guarantee that.
>
If you want the mbind contract to stay intact:
NP_OPS_MIGRATION (mbind can generate migrations)
NP_OPS_MEMPOLICY (this just tells mempolicy.c to allow the node)
The set of callbacks required should be exactly 0 (assuming we teach
migrate.c to inject __GFP_PRIVATE like we have mempolicy.c).
If your device requires some special notification on allocation, free
or migration to/from you need:
ops.free_folio(folio)
ops.migrate_to(folios, nid, mode, reason, nr_success)
ops.migrate_folio(src_folio, dst_folio)
The free path is the tricky one to get right. You can imagine:
buf = malloc(...);
mbind(buf, private_node);
memset(buf, 0x42, ...);
ioctl(driver, CHECK_OUT_THIS_DATA, buf);
exit(0);
The task dies and frees the pages back to the buddy - the question is
whether the 4-5 free_folio paths (put_folio, put_unref_folios, etc) can
all eat an ops.free_folio() callback to inform the driver the memory has
been freed.
In practice - this worked on my accelerator and compressed examples, but
I can't say it's 100% safe in all contexts. The free path needs more
scrutiny.
> For example, I assume compaction could just be supported for such
> memory? Similarly, longterm-pinning.
>
> For some of the other hooks it's rather unclear how they would affect
> the very simple mbind() rule. What is the effect of demotion or NUMA
> balancing?
>
> I'm afraid we're making things too complicated here or it might be the
> wrong abstraction, if i cannot even figure out how to make the simplest
> use case work.
>
> Maybe I'm wrong :)
>
Actually, quite the opposite: None of that should be engaged by
default. In our above example:
OPS_MIGRATION | OPS_MEMPOLICY
All this should say is that migration and mempolicy are supported - not
that anything in the kernel that uses migration will suddenly operate on
that memory.
So: Compaction, Longterm Pin, NUMA balancing, Demotion - etc - all of
these do not ever operate on this memory by default. Your device driver
or service would have to specifically opt-in to those services and must
be capable of dealing with the implications of that.
---
kind of neat aside:
You can hotplug private ZONE_NORMAL without NP_OPS_LONGTERMPIN and as
long as the driver/service controls the type/lifetime of allocations,
the node can remain hot-unpluggable in the future.
e.g. if the service only ever allocates movable allocations, the lack
of NP_OPS_LONGTERMPIN prevents those pages from being pinned. If you
add NP_OPS_MIGRATION - the attempt to pin will cause migration :]
~Gregory
On 4/13/26 19:05, Gregory Price wrote:
> On Mon, Apr 13, 2026 at 03:11:12PM +0200, David Hildenbrand (Arm) wrote:
>>> Normally cloud-hypervisor VMs with virtio-net can't be subject to KSM
>>> because the entire boot region gets marked shared.
>>
>> What exactly do you mean with "mark shared". Do you mean, that "shared
>> memory" is used in the hypervisor for all boot memory?
>>
>
> Sorry, meant MAP_SHARED. But yes, in some setups the hypervisor simply
> makes a memfd with the entire main memory region MAP_SHARED.
>
> This is because the virtio-net device / network stack does GFP_KERNEL
> allocations and then pins them on the host to allow zero-copy - so all
> of ZONE_NORMAL is a valid target.
>
> (At least that's my best understanding of the entire setup).
I think with vhost-kernel virtio-net just supports MAP_PRIVATE, KSM and
all of that.
The problem is vhost-user, where the other process needs to access all
of VM's memory. That's not only a problem for virtio-net, but also
virtio-fs and all the other stuff that uses vhost-user.
One idea discussed in the past was to let vhost-user access selected
guest memory through QEMU, so there would be no need to even map all of
guest memory into the other processes.
That in turn would stop requiring MAP_SHARED for most guest RAM, only
focusing it on some key parts. Not sure what happened with that idea.
A related series proposed some MEM_READ/WRITE backend requests [1]
[1] https://lists.nongnu.org/archive/html/qemu-devel/2024-09/msg02693.html
Something else people were discussing in the past was to physically
limit the area where virtio queues could be placed.
>
>>
>> You mean, in the VM, memory usable by virtio-net can only be consumed
>> from a dedicated physical memory region, and that region would be a
>> separate node?
>>
>
> Correct - it does requires teaching the network stack numa awareness.
>
> I was surprised by how little code this required, though I can't be
> 100% sure of its correctness since networking isn't my normal space.
One problem might be that VMs with NUMA disabled or reconfigured would
just break. So you cannot run arbitrary guests in there. That was also
one of the problems of "physically limit the area where virtio queues
could be placed", if you have to be prepared to run arbitrary OSes in
your VM (Windows says hi).
>
> Alternatively you could imagine this as a real device bringing its own
> dedicated networking memory for network buffers, and then telling the
> network start "Hey, prefer this node over normal kernel allocations".
>
> What I'd been hacking on was cobbled together with memfd + SRAT bits to
> bring up a private node statically and then have the device claim it -
> but this is just a proof of concept. A proper implementation would be
> extending virtio-net to report a dedicated EFI_RESERVED region.
>
>>>
>>> I see you saw below that one of the extensions is removing the nodes
>>> from the fallback list. That is part one, but it's insufficient to
>>> prevent complete leakage (someone might iterate over the nodes-possible
>>> list and try migrating memory).
>>
>> Which code would do that?
>>
>
> There are many callers of for_each_node() throughout the system.
>
> but one discrete example:
>
> int alloc_shrinker_info(struct mem_cgroup *memcg)
> {
> ... snip ...
> for_each_node(nid) {
> struct shrinker_info *info = kvzalloc_node(sizeof(*info) + array_size,
> GFP_KERNEL, nid);
> ... snip ..
> }
>
> If you disallow fallbacks in this scenario, this allocation always fails.
>
> This partially answers your question about slub fallback allocations,
> there are slab allocations like this that depend on fallbacks (more
> below on this explicitly).
But that's a different "fallback" problem, no?
You want allocations that target the "special node" to fallback to
*other* nodes, but not other allocations to fallback to *this special* node.
>
>>> Basically the only isolation mechanism we have today is ZONE_DEVICE.
>>>
>>> Either via mbind and friends, or even just the driver itself managing it
>>> directly via alloc_pages_node() and exposing some userland interface.
>>
>> Would mbind() work here? I thought mbind() would not suddenly give
>> access to some ZONE_DEVICE memory.
>>
>
> Sorry these were orthogonal thoughts.
>
> 1) We don't have such a mechanism. ZONE_DEVICE's preferred mechanism is
> setting up explicit migrations via migrate_device.c
Makes sense.
>
> 2) mbind / alloc_pages_node would only work for private nodes.
>
> Extending ZONE_DEVICE to enable mbind() would be an extreme lift,
> as the kernel makes a lot of assumptions about folio->lru.
>
> This is why i went the node route in the first place.
Agreed.
>
>>>
>>> in the NP_OPS_MIGRATION patch, this gets covered.
>>
>> Right, but I am not sure if NP_OPS_MIGRATION is really the right
>> approach for that. Have to think about that.
>>
>
> So, OPS is a bit misleading, but it's the closest i came to some
> existing pattern. OPS does not necessarily need to imply callbacks.
>
> I've been trying to minimize the patch set and I'm starting to think
> the MVP may actually be able to do away with the private_ops structure
> for a basic migration+mempolicy example by simply teaching some services
> (migrate.c, mempolicy.c) how/when to inject __GFP_PRIVATE.
>
> the mempolicy.c patch already does this, but not migrate.c - i haven't
> figured out the right pattern for that yet.
I assume you will be as LSF/MM? Would be good to discuss some of that in
person.
>
>>> 1) as you note, removing it from the default bitmaps, which is actually
>>> hard. You can't remove it from the possible-node bitmap, so that
>>> just seemed non-tractable.
>>
>> What about making people use a different set of bitmaps here? Quite some
>> work, but maybe that's the right direction given that we'll now treat
>> some nodes differently.
>>
>
> It's an option, although it is fragile. That means having to police all
> future users of possible-nodes and for_each_node and etc.
>
> I've been err'ing on the side of "not fragile", but i'm open to rework.
>
>>>
>>> 2) __GFP_THISNODE actually means (among other things) "don't fallback".
>>> And, in fact, there are some hotplug-time allocations that occur in
>>> SLAB (pglist_data) that target the private node that *must* fallback
>>> to successfully allocate for successful kernel operation.
>>
>>
>> Can you point me at the code?
>>
>
> There is actually a comment in slub.c that addresses this directly:
>
> static int slab_mem_going_online_callback(int nid)
> {
> ... snip ...
> /*
> * XXX: kmem_cache_alloc_node will fallback to other nodes
> * since memory is not yet available from the node that
> * is brought up.
> */
> n = kmem_cache_alloc(kmem_cache_node, GFP_KERNEL);
> ... snip ...
> }
>
> Slab basically acknowledges the behavior is required on existing nodes
> and just falls back immediately for the "going online" path.
>
> Other specific calls in the hotplug path:
>
> mm/sparse.c: kzalloc_node(size, GFP_KERNEL, nid)
> mm/sparse-vmemmap.c: alloc_pages_node(nid, GFP_KERNEL|...)
> mm/slub.c: kmalloc_node(sizeof(*barn), GFP_KERNEL, nid)
>
> There are quite a number of callers to kmem_cache_alloc_node() that
> would have to be individually audited.
>
> And some non-slab interfaces examples as well:
> alloc_shrinker_info
> alloc_node_nr_active
>
> I've been looking at this for a while, but I'm starting to think trying
> to touch all this surface area is simply too fragile compared to just
> letting normal memory be a fallback for private nodes and adding:
>
> __GFP_PRIVATE - unlock's private node, but allow fallback
> #define GFP_PRIVATE (__GFP_PRIVATE | __GFP_THISNODE) - only this node
>
> __GFP_PRIVATE vs GFP_PRIVATE then is just a matter of use case.
>
> For mbind() it probably makes sense we'd use GFP_PRIVATE - either it
> succeeds or it OOMs.
Needs a second thought regarding fallback logic I raised above.
What I think would have to be audited is the usage of __GFP_THISNODE by
kernel allocations, where we would not actually want to allocate from
this private node.
Maybe we could just outright refuse *any* non-user (movable) allocations
that target the node, even with __GFP_THISNODE.
Because, why would we want kernel allocations to even end up on a
private node that is supposed to only be consumed by user space? Or
which use cases are there where we would want to place kernel
allocations on there?
Needs a second thought, hoping we can discuss that in person.
>
>>> The flexibility is kind of the point :]
>>
>> Yeah, but it would be interesting which minimal support we would need to
>> just let some special memory be managed by the kernel, allowing mbind()
>> users to use it, but not have any other fallback allocations end up on it.
>>
>> Something very basic, on which we could build additional functionality.
>>
>
> I actually have a simplistic CXL driver that does exactly this:
> https://github.com/gourryinverse/linux/blob/072ecf7cbebd9871e76c0b52fd99aa1321405a59/drivers/cxl/type3_drivers/cxl_mempolicy/mempolicy.c#L65
>
Great.
> We have to support migration because mbind can migrate on bind if the
> VMA already has memory - but all this means is the migrate interfaces
> are live - not that the kernel actually uses them.
>
> so mbind requires (OPS_MIGRATE | OPS_MEMPOLICY)
>
> All these flags say is:
> - move_pages() syscalls can accept these nodes
> - migrate_pages() function calls can accept these nodes
> - mempolicy.c nodemasks allow the nodes (should restrict to mbind)
> - vma's with these nodes now inject __GFP_PRIVATE on fault
>
> All other services (reclaim, compaction, khugepaged, etc) do not scan
> these nodes and do not know about __GFP_PRIVATE, so they never see
> private node folios and can't allocate from the node.
>
> In this example, all migrate_to() really does is inject __GFP_THISNODE,
> but I've been thinking about whether we can just do this in migrate.c
> and leave implementing the .ops to a user that requires is.
>
> But otherwise "it just works".
>
> One note here though - OOM conditions and allocation failures are not
> intuitive, especially when THP/non-order-0 allocations are involved.
>
> But that might just mean this minimal setup should only allow order-0
> allocations - which is fiiiiiiiiiiiiiine :P.
Again, I am not sure about compaction and khugepaged. All we want to
guarantee is that our memory does not leave the private node.
That doesn't require any __GFP_PRIVATE magic, just en-lighting these
subsuystems that private nodes must use __GFP_THISNODE and must not leak
to other nodes.
>
> -----------------
>
> For basic examples
>
> I've implemented 4 examples to consider building on:
>
> 1) CXL mempolicy driver:
> https://github.com/gourryinverse/linux/blob/072ecf7cbebd9871e76c0b52fd99aa1321405a59/drivers/cxl/type3_drivers/cxl_mempolicy/mempolicy.c#L65
>
> As described above
>
> 2) Virtio-net / CXL.mem Network Card
> (Not published yet)
>
> This doesn't require any ops at all - the plumbing happens entirely
> inside the kernel. I onlined the node with an SRAT hack and no ops
> structure at all associated with the device (just set node affinity
> to the pcie_dev and plumbed it through the network stack).
>
> A proper implementation would have virtio-net register is own
> reserved memory region and online it during probe.
>
> 3) Accelerator
> (Not published yet)
>
> I have converted an open source but out of tree GPU driver which
> uses NUMA nodes to use private nodes. This required:
> NP_OPS_MIGRATION
> NP_OPS_MEMPOLICY
>
> The pattern is very similar to the CXL mempolicy driver, except
> that the driver had alloc_pages_node() calls that needed to have
> __GFP_PRIVATE added to ensure allocations landed on the device.
>
>
> 4) CXL Compressed RAM driver:
> https://github.com/gourryinverse/linux/blob/55c06eb6bced58132d9001e318f2958e8ac80614/mm/cram.c#L340
> needs pretty much everything - it's "normal memory" with access
> rules, so the driver isn't really in the management lifecycle.
>
> In this example - the only way to allocate memory on the node is
> via demotion. This allows us to close off the device to new
> allocations if the hardware reports low memory but the OS percieves
> the device to still have free memory.
>
> Which is a cool example: The driver just sets up the node with
> certain attributes and then lets the kernel deal with it.
>
>
> I have started compacting the _OPS_* flags related to reclaim into a
> single NP_OPS_RECLAIM flag while testing with this. Really i've come
> around to thinking many mm/ services need to be taken as a package,
> not fully piecemeal.
>
> The tl;dr: Once you cede some control over to the kernel, you're
> very close to ceding ALL control, but you still get some control
> over how/when allocations on the node can be made.
>
>
> It is important to note that even if we don't expose callbacks, we do
> still need a modicum of node filtering in some places that still use
> for_each_node() (vmscan.c, compaction.c, oom_kill.c, etc).
>
> These are basically all the places ZONE_DEVICE *implicitly* opts itself
> out of by having managed_pages=0. We have to make those situations
> explicit - but that doesn't mean we need callbacks.
>
>>>
>>> I would simply state: "That depends on the memory device"
>>
>> Let's keep it very simple: just some memory that you mbind(), and you
>> only want the mbind() user to make use of that memory.
>>
>> What would be the minimal set of hooks to guarantee that.
>>
>
> If you want the mbind contract to stay intact:
>
> NP_OPS_MIGRATION (mbind can generate migrations)
> NP_OPS_MEMPOLICY (this just tells mempolicy.c to allow the node)
I'm missing why these are even opt-in. What's the problem with allowing
mbind and mempolicy to use these nodes in some of your drivers?
I also have some questions about longterm pinnings, but that's better
discussed in person :)
>
> The set of callbacks required should be exactly 0 (assuming we teach
> migrate.c to inject __GFP_PRIVATE like we have mempolicy.c).
>
> If your device requires some special notification on allocation, free
> or migration to/from you need:
>
> ops.free_folio(folio)
> ops.migrate_to(folios, nid, mode, reason, nr_success)
> ops.migrate_folio(src_folio, dst_folio)
>
> The free path is the tricky one to get right. You can imagine:
>
> buf = malloc(...);
> mbind(buf, private_node);
> memset(buf, 0x42, ...);
> ioctl(driver, CHECK_OUT_THIS_DATA, buf);
> exit(0);
>
> The task dies and frees the pages back to the buddy - the question is
> whether the 4-5 free_folio paths (put_folio, put_unref_folios, etc) can
> all eat an ops.free_folio() callback to inform the driver the memory has
> been freed.
Right, that's rather invasive.
--
Cheers,
David
On Wed, Apr 15, 2026 at 11:49:59AM +0200, David Hildenbrand (Arm) wrote:
> On 4/13/26 19:05, Gregory Price wrote:
As a preface - the current RFC was informed by ZONE_DEVICE patterns.
I think that was useful as a way to find existing friction points - but
ultimately wrong for this new interface.
I don't thinks an ops struct here is the right design, and I think there
are only a few patterns that actually make sense for device memory using
nodes this way.
So there's going to be a *major* contraction in the complexity of this
patch series (hopefully I'll have something next week), and much of what
you point out below is already in-flight.
> > On Mon, Apr 13, 2026 at 03:11:12PM +0200, David Hildenbrand (Arm) wrote:
> >
> > This is because the virtio-net device / network stack does GFP_KERNEL
> > allocations and then pins them on the host to allow zero-copy - so all
> > of ZONE_NORMAL is a valid target.
> >
> > (At least that's my best understanding of the entire setup).
>
... snip ...
>
> A related series proposed some MEM_READ/WRITE backend requests [1]
>
> [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-09/msg02693.html
>
Oh interesting, thank you for the reference here.
>
> Something else people were discussing in the past was to physically
> limit the area where virtio queues could be placed.
>
That is functionally what I did - the idea was pretty simple, just have
a separate memfd/node dedicated for the queues:
guest_memory = memfd(MAP_PRIVATE)
net_memory = memfd(MAP_SHARED)
And boom, you get what you want.
So yeah "It works" - but there's likely other ways to do this too, and
as you note re: compatibility, i'm not sure virtio actually wants this,
but it's a nice proof-of-concept for a network device on the host that
carries its own memory.
I'll try post my hack as an example with the next RFC version, as I
think it's informative.
> >
> > This partially answers your question about slub fallback allocations,
> > there are slab allocations like this that depend on fallbacks (more
> > below on this explicitly).
>
> But that's a different "fallback" problem, no?
>
> You want allocations that target the "special node" to fallback to
> *other* nodes, but not other allocations to fallback to *this special* node.
>
... snip - slight reordering to put thoughts together ...
> >
> > __GFP_PRIVATE vs GFP_PRIVATE then is just a matter of use case.
> >
> > For mbind() it probably makes sense we'd use GFP_PRIVATE - either it
> > succeeds or it OOMs.
>
> Needs a second thought regarding fallback logic I raised above.
>
> What I think would have to be audited is the usage of __GFP_THISNODE by
> kernel allocations, where we would not actually want to allocate from
> this private node.
>
This is fair, and I a re-visit is absolutely warranted.
Re-examining the quick audit from my last response suggests - I should
never have seen leakage in those cases, but the fallbacks are needed.
So yes, this all requires a second look (and a third, and a ninth).
I'm not married to __GFP_PRIVATE, but it has been reliable for me.
> Maybe we could just outright refuse *any* non-user (movable) allocations
> that target the node, even with __GFP_THISNODE.
>
> Because, why would we want kernel allocations to even end up on a
> private node that is supposed to only be consumed by user space? Or
> which use cases are there where we would want to place kernel
> allocations on there?
>
As a start, maybe? But as a permanent invariant? I would wonder whether
the decision here would lock us into a design.
But then - this is all kernel internal, so i think it would be feasible
to change this out from under users without backward compatibility pain.
So far I have done my best to avoid changing any userland interfaces in
a way that would fundamentally change the contracts. If anything
private-node other than just the node's `has_memory_private` attribute
leaks into userland, someone messed up.
So... I think that's reasonable.
>
> I assume you will be as LSF/MM? Would be good to discuss some of that in
> person.
>
Yes, looking forward to it :]
> > One note here though - OOM conditions and allocation failures are not
> > intuitive, especially when THP/non-order-0 allocations are involved.
> >
> > But that might just mean this minimal setup should only allow order-0
> > allocations - which is fiiiiiiiiiiiiiine :P.
>
>
> Again, I am not sure about compaction and khugepaged. All we want to
> guarantee is that our memory does not leave the private node.
>
> That doesn't require any __GFP_PRIVATE magic, just en-lighting these
> subsystems that private nodes must use __GFP_THISNODE and must not leak
> to other nodes.
This is where specific use-cases matter.
In the compressed memory example - the device doesn't care about memory
leaving - but it cares about memory arriving and *and being modified*.
(more on this in your next question)
So i'm not convinced *all possible devices* would always want to support
move_pages(), mbind(), and set_mempolicy().
But, I do want to give this serious thought, and I agree the absolute
minimal patch set could just be the fallback control mechanism and
mm/ component filters/audit on __GFP_*.
> > If you want the mbind contract to stay intact:
> >
> > NP_OPS_MIGRATION (mbind can generate migrations)
> > NP_OPS_MEMPOLICY (this just tells mempolicy.c to allow the node)
>
> I'm missing why these are even opt-in. What's the problem with allowing
> mbind and mempolicy to use these nodes in some of your drivers?
>
First:
In my latest working branch these two flags have been folded into just
_OPS_MEMPOLICY and any other migration interaction is just handled by
filtering with the GFP flag.
on always allowing mbind and mempolicy vs opt-in
---
A proper compressed memory solution should not allow mbind/mempolicy.
Compressed memory is different from normal memory - as the kernel can
percieves free memory (many unused struct page in the buddy) when the
device knows there's none left (the physical capacity is actually full).
Any form of write to a compressed memory device is essentially a
dangerous condition (OOMs = poison, not oom_kill()).
So you need two controls: Allocation and (userland) Write protection
I implemented via:
- Demotion-only (allocations only happen in reclaim path)
- Write-protecting the entire node
(I fully accept that a write-protection extension here might be a bridge
to far, but please stick with me for the sake of exploration).
There's a serious argument to limit these devices to using an mbind
pattern, but I wanted to make a full-on attempt to integrate this device
into the demotion path as a transparent tier (kinda like zswap).
I could not square write-protection with mempolicy, so i had to make
them both optional and mutually exclusive.
If you limit the device to mbind interactions, you do limit what can
crash - but this forces userland software to be less portable by design:
- am i running on a system where this device is present?
- is that device exposing its memory on a node?
- which node?
- what memory can i put on that node? (can you prevent a process from
putting libc on that node?)
- how much compression ratio is left on the device?
- can i safety write to this virtual address?
- should i write-protect compressed VMAs? Can i handle those faults?
- many more
That sounds a lot like re-implementing a bunch of mm/ in userland, and
that's exactly where we were at with DAX. We know this pattern failed.
I'm trying to very much avoid repeating these mistakes, and so I'm very
much trying to find a good path forward here that results in transparent
usage of this memory.
> I also have some questions about longterm pinnings, but that's better
> discussed in person :)
>
The longterm pin extention came from auditing existing zone_device
filters.
tl;dr: informative mechanism - but it probably should be dropped,
it makes no sense (it's device memory, pinnings mean nothing?).
> >
> > The task dies and frees the pages back to the buddy - the question is
> > whether the 4-5 free_folio paths (put_folio, put_unref_folios, etc) can
> > all eat an ops.free_folio() callback to inform the driver the memory has
> > been freed.
>
> Right, that's rather invasive.
>
Yeah i'm trying to avoid it, and the answer may actually just exist in
the task-death and VMA cleanup path rather than the folio-free path.
From what i've seen of accelerator drivers that implement this, when you
inform the driver of a memory region with a task, the driver should have
a mechanism to take references on that VMA (or something like this) - so
that when the task dies the driver has a way to be notified of the VMA
being cleaned up.
This probably exists - I just haven't gotten there yet.
~Gregory
On Wed, Apr 15, 2026 at 8:18 AM Gregory Price <gourry@gourry.net> wrote: > > On Wed, Apr 15, 2026 at 11:49:59AM +0200, David Hildenbrand (Arm) wrote: > > On 4/13/26 19:05, Gregory Price wrote: > > As a preface - the current RFC was informed by ZONE_DEVICE patterns. > > I think that was useful as a way to find existing friction points - but > ultimately wrong for this new interface. > > I don't thinks an ops struct here is the right design, and I think there > are only a few patterns that actually make sense for device memory using > nodes this way. > > So there's going to be a *major* contraction in the complexity of this > patch series (hopefully I'll have something next week), and much of what > you point out below is already in-flight. > > > > On Mon, Apr 13, 2026 at 03:11:12PM +0200, David Hildenbrand (Arm) wrote: > > > > > > This is because the virtio-net device / network stack does GFP_KERNEL > > > allocations and then pins them on the host to allow zero-copy - so all > > > of ZONE_NORMAL is a valid target. > > > > > > (At least that's my best understanding of the entire setup). > > > ... snip ... > > > > A related series proposed some MEM_READ/WRITE backend requests [1] > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-09/msg02693.html > > > > Oh interesting, thank you for the reference here. > > > > > Something else people were discussing in the past was to physically > > limit the area where virtio queues could be placed. > > > > That is functionally what I did - the idea was pretty simple, just have > a separate memfd/node dedicated for the queues: > > guest_memory = memfd(MAP_PRIVATE) > net_memory = memfd(MAP_SHARED) > > And boom, you get what you want. > > So yeah "It works" - but there's likely other ways to do this too, and > as you note re: compatibility, i'm not sure virtio actually wants this, > but it's a nice proof-of-concept for a network device on the host that > carries its own memory. > > I'll try post my hack as an example with the next RFC version, as I > think it's informative. > > > > > > > This partially answers your question about slub fallback allocations, > > > there are slab allocations like this that depend on fallbacks (more > > > below on this explicitly). > > > > But that's a different "fallback" problem, no? > > > > You want allocations that target the "special node" to fallback to > > *other* nodes, but not other allocations to fallback to *this special* node. > > > ... snip - slight reordering to put thoughts together ... > > > > > > __GFP_PRIVATE vs GFP_PRIVATE then is just a matter of use case. > > > > > > For mbind() it probably makes sense we'd use GFP_PRIVATE - either it > > > succeeds or it OOMs. > > > > Needs a second thought regarding fallback logic I raised above. > > > > What I think would have to be audited is the usage of __GFP_THISNODE by > > kernel allocations, where we would not actually want to allocate from > > this private node. > > > > This is fair, and I a re-visit is absolutely warranted. > > Re-examining the quick audit from my last response suggests - I should > never have seen leakage in those cases, but the fallbacks are needed. > > So yes, this all requires a second look (and a third, and a ninth). > > I'm not married to __GFP_PRIVATE, but it has been reliable for me. > > > Maybe we could just outright refuse *any* non-user (movable) allocations > > that target the node, even with __GFP_THISNODE. > > > > Because, why would we want kernel allocations to even end up on a > > private node that is supposed to only be consumed by user space? Or > > which use cases are there where we would want to place kernel > > allocations on there? > > > > As a start, maybe? But as a permanent invariant? I would wonder whether > the decision here would lock us into a design. > > But then - this is all kernel internal, so i think it would be feasible > to change this out from under users without backward compatibility pain. > > So far I have done my best to avoid changing any userland interfaces in > a way that would fundamentally change the contracts. If anything > private-node other than just the node's `has_memory_private` attribute > leaks into userland, someone messed up. > > So... I think that's reasonable. > > > > > I assume you will be as LSF/MM? Would be good to discuss some of that in > > person. > > > > Yes, looking forward to it :] > > > > > One note here though - OOM conditions and allocation failures are not > > > intuitive, especially when THP/non-order-0 allocations are involved. > > > > > > But that might just mean this minimal setup should only allow order-0 > > > allocations - which is fiiiiiiiiiiiiiine :P. > > > > > > Again, I am not sure about compaction and khugepaged. All we want to > > guarantee is that our memory does not leave the private node. > > > > That doesn't require any __GFP_PRIVATE magic, just en-lighting these > > subsystems that private nodes must use __GFP_THISNODE and must not leak > > to other nodes. > > This is where specific use-cases matter. > > In the compressed memory example - the device doesn't care about memory > leaving - but it cares about memory arriving and *and being modified*. > (more on this in your next question) > > So i'm not convinced *all possible devices* would always want to support > move_pages(), mbind(), and set_mempolicy(). > > But, I do want to give this serious thought, and I agree the absolute > minimal patch set could just be the fallback control mechanism and > mm/ component filters/audit on __GFP_*. > > > > > If you want the mbind contract to stay intact: > > > > > > NP_OPS_MIGRATION (mbind can generate migrations) > > > NP_OPS_MEMPOLICY (this just tells mempolicy.c to allow the node) > > > > I'm missing why these are even opt-in. What's the problem with allowing > > mbind and mempolicy to use these nodes in some of your drivers? > > > > First: > > In my latest working branch these two flags have been folded into just > _OPS_MEMPOLICY and any other migration interaction is just handled by > filtering with the GFP flag. > > > on always allowing mbind and mempolicy vs opt-in > --- > > A proper compressed memory solution should not allow mbind/mempolicy. > > Compressed memory is different from normal memory - as the kernel can > percieves free memory (many unused struct page in the buddy) when the > device knows there's none left (the physical capacity is actually full). > > Any form of write to a compressed memory device is essentially a > dangerous condition (OOMs = poison, not oom_kill()). > > So you need two controls: Allocation and (userland) Write protection > I implemented via: > - Demotion-only (allocations only happen in reclaim path) > - Write-protecting the entire node > > (I fully accept that a write-protection extension here might be a bridge > to far, but please stick with me for the sake of exploration). > > > There's a serious argument to limit these devices to using an mbind > pattern, but I wanted to make a full-on attempt to integrate this device > into the demotion path as a transparent tier (kinda like zswap). > > I could not square write-protection with mempolicy, so i had to make > them both optional and mutually exclusive. > > If you limit the device to mbind interactions, you do limit what can > crash - but this forces userland software to be less portable by design: > > - am i running on a system where this device is present? > - is that device exposing its memory on a node? > - which node? > - what memory can i put on that node? (can you prevent a process from > putting libc on that node?) > - how much compression ratio is left on the device? > - can i safety write to this virtual address? > - should i write-protect compressed VMAs? Can i handle those faults? > - many more > > That sounds a lot like re-implementing a bunch of mm/ in userland, and > that's exactly where we were at with DAX. We know this pattern failed. > > I'm trying to very much avoid repeating these mistakes, and so I'm very > much trying to find a good path forward here that results in transparent > usage of this memory. > > > > I also have some questions about longterm pinnings, but that's better > > discussed in person :) > > > > The longterm pin extention came from auditing existing zone_device > filters. > > tl;dr: informative mechanism - but it probably should be dropped, > it makes no sense (it's device memory, pinnings mean nothing?). > > > > > > > > The task dies and frees the pages back to the buddy - the question is > > > whether the 4-5 free_folio paths (put_folio, put_unref_folios, etc) can > > > all eat an ops.free_folio() callback to inform the driver the memory has > > > been freed. > > > > Right, that's rather invasive. > > > > Yeah i'm trying to avoid it, and the answer may actually just exist in > the task-death and VMA cleanup path rather than the folio-free path. > > From what i've seen of accelerator drivers that implement this, when you > inform the driver of a memory region with a task, the driver should have > a mechanism to take references on that VMA (or something like this) - so > that when the task dies the driver has a way to be notified of the VMA > being cleaned up. > > This probably exists - I just haven't gotten there yet. > > ~Gregory This has been a really great discussion. I just wanted to add a few points that I think I have mentioned in other forums, but not here. In essence, this is a discussion about memory properties and the level at which they should be dealt with. Right now there are basically 3 levels: pageblocks, zones and nodes. While these levels exist for good reasons, they also sometimes lead to issues. There's duplication of functionality. MIGRATE_CMA and ZONE_MOVABLE both implement the same basic property, but at different levels (attempts have been made to merge them, but it didn't work out). There's also memory with clashing properties inhabiting the same data structure: LRUs. Having strictly movable memory on the same LRU as unmovable memory is a mismatch. It leads to the well known problem of reclaim done in the name of an unmovable allocation attempt can be entirely pointless in the face of large amounts of ZONE_MOVABLE or MIGRATE_CMA memory: the anon LRU will be chock full of movable-only pages. Reclaiming them is useless for your allocation, and skipping them leads to locking up the system because you're holding on to the LRU lock a long time. So, looking at having some properties set at the node level makes sense to me even in the non-device case. But perhaps that is out of scope for the initial discussion. One use case that seems like a good match for private nodes is guest memory. Guest memory is special enough to want to allocate / maintain it separately, which is acknowledged by the introduction of guest_memfd. I'm interested in enabling guest_memfd allocation from private nodes. I've been playing around with setting aside memory at boot, and assigning it to private nodes (one private node per physical NUMA node), and making it available to guest_memfd only. There are issues to be solved there, but the private node abstraction seems to fit well, and provides for useful hooks to manage guest memory. Some properties that I'm interested in for this use case: 1) is the memory in the direct map or not? Should that be configurable for a private node? I know there are patches right now to remove memory from the direct map for guest_memfd, but what if there was a private node whose memory is not in the direct map by default? 2) Default page size. devdax, a ZONE_DEVICE user, allows for memory setup on hotplug that initializes things with HVO-ed large pages. Could the page size be a property of the node? That would make it easy to hand out larger pages to guests. Of course, if you use anything but 4k, the argument of 'we can use the general buddy allocator' goes out the window, unless it's made to deal with a per-node base page size. - Frank
On Wed, Apr 15, 2026 at 12:47:50PM -0700, Frank van der Linden wrote: > > > > > I also have some questions about longterm pinnings, but that's better > > > discussed in person :) > > > > > > > The longterm pin extention came from auditing existing zone_device > > filters. > > > > tl;dr: informative mechanism - but it probably should be dropped, > > it makes no sense (it's device memory, pinnings mean nothing?). > > > > ... snip ... stitching together some context here > > So, looking at having some properties set at the node level makes > sense to me even in the non-device case. But perhaps that is out of > scope for the initial discussion. I think there's an argument burried in this observation (useful for non-device case) that suggests there could be a world where longterm pinning on this memory makes sense. But it doesn't need to be introduced from the start, and it's a 5-10 line change to add it in later, so I think it will get trimmed unless there's a user out there actively experimenting with it. ~Gregory
On Wed, Apr 15, 2026 at 12:47:50PM -0700, Frank van der Linden wrote:
>
> This has been a really great discussion. I just wanted to add a few
> points that I think I have mentioned in other forums, but not here.
>
> In essence, this is a discussion about memory properties and the level
> at which they should be dealt with. Right now there are basically 3
> levels: pageblocks, zones and nodes. While these levels exist for good
> reasons, they also sometimes lead to issues. There's duplication of
> functionality. MIGRATE_CMA and ZONE_MOVABLE both implement the same
> basic property, but at different levels (attempts have been made to
> merge them, but it didn't work out).
I have made this observation as well. ZONEs in particular are a bit
odd because they're somehow simultaneously too broad and too narrow in
terms of what they control and what they're used for.
1GB ZONE_MOVABLE HugeTLBFS Pages is an example weird carve-out, because
the memory is in ZONE_MOVABLE to help make 1GB allocations more
reliable, but 1GB movable pages were removed from the kernel because
they're not easily migrated (and therefore may block hot-unplug).
(Thankfully they're back now, so VMs can live on this memory :P)
So you have competing requirements, which suggests zone is the wrong
abstraction at some level - but it's what we've got.
> There's also memory with clashing
> properties inhabiting the same data structure: LRUs. Having strictly
> movable memory on the same LRU as unmovable memory is a mismatch. It
> leads to the well known problem of reclaim done in the name of an
> unmovable allocation attempt can be entirely pointless in the face of
> large amounts of ZONE_MOVABLE or MIGRATE_CMA memory: the anon LRU will
> be chock full of movable-only pages. Reclaiming them is useless for
> your allocation, and skipping them leads to locking up the system
> because you're holding on to the LRU lock a long time.
>
This is an interesting observation that should be solvable.
For example - i'm pretty sure mlock'd pages are on an unevictable LRU
for exactly this reason (to just skip scanning them during reclaim).
Which is a different pain point I have - since they're still migratable,
they could be demoted to make room for local hot pages.
> So, looking at having some properties set at the node level makes
> sense to me even in the non-device case. But perhaps that is out of
> scope for the initial discussion.
>
> One use case that seems like a good match for private nodes is guest
> memory. Guest memory is special enough to want to allocate / maintain
> it separately, which is acknowledged by the introduction of
> guest_memfd.
>
> I'm interested in enabling guest_memfd allocation from private nodes.
> I've been playing around with setting aside memory at boot, and
> assigning it to private nodes (one private node per physical NUMA
> node), and making it available to guest_memfd only. There are issues
> to be solved there, but the private node abstraction seems to fit
> well, and provides for useful hooks to manage guest memory.
>
I have wondered about this use case, but I haven't really played with
guest_memfd to know what the implications are here, so it's nice to hear
someone is looking at this. It will be nice to hear your input on where
the abstraction could be better.
> Some properties that I'm interested in for this use case:
>
> 1) is the memory in the direct map or not? Should that be configurable
> for a private node? I know there are patches right now to remove
> memory from the direct map for guest_memfd, but what if there was a
> private node whose memory is not in the direct map by default?
Presuming a page was not in the direct map and it was in the buddy
(strong assumption here), there's a handful of things that would
straight up break:
- init_on_alloc (post_alloc_hook) / __GFP_ZERO (clear_highpage)
- init_on_free (free_pages_prepare)
- kernel_poison_pages (accesses the page contents)
- CONFIG_DEBUG_PAGEALLOC
But... these things seem eminently skippable based on a node attribute.
I think this could be done, but there is added concern about spewing
an ever increasing numbers of hooks throughout mm/ as the number of
attributes increase.
But in this case I think the contract would require that an NP_OPS_NOMAP
would have to be mutually exclusive with all other node attributes (too
many places that touch the mapping, it would be too fragile).
There's a few catches here though
1) you lose the ability to zero out the page after allocation, so
whatever is in the memory already is going into the guest.
That seems problematic for a variety of reasons.
I guess you can use kmap_local_page?
But then why not just unmap after allocation?
If never mapping is a hard requirement, if that memory lives on
a device with a sanitize function, you maybe could massage kernel
free-page-reporting to offload the zeroing without having the
kernel map it - as long as you can take a delay after free before
the page becomes available again.
2) the current mempolicy guest_memfd patches would not apply because
I can't see how OPS_MEMPOLICY & OPS_NOMAP co-exist. A user program
could call mbind(nomap_node) on a random VMA - and there would be
kernel OOPS everywhere.
That would just mean pre-setting the node backing for all
guest_memfd VMAs, rather than using mbind().
Something like (cribbing from the memfd code with absolutely no
context, so there's a pile of assumptions being made here)
struct kvm_create_guest_memfd {
__u64 size;
__u64 flags;
__s32 numa_node; /* Set at creation */
__u32 pad;
__u64 reserved[5];
};
#define GUEST_MEMFD_FLAG_NUMA_NODE (1ULL << 2)
if (gmem->flags & GUEST_MEMFD_FLAG_NUMA_NODE)
folio = __folio_alloc(gfp | __GFP_PRIVATE, order,
gmem->numa_node, NULL);
else
/* existing mempolicy / default path */
folio = __filemap_get_folio_mpol(...);
Which may even be preferable to the recently upstreamed pattern.
> 2) Default page size. devdax, a ZONE_DEVICE user, allows for memory
> setup on hotplug that initializes things with HVO-ed large pages.
> Could the page size be a property of the node? That would make it easy
> to hand out larger pages to guests. Of course, if you use anything
> but 4k, the argument of 'we can use the general buddy allocator' goes
> out the window, unless it's made to deal with a per-node base page
> size.
>
Per-node page sizes are probably a bridge too far, that's seems like
a change that would echo through most of the buddy infrastructure, not
just a few hooks to prevent certain interactions.
However, I also don't think this is a requirement.
I know there is some work to try to raise the max page order to allow
THP to support 1GB huge pages - if max size is a concern, there's hope.
On fragmentation though...
If the consumer of a private node only ever allocates a specific order
(order-9) - the buddy never fragments smaller than that (it maybe
spends time coalescing for no value, but it'll never fragment smaller).
So is the concern here that you want to guarantee a minimum page size
to deal with the fragmentation problem on normal general-purpose nodes,
or do you want to guarantee a minimum page size because you can't limit
the allocations to be of a base order?
i.e.: is limiting guest_memfd allocations on a private node to a single
order (or a minimum order? 2MB?) a feasible option? (Pretend i know
very little about the guest_memfd specific memory management code).
~Gregory
© 2016 - 2026 Red Hat, Inc.