[RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory

Gregory Price posted 8 patches 1 month ago
.../admin-guide/cgroup-v1/cpusets.rst         |  19 +-
Documentation/admin-guide/cgroup-v2.rst       |  26 ++-
Documentation/filesystems/proc.rst            |   2 +-
drivers/base/node.c                           | 199 ++++++++++++++++++
drivers/cxl/core/Makefile                     |   1 +
drivers/cxl/core/core.h                       |   4 +
drivers/cxl/core/port.c                       |   4 +
drivers/cxl/core/private_region/Makefile      |  12 ++
.../cxl/core/private_region/private_region.c  | 129 ++++++++++++
.../cxl/core/private_region/private_region.h  |  14 ++
drivers/cxl/core/private_region/zswap.c       | 127 +++++++++++
drivers/cxl/core/region.c                     |  63 +++++-
drivers/cxl/cxl.h                             |  22 ++
include/linux/cpuset.h                        |  24 ++-
include/linux/gfp.h                           |   6 +
include/linux/mm.h                            |   4 +-
include/linux/mmzone.h                        |   6 +-
include/linux/node.h                          |  60 ++++++
include/linux/nodemask.h                      |   1 +
include/linux/oom.h                           |   2 +-
include/linux/swap.h                          |   2 +-
include/linux/zswap.h                         |   5 +
kernel/cgroup/cpuset-internal.h               |   8 +
kernel/cgroup/cpuset-v1.c                     |   8 +
kernel/cgroup/cpuset.c                        |  98 ++++++---
mm/compaction.c                               |   6 +-
mm/internal.h                                 |   2 +-
mm/memcontrol.c                               |   2 +-
mm/memory_hotplug.c                           |   2 +-
mm/mempolicy.c                                |   6 +-
mm/migrate.c                                  |   4 +-
mm/mmzone.c                                   |   5 +-
mm/page_alloc.c                               |  31 +--
mm/show_mem.c                                 |   9 +-
mm/slub.c                                     |   8 +-
mm/vmscan.c                                   |   6 +-
mm/zswap.c                                    | 106 +++++++++-
37 files changed, 942 insertions(+), 91 deletions(-)
create mode 100644 drivers/cxl/core/private_region/Makefile
create mode 100644 drivers/cxl/core/private_region/private_region.c
create mode 100644 drivers/cxl/core/private_region/private_region.h
create mode 100644 drivers/cxl/core/private_region/zswap.c
[RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory
Posted by Gregory Price 1 month ago
This series introduces N_PRIVATE, a new node state for memory nodes 
whose memory is not intended for general system consumption.  Today,
device drivers (CXL, accelerators, etc.) hotplug their memory to access
mm/ services like page allocation and reclaim, but this exposes general
workloads to memory with different characteristics and reliability
guarantees than system RAM.

N_PRIVATE provides isolation by default while enabling explicit access
via __GFP_THISNODE for subsystems that understand how to manage these
specialized memory regions.

Motivation
==========

Several emerging memory technologies require kernel memory management
services but should not be used for general allocations:

  - CXL Compressed RAM (CRAM): Hardware-compressed memory where the
    effective capacity depends on data compressibility.  Uncontrolled
    use risks capacity exhaustion when compression ratios degrade.

  - Accelerator Memory: GPU/TPU-attached memory optimized for specific
    access patterns that are not intended for general allocation.

  - Tiered Memory: Memory intended only as a demotion target, not for
    initial allocations.

Currently, these devices either avoid hotplugging entirely (losing mm/
services) or hotplug as regular N_MEMORY (risking reliability issues).
N_PRIVATE solves this by creating an isolated node class.

Design
======

The series introduces:

  1. N_PRIVATE node state (mutually exclusive with N_MEMORY)
  2. private_memtype enum for policy-based access control
  3. cpuset.mems.sysram for user-visible isolation
  4. Integration points for subsystems (zswap demonstrated)
  5. A cxl private_region example to demonstrate full plumbing

Private Memory Types (private_memtype)
======================================

The private_memtype enum defines policy bits that control how different
kernel subsystems may access private nodes:

  enum private_memtype {
      NODE_MEM_NOTYPE,      /* No type assigned (invalid state) */
      NODE_MEM_ZSWAP,       /* Swap compression target */
      NODE_MEM_COMPRESSED,  /* General compressed RAM */
      NODE_MEM_ACCELERATOR, /* Accelerator-attached memory */
      NODE_MEM_DEMOTE_ONLY, /* Memory-tier demotion target only */
      NODE_MAX_MEMTYPE,
  };

These types serve as policy hints for subsystems:

NODE_MEM_ZSWAP
--------------
Nodes with this type are registered as zswap compression targets.  When
zswap compresses a page, it can allocate directly from ZSWAP-typed nodes
using __GFP_THISNODE, bypassing software compression if the device
provides hardware compression.

Example flow:
  1. CXL device creates private_region with type=zswap
  2. Driver calls node_register_private() with NODE_MEM_ZSWAP
  3. zswap_add_direct_node() registers the node as a compression target
  4. On swap-out, zswap allocates from the private node
  5. page_allocated() callback validates compression ratio headroom
  6. page_freed() callback zeros pages to improve device compression

Prototype Note:
  This patch set does not actually do compression ratio validation, as
  this requires an actual device to provide some kind of counter and/or
  interrupt to denote when allocations are safe.  The callbacks are
  left as stubs with TODOs for device vendors to pick up the next step
  (we'll continue with a QEMU example if reception is positive).

  For now, this always succeeds because compressed=real capacity.

NODE_MEM_COMPRESSED (CRAM)
--------------------------
For general compressed RAM devices.  Unlike ZSWAP nodes, CRAM nodes
could be exposed to subsystems that understand compression semantics:

  - vmscan: Could prefer demoting pages to CRAM nodes before swap
  - memory-tiering: Could place CRAM between DRAM and persistent memory
  - zram: Could use as backing store instead of or alongside zswap

Such a component (mm/cram.c) would differ from zswap or zram by allowing
the compressed pages to remain mapped Read-Only in the page table.

NODE_MEM_ACCELERATOR
--------------------
For GPU/TPU/accelerator-attached memory.  Policy implications:

  - Default allocations: Never (isolated from general page_alloc)
  - GPU drivers: Explicit allocation via __GFP_THISNODE
  - NUMA balancing: Excluded from automatic migration
  - Memory tiering: Not a demotion target

Some GPU vendors want management of their memory via NUMA nodes, but
don't want fallback or migration allocations to occur.  This enables
that pattern.

mm/mempolicy.c could be used to allow for N_PRIVATE nodes of this type
if the intent is per-vma access to accelerator memory (e.g. via mbind)
but this is omitted from this series from now to limit userland
exposure until first class examples are provided.

NODE_MEM_DEMOTE_ONLY
--------------------
For memory intended exclusively as a demotion target in memory tiering:

  - page_alloc: Never allocates initially (slab, page faults, etc.)
  - vmscan/reclaim: Valid demotion target during memory pressure
  - memory-tiering: Allow hotness monitoring/promotion for this region

This enables "cold storage" tiers using slower/cheaper memory (CXL-
attached DRAM, persistent memory in volatile mode) without the memory
appearing in allocation fast paths.

This also adds some additional bonus of enforcing memory placement on
these nodes to be movable allocations only (with all the normal caveats
around page pinning).

Subsystem Integration Points
============================

The private_node_ops structure provides callbacks for integration:

  struct private_node_ops {
      struct list_head list;
      resource_size_t res_start;
      resource_size_t res_end;
      enum private_memtype memtype;
      int (*page_allocated)(struct page *page, void *data);
      void (*page_freed)(struct page *page, void *data);
      void *data;
  };

page_allocated(): Called after allocation, returns 0 to accept or
-ENOSPC/-ENODEV to reject (caller retries elsewhere).  Enables:
  - Compression ratio enforcement for CRAM/zswap
  - Capacity tracking for accelerator memory
  - Rate limiting for demotion targets

page_freed(): Called on free, enables:
  - Zeroing for compression ratio recovery
  - Capacity accounting updates
  - Device-specific cleanup

Isolation Enforcement
=====================

The series modifies core allocators to respect N_PRIVATE isolation:

  - page_alloc: Constrains zone iteration to cpuset.mems.sysram
  - slub: Allocates only from N_MEMORY nodes
  - compaction: Skips N_PRIVATE nodes
  - mempolicy: Uses sysram_nodes for policy evaluation

__GFP_THISNODE bypasses isolation, enabling explicit access:

  page = alloc_pages_node(private_nid, GFP_KERNEL | __GFP_THISNODE, 0);

This pattern is used by zswap, and would be used by other subsystems
that explicitly opt into private node access.

User-Visible Changes
====================

cpuset gains cpuset.mems.sysram (read-only), shows N_MEMORY nodes.

ABI: /proc/<pid>/status Mems_allowed shows sysram nodes only.

Drivers create private regions via sysfs:
  echo region0 > /sys/bus/cxl/.../create_private_region
  echo zswap > /sys/bus/cxl/.../region0/private_type
  echo 1 > /sys/bus/cxl/.../region0/commit

Series Organization
===================

Patch 1: numa,memory_hotplug: create N_PRIVATE (Private Nodes)
  Core infrastructure: N_PRIVATE node state, node_mark_private(),
  private_memtype enum, and private_node_ops registration.

Patch 2: mm: constify oom_control, scan_control, and alloc_context 
nodemask
  Preparatory cleanup for enforcing that nodemasks don't change.

Patch 3: mm: restrict slub, compaction, and page_alloc to sysram
  Enforce N_MEMORY-only allocation for general paths.

Patch 4: cpuset: introduce cpuset.mems.sysram
  User-visible isolation via cpuset interface.

Patch 5: Documentation/admin-guide/cgroups: update docs for mems_allowed
  Document the new behavior and sysram_nodes.

Patch 6: drivers/cxl/core/region: add private_region
  CXL infrastructure for private regions.

Patch 7: mm/zswap: compressed ram direct integration
  Zswap integration demonstrating direct hardware compression.

Patch 8: drivers/cxl: add zswap private_region type
  Complete example: CXL region as zswap compression target.

Future Work
===========

This series provides the foundation.  Planned follow-ups include:

  - CRAM integration with vmscan for smart demotion
  - ACCELERATOR type for GPU memory management
  - Memory-tiering integration with DEMOTE_ONLY nodes

Testing
=======

All patches build cleanly.  Tested with:
  - CXL QEMU emulation with private regions
  - Zswap stress tests with private compression targets
  - Cpuset verification of mems.sysram isolation


Gregory Price (8):
  numa,memory_hotplug: create N_PRIVATE (Private Nodes)
  mm: constify oom_control, scan_control, and alloc_context nodemask
  mm: restrict slub, compaction, and page_alloc to sysram
  cpuset: introduce cpuset.mems.sysram
  Documentation/admin-guide/cgroups: update docs for mems_allowed
  drivers/cxl/core/region: add private_region
  mm/zswap: compressed ram direct integration
  drivers/cxl: add zswap private_region type

 .../admin-guide/cgroup-v1/cpusets.rst         |  19 +-
 Documentation/admin-guide/cgroup-v2.rst       |  26 ++-
 Documentation/filesystems/proc.rst            |   2 +-
 drivers/base/node.c                           | 199 ++++++++++++++++++
 drivers/cxl/core/Makefile                     |   1 +
 drivers/cxl/core/core.h                       |   4 +
 drivers/cxl/core/port.c                       |   4 +
 drivers/cxl/core/private_region/Makefile      |  12 ++
 .../cxl/core/private_region/private_region.c  | 129 ++++++++++++
 .../cxl/core/private_region/private_region.h  |  14 ++
 drivers/cxl/core/private_region/zswap.c       | 127 +++++++++++
 drivers/cxl/core/region.c                     |  63 +++++-
 drivers/cxl/cxl.h                             |  22 ++
 include/linux/cpuset.h                        |  24 ++-
 include/linux/gfp.h                           |   6 +
 include/linux/mm.h                            |   4 +-
 include/linux/mmzone.h                        |   6 +-
 include/linux/node.h                          |  60 ++++++
 include/linux/nodemask.h                      |   1 +
 include/linux/oom.h                           |   2 +-
 include/linux/swap.h                          |   2 +-
 include/linux/zswap.h                         |   5 +
 kernel/cgroup/cpuset-internal.h               |   8 +
 kernel/cgroup/cpuset-v1.c                     |   8 +
 kernel/cgroup/cpuset.c                        |  98 ++++++---
 mm/compaction.c                               |   6 +-
 mm/internal.h                                 |   2 +-
 mm/memcontrol.c                               |   2 +-
 mm/memory_hotplug.c                           |   2 +-
 mm/mempolicy.c                                |   6 +-
 mm/migrate.c                                  |   4 +-
 mm/mmzone.c                                   |   5 +-
 mm/page_alloc.c                               |  31 +--
 mm/show_mem.c                                 |   9 +-
 mm/slub.c                                     |   8 +-
 mm/vmscan.c                                   |   6 +-
 mm/zswap.c                                    | 106 +++++++++-
 37 files changed, 942 insertions(+), 91 deletions(-)
 create mode 100644 drivers/cxl/core/private_region/Makefile
 create mode 100644 drivers/cxl/core/private_region/private_region.c
 create mode 100644 drivers/cxl/core/private_region/private_region.h
 create mode 100644 drivers/cxl/core/private_region/zswap.c
---
base-commit: 803dd4b1159cf9864be17aab8a17653e6ecbbbb6

-- 
2.52.0
Re: [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory
Posted by Balbir Singh 3 weeks, 5 days ago
On 1/9/26 06:37, Gregory Price wrote:
> This series introduces N_PRIVATE, a new node state for memory nodes 
> whose memory is not intended for general system consumption.  Today,
> device drivers (CXL, accelerators, etc.) hotplug their memory to access
> mm/ services like page allocation and reclaim, but this exposes general
> workloads to memory with different characteristics and reliability
> guarantees than system RAM.
> 
> N_PRIVATE provides isolation by default while enabling explicit access
> via __GFP_THISNODE for subsystems that understand how to manage these
> specialized memory regions.
> 

I assume each class of N_PRIVATE is a separate set of NUMA nodes, these
could be real or virtual memory nodes?

> Motivation
> ==========
> 
> Several emerging memory technologies require kernel memory management
> services but should not be used for general allocations:
> 
>   - CXL Compressed RAM (CRAM): Hardware-compressed memory where the
>     effective capacity depends on data compressibility.  Uncontrolled
>     use risks capacity exhaustion when compression ratios degrade.
> 
>   - Accelerator Memory: GPU/TPU-attached memory optimized for specific
>     access patterns that are not intended for general allocation.
> 
>   - Tiered Memory: Memory intended only as a demotion target, not for
>     initial allocations.
> 
> Currently, these devices either avoid hotplugging entirely (losing mm/
> services) or hotplug as regular N_MEMORY (risking reliability issues).
> N_PRIVATE solves this by creating an isolated node class.
> 
> Design
> ======
> 
> The series introduces:
> 
>   1. N_PRIVATE node state (mutually exclusive with N_MEMORY)

We should call it N_PRIVATE_MEMORY

>   2. private_memtype enum for policy-based access control
>   3. cpuset.mems.sysram for user-visible isolation
>   4. Integration points for subsystems (zswap demonstrated)
>   5. A cxl private_region example to demonstrate full plumbing
> 
> Private Memory Types (private_memtype)
> ======================================
> 
> The private_memtype enum defines policy bits that control how different
> kernel subsystems may access private nodes:
> 
>   enum private_memtype {
>       NODE_MEM_NOTYPE,      /* No type assigned (invalid state) */
>       NODE_MEM_ZSWAP,       /* Swap compression target */
>       NODE_MEM_COMPRESSED,  /* General compressed RAM */
>       NODE_MEM_ACCELERATOR, /* Accelerator-attached memory */
>       NODE_MEM_DEMOTE_ONLY, /* Memory-tier demotion target only */
>       NODE_MAX_MEMTYPE,
>   };
> 
> These types serve as policy hints for subsystems:
> 

Do these nodes have fallback(s)? Are these nodes prone to OOM when memory is exhausted
in one class of N_PRIVATE node(s)?


What about page cache allocation form these nodes? Since default allocations
never use them, a file system would need to do additional work to allocate
on them, if there was ever a desire to use them. Would memory
migration would work between N_PRIVATE and N_MEMORY using move_pages()?


> NODE_MEM_ZSWAP
> --------------
> Nodes with this type are registered as zswap compression targets.  When
> zswap compresses a page, it can allocate directly from ZSWAP-typed nodes
> using __GFP_THISNODE, bypassing software compression if the device
> provides hardware compression.
> 
> Example flow:
>   1. CXL device creates private_region with type=zswap
>   2. Driver calls node_register_private() with NODE_MEM_ZSWAP
>   3. zswap_add_direct_node() registers the node as a compression target
>   4. On swap-out, zswap allocates from the private node
>   5. page_allocated() callback validates compression ratio headroom
>   6. page_freed() callback zeros pages to improve device compression
> 
> Prototype Note:
>   This patch set does not actually do compression ratio validation, as
>   this requires an actual device to provide some kind of counter and/or
>   interrupt to denote when allocations are safe.  The callbacks are
>   left as stubs with TODOs for device vendors to pick up the next step
>   (we'll continue with a QEMU example if reception is positive).
> 
>   For now, this always succeeds because compressed=real capacity.
> 
> NODE_MEM_COMPRESSED (CRAM)
> --------------------------
> For general compressed RAM devices.  Unlike ZSWAP nodes, CRAM nodes
> could be exposed to subsystems that understand compression semantics:
> 
>   - vmscan: Could prefer demoting pages to CRAM nodes before swap
>   - memory-tiering: Could place CRAM between DRAM and persistent memory
>   - zram: Could use as backing store instead of or alongside zswap
> 
> Such a component (mm/cram.c) would differ from zswap or zram by allowing
> the compressed pages to remain mapped Read-Only in the page table.
> 
> NODE_MEM_ACCELERATOR
> --------------------
> For GPU/TPU/accelerator-attached memory.  Policy implications:
> 
>   - Default allocations: Never (isolated from general page_alloc)
>   - GPU drivers: Explicit allocation via __GFP_THISNODE
>   - NUMA balancing: Excluded from automatic migration
>   - Memory tiering: Not a demotion target
> 
> Some GPU vendors want management of their memory via NUMA nodes, but
> don't want fallback or migration allocations to occur.  This enables
> that pattern.
> 
> mm/mempolicy.c could be used to allow for N_PRIVATE nodes of this type
> if the intent is per-vma access to accelerator memory (e.g. via mbind)
> but this is omitted from this series from now to limit userland
> exposure until first class examples are provided.
> 
> NODE_MEM_DEMOTE_ONLY
> --------------------
> For memory intended exclusively as a demotion target in memory tiering:
> 
>   - page_alloc: Never allocates initially (slab, page faults, etc.)
>   - vmscan/reclaim: Valid demotion target during memory pressure
>   - memory-tiering: Allow hotness monitoring/promotion for this region
> 
> This enables "cold storage" tiers using slower/cheaper memory (CXL-
> attached DRAM, persistent memory in volatile mode) without the memory
> appearing in allocation fast paths.
> 
> This also adds some additional bonus of enforcing memory placement on
> these nodes to be movable allocations only (with all the normal caveats
> around page pinning).
> 
> Subsystem Integration Points
> ============================
> 
> The private_node_ops structure provides callbacks for integration:
> 
>   struct private_node_ops {
>       struct list_head list;
>       resource_size_t res_start;
>       resource_size_t res_end;
>       enum private_memtype memtype;
>       int (*page_allocated)(struct page *page, void *data);
>       void (*page_freed)(struct page *page, void *data);
>       void *data;
>   };
> 
> page_allocated(): Called after allocation, returns 0 to accept or
> -ENOSPC/-ENODEV to reject (caller retries elsewhere).  Enables:
>   - Compression ratio enforcement for CRAM/zswap
>   - Capacity tracking for accelerator memory
>   - Rate limiting for demotion targets
> 
> page_freed(): Called on free, enables:
>   - Zeroing for compression ratio recovery
>   - Capacity accounting updates
>   - Device-specific cleanup
> 
> Isolation Enforcement
> =====================
> 
> The series modifies core allocators to respect N_PRIVATE isolation:
> 
>   - page_alloc: Constrains zone iteration to cpuset.mems.sysram
>   - slub: Allocates only from N_MEMORY nodes
>   - compaction: Skips N_PRIVATE nodes
>   - mempolicy: Uses sysram_nodes for policy evaluation
> 
> __GFP_THISNODE bypasses isolation, enabling explicit access:
> 
>   page = alloc_pages_node(private_nid, GFP_KERNEL | __GFP_THISNODE, 0);
> 
> This pattern is used by zswap, and would be used by other subsystems
> that explicitly opt into private node access.
> 
> User-Visible Changes
> ====================
> 
> cpuset gains cpuset.mems.sysram (read-only), shows N_MEMORY nodes.
> 
> ABI: /proc/<pid>/status Mems_allowed shows sysram nodes only.
> 
> Drivers create private regions via sysfs:
>   echo region0 > /sys/bus/cxl/.../create_private_region
>   echo zswap > /sys/bus/cxl/.../region0/private_type
>   echo 1 > /sys/bus/cxl/.../region0/commit
> 
> Series Organization
> ===================
> 
> Patch 1: numa,memory_hotplug: create N_PRIVATE (Private Nodes)
>   Core infrastructure: N_PRIVATE node state, node_mark_private(),
>   private_memtype enum, and private_node_ops registration.
> 
> Patch 2: mm: constify oom_control, scan_control, and alloc_context 
> nodemask
>   Preparatory cleanup for enforcing that nodemasks don't change.
> 
> Patch 3: mm: restrict slub, compaction, and page_alloc to sysram
>   Enforce N_MEMORY-only allocation for general paths.
> 
> Patch 4: cpuset: introduce cpuset.mems.sysram
>   User-visible isolation via cpuset interface.
> 
> Patch 5: Documentation/admin-guide/cgroups: update docs for mems_allowed
>   Document the new behavior and sysram_nodes.
> 
> Patch 6: drivers/cxl/core/region: add private_region
>   CXL infrastructure for private regions.
> 
> Patch 7: mm/zswap: compressed ram direct integration
>   Zswap integration demonstrating direct hardware compression.
> 
> Patch 8: drivers/cxl: add zswap private_region type
>   Complete example: CXL region as zswap compression target.
> 
> Future Work
> ===========
> 
> This series provides the foundation.  Planned follow-ups include:
> 
>   - CRAM integration with vmscan for smart demotion
>   - ACCELERATOR type for GPU memory management
>   - Memory-tiering integration with DEMOTE_ONLY nodes
> 
> Testing
> =======
> 
> All patches build cleanly.  Tested with:
>   - CXL QEMU emulation with private regions
>   - Zswap stress tests with private compression targets
>   - Cpuset verification of mems.sysram isolation
> 
> 
> Gregory Price (8):
>   numa,memory_hotplug: create N_PRIVATE (Private Nodes)
>   mm: constify oom_control, scan_control, and alloc_context nodemask
>   mm: restrict slub, compaction, and page_alloc to sysram
>   cpuset: introduce cpuset.mems.sysram
>   Documentation/admin-guide/cgroups: update docs for mems_allowed
>   drivers/cxl/core/region: add private_region
>   mm/zswap: compressed ram direct integration
>   drivers/cxl: add zswap private_region type
> 
>  .../admin-guide/cgroup-v1/cpusets.rst         |  19 +-
>  Documentation/admin-guide/cgroup-v2.rst       |  26 ++-
>  Documentation/filesystems/proc.rst            |   2 +-
>  drivers/base/node.c                           | 199 ++++++++++++++++++
>  drivers/cxl/core/Makefile                     |   1 +
>  drivers/cxl/core/core.h                       |   4 +
>  drivers/cxl/core/port.c                       |   4 +
>  drivers/cxl/core/private_region/Makefile      |  12 ++
>  .../cxl/core/private_region/private_region.c  | 129 ++++++++++++
>  .../cxl/core/private_region/private_region.h  |  14 ++
>  drivers/cxl/core/private_region/zswap.c       | 127 +++++++++++
>  drivers/cxl/core/region.c                     |  63 +++++-
>  drivers/cxl/cxl.h                             |  22 ++
>  include/linux/cpuset.h                        |  24 ++-
>  include/linux/gfp.h                           |   6 +
>  include/linux/mm.h                            |   4 +-
>  include/linux/mmzone.h                        |   6 +-
>  include/linux/node.h                          |  60 ++++++
>  include/linux/nodemask.h                      |   1 +
>  include/linux/oom.h                           |   2 +-
>  include/linux/swap.h                          |   2 +-
>  include/linux/zswap.h                         |   5 +
>  kernel/cgroup/cpuset-internal.h               |   8 +
>  kernel/cgroup/cpuset-v1.c                     |   8 +
>  kernel/cgroup/cpuset.c                        |  98 ++++++---
>  mm/compaction.c                               |   6 +-
>  mm/internal.h                                 |   2 +-
>  mm/memcontrol.c                               |   2 +-
>  mm/memory_hotplug.c                           |   2 +-
>  mm/mempolicy.c                                |   6 +-
>  mm/migrate.c                                  |   4 +-
>  mm/mmzone.c                                   |   5 +-
>  mm/page_alloc.c                               |  31 +--
>  mm/show_mem.c                                 |   9 +-
>  mm/slub.c                                     |   8 +-
>  mm/vmscan.c                                   |   6 +-
>  mm/zswap.c                                    | 106 +++++++++-
>  37 files changed, 942 insertions(+), 91 deletions(-)
>  create mode 100644 drivers/cxl/core/private_region/Makefile
>  create mode 100644 drivers/cxl/core/private_region/private_region.c
>  create mode 100644 drivers/cxl/core/private_region/private_region.h
>  create mode 100644 drivers/cxl/core/private_region/zswap.c
> ---
> base-commit: 803dd4b1159cf9864be17aab8a17653e6ecbbbb6
> 

Thanks,
Balbir
Re: [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory
Posted by Gregory Price 3 weeks, 5 days ago
On Mon, Jan 12, 2026 at 10:12:23PM +1100, Balbir Singh wrote:
> On 1/9/26 06:37, Gregory Price wrote:
> > This series introduces N_PRIVATE, a new node state for memory nodes 
> > whose memory is not intended for general system consumption.  Today,
> > device drivers (CXL, accelerators, etc.) hotplug their memory to access
> > mm/ services like page allocation and reclaim, but this exposes general
> > workloads to memory with different characteristics and reliability
> > guarantees than system RAM.
> > 
> > N_PRIVATE provides isolation by default while enabling explicit access
> > via __GFP_THISNODE for subsystems that understand how to manage these
> > specialized memory regions.
> > 
> 
> I assume each class of N_PRIVATE is a separate set of NUMA nodes, these
> could be real or virtual memory nodes?
>

This has the the topic of a long, long discussion on the CXL discord -
how do we get extra nodes if we intend to make HPA space flexibly
configurable by "intended use".

tl;dr:  open to discussion.  As of right now, there's no way (that I
know of) to allocate additional NUMA nodes at boot without having some
indication that one is needed in the ACPI table (srat touches a PXM, or
CEDT defines a region not present in SRAT).

Best idea we have right now is to have a build config that reserves some
extra nodes which can be used later (they're in N_POSSIBLE but otherwise
not used by anything).

> > Design
> > ======
> > 
> > The series introduces:
> > 
> >   1. N_PRIVATE node state (mutually exclusive with N_MEMORY)
> 
> We should call it N_PRIVATE_MEMORY
>

Dan Williams convinced me to go with N_PRIVATE, but this is really a
bikeshed topic - we could call it N_BOBERT until we find consensus.

> > 
> >   enum private_memtype {
> >       NODE_MEM_NOTYPE,      /* No type assigned (invalid state) */
> >       NODE_MEM_ZSWAP,       /* Swap compression target */
> >       NODE_MEM_COMPRESSED,  /* General compressed RAM */
> >       NODE_MEM_ACCELERATOR, /* Accelerator-attached memory */
> >       NODE_MEM_DEMOTE_ONLY, /* Memory-tier demotion target only */
> >       NODE_MAX_MEMTYPE,
> >   };
> > 
> > These types serve as policy hints for subsystems:
> > 
> 
> Do these nodes have fallback(s)? Are these nodes prone to OOM when memory is exhausted
> in one class of N_PRIVATE node(s)?
> 

Right now, these nodes do not have fallbacks, and even if they did the
use of __GFP_THISNODE would prevent this.  That's intended.

In theory you could have nodes of similar types fall back to each other,
but that feels like increased complexity for questionable value.  The
service requested __GFP_THISNODE should be aware that it needs to manage
fallback.

> 
> What about page cache allocation form these nodes? Since default allocations
> never use them, a file system would need to do additional work to allocate
> on them, if there was ever a desire to use them. 

Yes, in-fact that is the intent.  Anything requesting memory from these
nodes would need to be aware of how to manage them.

Similar to ZONE_DEVICE memory - which is wholly unmanaged by the page
allocator.  There's potential for re-using some of the ZONE_DEVICE or
HMM callback infrastructure to implement the callbacks for N_PRIVATE
instead of re-inventing it.

> Would memory
> migration would work between N_PRIVATE and N_MEMORY using move_pages()?
> 

N_PRIVATE -> N_MEMORY would probably be easy and trivial, but could also
be a controllable bit.

A side-discussion not present in these notes has been whether memtype
should be an enum or a bitfield.

N_MEMORY -> N_PRIVATE via migrate.c would probably require some changes
to migration_target_control and the alloc callback (in vmscan.c, see
alloc_migrate_folio) would need to be N_PRIVATE aware.


Thanks for taking a look,
~Gregory
Re: [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory
Posted by Yury Norov 3 weeks, 5 days ago
On Mon, Jan 12, 2026 at 09:36:49AM -0500, Gregory Price wrote:
> On Mon, Jan 12, 2026 at 10:12:23PM +1100, Balbir Singh wrote:
> > On 1/9/26 06:37, Gregory Price wrote:
> > > This series introduces N_PRIVATE, a new node state for memory nodes 
> > > whose memory is not intended for general system consumption.  Today,
> > > device drivers (CXL, accelerators, etc.) hotplug their memory to access
> > > mm/ services like page allocation and reclaim, but this exposes general
> > > workloads to memory with different characteristics and reliability
> > > guarantees than system RAM.
> > > 
> > > N_PRIVATE provides isolation by default while enabling explicit access
> > > via __GFP_THISNODE for subsystems that understand how to manage these
> > > specialized memory regions.
> > > 
> > 
> > I assume each class of N_PRIVATE is a separate set of NUMA nodes, these
> > could be real or virtual memory nodes?
> >
> 
> This has the the topic of a long, long discussion on the CXL discord -
> how do we get extra nodes if we intend to make HPA space flexibly
> configurable by "intended use".
> 
> tl;dr:  open to discussion.  As of right now, there's no way (that I
> know of) to allocate additional NUMA nodes at boot without having some
> indication that one is needed in the ACPI table (srat touches a PXM, or
> CEDT defines a region not present in SRAT).
> 
> Best idea we have right now is to have a build config that reserves some
> extra nodes which can be used later (they're in N_POSSIBLE but otherwise
> not used by anything).
> 
> > > Design
> > > ======
> > > 
> > > The series introduces:
> > > 
> > >   1. N_PRIVATE node state (mutually exclusive with N_MEMORY)
> > 
> > We should call it N_PRIVATE_MEMORY
> >
> 
> Dan Williams convinced me to go with N_PRIVATE, but this is really a
> bikeshed topic

No it's not. To me (OK, an almost random reader in this discussion),
N_PRIVATE is a pretty confusing name. It doesn't answer the question:
private what? N_PRIVATE_MEMORY is better in that department, isn't?

But taking into account isolcpus, maybe N_ISOLMEM?

> - we could call it N_BOBERT until we find consensus.

Please give it the right name well describing the scope and purpose of
the new restriction policy before moving forward.
 
> > >   enum private_memtype {
> > >       NODE_MEM_NOTYPE,      /* No type assigned (invalid state) */
> > >       NODE_MEM_ZSWAP,       /* Swap compression target */
> > >       NODE_MEM_COMPRESSED,  /* General compressed RAM */
> > >       NODE_MEM_ACCELERATOR, /* Accelerator-attached memory */
> > >       NODE_MEM_DEMOTE_ONLY, /* Memory-tier demotion target only */
> > >       NODE_MAX_MEMTYPE,
> > >   };
> > > 
> > > These types serve as policy hints for subsystems:
> > > 
> > 
> > Do these nodes have fallback(s)? Are these nodes prone to OOM when memory is exhausted
> > in one class of N_PRIVATE node(s)?
> > 
> 
> Right now, these nodes do not have fallbacks, and even if they did the
> use of __GFP_THISNODE would prevent this.  That's intended.
> 
> In theory you could have nodes of similar types fall back to each other,
> but that feels like increased complexity for questionable value.  The
> service requested __GFP_THISNODE should be aware that it needs to manage
> fallback.

Yeah, and most GFP_THISNODE users also pass GFP_NOWARN, which makes it
looking more like an emergency feature. Maybe add a symmetric GFP_PRIVATE
flag that would allow for more flexibility, and highlight the intention
better?

> > What about page cache allocation form these nodes? Since default allocations
> > never use them, a file system would need to do additional work to allocate
> > on them, if there was ever a desire to use them. 
> 
> Yes, in-fact that is the intent.  Anything requesting memory from these
> nodes would need to be aware of how to manage them.
> 
> Similar to ZONE_DEVICE memory - which is wholly unmanaged by the page

This is quite opposite to what you are saying in the motivation
section:

  Several emerging memory technologies require kernel memory management
  services but should not be used for general allocations

So, is it completely unmanaged node, or only general allocation isolated?

Thanks,
Yury

> allocator.  There's potential for re-using some of the ZONE_DEVICE or
> HMM callback infrastructure to implement the callbacks for N_PRIVATE
> instead of re-inventing it.
> 
> > Would memory
> > migration would work between N_PRIVATE and N_MEMORY using move_pages()?
> > 
> 
> N_PRIVATE -> N_MEMORY would probably be easy and trivial, but could also
> be a controllable bit.
> 
> A side-discussion not present in these notes has been whether memtype
> should be an enum or a bitfield.
> 
> N_MEMORY -> N_PRIVATE via migrate.c would probably require some changes
> to migration_target_control and the alloc callback (in vmscan.c, see
> alloc_migrate_folio) would need to be N_PRIVATE aware.
> 
> 
> Thanks for taking a look,
> ~Gregory
Re: [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory
Posted by dan.j.williams@intel.com 3 weeks, 5 days ago
Yury Norov wrote:
[..]
> > Dan Williams convinced me to go with N_PRIVATE, but this is really a
> > bikeshed topic
> 
> No it's not. To me (OK, an almost random reader in this discussion),
> N_PRIVATE is a pretty confusing name. It doesn't answer the question:
> private what? N_PRIVATE_MEMORY is better in that department, isn't?
> 
> But taking into account isolcpus, maybe N_ISOLMEM?
> 
> > - we could call it N_BOBERT until we find consensus.
> 
> Please give it the right name well describing the scope and purpose of
> the new restriction policy before moving forward.

...this is the definition of a bikeshed discussion, and bikeshed's are
important for building consensus. The argument for N_PRIVATE is with
respect to looking at this from the perspective of the other node_states
that do not have the _MEMORY designation particularly _ONLINE and the
fact that the other _MEMORY states implied zone implications whereas
N_PRIVATE can span zones.

I agree with Gregory the name does not matter as much as the
documentation explaining what the name means. I am ok if others do not
sign onto the rationale for why not include _MEMORY, but lets capture
something that tries to clarify that this is a unique node state that
can have "all of the above" memory types relative to the existing
_MEMORY states.
Re: [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory
Posted by Balbir Singh 3 weeks, 5 days ago
On 1/13/26 07:24, dan.j.williams@intel.com wrote:
> Yury Norov wrote:
> [..]
>>> Dan Williams convinced me to go with N_PRIVATE, but this is really a
>>> bikeshed topic
>>
>> No it's not. To me (OK, an almost random reader in this discussion),
>> N_PRIVATE is a pretty confusing name. It doesn't answer the question:
>> private what? N_PRIVATE_MEMORY is better in that department, isn't?
>>
>> But taking into account isolcpus, maybe N_ISOLMEM?
>>
>>> - we could call it N_BOBERT until we find consensus.
>>
>> Please give it the right name well describing the scope and purpose of
>> the new restriction policy before moving forward.
> 
> ...this is the definition of a bikeshed discussion, and bikeshed's are
> important for building consensus. The argument for N_PRIVATE is with
> respect to looking at this from the perspective of the other node_states
> that do not have the _MEMORY designation particularly _ONLINE and the
> fact that the other _MEMORY states implied zone implications whereas
> N_PRIVATE can span zones.
> 
> I agree with Gregory the name does not matter as much as the
> documentation explaining what the name means. I am ok if others do not
> sign onto the rationale for why not include _MEMORY, but lets capture
> something that tries to clarify that this is a unique node state that
> can have "all of the above" memory types relative to the existing
> _MEMORY states.
> 

To me, N_ is a common prefix, we do have N_HIGH_MEMORY, N_NORMAL_MEMORY.
N_PRIVATE does not tell me if it's CPU or memory related.

Balbir
Re: [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory
Posted by dan.j.williams@intel.com 3 weeks, 5 days ago
Balbir Singh wrote:
[..]
> > I agree with Gregory the name does not matter as much as the
> > documentation explaining what the name means. I am ok if others do not
> > sign onto the rationale for why not include _MEMORY, but lets capture
> > something that tries to clarify that this is a unique node state that
> > can have "all of the above" memory types relative to the existing
> > _MEMORY states.
> > 
> 
> To me, N_ is a common prefix, we do have N_HIGH_MEMORY, N_NORMAL_MEMORY.
> N_PRIVATE does not tell me if it's CPU or memory related.

True that confusion about whether N_PRIVATE can apply to CPUs is there.
How about split the difference and call this:

    N_MEM_PRIVATE

To make it both distinct from _MEMORY and _HIGH_MEMORY which describe
ZONE limitations and distinct from N_CPU.
Re: [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory
Posted by Balbir Singh 3 weeks, 5 days ago
On 1/13/26 08:10, dan.j.williams@intel.com wrote:
> Balbir Singh wrote:
> [..]
>>> I agree with Gregory the name does not matter as much as the
>>> documentation explaining what the name means. I am ok if others do not
>>> sign onto the rationale for why not include _MEMORY, but lets capture
>>> something that tries to clarify that this is a unique node state that
>>> can have "all of the above" memory types relative to the existing
>>> _MEMORY states.
>>>
>>
>> To me, N_ is a common prefix, we do have N_HIGH_MEMORY, N_NORMAL_MEMORY.
>> N_PRIVATE does not tell me if it's CPU or memory related.
> 
> True that confusion about whether N_PRIVATE can apply to CPUs is there.
> How about split the difference and call this:
> 
>     N_MEM_PRIVATE
> 
> To make it both distinct from _MEMORY and _HIGH_MEMORY which describe
> ZONE limitations and distinct from N_CPU.

I'd be open to that name, how about N_MEMORY_PRIVATE? So then N_MEMORY
becomes (N_MEMORY_PUBLIC by default)

Balbir
Re: [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory
Posted by Gregory Price 3 weeks, 5 days ago
On Tue, Jan 13, 2026 at 09:54:32AM +1100, Balbir Singh wrote:
> On 1/13/26 08:10, dan.j.williams@intel.com wrote:
> > Balbir Singh wrote:
> > [..]
> >>> I agree with Gregory the name does not matter as much as the
> >>> documentation explaining what the name means. I am ok if others do not
> >>> sign onto the rationale for why not include _MEMORY, but lets capture
> >>> something that tries to clarify that this is a unique node state that
> >>> can have "all of the above" memory types relative to the existing
> >>> _MEMORY states.
> >>>
> >>
> >> To me, N_ is a common prefix, we do have N_HIGH_MEMORY, N_NORMAL_MEMORY.
> >> N_PRIVATE does not tell me if it's CPU or memory related.
> > 
> > True that confusion about whether N_PRIVATE can apply to CPUs is there.
> > How about split the difference and call this:
> > 
> >     N_MEM_PRIVATE
> > 
> > To make it both distinct from _MEMORY and _HIGH_MEMORY which describe
> > ZONE limitations and distinct from N_CPU.
> 
> I'd be open to that name, how about N_MEMORY_PRIVATE? So then N_MEMORY
> becomes (N_MEMORY_PUBLIC by default)
>

N_MEMORY_PUBLIC is forcing everyone else to change for the sake a new
feature, better to keep it N_MEM[ORY]_PRIVATE if anything

~Gregory
Re: [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory
Posted by dan.j.williams@intel.com 3 weeks, 5 days ago
Gregory Price wrote:
> On Tue, Jan 13, 2026 at 09:54:32AM +1100, Balbir Singh wrote:
> > On 1/13/26 08:10, dan.j.williams@intel.com wrote:
> > > Balbir Singh wrote:
> > > [..]
> > >>> I agree with Gregory the name does not matter as much as the
> > >>> documentation explaining what the name means. I am ok if others do not
> > >>> sign onto the rationale for why not include _MEMORY, but lets capture
> > >>> something that tries to clarify that this is a unique node state that
> > >>> can have "all of the above" memory types relative to the existing
> > >>> _MEMORY states.
> > >>>
> > >>
> > >> To me, N_ is a common prefix, we do have N_HIGH_MEMORY, N_NORMAL_MEMORY.
> > >> N_PRIVATE does not tell me if it's CPU or memory related.
> > > 
> > > True that confusion about whether N_PRIVATE can apply to CPUs is there.
> > > How about split the difference and call this:
> > > 
> > >     N_MEM_PRIVATE
> > > 
> > > To make it both distinct from _MEMORY and _HIGH_MEMORY which describe
> > > ZONE limitations and distinct from N_CPU.
> > 
> > I'd be open to that name, how about N_MEMORY_PRIVATE? So then N_MEMORY
> > becomes (N_MEMORY_PUBLIC by default)
> >
> 
> N_MEMORY_PUBLIC is forcing everyone else to change for the sake a new
> feature, better to keep it N_MEM[ORY]_PRIVATE if anything

I think what Balbir is saying is that the _PUBLIC is implied and can be
omitted. It is true that N_MEMORY[_PUBLIC] already indicates multi-zone
support. So N_MEMORY_PRIVATE makes sense to me as something that it is
distinct from N_{HIGH,NORMAL}_MEMORY which are subsets of N_MEMORY.
Distinct to prompt "go read the documentation to figure out why this
thing looks not like the others".
Re: [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory
Posted by Gregory Price 3 weeks, 4 days ago
On Mon, Jan 12, 2026 at 05:17:53PM -0800, dan.j.williams@intel.com wrote:
> 
> I think what Balbir is saying is that the _PUBLIC is implied and can be
> omitted. It is true that N_MEMORY[_PUBLIC] already indicates multi-zone
> support. So N_MEMORY_PRIVATE makes sense to me as something that it is
> distinct from N_{HIGH,NORMAL}_MEMORY which are subsets of N_MEMORY.
> Distinct to prompt "go read the documentation to figure out why this
> thing looks not like the others".

Ah, ack.  Will update for v4 once i give some thought to the compression
stuff and the cgroups notes.

I would love if the ZONE_DEVICE folks could also chime in on whether the
callback structures for pgmap and hmm might be re-usable here, but might
take a few more versions to get the attention of everyone.

~Gregory
Re: [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory
Posted by Balbir Singh 3 weeks, 4 days ago
On 1/13/26 12:30, Gregory Price wrote:
> On Mon, Jan 12, 2026 at 05:17:53PM -0800, dan.j.williams@intel.com wrote:
>>
>> I think what Balbir is saying is that the _PUBLIC is implied and can be
>> omitted. It is true that N_MEMORY[_PUBLIC] already indicates multi-zone
>> support. So N_MEMORY_PRIVATE makes sense to me as something that it is
>> distinct from N_{HIGH,NORMAL}_MEMORY which are subsets of N_MEMORY.
>> Distinct to prompt "go read the documentation to figure out why this
>> thing looks not like the others".
> 
> Ah, ack.  Will update for v4 once i give some thought to the compression
> stuff and the cgroups notes.
> 
> I would love if the ZONE_DEVICE folks could also chime in on whether the
> callback structures for pgmap and hmm might be re-usable here, but might
> take a few more versions to get the attention of everyone.
> 

I see ZONE_DEVICE as a parallel construct to N_MEMORY_PRIVATE. ZONE_DEVICE
is memory managed by devices and already isolated from the allocator. Do you
see a need for both? I do see the need for migration between the two, but
I suspect you want to have ZONE_DEVICE as a valid zone inside of N_MEMORY_PRIVATE?

Balbir
Re: [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory
Posted by Gregory Price 3 weeks, 4 days ago
On Tue, Jan 13, 2026 at 02:24:40PM +1100, Balbir Singh wrote:
> On 1/13/26 12:30, Gregory Price wrote:
> > On Mon, Jan 12, 2026 at 05:17:53PM -0800, dan.j.williams@intel.com wrote:
> >>
> >> I think what Balbir is saying is that the _PUBLIC is implied and can be
> >> omitted. It is true that N_MEMORY[_PUBLIC] already indicates multi-zone
> >> support. So N_MEMORY_PRIVATE makes sense to me as something that it is
> >> distinct from N_{HIGH,NORMAL}_MEMORY which are subsets of N_MEMORY.
> >> Distinct to prompt "go read the documentation to figure out why this
> >> thing looks not like the others".
> > 
> > Ah, ack.  Will update for v4 once i give some thought to the compression
> > stuff and the cgroups notes.
> > 
> > I would love if the ZONE_DEVICE folks could also chime in on whether the
> > callback structures for pgmap and hmm might be re-usable here, but might
> > take a few more versions to get the attention of everyone.
> > 
> 
> I see ZONE_DEVICE as a parallel construct to N_MEMORY_PRIVATE. ZONE_DEVICE
> is memory managed by devices and already isolated from the allocator. Do you
> see a need for both? I do see the need for migration between the two, but
> I suspect you want to have ZONE_DEVICE as a valid zone inside of N_MEMORY_PRIVATE?
> 

I see N_MEMORY_PRIVATE replacing some ZONE_DEVICE patterns.

N_MEMORY_PRIVATE essentially means some driver controls how allocation
occurs, and some components of mm/ can be enlightened to allow certain
types of N_MEMORY_PRIVATE nodes to be used directly (e.g. DEMOTE_ONLY
nodes could be used by vmscan.c but not by page_alloc.c as a fallback
node).

But you could totally have a driver hotplug an N_PRIVATE node and not
register the NID anywhere.  In that case the driver would allow
allocation either via something like

fd = open("/dev/my_driver_file", ...);
buf = mmap(fd, ...);
buf[0] = 0x0;
/* Fault into driver, driver allocates w/ __GFP flag for private node */

or just some ioctl()

ioctl(fd, ALLOC_SOME_MEMORY, ...);

The driver wouldn't have to reimplement allocator logic, and could
register its own set of callbacks to manage how the memory is allowed to
be mapped into page tables and such (my understanding is hmm.c already
has some of this control, that could be re-used - and pgmap exists for
ZONE_DEVICE, this could be re-used in some way).

~Gregory
Re: [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory
Posted by dan.j.williams@intel.com 3 weeks, 4 days ago
Gregory Price wrote:
> On Mon, Jan 12, 2026 at 05:17:53PM -0800, dan.j.williams@intel.com wrote:
> > 
> > I think what Balbir is saying is that the _PUBLIC is implied and can be
> > omitted. It is true that N_MEMORY[_PUBLIC] already indicates multi-zone
> > support. So N_MEMORY_PRIVATE makes sense to me as something that it is
> > distinct from N_{HIGH,NORMAL}_MEMORY which are subsets of N_MEMORY.
> > Distinct to prompt "go read the documentation to figure out why this
> > thing looks not like the others".
> 
> Ah, ack.  Will update for v4 once i give some thought to the compression
> stuff and the cgroups notes.
> 
> I would love if the ZONE_DEVICE folks could also chime in on whether the
> callback structures for pgmap and hmm might be re-usable here, but might
> take a few more versions to get the attention of everyone.

page->pgmap clobbers page->lru, i.e. they share the same union, so you
could not directly use the current ZONE_DEVICE scheme. That is because
current ZONE_DEVICE scheme needs to support ZONE_DEVICE mixed with
ZONE_NORMAL + ZONE_MOVABLE in the same node.

However, with N_MEMORY_PRIVATE effectively enabling a "node per device"
construct, you could move 'struct dev_pagemap' to node scope. I.e.
rather than annotate each page with which device it belongs teach
pgmap->ops callers to consider that the dev_pagemap instance may come
from the node instead.
Re: [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory
Posted by Gregory Price 3 weeks, 4 days ago
On Mon, Jan 12, 2026 at 07:12:10PM -0800, dan.j.williams@intel.com wrote:
> Gregory Price wrote:
> > On Mon, Jan 12, 2026 at 05:17:53PM -0800, dan.j.williams@intel.com wrote:
> > > 
> > > I think what Balbir is saying is that the _PUBLIC is implied and can be
> > > omitted. It is true that N_MEMORY[_PUBLIC] already indicates multi-zone
> > > support. So N_MEMORY_PRIVATE makes sense to me as something that it is
> > > distinct from N_{HIGH,NORMAL}_MEMORY which are subsets of N_MEMORY.
> > > Distinct to prompt "go read the documentation to figure out why this
> > > thing looks not like the others".
> > 
> > Ah, ack.  Will update for v4 once i give some thought to the compression
> > stuff and the cgroups notes.
> > 
> > I would love if the ZONE_DEVICE folks could also chime in on whether the
> > callback structures for pgmap and hmm might be re-usable here, but might
> > take a few more versions to get the attention of everyone.
> 
> page->pgmap clobbers page->lru, i.e. they share the same union, so you
> could not directly use the current ZONE_DEVICE scheme. That is because
> current ZONE_DEVICE scheme needs to support ZONE_DEVICE mixed with
> ZONE_NORMAL + ZONE_MOVABLE in the same node.
> 
> However, with N_MEMORY_PRIVATE effectively enabling a "node per device"
> construct, you could move 'struct dev_pagemap' to node scope. I.e.
> rather than annotate each page with which device it belongs teach
> pgmap->ops callers to consider that the dev_pagemap instance may come
> from the node instead.

Hmmmmmmm... this is interesting.

should be able to do that cleanly with page_pgmap() and/or folio_pgmap()
and update direct accessors.

probably we'd want mildly different patterns for N_PRIVATE that does
something like

if (is_private_page(page)) {
	... send to private router ...
}

bool is_private_page(page) {
	pgdat = NODE_DATA(page_to_nid(page));
	return pgdat && pgdat->pgmap;

	/* or this, but seems less efficient */
	return node_state(page_to_nid, N_PRIVATE);
}

Then we can add all the callbacks to pgmap instead of dumping them in
node.c.  Shouldn't affect any existing users, since this doesn't
intersect with ZONE_DEVICE.

Technically you COULD have ZONE_DEVICE in N_PRIVATE, but that would be
per-page pgmap, and probably you'd have to have the private router
handle the is_device_page() pattern like everyone else does.

(Seems pointless though, feels like N_PRIVATE replaces ZONE_DEVICE for
 some use cases)

~Gregory
Re: [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory
Posted by Balbir Singh 3 weeks, 5 days ago
On 1/13/26 09:40, Gregory Price wrote:
> On Tue, Jan 13, 2026 at 09:54:32AM +1100, Balbir Singh wrote:
>> On 1/13/26 08:10, dan.j.williams@intel.com wrote:
>>> Balbir Singh wrote:
>>> [..]
>>>>> I agree with Gregory the name does not matter as much as the
>>>>> documentation explaining what the name means. I am ok if others do not
>>>>> sign onto the rationale for why not include _MEMORY, but lets capture
>>>>> something that tries to clarify that this is a unique node state that
>>>>> can have "all of the above" memory types relative to the existing
>>>>> _MEMORY states.
>>>>>
>>>>
>>>> To me, N_ is a common prefix, we do have N_HIGH_MEMORY, N_NORMAL_MEMORY.
>>>> N_PRIVATE does not tell me if it's CPU or memory related.
>>>
>>> True that confusion about whether N_PRIVATE can apply to CPUs is there.
>>> How about split the difference and call this:
>>>
>>>     N_MEM_PRIVATE
>>>
>>> To make it both distinct from _MEMORY and _HIGH_MEMORY which describe
>>> ZONE limitations and distinct from N_CPU.
>>
>> I'd be open to that name, how about N_MEMORY_PRIVATE? So then N_MEMORY
>> becomes (N_MEMORY_PUBLIC by default)
>>
> 
> N_MEMORY_PUBLIC is forcing everyone else to change for the sake a new
> feature, better to keep it N_MEM[ORY]_PRIVATE if anything
> 

No name change needed, I meant to say N_MEMORY implies N_MEMORY_PUBLIC or
is interpreted as such. Consistency tells me PRIVATE and MEMORY are
required in the names (_MEM is incosistent with N_HIGH_MEMORY and N_NORMAL_MEMORY),
the order can be a choice, I am OK either ways, but I prefer N_PRIVATE_MEMORY.

Balbir
Re: [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for device-managed memory
Posted by Gregory Price 3 weeks, 5 days ago
On Mon, Jan 12, 2026 at 12:18:40PM -0500, Yury Norov wrote:
> On Mon, Jan 12, 2026 at 09:36:49AM -0500, Gregory Price wrote:
> > 
> > Dan Williams convinced me to go with N_PRIVATE, but this is really a
> > bikeshed topic
> 
> No it's not. To me (OK, an almost random reader in this discussion),
> N_PRIVATE is a pretty confusing name. It doesn't answer the question:
> private what? N_PRIVATE_MEMORY is better in that department, isn't?
> 
> But taking into account isolcpus, maybe N_ISOLMEM?
>
> > - we could call it N_BOBERT until we find consensus.
> 
> Please give it the right name well describing the scope and purpose of
> the new restriction policy before moving forward.
>  

"The right name" is a matter of opinion, of which there will be many.

It's been through 3 naming cycles already:

Protected -> SPM -> Private

It'll probably go through 3 more.

I originally named v3 N_PRIVATE_MEMORY, but Dan convinced me to drop to
N_PRIVATE.  We can always %s/N_PRIVATE/N_PRIVATE_MEMORY.

> > > >   enum private_memtype {
> > > >       NODE_MEM_NOTYPE,      /* No type assigned (invalid state) */
> > > >       NODE_MEM_ZSWAP,       /* Swap compression target */
> > > >       NODE_MEM_COMPRESSED,  /* General compressed RAM */
> > > >       NODE_MEM_ACCELERATOR, /* Accelerator-attached memory */
> > > >       NODE_MEM_DEMOTE_ONLY, /* Memory-tier demotion target only */
> > > >       NODE_MAX_MEMTYPE,
> > > >   };
> > > > 
> > > > These types serve as policy hints for subsystems:
> > > > 
> > > 
> > > Do these nodes have fallback(s)? Are these nodes prone to OOM when memory is exhausted
> > > in one class of N_PRIVATE node(s)?
> > > 
> > 
> > Right now, these nodes do not have fallbacks, and even if they did the
> > use of __GFP_THISNODE would prevent this.  That's intended.
> > 
> > In theory you could have nodes of similar types fall back to each other,
> > but that feels like increased complexity for questionable value.  The
> > service requested __GFP_THISNODE should be aware that it needs to manage
> > fallback.
> 
> Yeah, and most GFP_THISNODE users also pass GFP_NOWARN, which makes it
> looking more like an emergency feature. Maybe add a symmetric GFP_PRIVATE
> flag that would allow for more flexibility, and highlight the intention
> better?
> 

I originally added __GFP_SPM_NODE (v2 - equivalient to your suggestion)
and it was requested I try to use __GFP_THISNODE at LPC 2025 in December.

v3 makes this attempt.

This is good feedback to suggest maybe that's not the best and maybe we
should keep __GFP_SPM_NODE -> __GFP_PRIVATE

> > > What about page cache allocation form these nodes? Since default allocations
> > > never use them, a file system would need to do additional work to allocate
> > > on them, if there was ever a desire to use them. 
> > 
> > Yes, in-fact that is the intent.  Anything requesting memory from these
> > nodes would need to be aware of how to manage them.
> > 
> > Similar to ZONE_DEVICE memory - which is wholly unmanaged by the page
> 
> This is quite opposite to what you are saying in the motivation
> section:
> 
>   Several emerging memory technologies require kernel memory management
>   services but should not be used for general allocations
> 
> So, is it completely unmanaged node, or only general allocation isolated?
> 

Sorry, that wording is definitely confusing. I should have said "can
make use of kernel memory management services".

It's an unmanaged node from the perspecting of any existing user (no
existing core service user is exposed to this memory).  But this really
means that it's general-allocation-isolated.

ZONE_DEVICE is an unmanaged zone on a node, while this memory would be
onlined in ZONE_MOVABLE or below (i.e. it otherwise looks like normal
memory, just it can't be allocated).  In theory, we could re-use
ZONE_DEVICE for this, but that's probably a few more RFCs away.

I'm still trying to refine the language around this, thanks for pointing
this out.

~Gregory