drivers/cxl/core/region.c | 30 ++++++ drivers/cxl/cxl.h | 2 + drivers/dax/bus.c | 39 ++++++++ drivers/dax/bus.h | 1 + drivers/dax/cxl.c | 1 + drivers/dax/dax-private.h | 1 + drivers/dax/kmem.c | 2 + fs/proc/array.c | 2 +- include/linux/cpuset.h | 62 +++++++------ include/linux/gfp_types.h | 5 + include/linux/memory-tiers.h | 47 ++++++++++ include/linux/memory_hotplug.h | 10 ++ include/linux/mempolicy.h | 2 +- include/linux/mm.h | 4 +- include/linux/mmzone.h | 6 +- include/linux/oom.h | 2 +- include/linux/sched.h | 6 +- include/linux/swap.h | 2 +- init/init_task.c | 2 +- kernel/cgroup/cpuset-internal.h | 8 ++ kernel/cgroup/cpuset-v1.c | 7 ++ kernel/cgroup/cpuset.c | 158 ++++++++++++++++++++------------ kernel/fork.c | 2 +- kernel/sched/fair.c | 4 +- mm/compaction.c | 10 +- mm/hugetlb.c | 8 +- mm/internal.h | 2 +- mm/memcontrol.c | 3 +- mm/memory-tiers.c | 66 ++++++++++++- mm/memory_hotplug.c | 7 ++ mm/mempolicy.c | 34 +++---- mm/migrate.c | 4 +- mm/mmzone.c | 5 +- mm/oom_kill.c | 11 ++- mm/page_alloc.c | 57 +++++++----- mm/show_mem.c | 11 ++- mm/slub.c | 15 ++- mm/vmscan.c | 6 +- mm/zswap.c | 66 ++++++++++++- 39 files changed, 532 insertions(+), 178 deletions(-)
This is a code RFC for discussion related to
"Mempolicy is dead, long live memory policy!"
https://lpc.events/event/19/contributions/2143/
base-commit: 24172e0d79900908cf5ebf366600616d29c9b417
(version notes at end)
At LSF 2026, I plan to discuss:
- Why? (In short: shunting to DAX is a failed pattern for users)
- Other designs I considered (mempolicy, cpusets, zone_device)
- Why mempolicy.c and cpusets as-is are insufficient
- SPM types seeking this form of interface (Accelerator, Compression)
- Platform extensions that would be nice to see (SPM-only Bits)
Open Questions
- Single SPM nodemask, or multiple based on features?
- Apply SPM/SysRAM bit on-boot only or at-hotplug?
- Allocate extra "possible" NUMA nodes for flexbility?
- Should SPM Nodes be zone-restricted? (MOVABLE only?)
- How to handle things like reclaim and compaction on these nodes.
With this set, we aim to enable allocation of "special purpose memory"
with the page allocator (mm/page_alloc.c) without exposing the same
memory as "System RAM". Unless a non-userland component, and does so
with the GFP_SPM_NODE flag, memory on these nodes cannot be allocated.
This isolation mechanism is a requirement for memory policies which
depend on certain sets of memory never being used outside special
interfaces (such as a specific mm/component or driver).
We present an example of using this mechanism within ZSWAP, as-if
a "compressed memory node" was present. How to describe the features
of memory present on nodes is left up to comment here and at LPC '26.
Userspace-driven allocations are restricted by the sysram_nodes mask,
nothing in userspace can explicitly request memory from SPM nodes.
Instead, the intent is to create new components which understand memory
features and register those nodes with those components. This abstracts
the hardware complexity away from userland while also not requiring new
memory innovations to carry entirely new allocators.
The ZSwap example demonstrates this with the `mt_spm_nodemask`. This
hack treats all spm nodes as-if they are compressed memory nodes, and
we bypass the software compression logic in zswap in favor of simply
copying memory directly to the allocated page. In a real design
There are 4 major changes in this set:
1) Introducing mt_sysram_nodelist in mm/memory-tiers.c which denotes
the set of nodes which are eligible for use as normal system ram
Some existing users now pass mt_sysram_nodelist into the page
allocator instead of NULL, but passing a NULL pointer in will simply
have it replaced by mt_sysram_nodelist anyway. Should a fully NULL
pointer still make it to the page allocator, without GFP_SPM_NODE
SPM node zones will simply be skipped.
mt_sysram_nodelist is always guaranteed to contain the N_MEMORY nodes
present during __init, but if empty the use of mt_sysram_nodes()
will return a NULL to preserve current behavior.
2) The addition of `cpuset.mems.sysram` which restricts allocations to
`mt_sysram_nodes` unless GFP_SPM_NODE is used.
SPM Nodes are still allowed in cpuset.mems.allowed and effective.
This is done to allow separate control over sysram and SPM node sets
by cgroups while maintaining the existing hierarchical rules.
current cpuset configuration
cpuset.mems_allowed
|.mems_effective < (mems_allowed ∩ parent.mems_effective)
|->tasks.mems_allowed < cpuset.mems_effective
new cpuset configuration
cpuset.mems_allowed
|.mems_effective < (mems_allowed ∩ parent.mems_effective)
|.sysram_nodes < (mems_effective ∩ default_sys_nodemask)
|->task.sysram_nodes < cpuset.sysram_nodes
This means mems_allowed still restricts all node usage in any given
task context, which is the existing behavior.
3) Addition of MHP_SPM_NODE flag to instruct memory_hotplug.c that the
capacity being added should mark the node as an SPM Node.
A node is either SysRAM or SPM - never both. Attempting to add
incompatible memory to a node results in hotplug failure.
DAX and CXL are made aware of the bit and have `spm_node` bits added
to their relevant subsystems.
4) Adding GFP_SPM_NODE - which allows page_alloc.c to request memory
from the provided node or nodemask. It changes the behavior of
the cpuset mems_allowed and mt_node_allowed() checks.
v1->v2:
- naming improvements
default_node -> sysram_node
protected -> spm (Specific Purpose Memory)
- add missing constify patch
- add patch to update callers of __cpuset_zone_allowed
- add additional logic to the mm sysram_nodes patch
- fix bot build issues (ifdef config builds)
- fix out-of-tree driver build issues (function renames)
- change compressed_nodelist to spm_nodelist
- add latch mechanism for sysram/spm nodes (Dan Williams)
this drops some extra memory-hotplug logic which is nice
v1: https://lore.kernel.org/linux-mm/20251107224956.477056-1-gourry@gourry.net/
Gregory Price (11):
mm: constify oom_control, scan_control, and alloc_context nodemask
mm: change callers of __cpuset_zone_allowed to cpuset_zone_allowed
gfp: Add GFP_SPM_NODE for Specific Purpose Memory (SPM) allocations
memory-tiers: Introduce SysRAM and Specific Purpose Memory Nodes
mm: restrict slub, oom, compaction, and page_alloc to sysram by
default
mm,cpusets: rename task->mems_allowed to task->sysram_nodes
cpuset: introduce cpuset.mems.sysram
mm/memory_hotplug: add MHP_SPM_NODE flag
drivers/dax: add spm_node bit to dev_dax
drivers/cxl: add spm_node bit to cxl region
[HACK] mm/zswap: compressed ram integration example
drivers/cxl/core/region.c | 30 ++++++
drivers/cxl/cxl.h | 2 +
drivers/dax/bus.c | 39 ++++++++
drivers/dax/bus.h | 1 +
drivers/dax/cxl.c | 1 +
drivers/dax/dax-private.h | 1 +
drivers/dax/kmem.c | 2 +
fs/proc/array.c | 2 +-
include/linux/cpuset.h | 62 +++++++------
include/linux/gfp_types.h | 5 +
include/linux/memory-tiers.h | 47 ++++++++++
include/linux/memory_hotplug.h | 10 ++
include/linux/mempolicy.h | 2 +-
include/linux/mm.h | 4 +-
include/linux/mmzone.h | 6 +-
include/linux/oom.h | 2 +-
include/linux/sched.h | 6 +-
include/linux/swap.h | 2 +-
init/init_task.c | 2 +-
kernel/cgroup/cpuset-internal.h | 8 ++
kernel/cgroup/cpuset-v1.c | 7 ++
kernel/cgroup/cpuset.c | 158 ++++++++++++++++++++------------
kernel/fork.c | 2 +-
kernel/sched/fair.c | 4 +-
mm/compaction.c | 10 +-
mm/hugetlb.c | 8 +-
mm/internal.h | 2 +-
mm/memcontrol.c | 3 +-
mm/memory-tiers.c | 66 ++++++++++++-
mm/memory_hotplug.c | 7 ++
mm/mempolicy.c | 34 +++----
mm/migrate.c | 4 +-
mm/mmzone.c | 5 +-
mm/oom_kill.c | 11 ++-
mm/page_alloc.c | 57 +++++++-----
mm/show_mem.c | 11 ++-
mm/slub.c | 15 ++-
mm/vmscan.c | 6 +-
mm/zswap.c | 66 ++++++++++++-
39 files changed, 532 insertions(+), 178 deletions(-)
--
2.51.1
On 11/13/25 06:29, Gregory Price wrote: > This is a code RFC for discussion related to > > "Mempolicy is dead, long live memory policy!" > https://lpc.events/event/19/contributions/2143/ > :) I am trying to read through your series, but in the past I tried https://lwn.net/Articles/720380/ > base-commit: 24172e0d79900908cf5ebf366600616d29c9b417 > (version notes at end) > > At LSF 2026, I plan to discuss: > - Why? (In short: shunting to DAX is a failed pattern for users) > - Other designs I considered (mempolicy, cpusets, zone_device) > - Why mempolicy.c and cpusets as-is are insufficient > - SPM types seeking this form of interface (Accelerator, Compression) > - Platform extensions that would be nice to see (SPM-only Bits) > > Open Questions > - Single SPM nodemask, or multiple based on features? > - Apply SPM/SysRAM bit on-boot only or at-hotplug? > - Allocate extra "possible" NUMA nodes for flexbility? > - Should SPM Nodes be zone-restricted? (MOVABLE only?) > - How to handle things like reclaim and compaction on these nodes. > > > With this set, we aim to enable allocation of "special purpose memory" > with the page allocator (mm/page_alloc.c) without exposing the same > memory as "System RAM". Unless a non-userland component, and does so > with the GFP_SPM_NODE flag, memory on these nodes cannot be allocated. > > This isolation mechanism is a requirement for memory policies which > depend on certain sets of memory never being used outside special > interfaces (such as a specific mm/component or driver). > > We present an example of using this mechanism within ZSWAP, as-if > a "compressed memory node" was present. How to describe the features > of memory present on nodes is left up to comment here and at LPC '26. > > Userspace-driven allocations are restricted by the sysram_nodes mask, > nothing in userspace can explicitly request memory from SPM nodes. > > Instead, the intent is to create new components which understand memory > features and register those nodes with those components. This abstracts > the hardware complexity away from userland while also not requiring new > memory innovations to carry entirely new allocators. > > The ZSwap example demonstrates this with the `mt_spm_nodemask`. This > hack treats all spm nodes as-if they are compressed memory nodes, and > we bypass the software compression logic in zswap in favor of simply > copying memory directly to the allocated page. In a real design > > There are 4 major changes in this set: > > 1) Introducing mt_sysram_nodelist in mm/memory-tiers.c which denotes > the set of nodes which are eligible for use as normal system ram > > Some existing users now pass mt_sysram_nodelist into the page > allocator instead of NULL, but passing a NULL pointer in will simply > have it replaced by mt_sysram_nodelist anyway. Should a fully NULL > pointer still make it to the page allocator, without GFP_SPM_NODE > SPM node zones will simply be skipped. > > mt_sysram_nodelist is always guaranteed to contain the N_MEMORY nodes > present during __init, but if empty the use of mt_sysram_nodes() > will return a NULL to preserve current behavior. > > > 2) The addition of `cpuset.mems.sysram` which restricts allocations to > `mt_sysram_nodes` unless GFP_SPM_NODE is used. > > SPM Nodes are still allowed in cpuset.mems.allowed and effective. > > This is done to allow separate control over sysram and SPM node sets > by cgroups while maintaining the existing hierarchical rules. > > current cpuset configuration > cpuset.mems_allowed > |.mems_effective < (mems_allowed ∩ parent.mems_effective) > |->tasks.mems_allowed < cpuset.mems_effective > > new cpuset configuration > cpuset.mems_allowed > |.mems_effective < (mems_allowed ∩ parent.mems_effective) > |.sysram_nodes < (mems_effective ∩ default_sys_nodemask) > |->task.sysram_nodes < cpuset.sysram_nodes > > This means mems_allowed still restricts all node usage in any given > task context, which is the existing behavior. > > 3) Addition of MHP_SPM_NODE flag to instruct memory_hotplug.c that the > capacity being added should mark the node as an SPM Node. > > A node is either SysRAM or SPM - never both. Attempting to add > incompatible memory to a node results in hotplug failure. > > DAX and CXL are made aware of the bit and have `spm_node` bits added > to their relevant subsystems. > > 4) Adding GFP_SPM_NODE - which allows page_alloc.c to request memory > from the provided node or nodemask. It changes the behavior of > the cpuset mems_allowed and mt_node_allowed() checks. > > v1->v2: > - naming improvements > default_node -> sysram_node > protected -> spm (Specific Purpose Memory) > - add missing constify patch > - add patch to update callers of __cpuset_zone_allowed > - add additional logic to the mm sysram_nodes patch > - fix bot build issues (ifdef config builds) > - fix out-of-tree driver build issues (function renames) > - change compressed_nodelist to spm_nodelist > - add latch mechanism for sysram/spm nodes (Dan Williams) > this drops some extra memory-hotplug logic which is nice > v1: https://lore.kernel.org/linux-mm/20251107224956.477056-1-gourry@gourry.net/ > > Gregory Price (11): > mm: constify oom_control, scan_control, and alloc_context nodemask > mm: change callers of __cpuset_zone_allowed to cpuset_zone_allowed > gfp: Add GFP_SPM_NODE for Specific Purpose Memory (SPM) allocations > memory-tiers: Introduce SysRAM and Specific Purpose Memory Nodes > mm: restrict slub, oom, compaction, and page_alloc to sysram by > default > mm,cpusets: rename task->mems_allowed to task->sysram_nodes > cpuset: introduce cpuset.mems.sysram > mm/memory_hotplug: add MHP_SPM_NODE flag > drivers/dax: add spm_node bit to dev_dax > drivers/cxl: add spm_node bit to cxl region > [HACK] mm/zswap: compressed ram integration example > > drivers/cxl/core/region.c | 30 ++++++ > drivers/cxl/cxl.h | 2 + > drivers/dax/bus.c | 39 ++++++++ > drivers/dax/bus.h | 1 + > drivers/dax/cxl.c | 1 + > drivers/dax/dax-private.h | 1 + > drivers/dax/kmem.c | 2 + > fs/proc/array.c | 2 +- > include/linux/cpuset.h | 62 +++++++------ > include/linux/gfp_types.h | 5 + > include/linux/memory-tiers.h | 47 ++++++++++ > include/linux/memory_hotplug.h | 10 ++ > include/linux/mempolicy.h | 2 +- > include/linux/mm.h | 4 +- > include/linux/mmzone.h | 6 +- > include/linux/oom.h | 2 +- > include/linux/sched.h | 6 +- > include/linux/swap.h | 2 +- > init/init_task.c | 2 +- > kernel/cgroup/cpuset-internal.h | 8 ++ > kernel/cgroup/cpuset-v1.c | 7 ++ > kernel/cgroup/cpuset.c | 158 ++++++++++++++++++++------------ > kernel/fork.c | 2 +- > kernel/sched/fair.c | 4 +- > mm/compaction.c | 10 +- > mm/hugetlb.c | 8 +- > mm/internal.h | 2 +- > mm/memcontrol.c | 3 +- > mm/memory-tiers.c | 66 ++++++++++++- > mm/memory_hotplug.c | 7 ++ > mm/mempolicy.c | 34 +++---- > mm/migrate.c | 4 +- > mm/mmzone.c | 5 +- > mm/oom_kill.c | 11 ++- > mm/page_alloc.c | 57 +++++++----- > mm/show_mem.c | 11 ++- > mm/slub.c | 15 ++- > mm/vmscan.c | 6 +- > mm/zswap.c | 66 ++++++++++++- > 39 files changed, 532 insertions(+), 178 deletions(-) > Balbir
On Wed, Nov 26, 2025 at 02:23:23PM +1100, Balbir Singh wrote:
> On 11/13/25 06:29, Gregory Price wrote:
> > This is a code RFC for discussion related to
> >
> > "Mempolicy is dead, long live memory policy!"
> > https://lpc.events/event/19/contributions/2143/
> >
>
> :)
>
> I am trying to read through your series, but in the past I tried
> https://lwn.net/Articles/720380/
>
This is very interesting, I gave the whole RFC a read and it seems you
were working from the same conclusion ~8 years ago - that NUMA just
plainly "Feels like the correct abstraction".
First, thank you, the read-through here filled in some holes regarding
HMM-CDM for me. If you have developed any other recent opinions on the
use of HMM-CDM vs NUMA-CDM, your experience is most welcome.
Some observations:
1) You implemented what amounts to N_SPM_NODES
- I find it funny we separately came to the same conclusion. I had
not seen your series while researching this, that should be an
instructive history lesson for readers.
- N_SPM_NODES probably dictates some kind of input from ACPI table
extension, drivers input (like my MHP flag), or kernel configs
(build/init) to make sense.
- I discussed in my note to David that this is probably the right
way to go about doing it. I think N_MEMORY can still be set, if
a new global-default-node policy is created.
- cpuset/global sysram_nodes masks in this set are that policy.
2) You bring up the concept of NUMA node attributes
- I have privately discussed this concept with MM folks, but had
not come around to formalize this. It seems a natural extension.
- I wasn't sure whether such a thing would end up in memory-tiers.c
or somehow abstracted otherwise. We definitely do not want node
attributes to imply infinite N_XXXXX masks.
3) You attacked the problem from the zone iteration mechanism as the
primary allocation filter - while I used cpusets and basically
implemented a new in-kernel policy (sysram_nodes)
- I chose not to take that route (omitting these nodes from N_MEMORY)
precisely because it would require making changes all over the
kernel for components that may want to use the memory which
leverage N_MEMORY for zone iteration.
- Instead, I can see either per-component policies (reclaim->nodes)
or a global policy that covers all of those components (similar to
my sysram_nodes). Drivers would then be responsible to register
their hotplugged memory nodes with those components accordingly.
- My mechanism requires a GFP flag to punch a hole in the isolation,
while yours depends on the fact that page_alloc uses N_MEMORY if
nodemask is not provided. I can see an argument for going that
route instead of the sysram_nodes policy, but I also understand
why removing them from N_MEMORY causes issues (how do you opt these
nodes into core services like kswapd and such).
Interesting discussions to be had.
4) Many commenters tried pushing mempolicy as the place to do this.
We both independently came to the conclusion that
- mempolicy is at best an insufficient mechanism for isolation due
to the way the rest of the system is designed (cpusets, zones)
- at worst, actually harmful because it leads kernel developers to
believe users view mempolicy APIs as reasonable. They don't.
In my experience it's viewed as:
- too complicated (SW doesn't want to know about HW)
- useless (it's not even respected by reclaim)
- actively harmful (it makes your code less portable)
- "The only thing we have"
Your RFC has the same concerns expressed that I have seen over past
few years in Device-Memory development groups... except that the general
consensus was (in 2017) that these devices were not commodity hardware
the kernel needs a general abstraction (NUMA) to support.
"Push the complexity to userland" (mempolicy), and
"Make the driver manage it." (hmm/zone_device)
Have been the prevailing opinions as a result.
From where I sit, this depends on the assumption that anyone using such
systems is presumed to be sophisticated and empowered enough to accept
that complexity. This is just quite bluntly no longer the case.
GPUs, unified memory, and coherent interconnects have all become
commodity hardware in the data center, and the "users" here are
infrastructure-as-a-service folks that want these systems to be
some definition of fungible.
~Gregory
On 11/26/25 19:29, Gregory Price wrote: > On Wed, Nov 26, 2025 at 02:23:23PM +1100, Balbir Singh wrote: >> On 11/13/25 06:29, Gregory Price wrote: >>> This is a code RFC for discussion related to >>> >>> "Mempolicy is dead, long live memory policy!" >>> https://lpc.events/event/19/contributions/2143/ >>> >> >> :) >> >> I am trying to read through your series, but in the past I tried >> https://lwn.net/Articles/720380/ >> > > This is very interesting, I gave the whole RFC a read and it seems you > were working from the same conclusion ~8 years ago - that NUMA just > plainly "Feels like the correct abstraction". > > First, thank you, the read-through here filled in some holes regarding > HMM-CDM for me. If you have developed any other recent opinions on the > use of HMM-CDM vs NUMA-CDM, your experience is most welcome. > Sorry for the delay in responding, I've not yet read through your series > > Some observations: > > 1) You implemented what amounts to N_SPM_NODES > > - I find it funny we separately came to the same conclusion. I had > not seen your series while researching this, that should be an > instructive history lesson for readers. > > - N_SPM_NODES probably dictates some kind of input from ACPI table > extension, drivers input (like my MHP flag), or kernel configs > (build/init) to make sense. > > - I discussed in my note to David that this is probably the right > way to go about doing it. I think N_MEMORY can still be set, if > a new global-default-node policy is created. > I still think N_MEMORY as a flag should mean something different from N_SPM_NODE_MEMORY because their characteristics are different > - cpuset/global sysram_nodes masks in this set are that policy. > > > 2) You bring up the concept of NUMA node attributes > > - I have privately discussed this concept with MM folks, but had > not come around to formalize this. It seems a natural extension. > > - I wasn't sure whether such a thing would end up in memory-tiers.c > or somehow abstracted otherwise. We definitely do not want node > attributes to imply infinite N_XXXXX masks. I have to think about this some more > > > 3) You attacked the problem from the zone iteration mechanism as the > primary allocation filter - while I used cpusets and basically > implemented a new in-kernel policy (sysram_nodes) > > - I chose not to take that route (omitting these nodes from N_MEMORY) > precisely because it would require making changes all over the > kernel for components that may want to use the memory which > leverage N_MEMORY for zone iteration. > > - Instead, I can see either per-component policies (reclaim->nodes) > or a global policy that covers all of those components (similar to > my sysram_nodes). Drivers would then be responsible to register > their hotplugged memory nodes with those components accordingly. > To me node zonelists provide the right abstraction of where to allocate from and how to fallback as needed. I'll read your patches to figure out how your approach is different. I wanted the isolation at allocation time > - My mechanism requires a GFP flag to punch a hole in the isolation, > while yours depends on the fact that page_alloc uses N_MEMORY if > nodemask is not provided. I can see an argument for going that > route instead of the sysram_nodes policy, but I also understand > why removing them from N_MEMORY causes issues (how do you opt these > nodes into core services like kswapd and such). > > Interesting discussions to be had. Yes, we should look at the pros and cons. To be honest, I'd wouldn't be opposed to having kswapd and reclaim look different for these nodes, it would also mean that we'd need pagecache hooks if we want page cache on these nodes. Everything else, including move_pages() should just work. > > > 4) Many commenters tried pushing mempolicy as the place to do this. > We both independently came to the conclusion that > > - mempolicy is at best an insufficient mechanism for isolation due > to the way the rest of the system is designed (cpusets, zones) > > - at worst, actually harmful because it leads kernel developers to > believe users view mempolicy APIs as reasonable. They don't. > In my experience it's viewed as: > - too complicated (SW doesn't want to know about HW) > - useless (it's not even respected by reclaim) > - actively harmful (it makes your code less portable) > - "The only thing we have" > > Your RFC has the same concerns expressed that I have seen over past > few years in Device-Memory development groups... except that the general > consensus was (in 2017) that these devices were not commodity hardware > the kernel needs a general abstraction (NUMA) to support. > > "Push the complexity to userland" (mempolicy), and > "Make the driver manage it." (hmm/zone_device) > Yep > Have been the prevailing opinions as a result. > > From where I sit, this depends on the assumption that anyone using such > systems is presumed to be sophisticated and empowered enough to accept > that complexity. This is just quite bluntly no longer the case. > > GPUs, unified memory, and coherent interconnects have all become > commodity hardware in the data center, and the "users" here are > infrastructure-as-a-service folks that want these systems to be > some definition of fungible. > I also think the absence of better integration makes memory management harder Balbir
On Wed, Dec 03, 2025 at 03:36:33PM +1100, Balbir Singh wrote:
> > - I discussed in my note to David that this is probably the right
> > way to go about doing it. I think N_MEMORY can still be set, if
> > a new global-default-node policy is created.
> >
>
> I still think N_MEMORY as a flag should mean something different from
> N_SPM_NODE_MEMORY because their characteristics are different
>
... snip ... (I agree, see later)
> > - Instead, I can see either per-component policies (reclaim->nodes)
> > or a global policy that covers all of those components (similar to
> > my sysram_nodes). Drivers would then be responsible to register
> > their hotplugged memory nodes with those components accordingly.
> >
>
> To me node zonelists provide the right abstraction of where to allocate from
> and how to fallback as needed. I'll read your patches to figure out how your
> approach is different. I wanted the isolation at allocation time
>
... snip ... (I agree, see later)
>
> Yes, we should look at the pros and cons. To be honest, I'd wouldn't be
> opposed to having kswapd and reclaim look different for these nodes, it
> would also mean that we'd need pagecache hooks if we want page cache on
> these nodes. Everything else, including move_pages() should just work.
>
Basically my series does (roughly) the same as yours, but adds the
cpusets controls and a GFP flag. The MHP extention should ultimately
be converted to N_SPM_NODE_MEMORY (or whatever we decide to name it).
After some more time to think, I think we want all of it.
- N_SPM_NODE_MEMORY (or whatever we call it) handles filtering out
SPM at allocation time by default and protects all current users
of N_MEMORY from exposure to SPM.
- cpusets controls allow userland isolation control and a default sysram
mask (I think cpusets.sysram_nodes doesn't even need to be exposed via
sysfs to be honest). cpusets fix is needed due to task->mems_allowed
being used as a default nodemask on systems using cgroups/cpusets.
- GFP_SP_NODE protects against someone doing something like:
get_page_from_freelist(..., node_states[N_POSSIBLE])
or
numactl --interleave --all ./my_program
While providing a way to punch an explicit hole in the isolation
(GFP_SP_NODE means "Use N_SPM_NODE_MEMORY instead of N_MEMORY")
This could be argued against so long as we restrict mempolicy.c
to N_MEMORY nodes (to avoid `--interleave --all` issues), but this
limitation may not be preferable.
My concern is for breaking existing userland software that happens
to run on a system with SPM - but you can probably imagine many more
bad scenarios.
~Gregory
On Wed, Nov 12, 2025 at 02:29:16PM -0500, Gregory Price wrote: > With this set, we aim to enable allocation of "special purpose memory" > with the page allocator (mm/page_alloc.c) without exposing the same > memory as "System RAM". Unless a non-userland component, and does so > with the GFP_SPM_NODE flag, memory on these nodes cannot be allocated. How special is "special purpose memory"? If the only difference is a latency/bandwidth discrepancy compared to "System RAM", I don't believe it deserves this designation. I am not in favor of the new GFP flag approach. To me, this indicates that our infrastructure surrounding nodemasks is lacking. I believe we would benefit more by improving it rather than simply adding a GFP flag on top. While I am not an expert in NUMA, it appears that the approach with default and opt-in NUMA nodes could be generally useful. Like, introduce a system-wide default NUMA nodemask that is a subset of all possible nodes. This way, users can request the "special" nodes by using a wider mask than the default. cpusets should allow to set both default and possible masks in a hierarchical manner where a child's default/possible mask cannot be wider than the parent's possible mask and default is not wider that own possible. > Userspace-driven allocations are restricted by the sysram_nodes mask, > nothing in userspace can explicitly request memory from SPM nodes. > > Instead, the intent is to create new components which understand memory > features and register those nodes with those components. This abstracts > the hardware complexity away from userland while also not requiring new > memory innovations to carry entirely new allocators. I don't see how it is a positive. It seems to be negative side-effect of GFP being a leaky abstraction. -- Kiryl Shutsemau / Kirill A. Shutemov
On Tue, Nov 25, 2025 at 02:09:39PM +0000, Kiryl Shutsemau wrote: > On Wed, Nov 12, 2025 at 02:29:16PM -0500, Gregory Price wrote: > > With this set, we aim to enable allocation of "special purpose memory" > > with the page allocator (mm/page_alloc.c) without exposing the same > > memory as "System RAM". Unless a non-userland component, and does so > > with the GFP_SPM_NODE flag, memory on these nodes cannot be allocated. > > How special is "special purpose memory"? If the only difference is a > latency/bandwidth discrepancy compared to "System RAM", I don't believe > it deserves this designation. > That is not the only discrepancy, but it can certainly be one of them. I do think, at a certain latency/bandwidth level, memory becomes "Specific Purpose" - because the performance implications become so dramatic that you cannot allow just anything to land there. In my head, I've been thinking about this list 1) Plain old memory (<100ns) 2) Kinda slower, but basically still memory (100-300ns) 3) Slow Memory (>300ns, up to 2-3us loaded latencies) 4) Types 1-3, but with a special feature (Such as compression) 5) Coherent Accelerator Memory (various interconnects now exist) 6) Non-coherent Shared Memory and PMEM (FAMFS, Optane, etc) Originally I was considering [3,4], but with Alistar's comments I am also thinking about [5] since apparently some accelerators already toss their memory into the page allocator for management. Re: Slow memory -- Think >500-700ns cache line fetches, or 1-2us loaded. It's still "Basically just memory", but the scenarios in which you can use it transparently shrink significantly. If you can control what and how things can land there with good policy, this can still be a boon compared to hitting I/O. But you still want things like reclaim and compaction to run on this memory, and you still want buddy-allocation of this memory. Re: Compression This is a class of memory device which presents "usable memory" but which carries stipulations around its use. The compressed case is the example I use in this set. There is an inline compression mechanism on the device. If the compression ratio drops to low, writes can get dropped resulting in memory poison. We could solve this kind of problem only allowing allocation via demotion and hack off the Write-bit in the PTE. This provides the interposition needed to fend-off compression ratio issues. But... it's basically still "just memory" - you can even leave it mapped in the CPU page tables and allow userland to read unimpeded. In fact, we even want things like compaction and reclaim to run here. This cannot be done *unless* this memory is in the page allocator, and basically necessitates reimplementing all the core services the kernel provides. Re: Accelerators Alistair has described accelerators onlining their memory as NUMA nodes being an existing pattern (apparently not in-tree as far as I can see, though). General consensus is "don't do this" - and it should be obvious why. Memory pressure can cause non-workload memory to spill to these NUMA nodes as fallback allocation targets. But if we had a strong isolation mechanism, this could be supported. I'm not convinced this kind of memory actually needs core services like reclaim, so I will wait to see those arguments/data before I conclude whether the idea is sound. > > I am not in favor of the new GFP flag approach. To me, this indicates > that our infrastructure surrounding nodemasks is lacking. I believe we > would benefit more by improving it rather than simply adding a GFP flag > on top. > The core of this series is not the GFP flag, it is the splitting of (cpuset.mems_allowed) into (cpuset.mems_allowed, cpuset.sysram_nodes) That is the nodemask infrastructure improvement. The GFP flag is one mechanism of loosening the validation logic from limiting allocations from (sysram_nodes) to including all nodes present in (mems_allowed). > While I am not an expert in NUMA, it appears that the approach with > default and opt-in NUMA nodes could be generally useful. Like, > introduce a system-wide default NUMA nodemask that is a subset of all > possible nodes. This patch set does that (cpuset.sysram_nodes and mt_sysram_nodemask) > This way, users can request the "special" nodes by using > a wider mask than the default. > I describe in the response to David that this is possible, but creates extreme tripping hazards for a large swath of existing software. snippet ''' Simple answer: We can choose how hard this guardrail is to break. This initial attempt makes it "Hard": You cannot "accidentally" allocate SPM, the call must be explicit. Removing the GFP would work, and make it "Easier" to access SPM memory. This would allow a trivial mbind(range, SPM_NODE_ID) Which is great, but is also an incredible tripping hazard: numactl --interleave --all and in kernel land: __alloc_pages_noprof(..., nodes[N_MEMORY]) These will now instantly be subject to SPM node memory. ''' There are many places that use these patterns already. But at the end of the day, it is preference: we can choose to do that. > cpusets should allow to set both default and possible masks in a > hierarchical manner where a child's default/possible mask cannot be > wider than the parent's possible mask and default is not wider that > own possible. > This patch set implements exactly what you describe: sysram_nodes = default mems_allowed = possible > > Userspace-driven allocations are restricted by the sysram_nodes mask, > > nothing in userspace can explicitly request memory from SPM nodes. > > > > Instead, the intent is to create new components which understand memory > > features and register those nodes with those components. This abstracts > > the hardware complexity away from userland while also not requiring new > > memory innovations to carry entirely new allocators. > > I don't see how it is a positive. It seems to be negative side-effect of > GFP being a leaky abstraction. > It's a matter of applying an isolation mechanism and then punching an explicit hole in it. As it is right now, GFP is "leaky" in that there are, basically, no walls. Reclaim even ignored cpuset controls until recently, and the page_alloc code even says to ignore cpuset when in an interrupt context. The core of the proposal here is to provide a strong isolation mechanism and then allow punching explicit holes in it. The GFP flag is one pattern, I'm open to others. ~Gregory
On 2025-11-26 at 02:05 +1100, Gregory Price <gourry@gourry.net> wrote... > On Tue, Nov 25, 2025 at 02:09:39PM +0000, Kiryl Shutsemau wrote: > > On Wed, Nov 12, 2025 at 02:29:16PM -0500, Gregory Price wrote: > > > With this set, we aim to enable allocation of "special purpose memory" > > > with the page allocator (mm/page_alloc.c) without exposing the same > > > memory as "System RAM". Unless a non-userland component, and does so > > > with the GFP_SPM_NODE flag, memory on these nodes cannot be allocated. > > > > How special is "special purpose memory"? If the only difference is a > > latency/bandwidth discrepancy compared to "System RAM", I don't believe > > it deserves this designation. > > > > That is not the only discrepancy, but it can certainly be one of them. > > I do think, at a certain latency/bandwidth level, memory becomes > "Specific Purpose" - because the performance implications become so > dramatic that you cannot allow just anything to land there. > > In my head, I've been thinking about this list > > 1) Plain old memory (<100ns) > 2) Kinda slower, but basically still memory (100-300ns) > 3) Slow Memory (>300ns, up to 2-3us loaded latencies) > 4) Types 1-3, but with a special feature (Such as compression) > 5) Coherent Accelerator Memory (various interconnects now exist) > 6) Non-coherent Shared Memory and PMEM (FAMFS, Optane, etc) > > Originally I was considering [3,4], but with Alistar's comments I am > also thinking about [5] since apparently some accelerators already > toss their memory into the page allocator for management. Thanks. > Re: Slow memory -- > > Think >500-700ns cache line fetches, or 1-2us loaded. > > It's still "Basically just memory", but the scenarios in which > you can use it transparently shrink significantly. If you can > control what and how things can land there with good policy, > this can still be a boon compared to hitting I/O. > > But you still want things like reclaim and compaction to run > on this memory, and you still want buddy-allocation of this memory. > > Re: Compression > > This is a class of memory device which presents "usable memory" > but which carries stipulations around its use. > > The compressed case is the example I use in this set. There is an > inline compression mechanism on the device. If the compression ratio > drops to low, writes can get dropped resulting in memory poison. > > We could solve this kind of problem only allowing allocation via > demotion and hack off the Write-bit in the PTE. This provides the > interposition needed to fend-off compression ratio issues. > > But... it's basically still "just memory" - you can even leave it > mapped in the CPU page tables and allow userland to read unimpeded. > > In fact, we even want things like compaction and reclaim to run here. > This cannot be done *unless* this memory is in the page allocator, > and basically necessitates reimplementing all the core services the > kernel provides. > > Re: Accelerators > > Alistair has described accelerators onlining their memory as NUMA > nodes being an existing pattern (apparently not in-tree as far as I > can see, though). Yeah, sadly not yet :-( Hopefully "soon". Although onlining the memory doesn't have much driver involvement as the GPU memory all just appears in the ACPI tables as a CPU-less memory node anyway (which is why it ended up being easy for people to toss it into the page allocator). > General consensus is "don't do this" - and it should be obvious > why. Memory pressure can cause non-workload memory to spill to > these NUMA nodes as fallback allocation targets. Indeed, this is a common complaint when people have done this. > But if we had a strong isolation mechanism, this could be supported. > I'm not convinced this kind of memory actually needs core services > like reclaim, so I will wait to see those arguments/data before I > conclude whether the idea is sound. Sounds reasonable, I don't have strong arugments either way at the moment so will see if we can gather some data. > > > > > > I am not in favor of the new GFP flag approach. To me, this indicates > > that our infrastructure surrounding nodemasks is lacking. I believe we > > would benefit more by improving it rather than simply adding a GFP flag > > on top. > > > > The core of this series is not the GFP flag, it is the splitting of > (cpuset.mems_allowed) into (cpuset.mems_allowed, cpuset.sysram_nodes) > > That is the nodemask infrastructure improvement. The GFP flag is one > mechanism of loosening the validation logic from limiting allocations > from (sysram_nodes) to including all nodes present in (mems_allowed). > > > While I am not an expert in NUMA, it appears that the approach with > > default and opt-in NUMA nodes could be generally useful. Like, > > introduce a system-wide default NUMA nodemask that is a subset of all > > possible nodes. > > This patch set does that (cpuset.sysram_nodes and mt_sysram_nodemask) > > > This way, users can request the "special" nodes by using > > a wider mask than the default. > > > > I describe in the response to David that this is possible, but creates > extreme tripping hazards for a large swath of existing software. > > snippet > ''' > Simple answer: We can choose how hard this guardrail is to break. > > This initial attempt makes it "Hard": > You cannot "accidentally" allocate SPM, the call must be explicit. > > Removing the GFP would work, and make it "Easier" to access SPM memory. > > This would allow a trivial > > mbind(range, SPM_NODE_ID) > > Which is great, but is also an incredible tripping hazard: > > numactl --interleave --all > > and in kernel land: > > __alloc_pages_noprof(..., nodes[N_MEMORY]) > > These will now instantly be subject to SPM node memory. > ''' > > There are many places that use these patterns already. > > But at the end of the day, it is preference: we can choose to do that. > > > cpusets should allow to set both default and possible masks in a > > hierarchical manner where a child's default/possible mask cannot be > > wider than the parent's possible mask and default is not wider that > > own possible. > > > > This patch set implements exactly what you describe: > sysram_nodes = default > mems_allowed = possible > > > > Userspace-driven allocations are restricted by the sysram_nodes mask, > > > nothing in userspace can explicitly request memory from SPM nodes. > > > > > > Instead, the intent is to create new components which understand memory > > > features and register those nodes with those components. This abstracts > > > the hardware complexity away from userland while also not requiring new > > > memory innovations to carry entirely new allocators. > > > > I don't see how it is a positive. It seems to be negative side-effect of > > GFP being a leaky abstraction. > > > > It's a matter of applying an isolation mechanism and then punching an > explicit hole in it. As it is right now, GFP is "leaky" in that there > are, basically, no walls. Reclaim even ignored cpuset controls until > recently, and the page_alloc code even says to ignore cpuset when > in an interrupt context. > > The core of the proposal here is to provide a strong isolation mechanism > and then allow punching explicit holes in it. The GFP flag is one > pattern, I'm open to others. > > ~Gregory
[...] > 2) The addition of `cpuset.mems.sysram` which restricts allocations to > `mt_sysram_nodes` unless GFP_SPM_NODE is used. > > SPM Nodes are still allowed in cpuset.mems.allowed and effective. > > This is done to allow separate control over sysram and SPM node sets > by cgroups while maintaining the existing hierarchical rules. > > current cpuset configuration > cpuset.mems_allowed > |.mems_effective < (mems_allowed ∩ parent.mems_effective) > |->tasks.mems_allowed < cpuset.mems_effective > > new cpuset configuration > cpuset.mems_allowed > |.mems_effective < (mems_allowed ∩ parent.mems_effective) > |.sysram_nodes < (mems_effective ∩ default_sys_nodemask) > |->task.sysram_nodes < cpuset.sysram_nodes > > This means mems_allowed still restricts all node usage in any given > task context, which is the existing behavior. > > 3) Addition of MHP_SPM_NODE flag to instruct memory_hotplug.c that the > capacity being added should mark the node as an SPM Node. Sounds a bit like the wrong interface for configuring this. This smells like a per-node setting that should be configured before hotplugging any memory. > > A node is either SysRAM or SPM - never both. Attempting to add > incompatible memory to a node results in hotplug failure. > > DAX and CXL are made aware of the bit and have `spm_node` bits added > to their relevant subsystems. > > 4) Adding GFP_SPM_NODE - which allows page_alloc.c to request memory > from the provided node or nodemask. It changes the behavior of > the cpuset mems_allowed and mt_node_allowed() checks. I wonder why that is required. Couldn't we disallow allocation from one of these special nodes as default, and only allow it if someone explicitly passes in the node for allocation? What's the problem with that? -- Cheers David
On Mon, Nov 24, 2025 at 10:19:37AM +0100, David Hildenbrand (Red Hat) wrote:
> [...]
>
Apologies in advance for the wall of text, both of your questions really
do cut to the core of the series. The first (SPM nodes) is basically a
plumbing problem I haven't had time to address pre-LPC, the second (GFP)
is actually a design decision that is definitely up in the air.
So consider this a dump of everything I wouldn't have had time to cover
in the LPC session.
> > 3) Addition of MHP_SPM_NODE flag to instruct memory_hotplug.c that the
> > capacity being added should mark the node as an SPM Node.
>
> Sounds a bit like the wrong interface for configuring this. This smells like
> a per-node setting that should be configured before hotplugging any memory.
>
Assuming you're specifically talking about the MHP portion of this.
I agree, and I think the plumbing ultimately goes through acpi and
kernel configs. This was my shortest path to demonstrate a functional
prototype by LPC.
I think the most likely option simply reserving additional NUMA nodes for
hotpluggable regions based on a Kconfig setting.
I think the real setup process should look like follows:
1. At __init time, Linux reserves additional SPM nodes based on some
configuration (build? runtime? etc)
Essentially create: nodes[N_SPM]
2. At SPM setup time, a driver registers an "Abstract Type" with
mm/memory_tiers.c which maps SPM->Type.
This gives the core some management callback infrastructure without
polluting the core with device specific nonsense.
This also gives the driver a change to define things like SLIT
distances for those nodes, which otherwise won't exist.
3. At hotplug time, memory-hotplug.c should only have to flip a bit
in `mt_sysram_nodes` if NID is not in nodes[N_SPM]. That logic
is still there to ensure the base filtering works as intended.
I haven't quite figured out how to plumb out nodes[N_SPM] as described
above, but I did figure out how to demonstrate roughly the same effect
through memory-hotplug.c - hopefully that much is clear.
The problem with the above plan, is whether that "Makes sense" according
to ACPI specs and friends.
This operates in "Ambiguity Land", which is uncomfortable.
======== How Linux ingests ACPI Tables to make NUMA nodes =======
For the sake of completeness:
NUMA nodes are "marked as possible" primarily via entries in the ACPI
SRAT (Static Resource Affinity Table).
https://docs.kernel.org/driver-api/cxl/platform/acpi/srat.html
Subtable Type : 01 [Memory Affinity]
Length : 28
Proximity Domain : 00000001 <- NUMA Node 1
A proximity domain (PXM) is simply a logical grouping of components
according to the OSPM. Linux takes PXMs and maps them to NUMA nodes.
In most cases (NR_PXM == NR_NODES), but not always. For example, if
the CXL Early Detection Table (CEDT) describes a CXL memory region for
which there is no SRAT entry, Linux reserves a "Fake PXM" id and
marks that ID as a "possible" NUMA node.
= drivers/acpi/numa/srat.c
int __init acpi_numa_init(void)
{
...
/* fake_pxm is the next unused PXM value after SRAT parsing */
for (i = 0, fake_pxm = -1; i < MAX_NUMNODES; i++) {
if (node_to_pxm_map[i] > fake_pxm)
fake_pxm = node_to_pxm_map[i];
}
last_real_pxm = fake_pxm;
fake_pxm++;
acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
&fake_pxm);
...
}
static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
void *arg, const unsigned long table_end)
{
...
/* No SRAT description. Create a new node. */
node = acpi_map_pxm_to_node(*fake_pxm);
...
node_set(node, numa_nodes_parsed); <- this is used to set N_POSSIBLE
}
Here's where we get into "Specification Ambiguity"
The ACPI spec does not limit (as far as I can see) a memory region from
being associated with multiple proximity domains (NUMA nodes).
Therefore, the OSPM could actually report it multiple times in the SRAT
in order to reserve multiple NUMA node possiblities for the same device.
A further extention to ACPI could be used to mark such Memory PXMs as
"Specific Purpose" - similar to the EFI_MEMORY_SP bit used to mark
memory regions as "Soft Reserved".
(this would probably break quite a lot of existing linux code, which
a quick browse around gives you the sense that there's an assumption
a given page can only be affiliated with one possible numa node)
But Linux could also utilize build or runtime settings to add additional
nodes which are reserved for SPM use - but are otherwise left out of
all the default maps. This at least seems reasonable.
Note: N_POSSIBLE nodes is set at __init time, and is more or less
expected to never change. It's probably preferable to work with this
restriction, rather than to try to change it. Many race conditions.
<skippable wall>
================= Spec nonsense for reference ====================
(ACPI 6.5 Spec)
5.2.16.2 Memory Affinity Structure
The Memory Affinity structure provides the following topology information statically to the operating system:
• The association between a memory range and the proximity domain to which it belongs
• Information about whether the memory range can be hot-plugged.
5.2.19 Maximum System Characteristics Table (MSCT)
This section describes the format of the Maximum System Characteristic Table (MSCT), which provides OSPM with
information characteristics of a system’s maximum topology capabilities. If the system maximum topology is not
known up front at boot time, then this table is not present. OSPM will use information provided by the MSCT only
when the System Resource Affinity Table (SRAT) exists. The MSCT must contain all proximity and clock domains
defined in the SRAT.
-- field: Maximum Number of Proximity Domains
Indicates the maximum number of Proximity Domains ever possible in the system.
In theory an OSPM could make (MAX_NODES > (NR_NODES in SRAT)) and
that delta could be used to indicate the presense of SPM nodes.
This doesn't solve the SLIT PXM distance problem.
6.2.14 _PXM (Proximity)
This optional object is used to describe proximity domain associations within a machine. _PXM evaluates to an integer
that identifies a device as belonging to a Proximity Domain defined in the System Resource Affinity Table (SRAT).
OSPM assumes that two devices in the same proximity domain are tightly coupled.
17.2.1 System Resource Affinity Table Definition
The optional System Resource Affinity Table (SRAT) provides the boot time description of the processor and memory
ranges belonging to a system locality. OSPM will consume the SRAT only at boot time. For any devices not in the
SRAT, OSPM should use _PXM (Proximity) for them or their ancestors that are hot-added into the system after boot
up.
The SRAT describes the system locality that all processors and memory present in a system belong to at system boot.
This includes memory that can be hot-added (that is memory that can be added to the system while it is running,
without requiring a reboot). OSPM can use this information to optimize the performance of NUMA architecture
systems. For example, OSPM could utilize this information to optimize allocation of memory resources and the
scheduling of software threads.
=============================================================
</skippable wall>
So TL;DR: Yes, I agree, this logic should __init time configured, but
while we work on that plumbing, the memory-hotplug.c interface can be
used to unblock exploratory work (such as Alistair's GPU interests).
> > 4) Adding GFP_SPM_NODE - which allows page_alloc.c to request memory
> > from the provided node or nodemask. It changes the behavior of
> > the cpuset mems_allowed and mt_node_allowed() checks.
>
> I wonder why that is required. Couldn't we disallow allocation from one of
> these special nodes as default, and only allow it if someone explicitly
> passes in the node for allocation?
>
> What's the problem with that?
>
Simple answer: We can choose how hard this guardrail is to break.
This initial attempt makes it "Hard":
You cannot "accidentally" allocate SPM, the call must be explicit.
Removing the GFP would work, and make it "Easier" to access SPM memory.
(There would be other adjustments needed, but the idea is the same).
To do this you would revert the mems_allowed check changes in cpuset
to check mems_allowed always (instead of sysram_nodes).
This would allow a trivial
mbind(range, SPM_NODE_ID)
Which is great, but is also an incredible tripping hazard:
numactl --interleave --all
and in kernel land:
__alloc_pages_noprof(..., nodes[N_MEMORY])
These will now instantly be subject to SPM node memory.
The first pass leverages the GFP flag to make all these tripping hazards
disappear. You can pass a completely garbage nodemask into the page
allocator and still rest assured that you won't touch SPM nodes.
So TL;DR: "What do we want here?" (if anything at all)
For completeness, here are the page_alloc/cpuset/mempolicy interactions
which lead me to a GFP flag as the "loosening mechanism" for the filter,
rather than allowing any nodemask to "just work".
Apologies again for the wall of text here, essentially dumping
~6 months of research and prototyping.
====================
There are basically 3 components which interact with each other:
1) the page allocator nodemask / zone logic
2) cpuset.mems_allowed
3) mempolicy (task, vma)
and now:
4) GFP_SPM_NODE
=== 1) the page allocator nodemask and zone iteration logic
- page allocator uses prepare_alloc_pages() to decide what
alloc_context.nodemask will contain
- nodemask can be NULL or a set of nodes.
- for_zone() iteration logic will iterate all zones if mask=NULL
Otherwise, it skips zones on nodes not present in the mask
- the value of alloc_context.nodemask may change
for example it may end up loosened if in an interrupt context or
if reclaim/compaction/fallbacks are invoked.
Some issues might be obvious:
It would be bad, for example, for an interrupt to have its allocation
context loosened to nodes[N_MEMORY] and end up allocating SPM memory
Capturing all of these scenarios would be very difficult if not
impossible.
The page allocator does an initial filtering of nodes if nodemask=NULL,
or it defers the filter operation to the allocation logic if a nodemask
is present (or we're in a interrupt context).
static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
int preferred_nid, nodemask_t *nodemask,
struct alloc_context *ac, gfp_t *alloc_gfp,
unsigned int *alloc_flags)
{
...
ac->nodemask = nodemask;
if (cpuset_enabled()) {
...
if (in_task() && !ac->nodemask)
ac->nodemask = &cpuset_current_mems_allowed;
^^^^ current_task.mems_allowed
else
*alloc_flags |= ALLOC_CPUSET;
^^^ apply cpuset check during allocation instead
}
}
Note here: If cpuset is not enabled, we don't filter!
patch 05/11 uses mt_sysram_nodes to filter in that scenario
In the actual allocation logic, we use this nodemask (or cpusets) to
filter out unwanted nodes.
static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
const struct alloc_context *ac)
{
z = ac->preferred_zoneref;
for_next_zone_zonelist_nodemask(zone, z, ac->highest_zoneidx,
ac->nodemask) {
^ if nodemask=NULL - iterates ALL zones in all nodes ^
...
if (cpuset_enabled() &&
(alloc_flags & ALLOC_CPUSET) &&
!__cpuset_zone_allowed(zone, gfp_mask))
continue;
^^^^^^^^ Skip zone if not in mems_allowed ^^^^^^^^^
Of course we could change the page allocator logic more explicitly
to support this kind of scenario.
For example:
We might add alloc_spm_pages() which checks mems_allowed instead
of sysram_nodes.
I tried this, and the code duplication and spaghetti it resulted in
was embarassing. It did work, but adding hundreds of lines to
page_alloc.c, with the risk of breaking something just lead me to
quickly disgarded it.
It also just bluntly made using SPM memory worse - you just want to
call alloc_pages(nodemask) and be done with it.
This is what lead me to focus on modifying cpuset.mems_allowed and
add global filter logic when cpusets is disabled.
=== 2) cpuset.mems
- cpuset.mems_allowed is the "primary filter" for most allocations
- if cpusets is not enabled, basically all nodes are "allowed"
- cpuset.mems_allowed is an *inherited value*
child cgroups are restricted by the parent's mems_allowed
cpuset.effective_mems is the actual nodemask filter.
cpuset.mems_allowed as-is cannot both restrict *AND* allow SPM nodes.
See the filtering functions above:
If you remove an SPM node from root_cgroup.cpuset.mems_allowed
to all of its children from using it, you effectively prevent
ANYTHING from using it: The node is simply not allowed.
Since all tasks operate from within a the root context or its
children - you can never "Allow" the node.
If you don't remove the SPM node from the root cgroup, you aren't
preventing tasks in the root cgroup from accessing the memory.
I chose to break mems_allowed into (mems_allowed, sysram_nodes) to:
a) create simple nodemask=NULL default nodemask filters:
mt_sysram_nodes, cpuset.sysram_nodes, task.sysram_nodes
b) Leverage the existing cpuset filtering mechanism in
mems_allowed() checks
c) Simplify the non-cpuset filter mechanism to a 2-line change
in page_alloc.c -- from Patch 04/11:
@@ -3753,6 +3754,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
if ((alloc_flags & ALLOC_CPUSET) &&
!cpuset_zone_allowed(zone, gfp_mask))
continue;
+ else if (!mt_node_allowed(zone_to_nid(zone), gfp_mask))
+ continue;
page_alloc.c changes are much cleaner and easy to understand this way
=== 3) mempolicy
- mempolicy allows you change the task or vma node-policy, separate
from (but restricted by) cpuset.mems
- there are some policies like interleave which provide (ALL) options
which create, basically, a nodemask=nodes[N_MEMORY] scenario.
- This is entirely controllable via userspace.
- There exists a lot of software out there which makes use of this
interface via numactl syscalls (set_mempolicy, mbind, etc)
- There is a global "default" mempolicy which is leveraged when
task->mempolicy=NULL or vma->vm_policy=NULL.
The default policy is essentially "Allocate from local node, but
fallback to any possible node as-needed"
During my initial explorations I started by looking at whether a filter
function could be implemented via the global policy.
It should be somewhat obvious this falls apart completely as soon as you
find the page allocator actually filters using cpusets.
So mempolicies are dead as a candidate for any real isolation mechanism.
It is nothing more than a suggestion at best, and is actually explicitly
ignored by things like reclaim.
(cough: Mempolicy is dead, long live Memory Policy)
I was also very worried about introducing an SPM Node solution which
presented as an isolation mechanism... which then immediately crashed
and burned when deployed by anyone already using numactl.
I have since, however, been experimenting with how you might enable
mempolicy to include SPM nodes more explicitly (with the GFP flag).
(attached at the end, completely untested, just conceptual).
=== 4) GFP_SPM_NODE
Once the filtering functions are in place (sysram_nodes), we've hit
a point where absolutely nothing can actually touch those nodes at all.
So that was requirement #1... but of course we do actually want to
allocate this memory, that's the point. But now we have a choice...
If a node is present in the nodemask, we can:
1) filter it based on sysram_nodes
a) cpuset.sysram, or
b) mt_sysram_nodes
or
2) filter it based on mems_allowed
a) cpuset.effective_mems, or
b) nodes[N_MEMORY]
The first choice is "Hard Guardrails" - it requires both an explict mask
AND the GFP flag to reach SPM memory.
The second choice is "Soft Guardrails" - more or less any nodemask is
allowed, and we trust the callers to be sane.
The cpuset filter functions already had gfp argument by the way:
bool cpuset_current_node_allowed(int node, gfp_t gfp_mask) {...}
I chose the former for the first pass due to the mempolicy section
above. If someone has an idea of how to apply this filtering logic
WITHOUT the GFP flag - I am absolutely welcome to suggestions.
My only other idea was separate alloc_spm_pages() interfaces, and that
just felt bad.
~Gregory
--------------- mempolicy extension ----------
mempolicy: add MPOL_F_SPM_NODE
Add a way for mempolicies to access SPM nodes.
Require MPOL_F_STATIC_NODES to prevent the policy mask from being
remapped onto other nodes.
Note: This doesn't work as-is because mempolicies are restricted by
cpuset.sysram_nodes instead of cpuset.mems_allowed, so the nodemask
will be rejected. This can be changed in the new/rebind mempolicy
interfaces.
Signed-off-by: Gregory Price
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 8fbbe613611a..c26aa8fb56d3 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -31,6 +31,7 @@ enum {
#define MPOL_F_STATIC_NODES (1 << 15)
#define MPOL_F_RELATIVE_NODES (1 << 14)
#define MPOL_F_NUMA_BALANCING (1 << 13) /* Optimize with NUMA balancing if possible */
+#define MPOL_F_SPM_NODE (1 << 12) /* Nodemask contains SPM Nodes */
/*
* MPOL_MODE_FLAGS is the union of all possible optional mode flags passed to
diff --git a/mm/memory.c b/mm/memory.c
index b59ae7ce42eb..7097d7045954 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3459,8 +3459,14 @@ static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
{
struct file *vm_file = vma->vm_file;
- if (vm_file)
- return mapping_gfp_mask(vm_file->f_mapping) | __GFP_FS | __GFP_IO;
+ if (vm_file) {
+ gfp_t gfp;
+ gfp = mapping_gfp_mask(vm_file->f_mapping) | __GFP_FS | __GFP_IO;
+ if (vma->vm_policy)
+ gfp |= (vma->vm_policy->flags & MPOL_F_SPM_NODE) ?
+ __GFP_SPM_NODE : 0;
+ return gfp;
+ }
/*
* Special mappings (e.g. VDSO) do not have any file so fake
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e1e8a1f3e1a2..2b4d23983ef8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1652,6 +1652,8 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
return -EINVAL;
if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
return -EINVAL;
+ if ((*flags & MPOL_F_SPM_NODE) && !(*flags & MPOL_F_STATIC_NODES))
+ return -EINVAL;
if (*flags & MPOL_F_NUMA_BALANCING) {
if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY)
*flags |= (MPOL_F_MOF | MPOL_F_MORON);
Just managed to go through the series and I think there are very good ideas here. It seems to cover the isolation requirements that are needed for the devices with inline compression. As an RFC I can try to build something on top of it and test it more. I hope we find the right abstractions for this to move forward. On Tue, Nov 25, 2025 at 6:58 AM Gregory Price <gourry@gourry.net> wrote: > > On Mon, Nov 24, 2025 at 10:19:37AM +0100, David Hildenbrand (Red Hat) wrote: > > [...] > > > > Apologies in advance for the wall of text, both of your questions really > do cut to the core of the series. The first (SPM nodes) is basically a > plumbing problem I haven't had time to address pre-LPC, the second (GFP) > is actually a design decision that is definitely up in the air. > > So consider this a dump of everything I wouldn't have had time to cover > in the LPC session. > > > > 3) Addition of MHP_SPM_NODE flag to instruct memory_hotplug.c that the > > > capacity being added should mark the node as an SPM Node. > > > > Sounds a bit like the wrong interface for configuring this. This smells like > > a per-node setting that should be configured before hotplugging any memory. > > > > Assuming you're specifically talking about the MHP portion of this. > > I agree, and I think the plumbing ultimately goes through acpi and > kernel configs. This was my shortest path to demonstrate a functional > prototype by LPC. > > I think the most likely option simply reserving additional NUMA nodes for > hotpluggable regions based on a Kconfig setting. > > I think the real setup process should look like follows: > > 1. At __init time, Linux reserves additional SPM nodes based on some > configuration (build? runtime? etc) > > Essentially create: nodes[N_SPM] > > 2. At SPM setup time, a driver registers an "Abstract Type" with > mm/memory_tiers.c which maps SPM->Type. > > This gives the core some management callback infrastructure without > polluting the core with device specific nonsense. > > This also gives the driver a change to define things like SLIT > distances for those nodes, which otherwise won't exist. > > 3. At hotplug time, memory-hotplug.c should only have to flip a bit > in `mt_sysram_nodes` if NID is not in nodes[N_SPM]. That logic > is still there to ensure the base filtering works as intended. > > > I haven't quite figured out how to plumb out nodes[N_SPM] as described > above, but I did figure out how to demonstrate roughly the same effect > through memory-hotplug.c - hopefully that much is clear. > > The problem with the above plan, is whether that "Makes sense" according > to ACPI specs and friends. > > This operates in "Ambiguity Land", which is uncomfortable. What you describe in a high level above makes sense. And while I agree that ACPI seems like a good layer for this, it could take a while for things to converge. At the same time different vendors might do things differently (unsurprisingly I guess...). For example, it would not be an absurd idea that the "specialness" of the device (e.g. compression) appears as a vendor specific capability in CXL. So, it would make sense to allow specific device drivers to set the respective node as SPM (as I understood you suggest above, right?) Finally, going back to the isolation, I'm curious to see if this covers GPU use cases as Alistair brought up or HBMs in general. Maybe there could be synergies with the HBM related talk in the device MC? Best, /Yiannis
On 2025-11-13 at 06:29 +1100, Gregory Price <gourry@gourry.net> wrote... > This is a code RFC for discussion related to > > "Mempolicy is dead, long live memory policy!" > https://lpc.events/event/19/contributions/2143/ > > base-commit: 24172e0d79900908cf5ebf366600616d29c9b417 > (version notes at end) > > At LSF 2026, I plan to discuss: Excellent! This all sounds quite interesting to me at least so I've added my two cents here but looking forward to discussing at LPC. > - Why? (In short: shunting to DAX is a failed pattern for users) > - Other designs I considered (mempolicy, cpusets, zone_device) I'm interested in the contrast with zone_device, and in particular why device_coherent memory doesn't end up being a good fit for this. > - Why mempolicy.c and cpusets as-is are insufficient > - SPM types seeking this form of interface (Accelerator, Compression) I'm sure you can guess my interest is in GPUs which also have memory some people consider should only be used for specific purposes :-) Currently our coherent GPUs online this as a normal NUMA noode, for which we have also generally found mempolicy, cpusets, etc. inadequate as well, so it will be interesting to hear what short comings you have been running into (I'm less familiar with the Compression cases you talk about here though). > - Platform extensions that would be nice to see (SPM-only Bits) > > Open Questions > - Single SPM nodemask, or multiple based on features? > - Apply SPM/SysRAM bit on-boot only or at-hotplug? > - Allocate extra "possible" NUMA nodes for flexbility? I guess this might make hotplug easier? Particularly in cases where FW hasn't created the nodes. > - Should SPM Nodes be zone-restricted? (MOVABLE only?) For device based memory I think so - otherwise you can never gurantee devices can be removed or drivers (if required to access the memory) can be unbound as you can't migrate things off the memory. > - How to handle things like reclaim and compaction on these nodes. > > > With this set, we aim to enable allocation of "special purpose memory" > with the page allocator (mm/page_alloc.c) without exposing the same > memory as "System RAM". Unless a non-userland component, and does so > with the GFP_SPM_NODE flag, memory on these nodes cannot be allocated. > > This isolation mechanism is a requirement for memory policies which > depend on certain sets of memory never being used outside special > interfaces (such as a specific mm/component or driver). > > We present an example of using this mechanism within ZSWAP, as-if > a "compressed memory node" was present. How to describe the features > of memory present on nodes is left up to comment here and at LPC '26. > > Userspace-driven allocations are restricted by the sysram_nodes mask, > nothing in userspace can explicitly request memory from SPM nodes. > > Instead, the intent is to create new components which understand memory > features and register those nodes with those components. This abstracts > the hardware complexity away from userland while also not requiring new > memory innovations to carry entirely new allocators. > > The ZSwap example demonstrates this with the `mt_spm_nodemask`. This > hack treats all spm nodes as-if they are compressed memory nodes, and > we bypass the software compression logic in zswap in favor of simply > copying memory directly to the allocated page. In a real design So in your example (I get it's a hack) is the main advantage that you can use all the same memory allocation policies (eg. cgroups) when needing to allocate the pages? Given this is ZSwap I guess these pages would never be mapped directly into user-space but would anything in the design prevent that? For example could a driver say allocate SPM memory and then explicitly migrate an existing page to it? > There are 4 major changes in this set: > > 1) Introducing mt_sysram_nodelist in mm/memory-tiers.c which denotes > the set of nodes which are eligible for use as normal system ram > > Some existing users now pass mt_sysram_nodelist into the page > allocator instead of NULL, but passing a NULL pointer in will simply > have it replaced by mt_sysram_nodelist anyway. Should a fully NULL > pointer still make it to the page allocator, without GFP_SPM_NODE > SPM node zones will simply be skipped. > > mt_sysram_nodelist is always guaranteed to contain the N_MEMORY nodes > present during __init, but if empty the use of mt_sysram_nodes() > will return a NULL to preserve current behavior. > > > 2) The addition of `cpuset.mems.sysram` which restricts allocations to > `mt_sysram_nodes` unless GFP_SPM_NODE is used. > > SPM Nodes are still allowed in cpuset.mems.allowed and effective. > > This is done to allow separate control over sysram and SPM node sets > by cgroups while maintaining the existing hierarchical rules. > > current cpuset configuration > cpuset.mems_allowed > |.mems_effective < (mems_allowed ∩ parent.mems_effective) > |->tasks.mems_allowed < cpuset.mems_effective > > new cpuset configuration > cpuset.mems_allowed > |.mems_effective < (mems_allowed ∩ parent.mems_effective) > |.sysram_nodes < (mems_effective ∩ default_sys_nodemask) > |->task.sysram_nodes < cpuset.sysram_nodes > > This means mems_allowed still restricts all node usage in any given > task context, which is the existing behavior. > > 3) Addition of MHP_SPM_NODE flag to instruct memory_hotplug.c that the > capacity being added should mark the node as an SPM Node. > > A node is either SysRAM or SPM - never both. Attempting to add > incompatible memory to a node results in hotplug failure. > > DAX and CXL are made aware of the bit and have `spm_node` bits added > to their relevant subsystems. > > 4) Adding GFP_SPM_NODE - which allows page_alloc.c to request memory > from the provided node or nodemask. It changes the behavior of > the cpuset mems_allowed and mt_node_allowed() checks. > > v1->v2: > - naming improvements > default_node -> sysram_node > protected -> spm (Specific Purpose Memory) > - add missing constify patch > - add patch to update callers of __cpuset_zone_allowed > - add additional logic to the mm sysram_nodes patch > - fix bot build issues (ifdef config builds) > - fix out-of-tree driver build issues (function renames) > - change compressed_nodelist to spm_nodelist > - add latch mechanism for sysram/spm nodes (Dan Williams) > this drops some extra memory-hotplug logic which is nice > v1: https://lore.kernel.org/linux-mm/20251107224956.477056-1-gourry@gourry.net/ > > Gregory Price (11): > mm: constify oom_control, scan_control, and alloc_context nodemask > mm: change callers of __cpuset_zone_allowed to cpuset_zone_allowed > gfp: Add GFP_SPM_NODE for Specific Purpose Memory (SPM) allocations > memory-tiers: Introduce SysRAM and Specific Purpose Memory Nodes > mm: restrict slub, oom, compaction, and page_alloc to sysram by > default > mm,cpusets: rename task->mems_allowed to task->sysram_nodes > cpuset: introduce cpuset.mems.sysram > mm/memory_hotplug: add MHP_SPM_NODE flag > drivers/dax: add spm_node bit to dev_dax > drivers/cxl: add spm_node bit to cxl region > [HACK] mm/zswap: compressed ram integration example > > drivers/cxl/core/region.c | 30 ++++++ > drivers/cxl/cxl.h | 2 + > drivers/dax/bus.c | 39 ++++++++ > drivers/dax/bus.h | 1 + > drivers/dax/cxl.c | 1 + > drivers/dax/dax-private.h | 1 + > drivers/dax/kmem.c | 2 + > fs/proc/array.c | 2 +- > include/linux/cpuset.h | 62 +++++++------ > include/linux/gfp_types.h | 5 + > include/linux/memory-tiers.h | 47 ++++++++++ > include/linux/memory_hotplug.h | 10 ++ > include/linux/mempolicy.h | 2 +- > include/linux/mm.h | 4 +- > include/linux/mmzone.h | 6 +- > include/linux/oom.h | 2 +- > include/linux/sched.h | 6 +- > include/linux/swap.h | 2 +- > init/init_task.c | 2 +- > kernel/cgroup/cpuset-internal.h | 8 ++ > kernel/cgroup/cpuset-v1.c | 7 ++ > kernel/cgroup/cpuset.c | 158 ++++++++++++++++++++------------ > kernel/fork.c | 2 +- > kernel/sched/fair.c | 4 +- > mm/compaction.c | 10 +- > mm/hugetlb.c | 8 +- > mm/internal.h | 2 +- > mm/memcontrol.c | 3 +- > mm/memory-tiers.c | 66 ++++++++++++- > mm/memory_hotplug.c | 7 ++ > mm/mempolicy.c | 34 +++---- > mm/migrate.c | 4 +- > mm/mmzone.c | 5 +- > mm/oom_kill.c | 11 ++- > mm/page_alloc.c | 57 +++++++----- > mm/show_mem.c | 11 ++- > mm/slub.c | 15 ++- > mm/vmscan.c | 6 +- > mm/zswap.c | 66 ++++++++++++- > 39 files changed, 532 insertions(+), 178 deletions(-) > > -- > 2.51.1 >
On Tue, Nov 18, 2025 at 06:02:02PM +1100, Alistair Popple wrote:
>
> I'm interested in the contrast with zone_device, and in particular why
> device_coherent memory doesn't end up being a good fit for this.
>
> > - Why mempolicy.c and cpusets as-is are insufficient
> > - SPM types seeking this form of interface (Accelerator, Compression)
>
> I'm sure you can guess my interest is in GPUs which also have memory some people
> consider should only be used for specific purposes :-) Currently our coherent
> GPUs online this as a normal NUMA noode, for which we have also generally
> found mempolicy, cpusets, etc. inadequate as well, so it will be interesting to
> hear what short comings you have been running into (I'm less familiar with the
> Compression cases you talk about here though).
>
after some thought, talks, and doc readings it seems like the
zone_device setups don't allow the CPU to map the devmem into page
tables, and instead depends on migrate_device logic (unless the docs are
out of sync with the code these days). That's at least what's described
in hmm and migrate_device.
Assuming this is out of date and ZONE_DEVICE memory is mappable into
page tables, assuming you want sparse allocation, ZONE_DEVICE seems to
suggest you at least have to re-implement the buddy logic (which isn't
that tall of an ask).
But I could imagine an (overly simplistic) pattern with SPM Nodes:
fd = open("/dev/gpu_mem", ...)
buf = mmap(fd, ...)
buf[0]
1) driver takes the fault
2) driver calls alloc_page(..., gpu_node, GFP_SPM_NODE)
3) driver manages any special page table masks
Like marking pages RO/RW to manage ownership.
4) driver sends the gpu the (mapping_id, pfn, index) information
so that gpu can map the region in its page tables.
5) since the memory is cache coherent, gpu and cpu are free to
operate directly on the pages without any additional magic
(except typical concurrency controls).
Driver doesn't have to do much in the way of allocationg management.
This is probably less compelling since you don't want general purposes
services like reclaim, migration, compaction, tiering - etc.
The value is clearly that you get to manage GPU memory like any other
memory, but without worry that other parts of the system will touch it.
I'm much more focused on the "I have memory that is otherwise general
purpose, and wants services like reclaim and compaction, but I want
strong controls over how things can land there in the first place".
~Gregory
On 2025-11-22 at 08:07 +1100, Gregory Price <gourry@gourry.net> wrote...
> On Tue, Nov 18, 2025 at 06:02:02PM +1100, Alistair Popple wrote:
> >
> > I'm interested in the contrast with zone_device, and in particular why
> > device_coherent memory doesn't end up being a good fit for this.
> >
> > > - Why mempolicy.c and cpusets as-is are insufficient
> > > - SPM types seeking this form of interface (Accelerator, Compression)
> >
> > I'm sure you can guess my interest is in GPUs which also have memory some people
> > consider should only be used for specific purposes :-) Currently our coherent
> > GPUs online this as a normal NUMA noode, for which we have also generally
> > found mempolicy, cpusets, etc. inadequate as well, so it will be interesting to
> > hear what short comings you have been running into (I'm less familiar with the
> > Compression cases you talk about here though).
> >
>
> after some thought, talks, and doc readings it seems like the
> zone_device setups don't allow the CPU to map the devmem into page
> tables, and instead depends on migrate_device logic (unless the docs are
> out of sync with the code these days). That's at least what's described
> in hmm and migrate_device.
There are multiple types here (DEVICE_PRIVATE and DEVICE_COHERENT). The former
is mostly irrelevant for this discussion but I'm including the descriptions here
for completeness. You are correct in saying that the only way either of these
currently get mapped into the page tables is via explicit migration of memory
to ZONE_DEVICE by a driver. There is also a corner case for first touch handling
which allows drivers to establish mappings to zero pages on a device if the page
hasn't been populated previously on the CPU.
These pages can, in some sense at least, be mapped on the CPU. DEVICE_COHERENT
pages are mapped normally (ie. CPU can access these directly) where as
DEVICE_PRIVATE pages are mapped using special swap entries so drivers can
emulate coherence by migrating pages back. This is used by devices without
coherent interconnects (ie. PCIe) where as the former could be used by eg. CXL.
> Assuming this is out of date and ZONE_DEVICE memory is mappable into
> page tables, assuming you want sparse allocation, ZONE_DEVICE seems to
> suggest you at least have to re-implement the buddy logic (which isn't
> that tall of an ask).
That's basically what happens - GPU drivers need memory allocation and therefore
re-implement some form of memory allocator. Agree that just being able to
reuse the buddy logic probably isn't that compelling though and isn't really of
interest (hence some of my original questions on what this is about).
> But I could imagine an (overly simplistic) pattern with SPM Nodes:
>
> fd = open("/dev/gpu_mem", ...)
> buf = mmap(fd, ...)
> buf[0]
> 1) driver takes the fault
> 2) driver calls alloc_page(..., gpu_node, GFP_SPM_NODE)
> 3) driver manages any special page table masks
> Like marking pages RO/RW to manage ownership.
Of course as an aside this needs to match the CPU PTEs logic (this what
hmm_range_fault() is primarily used for).
> 4) driver sends the gpu the (mapping_id, pfn, index) information
> so that gpu can map the region in its page tables.
On coherent systems this often just uses HW address translation services
(ATS), although I think the specific implementation of how page-tables are
mirrored/shared is orthogonal to this.
> 5) since the memory is cache coherent, gpu and cpu are free to
> operate directly on the pages without any additional magic
> (except typical concurrency controls).
This is roughly how things work with DEVICE_PRIVATE/COHERENT memory today,
except in the case of DEVICE_PRIVATE in step (5) above. In that case the page is
mapped as a non-present special swap entry that triggers a driver callback due
to the lack of cache coherence.
> Driver doesn't have to do much in the way of allocationg management.
>
> This is probably less compelling since you don't want general purposes
> services like reclaim, migration, compaction, tiering - etc.
On at least some of our systems I'm told we do want this, hence my interest
here. Currently we have systems not using DEVICE_COHERENT and instead just
onlining everything as normal system managed memory in order to get reclaim
and tiering. Of course then people complain that it's managed as normal system
memory and non-GPU related things (ie. page-cache) end up in what's viewed as
special purpose memory.
> The value is clearly that you get to manage GPU memory like any other
> memory, but without worry that other parts of the system will touch it.
>
> I'm much more focused on the "I have memory that is otherwise general
> purpose, and wants services like reclaim and compaction, but I want
> strong controls over how things can land there in the first place".
So maybe there is some overlap here - what I have is memoy that we want managed
much like normal memory but with strong controls over what it can be used for
(ie. just for tasks utilising the processing element on the accelerator).
- Alistair
> ~Gregory
>
On Mon, Nov 24, 2025 at 10:09:37AM +1100, Alistair Popple wrote:
> On 2025-11-22 at 08:07 +1100, Gregory Price <gourry@gourry.net> wrote...
> > On Tue, Nov 18, 2025 at 06:02:02PM +1100, Alistair Popple wrote:
> > >
>
> There are multiple types here (DEVICE_PRIVATE and DEVICE_COHERENT). The former
> is mostly irrelevant for this discussion but I'm including the descriptions here
> for completeness.
>
I appreciate you taking the time here. I'll maybe try to look at
updating the docs as this evolves.
> > But I could imagine an (overly simplistic) pattern with SPM Nodes:
> >
> > fd = open("/dev/gpu_mem", ...)
> > buf = mmap(fd, ...)
> > buf[0]
> > 1) driver takes the fault
> > 2) driver calls alloc_page(..., gpu_node, GFP_SPM_NODE)
> > 3) driver manages any special page table masks
> > Like marking pages RO/RW to manage ownership.
>
> Of course as an aside this needs to match the CPU PTEs logic (this what
> hmm_range_fault() is primarily used for).
>
This is actually the most interesting part of series for me. I'm using
a compressed memory device as a stand-in for a memory type that requires
special page table entries (RO) to avoid compression ratios tanking
(resulting, eventually, in a MCE as there's no way to slow things down).
You can somewhat "Get there from here" through device coherent
ZONE_DEVICE, but you still don't have access to basic services like
compaction and reclaim - which you absolutely do want for such a memory
type (for the same reasons we groom zswap and zram).
I wonder if we can even re-use the hmm interfaces for SPM nodes to make
managing special page table policies easier as well. That seems
promising.
I said this during LSFMM: Without isolation, "memory policy" is really
just a suggestion. What we're describing here is all predicated on
isolation work, and all of a sudden much clearer examples of managing
memory on NUMA boundaries starts to make a little more sense.
> > 4) driver sends the gpu the (mapping_id, pfn, index) information
> > so that gpu can map the region in its page tables.
>
> On coherent systems this often just uses HW address translation services
> (ATS), although I think the specific implementation of how page-tables are
> mirrored/shared is orthogonal to this.
>
Yeah this part is completely foreign to me, I just presume there's some
way to tell the GPU how to recontruct the virtually contiguous setup.
That mechanism would be entirely reusable here (I assume).
> This is roughly how things work with DEVICE_PRIVATE/COHERENT memory today,
> except in the case of DEVICE_PRIVATE in step (5) above. In that case the page is
> mapped as a non-present special swap entry that triggers a driver callback due
> to the lack of cache coherence.
>
Btw, just an aside, Lorenzo is moving to rename these entries to
softleaf (software-leaf) entries. I think you'll find it welcome.
https://lore.kernel.org/linux-mm/c879383aac77d96a03e4d38f7daba893cd35fc76.1762812360.git.lorenzo.stoakes@oracle.com/
> > Driver doesn't have to do much in the way of allocationg management.
> >
> > This is probably less compelling since you don't want general purposes
> > services like reclaim, migration, compaction, tiering - etc.
>
> On at least some of our systems I'm told we do want this, hence my interest
> here. Currently we have systems not using DEVICE_COHERENT and instead just
> onlining everything as normal system managed memory in order to get reclaim
> and tiering. Of course then people complain that it's managed as normal system
> memory and non-GPU related things (ie. page-cache) end up in what's viewed as
> special purpose memory.
>
Ok, so now this gets interesting then. I don't understand how this
makes sense (not saying it doesn't, I simply don't understand).
I would presume that under no circumstance do you want device memory to
just suddenly disappear without some coordination from the driver.
Whether it's compaction or reclaim, you have some thread that's going to
migrate a virtual mapping from HPA(A) to HPA(B) and HPA(B) may or may not
even map to the same memory device.
That thread may not even be called in the context of a thread which
accesses GPU memory (although, I think we could enforce that on top
of SPM nodes, but devil is in the details).
Maybe that "all magically works" because of the ATS described above?
I suppose this assumes you have some kind of unified memory view between
host and device memory? Are there docs here you can point me at that
might explain this wizardry? (Sincerely, this is fascinating)
> > The value is clearly that you get to manage GPU memory like any other
> > memory, but without worry that other parts of the system will touch it.
> >
> > I'm much more focused on the "I have memory that is otherwise general
> > purpose, and wants services like reclaim and compaction, but I want
> > strong controls over how things can land there in the first place".
>
> So maybe there is some overlap here - what I have is memoy that we want managed
> much like normal memory but with strong controls over what it can be used for
> (ie. just for tasks utilising the processing element on the accelerator).
>
I think it might be great if we could discuss this a bit more in-depth,
as i've already been considering very mild refactors to reclaim to
enable a driver to engage it with an SPM node as the only shrink target.
This all becomes much more complicated due to per-memcg LRUs and such.
All that said, I'm focused on the isolation / allocation pieces first.
If that can't be agreed upon, the rest isn't worth exploring.
I do have a mild extension to mempolicy that allows mbind() to hit an
SPM node as an example as well. I'll discuss this in the response to
David's thread, as he had some related questions about the GFP flag.
~Gregory
On 2025-11-25 at 02:28 +1100, Gregory Price <gourry@gourry.net> wrote...
> On Mon, Nov 24, 2025 at 10:09:37AM +1100, Alistair Popple wrote:
> > On 2025-11-22 at 08:07 +1100, Gregory Price <gourry@gourry.net> wrote...
> > > On Tue, Nov 18, 2025 at 06:02:02PM +1100, Alistair Popple wrote:
> > > >
> >
> > There are multiple types here (DEVICE_PRIVATE and DEVICE_COHERENT). The former
> > is mostly irrelevant for this discussion but I'm including the descriptions here
> > for completeness.
> >
>
> I appreciate you taking the time here. I'll maybe try to look at
> updating the docs as this evolves.
I believe the DEVICE_PRIVATE bit is documented here
https://www.kernel.org/doc/Documentation/vm/hmm.rst , but if there is anything
there that you think needs improvement I'd be happy to look or review. I'm not
sure if that was updated for DEVICE_COHERENT though.
> > > But I could imagine an (overly simplistic) pattern with SPM Nodes:
> > >
> > > fd = open("/dev/gpu_mem", ...)
> > > buf = mmap(fd, ...)
> > > buf[0]
> > > 1) driver takes the fault
> > > 2) driver calls alloc_page(..., gpu_node, GFP_SPM_NODE)
> > > 3) driver manages any special page table masks
> > > Like marking pages RO/RW to manage ownership.
> >
> > Of course as an aside this needs to match the CPU PTEs logic (this what
> > hmm_range_fault() is primarily used for).
> >
>
> This is actually the most interesting part of series for me. I'm using
> a compressed memory device as a stand-in for a memory type that requires
> special page table entries (RO) to avoid compression ratios tanking
> (resulting, eventually, in a MCE as there's no way to slow things down).
>
> You can somewhat "Get there from here" through device coherent
> ZONE_DEVICE, but you still don't have access to basic services like
> compaction and reclaim - which you absolutely do want for such a memory
> type (for the same reasons we groom zswap and zram).
>
> I wonder if we can even re-use the hmm interfaces for SPM nodes to make
> managing special page table policies easier as well. That seems
> promising.
It might depend on what exactly you're looking to do - HMM is really too parts,
one for mirroring page tables and another for allowing special non-present PTEs
to be setup to map a dummy ZONE_DEVICE struct page that notifies a driver when
the CPU attempts access.
> I said this during LSFMM: Without isolation, "memory policy" is really
> just a suggestion. What we're describing here is all predicated on
> isolation work, and all of a sudden much clearer examples of managing
> memory on NUMA boundaries starts to make a little more sense.
I very much agree with the views of memory policy that you shared in one of the
other threads. I don't think it is adequate for providing isolation, and agree
the isolation (and degree of isolation) is the interesting bit of the work here,
at least for now.
>
> > > 4) driver sends the gpu the (mapping_id, pfn, index) information
> > > so that gpu can map the region in its page tables.
> >
> > On coherent systems this often just uses HW address translation services
> > (ATS), although I think the specific implementation of how page-tables are
> > mirrored/shared is orthogonal to this.
> >
>
> Yeah this part is completely foreign to me, I just presume there's some
> way to tell the GPU how to recontruct the virtually contiguous setup.
> That mechanism would be entirely reusable here (I assume).
>
> > This is roughly how things work with DEVICE_PRIVATE/COHERENT memory today,
> > except in the case of DEVICE_PRIVATE in step (5) above. In that case the page is
> > mapped as a non-present special swap entry that triggers a driver callback due
> > to the lack of cache coherence.
> >
>
> Btw, just an aside, Lorenzo is moving to rename these entries to
> softleaf (software-leaf) entries. I think you'll find it welcome.
> https://lore.kernel.org/linux-mm/c879383aac77d96a03e4d38f7daba893cd35fc76.1762812360.git.lorenzo.stoakes@oracle.com/
>
> > > Driver doesn't have to do much in the way of allocationg management.
> > >
> > > This is probably less compelling since you don't want general purposes
> > > services like reclaim, migration, compaction, tiering - etc.
> >
> > On at least some of our systems I'm told we do want this, hence my interest
> > here. Currently we have systems not using DEVICE_COHERENT and instead just
> > onlining everything as normal system managed memory in order to get reclaim
> > and tiering. Of course then people complain that it's managed as normal system
> > memory and non-GPU related things (ie. page-cache) end up in what's viewed as
> > special purpose memory.
> >
>
> Ok, so now this gets interesting then. I don't understand how this
> makes sense (not saying it doesn't, I simply don't understand).
>
> I would presume that under no circumstance do you want device memory to
> just suddenly disappear without some coordination from the driver.
>
> Whether it's compaction or reclaim, you have some thread that's going to
> migrate a virtual mapping from HPA(A) to HPA(B) and HPA(B) may or may not
> even map to the same memory device.
>
> That thread may not even be called in the context of a thread which
> accesses GPU memory (although, I think we could enforce that on top
> of SPM nodes, but devil is in the details).
>
> Maybe that "all magically works" because of the ATS described above?
Pretty much - both ATS and hmm_range_fault() are, conceptually at least, just
methods of sharing/mirroring the CPU page table to a device. So in your example
above if a thread was to migrate a mapping from one page to another this "black
magic" would keep everything in sync. Eg. For hmm_range_fault() the driver
gets a mmu_notifier callback saying the virtual mapping no longer points to
HPA(A). If it needs to find the new mapping to HPA(B) it can look it up using
hmm_range_fault() and program it's page tables with the new mapping.
At a sufficiently high level ATS is just a HW implemented equivalence of this.
> I suppose this assumes you have some kind of unified memory view between
> host and device memory? Are there docs here you can point me at that
> might explain this wizardry? (Sincerely, this is fascinating)
Right - it's all predicated on the host and device sharing the same view of the
virtual address space. I'm not sure of any good docs on this, but I will be at
LPC so would be happy to have a discussion there.
> > > The value is clearly that you get to manage GPU memory like any other
> > > memory, but without worry that other parts of the system will touch it.
> > >
> > > I'm much more focused on the "I have memory that is otherwise general
> > > purpose, and wants services like reclaim and compaction, but I want
> > > strong controls over how things can land there in the first place".
> >
> > So maybe there is some overlap here - what I have is memoy that we want managed
> > much like normal memory but with strong controls over what it can be used for
> > (ie. just for tasks utilising the processing element on the accelerator).
> >
>
> I think it might be great if we could discuss this a bit more in-depth,
> as i've already been considering very mild refactors to reclaim to
> enable a driver to engage it with an SPM node as the only shrink target.
Absolutely! Looking forward to an in-person discussion.
- Alistair
> This all becomes much more complicated due to per-memcg LRUs and such.
>
> All that said, I'm focused on the isolation / allocation pieces first.
> If that can't be agreed upon, the rest isn't worth exploring.
>
> I do have a mild extension to mempolicy that allows mbind() to hit an
> SPM node as an example as well. I'll discuss this in the response to
> David's thread, as he had some related questions about the GFP flag.
>
> ~Gregory
>
On Tue, Nov 18, 2025 at 06:02:02PM +1100, Alistair Popple wrote: > On 2025-11-13 at 06:29 +1100, Gregory Price <gourry@gourry.net> wrote... > > - Why? (In short: shunting to DAX is a failed pattern for users) > > - Other designs I considered (mempolicy, cpusets, zone_device) > > I'm interested in the contrast with zone_device, and in particular why > device_coherent memory doesn't end up being a good fit for this. > I did consider zone_device briefly, but if you want sparse allocation you end up essentially re-implementing some form of buddy allocator. That seemed less then ideal, to say the least. Additionally, pgmap use precludes these pages from using LRU/Reclaim, and some devices may very well be compatible with such patterns. (I think compression will be, but it still needs work) > > - Why mempolicy.c and cpusets as-is are insufficient > > - SPM types seeking this form of interface (Accelerator, Compression) > > I'm sure you can guess my interest is in GPUs which also have memory some people > consider should only be used for specific purposes :-) Currently our coherent > GPUs online this as a normal NUMA noode, for which we have also generally > found mempolicy, cpusets, etc. inadequate as well, so it will be interesting to > hear what short comings you have been running into (I'm less familiar with the > Compression cases you talk about here though). > The TL;DR: cpusets as-designed doesn't really allow the concept of "Nothing can access XYZ node except specific things" because this would involve removing a node from the root cpusets.mems - and that can't be loosened. mempolicy is more of a suggestion and can be completely overridden. It is entirely ignored by things like demotion/reclaim/etc. I plan to discuss a bit of the specifics at LPC, but a lot of this stems from the zone-iteration logic in page_alloc.c and the rather... ermm... "complex" nature of how mempolicy and cpusets interacts with each other. I may add some additional notes on this thread prior to LPC given that time may be too short to get into the nasty bits in the session. > > - Platform extensions that would be nice to see (SPM-only Bits) > > > > Open Questions > > - Single SPM nodemask, or multiple based on features? > > - Apply SPM/SysRAM bit on-boot only or at-hotplug? > > - Allocate extra "possible" NUMA nodes for flexbility? > > I guess this might make hotplug easier? Particularly in cases where FW hasn't > created the nodes. > In cases where you need to reach back to the device for some signal, you likely need to have the driver for that device manage the alloc/free patterns - so this may (or may not) generalize to 1-device-per-node. In the scenario where you want some flexibility in managing regions, this may require multiple nodes for device. Maybe one device provides multiple types of memory - you want those on separate nodes. This doesn't seem like something you need to solve right away, just something for folks to consider. > > - Should SPM Nodes be zone-restricted? (MOVABLE only?) > > For device based memory I think so - otherwise you can never gurantee devices > can be removed or drivers (if required to access the memory) can be unbound as > you can't migrate things off the memory. > Zones in this scenario are bit of a square-peg/round-hole. Forcing everything in ZONE_MOVABLE means you can't do page pinning or things like 1GB gigantic pages. But the device driver should be capable of managing hotplug anyway, so what's the point of ZONE_MOVABLE? :shrug: > > The ZSwap example demonstrates this with the `mt_spm_nodemask`. This > > hack treats all spm nodes as-if they are compressed memory nodes, and > > we bypass the software compression logic in zswap in favor of simply > > copying memory directly to the allocated page. In a real design > > So in your example (I get it's a hack) is the main advantage that you can use > all the same memory allocation policies (eg. cgroups) when needing to allocate > the pages? Given this is ZSwap I guess these pages would never be mapped > directly into user-space but would anything in the design prevent that? This is, in-fact, the long term intent. As long as the device can manage inline decompression with reasonable latencies, there's no reason you shouldn't be able to leave the pages mapped Read-Only in user-space. The driver would be responsible for migrating on write-fault, similar to a NUMA Hint Fault on the existing transparent page placement system. > For example could a driver say allocate SPM memory and then explicitly > migrate an existing page to it? You might even extend migrate_pages with a new flag that simply drops the write-able flag from the page table mapping and abstract that entire complexity out of the driver :] ~Gregory
© 2016 - 2026 Red Hat, Inc.