[RFC PATCH v3 5/8] Documentation/admin-guide/cgroups: update docs for mems_allowed

Gregory Price posted 8 patches 1 month ago
[RFC PATCH v3 5/8] Documentation/admin-guide/cgroups: update docs for mems_allowed
Posted by Gregory Price 1 month ago
Add new information about mems_allowed and sysram_nodes, which says
mems_allowed may contain union(N_MEMORY, N_PRIVATE) nodes, while
sysram_nodes may only contain a subset of N_MEMORY nodes.

cpuset.mems.sysram is a new RO ABI which reports the list of
N_MEMORY nodes the cpuset is allowed to use, while
cpusets.mems and mems.effective may also contain N_PRIVATE.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 .../admin-guide/cgroup-v1/cpusets.rst         | 19 +++++++++++---
 Documentation/admin-guide/cgroup-v2.rst       | 26 +++++++++++++++++--
 Documentation/filesystems/proc.rst            |  2 +-
 3 files changed, 40 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst
index c7909e5ac136..6d326056f7b4 100644
--- a/Documentation/admin-guide/cgroup-v1/cpusets.rst
+++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst
@@ -158,21 +158,26 @@ new system calls are added for cpusets - all support for querying and
 modifying cpusets is via this cpuset file system.
 
 The /proc/<pid>/status file for each task has four added lines,
-displaying the task's cpus_allowed (on which CPUs it may be scheduled)
-and mems_allowed (on which Memory Nodes it may obtain memory),
-in the two formats seen in the following example::
+displaying the task's cpus_allowed (on which CPUs it may be scheduled),
+and mems_allowed (on which SystemRAM nodes it may obtain memory),
+in the formats seen in the following example::
 
   Cpus_allowed:   ffffffff,ffffffff,ffffffff,ffffffff
   Cpus_allowed_list:      0-127
   Mems_allowed:   ffffffff,ffffffff
   Mems_allowed_list:      0-63
 
+Note that Mems_allowed only shows SystemRAM nodes (N_MEMORY), not
+Private Nodes.  Private Nodes may be accessible via __GFP_THISNODE
+allocations if they appear in the task's cpuset.effective_mems.
+
 Each cpuset is represented by a directory in the cgroup file system
 containing (on top of the standard cgroup files) the following
 files describing that cpuset:
 
  - cpuset.cpus: list of CPUs in that cpuset
  - cpuset.mems: list of Memory Nodes in that cpuset
+ - cpuset.mems.sysram: read-only list of SystemRAM nodes (excludes Private Nodes)
  - cpuset.memory_migrate flag: if set, move pages to cpusets nodes
  - cpuset.cpu_exclusive flag: is cpu placement exclusive?
  - cpuset.mem_exclusive flag: is memory placement exclusive?
@@ -227,7 +232,9 @@ nodes with memory--using the cpuset_track_online_nodes() hook.
 
 The cpuset.effective_cpus and cpuset.effective_mems files are
 normally read-only copies of cpuset.cpus and cpuset.mems files
-respectively.  If the cpuset cgroup filesystem is mounted with the
+respectively.  The cpuset.effective_mems file may include both
+regular SystemRAM nodes (N_MEMORY) and Private Nodes (N_PRIVATE).
+If the cpuset cgroup filesystem is mounted with the
 special "cpuset_v2_mode" option, the behavior of these files will become
 similar to the corresponding files in cpuset v2.  In other words, hotplug
 events will not change cpuset.cpus and cpuset.mems.  Those events will
@@ -236,6 +243,10 @@ the actual cpus and memory nodes that are currently used by this cpuset.
 See Documentation/admin-guide/cgroup-v2.rst for more information about
 cpuset v2 behavior.
 
+The cpuset.mems.sysram file shows only the SystemRAM nodes (N_MEMORY)
+from cpuset.effective_mems, excluding any Private Nodes. This
+represents the nodes available for general memory allocation.
+
 
 1.4 What are exclusive cpusets ?
 --------------------------------
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 7f5b59d95fce..6af54efb84a2 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2530,8 +2530,11 @@ Cpuset Interface Files
 	cpuset-enabled cgroups.
 
 	It lists the onlined memory nodes that are actually granted to
-	this cgroup by its parent. These memory nodes are allowed to
-	be used by tasks within the current cgroup.
+	this cgroup by its parent.  This includes both regular SystemRAM
+	nodes (N_MEMORY) and Private Nodes (N_PRIVATE) that provide
+	device-specific memory not intended for general consumption.
+	Tasks within this cgroup may access Private Nodes using explicit
+	__GFP_THISNODE allocations if the node is in this mask.
 
 	If "cpuset.mems" is empty, it shows all the memory nodes from the
 	parent cgroup that will be available to be used by this cgroup.
@@ -2541,6 +2544,25 @@ Cpuset Interface Files
 
 	Its value will be affected by memory nodes hotplug events.
 
+  cpuset.mems.sysram
+	A read-only multiple values file which exists on all
+	cpuset-enabled cgroups.
+
+	It lists the SystemRAM nodes (N_MEMORY) that are available for
+	general memory allocation by tasks within this cgroup.  This is
+	a subset of "cpuset.mems.effective" that excludes Private Nodes.
+
+	Normal page allocations are restricted to nodes in this mask.
+	The kernel page allocator, slab allocator, and compaction only
+	consider SystemRAM nodes when allocating memory for tasks.
+
+	Private Nodes are excluded from this mask because their memory
+	is managed by device drivers for specific purposes (e.g., CXL
+	compressed memory, accelerator memory) and should not be used
+	for general allocations.
+
+	Its value will be affected by memory nodes hotplug events.
+
   cpuset.cpus.exclusive
 	A read-write multiple values file which exists on non-root
 	cpuset-enabled cgroups.
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index c92e95e28047..68f3d8ffc03b 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -294,7 +294,7 @@ It's slow but very precise.
  Cpus_active_mm              mask of CPUs on which this process has an active
                              memory context
  Cpus_active_mm_list         Same as previous, but in "list format"
- Mems_allowed                mask of memory nodes allowed to this process
+ Mems_allowed                mask of SystemRAM nodes for general allocations
  Mems_allowed_list           Same as previous, but in "list format"
  voluntary_ctxt_switches     number of voluntary context switches
  nonvoluntary_ctxt_switches  number of non voluntary context switches
-- 
2.52.0
Re: [RFC PATCH v3 5/8] Documentation/admin-guide/cgroups: update docs for mems_allowed
Posted by Michal Koutný 4 weeks ago
Hello.

On Thu, Jan 08, 2026 at 03:37:52PM -0500, Gregory Price <gourry@gourry.net> wrote:
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -2530,8 +2530,11 @@ Cpuset Interface Files
>  	cpuset-enabled cgroups.
>  
>  	It lists the onlined memory nodes that are actually granted to
> -	this cgroup by its parent. These memory nodes are allowed to
> -	be used by tasks within the current cgroup.
> +	this cgroup by its parent.  This includes both regular SystemRAM
> +	nodes (N_MEMORY) and Private Nodes (N_PRIVATE) that provide
> +	device-specific memory not intended for general consumption.
> +	Tasks within this cgroup may access Private Nodes using explicit
> +	__GFP_THISNODE allocations if the node is in this mask.

Notice that these files are exposed for userspace. Hence I'm not sure
they'd be able to ask for allocations like this (or even need to know
about this implementation detail).

>  
>  	If "cpuset.mems" is empty, it shows all the memory nodes from the
>  	parent cgroup that will be available to be used by this cgroup.
> @@ -2541,6 +2544,25 @@ Cpuset Interface Files
>  
>  	Its value will be affected by memory nodes hotplug events.
>  
> +  cpuset.mems.sysram
> +	A read-only multiple values file which exists on all
> +	cpuset-enabled cgroups.
> +
> +	It lists the SystemRAM nodes (N_MEMORY) that are available for
> +	general memory allocation by tasks within this cgroup.  This is
> +	a subset of "cpuset.mems.effective" that excludes Private Nodes.
> +
> +	Normal page allocations are restricted to nodes in this mask.
> +	The kernel page allocator, slab allocator, and compaction only
> +	consider SystemRAM nodes when allocating memory for tasks.
> +
> +	Private Nodes are excluded from this mask because their memory
> +	is managed by device drivers for specific purposes (e.g., CXL
> +	compressed memory, accelerator memory) and should not be used
> +	for general allocations.

So I wonder whether the N_PRIVATE nodes should be included in
cpuset.mems[.effective] at all.
(It resembles CPU isolation to me a bit ~ cpuset.cpus.isolated.)
Maybe you only want to expose it on the root cpuset cg and inverted like
cpuset.mems.private?

Thanks,
Michal
Re: [RFC PATCH v3 5/8] Documentation/admin-guide/cgroups: update docs for mems_allowed
Posted by Gregory Price 4 weeks ago
On Mon, Jan 12, 2026 at 03:30:26PM +0100, Michal Koutný wrote:
> Hello.
> 
> On Thu, Jan 08, 2026 at 03:37:52PM -0500, Gregory Price <gourry@gourry.net> wrote:
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -2530,8 +2530,11 @@ Cpuset Interface Files
> >  	cpuset-enabled cgroups.
> >  
> >  	It lists the onlined memory nodes that are actually granted to
> > -	this cgroup by its parent. These memory nodes are allowed to
> > -	be used by tasks within the current cgroup.
> > +	this cgroup by its parent.  This includes both regular SystemRAM
> > +	nodes (N_MEMORY) and Private Nodes (N_PRIVATE) that provide
> > +	device-specific memory not intended for general consumption.
> > +	Tasks within this cgroup may access Private Nodes using explicit
> > +	__GFP_THISNODE allocations if the node is in this mask.
> 
> Notice that these files are exposed for userspace. Hence I'm not sure
> they'd be able to ask for allocations like this (or even need to know
> about this implementation detail).
>

Fair, I can drop this, the intent is actually to limit user-space
knowledge of this at all.

> >  
> >  	If "cpuset.mems" is empty, it shows all the memory nodes from the
> >  	parent cgroup that will be available to be used by this cgroup.
> > @@ -2541,6 +2544,25 @@ Cpuset Interface Files
> >  
> >  	Its value will be affected by memory nodes hotplug events.
> >  
> > +  cpuset.mems.sysram
> > +	A read-only multiple values file which exists on all
> > +	cpuset-enabled cgroups.
> > +
> > +	It lists the SystemRAM nodes (N_MEMORY) that are available for
> > +	general memory allocation by tasks within this cgroup.  This is
> > +	a subset of "cpuset.mems.effective" that excludes Private Nodes.
> > +
> > +	Normal page allocations are restricted to nodes in this mask.
> > +	The kernel page allocator, slab allocator, and compaction only
> > +	consider SystemRAM nodes when allocating memory for tasks.
> > +
> > +	Private Nodes are excluded from this mask because their memory
> > +	is managed by device drivers for specific purposes (e.g., CXL
> > +	compressed memory, accelerator memory) and should not be used
> > +	for general allocations.
> 
> So I wonder whether the N_PRIVATE nodes should be included in
> cpuset.mems[.effective] at all.

I think it makes the control path easier (both more intuitive and easier
to write in the cpuset code), but I can take another look at this.

Although omitting them from .effective i think prevents the user from
controlling whether their memory ends up on that node. 

i.e. the user might be aware that they have compressed memory on node N,
and they have a cgroup that they don't want on node N - not having it
included in mems.allowed / mems.effective means they can't control this.

> (It resembles CPU isolation to me a bit ~ cpuset.cpus.isolated.)
> Maybe you only want to expose it on the root cpuset cg and inverted like
> cpuset.mems.private?
>

Hm, I had not considered adding the separate mask for .private as
opposed to sysram.

If all we actually need to change is the allowed() callback to check an
additional nodemask, that might end up cleaner.

Thank you, I'll take another look at this piece.

~Gregory