[v3] CXL Boot to Bash Documentation

[PATCH v3 14/17] cxl: docs/allocation/page-allocator

Posted by Gregory Price 9 months ago

Document some interesting interactions that occur when exposing CXL
memory capacity to page allocator.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 .../cxl/allocation/page-allocator.rst         | 85 +++++++++++++++++++
 Documentation/driver-api/cxl/index.rst        |  1 +
 2 files changed, 86 insertions(+)
 create mode 100644 Documentation/driver-api/cxl/allocation/page-allocator.rst

diff --git a/Documentation/driver-api/cxl/allocation/page-allocator.rst b/Documentation/driver-api/cxl/allocation/page-allocator.rst
new file mode 100644
index 000000000000..7b8fe1b8d5bb
--- /dev/null
+++ b/Documentation/driver-api/cxl/allocation/page-allocator.rst
@@ -0,0 +1,85 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================
+The Page Allocator
+==================
+
+The kernel page allocator services all general page allocation requests, such
+as :code:`kmalloc`.  CXL configuration steps affect the behavior of the page
+allocator based on the selected `Memory Zone` and `NUMA node` the capacity is
+placed in.
+
+This section mostly focuses on how these configurations affect the page
+allocator (as of Linux v6.15) rather than the overall page allocator behavior.
+
+NUMA nodes and mempolicy
+========================
+Unless a task explicitly registers a mempolicy, the default memory policy
+of the linux kernel is to allocate memory from the `local NUMA node` first,
+and fall back to other nodes only if the local node is pressured.
+
+Generally, we expect to see local DRAM and CXL memory on separate NUMA nodes,
+with the CXL memory being non-local.  Technically, however, it is possible
+for a compute node to have no local DRAM, and for CXL memory to be the
+`local` capacity for that compute node.
+
+
+Memory Zones
+============
+CXL capacity may be onlined in :code:`ZONE_NORMAL` or :code:`ZONE_MOVABLE`.
+
+As of v6.15, the page allocator attempts to allocate from the highest
+available and compatible ZONE for an allocation from the local node first.
+
+An example of a `zone incompatibility` is attempting to service an allocation
+marked :code:`GFP_KERNEL` from :code:`ZONE_MOVABLE`.  Kernel allocations are
+typically not migratable, and as a result can only be serviced from
+:code:`ZONE_NORMAL` or lower.
+
+To simplify this, the page allocator will prefer :code:`ZONE_MOVABLE` over
+:code:`ZONE_NORMAL` by default, but if :code:`ZONE_MOVABLE` is depleted, it
+will fallback to allocate from :code:`ZONE_NORMAL`.
+
+
+Zone and Node Quirks
+====================
+Let's consider a configuration where the local DRAM capacity is largely onlined
+into :code:`ZONE_NORMAL`, with no :code:`ZONE_MOVABLE` capacity present. The
+CXL capacity has the opposite configuration - all onlined in
+:code:`ZONE_MOVABLE`.
+
+Under the default allocation policy, the page allocator will completely skip
+:code:`ZONE_MOVABLE` as a valid allocation target.  This is because, as of
+Linux v6.15, the page allocator does (approximately) the following: ::
+
+  for (each zone in local_node):
+
+    for (each node in fallback_order):
+
+      attempt_allocation(gfp_flags);
+
+Because the local node does not have :code:`ZONE_MOVABLE`, the CXL node is
+functionally unreachable for direct allocation.  As a result, the only way
+for CXL capacity to be used is via `demotion` in the reclaim path.
+
+This configuration also means that if the DRAM ndoe has :code:`ZONE_MOVABLE`
+capacity - when that capacity is depleted, the page allocator will actually
+prefer CXL :code:`ZONE_MOVABLE` pages over DRAM :code:`ZONE_NORMAL` pages.
+
+We may wish to invert this priority in future Linux versions.
+
+If `demotion` and `swap` are disabled, Linux will begin to cause OOM crashes
+when the DRAM nodes are depleted. See the reclaim section for more details.
+
+
+CGroups and CPUSets
+===================
+Finally, assuming CXL memory is reachable via the page allocation (i.e. onlined
+in :code:`ZONE_NORMAL`), the :code:`cpusets.mems_allowed` may be used by
+containers to limit the accessibility of certain NUMA nodes for tasks in that
+container.  Users may wish to utilize this in multi-tenant systems where some
+tasks prefer not to use slower memory.
+
+In the reclaim section we'll discuss some limitations of this interface to
+prevent demotions of shared data to CXL memory (if demotions are enabled).
+
diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst
index 6e7497f4811a..7acab7e7df96 100644
--- a/Documentation/driver-api/cxl/index.rst
+++ b/Documentation/driver-api/cxl/index.rst
@@ -45,5 +45,6 @@ that have impacts on each other.  The docs here break up configurations steps.
    :caption: Memory Allocation
 
    allocation/dax
+   allocation/page-allocator
 
 .. only::  subproject and html
-- 
2.49.0

Re: [PATCH v3 14/17] cxl: docs/allocation/page-allocator

Posted by Matthew Wilcox 9 months ago

On Mon, May 12, 2025 at 12:21:31PM -0400, Gregory Price wrote:
> Document some interesting interactions that occur when exposing CXL
> memory capacity to page allocator.

We should not do this.  Asking the page allocator for memory (eg for
slab) should never return memory on CXL.  There need to be special
interfaces for clients that know they can tolerate the added latency.

NAK this concept, and NAK this specific document.  I have no comment on
the previous documents.

Re: [PATCH v3 14/17] cxl: docs/allocation/page-allocator

Posted by Gregory Price 9 months ago

On Mon, May 12, 2025 at 05:34:56PM +0100, Matthew Wilcox wrote:
> On Mon, May 12, 2025 at 12:21:31PM -0400, Gregory Price wrote:
> > Document some interesting interactions that occur when exposing CXL
> > memory capacity to page allocator.
> 
> We should not do this.  Asking the page allocator for memory (eg for
> slab) should never return memory on CXL.  There need to be special
> interfaces for clients that know they can tolerate the added latency.
> 
> NAK this concept, and NAK this specific document.  I have no comment on
> the previous documents.

This describes what presently exists, so i'm not sure of what value a
NAK here is.

Feel free to submit patches that deletes the existing code if you want
it removed from the documentation.

~Gregory

Re: [PATCH v3 14/17] cxl: docs/allocation/page-allocator

Posted by Matthew Wilcox 9 months ago

On Mon, May 12, 2025 at 12:38:47PM -0400, Gregory Price wrote:
> On Mon, May 12, 2025 at 05:34:56PM +0100, Matthew Wilcox wrote:
> > On Mon, May 12, 2025 at 12:21:31PM -0400, Gregory Price wrote:
> > > Document some interesting interactions that occur when exposing CXL
> > > memory capacity to page allocator.
> > 
> > We should not do this.  Asking the page allocator for memory (eg for
> > slab) should never return memory on CXL.  There need to be special
> > interfaces for clients that know they can tolerate the added latency.
> > 
> > NAK this concept, and NAK this specific document.  I have no comment on
> > the previous documents.
> 
> This describes what presently exists, so i'm not sure of what value a
> NAK here is.
> 
> Feel free to submit patches that deletes the existing code if you want
> it removed from the documentation.

Who sneaked that in when?

Re: [PATCH v3 14/17] cxl: docs/allocation/page-allocator

Posted by Gregory Price 9 months ago

On Mon, May 12, 2025 at 06:52:31PM +0100, Matthew Wilcox wrote:
> > 
> > Feel free to submit patches that deletes the existing code if you want
> > it removed from the documentation.
> 
> Who sneaked that in when?

The ACPI and EFI folks when they allowed for CXL memory to be marked 
EFI_CONVENTIONAL_MEMORY - which means Linux can't actually differentiate
between DRAM and CXL during __init and brings it online in the page
allocator as SystemRAM in ZONE_NORMAL (attached to the NUMA node that
maps to the Proximity Domain in the SRAT).

Not sure there's anything you can do about that.

And for DAX:

09d09e04d2 (cxl/dax: Create dax devices for CXL RAM regions)

Which allows for EFI_MEMORY_SP / Soft Reserved CXL regions to be brought
up as a DAX devices (which can be bound to SystemRAM via DAX kmem).

Wasn't much sneaking going on here - DAX kmem has been around and hacked
on since 2019, and probably some years before that.

~Gregory

Re: [PATCH v3 14/17] cxl: docs/allocation/page-allocator

Posted by dan.j.williams@intel.com 9 months ago

Gregory Price wrote:
> On Mon, May 12, 2025 at 06:52:31PM +0100, Matthew Wilcox wrote:
> > > 
> > > Feel free to submit patches that deletes the existing code if you want
> > > it removed from the documentation.
> > 
> > Who sneaked that in when?
> 
> The ACPI and EFI folks when they allowed for CXL memory to be marked 
> EFI_CONVENTIONAL_MEMORY - which means Linux can't actually differentiate
> between DRAM and CXL during __init and brings it online in the page
> allocator as SystemRAM in ZONE_NORMAL (attached to the NUMA node that
> maps to the Proximity Domain in the SRAT).
> 
> Not sure there's anything you can do about that.
> 
> And for DAX:
> 
> 09d09e04d2 (cxl/dax: Create dax devices for CXL RAM regions)
> 
> Which allows for EFI_MEMORY_SP / Soft Reserved CXL regions to be brought
> up as a DAX devices (which can be bound to SystemRAM via DAX kmem).
> 
> Wasn't much sneaking going on here - DAX kmem has been around and hacked
> on since 2019, and probably some years before that.

Right.

These interfaces have been there for a long time and this documentation
is simply catching up with what is there today. I called for all of this
documentation to go upstream and have no problem defending it to Linus.
Appreciate all the work here Gregory!

Now, is device-dax and dax_kmem the long term solution for exposing
memory of this relative performance class? After LSF/MM this year I am
convinced the answer is "no". Specifically I want to see a solution that
meets what this astute LWN commenter recommended:

https://lwn.net/Articles/1017142/

We can delete documentation and infrastructure once we have the
replacement interface upstream and can start a deprecation process.