mm/mempolicy: introduce socket-aware weighted interleave

[LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

Posted by Rakie Kim 3 weeks ago

This patch series is an RFC to propose and discuss the overall design
and concept of a socket-aware weighted interleave mechanism. As there
are areas requiring further refinement, the primary goal at this stage
is to gather feedback on the architectural approach rather than focusing
on fine-grained implementation details.

Weighted interleave distributes page allocations across multiple nodes
based on configured weights. However, the current implementation applies
a single global weight vector. In multi-socket systems, this creates a
mismatch between configured weights and actual hardware performance, as
it cannot account for inter-socket interconnect costs. To address this,
we propose a socket-aware approach that restricts candidate nodes to
the local socket before applying weights.

Flat weighted interleave applies one global weight vector regardless of
where a task runs. On multi-socket systems, this ignores inter-socket
interconnect costs, meaning the configured weights do not accurately
reflect the actual hardware performance.

Consider a dual-socket system:

          node0             node1
        +-------+         +-------+
        | CPU 0 |---------| CPU 1 |
        +-------+         +-------+
        | DRAM0 |         | DRAM1 |
        +---+---+         +---+---+
            |                 |
        +---+---+         +---+---+
        | CXL 0 |         | CXL 1 |
        +-------+         +-------+
          node2             node3

Assuming local DRAM provides 300 GB/s and local CXL provides 100 GB/s,
the effective bandwidth varies significantly from the perspective of
each CPU due to inter-socket interconnect penalties.

Local device capabilities (GB/s) vs. cross-socket effective bandwidth:

         0     1     2     3
CPU 0  300   150   100    50
CPU 1  150   300    50   100

A reasonable global weight vector reflecting the base capabilities is:

     node0=3 node1=3 node2=1 node3=1

However, because these configured node weights do not account for
interconnect degradation between sockets, applying them flatly to all
sources yields the following effective map from each CPU's perspective:

         0     1     2     3
CPU 0    3     3     1     1
CPU 1    3     3     1     1

This does not account for the interconnect penalty (e.g., node0->node1
drops 300->150, node0->node3 drops 100->50) and thus forces allocations
that cause a mismatch with actual performance.

This patch makes weighted interleave socket-aware. Before weighting is
applied, the candidate nodes are restricted to the current socket; only
if no eligible local nodes remain does the policy fall back to the
wider set.

Even if the configured global weights remain identically set:

     node0=3 node1=3 node2=1 node3=1

The resulting effective map from the perspective of each CPU becomes:

         0     1     2     3
CPU 0    3     0     1     0
CPU 1    0     3     0     1

Now tasks running on node0 prefer DRAM0(3) and CXL0(1), while tasks on
node1 prefer DRAM1(3) and CXL1(1). This aligns allocation with actual
effective bandwidth, preserves NUMA locality, and reduces cross-socket
traffic.

To make this possible, the system requires a mechanism to understand
the physical topology. The existing NUMA distance model provides only
relative latency values between nodes and lacks any notion of
structural grouping such as socket boundaries. This is especially
problematic for CXL memory nodes, which appear without an explicit
socket association.

This patch series introduces a socket-aware topology management layer
that groups NUMA nodes according to their physical package. It
explicitly links CPU and memory-only nodes (such as CXL) under the
same socket using an initiator CPU node. This captures the true
hardware hierarchy rather than relying solely on flat distance values.


[Experimental Results]

System Configuration:
- Processor: Dual-Socket Intel Xeon 6980P (Granite Rapids)

               node0                       node1
             +-------+                   +-------+
             | CPU 0 |-------------------| CPU 1 |
             +-------+                   +-------+
12 Channels  | DRAM0 |                   | DRAM1 |  12 Channels
DDR5-6400    +---+---+                   +---+---+  DDR5-6400
                 |                           |
             +---+---+                   +---+---+
8 Channels   | CXL 0 |                   | CXL 1 |  8 Channels
DDR5-6400    +-------+                   +-------+  DDR5-6400
               node2                       node3

1) Throughput (System Bandwidth)
   - DRAM Only: 966 GB/s
   - Weighted Interleave: 903 GB/s (7% decrease compared to DRAM Only)
   - Socket-Aware Weighted Interleave: 1329 GB/s (1.33TB/s)
     (38% increase compared to DRAM Only,
      47% increase compared to Weighted Interleave)

2) Loaded Latency (Under High Bandwidth)
   - DRAM Only: 544 ns
   - Weighted Interleave: 545 ns
   - Socket-Aware Weighted Interleave: 436 ns
     (20% reduction compared to both)


[Additional Considerations]

Please note that this series includes modifications to the CXL driver
to register these nodes. However, the necessity and the approach of
these driver-side changes require further discussion and consideration.
Additionally, this topology layer was originally designed to support
both memory tiering and weighted interleave. Currently, it is only
utilized by the weighted interleave policy. As a result, several
functions exposed by this layer are not actively used in this RFC.
Unused portions will be cleaned up and removed in the final patch
submission.

Summary of patches:

  [PATCH 1/4] mm/numa: introduce nearest_nodes_nodemask()
  This patch adds a new NUMA helper function to find all nodes in a
  given nodemask that share the minimum distance from a specified
  source node.

  [PATCH 2/4] mm/memory-tiers: introduce socket-aware topology mgmt
  This patch introduces a management layer that groups NUMA nodes by
  their physical package (socket). It forms a "memory package" to
  abstract real hardware locality for predictable NUMA memory
  management.

  [PATCH 3/4] mm/memory-tiers: register CXL nodes to socket packages
  This patch implements a registration path to bind CXL memory nodes
  to a socket-aware memory package using an initiator CPU node. This
  ensures CXL nodes are deterministically grouped with the CPUs they
  service.

  [PATCH 4/4] mm/mempolicy: enhance weighted interleave with locality
  This patch modifies the weighted interleave policy to restrict
  candidate nodes to the current socket before applying weights. It
  reduces cross-socket traffic and aligns memory allocation with
  actual bandwidth.

Any feedback and discussions are highly appreciated.

Thanks

Rakie Kim (4):
  mm/numa: introduce nearest_nodes_nodemask()
  mm/memory-tiers: introduce socket-aware topology management for NUMA
    nodes
  mm/memory-tiers: register CXL nodes to socket-aware packages via
    initiator
  mm/mempolicy: enhance weighted interleave with socket-aware locality

 drivers/cxl/core/region.c    |  46 +++
 drivers/cxl/cxl.h            |   1 +
 drivers/dax/kmem.c           |   2 +
 include/linux/memory-tiers.h |  93 +++++
 include/linux/numa.h         |   8 +
 mm/memory-tiers.c            | 766 +++++++++++++++++++++++++++++++++++
 mm/mempolicy.c               | 135 +++++-
 7 files changed, 1047 insertions(+), 4 deletions(-)


base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
-- 
2.34.1

Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

Posted by Jonathan Cameron 2 weeks, 5 days ago

On Mon, 16 Mar 2026 14:12:48 +0900
Rakie Kim <rakie.kim@sk.com> wrote:

> This patch series is an RFC to propose and discuss the overall design
> and concept of a socket-aware weighted interleave mechanism. As there
> are areas requiring further refinement, the primary goal at this stage
> is to gather feedback on the architectural approach rather than focusing
> on fine-grained implementation details.
> 
> Weighted interleave distributes page allocations across multiple nodes
> based on configured weights. However, the current implementation applies
> a single global weight vector. In multi-socket systems, this creates a
> mismatch between configured weights and actual hardware performance, as
> it cannot account for inter-socket interconnect costs. To address this,
> we propose a socket-aware approach that restricts candidate nodes to
> the local socket before applying weights.
> 
> Flat weighted interleave applies one global weight vector regardless of
> where a task runs. On multi-socket systems, this ignores inter-socket
> interconnect costs, meaning the configured weights do not accurately
> reflect the actual hardware performance.
> 
> Consider a dual-socket system:
> 
>           node0             node1
>         +-------+         +-------+
>         | CPU 0 |---------| CPU 1 |
>         +-------+         +-------+
>         | DRAM0 |         | DRAM1 |
>         +---+---+         +---+---+
>             |                 |
>         +---+---+         +---+---+
>         | CXL 0 |         | CXL 1 |
>         +-------+         +-------+
>           node2             node3
> 
> Assuming local DRAM provides 300 GB/s and local CXL provides 100 GB/s,
> the effective bandwidth varies significantly from the perspective of
> each CPU due to inter-socket interconnect penalties.

I'm fully on board with this problem and very pleased to see someone
working on it!

I have some questions about the example.
The condition definitely applies when the local node to
CXL bandwidth > interconnect bandwidth, but that's not true here so this is
a more complex and I'm curious about the example

> 
> Local device capabilities (GB/s) vs. cross-socket effective bandwidth:
> 
>          0     1     2     3
> CPU 0  300   150   100    50
> CPU 1  150   300    50   100

These numbers don't seem consistent with the 100 / 300 numbers above.
These aren't low load bandwidths because if they were you'd not see any
drop on the CXL numbers as the bottleneck is still the CXL bus.  Given the
game here is bandwidth interleaving - fair enough that these should be
loaded bandwidths.

If these are fully loaded bandwidth then the headline DRAM / CXL numbers need
to be the sum of all access paths.  So DRAM must be 450GiB/s and CXL 150GiB/s
The cross CPU interconnect is 200GiB/s in each direction I think.
This is ignoring caching etc which can make judging interconnect effects tricky
at best!

Years ago there were some attempts to standardize the information available
on topology under load. To put it lightly it got tricky fast and no one
could agree on how to measure it for an empirical solution.

> 
> A reasonable global weight vector reflecting the base capabilities is:
> 
>      node0=3 node1=3 node2=1 node3=1
> 
> However, because these configured node weights do not account for
> interconnect degradation between sockets, applying them flatly to all
> sources yields the following effective map from each CPU's perspective:
> 
>          0     1     2     3
> CPU 0    3     3     1     1
> CPU 1    3     3     1     1
> 
> This does not account for the interconnect penalty (e.g., node0->node1
> drops 300->150, node0->node3 drops 100->50) and thus forces allocations
> that cause a mismatch with actual performance.
> 
> This patch makes weighted interleave socket-aware. Before weighting is
> applied, the candidate nodes are restricted to the current socket; only
> if no eligible local nodes remain does the policy fall back to the
> wider set.
> 
> Even if the configured global weights remain identically set:
> 
>      node0=3 node1=3 node2=1 node3=1
> 
> The resulting effective map from the perspective of each CPU becomes:
> 
>          0     1     2     3
> CPU 0    3     0     1     0
> CPU 1    0     3     0     1
> 
> Now tasks running on node0 prefer DRAM0(3) and CXL0(1), while tasks on
> node1 prefer DRAM1(3) and CXL1(1). This aligns allocation with actual
> effective bandwidth, preserves NUMA locality, and reduces cross-socket
> traffic.

Workload wise this is kind of assuming each NUMA node is doing something
similar and keeping to itself. Assuming a nice balanced setup that is
fine. However, with certain CPU topologies you are likely to see slightly
messier things.

> 
> To make this possible, the system requires a mechanism to understand
> the physical topology. The existing NUMA distance model provides only
> relative latency values between nodes and lacks any notion of
> structural grouping such as socket boundaries. This is especially
> problematic for CXL memory nodes, which appear without an explicit
> socket association.

So in a general sense, the missing info here is effectively the same
stuff we are missing from the HMAT presentation (it's there in the
table and it's there to compute in CXL cases) just because we decided
not to surface anything other than distances to memory from nearest
initiator.  I chatted to Joshua and Kieth about filling in that stuff
at last LSFMM. To me that's just a bit of engineering work that needs
doing now we have proven use cases for the data. Mostly it's figuring out
the presentation to userspace and kernel data structures as it's a
lot of data in a big system (typically at least 32 NUMA nodes).

> 
> This patch series introduces a socket-aware topology management layer
> that groups NUMA nodes according to their physical package. It
> explicitly links CPU and memory-only nodes (such as CXL) under the
> same socket using an initiator CPU node. This captures the true
> hardware hierarchy rather than relying solely on flat distance values.
> 
> 
> [Experimental Results]
> 
> System Configuration:
> - Processor: Dual-Socket Intel Xeon 6980P (Granite Rapids)
> 
>                node0                       node1
>              +-------+                   +-------+
>              | CPU 0 |-------------------| CPU 1 |
>              +-------+                   +-------+
> 12 Channels  | DRAM0 |                   | DRAM1 |  12 Channels
> DDR5-6400    +---+---+                   +---+---+  DDR5-6400
>                  |                           |
>              +---+---+                   +---+---+
> 8 Channels   | CXL 0 |                   | CXL 1 |  8 Channels
> DDR5-6400    +-------+                   +-------+  DDR5-6400
>                node2                       node3
> 
> 1) Throughput (System Bandwidth)
>    - DRAM Only: 966 GB/s
>    - Weighted Interleave: 903 GB/s (7% decrease compared to DRAM Only)
>    - Socket-Aware Weighted Interleave: 1329 GB/s (1.33TB/s)
>      (38% increase compared to DRAM Only,
>       47% increase compared to Weighted Interleave)
> 
> 2) Loaded Latency (Under High Bandwidth)
>    - DRAM Only: 544 ns
>    - Weighted Interleave: 545 ns
>    - Socket-Aware Weighted Interleave: 436 ns
>      (20% reduction compared to both)
> 

This may prove too simplistic so we need to be a little careful.
It may be enough for now though so I'm not saying we necessarily
need to change things (yet)!. Just highlighting things I've seen
turn up before in such discussions.

Simplest one is that we have more CXL memory on some nodes than
others.  Only so many lanes and we probably want some of them for
other purposes!

More fun, multi NUMA node per sockets systems.

A typical CPU Die with memory controllers (e.g. taking one of
our old parts where there are dieshots online kunpeng 920 to
avoid any chance of leaking anything...).

                  Socket 0             Socket 1
 |    node0      |   node 1|       | node2 | |    node 3     |
 +-----+ +-------+ +-------+       +-------+ +-------+ +-----+
 | IO  | | CPU 0 | | CPU 1 |-------| CPU 2 | | CPU 3 | | IO  |
 | DIE | +-------+ +-------+       +-------+ +-------+ | DIE |
 +--+--+ | DRAM0 | | DRAM1 |       | DRAM2 | | DRAM2 | +--+--+
    |    +-------+ +-------+       +-------+ +-------+    |
    |                                                     |
+---+---+                                             +---+---+ 
| CXL 0 |                                             | CXL 1 |
+-------+                                             +-------+

So only a single CXL device per socket and the socket is multiple
NUMA nodes as the DRAM interfaces are on the CPU Dies (unlike some
others where they are on the IO Die alongside the CXL interfaces).

CXL topology cases:

A simple dual socket setup with a CXL switch and MLD below it
makes for a shared link to the CXL memory (and hence a bandwidth
restriction) that this can't model.

                node0                       node1
              +-------+                   +-------+
              | CPU 0 |-------------------| CPU 1 |
              +-------+                   +-------+
 12 Channels  | DRAM0 |                   | DRAM1 |  12 Channels
 DDR5-6400    +---+---+                   +---+---+  DDR5-6400
                  |                           |
                  |___________________________| 
                                |
                                |
                            +---+---+       
            Many Channels   | CXL 0 |    
               DDR5-6400    +-------+   
                node2/3     

Note it's still two nodes for the CXL as we aren't accessing the same DPA for
each host node but their actual memory is interleaved across the same devices
to give peak BW.

The reason you might do this is load balancing across lots of CXL devices
downstream of the switch.

Note this also effectively happens with MHDs just the load balancing is across
backend memory being provided via multiple heads.  Whether people wire MHDs
that way or tend to have multiple top of rack devices with each CPU
socket connecting to a different one is an open question to me.

I have no idea yet on how you'd present the resulting bandwidth interference
effects of such as setup.

IO Expanders on the CPU interconnect:

Just for fun, on similar interconnects we've previously also seen
the following and I'd be surprised if those going for max bandwidth
don't do this for CXL at some point soon.

                node0                       node1
              +-------+                   +-------+
              | CPU 0 |-------------------| CPU 1 |
              +-------+                   +-------+
 12 Channels  | DRAM0 |                   | DRAM1 |  12 Channels
 DDR5-6400    +---+---+                   +---+---+  DDR5-6400
                  |                           |
                  |___________________________|
                      |  IO Expander      |
                      |  CPU interconnect |
                      |___________________|
                                |
                            +---+---+       
            Many Channels   | CXL 0 |    
               DDR5-6400    +-------+   
                node2

That is the CXL memory is effectively the same distance from
CPU0 and CPU1 - they probably have their own local CXL as well
as this approach is done to scale up interconnect lanes in a system
when bandwidth is way more important than compute. Similar to the
MHD case but in this case we are accessing the same DPAs via
both paths.

Anyhow, the exact details of those don't matter beyond the general
point that even in 'balanced' high performance configurations there
may not be a clean 1:1 relationship between NUMA nodes and CXL memory
devices.  Maybe some maths that aggregates some groups of nodes
together would be enough. I've not really thought it through yet.

Fun and useful topic.  Whilst I won't be at LSFMM it is definitely
something I'd like to see move forward in general.

Thanks,

Jonathan

> 
> [Additional Considerations]
> 
> Please note that this series includes modifications to the CXL driver
> to register these nodes. However, the necessity and the approach of
> these driver-side changes require further discussion and consideration.
> Additionally, this topology layer was originally designed to support
> both memory tiering and weighted interleave. Currently, it is only
> utilized by the weighted interleave policy. As a result, several
> functions exposed by this layer are not actively used in this RFC.
> Unused portions will be cleaned up and removed in the final patch
> submission.
> 
> Summary of patches:
> 
>   [PATCH 1/4] mm/numa: introduce nearest_nodes_nodemask()
>   This patch adds a new NUMA helper function to find all nodes in a
>   given nodemask that share the minimum distance from a specified
>   source node.
> 
>   [PATCH 2/4] mm/memory-tiers: introduce socket-aware topology mgmt
>   This patch introduces a management layer that groups NUMA nodes by
>   their physical package (socket). It forms a "memory package" to
>   abstract real hardware locality for predictable NUMA memory
>   management.
> 
>   [PATCH 3/4] mm/memory-tiers: register CXL nodes to socket packages
>   This patch implements a registration path to bind CXL memory nodes
>   to a socket-aware memory package using an initiator CPU node. This
>   ensures CXL nodes are deterministically grouped with the CPUs they
>   service.
> 
>   [PATCH 4/4] mm/mempolicy: enhance weighted interleave with locality
>   This patch modifies the weighted interleave policy to restrict
>   candidate nodes to the current socket before applying weights. It
>   reduces cross-socket traffic and aligns memory allocation with
>   actual bandwidth.
> 
> Any feedback and discussions are highly appreciated.
> 
> Thanks
> 
> Rakie Kim (4):
>   mm/numa: introduce nearest_nodes_nodemask()
>   mm/memory-tiers: introduce socket-aware topology management for NUMA
>     nodes
>   mm/memory-tiers: register CXL nodes to socket-aware packages via
>     initiator
>   mm/mempolicy: enhance weighted interleave with socket-aware locality
> 
>  drivers/cxl/core/region.c    |  46 +++
>  drivers/cxl/cxl.h            |   1 +
>  drivers/dax/kmem.c           |   2 +
>  include/linux/memory-tiers.h |  93 +++++
>  include/linux/numa.h         |   8 +
>  mm/memory-tiers.c            | 766 +++++++++++++++++++++++++++++++++++
>  mm/mempolicy.c               | 135 +++++-
>  7 files changed, 1047 insertions(+), 4 deletions(-)
> 
> 
> base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b

Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

Posted by Rakie Kim 2 weeks, 4 days ago

On Wed, 18 Mar 2026 12:02:45 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> On Mon, 16 Mar 2026 14:12:48 +0900
> Rakie Kim <rakie.kim@sk.com> wrote:
> 

Hello Jonathan,

Thanks for your detailed review and the insights on various topology cases.

> > This patch series is an RFC to propose and discuss the overall design
> > and concept of a socket-aware weighted interleave mechanism. As there
> > are areas requiring further refinement, the primary goal at this stage
> > is to gather feedback on the architectural approach rather than focusing
> > on fine-grained implementation details.
> > 
> > Weighted interleave distributes page allocations across multiple nodes
> > based on configured weights. However, the current implementation applies
> > a single global weight vector. In multi-socket systems, this creates a
> > mismatch between configured weights and actual hardware performance, as
> > it cannot account for inter-socket interconnect costs. To address this,
> > we propose a socket-aware approach that restricts candidate nodes to
> > the local socket before applying weights.
> > 
> > Flat weighted interleave applies one global weight vector regardless of
> > where a task runs. On multi-socket systems, this ignores inter-socket
> > interconnect costs, meaning the configured weights do not accurately
> > reflect the actual hardware performance.
> > 
> > Consider a dual-socket system:
> > 
> >           node0             node1
> >         +-------+         +-------+
> >         | CPU 0 |---------| CPU 1 |
> >         +-------+         +-------+
> >         | DRAM0 |         | DRAM1 |
> >         +---+---+         +---+---+
> >             |                 |
> >         +---+---+         +---+---+
> >         | CXL 0 |         | CXL 1 |
> >         +-------+         +-------+
> >           node2             node3
> > 
> > Assuming local DRAM provides 300 GB/s and local CXL provides 100 GB/s,
> > the effective bandwidth varies significantly from the perspective of
> > each CPU due to inter-socket interconnect penalties.
> 
> I'm fully on board with this problem and very pleased to see someone
> working on it!
> 
> I have some questions about the example.
> The condition definitely applies when the local node to
> CXL bandwidth > interconnect bandwidth, but that's not true here so this is
> a more complex and I'm curious about the example
> 
> > 
> > Local device capabilities (GB/s) vs. cross-socket effective bandwidth:
> > 
> >          0     1     2     3
> > CPU 0  300   150   100    50
> > CPU 1  150   300    50   100
> 
> These numbers don't seem consistent with the 100 / 300 numbers above.
> These aren't low load bandwidths because if they were you'd not see any
> drop on the CXL numbers as the bottleneck is still the CXL bus.  Given the
> game here is bandwidth interleaving - fair enough that these should be
> loaded bandwidths.
> 
> If these are fully loaded bandwidth then the headline DRAM / CXL numbers need
> to be the sum of all access paths.  So DRAM must be 450GiB/s and CXL 150GiB/s
> The cross CPU interconnect is 200GiB/s in each direction I think.
> This is ignoring caching etc which can make judging interconnect effects tricky
> at best!
> 
> Years ago there were some attempts to standardize the information available
> on topology under load. To put it lightly it got tricky fast and no one
> could agree on how to measure it for an empirical solution.
> 

You are exactly right about the numbers. The values used in the example
were overly simplified just to briefly illustrate the concept of the
interconnect penalty. I realize that this oversimplification caused
confusion regarding the actual bottleneck and fully loaded bandwidth.
In the next update, I will revise the example to use more accurate
numbers based on the actual system I am currently using.

> > 
> > A reasonable global weight vector reflecting the base capabilities is:
> > 
> >      node0=3 node1=3 node2=1 node3=1
> > 
> > However, because these configured node weights do not account for
> > interconnect degradation between sockets, applying them flatly to all
> > sources yields the following effective map from each CPU's perspective:
> > 
> >          0     1     2     3
> > CPU 0    3     3     1     1
> > CPU 1    3     3     1     1
> > 
> > This does not account for the interconnect penalty (e.g., node0->node1
> > drops 300->150, node0->node3 drops 100->50) and thus forces allocations
> > that cause a mismatch with actual performance.
> > 
> > This patch makes weighted interleave socket-aware. Before weighting is
> > applied, the candidate nodes are restricted to the current socket; only
> > if no eligible local nodes remain does the policy fall back to the
> > wider set.
> > 
> > Even if the configured global weights remain identically set:
> > 
> >      node0=3 node1=3 node2=1 node3=1
> > 
> > The resulting effective map from the perspective of each CPU becomes:
> > 
> >          0     1     2     3
> > CPU 0    3     0     1     0
> > CPU 1    0     3     0     1
> > 
> > Now tasks running on node0 prefer DRAM0(3) and CXL0(1), while tasks on
> > node1 prefer DRAM1(3) and CXL1(1). This aligns allocation with actual
> > effective bandwidth, preserves NUMA locality, and reduces cross-socket
> > traffic.
> 
> Workload wise this is kind of assuming each NUMA node is doing something
> similar and keeping to itself. Assuming a nice balanced setup that is
> fine. However, with certain CPU topologies you are likely to see slightly
> messier things.
> 

I agree with your point. Since the current design is still an early draft,
I understand that this assumption may not hold true for all workloads.
This is an area that requires further consideration.

> > 
> > To make this possible, the system requires a mechanism to understand
> > the physical topology. The existing NUMA distance model provides only
> > relative latency values between nodes and lacks any notion of
> > structural grouping such as socket boundaries. This is especially
> > problematic for CXL memory nodes, which appear without an explicit
> > socket association.
> 
> So in a general sense, the missing info here is effectively the same
> stuff we are missing from the HMAT presentation (it's there in the
> table and it's there to compute in CXL cases) just because we decided
> not to surface anything other than distances to memory from nearest
> initiator.  I chatted to Joshua and Kieth about filling in that stuff
> at last LSFMM. To me that's just a bit of engineering work that needs
> doing now we have proven use cases for the data. Mostly it's figuring out
> the presentation to userspace and kernel data structures as it's a
> lot of data in a big system (typically at least 32 NUMA nodes).
> 

Hearing about the discussion on exposing HMAT data is very welcome news.
Because this detailed topology information is not yet fully exposed to
the kernel and userspace, I used a temporary package-based restriction.
Figuring out how to expose and integrate this data into the kernel data
structures is indeed a crucial engineering task we need to solve.

Actually, when I first started this work, I considered fetching the
topology information from HMAT before adopting the current approach.
However, I encountered a firmware issue on my test systems
(Granite Rapids and Sierra Forest).

Although each socket has its own locally attached CXL device, the HMAT
only registers node1 (Socket 1) as the initiator for both CXL memory
nodes (node2 and node3). As a result, the sysfs HMAT initiators for
both node2 and node3 only expose node1.

Even though the distance map shows node2 is physically closer to
Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the
routing path strictly through Socket 1. Because the HMAT alone made it
difficult to determine the exact physical socket connections on these
systems, I ended up using the current CXL driver-based approach.

I wonder if others have experienced similar broken HMAT cases with CXL.
If HMAT information becomes more reliable in the future, we could
build a much more efficient structure.

> > 
> > This patch series introduces a socket-aware topology management layer
> > that groups NUMA nodes according to their physical package. It
> > explicitly links CPU and memory-only nodes (such as CXL) under the
> > same socket using an initiator CPU node. This captures the true
> > hardware hierarchy rather than relying solely on flat distance values.
> > 
> > 
> > [Experimental Results]
> > 
> > System Configuration:
> > - Processor: Dual-Socket Intel Xeon 6980P (Granite Rapids)
> > 
> >                node0                       node1
> >              +-------+                   +-------+
> >              | CPU 0 |-------------------| CPU 1 |
> >              +-------+                   +-------+
> > 12 Channels  | DRAM0 |                   | DRAM1 |  12 Channels
> > DDR5-6400    +---+---+                   +---+---+  DDR5-6400
> >                  |                           |
> >              +---+---+                   +---+---+
> > 8 Channels   | CXL 0 |                   | CXL 1 |  8 Channels
> > DDR5-6400    +-------+                   +-------+  DDR5-6400
> >                node2                       node3
> > 
> > 1) Throughput (System Bandwidth)
> >    - DRAM Only: 966 GB/s
> >    - Weighted Interleave: 903 GB/s (7% decrease compared to DRAM Only)
> >    - Socket-Aware Weighted Interleave: 1329 GB/s (1.33TB/s)
> >      (38% increase compared to DRAM Only,
> >       47% increase compared to Weighted Interleave)
> > 
> > 2) Loaded Latency (Under High Bandwidth)
> >    - DRAM Only: 544 ns
> >    - Weighted Interleave: 545 ns
> >    - Socket-Aware Weighted Interleave: 436 ns
> >      (20% reduction compared to both)
> > 
> 
> This may prove too simplistic so we need to be a little careful.
> It may be enough for now though so I'm not saying we necessarily
> need to change things (yet)!. Just highlighting things I've seen
> turn up before in such discussions.
> 
> Simplest one is that we have more CXL memory on some nodes than
> others.  Only so many lanes and we probably want some of them for
> other purposes!
> 
> More fun, multi NUMA node per sockets systems.
> 
> A typical CPU Die with memory controllers (e.g. taking one of
> our old parts where there are dieshots online kunpeng 920 to
> avoid any chance of leaking anything...).
> 
>                   Socket 0             Socket 1
>  |    node0      |   node 1|       | node2 | |    node 3     |
>  +-----+ +-------+ +-------+       +-------+ +-------+ +-----+
>  | IO  | | CPU 0 | | CPU 1 |-------| CPU 2 | | CPU 3 | | IO  |
>  | DIE | +-------+ +-------+       +-------+ +-------+ | DIE |
>  +--+--+ | DRAM0 | | DRAM1 |       | DRAM2 | | DRAM2 | +--+--+
>     |    +-------+ +-------+       +-------+ +-------+    |
>     |                                                     |
> +---+---+                                             +---+---+ 
> | CXL 0 |                                             | CXL 1 |
> +-------+                                             +-------+
> 
> So only a single CXL device per socket and the socket is multiple
> NUMA nodes as the DRAM interfaces are on the CPU Dies (unlike some
> others where they are on the IO Die alongside the CXL interfaces).
> 
> CXL topology cases:
> 
> A simple dual socket setup with a CXL switch and MLD below it
> makes for a shared link to the CXL memory (and hence a bandwidth
> restriction) that this can't model.
> 
>                 node0                       node1
>               +-------+                   +-------+
>               | CPU 0 |-------------------| CPU 1 |
>               +-------+                   +-------+
>  12 Channels  | DRAM0 |                   | DRAM1 |  12 Channels
>  DDR5-6400    +---+---+                   +---+---+  DDR5-6400
>                   |                           |
>                   |___________________________| 
>                                 |
>                                 |
>                             +---+---+       
>             Many Channels   | CXL 0 |    
>                DDR5-6400    +-------+   
>                 node2/3     
>  
> Note it's still two nodes for the CXL as we aren't accessing the same DPA for
> each host node but their actual memory is interleaved across the same devices
> to give peak BW.
> 
> The reason you might do this is load balancing across lots of CXL devices
> downstream of the switch.
> 
> Note this also effectively happens with MHDs just the load balancing is across
> backend memory being provided via multiple heads.  Whether people wire MHDs
> that way or tend to have multiple top of rack devices with each CPU
> socket connecting to a different one is an open question to me.
> 
> I have no idea yet on how you'd present the resulting bandwidth interference
> effects of such as setup.
> 
> IO Expanders on the CPU interconnect:
> 
> Just for fun, on similar interconnects we've previously also seen
> the following and I'd be surprised if those going for max bandwidth
> don't do this for CXL at some point soon.
> 
> 
>                 node0                       node1
>               +-------+                   +-------+
>               | CPU 0 |-------------------| CPU 1 |
>               +-------+                   +-------+
>  12 Channels  | DRAM0 |                   | DRAM1 |  12 Channels
>  DDR5-6400    +---+---+                   +---+---+  DDR5-6400
>                   |                           |
>                   |___________________________|
>                       |  IO Expander      |
>                       |  CPU interconnect |
>                       |___________________|
>                                 |
>                             +---+---+       
>             Many Channels   | CXL 0 |    
>                DDR5-6400    +-------+   
>                 node2
> 
> That is the CXL memory is effectively the same distance from
> CPU0 and CPU1 - they probably have their own local CXL as well
> as this approach is done to scale up interconnect lanes in a system
> when bandwidth is way more important than compute. Similar to the
> MHD case but in this case we are accessing the same DPAs via
> both paths.
> 
> Anyhow, the exact details of those don't matter beyond the general
> point that even in 'balanced' high performance configurations there
> may not be a clean 1:1 relationship between NUMA nodes and CXL memory
> devices.  Maybe some maths that aggregates some groups of nodes
> together would be enough. I've not really thought it through yet.
> 
> Fun and useful topic.  Whilst I won't be at LSFMM it is definitely
> something I'd like to see move forward in general.
> 
> Thanks,
> 
> Jonathan
> 

The complex topology cases you presented, such as multi-NUMA per socket,
shared CXL switches, and IO expanders, are very important points.
I clearly understand that the simple package-level grouping does not fully
reflect the 1:1 relationship in these future hardware architectures.

I have also thought about the shared CXL switch scenario you mentioned,
and I know the current design falls short in addressing it properly.
While the current implementation starts with a simple socket-local
restriction, I plan to evolve it into a more flexible node aggregation
model to properly reflect all the diverse topologies you suggested.

Thanks again for your time and review.

Rakie Kim

> > 
> > [Additional Considerations]
> > 
> > Please note that this series includes modifications to the CXL driver
> > to register these nodes. However, the necessity and the approach of
> > these driver-side changes require further discussion and consideration.
> > Additionally, this topology layer was originally designed to support
> > both memory tiering and weighted interleave. Currently, it is only
> > utilized by the weighted interleave policy. As a result, several
> > functions exposed by this layer are not actively used in this RFC.
> > Unused portions will be cleaned up and removed in the final patch
> > submission.
> > 
> > Summary of patches:
> > 
> >   [PATCH 1/4] mm/numa: introduce nearest_nodes_nodemask()
> >   This patch adds a new NUMA helper function to find all nodes in a
> >   given nodemask that share the minimum distance from a specified
> >   source node.
> > 
> >   [PATCH 2/4] mm/memory-tiers: introduce socket-aware topology mgmt
> >   This patch introduces a management layer that groups NUMA nodes by
> >   their physical package (socket). It forms a "memory package" to
> >   abstract real hardware locality for predictable NUMA memory
> >   management.
> > 
> >   [PATCH 3/4] mm/memory-tiers: register CXL nodes to socket packages
> >   This patch implements a registration path to bind CXL memory nodes
> >   to a socket-aware memory package using an initiator CPU node. This
> >   ensures CXL nodes are deterministically grouped with the CPUs they
> >   service.
> > 
> >   [PATCH 4/4] mm/mempolicy: enhance weighted interleave with locality
> >   This patch modifies the weighted interleave policy to restrict
> >   candidate nodes to the current socket before applying weights. It
> >   reduces cross-socket traffic and aligns memory allocation with
> >   actual bandwidth.
> > 
> > Any feedback and discussions are highly appreciated.
> > 
> > Thanks
> > 
> > Rakie Kim (4):
> >   mm/numa: introduce nearest_nodes_nodemask()
> >   mm/memory-tiers: introduce socket-aware topology management for NUMA
> >     nodes
> >   mm/memory-tiers: register CXL nodes to socket-aware packages via
> >     initiator
> >   mm/mempolicy: enhance weighted interleave with socket-aware locality
> > 
> >  drivers/cxl/core/region.c    |  46 +++
> >  drivers/cxl/cxl.h            |   1 +
> >  drivers/dax/kmem.c           |   2 +
> >  include/linux/memory-tiers.h |  93 +++++
> >  include/linux/numa.h         |   8 +
> >  mm/memory-tiers.c            | 766 +++++++++++++++++++++++++++++++++++
> >  mm/mempolicy.c               | 135 +++++-
> >  7 files changed, 1047 insertions(+), 4 deletions(-)
> > 
> > 
> > base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
>

Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

Posted by Jonathan Cameron 2 weeks, 3 days ago

> > > 
> > > To make this possible, the system requires a mechanism to understand
> > > the physical topology. The existing NUMA distance model provides only
> > > relative latency values between nodes and lacks any notion of
> > > structural grouping such as socket boundaries. This is especially
> > > problematic for CXL memory nodes, which appear without an explicit
> > > socket association.  
> > 
> > So in a general sense, the missing info here is effectively the same
> > stuff we are missing from the HMAT presentation (it's there in the
> > table and it's there to compute in CXL cases) just because we decided
> > not to surface anything other than distances to memory from nearest
> > initiator.  I chatted to Joshua and Kieth about filling in that stuff
> > at last LSFMM. To me that's just a bit of engineering work that needs
> > doing now we have proven use cases for the data. Mostly it's figuring out
> > the presentation to userspace and kernel data structures as it's a
> > lot of data in a big system (typically at least 32 NUMA nodes).
> >   
> 
> Hearing about the discussion on exposing HMAT data is very welcome news.
> Because this detailed topology information is not yet fully exposed to
> the kernel and userspace, I used a temporary package-based restriction.
> Figuring out how to expose and integrate this data into the kernel data
> structures is indeed a crucial engineering task we need to solve.
> 
> Actually, when I first started this work, I considered fetching the
> topology information from HMAT before adopting the current approach.
> However, I encountered a firmware issue on my test systems
> (Granite Rapids and Sierra Forest).
> 
> Although each socket has its own locally attached CXL device, the HMAT
> only registers node1 (Socket 1) as the initiator for both CXL memory
> nodes (node2 and node3). As a result, the sysfs HMAT initiators for
> both node2 and node3 only expose node1.

Do you mean the Memory Proximity Domain Attributes Structure has
the "Proximity Domain for the Attached Initiator" set wrong?
Was this for it's presentation of the full path to CXL mem nodes, or
to a PXM with a generic port?  Sounds like you have SRAT covering
the CXL mem so ideal would be to have the HMAT data to GP and to
the CXL PXMs that BIOS has set up.

Either way having that set at all for CXL memory is fishy as it's about
where the 'memory controller' is and on CXL mem that should be at the
device end of the link.  My understanding of that is was only meant
to be set when you have separate memory only Nodes where the physical
controller is in a particular other node (e.g. what you do
if you have a CPU with DRAM and HBM).  Maybe we need to make the
kernel warn + ignore that if it is set to something odd like yours.

> 
> Even though the distance map shows node2 is physically closer to
> Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the
> routing path strictly through Socket 1. Because the HMAT alone made it
> difficult to determine the exact physical socket connections on these
> systems, I ended up using the current CXL driver-based approach.

Are the HMAT latencies and bandwidths all there?  Or are some missing
and you have to use SLIT (which generally is garbage for historical
reasons of tuning SLIT to particular OS behaviour).

> 
> I wonder if others have experienced similar broken HMAT cases with CXL.
> If HMAT information becomes more reliable in the future, we could
> build a much more efficient structure.

Given it's being lightly used I suspect there will be many bugs :(
I hope we can assume they will get fixed however!

...

> 
> The complex topology cases you presented, such as multi-NUMA per socket,
> shared CXL switches, and IO expanders, are very important points.
> I clearly understand that the simple package-level grouping does not fully
> reflect the 1:1 relationship in these future hardware architectures.
> 
> I have also thought about the shared CXL switch scenario you mentioned,
> and I know the current design falls short in addressing it properly.
> While the current implementation starts with a simple socket-local
> restriction, I plan to evolve it into a more flexible node aggregation
> model to properly reflect all the diverse topologies you suggested.

If we can ensure it fails cleanly when it finds a topology that it can't
cope with (and I guess falls back to current) then I'm fine with a partial
solution that evolves.


> 
> Thanks again for your time and review.

You are welcome.

Thanks

Jonathan

> 
> Rakie Kim
>

Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

Posted by Rakie Kim 1 week, 6 days ago

On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> 
> > > > 
> > > > To make this possible, the system requires a mechanism to understand
> > > > the physical topology. The existing NUMA distance model provides only
> > > > relative latency values between nodes and lacks any notion of
> > > > structural grouping such as socket boundaries. This is especially
> > > > problematic for CXL memory nodes, which appear without an explicit
> > > > socket association.  
> > > 
> > > So in a general sense, the missing info here is effectively the same
> > > stuff we are missing from the HMAT presentation (it's there in the
> > > table and it's there to compute in CXL cases) just because we decided
> > > not to surface anything other than distances to memory from nearest
> > > initiator.  I chatted to Joshua and Kieth about filling in that stuff
> > > at last LSFMM. To me that's just a bit of engineering work that needs
> > > doing now we have proven use cases for the data. Mostly it's figuring out
> > > the presentation to userspace and kernel data structures as it's a
> > > lot of data in a big system (typically at least 32 NUMA nodes).
> > >   
> > 
> > Hearing about the discussion on exposing HMAT data is very welcome news.
> > Because this detailed topology information is not yet fully exposed to
> > the kernel and userspace, I used a temporary package-based restriction.
> > Figuring out how to expose and integrate this data into the kernel data
> > structures is indeed a crucial engineering task we need to solve.
> > 
> > Actually, when I first started this work, I considered fetching the
> > topology information from HMAT before adopting the current approach.
> > However, I encountered a firmware issue on my test systems
> > (Granite Rapids and Sierra Forest).
> > 
> > Although each socket has its own locally attached CXL device, the HMAT
> > only registers node1 (Socket 1) as the initiator for both CXL memory
> > nodes (node2 and node3). As a result, the sysfs HMAT initiators for
> > both node2 and node3 only expose node1.
> 
> Do you mean the Memory Proximity Domain Attributes Structure has
> the "Proximity Domain for the Attached Initiator" set wrong?
> Was this for it's presentation of the full path to CXL mem nodes, or
> to a PXM with a generic port?  Sounds like you have SRAT covering
> the CXL mem so ideal would be to have the HMAT data to GP and to
> the CXL PXMs that BIOS has set up.
> 
> Either way having that set at all for CXL memory is fishy as it's about
> where the 'memory controller' is and on CXL mem that should be at the
> device end of the link.  My understanding of that is was only meant
> to be set when you have separate memory only Nodes where the physical
> controller is in a particular other node (e.g. what you do
> if you have a CPU with DRAM and HBM).  Maybe we need to make the
> kernel warn + ignore that if it is set to something odd like yours.
> 

Hello Jonathan,

Your insight is incredibly accurate. To clarify the situation, here is
the actual configuration of my system:

NODE   Type          PXD
node0  local memory  0x00
node1  local memory  0x01
node2  cxl memory    0x0A
node3  cxl memory    0x0B

Physically, the node2 CXL is attached to node0 (Socket 0), and the
node3 CXL is attached to node1 (Socket 1). However, extracting the
HMAT.dsl reveals the following:

- local memory
  [028h] Flags: 0001 (Processor Proximity Domain Valid = 1)
         Attached Initiator Proximity Domain: 0x00
         Memory Proximity Domain: 0x00
  [050h] Flags: 0001 (Processor Proximity Domain Valid = 1)
         Attached Initiator Proximity Domain: 0x01
         Memory Proximity Domain: 0x01

- cxl memory
  [078h] Flags: 0000 (Processor Proximity Domain Valid = 0)
         Attached Initiator Proximity Domain: 0x00
         Memory Proximity Domain: 0x0A
  [0A0h] Flags: 0000 (Processor Proximity Domain Valid = 0)
         Attached Initiator Proximity Domain: 0x00
         Memory Proximity Domain: 0x0B

As you correctly suspected, the flags for the CXL memory are 0000,
meaning the Processor Proximity Domain is marked as invalid. But when
checking the sysfs initiator configurations, it shows a different story:

Node   access0 Initiator  access1 Initiator
node0  node0              node0
node1  node1              node1
node2  node1              node1
node3  node1              node1

Although the Attached Initiator is set to 0 in HMAT with an invalid
flag, sysfs strangely registers node1 as the initiator for both CXL
nodes. Because both HMAT and sysfs are exposing abnormal values, it was
impossible for me to determine the true socket connections for CXL
using this data.

> > 
> > Even though the distance map shows node2 is physically closer to
> > Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the
> > routing path strictly through Socket 1. Because the HMAT alone made it
> > difficult to determine the exact physical socket connections on these
> > systems, I ended up using the current CXL driver-based approach.
> 
> Are the HMAT latencies and bandwidths all there?  Or are some missing
> and you have to use SLIT (which generally is garbage for historical
> reasons of tuning SLIT to particular OS behaviour).
> 

The HMAT latencies and bandwidths are present, but the values seem
broken. Here is the latency table:

Init->Target | node0 | node1 | node2 | node3
node0        | 0x38B | 0x89F | 0x9C4 | 0x3AFC
node1        | 0x89F | 0x38B | 0x3AFC| 0x4268

I used the identical type of DRAM and CXL memory for both sockets.
However, looking at the table, the local CXL access latency from
node0->node2 (0x9C4) and node1->node3 (0x4268) shows a massive,
unjustified difference. This asymmetry proves that the table is
currently unreliable.

> > 
> > I wonder if others have experienced similar broken HMAT cases with CXL.
> > If HMAT information becomes more reliable in the future, we could
> > build a much more efficient structure.
> 
> Given it's being lightly used I suspect there will be many bugs :(
> I hope we can assume they will get fixed however!
> 
> ...
> 

The most critical issue caused by this broken initiator setting is that
topology analysis tools like `hwloc` are completely misled. Currently,
`hwloc` displays both CXL nodes as being attached to Socket 1.

I observed this exact same issue on both Sierra Forest and Granite
Rapids systems. I believe this broken topology exposure is a severe
problem that must be addressed, though I am not entirely sure what the
best fix would be yet. I would love to hear your thoughts on this.

> > 
> > The complex topology cases you presented, such as multi-NUMA per socket,
> > shared CXL switches, and IO expanders, are very important points.
> > I clearly understand that the simple package-level grouping does not fully
> > reflect the 1:1 relationship in these future hardware architectures.
> > 
> > I have also thought about the shared CXL switch scenario you mentioned,
> > and I know the current design falls short in addressing it properly.
> > While the current implementation starts with a simple socket-local
> > restriction, I plan to evolve it into a more flexible node aggregation
> > model to properly reflect all the diverse topologies you suggested.
> 
> If we can ensure it fails cleanly when it finds a topology that it can't
> cope with (and I guess falls back to current) then I'm fine with a partial
> solution that evolves.
> 

I completely agree with ensuring a clean failure. To stabilize this
partial solution, I am currently considering a few options for the
next version:

1. Enable this feature only when a strict 1:1 topology is detected.
2. Provide a sysfs allowing users to enable/disable it.
3. Allow users to manually override/configure the topology via sysfs.
4. Implement dynamic fallback behaviors depending on the detected
   topology shape (needs further thought).

By the way, when I first posted this RFC, I accidentally missed adding
lsf-pc@lists.linux-foundation.org to the CC list. I am considering
re-posting it to ensure it reaches the lsf-pc.

Thanks again for your profound insights and time. It is tremendously
helpful.

Rakie Kim

> 
> > 
> > Thanks again for your time and review.
> 
> You are welcome.
> 
> Thanks
> 
> Jonathan
> 
> > 
> > Rakie Kim
> >

Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

Posted by Gregory Price 1 week, 4 days ago

On Tue, Mar 24, 2026 at 02:35:45PM +0900, Rakie Kim wrote:
> On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> 
> Init->Target | node0 | node1 | node2 | node3
> node0        | 0x38B | 0x89F | 0x9C4 | 0x3AFC
> node1        | 0x89F | 0x38B | 0x3AFC| 0x4268
> 
> I used the identical type of DRAM and CXL memory for both sockets.
> However, looking at the table, the local CXL access latency from
> node0->node2 (0x9C4) and node1->node3 (0x4268) shows a massive,
> unjustified difference. This asymmetry proves that the table is
> currently unreliable.
>

Can you dump your CDAT for each device so you can at least check whether
the device reports the same latency?

Would at least tell the interested parties whether this is firmware or
BIOS issue.

sudo cat /sys/bus/cxl/devices/endpointN/CDAT | python3 cdat_dump.py

~Gregory

---


#!/usr/bin/env python3
# SPDX-License-Identifier: GPL-2.0-only
# Copyright(c) 2026 Meta Platforms, Inc. and affiliates.
#
# cdat_dump.py - Dump and decode CDAT (Coherent Device Attribute Table)
#                from CXL devices via sysfs
#
# Usage:
#   cdat_dump.py                          # dump all CXL devices with CDAT
#   cdat_dump.py /sys/bus/cxl/devices/endpoint0/CDAT
#   cdat_dump.py --raw cdat_binary.bin    # decode from raw file
#   cdat_dump.py --hex                    # include hex dump of each entry

import argparse
import glob
import os
import struct
import sys

# CDAT Header: u32 length, u8 revision, u8 checksum, u8 reserved[6], u32 sequence
CDAT_HDR_FMT = "<IBBBBBBBBBBI"
CDAT_HDR_SIZE = 16

# Common subtable header: u8 type, u8 reserved, u16 length
CDAT_SUBTBL_HDR_FMT = "<BBH"
CDAT_SUBTBL_HDR_SIZE = 4

# DSMAS (type 0): handle, flags, reserved(u16), dpa_base(u64), dpa_length(u64)
DSMAS_FMT = "<BBHQQ"
DSMAS_SIZE = 20

# DSLBIS (type 1): handle, flags, data_type, reserved, entry_base_unit(u64),
#                  entry[3](u16 x3), reserved2(u16)
DSLBIS_FMT = "<BBBBQHHHH"
DSLBIS_SIZE = 20

# DSMSCIS (type 2): dsmas_handle, reserved[3], side_cache_size(u64),
#                   cache_attributes(u32)
DSMSCIS_FMT = "<BBBBQI"
DSMSCIS_SIZE = 16

# DSIS (type 3): flags, handle, reserved(u16)
DSIS_FMT = "<BBH"
DSIS_SIZE = 4

# DSEMTS (type 4): dsmas_handle, memory_type, reserved(u16),
#                  dpa_offset(u64), range_length(u64)
DSEMTS_FMT = "<BBHQQ"
DSEMTS_SIZE = 20

# SSLBIS (type 5) fixed part: data_type, reserved[3], entry_base_unit(u64)
SSLBIS_FMT = "<BBBBQ"
SSLBIS_SIZE = 12

# SSLBE entry: portx_id(u16), porty_id(u16), latency_or_bandwidth(u16),
#              reserved(u16)
SSLBE_FMT = "<HHHH"
SSLBE_SIZE = 8

CDAT_TYPE_NAMES = {
    0: "DSMAS  (Device Scoped Memory Affinity Structure)",
    1: "DSLBIS (Device Scoped Latency and Bandwidth Information Structure)",
    2: "DSMSCIS (Device Scoped Memory Side Cache Information Structure)",
    3: "DSIS   (Device Scoped Initiator Structure)",
    4: "DSEMTS (Device Scoped EFI Memory Type Structure)",
    5: "SSLBIS (Switch Scoped Latency and Bandwidth Information Structure)",
}

HMAT_DATA_TYPES = {
    0: "Access Latency",
    1: "Read Latency",
    2: "Write Latency",
    3: "Access Bandwidth",
    4: "Read Bandwidth",
    5: "Write Bandwidth",
}

EFI_MEM_TYPES = {
    0: "EfiConventionalMemory",
    1: "EfiConventionalMemory (EFI_MEMORY_SP)",
    2: "EfiReservedMemoryType",
}

CACHE_ASSOCIATIVITY = {
    0: "None",
    1: "Direct Mapped",
    2: "Complex Cache Indexing",
}

CACHE_WRITE_POLICY = {
    0: "None",
    1: "Write Back",
    2: "Write Through",
}


def hexdump(data, indent="  "):
    lines = []
    for i in range(0, len(data), 16):
        chunk = data[i:i+16]
        hexstr = " ".join(f"{b:02x}" for b in chunk)
        ascstr = "".join(chr(b) if 32 <= b < 127 else "." for b in chunk)
        lines.append(f"{indent}{i:04x}: {hexstr:<48s}  {ascstr}")
    return "\n".join(lines)


def fmt_size(size):
    if size >= (1 << 40):
        return f"{size / (1 << 40):.2f} TiB"
    if size >= (1 << 30):
        return f"{size / (1 << 30):.2f} GiB"
    if size >= (1 << 20):
        return f"{size / (1 << 20):.2f} MiB"
    if size >= (1 << 10):
        return f"{size / (1 << 10):.2f} KiB"
    return f"{size} B"


def fmt_port(port_id):
    if port_id == 0xFFFF:
        return "ANY"
    if port_id == 0x0100:
        return "USP (upstream)"
    return f"DSP {port_id}"


def decode_latency_bandwidth(entry_val, base_unit, data_type):
    """Decode a DSLBIS/SSLBIS entry value into human-readable form."""
    if entry_val == 0xFFFF or entry_val == 0:
        return "N/A"

    raw = entry_val * base_unit
    if data_type <= 2:  # latency types (picoeconds -> nanoseconds)
        ns = raw / 1000.0
        if ns >= 1000:
            return f"{ns/1000:.2f} us ({raw} ps)"
        return f"{ns:.2f} ns ({raw} ps)"
    else:  # bandwidth types (MB/s)
        if raw >= 1024:
            return f"{raw/1024:.2f} GB/s ({raw} MB/s)"
        return f"{raw} MB/s"


def decode_dsmas(data, show_hex):
    handle, flags, _, dpa_base, dpa_length = struct.unpack_from(DSMAS_FMT, data)

    flag_strs = []
    if flags & (1 << 2):
        flag_strs.append("NonVolatile")
    if flags & (1 << 3):
        flag_strs.append("Shareable")
    if flags & (1 << 6):
        flag_strs.append("ReadOnly")
    flag_desc = ", ".join(flag_strs) if flag_strs else "None"

    print(f"    DSMAD Handle:  {handle}")
    print(f"    Flags:         0x{flags:02x} ({flag_desc})")
    print(f"    DPA Base:      0x{dpa_base:016x}")
    print(f"    DPA Length:    0x{dpa_length:016x} ({fmt_size(dpa_length)})")


def decode_dslbis(data, show_hex):
    handle, flags, data_type, _, base_unit, e0, e1, e2, _ = \
        struct.unpack_from(DSLBIS_FMT, data)

    dt_name = HMAT_DATA_TYPES.get(data_type, f"Unknown ({data_type})")

    print(f"    Handle:        {handle}")
    print(f"    Flags:         0x{flags:02x}")
    print(f"    Data Type:     {data_type} ({dt_name})")
    print(f"    Base Unit:     {base_unit}")
    print(f"    Entry[0]:      {e0} -> {decode_latency_bandwidth(e0, base_unit, data_type)}")
    if e1:
        print(f"    Entry[1]:      {e1} -> {decode_latency_bandwidth(e1, base_unit, data_type)}")
    if e2:
        print(f"    Entry[2]:      {e2} -> {decode_latency_bandwidth(e2, base_unit, data_type)}")


def decode_dsmscis(data, show_hex):
    dsmas_handle, _, _, _, cache_size, cache_attr = \
        struct.unpack_from(DSMSCIS_FMT, data)

    total_levels = cache_attr & 0xF
    cache_level = (cache_attr >> 4) & 0xF
    assoc = (cache_attr >> 8) & 0xF
    write_pol = (cache_attr >> 12) & 0xF
    line_size = (cache_attr >> 16) & 0xFFFF

    assoc_str = CACHE_ASSOCIATIVITY.get(assoc, f"Unknown ({assoc})")
    wp_str = CACHE_WRITE_POLICY.get(write_pol, f"Unknown ({write_pol})")

    print(f"    DSMAS Handle:  {dsmas_handle}")
    print(f"    Cache Size:    0x{cache_size:016x} ({fmt_size(cache_size)})")
    print(f"    Cache Attrs:   0x{cache_attr:08x}")
    print(f"      Total Levels:    {total_levels}")
    print(f"      Cache Level:     {cache_level}")
    print(f"      Associativity:   {assoc_str}")
    print(f"      Write Policy:    {wp_str}")
    print(f"      Line Size:       {line_size} bytes")


def decode_dsis(data, show_hex):
    flags, handle, _ = struct.unpack_from(DSIS_FMT, data)

    mem_attached = bool(flags & 1)
    if mem_attached:
        handle_desc = f"DSMAS handle {handle}"
    else:
        handle_desc = f"Initiator handle {handle} (no memory)"

    print(f"    Flags:         0x{flags:02x} (Memory Attached: {mem_attached})")
    print(f"    Handle:        {handle} ({handle_desc})")


def decode_dsemts(data, show_hex):
    dsmas_handle, mem_type, _, dpa_offset, range_length = \
        struct.unpack_from(DSEMTS_FMT, data)

    mt_str = EFI_MEM_TYPES.get(mem_type, f"Reserved ({mem_type})")

    print(f"    DSMAS Handle:  {dsmas_handle}")
    print(f"    Memory Type:   {mem_type} ({mt_str})")
    print(f"    DPA Offset:    0x{dpa_offset:016x}")
    print(f"    Range Length:  0x{range_length:016x} ({fmt_size(range_length)})")


def decode_sslbis(data, total_len, show_hex):
    dt, _, _, _, base_unit = struct.unpack_from(SSLBIS_FMT, data)
    dt_name = HMAT_DATA_TYPES.get(dt, f"Unknown ({dt})")

    print(f"    Data Type:     {dt} ({dt_name})")
    print(f"    Base Unit:     {base_unit}")

    # Variable number of SSLBE entries after the fixed header
    entries_data = data[SSLBIS_SIZE:]
    n_entries = len(entries_data) // SSLBE_SIZE

    for i in range(n_entries):
        off = i * SSLBE_SIZE
        px, py, val, _ = struct.unpack_from(SSLBE_FMT, entries_data, off)
        decoded = decode_latency_bandwidth(val, base_unit, dt)
        print(f"    Entry[{i}]:  {fmt_port(px)} <-> {fmt_port(py)}: "
              f"{val} -> {decoded}")


DECODERS = {
    0: decode_dsmas,
    1: decode_dslbis,
    2: decode_dsmscis,
    3: decode_dsis,
    4: decode_dsemts,
}


def decode_cdat(data, source="", show_hex=False):
    if len(data) < CDAT_HDR_SIZE:
        print(f"Error: data too short for CDAT header ({len(data)} < {CDAT_HDR_SIZE})")
        return False

    # Parse header
    vals = struct.unpack_from(CDAT_HDR_FMT, data)
    length = vals[0]
    revision = vals[1]
    checksum = vals[2]
    # vals[3:9] are the 6 reserved bytes
    sequence = vals[9]

    # Verify checksum
    cksum = sum(data[:length]) & 0xFF
    cksum_ok = "OK" if cksum == 0 else f"FAIL (sum=0x{cksum:02x})"

    if source:
        print(f"=== CDAT from {source} ===")
    print(f"CDAT Header:")
    print(f"  Length:     {length} bytes")
    print(f"  Revision:   {revision}")
    print(f"  Checksum:   0x{checksum:02x} ({cksum_ok})")
    print(f"  Sequence:   {sequence}")

    if show_hex:
        print(f"  Raw header:")
        print(hexdump(data[:CDAT_HDR_SIZE], "    "))

    if length > len(data):
        print(f"Warning: CDAT length ({length}) > available data ({len(data)})")
        length = len(data)

    # Parse subtables
    offset = CDAT_HDR_SIZE
    entry_num = 0
    counts = {}

    while offset + CDAT_SUBTBL_HDR_SIZE <= length:
        stype, _, slen = struct.unpack_from(CDAT_SUBTBL_HDR_FMT, data, offset)

        if slen < CDAT_SUBTBL_HDR_SIZE:
            print(f"\nError: subtable at offset {offset} has invalid length {slen}")
            break
        if offset + slen > length:
            print(f"\nError: subtable at offset {offset} extends past end "
                  f"(offset+len={offset+slen} > {length})")
            break

        counts[stype] = counts.get(stype, 0) + 1
        type_name = CDAT_TYPE_NAMES.get(stype, f"Unknown (type={stype})")

        print(f"\n  [{entry_num}] {type_name}")
        print(f"    Offset: {offset}, Length: {slen}")

        if show_hex:
            print(hexdump(data[offset:offset+slen], "    "))

        # Decode the subtable body (skip the 4-byte common header)
        body = data[offset + CDAT_SUBTBL_HDR_SIZE:offset + slen]

        if stype == 5:
            # SSLBIS has variable length, pass total subtable body length
            decode_sslbis(body, slen - CDAT_SUBTBL_HDR_SIZE, show_hex)
        elif stype in DECODERS:
            DECODERS[stype](body, show_hex)
        else:
            print(f"    (unknown type, raw data follows)")
            print(hexdump(body, "    "))

        offset += slen
        entry_num += 1

    # Summary
    print(f"\nSummary: {entry_num} entries")
    for t in sorted(counts):
        name = CDAT_TYPE_NAMES.get(t, f"Unknown ({t})")
        print(f"  {name}: {counts[t]}")

    if offset < length:
        trailing = length - offset
        print(f"\nWarning: {trailing} trailing bytes after last subtable")

    print()
    return True


def find_cdat_sysfs():
    """Find all CXL devices with CDAT attributes in sysfs."""
    paths = []
    for dev_path in sorted(glob.glob("/sys/bus/cxl/devices/*")):
        cdat_path = os.path.join(dev_path, "CDAT")
        if os.path.exists(cdat_path):
            paths.append(cdat_path)
    return paths


def read_cdat(path):
    """Read binary CDAT data from a sysfs attribute or file."""
    try:
        with open(path, "rb") as f:
            return f.read()
    except PermissionError:
        print(f"Error: permission denied reading {path} (need root?)")
        return None
    except OSError as e:
        print(f"Error reading {path}: {e}")
        return None


def main():
    parser = argparse.ArgumentParser(
        description="Dump and decode CXL CDAT (Coherent Device Attribute Table)",
        epilog="Without arguments, discovers and dumps CDAT from all CXL devices.\n"
               "Requires root access to read sysfs CDAT attributes.",
        formatter_class=argparse.RawDescriptionHelpFormatter,
    )
    parser.add_argument(
        "path", nargs="*",
        help="Path to sysfs CDAT attribute or raw CDAT binary file",
    )
    parser.add_argument(
        "--raw", action="store_true",
        help="Treat input as raw CDAT binary file (not sysfs)",
    )
    parser.add_argument(
        "--hex", action="store_true",
        help="Include hex dump of each entry",
    )
    args = parser.parse_args()

    paths = args.path

    # Read from stdin if piped or explicitly given "-"
    if not sys.stdin.isatty() and not paths:
        data = sys.stdin.buffer.read()
        if not data:
            print("Error: no data on stdin")
            return 1
        return 0 if decode_cdat(data, "stdin", show_hex=args.hex) else 1

    if paths == ["-"]:
        data = sys.stdin.buffer.read()
        if not data:
            print("Error: no data on stdin")
            return 1
        return 0 if decode_cdat(data, "stdin", show_hex=args.hex) else 1

    if not paths:
        paths = find_cdat_sysfs()
        if not paths:
            print("No CXL devices with CDAT found in sysfs.")
            print("Check that CXL devices are present and the cxl_port driver is loaded.")
            return 1

    ok = True
    for path in paths:
        data = read_cdat(path)
        if data is None:
            ok = False
            continue

        if not data:
            dev = os.path.basename(os.path.dirname(path)) if not args.raw else path
            print(f"{dev}: CDAT is empty (read from device failed at probe time)")
            ok = False
            continue

        source = path
        if not args.raw and "/sys/" in path:
            source = os.path.basename(os.path.dirname(path))

        if not decode_cdat(data, source, show_hex=args.hex):
            ok = False

    return 0 if ok else 1


if __name__ == "__main__":
    sys.exit(main())

Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

Posted by Rakie Kim 1 week ago

On Thu, 26 Mar 2026 21:54:30 -0400 Gregory Price <gourry@gourry.net> wrote:
> On Tue, Mar 24, 2026 at 02:35:45PM +0900, Rakie Kim wrote:
> > On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> > 
> > Init->Target | node0 | node1 | node2 | node3
> > node0        | 0x38B | 0x89F | 0x9C4 | 0x3AFC
> > node1        | 0x89F | 0x38B | 0x3AFC| 0x4268
> > 
> > I used the identical type of DRAM and CXL memory for both sockets.
> > However, looking at the table, the local CXL access latency from
> > node0->node2 (0x9C4) and node1->node3 (0x4268) shows a massive,
> > unjustified difference. This asymmetry proves that the table is
> > currently unreliable.
> >
> 
> Can you dump your CDAT for each device so you can at least check whether
> the device reports the same latency?
> 
> Would at least tell the interested parties whether this is firmware or
> BIOS issue.
> 
> sudo cat /sys/bus/cxl/devices/endpointN/CDAT | python3 cdat_dump.py
> 
> ~Gregory
> 
< --snip-- >

Hello Gregory,

Thank you for providing the python script. It was incredibly helpful.
I ran it across all the CDAT files found in my sysfs.

For context, my system currently has 16 CXL endpoints attached
(from `endpoint9` to `endpoint24`):
/sys/devices/platform/ACPI0017:00/root0/port.../endpoint.../CDAT

As Dave Jiang recently pointed out in this thread, the Intel BIOS
team confirmed that the HMAT values actually represent "end-to-end"
latency. By comparing these CDAT dump results (which show the
device-level latency) with the HMAT numbers, we should have a much
clearer picture of whether the massive asymmetry originates from the
device firmware itself or from the BIOS calculations.

I have attached the extracted CDAT dump results for the devices below.

Thanks again for your help in isolating this issue!

Rakie Kim


=== CDAT from endpoint16 ===
CDAT Header:
  Length:     208 bytes
  Revision:   1
  Checksum:   0xc3 (OK)
  Sequence:   0

  [0] DSMAS  (Device Scoped Memory Affinity Structure)
    Offset: 16, Length: 24
    DSMAD Handle:  0
    Flags:         0x00 (None)
    DPA Base:      0x0000000000000000
    DPA Length:    0x0000002000000000 (128.00 GiB)

  [1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 40, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     0 (Access Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 64, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     1 (Read Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 88, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     2 (Write Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 112, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     3 (Access Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 136, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     4 (Read Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 160, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     5 (Write Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [7] DSEMTS (Device Scoped EFI Memory Type Structure)
    Offset: 184, Length: 24
    DSMAS Handle:  0
    Memory Type:   0 (EfiConventionalMemory)
    DPA Offset:    0x0000000000000000
    Range Length:  0x0000002000000000 (128.00 GiB)

Summary: 8 entries
  DSMAS  (Device Scoped Memory Affinity Structure): 1
  DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
  DSEMTS (Device Scoped EFI Memory Type Structure): 1

=== CDAT from endpoint20 ===
CDAT Header:
  Length:     208 bytes
  Revision:   1
  Checksum:   0xc3 (OK)
  Sequence:   0

  [0] DSMAS  (Device Scoped Memory Affinity Structure)
    Offset: 16, Length: 24
    DSMAD Handle:  0
    Flags:         0x00 (None)
    DPA Base:      0x0000000000000000
    DPA Length:    0x0000002000000000 (128.00 GiB)

  [1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 40, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     0 (Access Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 64, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     1 (Read Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 88, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     2 (Write Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 112, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     3 (Access Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 136, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     4 (Read Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 160, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     5 (Write Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [7] DSEMTS (Device Scoped EFI Memory Type Structure)
    Offset: 184, Length: 24
    DSMAS Handle:  0
    Memory Type:   0 (EfiConventionalMemory)
    DPA Offset:    0x0000000000000000
    Range Length:  0x0000002000000000 (128.00 GiB)

Summary: 8 entries
  DSMAS  (Device Scoped Memory Affinity Structure): 1
  DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
  DSEMTS (Device Scoped EFI Memory Type Structure): 1

=== CDAT from endpoint18 ===
CDAT Header:
  Length:     208 bytes
  Revision:   1
  Checksum:   0xc3 (OK)
  Sequence:   0

  [0] DSMAS  (Device Scoped Memory Affinity Structure)
    Offset: 16, Length: 24
    DSMAD Handle:  0
    Flags:         0x00 (None)
    DPA Base:      0x0000000000000000
    DPA Length:    0x0000002000000000 (128.00 GiB)

  [1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 40, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     0 (Access Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 64, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     1 (Read Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 88, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     2 (Write Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 112, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     3 (Access Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 136, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     4 (Read Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 160, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     5 (Write Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [7] DSEMTS (Device Scoped EFI Memory Type Structure)
    Offset: 184, Length: 24
    DSMAS Handle:  0
    Memory Type:   0 (EfiConventionalMemory)
    DPA Offset:    0x0000000000000000
    Range Length:  0x0000002000000000 (128.00 GiB)

Summary: 8 entries
  DSMAS  (Device Scoped Memory Affinity Structure): 1
  DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
  DSEMTS (Device Scoped EFI Memory Type Structure): 1

=== CDAT from endpoint17 ===
CDAT Header:
  Length:     208 bytes
  Revision:   1
  Checksum:   0xc3 (OK)
  Sequence:   0

  [0] DSMAS  (Device Scoped Memory Affinity Structure)
    Offset: 16, Length: 24
    DSMAD Handle:  0
    Flags:         0x00 (None)
    DPA Base:      0x0000000000000000
    DPA Length:    0x0000002000000000 (128.00 GiB)

  [1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 40, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     0 (Access Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 64, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     1 (Read Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 88, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     2 (Write Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 112, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     3 (Access Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 136, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     4 (Read Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 160, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     5 (Write Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [7] DSEMTS (Device Scoped EFI Memory Type Structure)
    Offset: 184, Length: 24
    DSMAS Handle:  0
    Memory Type:   0 (EfiConventionalMemory)
    DPA Offset:    0x0000000000000000
    Range Length:  0x0000002000000000 (128.00 GiB)

Summary: 8 entries
  DSMAS  (Device Scoped Memory Affinity Structure): 1
  DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
  DSEMTS (Device Scoped EFI Memory Type Structure): 1

=== CDAT from endpoint14 ===
CDAT Header:
  Length:     208 bytes
  Revision:   1
  Checksum:   0xc3 (OK)
  Sequence:   0

  [0] DSMAS  (Device Scoped Memory Affinity Structure)
    Offset: 16, Length: 24
    DSMAD Handle:  0
    Flags:         0x00 (None)
    DPA Base:      0x0000000000000000
    DPA Length:    0x0000002000000000 (128.00 GiB)

  [1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 40, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     0 (Access Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 64, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     1 (Read Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 88, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     2 (Write Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 112, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     3 (Access Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 136, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     4 (Read Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 160, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     5 (Write Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [7] DSEMTS (Device Scoped EFI Memory Type Structure)
    Offset: 184, Length: 24
    DSMAS Handle:  0
    Memory Type:   0 (EfiConventionalMemory)
    DPA Offset:    0x0000000000000000
    Range Length:  0x0000002000000000 (128.00 GiB)

Summary: 8 entries
  DSMAS  (Device Scoped Memory Affinity Structure): 1
  DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
  DSEMTS (Device Scoped EFI Memory Type Structure): 1

=== CDAT from endpoint10 ===
CDAT Header:
  Length:     208 bytes
  Revision:   1
  Checksum:   0xc3 (OK)
  Sequence:   0

  [0] DSMAS  (Device Scoped Memory Affinity Structure)
    Offset: 16, Length: 24
    DSMAD Handle:  0
    Flags:         0x00 (None)
    DPA Base:      0x0000000000000000
    DPA Length:    0x0000002000000000 (128.00 GiB)

  [1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 40, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     0 (Access Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 64, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     1 (Read Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 88, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     2 (Write Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 112, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     3 (Access Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 136, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     4 (Read Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 160, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     5 (Write Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [7] DSEMTS (Device Scoped EFI Memory Type Structure)
    Offset: 184, Length: 24
    DSMAS Handle:  0
    Memory Type:   0 (EfiConventionalMemory)
    DPA Offset:    0x0000000000000000
    Range Length:  0x0000002000000000 (128.00 GiB)

Summary: 8 entries
  DSMAS  (Device Scoped Memory Affinity Structure): 1
  DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
  DSEMTS (Device Scoped EFI Memory Type Structure): 1

=== CDAT from endpoint23 ===
CDAT Header:
  Length:     208 bytes
  Revision:   1
  Checksum:   0xc3 (OK)
  Sequence:   0

  [0] DSMAS  (Device Scoped Memory Affinity Structure)
    Offset: 16, Length: 24
    DSMAD Handle:  0
    Flags:         0x00 (None)
    DPA Base:      0x0000000000000000
    DPA Length:    0x0000002000000000 (128.00 GiB)

  [1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 40, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     0 (Access Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 64, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     1 (Read Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 88, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     2 (Write Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 112, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     3 (Access Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 136, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     4 (Read Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 160, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     5 (Write Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [7] DSEMTS (Device Scoped EFI Memory Type Structure)
    Offset: 184, Length: 24
    DSMAS Handle:  0
    Memory Type:   0 (EfiConventionalMemory)
    DPA Offset:    0x0000000000000000
    Range Length:  0x0000002000000000 (128.00 GiB)

Summary: 8 entries
  DSMAS  (Device Scoped Memory Affinity Structure): 1
  DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
  DSEMTS (Device Scoped EFI Memory Type Structure): 1

=== CDAT from endpoint22 ===
CDAT Header:
  Length:     208 bytes
  Revision:   1
  Checksum:   0xc3 (OK)
  Sequence:   0

  [0] DSMAS  (Device Scoped Memory Affinity Structure)
    Offset: 16, Length: 24
    DSMAD Handle:  0
    Flags:         0x00 (None)
    DPA Base:      0x0000000000000000
    DPA Length:    0x0000002000000000 (128.00 GiB)

  [1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 40, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     0 (Access Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 64, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     1 (Read Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 88, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     2 (Write Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 112, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     3 (Access Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 136, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     4 (Read Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 160, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     5 (Write Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [7] DSEMTS (Device Scoped EFI Memory Type Structure)
    Offset: 184, Length: 24
    DSMAS Handle:  0
    Memory Type:   0 (EfiConventionalMemory)
    DPA Offset:    0x0000000000000000
    Range Length:  0x0000002000000000 (128.00 GiB)

Summary: 8 entries
  DSMAS  (Device Scoped Memory Affinity Structure): 1
  DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
  DSEMTS (Device Scoped EFI Memory Type Structure): 1

=== CDAT from endpoint15 ===
CDAT Header:
  Length:     208 bytes
  Revision:   1
  Checksum:   0xc3 (OK)
  Sequence:   0

  [0] DSMAS  (Device Scoped Memory Affinity Structure)
    Offset: 16, Length: 24
    DSMAD Handle:  0
    Flags:         0x00 (None)
    DPA Base:      0x0000000000000000
    DPA Length:    0x0000002000000000 (128.00 GiB)

  [1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 40, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     0 (Access Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 64, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     1 (Read Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 88, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     2 (Write Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 112, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     3 (Access Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 136, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     4 (Read Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 160, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     5 (Write Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [7] DSEMTS (Device Scoped EFI Memory Type Structure)
    Offset: 184, Length: 24
    DSMAS Handle:  0
    Memory Type:   0 (EfiConventionalMemory)
    DPA Offset:    0x0000000000000000
    Range Length:  0x0000002000000000 (128.00 GiB)

Summary: 8 entries
  DSMAS  (Device Scoped Memory Affinity Structure): 1
  DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
  DSEMTS (Device Scoped EFI Memory Type Structure): 1

=== CDAT from endpoint11 ===
CDAT Header:
  Length:     208 bytes
  Revision:   1
  Checksum:   0xc3 (OK)
  Sequence:   0

  [0] DSMAS  (Device Scoped Memory Affinity Structure)
    Offset: 16, Length: 24
    DSMAD Handle:  0
    Flags:         0x00 (None)
    DPA Base:      0x0000000000000000
    DPA Length:    0x0000002000000000 (128.00 GiB)

  [1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 40, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     0 (Access Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 64, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     1 (Read Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 88, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     2 (Write Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 112, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     3 (Access Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 136, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     4 (Read Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 160, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     5 (Write Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [7] DSEMTS (Device Scoped EFI Memory Type Structure)
    Offset: 184, Length: 24
    DSMAS Handle:  0
    Memory Type:   0 (EfiConventionalMemory)
    DPA Offset:    0x0000000000000000
    Range Length:  0x0000002000000000 (128.00 GiB)

Summary: 8 entries
  DSMAS  (Device Scoped Memory Affinity Structure): 1
  DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
  DSEMTS (Device Scoped EFI Memory Type Structure): 1

=== CDAT from endpoint12 ===
CDAT Header:
  Length:     208 bytes
  Revision:   1
  Checksum:   0xc3 (OK)
  Sequence:   0

  [0] DSMAS  (Device Scoped Memory Affinity Structure)
    Offset: 16, Length: 24
    DSMAD Handle:  0
    Flags:         0x00 (None)
    DPA Base:      0x0000000000000000
    DPA Length:    0x0000002000000000 (128.00 GiB)

  [1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 40, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     0 (Access Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 64, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     1 (Read Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 88, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     2 (Write Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 112, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     3 (Access Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 136, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     4 (Read Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 160, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     5 (Write Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [7] DSEMTS (Device Scoped EFI Memory Type Structure)
    Offset: 184, Length: 24
    DSMAS Handle:  0
    Memory Type:   0 (EfiConventionalMemory)
    DPA Offset:    0x0000000000000000
    Range Length:  0x0000002000000000 (128.00 GiB)

Summary: 8 entries
  DSMAS  (Device Scoped Memory Affinity Structure): 1
  DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
  DSEMTS (Device Scoped EFI Memory Type Structure): 1

=== CDAT from endpoint19 ===
CDAT Header:
  Length:     208 bytes
  Revision:   1
  Checksum:   0xc3 (OK)
  Sequence:   0

  [0] DSMAS  (Device Scoped Memory Affinity Structure)
    Offset: 16, Length: 24
    DSMAD Handle:  0
    Flags:         0x00 (None)
    DPA Base:      0x0000000000000000
    DPA Length:    0x0000002000000000 (128.00 GiB)

  [1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 40, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     0 (Access Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 64, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     1 (Read Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 88, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     2 (Write Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 112, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     3 (Access Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 136, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     4 (Read Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 160, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     5 (Write Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [7] DSEMTS (Device Scoped EFI Memory Type Structure)
    Offset: 184, Length: 24
    DSMAS Handle:  0
    Memory Type:   0 (EfiConventionalMemory)
    DPA Offset:    0x0000000000000000
    Range Length:  0x0000002000000000 (128.00 GiB)

Summary: 8 entries
  DSMAS  (Device Scoped Memory Affinity Structure): 1
  DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
  DSEMTS (Device Scoped EFI Memory Type Structure): 1

=== CDAT from endpoint13 ===
CDAT Header:
  Length:     208 bytes
  Revision:   1
  Checksum:   0xc3 (OK)
  Sequence:   0

  [0] DSMAS  (Device Scoped Memory Affinity Structure)
    Offset: 16, Length: 24
    DSMAD Handle:  0
    Flags:         0x00 (None)
    DPA Base:      0x0000000000000000
    DPA Length:    0x0000002000000000 (128.00 GiB)

  [1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 40, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     0 (Access Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 64, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     1 (Read Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 88, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     2 (Write Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 112, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     3 (Access Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 136, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     4 (Read Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 160, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     5 (Write Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [7] DSEMTS (Device Scoped EFI Memory Type Structure)
    Offset: 184, Length: 24
    DSMAS Handle:  0
    Memory Type:   0 (EfiConventionalMemory)
    DPA Offset:    0x0000000000000000
    Range Length:  0x0000002000000000 (128.00 GiB)

Summary: 8 entries
  DSMAS  (Device Scoped Memory Affinity Structure): 1
  DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
  DSEMTS (Device Scoped EFI Memory Type Structure): 1

=== CDAT from endpoint9 ===
CDAT Header:
  Length:     208 bytes
  Revision:   1
  Checksum:   0xc3 (OK)
  Sequence:   0

  [0] DSMAS  (Device Scoped Memory Affinity Structure)
    Offset: 16, Length: 24
    DSMAD Handle:  0
    Flags:         0x00 (None)
    DPA Base:      0x0000000000000000
    DPA Length:    0x0000002000000000 (128.00 GiB)

  [1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 40, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     0 (Access Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 64, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     1 (Read Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 88, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     2 (Write Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 112, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     3 (Access Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 136, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     4 (Read Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 160, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     5 (Write Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [7] DSEMTS (Device Scoped EFI Memory Type Structure)
    Offset: 184, Length: 24
    DSMAS Handle:  0
    Memory Type:   0 (EfiConventionalMemory)
    DPA Offset:    0x0000000000000000
    Range Length:  0x0000002000000000 (128.00 GiB)

Summary: 8 entries
  DSMAS  (Device Scoped Memory Affinity Structure): 1
  DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
  DSEMTS (Device Scoped EFI Memory Type Structure): 1

=== CDAT from endpoint21 ===
CDAT Header:
  Length:     208 bytes
  Revision:   1
  Checksum:   0xc3 (OK)
  Sequence:   0

  [0] DSMAS  (Device Scoped Memory Affinity Structure)
    Offset: 16, Length: 24
    DSMAD Handle:  0
    Flags:         0x00 (None)
    DPA Base:      0x0000000000000000
    DPA Length:    0x0000002000000000 (128.00 GiB)

  [1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 40, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     0 (Access Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 64, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     1 (Read Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 88, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     2 (Write Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 112, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     3 (Access Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 136, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     4 (Read Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 160, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     5 (Write Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [7] DSEMTS (Device Scoped EFI Memory Type Structure)
    Offset: 184, Length: 24
    DSMAS Handle:  0
    Memory Type:   0 (EfiConventionalMemory)
    DPA Offset:    0x0000000000000000
    Range Length:  0x0000002000000000 (128.00 GiB)

Summary: 8 entries
  DSMAS  (Device Scoped Memory Affinity Structure): 1
  DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
  DSEMTS (Device Scoped EFI Memory Type Structure): 1

=== CDAT from endpoint24 ===
CDAT Header:
  Length:     208 bytes
  Revision:   1
  Checksum:   0xc3 (OK)
  Sequence:   0

  [0] DSMAS  (Device Scoped Memory Affinity Structure)
    Offset: 16, Length: 24
    DSMAD Handle:  0
    Flags:         0x00 (None)
    DPA Base:      0x0000000000000000
    DPA Length:    0x0000002000000000 (128.00 GiB)

  [1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 40, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     0 (Access Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 64, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     1 (Read Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 88, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     2 (Write Latency)
    Base Unit:     1000
    Entry[0]:      110 -> 110.00 ns (110000 ps)

  [4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 112, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     3 (Access Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 136, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     4 (Read Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
    Offset: 160, Length: 24
    Handle:        0
    Flags:         0x00
    Data Type:     5 (Write Bandwidth)
    Base Unit:     1000
    Entry[0]:      45 -> 43.95 GB/s (45000 MB/s)

  [7] DSEMTS (Device Scoped EFI Memory Type Structure)
    Offset: 184, Length: 24
    DSMAS Handle:  0
    Memory Type:   0 (EfiConventionalMemory)
    DPA Offset:    0x0000000000000000
    Range Length:  0x0000002000000000 (128.00 GiB)

Summary: 8 entries
  DSMAS  (Device Scoped Memory Affinity Structure): 1
  DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
  DSEMTS (Device Scoped EFI Memory Type Structure): 1

Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

Posted by Dave Jiang 1 week, 4 days ago


On 3/23/26 10:35 PM, Rakie Kim wrote:
> On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
>>
< --snip-- >

> 
> The HMAT latencies and bandwidths are present, but the values seem
> broken. Here is the latency table:
> 
> Init->Target | node0 | node1 | node2 | node3
> node0        | 0x38B | 0x89F | 0x9C4 | 0x3AFC
> node1        | 0x89F | 0x38B | 0x3AFC| 0x4268

Do you have the iasl -d outputs of the SRAT and the HMAT somewhere we can look at?

DJ

Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

Posted by Rakie Kim 1 week, 1 day ago

On Thu, 26 Mar 2026 15:24:26 -0700 Dave Jiang <dave.jiang@intel.com> wrote:
> 
> 
> On 3/23/26 10:35 PM, Rakie Kim wrote:
> > On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> >>
> < --snip-- >
> 
> > 
> > The HMAT latencies and bandwidths are present, but the values seem
> > broken. Here is the latency table:
> > 
> > Init->Target | node0 | node1 | node2 | node3
> > node0        | 0x38B | 0x89F | 0x9C4 | 0x3AFC
> > node1        | 0x89F | 0x38B | 0x3AFC| 0x4268
> 
> Do you have the iasl -d outputs of the SRAT and the HMAT somewhere we can look at?
> 
> DJ
> 

Hello Dave,

Posting the entire ACPI dump would be too long for the list.
Instead, I have extracted the relevant structures from the `iasl -d`
output for the local memory (PXM 0, 1) and CXL memory nodes (PXM A, B).

Here is the truncated `HMAT.dsl` showing the core topology mappings
and the latency matrix we are discussing:

----------------------------------------------------------------------
[000h 0000   4]                    Signature : "HMAT"    [Heterogeneous Memory Attributes Table]
[004h 0004   4]                 Table Length : 00000668
[008h 0008   1]                     Revision : 02
[009h 0009   1]                     Checksum : 6A
[00Ah 0010   6]                       Oem ID : "GBT   "
[010h 0016   8]                 Oem Table ID : "GBTUACPI"
[018h 0024   4]                 Oem Revision : 01072009
[01Ch 0028   4]              Asl Compiler ID : "AMI "
[020h 0032   4]        Asl Compiler Revision : 20230628

[024h 0036   4]                     Reserved : 00000000

[028h 0040   2]               Structure Type : 0000 [Memory Proximity Domain Attributes]
[02Ah 0042   2]                     Reserved : 0000
[02Ch 0044   4]                       Length : 00000028
[030h 0048   2]        Flags (decoded below) : 0001
            Processor Proximity Domain Valid : 1
[032h 0050   2]                    Reserved1 : 0000
[034h 0052   4] Attached Initiator Proximity Domain : 00000000
[038h 0056   4]      Memory Proximity Domain : 00000000

...

[050h 0080   2]               Structure Type : 0000 [Memory Proximity Domain Attributes]
[052h 0082   2]                     Reserved : 0000
[054h 0084   4]                       Length : 00000028
[058h 0088   2]        Flags (decoded below) : 0001
            Processor Proximity Domain Valid : 1
[05Ah 0090   2]                    Reserved1 : 0000
[05Ch 0092   4] Attached Initiator Proximity Domain : 00000001
[060h 0096   4]      Memory Proximity Domain : 00000001

...

[078h 0120   2]               Structure Type : 0000 [Memory Proximity Domain Attributes]
[07Ah 0122   2]                     Reserved : 0000
[07Ch 0124   4]                       Length : 00000028
[080h 0128   2]        Flags (decoded below) : 0000
            Processor Proximity Domain Valid : 0
[082h 0130   2]                    Reserved1 : 0000
[084h 0132   4] Attached Initiator Proximity Domain : 00000000
[088h 0136   4]      Memory Proximity Domain : 0000000A

...

[0A0h 0160   2]               Structure Type : 0000 [Memory Proximity Domain Attributes]
[0A2h 0162   2]                     Reserved : 0000
[0A4h 0164   4]                       Length : 00000028
[0A8h 0168   2]        Flags (decoded below) : 0000
            Processor Proximity Domain Valid : 0
[0AAh 0170   2]                    Reserved1 : 0000
[0ACh 0172   4] Attached Initiator Proximity Domain : 00000000
[0B0h 0176   4]      Memory Proximity Domain : 0000000B

...

[0C8h 0200   2]               Structure Type : 0001 [System Locality Latency and Bandwidth Information]
[0CAh 0202   2]                     Reserved : 0000
[0CCh 0204   4]                       Length : 00000168
[0D0h 0208   1]        Flags (decoded below) : 00
                            Memory Hierarchy : 0
[0D1h 0209   1]                    Data Type : 01
[0D2h 0210   2]                    Reserved1 : 0000
[0D4h 0212   4] Initiator Proximity Domains # : 0000000A
[0D8h 0216   4]   Target Proximity Domains # : 0000000C

...

[140h 0320   2]                        Entry : 038B
[142h 0322   2]                        Entry : 089F
[144h 0324   2]                        Entry : 09C4
[146h 0326   2]                        Entry : 09C4
[148h 0328   2]                        Entry : 09C4
[14Ah 0330   2]                        Entry : 09C4
[14Ch 0332   2]                        Entry : 157C
[14Eh 0334   2]                        Entry : 157C
[150h 0336   2]                        Entry : 157C
[152h 0338   2]                        Entry : 157C
[154h 0340   2]                        Entry : 3AFC
[156h 0342   2]                        Entry : 4268
[158h 0344   2]                        Entry : 089F
[15Ah 0346   2]                        Entry : 038B
[15Ch 0348   2]                        Entry : 157C
[15Eh 0350   2]                        Entry : 157C
[160h 0352   2]                        Entry : 157C
[162h 0354   2]                        Entry : 157C
[164h 0356   2]                        Entry : 09C4
[166h 0358   2]                        Entry : 09C4
[168h 0360   2]                        Entry : 09C4
[16Ah 0362   2]                        Entry : 09C4
[16Ch 0364   2]                        Entry : 4268
[16Eh 0366   2]                        Entry : 3AFC
----------------------------------------------------------------------

And here is the relevant extraction from `SRAT.dsl`. As you suspected,
the CXL memory ranges are indeed statically defined in the SRAT at boot:

----------------------------------------------------------------------
[000h 0000   4]                    Signature : "SRAT"    [System Resource Affinity Table]
[004h 0004   4]                 Table Length : 0000A1F8
[008h 0008   1]                     Revision : 03
[009h 0009   1]                     Checksum : 54
[00Ah 0010   6]                       Oem ID : "GBT   "
[010h 0016   8]                 Oem Table ID : "GBTUACPI"
[018h 0024   4]                 Oem Revision : 00000002
[01Ch 0028   4]              Asl Compiler ID : "AMI "
[020h 0032   4]        Asl Compiler Revision : 20230628

[024h 0036   4]               Table Revision : 00000001
[028h 0040   8]                     Reserved : 0000000000000000

[030h 0048   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[031h 0049   1]                       Length : 10

[032h 0050   1]      Proximity Domain Low(8) : 00
[033h 0051   1]                      Apic ID : FF
[034h 0052   4]        Flags (decoded below) : 00000000
                                     Enabled : 0
[038h 0056   1]              Local Sapic EID : 00
[039h 0057   3]    Proximity Domain High(24) : 000000
[03Ch 0060   4]                 Clock Domain : 00000000

[040h 0064   1]                Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[041h 0065   1]                       Length : 10

[042h 0066   1]      Proximity Domain Low(8) : 00
[043h 0067   1]                      Apic ID : FF
[044h 0068   4]        Flags (decoded below) : 00000000
                                     Enabled : 0
[048h 0072   1]              Local Sapic EID : 00

...

[A1A8h 41384   1]                Subtable Type : 01 [Memory Affinity]
[A1A9h 41385   1]                       Length : 28

[A1AAh 41386   4]             Proximity Domain : 0000000A
[A1AEh 41390   2]                    Reserved1 : 0000
[A1B0h 41392   8]                 Base Address : 000000C040000000
[A1B8h 41400   8]               Address Length : 0000010000000000
[A1C0h 41408   4]                    Reserved2 : 00000000
[A1C4h 41412   4]        Flags (decoded below) : 0000000B
                                     Enabled : 1

[A1D0h 41424   1]                Subtable Type : 01 [Memory Affinity]
[A1D1h 41425   1]                       Length : 28

[A1D2h 41426   4]             Proximity Domain : 0000000B
[A1D6h 41430   2]                    Reserved1 : 0000
[A1D8h 41432   8]                 Base Address : 0000071E40000000
[A1E0h 41440   8]               Address Length : 0000010000000000
[A1E8h 41448   4]                    Reserved2 : 00000000
[A1ECh 41452   4]        Flags (decoded below) : 0000000B
                                     Enabled : 1
----------------------------------------------------------------------

Rakie Kim
>

Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

Posted by Dan Williams 1 week, 4 days ago

Rakie Kim wrote:
[..]
> Hello Jonathan,
> 
> Your insight is incredibly accurate. To clarify the situation, here is
> the actual configuration of my system:
> 
> NODE   Type          PXD
> node0  local memory  0x00
> node1  local memory  0x01
> node2  cxl memory    0x0A
> node3  cxl memory    0x0B
> 
> Physically, the node2 CXL is attached to node0 (Socket 0), and the
> node3 CXL is attached to node1 (Socket 1). However, extracting the
> HMAT.dsl reveals the following:
> 
> - local memory
>   [028h] Flags: 0001 (Processor Proximity Domain Valid = 1)
>          Attached Initiator Proximity Domain: 0x00
>          Memory Proximity Domain: 0x00
>   [050h] Flags: 0001 (Processor Proximity Domain Valid = 1)
>          Attached Initiator Proximity Domain: 0x01
>          Memory Proximity Domain: 0x01
> 
> - cxl memory
>   [078h] Flags: 0000 (Processor Proximity Domain Valid = 0)
>          Attached Initiator Proximity Domain: 0x00
>          Memory Proximity Domain: 0x0A
>   [0A0h] Flags: 0000 (Processor Proximity Domain Valid = 0)
>          Attached Initiator Proximity Domain: 0x00
>          Memory Proximity Domain: 0x0B

This looks good.

Unless the CPU is directly attached to the memory controller then there
is no attached initiator. For example, if you wanted to run an x86
memory controller configuration instruction like PCONFIG you would issue
an IPI to the CPU attached to the target memory controller. There is no
such connection for a CPU to do the same for a CXL proximity domain.

> As you correctly suspected, the flags for the CXL memory are 0000,
> meaning the Processor Proximity Domain is marked as invalid. But when
> checking the sysfs initiator configurations, it shows a different story:
> 
> Node   access0 Initiator  access1 Initiator
> node0  node0              node0
> node1  node1              node1
> node2  node1              node1
> node3  node1              node1

2 comments. HMAT is not a physical topology layout table. The
fallback determination of "best" initiator when "Attached Initiator PXM"
is not set is just a heuristic. That heuristic probably has not been
touched since the initial HMAT support went upstream.

> Although the Attached Initiator is set to 0 in HMAT with an invalid
> flag, sysfs strangely registers node1 as the initiator for both CXL
> nodes. Because both HMAT and sysfs are exposing abnormal values, it was
> impossible for me to determine the true socket connections for CXL
> using this data.

Yeah, this sounds more like a kernel bug report than a firmware bug
report at this point.


> > > Even though the distance map shows node2 is physically closer to
> > > Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the
> > > routing path strictly through Socket 1. Because the HMAT alone made it
> > > difficult to determine the exact physical socket connections on these
> > > systems, I ended up using the current CXL driver-based approach.
> > 
> > Are the HMAT latencies and bandwidths all there?  Or are some missing
> > and you have to use SLIT (which generally is garbage for historical
> > reasons of tuning SLIT to particular OS behaviour).
> > 
> 
> The HMAT latencies and bandwidths are present, but the values seem
> broken. Here is the latency table:
> 
> Init->Target | node0 | node1 | node2 | node3
> node0        | 0x38B | 0x89F | 0x9C4 | 0x3AFC
> node1        | 0x89F | 0x38B | 0x3AFC| 0x4268
> 
> I used the identical type of DRAM and CXL memory for both sockets.
> However, looking at the table, the local CXL access latency from
> node0->node2 (0x9C4) and node1->node3 (0x4268) shows a massive,
> unjustified difference. This asymmetry proves that the table is
> currently unreliable.

...or it is telling the truth. Would need more data.

> > > I wonder if others have experienced similar broken HMAT cases with CXL.
> > > If HMAT information becomes more reliable in the future, we could
> > > build a much more efficient structure.
> > 
> > Given it's being lightly used I suspect there will be many bugs :(
> > I hope we can assume they will get fixed however!
> > 
> > ...
> > 
> 
> The most critical issue caused by this broken initiator setting is that
> topology analysis tools like `hwloc` are completely misled. Currently,
> `hwloc` displays both CXL nodes as being attached to Socket 1.
> 
> I observed this exact same issue on both Sierra Forest and Granite
> Rapids systems. I believe this broken topology exposure is a severe
> problem that must be addressed, though I am not entirely sure what the
> best fix would be yet. I would love to hear your thoughts on this.

Before determining that these numbers are wrong you would need to redo
the calculation from CDAT data to see if you get a different answer.

The driver currently does this calculation as part of determining a QoS
class. It would be reasonable to also use that same calculation to double
check the BIOS firmware numbers for CXL proximity domains established at
boot.

> > > The complex topology cases you presented, such as multi-NUMA per socket,
> > > shared CXL switches, and IO expanders, are very important points.
> > > I clearly understand that the simple package-level grouping does not fully
> > > reflect the 1:1 relationship in these future hardware architectures.
> > > 
> > > I have also thought about the shared CXL switch scenario you mentioned,
> > > and I know the current design falls short in addressing it properly.
> > > While the current implementation starts with a simple socket-local
> > > restriction, I plan to evolve it into a more flexible node aggregation
> > > model to properly reflect all the diverse topologies you suggested.
> > 
> > If we can ensure it fails cleanly when it finds a topology that it can't
> > cope with (and I guess falls back to current) then I'm fine with a partial
> > solution that evolves.
> > 
> 
> I completely agree with ensuring a clean failure. To stabilize this
> partial solution, I am currently considering a few options for the
> next version:
> 
> 1. Enable this feature only when a strict 1:1 topology is detected.
> 2. Provide a sysfs allowing users to enable/disable it.
> 3. Allow users to manually override/configure the topology via sysfs.
> 4. Implement dynamic fallback behaviors depending on the detected
>    topology shape (needs further thought).

The advice is always start as simple as possible but no simpler.

It may be the case that Linux indeed finds that platform firmware comes
to a different result than expected. When that happens the CXL subsystem
can probably emit the mismatch details, or otherwise validate the HMAT.

As for actual physical topology layout determination, that is out of
scope for HMAT, but the CXL CDAT calculations do consider PCI link
details.

> By the way, when I first posted this RFC, I accidentally missed adding
> lsf-pc@lists.linux-foundation.org to the CC list. I am considering
> re-posting it to ensure it reaches the lsf-pc.

They are on the Cc: now, I expect that is sufficient.

Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

Posted by Rakie Kim 1 week ago

On Thu, 26 Mar 2026 13:13:40 -0700 Dan Williams <dan.j.williams@intel.com> wrote:
> Rakie Kim wrote:
> [..]
> > Hello Jonathan,
> > 
> > Your insight is incredibly accurate. To clarify the situation, here is
> > the actual configuration of my system:
> > 
> > NODE   Type          PXD
> > node0  local memory  0x00
> > node1  local memory  0x01
> > node2  cxl memory    0x0A
> > node3  cxl memory    0x0B
> > 
> > Physically, the node2 CXL is attached to node0 (Socket 0), and the
> > node3 CXL is attached to node1 (Socket 1). However, extracting the
> > HMAT.dsl reveals the following:
> > 
> > - local memory
> >   [028h] Flags: 0001 (Processor Proximity Domain Valid = 1)
> >          Attached Initiator Proximity Domain: 0x00
> >          Memory Proximity Domain: 0x00
> >   [050h] Flags: 0001 (Processor Proximity Domain Valid = 1)
> >          Attached Initiator Proximity Domain: 0x01
> >          Memory Proximity Domain: 0x01
> > 
> > - cxl memory
> >   [078h] Flags: 0000 (Processor Proximity Domain Valid = 0)
> >          Attached Initiator Proximity Domain: 0x00
> >          Memory Proximity Domain: 0x0A
> >   [0A0h] Flags: 0000 (Processor Proximity Domain Valid = 0)
> >          Attached Initiator Proximity Domain: 0x00
> >          Memory Proximity Domain: 0x0B
> 
> This looks good.
> 
> Unless the CPU is directly attached to the memory controller then there
> is no attached initiator. For example, if you wanted to run an x86
> memory controller configuration instruction like PCONFIG you would issue
> an IPI to the CPU attached to the target memory controller. There is no
> such connection for a CPU to do the same for a CXL proximity domain.
> 
> > As you correctly suspected, the flags for the CXL memory are 0000,
> > meaning the Processor Proximity Domain is marked as invalid. But when
> > checking the sysfs initiator configurations, it shows a different story:
> > 
> > Node   access0 Initiator  access1 Initiator
> > node0  node0              node0
> > node1  node1              node1
> > node2  node1              node1
> > node3  node1              node1
> 
> 2 comments. HMAT is not a physical topology layout table. The
> fallback determination of "best" initiator when "Attached Initiator PXM"
> is not set is just a heuristic. That heuristic probably has not been
> touched since the initial HMAT support went upstream.
> 
> > Although the Attached Initiator is set to 0 in HMAT with an invalid
> > flag, sysfs strangely registers node1 as the initiator for both CXL
> > nodes. Because both HMAT and sysfs are exposing abnormal values, it was
> > impossible for me to determine the true socket connections for CXL
> > using this data.
> 
> Yeah, this sounds more like a kernel bug report than a firmware bug
> report at this point.
> 

You are right. From the hardware's perspective, the `0000`
flag makes perfect sense since the CPU is not directly attached to
the CXL memory controller. I completely agree with your assessment
that this points directly to a bug in the kernel's outdated fallback
heuristic logic, rather than a firmware error.

> 
> > > > Even though the distance map shows node2 is physically closer to
> > > > Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the
> > > > routing path strictly through Socket 1. Because the HMAT alone made it
> > > > difficult to determine the exact physical socket connections on these
> > > > systems, I ended up using the current CXL driver-based approach.
> > > 
> > > Are the HMAT latencies and bandwidths all there?  Or are some missing
> > > and you have to use SLIT (which generally is garbage for historical
> > > reasons of tuning SLIT to particular OS behaviour).
> > > 
> > 
> > The HMAT latencies and bandwidths are present, but the values seem
> > broken. Here is the latency table:
> > 
> > Init->Target | node0 | node1 | node2 | node3
> > node0        | 0x38B | 0x89F | 0x9C4 | 0x3AFC
> > node1        | 0x89F | 0x38B | 0x3AFC| 0x4268
> > 
> > I used the identical type of DRAM and CXL memory for both sockets.
> > However, looking at the table, the local CXL access latency from
> > node0->node2 (0x9C4) and node1->node3 (0x4268) shows a massive,
> > unjustified difference. This asymmetry proves that the table is
> > currently unreliable.
> 
> ...or it is telling the truth. Would need more data.
> 
> > > > I wonder if others have experienced similar broken HMAT cases with CXL.
> > > > If HMAT information becomes more reliable in the future, we could
> > > > build a much more efficient structure.
> > > 
> > > Given it's being lightly used I suspect there will be many bugs :(
> > > I hope we can assume they will get fixed however!
> > > 
> > > ...
> > > 
> > 
> > The most critical issue caused by this broken initiator setting is that
> > topology analysis tools like `hwloc` are completely misled. Currently,
> > `hwloc` displays both CXL nodes as being attached to Socket 1.
> > 
> > I observed this exact same issue on both Sierra Forest and Granite
> > Rapids systems. I believe this broken topology exposure is a severe
> > problem that must be addressed, though I am not entirely sure what the
> > best fix would be yet. I would love to hear your thoughts on this.
> 
> Before determining that these numbers are wrong you would need to redo
> the calculation from CDAT data to see if you get a different answer.
> 
> The driver currently does this calculation as part of determining a QoS
> class. It would be reasonable to also use that same calculation to double
> check the BIOS firmware numbers for CXL proximity domains established at
> boot.
> 

It was indeed premature of me to conclude the table was broken solely
based on the large and asymmetric numbers.

Interestingly, Dave Jiang just mentioned in another reply that the
Intel BIOS folks confirmed these HMAT values actually represent
"end-to-end" latency, which perfectly explains why the numbers are
so much larger than expected.

Also, I have just posted the detailed `SRAT` and `HMAT` dumps in my
reply to Dave Jiang. Please feel free to refer to the exact firmware
structures we are discussing here:
https://lore.kernel.org/all/20260330025914.361-1-rakie.kim@sk.com/

> > > > The complex topology cases you presented, such as multi-NUMA per socket,
> > > > shared CXL switches, and IO expanders, are very important points.
> > > > I clearly understand that the simple package-level grouping does not fully
> > > > reflect the 1:1 relationship in these future hardware architectures.
> > > > 
> > > > I have also thought about the shared CXL switch scenario you mentioned,
> > > > and I know the current design falls short in addressing it properly.
> > > > While the current implementation starts with a simple socket-local
> > > > restriction, I plan to evolve it into a more flexible node aggregation
> > > > model to properly reflect all the diverse topologies you suggested.
> > > 
> > > If we can ensure it fails cleanly when it finds a topology that it can't
> > > cope with (and I guess falls back to current) then I'm fine with a partial
> > > solution that evolves.
> > > 
> > 
> > I completely agree with ensuring a clean failure. To stabilize this
> > partial solution, I am currently considering a few options for the
> > next version:
> > 
> > 1. Enable this feature only when a strict 1:1 topology is detected.
> > 2. Provide a sysfs allowing users to enable/disable it.
> > 3. Allow users to manually override/configure the topology via sysfs.
> > 4. Implement dynamic fallback behaviors depending on the detected
> >    topology shape (needs further thought).
> 
> The advice is always start as simple as possible but no simpler.
> 
> It may be the case that Linux indeed finds that platform firmware comes
> to a different result than expected. When that happens the CXL subsystem
> can probably emit the mismatch details, or otherwise validate the HMAT.
> 
> As for actual physical topology layout determination, that is out of
> scope for HMAT, but the CXL CDAT calculations do consider PCI link
> details.
> 


Thank you for the clear architectural guidance.

Knowing that physical topology determination is strictly out of scope
for HMAT reassures me that leveraging the PCI link details is indeed
the correct direction for this Socket-aware feature.

To discover the topology, I actually implemented a method to retrieve
this information directly from the CXL driver in PATCH 3 of this RFC:
https://lore.kernel.org/all/20260316051258.246-4-rakie.kim@sk.com/

However, I am still wondering if this specific implementation is the
truly correct and most appropriate way to achieve it in the kernel.
Any thoughts on that specific approach would be highly appreciated.

I will keep your advice in mind and ensure the fallback and policy
designs are kept as simple as possible for the next version.

Thanks again for your time and all the valuable insights.

Rakie Kim

Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

Posted by Jonathan Cameron 1 week, 5 days ago

On Tue, 24 Mar 2026 14:35:45 +0900
Rakie Kim <rakie.kim@sk.com> wrote:

> On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> >   
> > > > > 
> > > > > To make this possible, the system requires a mechanism to understand
> > > > > the physical topology. The existing NUMA distance model provides only
> > > > > relative latency values between nodes and lacks any notion of
> > > > > structural grouping such as socket boundaries. This is especially
> > > > > problematic for CXL memory nodes, which appear without an explicit
> > > > > socket association.    
> > > > 
> > > > So in a general sense, the missing info here is effectively the same
> > > > stuff we are missing from the HMAT presentation (it's there in the
> > > > table and it's there to compute in CXL cases) just because we decided
> > > > not to surface anything other than distances to memory from nearest
> > > > initiator.  I chatted to Joshua and Kieth about filling in that stuff
> > > > at last LSFMM. To me that's just a bit of engineering work that needs
> > > > doing now we have proven use cases for the data. Mostly it's figuring out
> > > > the presentation to userspace and kernel data structures as it's a
> > > > lot of data in a big system (typically at least 32 NUMA nodes).
> > > >     
> > > 
> > > Hearing about the discussion on exposing HMAT data is very welcome news.
> > > Because this detailed topology information is not yet fully exposed to
> > > the kernel and userspace, I used a temporary package-based restriction.
> > > Figuring out how to expose and integrate this data into the kernel data
> > > structures is indeed a crucial engineering task we need to solve.
> > > 
> > > Actually, when I first started this work, I considered fetching the
> > > topology information from HMAT before adopting the current approach.
> > > However, I encountered a firmware issue on my test systems
> > > (Granite Rapids and Sierra Forest).
> > > 
> > > Although each socket has its own locally attached CXL device, the HMAT
> > > only registers node1 (Socket 1) as the initiator for both CXL memory
> > > nodes (node2 and node3). As a result, the sysfs HMAT initiators for
> > > both node2 and node3 only expose node1.  
> > 
> > Do you mean the Memory Proximity Domain Attributes Structure has
> > the "Proximity Domain for the Attached Initiator" set wrong?
> > Was this for it's presentation of the full path to CXL mem nodes, or
> > to a PXM with a generic port?  Sounds like you have SRAT covering
> > the CXL mem so ideal would be to have the HMAT data to GP and to
> > the CXL PXMs that BIOS has set up.
> > 
> > Either way having that set at all for CXL memory is fishy as it's about
> > where the 'memory controller' is and on CXL mem that should be at the
> > device end of the link.  My understanding of that is was only meant
> > to be set when you have separate memory only Nodes where the physical
> > controller is in a particular other node (e.g. what you do
> > if you have a CPU with DRAM and HBM).  Maybe we need to make the
> > kernel warn + ignore that if it is set to something odd like yours.
> >   
> 
> Hello Jonathan,
> 
> Your insight is incredibly accurate. To clarify the situation, here is
> the actual configuration of my system:
> 
> NODE   Type          PXD
> node0  local memory  0x00
> node1  local memory  0x01
> node2  cxl memory    0x0A
> node3  cxl memory    0x0B
> 
> Physically, the node2 CXL is attached to node0 (Socket 0), and the
> node3 CXL is attached to node1 (Socket 1). However, extracting the
> HMAT.dsl reveals the following:
> 
> - local memory
>   [028h] Flags: 0001 (Processor Proximity Domain Valid = 1)
>          Attached Initiator Proximity Domain: 0x00
>          Memory Proximity Domain: 0x00
>   [050h] Flags: 0001 (Processor Proximity Domain Valid = 1)
>          Attached Initiator Proximity Domain: 0x01
>          Memory Proximity Domain: 0x01
> 
> - cxl memory
>   [078h] Flags: 0000 (Processor Proximity Domain Valid = 0)
>          Attached Initiator Proximity Domain: 0x00
>          Memory Proximity Domain: 0x0A
>   [0A0h] Flags: 0000 (Processor Proximity Domain Valid = 0)
>          Attached Initiator Proximity Domain: 0x00
>          Memory Proximity Domain: 0x0B

That's faintly amusing given it conveys no information at all.
Still unless we have a bug shouldn't cause anything odd.

> 
> As you correctly suspected, the flags for the CXL memory are 0000,
> meaning the Processor Proximity Domain is marked as invalid. But when
> checking the sysfs initiator configurations, it shows a different story:
> 
> Node   access0 Initiator  access1 Initiator
> node0  node0              node0
> node1  node1              node1
> node2  node1              node1
> node3  node1              node1
> 
> Although the Attached Initiator is set to 0 in HMAT with an invalid
> flag, sysfs strangely registers node1 as the initiator for both CXL
> nodes.
Been a while since I looked the hmat parser..

If ACPI_HMAT_PROCESSOR_PD_VALID isn't set, hmat_parse_proximity_domain()
shouldn't set the target. At end of that function should be set to PXM_INVALID.

It should therefore retain the state from alloc_memory_intiator() I think?

Given I did all my testing without the PD_VALID set (as it wasn't on my
test system) it should be fine with that.  Anyhow, let's look at the data
for proximity.



> Because both HMAT and sysfs are exposing abnormal values, it was
> impossible for me to determine the true socket connections for CXL
> using this data.
> 
> > > 
> > > Even though the distance map shows node2 is physically closer to
> > > Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the
> > > routing path strictly through Socket 1. Because the HMAT alone made it
> > > difficult to determine the exact physical socket connections on these
> > > systems, I ended up using the current CXL driver-based approach.  
> > 
> > Are the HMAT latencies and bandwidths all there?  Or are some missing
> > and you have to use SLIT (which generally is garbage for historical
> > reasons of tuning SLIT to particular OS behaviour).
> >   
> 
> The HMAT latencies and bandwidths are present, but the values seem
> broken. Here is the latency table:
> 
> Init->Target | node0 | node1 | node2 | node3
> node0        | 0x38B | 0x89F | 0x9C4 | 0x3AFC
> node1        | 0x89F | 0x38B | 0x3AFC| 0x4268

Yeah. That would do it...  Looks like that final value is garbage.

> 
> I used the identical type of DRAM and CXL memory for both sockets.
> However, looking at the table, the local CXL access latency from
> node0->node2 (0x9C4) and node1->node3 (0x4268) shows a massive,
> unjustified difference. This asymmetry proves that the table is
> currently unreliable.

Poke your favourite bios vendor I guess.

I asked one of the intel folk to take a look at see if this is a broader issue
or just one particular bios.

> 
> > > 
> > > I wonder if others have experienced similar broken HMAT cases with CXL.
> > > If HMAT information becomes more reliable in the future, we could
> > > build a much more efficient structure.  
> > 
> > Given it's being lightly used I suspect there will be many bugs :(
> > I hope we can assume they will get fixed however!
> > 
> > ...
> >   
> 
> The most critical issue caused by this broken initiator setting is that
> topology analysis tools like `hwloc` are completely misled. Currently,
> `hwloc` displays both CXL nodes as being attached to Socket 1.
> 
> I observed this exact same issue on both Sierra Forest and Granite
> Rapids systems. I believe this broken topology exposure is a severe
> problem that must be addressed, though I am not entirely sure what the
> best fix would be yet. I would love to hear your thoughts on this.

Fix then bios.  If you don't mind, can you provide dumps of
cat /sys/firmware/acpi/tables/HMAT  just so we can check there is nothing
wrong with the parser.

> 
> > > 
> > > The complex topology cases you presented, such as multi-NUMA per socket,
> > > shared CXL switches, and IO expanders, are very important points.
> > > I clearly understand that the simple package-level grouping does not fully
> > > reflect the 1:1 relationship in these future hardware architectures.
> > > 
> > > I have also thought about the shared CXL switch scenario you mentioned,
> > > and I know the current design falls short in addressing it properly.
> > > While the current implementation starts with a simple socket-local
> > > restriction, I plan to evolve it into a more flexible node aggregation
> > > model to properly reflect all the diverse topologies you suggested.  
> > 
> > If we can ensure it fails cleanly when it finds a topology that it can't
> > cope with (and I guess falls back to current) then I'm fine with a partial
> > solution that evolves.
> >   
> 
> I completely agree with ensuring a clean failure. To stabilize this
> partial solution, I am currently considering a few options for the
> next version:
> 
> 1. Enable this feature only when a strict 1:1 topology is detected.
Definitely default to off.  Maybe allow a user to say they want to do it
anyway. I can see there might be systems that are only a tiny bit off and
it makes not practical difference.

> 2. Provide a sysfs allowing users to enable/disable it.
Makes sense.
> 3. Allow users to manually override/configure the topology via sysfs.

No.  If people are in this state we should apply fixes to the HMAT table
either by injection of real data or some quirking.  If we add userspace
control via simpler means the motivation for people to fix bios goes out
the window and it never gets resolved.

> 4. Implement dynamic fallback behaviors depending on the detected
>    topology shape (needs further thought).

That would be interesting. But maybe not a 1st version thing :)

> 
> By the way, when I first posted this RFC, I accidentally missed adding
> lsf-pc@lists.linux-foundation.org to the CC list. I am considering
> re-posting it to ensure it reaches the lsf-pc.

Makes sense. Make sure to add a back link to this so it is visible
discussion already going on.
> 
> Thanks again for your profound insights and time. It is tremendously
> helpful.

Thanks to you for starting to solve the problem!

J
> 
> Rakie Kim
> 
> >   
> > > 
> > > Thanks again for your time and review.  
> > 
> > You are welcome.
> > 
> > Thanks
> > 
> > Jonathan
> >   
> > > 
> > > Rakie Kim
> > >   
>

Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

Posted by Rakie Kim 1 week, 4 days ago

On Wed, 25 Mar 2026 12:33:50 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> On Tue, 24 Mar 2026 14:35:45 +0900
> Rakie Kim <rakie.kim@sk.com> wrote:
> 
> > On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> > >   
> > > > > > 
> > > > > > To make this possible, the system requires a mechanism to understand
> > > > > > the physical topology. The existing NUMA distance model provides only
> > > > > > relative latency values between nodes and lacks any notion of
> > > > > > structural grouping such as socket boundaries. This is especially
> > > > > > problematic for CXL memory nodes, which appear without an explicit
> > > > > > socket association.    
> > > > > 
> > > > > So in a general sense, the missing info here is effectively the same
> > > > > stuff we are missing from the HMAT presentation (it's there in the
> > > > > table and it's there to compute in CXL cases) just because we decided
> > > > > not to surface anything other than distances to memory from nearest
> > > > > initiator.  I chatted to Joshua and Kieth about filling in that stuff
> > > > > at last LSFMM. To me that's just a bit of engineering work that needs
> > > > > doing now we have proven use cases for the data. Mostly it's figuring out
> > > > > the presentation to userspace and kernel data structures as it's a
> > > > > lot of data in a big system (typically at least 32 NUMA nodes).
> > > > >     
> > > > 
> > > > Hearing about the discussion on exposing HMAT data is very welcome news.
> > > > Because this detailed topology information is not yet fully exposed to
> > > > the kernel and userspace, I used a temporary package-based restriction.
> > > > Figuring out how to expose and integrate this data into the kernel data
> > > > structures is indeed a crucial engineering task we need to solve.
> > > > 
> > > > Actually, when I first started this work, I considered fetching the
> > > > topology information from HMAT before adopting the current approach.
> > > > However, I encountered a firmware issue on my test systems
> > > > (Granite Rapids and Sierra Forest).
> > > > 
> > > > Although each socket has its own locally attached CXL device, the HMAT
> > > > only registers node1 (Socket 1) as the initiator for both CXL memory
> > > > nodes (node2 and node3). As a result, the sysfs HMAT initiators for
> > > > both node2 and node3 only expose node1.  
> > > 
> > > Do you mean the Memory Proximity Domain Attributes Structure has
> > > the "Proximity Domain for the Attached Initiator" set wrong?
> > > Was this for it's presentation of the full path to CXL mem nodes, or
> > > to a PXM with a generic port?  Sounds like you have SRAT covering
> > > the CXL mem so ideal would be to have the HMAT data to GP and to
> > > the CXL PXMs that BIOS has set up.
> > > 
> > > Either way having that set at all for CXL memory is fishy as it's about
> > > where the 'memory controller' is and on CXL mem that should be at the
> > > device end of the link.  My understanding of that is was only meant
> > > to be set when you have separate memory only Nodes where the physical
> > > controller is in a particular other node (e.g. what you do
> > > if you have a CPU with DRAM and HBM).  Maybe we need to make the
> > > kernel warn + ignore that if it is set to something odd like yours.
> > >   
> > 
> > Hello Jonathan,
> > 
> > Your insight is incredibly accurate. To clarify the situation, here is
> > the actual configuration of my system:
> > 
> > NODE   Type          PXD
> > node0  local memory  0x00
> > node1  local memory  0x01
> > node2  cxl memory    0x0A
> > node3  cxl memory    0x0B
> > 
> > Physically, the node2 CXL is attached to node0 (Socket 0), and the
> > node3 CXL is attached to node1 (Socket 1). However, extracting the
> > HMAT.dsl reveals the following:
> > 
> > - local memory
> >   [028h] Flags: 0001 (Processor Proximity Domain Valid = 1)
> >          Attached Initiator Proximity Domain: 0x00
> >          Memory Proximity Domain: 0x00
> >   [050h] Flags: 0001 (Processor Proximity Domain Valid = 1)
> >          Attached Initiator Proximity Domain: 0x01
> >          Memory Proximity Domain: 0x01
> > 
> > - cxl memory
> >   [078h] Flags: 0000 (Processor Proximity Domain Valid = 0)
> >          Attached Initiator Proximity Domain: 0x00
> >          Memory Proximity Domain: 0x0A
> >   [0A0h] Flags: 0000 (Processor Proximity Domain Valid = 0)
> >          Attached Initiator Proximity Domain: 0x00
> >          Memory Proximity Domain: 0x0B
> 
> That's faintly amusing given it conveys no information at all.
> Still unless we have a bug shouldn't cause anything odd.
> 
> > 
> > As you correctly suspected, the flags for the CXL memory are 0000,
> > meaning the Processor Proximity Domain is marked as invalid. But when
> > checking the sysfs initiator configurations, it shows a different story:
> > 
> > Node   access0 Initiator  access1 Initiator
> > node0  node0              node0
> > node1  node1              node1
> > node2  node1              node1
> > node3  node1              node1
> > 
> > Although the Attached Initiator is set to 0 in HMAT with an invalid
> > flag, sysfs strangely registers node1 as the initiator for both CXL
> > nodes.
> Been a while since I looked the hmat parser..
> 
> If ACPI_HMAT_PROCESSOR_PD_VALID isn't set, hmat_parse_proximity_domain()
> shouldn't set the target. At end of that function should be set to PXM_INVALID.
> 
> It should therefore retain the state from alloc_memory_intiator() I think?
> 
> Given I did all my testing without the PD_VALID set (as it wasn't on my
> test system) it should be fine with that.  Anyhow, let's look at the data
> for proximity.
> 
> 

Hello Jonathan,

Thank you for the deep insight into the HMAT parser code. As you
mentioned, considering the current state where node 1 is still
registered as the initiator in sysfs despite the flag being 0, it
seems highly likely that the kernel parser logic is not handling
this specific situation gracefully.

> 
> > Because both HMAT and sysfs are exposing abnormal values, it was
> > impossible for me to determine the true socket connections for CXL
> > using this data.
> > 
> > > > 
> > > > Even though the distance map shows node2 is physically closer to
> > > > Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the
> > > > routing path strictly through Socket 1. Because the HMAT alone made it
> > > > difficult to determine the exact physical socket connections on these
> > > > systems, I ended up using the current CXL driver-based approach.  
> > > 
> > > Are the HMAT latencies and bandwidths all there?  Or are some missing
> > > and you have to use SLIT (which generally is garbage for historical
> > > reasons of tuning SLIT to particular OS behaviour).
> > >   
> > 
> > The HMAT latencies and bandwidths are present, but the values seem
> > broken. Here is the latency table:
> > 
> > Init->Target | node0 | node1 | node2 | node3
> > node0        | 0x38B | 0x89F | 0x9C4 | 0x3AFC
> > node1        | 0x89F | 0x38B | 0x3AFC| 0x4268
> 
> Yeah. That would do it...  Looks like that final value is garbage.
> 
> > 
> > I used the identical type of DRAM and CXL memory for both sockets.
> > However, looking at the table, the local CXL access latency from
> > node0->node2 (0x9C4) and node1->node3 (0x4268) shows a massive,
> > unjustified difference. This asymmetry proves that the table is
> > currently unreliable.
> 
> Poke your favourite bios vendor I guess.
> 
> I asked one of the intel folk to take a look at see if this is a broader issue
> or just one particular bios.
> 

I really appreciate you reaching out to the Intel contact to check if
this is a broader platform issue. I will also try to find a way to
report this BIOS issue to our system vendor, though I might need to
figure out the proper channel since I am not the system administrator.

Regarding the HMAT dump you requested, how should I provide it to you?
Would a hex dump converted via a utility like `xxd` be acceptable,
something like the snippet below?

00000000: 484d 4154 6806 0000 026a 4742 5420 2020  HMATh....jGBT
00000010: 4742 5455 4143 5049 0920 0701 414d 4920  GBTUACPI. ..AMI
00000020: 2806 2320 0000 0000 0000 0000 2800 0000  (.# ........(...
00000030: 0100 0000 0000 0000 0000 0000 0000 0000  ................

> > 
> > > > 
> > > > I wonder if others have experienced similar broken HMAT cases with CXL.
> > > > If HMAT information becomes more reliable in the future, we could
> > > > build a much more efficient structure.  
> > > 
> > > Given it's being lightly used I suspect there will be many bugs :(
> > > I hope we can assume they will get fixed however!
> > > 
> > > ...
> > >   
> > 
> > The most critical issue caused by this broken initiator setting is that
> > topology analysis tools like `hwloc` are completely misled. Currently,
> > `hwloc` displays both CXL nodes as being attached to Socket 1.
> > 
> > I observed this exact same issue on both Sierra Forest and Granite
> > Rapids systems. I believe this broken topology exposure is a severe
> > problem that must be addressed, though I am not entirely sure what the
> > best fix would be yet. I would love to hear your thoughts on this.
> 
> Fix then bios.  If you don't mind, can you provide dumps of
> cat /sys/firmware/acpi/tables/HMAT  just so we can check there is nothing
> wrong with the parser.
> 
> > 
> > > > 
> > > > The complex topology cases you presented, such as multi-NUMA per socket,
> > > > shared CXL switches, and IO expanders, are very important points.
> > > > I clearly understand that the simple package-level grouping does not fully
> > > > reflect the 1:1 relationship in these future hardware architectures.
> > > > 
> > > > I have also thought about the shared CXL switch scenario you mentioned,
> > > > and I know the current design falls short in addressing it properly.
> > > > While the current implementation starts with a simple socket-local
> > > > restriction, I plan to evolve it into a more flexible node aggregation
> > > > model to properly reflect all the diverse topologies you suggested.  
> > > 
> > > If we can ensure it fails cleanly when it finds a topology that it can't
> > > cope with (and I guess falls back to current) then I'm fine with a partial
> > > solution that evolves.
> > >   
> > 
> > I completely agree with ensuring a clean failure. To stabilize this
> > partial solution, I am currently considering a few options for the
> > next version:
> > 
> > 1. Enable this feature only when a strict 1:1 topology is detected.
> Definitely default to off.  Maybe allow a user to say they want to do it
> anyway. I can see there might be systems that are only a tiny bit off and
> it makes not practical difference.
> 

Your suggestion is very reasonable. I will proceed with this approach
for the next version, keeping the feature disabled by default.

> > 2. Provide a sysfs allowing users to enable/disable it.
> Makes sense.

I will include this sysfs enable/disable feature in the next version.

> > 3. Allow users to manually override/configure the topology via sysfs.
> 
> No.  If people are in this state we should apply fixes to the HMAT table
> either by injection of real data or some quirking.  If we add userspace
> control via simpler means the motivation for people to fix bios goes out
> the window and it never gets resolved.
> 

Your reasoning is absolutely correct. I will not allow users to modify
the topology via sysfs. However, I plan to provide a read-only sysfs
interface so users can at least check the current topology information.

> > 4. Implement dynamic fallback behaviors depending on the detected
> >    topology shape (needs further thought).
> 
> That would be interesting. But maybe not a 1st version thing :)
> 

This is an area I also need to think more deeply about. I will not
include it in the initial version, but will consider implementing it
in the future.

Once again, I deeply appreciate your time, thorough review, and for
reaching out to Intel for further clarification. It is a huge help.

Rakie Kim

Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

Posted by Dave Jiang 1 week, 4 days ago


On 3/26/26 1:54 AM, Rakie Kim wrote:
> On Wed, 25 Mar 2026 12:33:50 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
>> On Tue, 24 Mar 2026 14:35:45 +0900
>> Rakie Kim <rakie.kim@sk.com> wrote:
>>
>>> On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:

<--snip-->

 
> Hello Jonathan,
> 
> Thank you for the deep insight into the HMAT parser code. As you
> mentioned, considering the current state where node 1 is still
> registered as the initiator in sysfs despite the flag being 0, it
> seems highly likely that the kernel parser logic is not handling
> this specific situation gracefully.
> 
>>
>>> Because both HMAT and sysfs are exposing abnormal values, it was
>>> impossible for me to determine the true socket connections for CXL
>>> using this data.
>>>
>>>>>
>>>>> Even though the distance map shows node2 is physically closer to
>>>>> Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the
>>>>> routing path strictly through Socket 1. Because the HMAT alone made it
>>>>> difficult to determine the exact physical socket connections on these
>>>>> systems, I ended up using the current CXL driver-based approach.  
>>>>
>>>> Are the HMAT latencies and bandwidths all there?  Or are some missing
>>>> and you have to use SLIT (which generally is garbage for historical
>>>> reasons of tuning SLIT to particular OS behaviour).
>>>>   
>>>
>>> The HMAT latencies and bandwidths are present, but the values seem
>>> broken. Here is the latency table:
>>>
>>> Init->Target | node0 | node1 | node2 | node3
>>> node0        | 0x38B | 0x89F | 0x9C4 | 0x3AFC
>>> node1        | 0x89F | 0x38B | 0x3AFC| 0x4268
>>
>> Yeah. That would do it...  Looks like that final value is garbage.

Hi Rakie,
So I talked to the Intel BIOS folks and apparently for devices that are not hot-plugged (with memory ranges provided in SRAT), those HMAT values are the value for end to end and not just CPU to Gen Port. That's why they look so much bigger. So there are couple things we'll have to consider:
1. Make sure that Intel, AMD, and ARM HMATs are all created the same way and this is the agreed on way to do this. Hopefully someone from AMD and ARM vendors can comment. We all should get on the same page for the CXL kernel code to work properly.

2. Add code in the CXL driver to detect whether the range is in SRAT and then skip the end to end perf calculation if that is the case. 

DJ


<--snip-->

Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

Posted by Rakie Kim 1 week, 1 day ago

On Thu, 26 Mar 2026 14:41:32 -0700 Dave Jiang <dave.jiang@intel.com> wrote:
> 
> 
> On 3/26/26 1:54 AM, Rakie Kim wrote:
> > On Wed, 25 Mar 2026 12:33:50 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> >> On Tue, 24 Mar 2026 14:35:45 +0900
> >> Rakie Kim <rakie.kim@sk.com> wrote:
> >>
> >>> On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> 
> <--snip-->
> 
>  
> > Hello Jonathan,
> > 
> > Thank you for the deep insight into the HMAT parser code. As you
> > mentioned, considering the current state where node 1 is still
> > registered as the initiator in sysfs despite the flag being 0, it
> > seems highly likely that the kernel parser logic is not handling
> > this specific situation gracefully.
> > 
> >>
> >>> Because both HMAT and sysfs are exposing abnormal values, it was
> >>> impossible for me to determine the true socket connections for CXL
> >>> using this data.
> >>>
> >>>>>
> >>>>> Even though the distance map shows node2 is physically closer to
> >>>>> Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the
> >>>>> routing path strictly through Socket 1. Because the HMAT alone made it
> >>>>> difficult to determine the exact physical socket connections on these
> >>>>> systems, I ended up using the current CXL driver-based approach.  
> >>>>
> >>>> Are the HMAT latencies and bandwidths all there?  Or are some missing
> >>>> and you have to use SLIT (which generally is garbage for historical
> >>>> reasons of tuning SLIT to particular OS behaviour).
> >>>>   
> >>>
> >>> The HMAT latencies and bandwidths are present, but the values seem
> >>> broken. Here is the latency table:
> >>>
> >>> Init->Target | node0 | node1 | node2 | node3
> >>> node0        | 0x38B | 0x89F | 0x9C4 | 0x3AFC
> >>> node1        | 0x89F | 0x38B | 0x3AFC| 0x4268
> >>
> >> Yeah. That would do it...  Looks like that final value is garbage.
> 
> Hi Rakie,
> So I talked to the Intel BIOS folks and apparently for devices that are not hot-plugged (with memory ranges provided in SRAT), those HMAT values are the value for end to end and not just CPU to Gen Port. That's why they look so much bigger. So there are couple things we'll have to consider:
> 1. Make sure that Intel, AMD, and ARM HMATs are all created the same way and this is the agreed on way to do this. Hopefully someone from AMD and ARM vendors can comment. We all should get on the same page for the CXL kernel code to work properly.
> 
> 2. Add code in the CXL driver to detect whether the range is in SRAT and then skip the end to end perf calculation if that is the case. 
> 
> DJ
> 

Hello Dave,

Thank you for reaching out to the Intel BIOS folks directly. Knowing
that the HMAT values represent end-to-end latency completely explains
why the numbers seemed so disproportionately large.

I strongly agree with your two points. Establishing a consensus
across all architecture vendors (Intel, AMD, ARM) on how these
values are interpreted is crucial. Also, adding logic to the CXL
driver to detect the SRAT presence and skip redundant calculations
sounds like the exact right direction.

I have posted the detailed SRAT and HMAT information at the link below:
https://lore.kernel.org/all/20260330025914.361-1-rakie.kim@sk.com/

Rakie Kim

> 
> <--snip-->
>

Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

Posted by Dave Jiang 1 week, 4 days ago


On 3/26/26 2:41 PM, Dave Jiang wrote:
> 
> 
> On 3/26/26 1:54 AM, Rakie Kim wrote:
>> On Wed, 25 Mar 2026 12:33:50 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
>>> On Tue, 24 Mar 2026 14:35:45 +0900
>>> Rakie Kim <rakie.kim@sk.com> wrote:
>>>
>>>> On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> 
> <--snip-->
> 
>  
>> Hello Jonathan,
>>
>> Thank you for the deep insight into the HMAT parser code. As you
>> mentioned, considering the current state where node 1 is still
>> registered as the initiator in sysfs despite the flag being 0, it
>> seems highly likely that the kernel parser logic is not handling
>> this specific situation gracefully.
>>
>>>
>>>> Because both HMAT and sysfs are exposing abnormal values, it was
>>>> impossible for me to determine the true socket connections for CXL
>>>> using this data.
>>>>
>>>>>>
>>>>>> Even though the distance map shows node2 is physically closer to
>>>>>> Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the
>>>>>> routing path strictly through Socket 1. Because the HMAT alone made it
>>>>>> difficult to determine the exact physical socket connections on these
>>>>>> systems, I ended up using the current CXL driver-based approach.  
>>>>>
>>>>> Are the HMAT latencies and bandwidths all there?  Or are some missing
>>>>> and you have to use SLIT (which generally is garbage for historical
>>>>> reasons of tuning SLIT to particular OS behaviour).
>>>>>   
>>>>
>>>> The HMAT latencies and bandwidths are present, but the values seem
>>>> broken. Here is the latency table:
>>>>
>>>> Init->Target | node0 | node1 | node2 | node3
>>>> node0        | 0x38B | 0x89F | 0x9C4 | 0x3AFC
>>>> node1        | 0x89F | 0x38B | 0x3AFC| 0x4268
>>>
>>> Yeah. That would do it...  Looks like that final value is garbage.
> 
> Hi Rakie,
> So I talked to the Intel BIOS folks and apparently for devices that are not hot-plugged (with memory ranges provided in SRAT), those HMAT values are the value for end to end and not just CPU to Gen Port. That's why they look so much bigger. So there are couple things we'll have to consider:
> 1. Make sure that Intel, AMD, and ARM HMATs are all created the same way and this is the agreed on way to do this. Hopefully someone from AMD and ARM vendors can comment. We all should get on the same page for the CXL kernel code to work properly.
> 
> 2. Add code in the CXL driver to detect whether the range is in SRAT and then skip the end to end perf calculation if that is the case. 

After further talking to Jonathan, I don't think at least this part is an issue. The devices that are attached at boot do not have Generic Ports in the SRAT.

> 
> DJ
> 
> 
> <--snip-->
> 
>

Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

Posted by Rakie Kim 1 week ago

On Thu, 26 Mar 2026 15:19:49 -0700 Dave Jiang <dave.jiang@intel.com> wrote:
> 
> 
> On 3/26/26 2:41 PM, Dave Jiang wrote:
> > 
> > 
> > On 3/26/26 1:54 AM, Rakie Kim wrote:
> >> On Wed, 25 Mar 2026 12:33:50 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> >>> On Tue, 24 Mar 2026 14:35:45 +0900
> >>> Rakie Kim <rakie.kim@sk.com> wrote:
> >>>
> >>>> On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> > 
> > <--snip-->
> > 
> >  
> >> Hello Jonathan,
> >>
> >> Thank you for the deep insight into the HMAT parser code. As you
> >> mentioned, considering the current state where node 1 is still
> >> registered as the initiator in sysfs despite the flag being 0, it
> >> seems highly likely that the kernel parser logic is not handling
> >> this specific situation gracefully.
> >>
> >>>
> >>>> Because both HMAT and sysfs are exposing abnormal values, it was
> >>>> impossible for me to determine the true socket connections for CXL
> >>>> using this data.
> >>>>
> >>>>>>
> >>>>>> Even though the distance map shows node2 is physically closer to
> >>>>>> Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the
> >>>>>> routing path strictly through Socket 1. Because the HMAT alone made it
> >>>>>> difficult to determine the exact physical socket connections on these
> >>>>>> systems, I ended up using the current CXL driver-based approach.  
> >>>>>
> >>>>> Are the HMAT latencies and bandwidths all there?  Or are some missing
> >>>>> and you have to use SLIT (which generally is garbage for historical
> >>>>> reasons of tuning SLIT to particular OS behaviour).
> >>>>>   
> >>>>
> >>>> The HMAT latencies and bandwidths are present, but the values seem
> >>>> broken. Here is the latency table:
> >>>>
> >>>> Init->Target | node0 | node1 | node2 | node3
> >>>> node0        | 0x38B | 0x89F | 0x9C4 | 0x3AFC
> >>>> node1        | 0x89F | 0x38B | 0x3AFC| 0x4268
> >>>
> >>> Yeah. That would do it...  Looks like that final value is garbage.
> > 
> > Hi Rakie,
> > So I talked to the Intel BIOS folks and apparently for devices that are not hot-plugged (with memory ranges provided in SRAT), those HMAT values are the value for end to end and not just CPU to Gen Port. That's why they look so much bigger. So there are couple things we'll have to consider:
> > 1. Make sure that Intel, AMD, and ARM HMATs are all created the same way and this is the agreed on way to do this. Hopefully someone from AMD and ARM vendors can comment. We all should get on the same page for the CXL kernel code to work properly.
> > 
> > 2. Add code in the CXL driver to detect whether the range is in SRAT and then skip the end to end perf calculation if that is the case. 
> 
> After further talking to Jonathan, I don't think at least this part is an issue. The devices that are attached at boot do not have Generic Ports in the SRAT.
> 

Hello Dave,

Understood. I'm glad that the discussion with Jonathan helped clarify
the situation regarding the Generic Ports in the SRAT. Thank you for
the quick update on this.

Thanks again for looking into this so thoroughly and keeping me in the loop.

Rakie Kim

Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

Posted by Joshua Hahn 3 weeks ago

Hello Rakie! I hope you have been doing well. Thank you for this
RFC, I think it is a very interesting idea. 

[...snip...]

> Consider a dual-socket system:
> 
>           node0             node1
>         +-------+         +-------+
>         | CPU 0 |---------| CPU 1 |
>         +-------+         +-------+
>         | DRAM0 |         | DRAM1 |
>         +---+---+         +---+---+
>             |                 |
>         +---+---+         +---+---+
>         | CXL 0 |         | CXL 1 |
>         +-------+         +-------+
>           node2             node3
> 
> Assuming local DRAM provides 300 GB/s and local CXL provides 100 GB/s,
> the effective bandwidth varies significantly from the perspective of
> each CPU due to inter-socket interconnect penalties.
> 
> Local device capabilities (GB/s) vs. cross-socket effective bandwidth:
> 
>          0     1     2     3
> CPU 0  300   150   100    50
> CPU 1  150   300    50   100
> 
> A reasonable global weight vector reflecting the base capabilities is:
> 
>      node0=3 node1=3 node2=1 node3=1
> 
> However, because these configured node weights do not account for
> interconnect degradation between sockets, applying them flatly to all
> sources yields the following effective map from each CPU's perspective:
> 
>          0     1     2     3
> CPU 0    3     3     1     1
> CPU 1    3     3     1     1
> 
> This does not account for the interconnect penalty (e.g., node0->node1
> drops 300->150, node0->node3 drops 100->50) and thus forces allocations
> that cause a mismatch with actual performance.
> 
> This patch makes weighted interleave socket-aware. Before weighting is
> applied, the candidate nodes are restricted to the current socket; only
> if no eligible local nodes remain does the policy fall back to the
> wider set.

So when I saw this, I thought the idea was that we would attempt an
allocation with these socket-aware weights, and upon failure, fall back
to the global weights that are set so that we can try to fulfill the
allocation from cross-socket nodes.

However, reading the implementation in 4/4, it seems like what is meant
by "fallback" here is not in the sense of a fallback allocation, but
in the sense of "if there is a misconfiguration and the intersection
between policy nodes and the CPU's package is empty, use the global
nodes instead". 

Am I understanding this correctly? 

And, it seems like what this also means is that under sane configurations,
there is no more cross socket memory allocation, since it will always
try to fulfill it from the local node. 

> Even if the configured global weights remain identically set:
> 
>      node0=3 node1=3 node2=1 node3=1
> 
> The resulting effective map from the perspective of each CPU becomes:
> 
>          0     1     2     3
> CPU 0    3     0     1     0
> CPU 1    0     3     0     1

> Now tasks running on node0 prefer DRAM0(3) and CXL0(1), while tasks on
> node1 prefer DRAM1(3) and CXL1(1). This aligns allocation with actual
> effective bandwidth, preserves NUMA locality, and reduces cross-socket
> traffic.

In that sense I thought the word "prefer" was a bit confusing, since I
thought it would mean that it would try to fulfill the alloactions
from within a packet first, then fall back to remote packets if that
failed. (Or maybe I am just misunderstanding your explanation. Please
do let me know if that is the case : -) )

If what I understand is the case , I think this is the same thing as
just restricting allocations to be socket-local. I also wonder if
this idea applies to other mempolicies as well (i.e. unweighted interleave)

I think we should consider what the expected and desirable behavior is
when one socket is fully saturated but the other socket is empty. In my
mind this is no different from considering within-packet remote NUMA
allocations; the tradeoff becomes between reclaiming locally and
keeping allocations local, vs. skipping reclaiming and consuming
free memory while eating the remote access latency, similar to
zone_reclaim mode (packet_reclaim_mode? ; -) )

In my mind (without doing any benchmarking myself or looking at the numbers)
I imagine that there are some scenarios where we actually do want cross
socket allocations, like in the example above when we have very asymmetric
saturations across sockets. Is this something that could be worth
benchmarking as well?

I will end by saying that in the normal case (sockets have similar saturation)
I think this series is a definite win and improvement to weighted interleave.
I just was curious whether we can handle the worst-case scenarios.

Thank you again for the series. Have a great day!
Joshua

Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

Posted by Rakie Kim 2 weeks, 6 days ago

On Mon, 16 Mar 2026 08:19:32 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> Hello Rakie! I hope you have been doing well. Thank you for this
> RFC, I think it is a very interesting idea. 

Hello Joshua,

I hope you are doing well. Thanks for your review and feedback on this RFC.

> 
> [...snip...]
> 
> > Consider a dual-socket system:
> > 
h >           node0             node1
> >         +-------+         +-------+
> >         | CPU 0 |---------| CPU 1 |
> >         +-------+         +-------+
> >         | DRAM0 |         | DRAM1 |
> >         +---+---+         +---+---+
> >             |                 |
> >         +---+---+         +---+---+
> >         | CXL 0 |         | CXL 1 |
> >         +-------+         +-------+
> >           node2             node3
> > 
> > Assuming local DRAM provides 300 GB/s and local CXL provides 100 GB/s,
> > the effective bandwidth varies significantly from the perspective of
> > each CPU due to inter-socket interconnect penalties.
> > 
> > Local device capabilities (GB/s) vs. cross-socket effective bandwidth:
> > 
> >          0     1     2     3
> > CPU 0  300   150   100    50
> > CPU 1  150   300    50   100
> > 
> > A reasonable global weight vector reflecting the base capabilities is:
> > 
> >      node0=3 node1=3 node2=1 node3=1
> > 
> > However, because these configured node weights do not account for
> > interconnect degradation between sockets, applying them flatly to all
> > sources yields the following effective map from each CPU's perspective:
> > 
> >          0     1     2     3
> > CPU 0    3     3     1     1
> > CPU 1    3     3     1     1
> > 
> > This does not account for the interconnect penalty (e.g., node0->node1
> > drops 300->150, node0->node3 drops 100->50) and thus forces allocations
> > that cause a mismatch with actual performance.
> > 
> > This patch makes weighted interleave socket-aware. Before weighting is
> > applied, the candidate nodes are restricted to the current socket; only
> > if no eligible local nodes remain does the policy fall back to the
> > wider set.
> 
> So when I saw this, I thought the idea was that we would attempt an
> allocation with these socket-aware weights, and upon failure, fall back
> to the global weights that are set so that we can try to fulfill the
> allocation from cross-socket nodes.
> 
> However, reading the implementation in 4/4, it seems like what is meant
> by "fallback" here is not in the sense of a fallback allocation, but
> in the sense of "if there is a misconfiguration and the intersection
> between policy nodes and the CPU's package is empty, use the global
> nodes instead". 
> 
> Am I understanding this correctly? 
> 
> And, it seems like what this also means is that under sane configurations,
> there is no more cross socket memory allocation, since it will always
> try to fulfill it from the local node. 
> 

Your analysis of the code in patch 4/4 is exactly correct. I apologize
for using the term "fallback" in the cover letter, which caused some
confusion. As you understood, the current implementation strictly
restricts allocations to the local socket to avoid cross-socket traffic.

> > Even if the configured global weights remain identically set:
> > 
> >      node0=3 node1=3 node2=1 node3=1
> > 
> > The resulting effective map from the perspective of each CPU becomes:
> > 
> >          0     1     2     3
> > CPU 0    3     0     1     0
> > CPU 1    0     3     0     1
> 
> > Now tasks running on node0 prefer DRAM0(3) and CXL0(1), while tasks on
> > node1 prefer DRAM1(3) and CXL1(1). This aligns allocation with actual
> > effective bandwidth, preserves NUMA locality, and reduces cross-socket
> > traffic.
> 
> In that sense I thought the word "prefer" was a bit confusing, since I
> thought it would mean that it would try to fulfill the alloactions
> from within a packet first, then fall back to remote packets if that
> failed. (Or maybe I am just misunderstanding your explanation. Please
> do let me know if that is the case : -) )
> 
> If what I understand is the case , I think this is the same thing as
> just restricting allocations to be socket-local. I also wonder if
> this idea applies to other mempolicies as well (i.e. unweighted interleave)

Again, I apologize for the confusion caused by words like "prefer" and
"fallback" in the commit message. Your understanding is correct; the
current code strictly restricts allocations to the socket-local nodes.

To determine where memory may be allocated within a socket, the code uses
a function named policy_resolve_package_nodes(). As described in the
comments, the logic works as follows:

1. Success case: It tries to use the intersection of the current CPU's
   package nodes and the user's preselected policy nodes. If the
   intersection is not empty, it uses these local nodes.
2. Failure case: If the intersection is empty (e.g., the user opted out
   of the current package), it finds the package of another node in the
   policy nodes and gets the intersection again. If this also yields an
   empty set, it completely falls back to the original global policy nodes.

In this early version, the consideration for handling various detailed
cases is insufficient. Also, as you pointed out, applying this strict
local restriction directly to other policies like unweighted interleave
might be difficult, as it could conflict with the original purpose of
interleaving. I plan to consider these aspects further and prepare a
more complemented design.

> 
> I think we should consider what the expected and desirable behavior is
> when one socket is fully saturated but the other socket is empty. In my
> mind this is no different from considering within-packet remote NUMA
> allocations; the tradeoff becomes between reclaiming locally and
> keeping allocations local, vs. skipping reclaiming and consuming
> free memory while eating the remote access latency, similar to
> zone_reclaim mode (packet_reclaim_mode? ; -) )

This is an issue I have been thinking about since the early design phase,
and it must be resolved to improve this patch series. The trade-off
between forcing local memory reclaim to stay local versus accepting the
latency penalty of using a remote socket is a point we need to address.
I will continue to think about how to handle this properly.

> 
> In my mind (without doing any benchmarking myself or looking at the numbers)
> I imagine that there are some scenarios where we actually do want cross
> socket allocations, like in the example above when we have very asymmetric
> saturations across sockets. Is this something that could be worth
> benchmarking as well?

Your suggestion is valid and worth considering. I am currently analyzing
the behavior of this feature under various workloads. I will also
consider the asymmetric saturation scenarios you suggested.

> 
> I will end by saying that in the normal case (sockets have similar saturation)
> I think this series is a definite win and improvement to weighted interleave.
> I just was curious whether we can handle the worst-case scenarios.
> 
> Thank you again for the series. Have a great day!
> Joshua

Thanks again for the review. I will prepare a more considered design
for the next version based on these points.

Rakie Kim

Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

Posted by Gregory Price 3 weeks ago

On Mon, Mar 16, 2026 at 08:19:32AM -0700, Joshua Hahn wrote:
> 
> In that sense I thought the word "prefer" was a bit confusing, since I
> thought it would mean that it would try to fulfill the alloactions
> from within a packet first, then fall back to remote packets if that
> failed. (Or maybe I am just misunderstanding your explanation. Please
> do let me know if that is the case : -) )
> 
> If what I understand is the case , I think this is the same thing as
> just restricting allocations to be socket-local. I also wonder if
> this idea applies to other mempolicies as well (i.e. unweighted interleave)
> 

I was thinking about this as well, and in my head i think you have to
consider a 2x2 situation

cpuset             |   multi-socket-cpu      single-socket-cpu
==================================================================
single-socket-mem  |     mem-package            mem-package
------------------------------------------------------------------
multi-socket-mem   |       global                 global
------------------------------------------------------------------

But I think this reduces to cpuset nodes dictates the weights used -
which should already be the case with the existing code.

I think you are right that we need to be very explicit about the
fallback semantics here - but that may just be a matter of dictating
whether the allocation falls back or prefers direct reclaim to push
pages out of their requested nodes.

~Gregory

Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

Posted by Rakie Kim 2 weeks, 6 days ago

On Mon, 16 Mar 2026 15:45:24 -0400 Gregory Price <gourry@gourry.net> wrote:
> On Mon, Mar 16, 2026 at 08:19:32AM -0700, Joshua Hahn wrote:
> > 
> > In that sense I thought the word "prefer" was a bit confusing, since I
> > thought it would mean that it would try to fulfill the alloactions
> > from within a packet first, then fall back to remote packets if that
> > failed. (Or maybe I am just misunderstanding your explanation. Please
> > do let me know if that is the case : -) )
> > 
> > If what I understand is the case , I think this is the same thing as
> > just restricting allocations to be socket-local. I also wonder if
> > this idea applies to other mempolicies as well (i.e. unweighted interleave)
> > 
> 
> I was thinking about this as well, and in my head i think you have to
> consider a 2x2 situation
> 
> cpuset             |   multi-socket-cpu      single-socket-cpu
> ==================================================================
> single-socket-mem  |     mem-package            mem-package
> ------------------------------------------------------------------
> multi-socket-mem   |       global                 global
> ------------------------------------------------------------------
> 
> But I think this reduces to cpuset nodes dictates the weights used -
> which should already be the case with the existing code.

Hello Gregory,

Thanks for your additional feedback.

I agree with your analysis. The final behavior should follow the nodes
dictated by the cpuset or mempolicy configurations.

> 
> I think you are right that we need to be very explicit about the
> fallback semantics here - but that may just be a matter of dictating
> whether the allocation falls back or prefers direct reclaim to push
> pages out of their requested nodes.
> 
> ~Gregory

As you and Joshua pointed out, making the fallback semantics explicit
is the most critical issue for this patch series. We need a clear policy
to decide whether the allocation should fall back to a remote node or
force direct reclaim to keep the allocation local.

I will explicitly define these fallback semantics and address this
trade-off in the design for the next version.

Thanks again for your time and review.

Rakie Kim

Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

Posted by Gregory Price 3 weeks ago

On Mon, Mar 16, 2026 at 02:12:48PM +0900, Rakie Kim wrote:
> This patch series is an RFC to propose and discuss the overall design
> and concept of a socket-aware weighted interleave mechanism. As there
> are areas requiring further refinement, the primary goal at this stage
> is to gather feedback on the architectural approach rather than focusing
> on fine-grained implementation details.
> 

I gave this a brief browse this morning, and I rather like this
approach, more-so than the original proposals for socket-awareness
that encoded the weights in a 2-dimensional array.

I think this would be a great discussion at LSF, and I wonder if
something like memory-package could be used for more purposes than just
weighted interleave.

~Gregory

Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

Posted by Rakie Kim 2 weeks, 6 days ago

On Mon, 16 Mar 2026 10:01:48 -0400 Gregory Price <gourry@gourry.net> wrote:
> On Mon, Mar 16, 2026 at 02:12:48PM +0900, Rakie Kim wrote:
> > This patch series is an RFC to propose and discuss the overall design
> > and concept of a socket-aware weighted interleave mechanism. As there
> > are areas requiring further refinement, the primary goal at this stage
> > is to gather feedback on the architectural approach rather than focusing
> > on fine-grained implementation details.
> > 
> 
> I gave this a brief browse this morning, and I rather like this
> approach, more-so than the original proposals for socket-awareness
> that encoded the weights in a 2-dimensional array.

Hello Gregory,

Thanks for your review and feedback. I also think this approach is
much better than the previous 2-dimensional array idea. Since this
is still an early draft, I hope this code will be developed into a
better design through community discussions.

> 
> I think this would be a great discussion at LSF, and I wonder if
> something like memory-package could be used for more purposes than just
> weighted interleave.
> 
> ~Gregory

I and Honggyu Kim are actually preparing to propose this exact topic
for the upcoming LSF/MM/BPF summit. However, I accidentally missed
adding lsf-pc@lists.linux-foundation.org to the CC list. I will
re-post or forward this to the LSF PC list soon.

You are exactly right about the memory-package. When I first designed
it, I wanted to use it for memory tiering and other areas, not just
for weighted interleave. For now, weighted interleave is the only
implemented use case, but I hope to keep improving it so it can be
used in other subsystems as well.

Thanks again for your time and review.

Rakie Kim