drivers/cxl/core/region.c | 46 +++ drivers/cxl/cxl.h | 1 + drivers/dax/kmem.c | 2 + include/linux/memory-tiers.h | 93 +++++ include/linux/numa.h | 8 + mm/memory-tiers.c | 766 +++++++++++++++++++++++++++++++++++ mm/mempolicy.c | 135 +++++- 7 files changed, 1047 insertions(+), 4 deletions(-)
This patch series is an RFC to propose and discuss the overall design
and concept of a socket-aware weighted interleave mechanism. As there
are areas requiring further refinement, the primary goal at this stage
is to gather feedback on the architectural approach rather than focusing
on fine-grained implementation details.
Weighted interleave distributes page allocations across multiple nodes
based on configured weights. However, the current implementation applies
a single global weight vector. In multi-socket systems, this creates a
mismatch between configured weights and actual hardware performance, as
it cannot account for inter-socket interconnect costs. To address this,
we propose a socket-aware approach that restricts candidate nodes to
the local socket before applying weights.
Flat weighted interleave applies one global weight vector regardless of
where a task runs. On multi-socket systems, this ignores inter-socket
interconnect costs, meaning the configured weights do not accurately
reflect the actual hardware performance.
Consider a dual-socket system:
node0 node1
+-------+ +-------+
| CPU 0 |---------| CPU 1 |
+-------+ +-------+
| DRAM0 | | DRAM1 |
+---+---+ +---+---+
| |
+---+---+ +---+---+
| CXL 0 | | CXL 1 |
+-------+ +-------+
node2 node3
Assuming local DRAM provides 300 GB/s and local CXL provides 100 GB/s,
the effective bandwidth varies significantly from the perspective of
each CPU due to inter-socket interconnect penalties.
Local device capabilities (GB/s) vs. cross-socket effective bandwidth:
0 1 2 3
CPU 0 300 150 100 50
CPU 1 150 300 50 100
A reasonable global weight vector reflecting the base capabilities is:
node0=3 node1=3 node2=1 node3=1
However, because these configured node weights do not account for
interconnect degradation between sockets, applying them flatly to all
sources yields the following effective map from each CPU's perspective:
0 1 2 3
CPU 0 3 3 1 1
CPU 1 3 3 1 1
This does not account for the interconnect penalty (e.g., node0->node1
drops 300->150, node0->node3 drops 100->50) and thus forces allocations
that cause a mismatch with actual performance.
This patch makes weighted interleave socket-aware. Before weighting is
applied, the candidate nodes are restricted to the current socket; only
if no eligible local nodes remain does the policy fall back to the
wider set.
Even if the configured global weights remain identically set:
node0=3 node1=3 node2=1 node3=1
The resulting effective map from the perspective of each CPU becomes:
0 1 2 3
CPU 0 3 0 1 0
CPU 1 0 3 0 1
Now tasks running on node0 prefer DRAM0(3) and CXL0(1), while tasks on
node1 prefer DRAM1(3) and CXL1(1). This aligns allocation with actual
effective bandwidth, preserves NUMA locality, and reduces cross-socket
traffic.
To make this possible, the system requires a mechanism to understand
the physical topology. The existing NUMA distance model provides only
relative latency values between nodes and lacks any notion of
structural grouping such as socket boundaries. This is especially
problematic for CXL memory nodes, which appear without an explicit
socket association.
This patch series introduces a socket-aware topology management layer
that groups NUMA nodes according to their physical package. It
explicitly links CPU and memory-only nodes (such as CXL) under the
same socket using an initiator CPU node. This captures the true
hardware hierarchy rather than relying solely on flat distance values.
[Experimental Results]
System Configuration:
- Processor: Dual-Socket Intel Xeon 6980P (Granite Rapids)
node0 node1
+-------+ +-------+
| CPU 0 |-------------------| CPU 1 |
+-------+ +-------+
12 Channels | DRAM0 | | DRAM1 | 12 Channels
DDR5-6400 +---+---+ +---+---+ DDR5-6400
| |
+---+---+ +---+---+
8 Channels | CXL 0 | | CXL 1 | 8 Channels
DDR5-6400 +-------+ +-------+ DDR5-6400
node2 node3
1) Throughput (System Bandwidth)
- DRAM Only: 966 GB/s
- Weighted Interleave: 903 GB/s (7% decrease compared to DRAM Only)
- Socket-Aware Weighted Interleave: 1329 GB/s (1.33TB/s)
(38% increase compared to DRAM Only,
47% increase compared to Weighted Interleave)
2) Loaded Latency (Under High Bandwidth)
- DRAM Only: 544 ns
- Weighted Interleave: 545 ns
- Socket-Aware Weighted Interleave: 436 ns
(20% reduction compared to both)
[Additional Considerations]
Please note that this series includes modifications to the CXL driver
to register these nodes. However, the necessity and the approach of
these driver-side changes require further discussion and consideration.
Additionally, this topology layer was originally designed to support
both memory tiering and weighted interleave. Currently, it is only
utilized by the weighted interleave policy. As a result, several
functions exposed by this layer are not actively used in this RFC.
Unused portions will be cleaned up and removed in the final patch
submission.
Summary of patches:
[PATCH 1/4] mm/numa: introduce nearest_nodes_nodemask()
This patch adds a new NUMA helper function to find all nodes in a
given nodemask that share the minimum distance from a specified
source node.
[PATCH 2/4] mm/memory-tiers: introduce socket-aware topology mgmt
This patch introduces a management layer that groups NUMA nodes by
their physical package (socket). It forms a "memory package" to
abstract real hardware locality for predictable NUMA memory
management.
[PATCH 3/4] mm/memory-tiers: register CXL nodes to socket packages
This patch implements a registration path to bind CXL memory nodes
to a socket-aware memory package using an initiator CPU node. This
ensures CXL nodes are deterministically grouped with the CPUs they
service.
[PATCH 4/4] mm/mempolicy: enhance weighted interleave with locality
This patch modifies the weighted interleave policy to restrict
candidate nodes to the current socket before applying weights. It
reduces cross-socket traffic and aligns memory allocation with
actual bandwidth.
Any feedback and discussions are highly appreciated.
Thanks
Rakie Kim (4):
mm/numa: introduce nearest_nodes_nodemask()
mm/memory-tiers: introduce socket-aware topology management for NUMA
nodes
mm/memory-tiers: register CXL nodes to socket-aware packages via
initiator
mm/mempolicy: enhance weighted interleave with socket-aware locality
drivers/cxl/core/region.c | 46 +++
drivers/cxl/cxl.h | 1 +
drivers/dax/kmem.c | 2 +
include/linux/memory-tiers.h | 93 +++++
include/linux/numa.h | 8 +
mm/memory-tiers.c | 766 +++++++++++++++++++++++++++++++++++
mm/mempolicy.c | 135 +++++-
7 files changed, 1047 insertions(+), 4 deletions(-)
base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
--
2.34.1
On Mon, 16 Mar 2026 14:12:48 +0900
Rakie Kim <rakie.kim@sk.com> wrote:
> This patch series is an RFC to propose and discuss the overall design
> and concept of a socket-aware weighted interleave mechanism. As there
> are areas requiring further refinement, the primary goal at this stage
> is to gather feedback on the architectural approach rather than focusing
> on fine-grained implementation details.
>
> Weighted interleave distributes page allocations across multiple nodes
> based on configured weights. However, the current implementation applies
> a single global weight vector. In multi-socket systems, this creates a
> mismatch between configured weights and actual hardware performance, as
> it cannot account for inter-socket interconnect costs. To address this,
> we propose a socket-aware approach that restricts candidate nodes to
> the local socket before applying weights.
>
> Flat weighted interleave applies one global weight vector regardless of
> where a task runs. On multi-socket systems, this ignores inter-socket
> interconnect costs, meaning the configured weights do not accurately
> reflect the actual hardware performance.
>
> Consider a dual-socket system:
>
> node0 node1
> +-------+ +-------+
> | CPU 0 |---------| CPU 1 |
> +-------+ +-------+
> | DRAM0 | | DRAM1 |
> +---+---+ +---+---+
> | |
> +---+---+ +---+---+
> | CXL 0 | | CXL 1 |
> +-------+ +-------+
> node2 node3
>
> Assuming local DRAM provides 300 GB/s and local CXL provides 100 GB/s,
> the effective bandwidth varies significantly from the perspective of
> each CPU due to inter-socket interconnect penalties.
I'm fully on board with this problem and very pleased to see someone
working on it!
I have some questions about the example.
The condition definitely applies when the local node to
CXL bandwidth > interconnect bandwidth, but that's not true here so this is
a more complex and I'm curious about the example
>
> Local device capabilities (GB/s) vs. cross-socket effective bandwidth:
>
> 0 1 2 3
> CPU 0 300 150 100 50
> CPU 1 150 300 50 100
These numbers don't seem consistent with the 100 / 300 numbers above.
These aren't low load bandwidths because if they were you'd not see any
drop on the CXL numbers as the bottleneck is still the CXL bus. Given the
game here is bandwidth interleaving - fair enough that these should be
loaded bandwidths.
If these are fully loaded bandwidth then the headline DRAM / CXL numbers need
to be the sum of all access paths. So DRAM must be 450GiB/s and CXL 150GiB/s
The cross CPU interconnect is 200GiB/s in each direction I think.
This is ignoring caching etc which can make judging interconnect effects tricky
at best!
Years ago there were some attempts to standardize the information available
on topology under load. To put it lightly it got tricky fast and no one
could agree on how to measure it for an empirical solution.
>
> A reasonable global weight vector reflecting the base capabilities is:
>
> node0=3 node1=3 node2=1 node3=1
>
> However, because these configured node weights do not account for
> interconnect degradation between sockets, applying them flatly to all
> sources yields the following effective map from each CPU's perspective:
>
> 0 1 2 3
> CPU 0 3 3 1 1
> CPU 1 3 3 1 1
>
> This does not account for the interconnect penalty (e.g., node0->node1
> drops 300->150, node0->node3 drops 100->50) and thus forces allocations
> that cause a mismatch with actual performance.
>
> This patch makes weighted interleave socket-aware. Before weighting is
> applied, the candidate nodes are restricted to the current socket; only
> if no eligible local nodes remain does the policy fall back to the
> wider set.
>
> Even if the configured global weights remain identically set:
>
> node0=3 node1=3 node2=1 node3=1
>
> The resulting effective map from the perspective of each CPU becomes:
>
> 0 1 2 3
> CPU 0 3 0 1 0
> CPU 1 0 3 0 1
>
> Now tasks running on node0 prefer DRAM0(3) and CXL0(1), while tasks on
> node1 prefer DRAM1(3) and CXL1(1). This aligns allocation with actual
> effective bandwidth, preserves NUMA locality, and reduces cross-socket
> traffic.
Workload wise this is kind of assuming each NUMA node is doing something
similar and keeping to itself. Assuming a nice balanced setup that is
fine. However, with certain CPU topologies you are likely to see slightly
messier things.
>
> To make this possible, the system requires a mechanism to understand
> the physical topology. The existing NUMA distance model provides only
> relative latency values between nodes and lacks any notion of
> structural grouping such as socket boundaries. This is especially
> problematic for CXL memory nodes, which appear without an explicit
> socket association.
So in a general sense, the missing info here is effectively the same
stuff we are missing from the HMAT presentation (it's there in the
table and it's there to compute in CXL cases) just because we decided
not to surface anything other than distances to memory from nearest
initiator. I chatted to Joshua and Kieth about filling in that stuff
at last LSFMM. To me that's just a bit of engineering work that needs
doing now we have proven use cases for the data. Mostly it's figuring out
the presentation to userspace and kernel data structures as it's a
lot of data in a big system (typically at least 32 NUMA nodes).
>
> This patch series introduces a socket-aware topology management layer
> that groups NUMA nodes according to their physical package. It
> explicitly links CPU and memory-only nodes (such as CXL) under the
> same socket using an initiator CPU node. This captures the true
> hardware hierarchy rather than relying solely on flat distance values.
>
>
> [Experimental Results]
>
> System Configuration:
> - Processor: Dual-Socket Intel Xeon 6980P (Granite Rapids)
>
> node0 node1
> +-------+ +-------+
> | CPU 0 |-------------------| CPU 1 |
> +-------+ +-------+
> 12 Channels | DRAM0 | | DRAM1 | 12 Channels
> DDR5-6400 +---+---+ +---+---+ DDR5-6400
> | |
> +---+---+ +---+---+
> 8 Channels | CXL 0 | | CXL 1 | 8 Channels
> DDR5-6400 +-------+ +-------+ DDR5-6400
> node2 node3
>
> 1) Throughput (System Bandwidth)
> - DRAM Only: 966 GB/s
> - Weighted Interleave: 903 GB/s (7% decrease compared to DRAM Only)
> - Socket-Aware Weighted Interleave: 1329 GB/s (1.33TB/s)
> (38% increase compared to DRAM Only,
> 47% increase compared to Weighted Interleave)
>
> 2) Loaded Latency (Under High Bandwidth)
> - DRAM Only: 544 ns
> - Weighted Interleave: 545 ns
> - Socket-Aware Weighted Interleave: 436 ns
> (20% reduction compared to both)
>
This may prove too simplistic so we need to be a little careful.
It may be enough for now though so I'm not saying we necessarily
need to change things (yet)!. Just highlighting things I've seen
turn up before in such discussions.
Simplest one is that we have more CXL memory on some nodes than
others. Only so many lanes and we probably want some of them for
other purposes!
More fun, multi NUMA node per sockets systems.
A typical CPU Die with memory controllers (e.g. taking one of
our old parts where there are dieshots online kunpeng 920 to
avoid any chance of leaking anything...).
Socket 0 Socket 1
| node0 | node 1| | node2 | | node 3 |
+-----+ +-------+ +-------+ +-------+ +-------+ +-----+
| IO | | CPU 0 | | CPU 1 |-------| CPU 2 | | CPU 3 | | IO |
| DIE | +-------+ +-------+ +-------+ +-------+ | DIE |
+--+--+ | DRAM0 | | DRAM1 | | DRAM2 | | DRAM2 | +--+--+
| +-------+ +-------+ +-------+ +-------+ |
| |
+---+---+ +---+---+
| CXL 0 | | CXL 1 |
+-------+ +-------+
So only a single CXL device per socket and the socket is multiple
NUMA nodes as the DRAM interfaces are on the CPU Dies (unlike some
others where they are on the IO Die alongside the CXL interfaces).
CXL topology cases:
A simple dual socket setup with a CXL switch and MLD below it
makes for a shared link to the CXL memory (and hence a bandwidth
restriction) that this can't model.
node0 node1
+-------+ +-------+
| CPU 0 |-------------------| CPU 1 |
+-------+ +-------+
12 Channels | DRAM0 | | DRAM1 | 12 Channels
DDR5-6400 +---+---+ +---+---+ DDR5-6400
| |
|___________________________|
|
|
+---+---+
Many Channels | CXL 0 |
DDR5-6400 +-------+
node2/3
Note it's still two nodes for the CXL as we aren't accessing the same DPA for
each host node but their actual memory is interleaved across the same devices
to give peak BW.
The reason you might do this is load balancing across lots of CXL devices
downstream of the switch.
Note this also effectively happens with MHDs just the load balancing is across
backend memory being provided via multiple heads. Whether people wire MHDs
that way or tend to have multiple top of rack devices with each CPU
socket connecting to a different one is an open question to me.
I have no idea yet on how you'd present the resulting bandwidth interference
effects of such as setup.
IO Expanders on the CPU interconnect:
Just for fun, on similar interconnects we've previously also seen
the following and I'd be surprised if those going for max bandwidth
don't do this for CXL at some point soon.
node0 node1
+-------+ +-------+
| CPU 0 |-------------------| CPU 1 |
+-------+ +-------+
12 Channels | DRAM0 | | DRAM1 | 12 Channels
DDR5-6400 +---+---+ +---+---+ DDR5-6400
| |
|___________________________|
| IO Expander |
| CPU interconnect |
|___________________|
|
+---+---+
Many Channels | CXL 0 |
DDR5-6400 +-------+
node2
That is the CXL memory is effectively the same distance from
CPU0 and CPU1 - they probably have their own local CXL as well
as this approach is done to scale up interconnect lanes in a system
when bandwidth is way more important than compute. Similar to the
MHD case but in this case we are accessing the same DPAs via
both paths.
Anyhow, the exact details of those don't matter beyond the general
point that even in 'balanced' high performance configurations there
may not be a clean 1:1 relationship between NUMA nodes and CXL memory
devices. Maybe some maths that aggregates some groups of nodes
together would be enough. I've not really thought it through yet.
Fun and useful topic. Whilst I won't be at LSFMM it is definitely
something I'd like to see move forward in general.
Thanks,
Jonathan
>
> [Additional Considerations]
>
> Please note that this series includes modifications to the CXL driver
> to register these nodes. However, the necessity and the approach of
> these driver-side changes require further discussion and consideration.
> Additionally, this topology layer was originally designed to support
> both memory tiering and weighted interleave. Currently, it is only
> utilized by the weighted interleave policy. As a result, several
> functions exposed by this layer are not actively used in this RFC.
> Unused portions will be cleaned up and removed in the final patch
> submission.
>
> Summary of patches:
>
> [PATCH 1/4] mm/numa: introduce nearest_nodes_nodemask()
> This patch adds a new NUMA helper function to find all nodes in a
> given nodemask that share the minimum distance from a specified
> source node.
>
> [PATCH 2/4] mm/memory-tiers: introduce socket-aware topology mgmt
> This patch introduces a management layer that groups NUMA nodes by
> their physical package (socket). It forms a "memory package" to
> abstract real hardware locality for predictable NUMA memory
> management.
>
> [PATCH 3/4] mm/memory-tiers: register CXL nodes to socket packages
> This patch implements a registration path to bind CXL memory nodes
> to a socket-aware memory package using an initiator CPU node. This
> ensures CXL nodes are deterministically grouped with the CPUs they
> service.
>
> [PATCH 4/4] mm/mempolicy: enhance weighted interleave with locality
> This patch modifies the weighted interleave policy to restrict
> candidate nodes to the current socket before applying weights. It
> reduces cross-socket traffic and aligns memory allocation with
> actual bandwidth.
>
> Any feedback and discussions are highly appreciated.
>
> Thanks
>
> Rakie Kim (4):
> mm/numa: introduce nearest_nodes_nodemask()
> mm/memory-tiers: introduce socket-aware topology management for NUMA
> nodes
> mm/memory-tiers: register CXL nodes to socket-aware packages via
> initiator
> mm/mempolicy: enhance weighted interleave with socket-aware locality
>
> drivers/cxl/core/region.c | 46 +++
> drivers/cxl/cxl.h | 1 +
> drivers/dax/kmem.c | 2 +
> include/linux/memory-tiers.h | 93 +++++
> include/linux/numa.h | 8 +
> mm/memory-tiers.c | 766 +++++++++++++++++++++++++++++++++++
> mm/mempolicy.c | 135 +++++-
> 7 files changed, 1047 insertions(+), 4 deletions(-)
>
>
> base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
On Wed, 18 Mar 2026 12:02:45 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote: > On Mon, 16 Mar 2026 14:12:48 +0900 > Rakie Kim <rakie.kim@sk.com> wrote: > Hello Jonathan, Thanks for your detailed review and the insights on various topology cases. > > This patch series is an RFC to propose and discuss the overall design > > and concept of a socket-aware weighted interleave mechanism. As there > > are areas requiring further refinement, the primary goal at this stage > > is to gather feedback on the architectural approach rather than focusing > > on fine-grained implementation details. > > > > Weighted interleave distributes page allocations across multiple nodes > > based on configured weights. However, the current implementation applies > > a single global weight vector. In multi-socket systems, this creates a > > mismatch between configured weights and actual hardware performance, as > > it cannot account for inter-socket interconnect costs. To address this, > > we propose a socket-aware approach that restricts candidate nodes to > > the local socket before applying weights. > > > > Flat weighted interleave applies one global weight vector regardless of > > where a task runs. On multi-socket systems, this ignores inter-socket > > interconnect costs, meaning the configured weights do not accurately > > reflect the actual hardware performance. > > > > Consider a dual-socket system: > > > > node0 node1 > > +-------+ +-------+ > > | CPU 0 |---------| CPU 1 | > > +-------+ +-------+ > > | DRAM0 | | DRAM1 | > > +---+---+ +---+---+ > > | | > > +---+---+ +---+---+ > > | CXL 0 | | CXL 1 | > > +-------+ +-------+ > > node2 node3 > > > > Assuming local DRAM provides 300 GB/s and local CXL provides 100 GB/s, > > the effective bandwidth varies significantly from the perspective of > > each CPU due to inter-socket interconnect penalties. > > I'm fully on board with this problem and very pleased to see someone > working on it! > > I have some questions about the example. > The condition definitely applies when the local node to > CXL bandwidth > interconnect bandwidth, but that's not true here so this is > a more complex and I'm curious about the example > > > > > Local device capabilities (GB/s) vs. cross-socket effective bandwidth: > > > > 0 1 2 3 > > CPU 0 300 150 100 50 > > CPU 1 150 300 50 100 > > These numbers don't seem consistent with the 100 / 300 numbers above. > These aren't low load bandwidths because if they were you'd not see any > drop on the CXL numbers as the bottleneck is still the CXL bus. Given the > game here is bandwidth interleaving - fair enough that these should be > loaded bandwidths. > > If these are fully loaded bandwidth then the headline DRAM / CXL numbers need > to be the sum of all access paths. So DRAM must be 450GiB/s and CXL 150GiB/s > The cross CPU interconnect is 200GiB/s in each direction I think. > This is ignoring caching etc which can make judging interconnect effects tricky > at best! > > Years ago there were some attempts to standardize the information available > on topology under load. To put it lightly it got tricky fast and no one > could agree on how to measure it for an empirical solution. > You are exactly right about the numbers. The values used in the example were overly simplified just to briefly illustrate the concept of the interconnect penalty. I realize that this oversimplification caused confusion regarding the actual bottleneck and fully loaded bandwidth. In the next update, I will revise the example to use more accurate numbers based on the actual system I am currently using. > > > > A reasonable global weight vector reflecting the base capabilities is: > > > > node0=3 node1=3 node2=1 node3=1 > > > > However, because these configured node weights do not account for > > interconnect degradation between sockets, applying them flatly to all > > sources yields the following effective map from each CPU's perspective: > > > > 0 1 2 3 > > CPU 0 3 3 1 1 > > CPU 1 3 3 1 1 > > > > This does not account for the interconnect penalty (e.g., node0->node1 > > drops 300->150, node0->node3 drops 100->50) and thus forces allocations > > that cause a mismatch with actual performance. > > > > This patch makes weighted interleave socket-aware. Before weighting is > > applied, the candidate nodes are restricted to the current socket; only > > if no eligible local nodes remain does the policy fall back to the > > wider set. > > > > Even if the configured global weights remain identically set: > > > > node0=3 node1=3 node2=1 node3=1 > > > > The resulting effective map from the perspective of each CPU becomes: > > > > 0 1 2 3 > > CPU 0 3 0 1 0 > > CPU 1 0 3 0 1 > > > > Now tasks running on node0 prefer DRAM0(3) and CXL0(1), while tasks on > > node1 prefer DRAM1(3) and CXL1(1). This aligns allocation with actual > > effective bandwidth, preserves NUMA locality, and reduces cross-socket > > traffic. > > Workload wise this is kind of assuming each NUMA node is doing something > similar and keeping to itself. Assuming a nice balanced setup that is > fine. However, with certain CPU topologies you are likely to see slightly > messier things. > I agree with your point. Since the current design is still an early draft, I understand that this assumption may not hold true for all workloads. This is an area that requires further consideration. > > > > To make this possible, the system requires a mechanism to understand > > the physical topology. The existing NUMA distance model provides only > > relative latency values between nodes and lacks any notion of > > structural grouping such as socket boundaries. This is especially > > problematic for CXL memory nodes, which appear without an explicit > > socket association. > > So in a general sense, the missing info here is effectively the same > stuff we are missing from the HMAT presentation (it's there in the > table and it's there to compute in CXL cases) just because we decided > not to surface anything other than distances to memory from nearest > initiator. I chatted to Joshua and Kieth about filling in that stuff > at last LSFMM. To me that's just a bit of engineering work that needs > doing now we have proven use cases for the data. Mostly it's figuring out > the presentation to userspace and kernel data structures as it's a > lot of data in a big system (typically at least 32 NUMA nodes). > Hearing about the discussion on exposing HMAT data is very welcome news. Because this detailed topology information is not yet fully exposed to the kernel and userspace, I used a temporary package-based restriction. Figuring out how to expose and integrate this data into the kernel data structures is indeed a crucial engineering task we need to solve. Actually, when I first started this work, I considered fetching the topology information from HMAT before adopting the current approach. However, I encountered a firmware issue on my test systems (Granite Rapids and Sierra Forest). Although each socket has its own locally attached CXL device, the HMAT only registers node1 (Socket 1) as the initiator for both CXL memory nodes (node2 and node3). As a result, the sysfs HMAT initiators for both node2 and node3 only expose node1. Even though the distance map shows node2 is physically closer to Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the routing path strictly through Socket 1. Because the HMAT alone made it difficult to determine the exact physical socket connections on these systems, I ended up using the current CXL driver-based approach. I wonder if others have experienced similar broken HMAT cases with CXL. If HMAT information becomes more reliable in the future, we could build a much more efficient structure. > > > > This patch series introduces a socket-aware topology management layer > > that groups NUMA nodes according to their physical package. It > > explicitly links CPU and memory-only nodes (such as CXL) under the > > same socket using an initiator CPU node. This captures the true > > hardware hierarchy rather than relying solely on flat distance values. > > > > > > [Experimental Results] > > > > System Configuration: > > - Processor: Dual-Socket Intel Xeon 6980P (Granite Rapids) > > > > node0 node1 > > +-------+ +-------+ > > | CPU 0 |-------------------| CPU 1 | > > +-------+ +-------+ > > 12 Channels | DRAM0 | | DRAM1 | 12 Channels > > DDR5-6400 +---+---+ +---+---+ DDR5-6400 > > | | > > +---+---+ +---+---+ > > 8 Channels | CXL 0 | | CXL 1 | 8 Channels > > DDR5-6400 +-------+ +-------+ DDR5-6400 > > node2 node3 > > > > 1) Throughput (System Bandwidth) > > - DRAM Only: 966 GB/s > > - Weighted Interleave: 903 GB/s (7% decrease compared to DRAM Only) > > - Socket-Aware Weighted Interleave: 1329 GB/s (1.33TB/s) > > (38% increase compared to DRAM Only, > > 47% increase compared to Weighted Interleave) > > > > 2) Loaded Latency (Under High Bandwidth) > > - DRAM Only: 544 ns > > - Weighted Interleave: 545 ns > > - Socket-Aware Weighted Interleave: 436 ns > > (20% reduction compared to both) > > > > This may prove too simplistic so we need to be a little careful. > It may be enough for now though so I'm not saying we necessarily > need to change things (yet)!. Just highlighting things I've seen > turn up before in such discussions. > > Simplest one is that we have more CXL memory on some nodes than > others. Only so many lanes and we probably want some of them for > other purposes! > > More fun, multi NUMA node per sockets systems. > > A typical CPU Die with memory controllers (e.g. taking one of > our old parts where there are dieshots online kunpeng 920 to > avoid any chance of leaking anything...). > > Socket 0 Socket 1 > | node0 | node 1| | node2 | | node 3 | > +-----+ +-------+ +-------+ +-------+ +-------+ +-----+ > | IO | | CPU 0 | | CPU 1 |-------| CPU 2 | | CPU 3 | | IO | > | DIE | +-------+ +-------+ +-------+ +-------+ | DIE | > +--+--+ | DRAM0 | | DRAM1 | | DRAM2 | | DRAM2 | +--+--+ > | +-------+ +-------+ +-------+ +-------+ | > | | > +---+---+ +---+---+ > | CXL 0 | | CXL 1 | > +-------+ +-------+ > > So only a single CXL device per socket and the socket is multiple > NUMA nodes as the DRAM interfaces are on the CPU Dies (unlike some > others where they are on the IO Die alongside the CXL interfaces). > > CXL topology cases: > > A simple dual socket setup with a CXL switch and MLD below it > makes for a shared link to the CXL memory (and hence a bandwidth > restriction) that this can't model. > > node0 node1 > +-------+ +-------+ > | CPU 0 |-------------------| CPU 1 | > +-------+ +-------+ > 12 Channels | DRAM0 | | DRAM1 | 12 Channels > DDR5-6400 +---+---+ +---+---+ DDR5-6400 > | | > |___________________________| > | > | > +---+---+ > Many Channels | CXL 0 | > DDR5-6400 +-------+ > node2/3 > > Note it's still two nodes for the CXL as we aren't accessing the same DPA for > each host node but their actual memory is interleaved across the same devices > to give peak BW. > > The reason you might do this is load balancing across lots of CXL devices > downstream of the switch. > > Note this also effectively happens with MHDs just the load balancing is across > backend memory being provided via multiple heads. Whether people wire MHDs > that way or tend to have multiple top of rack devices with each CPU > socket connecting to a different one is an open question to me. > > I have no idea yet on how you'd present the resulting bandwidth interference > effects of such as setup. > > IO Expanders on the CPU interconnect: > > Just for fun, on similar interconnects we've previously also seen > the following and I'd be surprised if those going for max bandwidth > don't do this for CXL at some point soon. > > > node0 node1 > +-------+ +-------+ > | CPU 0 |-------------------| CPU 1 | > +-------+ +-------+ > 12 Channels | DRAM0 | | DRAM1 | 12 Channels > DDR5-6400 +---+---+ +---+---+ DDR5-6400 > | | > |___________________________| > | IO Expander | > | CPU interconnect | > |___________________| > | > +---+---+ > Many Channels | CXL 0 | > DDR5-6400 +-------+ > node2 > > That is the CXL memory is effectively the same distance from > CPU0 and CPU1 - they probably have their own local CXL as well > as this approach is done to scale up interconnect lanes in a system > when bandwidth is way more important than compute. Similar to the > MHD case but in this case we are accessing the same DPAs via > both paths. > > Anyhow, the exact details of those don't matter beyond the general > point that even in 'balanced' high performance configurations there > may not be a clean 1:1 relationship between NUMA nodes and CXL memory > devices. Maybe some maths that aggregates some groups of nodes > together would be enough. I've not really thought it through yet. > > Fun and useful topic. Whilst I won't be at LSFMM it is definitely > something I'd like to see move forward in general. > > Thanks, > > Jonathan > The complex topology cases you presented, such as multi-NUMA per socket, shared CXL switches, and IO expanders, are very important points. I clearly understand that the simple package-level grouping does not fully reflect the 1:1 relationship in these future hardware architectures. I have also thought about the shared CXL switch scenario you mentioned, and I know the current design falls short in addressing it properly. While the current implementation starts with a simple socket-local restriction, I plan to evolve it into a more flexible node aggregation model to properly reflect all the diverse topologies you suggested. Thanks again for your time and review. Rakie Kim > > > > [Additional Considerations] > > > > Please note that this series includes modifications to the CXL driver > > to register these nodes. However, the necessity and the approach of > > these driver-side changes require further discussion and consideration. > > Additionally, this topology layer was originally designed to support > > both memory tiering and weighted interleave. Currently, it is only > > utilized by the weighted interleave policy. As a result, several > > functions exposed by this layer are not actively used in this RFC. > > Unused portions will be cleaned up and removed in the final patch > > submission. > > > > Summary of patches: > > > > [PATCH 1/4] mm/numa: introduce nearest_nodes_nodemask() > > This patch adds a new NUMA helper function to find all nodes in a > > given nodemask that share the minimum distance from a specified > > source node. > > > > [PATCH 2/4] mm/memory-tiers: introduce socket-aware topology mgmt > > This patch introduces a management layer that groups NUMA nodes by > > their physical package (socket). It forms a "memory package" to > > abstract real hardware locality for predictable NUMA memory > > management. > > > > [PATCH 3/4] mm/memory-tiers: register CXL nodes to socket packages > > This patch implements a registration path to bind CXL memory nodes > > to a socket-aware memory package using an initiator CPU node. This > > ensures CXL nodes are deterministically grouped with the CPUs they > > service. > > > > [PATCH 4/4] mm/mempolicy: enhance weighted interleave with locality > > This patch modifies the weighted interleave policy to restrict > > candidate nodes to the current socket before applying weights. It > > reduces cross-socket traffic and aligns memory allocation with > > actual bandwidth. > > > > Any feedback and discussions are highly appreciated. > > > > Thanks > > > > Rakie Kim (4): > > mm/numa: introduce nearest_nodes_nodemask() > > mm/memory-tiers: introduce socket-aware topology management for NUMA > > nodes > > mm/memory-tiers: register CXL nodes to socket-aware packages via > > initiator > > mm/mempolicy: enhance weighted interleave with socket-aware locality > > > > drivers/cxl/core/region.c | 46 +++ > > drivers/cxl/cxl.h | 1 + > > drivers/dax/kmem.c | 2 + > > include/linux/memory-tiers.h | 93 +++++ > > include/linux/numa.h | 8 + > > mm/memory-tiers.c | 766 +++++++++++++++++++++++++++++++++++ > > mm/mempolicy.c | 135 +++++- > > 7 files changed, 1047 insertions(+), 4 deletions(-) > > > > > > base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b >
> > > > > > To make this possible, the system requires a mechanism to understand > > > the physical topology. The existing NUMA distance model provides only > > > relative latency values between nodes and lacks any notion of > > > structural grouping such as socket boundaries. This is especially > > > problematic for CXL memory nodes, which appear without an explicit > > > socket association. > > > > So in a general sense, the missing info here is effectively the same > > stuff we are missing from the HMAT presentation (it's there in the > > table and it's there to compute in CXL cases) just because we decided > > not to surface anything other than distances to memory from nearest > > initiator. I chatted to Joshua and Kieth about filling in that stuff > > at last LSFMM. To me that's just a bit of engineering work that needs > > doing now we have proven use cases for the data. Mostly it's figuring out > > the presentation to userspace and kernel data structures as it's a > > lot of data in a big system (typically at least 32 NUMA nodes). > > > > Hearing about the discussion on exposing HMAT data is very welcome news. > Because this detailed topology information is not yet fully exposed to > the kernel and userspace, I used a temporary package-based restriction. > Figuring out how to expose and integrate this data into the kernel data > structures is indeed a crucial engineering task we need to solve. > > Actually, when I first started this work, I considered fetching the > topology information from HMAT before adopting the current approach. > However, I encountered a firmware issue on my test systems > (Granite Rapids and Sierra Forest). > > Although each socket has its own locally attached CXL device, the HMAT > only registers node1 (Socket 1) as the initiator for both CXL memory > nodes (node2 and node3). As a result, the sysfs HMAT initiators for > both node2 and node3 only expose node1. Do you mean the Memory Proximity Domain Attributes Structure has the "Proximity Domain for the Attached Initiator" set wrong? Was this for it's presentation of the full path to CXL mem nodes, or to a PXM with a generic port? Sounds like you have SRAT covering the CXL mem so ideal would be to have the HMAT data to GP and to the CXL PXMs that BIOS has set up. Either way having that set at all for CXL memory is fishy as it's about where the 'memory controller' is and on CXL mem that should be at the device end of the link. My understanding of that is was only meant to be set when you have separate memory only Nodes where the physical controller is in a particular other node (e.g. what you do if you have a CPU with DRAM and HBM). Maybe we need to make the kernel warn + ignore that if it is set to something odd like yours. > > Even though the distance map shows node2 is physically closer to > Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the > routing path strictly through Socket 1. Because the HMAT alone made it > difficult to determine the exact physical socket connections on these > systems, I ended up using the current CXL driver-based approach. Are the HMAT latencies and bandwidths all there? Or are some missing and you have to use SLIT (which generally is garbage for historical reasons of tuning SLIT to particular OS behaviour). > > I wonder if others have experienced similar broken HMAT cases with CXL. > If HMAT information becomes more reliable in the future, we could > build a much more efficient structure. Given it's being lightly used I suspect there will be many bugs :( I hope we can assume they will get fixed however! ... > > The complex topology cases you presented, such as multi-NUMA per socket, > shared CXL switches, and IO expanders, are very important points. > I clearly understand that the simple package-level grouping does not fully > reflect the 1:1 relationship in these future hardware architectures. > > I have also thought about the shared CXL switch scenario you mentioned, > and I know the current design falls short in addressing it properly. > While the current implementation starts with a simple socket-local > restriction, I plan to evolve it into a more flexible node aggregation > model to properly reflect all the diverse topologies you suggested. If we can ensure it fails cleanly when it finds a topology that it can't cope with (and I guess falls back to current) then I'm fine with a partial solution that evolves. > > Thanks again for your time and review. You are welcome. Thanks Jonathan > > Rakie Kim >
On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
>
> > > >
> > > > To make this possible, the system requires a mechanism to understand
> > > > the physical topology. The existing NUMA distance model provides only
> > > > relative latency values between nodes and lacks any notion of
> > > > structural grouping such as socket boundaries. This is especially
> > > > problematic for CXL memory nodes, which appear without an explicit
> > > > socket association.
> > >
> > > So in a general sense, the missing info here is effectively the same
> > > stuff we are missing from the HMAT presentation (it's there in the
> > > table and it's there to compute in CXL cases) just because we decided
> > > not to surface anything other than distances to memory from nearest
> > > initiator. I chatted to Joshua and Kieth about filling in that stuff
> > > at last LSFMM. To me that's just a bit of engineering work that needs
> > > doing now we have proven use cases for the data. Mostly it's figuring out
> > > the presentation to userspace and kernel data structures as it's a
> > > lot of data in a big system (typically at least 32 NUMA nodes).
> > >
> >
> > Hearing about the discussion on exposing HMAT data is very welcome news.
> > Because this detailed topology information is not yet fully exposed to
> > the kernel and userspace, I used a temporary package-based restriction.
> > Figuring out how to expose and integrate this data into the kernel data
> > structures is indeed a crucial engineering task we need to solve.
> >
> > Actually, when I first started this work, I considered fetching the
> > topology information from HMAT before adopting the current approach.
> > However, I encountered a firmware issue on my test systems
> > (Granite Rapids and Sierra Forest).
> >
> > Although each socket has its own locally attached CXL device, the HMAT
> > only registers node1 (Socket 1) as the initiator for both CXL memory
> > nodes (node2 and node3). As a result, the sysfs HMAT initiators for
> > both node2 and node3 only expose node1.
>
> Do you mean the Memory Proximity Domain Attributes Structure has
> the "Proximity Domain for the Attached Initiator" set wrong?
> Was this for it's presentation of the full path to CXL mem nodes, or
> to a PXM with a generic port? Sounds like you have SRAT covering
> the CXL mem so ideal would be to have the HMAT data to GP and to
> the CXL PXMs that BIOS has set up.
>
> Either way having that set at all for CXL memory is fishy as it's about
> where the 'memory controller' is and on CXL mem that should be at the
> device end of the link. My understanding of that is was only meant
> to be set when you have separate memory only Nodes where the physical
> controller is in a particular other node (e.g. what you do
> if you have a CPU with DRAM and HBM). Maybe we need to make the
> kernel warn + ignore that if it is set to something odd like yours.
>
Hello Jonathan,
Your insight is incredibly accurate. To clarify the situation, here is
the actual configuration of my system:
NODE Type PXD
node0 local memory 0x00
node1 local memory 0x01
node2 cxl memory 0x0A
node3 cxl memory 0x0B
Physically, the node2 CXL is attached to node0 (Socket 0), and the
node3 CXL is attached to node1 (Socket 1). However, extracting the
HMAT.dsl reveals the following:
- local memory
[028h] Flags: 0001 (Processor Proximity Domain Valid = 1)
Attached Initiator Proximity Domain: 0x00
Memory Proximity Domain: 0x00
[050h] Flags: 0001 (Processor Proximity Domain Valid = 1)
Attached Initiator Proximity Domain: 0x01
Memory Proximity Domain: 0x01
- cxl memory
[078h] Flags: 0000 (Processor Proximity Domain Valid = 0)
Attached Initiator Proximity Domain: 0x00
Memory Proximity Domain: 0x0A
[0A0h] Flags: 0000 (Processor Proximity Domain Valid = 0)
Attached Initiator Proximity Domain: 0x00
Memory Proximity Domain: 0x0B
As you correctly suspected, the flags for the CXL memory are 0000,
meaning the Processor Proximity Domain is marked as invalid. But when
checking the sysfs initiator configurations, it shows a different story:
Node access0 Initiator access1 Initiator
node0 node0 node0
node1 node1 node1
node2 node1 node1
node3 node1 node1
Although the Attached Initiator is set to 0 in HMAT with an invalid
flag, sysfs strangely registers node1 as the initiator for both CXL
nodes. Because both HMAT and sysfs are exposing abnormal values, it was
impossible for me to determine the true socket connections for CXL
using this data.
> >
> > Even though the distance map shows node2 is physically closer to
> > Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the
> > routing path strictly through Socket 1. Because the HMAT alone made it
> > difficult to determine the exact physical socket connections on these
> > systems, I ended up using the current CXL driver-based approach.
>
> Are the HMAT latencies and bandwidths all there? Or are some missing
> and you have to use SLIT (which generally is garbage for historical
> reasons of tuning SLIT to particular OS behaviour).
>
The HMAT latencies and bandwidths are present, but the values seem
broken. Here is the latency table:
Init->Target | node0 | node1 | node2 | node3
node0 | 0x38B | 0x89F | 0x9C4 | 0x3AFC
node1 | 0x89F | 0x38B | 0x3AFC| 0x4268
I used the identical type of DRAM and CXL memory for both sockets.
However, looking at the table, the local CXL access latency from
node0->node2 (0x9C4) and node1->node3 (0x4268) shows a massive,
unjustified difference. This asymmetry proves that the table is
currently unreliable.
> >
> > I wonder if others have experienced similar broken HMAT cases with CXL.
> > If HMAT information becomes more reliable in the future, we could
> > build a much more efficient structure.
>
> Given it's being lightly used I suspect there will be many bugs :(
> I hope we can assume they will get fixed however!
>
> ...
>
The most critical issue caused by this broken initiator setting is that
topology analysis tools like `hwloc` are completely misled. Currently,
`hwloc` displays both CXL nodes as being attached to Socket 1.
I observed this exact same issue on both Sierra Forest and Granite
Rapids systems. I believe this broken topology exposure is a severe
problem that must be addressed, though I am not entirely sure what the
best fix would be yet. I would love to hear your thoughts on this.
> >
> > The complex topology cases you presented, such as multi-NUMA per socket,
> > shared CXL switches, and IO expanders, are very important points.
> > I clearly understand that the simple package-level grouping does not fully
> > reflect the 1:1 relationship in these future hardware architectures.
> >
> > I have also thought about the shared CXL switch scenario you mentioned,
> > and I know the current design falls short in addressing it properly.
> > While the current implementation starts with a simple socket-local
> > restriction, I plan to evolve it into a more flexible node aggregation
> > model to properly reflect all the diverse topologies you suggested.
>
> If we can ensure it fails cleanly when it finds a topology that it can't
> cope with (and I guess falls back to current) then I'm fine with a partial
> solution that evolves.
>
I completely agree with ensuring a clean failure. To stabilize this
partial solution, I am currently considering a few options for the
next version:
1. Enable this feature only when a strict 1:1 topology is detected.
2. Provide a sysfs allowing users to enable/disable it.
3. Allow users to manually override/configure the topology via sysfs.
4. Implement dynamic fallback behaviors depending on the detected
topology shape (needs further thought).
By the way, when I first posted this RFC, I accidentally missed adding
lsf-pc@lists.linux-foundation.org to the CC list. I am considering
re-posting it to ensure it reaches the lsf-pc.
Thanks again for your profound insights and time. It is tremendously
helpful.
Rakie Kim
>
> >
> > Thanks again for your time and review.
>
> You are welcome.
>
> Thanks
>
> Jonathan
>
> >
> > Rakie Kim
> >
On Tue, Mar 24, 2026 at 02:35:45PM +0900, Rakie Kim wrote:
> On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
>
> Init->Target | node0 | node1 | node2 | node3
> node0 | 0x38B | 0x89F | 0x9C4 | 0x3AFC
> node1 | 0x89F | 0x38B | 0x3AFC| 0x4268
>
> I used the identical type of DRAM and CXL memory for both sockets.
> However, looking at the table, the local CXL access latency from
> node0->node2 (0x9C4) and node1->node3 (0x4268) shows a massive,
> unjustified difference. This asymmetry proves that the table is
> currently unreliable.
>
Can you dump your CDAT for each device so you can at least check whether
the device reports the same latency?
Would at least tell the interested parties whether this is firmware or
BIOS issue.
sudo cat /sys/bus/cxl/devices/endpointN/CDAT | python3 cdat_dump.py
~Gregory
---
#!/usr/bin/env python3
# SPDX-License-Identifier: GPL-2.0-only
# Copyright(c) 2026 Meta Platforms, Inc. and affiliates.
#
# cdat_dump.py - Dump and decode CDAT (Coherent Device Attribute Table)
# from CXL devices via sysfs
#
# Usage:
# cdat_dump.py # dump all CXL devices with CDAT
# cdat_dump.py /sys/bus/cxl/devices/endpoint0/CDAT
# cdat_dump.py --raw cdat_binary.bin # decode from raw file
# cdat_dump.py --hex # include hex dump of each entry
import argparse
import glob
import os
import struct
import sys
# CDAT Header: u32 length, u8 revision, u8 checksum, u8 reserved[6], u32 sequence
CDAT_HDR_FMT = "<IBBBBBBBBBBI"
CDAT_HDR_SIZE = 16
# Common subtable header: u8 type, u8 reserved, u16 length
CDAT_SUBTBL_HDR_FMT = "<BBH"
CDAT_SUBTBL_HDR_SIZE = 4
# DSMAS (type 0): handle, flags, reserved(u16), dpa_base(u64), dpa_length(u64)
DSMAS_FMT = "<BBHQQ"
DSMAS_SIZE = 20
# DSLBIS (type 1): handle, flags, data_type, reserved, entry_base_unit(u64),
# entry[3](u16 x3), reserved2(u16)
DSLBIS_FMT = "<BBBBQHHHH"
DSLBIS_SIZE = 20
# DSMSCIS (type 2): dsmas_handle, reserved[3], side_cache_size(u64),
# cache_attributes(u32)
DSMSCIS_FMT = "<BBBBQI"
DSMSCIS_SIZE = 16
# DSIS (type 3): flags, handle, reserved(u16)
DSIS_FMT = "<BBH"
DSIS_SIZE = 4
# DSEMTS (type 4): dsmas_handle, memory_type, reserved(u16),
# dpa_offset(u64), range_length(u64)
DSEMTS_FMT = "<BBHQQ"
DSEMTS_SIZE = 20
# SSLBIS (type 5) fixed part: data_type, reserved[3], entry_base_unit(u64)
SSLBIS_FMT = "<BBBBQ"
SSLBIS_SIZE = 12
# SSLBE entry: portx_id(u16), porty_id(u16), latency_or_bandwidth(u16),
# reserved(u16)
SSLBE_FMT = "<HHHH"
SSLBE_SIZE = 8
CDAT_TYPE_NAMES = {
0: "DSMAS (Device Scoped Memory Affinity Structure)",
1: "DSLBIS (Device Scoped Latency and Bandwidth Information Structure)",
2: "DSMSCIS (Device Scoped Memory Side Cache Information Structure)",
3: "DSIS (Device Scoped Initiator Structure)",
4: "DSEMTS (Device Scoped EFI Memory Type Structure)",
5: "SSLBIS (Switch Scoped Latency and Bandwidth Information Structure)",
}
HMAT_DATA_TYPES = {
0: "Access Latency",
1: "Read Latency",
2: "Write Latency",
3: "Access Bandwidth",
4: "Read Bandwidth",
5: "Write Bandwidth",
}
EFI_MEM_TYPES = {
0: "EfiConventionalMemory",
1: "EfiConventionalMemory (EFI_MEMORY_SP)",
2: "EfiReservedMemoryType",
}
CACHE_ASSOCIATIVITY = {
0: "None",
1: "Direct Mapped",
2: "Complex Cache Indexing",
}
CACHE_WRITE_POLICY = {
0: "None",
1: "Write Back",
2: "Write Through",
}
def hexdump(data, indent=" "):
lines = []
for i in range(0, len(data), 16):
chunk = data[i:i+16]
hexstr = " ".join(f"{b:02x}" for b in chunk)
ascstr = "".join(chr(b) if 32 <= b < 127 else "." for b in chunk)
lines.append(f"{indent}{i:04x}: {hexstr:<48s} {ascstr}")
return "\n".join(lines)
def fmt_size(size):
if size >= (1 << 40):
return f"{size / (1 << 40):.2f} TiB"
if size >= (1 << 30):
return f"{size / (1 << 30):.2f} GiB"
if size >= (1 << 20):
return f"{size / (1 << 20):.2f} MiB"
if size >= (1 << 10):
return f"{size / (1 << 10):.2f} KiB"
return f"{size} B"
def fmt_port(port_id):
if port_id == 0xFFFF:
return "ANY"
if port_id == 0x0100:
return "USP (upstream)"
return f"DSP {port_id}"
def decode_latency_bandwidth(entry_val, base_unit, data_type):
"""Decode a DSLBIS/SSLBIS entry value into human-readable form."""
if entry_val == 0xFFFF or entry_val == 0:
return "N/A"
raw = entry_val * base_unit
if data_type <= 2: # latency types (picoeconds -> nanoseconds)
ns = raw / 1000.0
if ns >= 1000:
return f"{ns/1000:.2f} us ({raw} ps)"
return f"{ns:.2f} ns ({raw} ps)"
else: # bandwidth types (MB/s)
if raw >= 1024:
return f"{raw/1024:.2f} GB/s ({raw} MB/s)"
return f"{raw} MB/s"
def decode_dsmas(data, show_hex):
handle, flags, _, dpa_base, dpa_length = struct.unpack_from(DSMAS_FMT, data)
flag_strs = []
if flags & (1 << 2):
flag_strs.append("NonVolatile")
if flags & (1 << 3):
flag_strs.append("Shareable")
if flags & (1 << 6):
flag_strs.append("ReadOnly")
flag_desc = ", ".join(flag_strs) if flag_strs else "None"
print(f" DSMAD Handle: {handle}")
print(f" Flags: 0x{flags:02x} ({flag_desc})")
print(f" DPA Base: 0x{dpa_base:016x}")
print(f" DPA Length: 0x{dpa_length:016x} ({fmt_size(dpa_length)})")
def decode_dslbis(data, show_hex):
handle, flags, data_type, _, base_unit, e0, e1, e2, _ = \
struct.unpack_from(DSLBIS_FMT, data)
dt_name = HMAT_DATA_TYPES.get(data_type, f"Unknown ({data_type})")
print(f" Handle: {handle}")
print(f" Flags: 0x{flags:02x}")
print(f" Data Type: {data_type} ({dt_name})")
print(f" Base Unit: {base_unit}")
print(f" Entry[0]: {e0} -> {decode_latency_bandwidth(e0, base_unit, data_type)}")
if e1:
print(f" Entry[1]: {e1} -> {decode_latency_bandwidth(e1, base_unit, data_type)}")
if e2:
print(f" Entry[2]: {e2} -> {decode_latency_bandwidth(e2, base_unit, data_type)}")
def decode_dsmscis(data, show_hex):
dsmas_handle, _, _, _, cache_size, cache_attr = \
struct.unpack_from(DSMSCIS_FMT, data)
total_levels = cache_attr & 0xF
cache_level = (cache_attr >> 4) & 0xF
assoc = (cache_attr >> 8) & 0xF
write_pol = (cache_attr >> 12) & 0xF
line_size = (cache_attr >> 16) & 0xFFFF
assoc_str = CACHE_ASSOCIATIVITY.get(assoc, f"Unknown ({assoc})")
wp_str = CACHE_WRITE_POLICY.get(write_pol, f"Unknown ({write_pol})")
print(f" DSMAS Handle: {dsmas_handle}")
print(f" Cache Size: 0x{cache_size:016x} ({fmt_size(cache_size)})")
print(f" Cache Attrs: 0x{cache_attr:08x}")
print(f" Total Levels: {total_levels}")
print(f" Cache Level: {cache_level}")
print(f" Associativity: {assoc_str}")
print(f" Write Policy: {wp_str}")
print(f" Line Size: {line_size} bytes")
def decode_dsis(data, show_hex):
flags, handle, _ = struct.unpack_from(DSIS_FMT, data)
mem_attached = bool(flags & 1)
if mem_attached:
handle_desc = f"DSMAS handle {handle}"
else:
handle_desc = f"Initiator handle {handle} (no memory)"
print(f" Flags: 0x{flags:02x} (Memory Attached: {mem_attached})")
print(f" Handle: {handle} ({handle_desc})")
def decode_dsemts(data, show_hex):
dsmas_handle, mem_type, _, dpa_offset, range_length = \
struct.unpack_from(DSEMTS_FMT, data)
mt_str = EFI_MEM_TYPES.get(mem_type, f"Reserved ({mem_type})")
print(f" DSMAS Handle: {dsmas_handle}")
print(f" Memory Type: {mem_type} ({mt_str})")
print(f" DPA Offset: 0x{dpa_offset:016x}")
print(f" Range Length: 0x{range_length:016x} ({fmt_size(range_length)})")
def decode_sslbis(data, total_len, show_hex):
dt, _, _, _, base_unit = struct.unpack_from(SSLBIS_FMT, data)
dt_name = HMAT_DATA_TYPES.get(dt, f"Unknown ({dt})")
print(f" Data Type: {dt} ({dt_name})")
print(f" Base Unit: {base_unit}")
# Variable number of SSLBE entries after the fixed header
entries_data = data[SSLBIS_SIZE:]
n_entries = len(entries_data) // SSLBE_SIZE
for i in range(n_entries):
off = i * SSLBE_SIZE
px, py, val, _ = struct.unpack_from(SSLBE_FMT, entries_data, off)
decoded = decode_latency_bandwidth(val, base_unit, dt)
print(f" Entry[{i}]: {fmt_port(px)} <-> {fmt_port(py)}: "
f"{val} -> {decoded}")
DECODERS = {
0: decode_dsmas,
1: decode_dslbis,
2: decode_dsmscis,
3: decode_dsis,
4: decode_dsemts,
}
def decode_cdat(data, source="", show_hex=False):
if len(data) < CDAT_HDR_SIZE:
print(f"Error: data too short for CDAT header ({len(data)} < {CDAT_HDR_SIZE})")
return False
# Parse header
vals = struct.unpack_from(CDAT_HDR_FMT, data)
length = vals[0]
revision = vals[1]
checksum = vals[2]
# vals[3:9] are the 6 reserved bytes
sequence = vals[9]
# Verify checksum
cksum = sum(data[:length]) & 0xFF
cksum_ok = "OK" if cksum == 0 else f"FAIL (sum=0x{cksum:02x})"
if source:
print(f"=== CDAT from {source} ===")
print(f"CDAT Header:")
print(f" Length: {length} bytes")
print(f" Revision: {revision}")
print(f" Checksum: 0x{checksum:02x} ({cksum_ok})")
print(f" Sequence: {sequence}")
if show_hex:
print(f" Raw header:")
print(hexdump(data[:CDAT_HDR_SIZE], " "))
if length > len(data):
print(f"Warning: CDAT length ({length}) > available data ({len(data)})")
length = len(data)
# Parse subtables
offset = CDAT_HDR_SIZE
entry_num = 0
counts = {}
while offset + CDAT_SUBTBL_HDR_SIZE <= length:
stype, _, slen = struct.unpack_from(CDAT_SUBTBL_HDR_FMT, data, offset)
if slen < CDAT_SUBTBL_HDR_SIZE:
print(f"\nError: subtable at offset {offset} has invalid length {slen}")
break
if offset + slen > length:
print(f"\nError: subtable at offset {offset} extends past end "
f"(offset+len={offset+slen} > {length})")
break
counts[stype] = counts.get(stype, 0) + 1
type_name = CDAT_TYPE_NAMES.get(stype, f"Unknown (type={stype})")
print(f"\n [{entry_num}] {type_name}")
print(f" Offset: {offset}, Length: {slen}")
if show_hex:
print(hexdump(data[offset:offset+slen], " "))
# Decode the subtable body (skip the 4-byte common header)
body = data[offset + CDAT_SUBTBL_HDR_SIZE:offset + slen]
if stype == 5:
# SSLBIS has variable length, pass total subtable body length
decode_sslbis(body, slen - CDAT_SUBTBL_HDR_SIZE, show_hex)
elif stype in DECODERS:
DECODERS[stype](body, show_hex)
else:
print(f" (unknown type, raw data follows)")
print(hexdump(body, " "))
offset += slen
entry_num += 1
# Summary
print(f"\nSummary: {entry_num} entries")
for t in sorted(counts):
name = CDAT_TYPE_NAMES.get(t, f"Unknown ({t})")
print(f" {name}: {counts[t]}")
if offset < length:
trailing = length - offset
print(f"\nWarning: {trailing} trailing bytes after last subtable")
print()
return True
def find_cdat_sysfs():
"""Find all CXL devices with CDAT attributes in sysfs."""
paths = []
for dev_path in sorted(glob.glob("/sys/bus/cxl/devices/*")):
cdat_path = os.path.join(dev_path, "CDAT")
if os.path.exists(cdat_path):
paths.append(cdat_path)
return paths
def read_cdat(path):
"""Read binary CDAT data from a sysfs attribute or file."""
try:
with open(path, "rb") as f:
return f.read()
except PermissionError:
print(f"Error: permission denied reading {path} (need root?)")
return None
except OSError as e:
print(f"Error reading {path}: {e}")
return None
def main():
parser = argparse.ArgumentParser(
description="Dump and decode CXL CDAT (Coherent Device Attribute Table)",
epilog="Without arguments, discovers and dumps CDAT from all CXL devices.\n"
"Requires root access to read sysfs CDAT attributes.",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
parser.add_argument(
"path", nargs="*",
help="Path to sysfs CDAT attribute or raw CDAT binary file",
)
parser.add_argument(
"--raw", action="store_true",
help="Treat input as raw CDAT binary file (not sysfs)",
)
parser.add_argument(
"--hex", action="store_true",
help="Include hex dump of each entry",
)
args = parser.parse_args()
paths = args.path
# Read from stdin if piped or explicitly given "-"
if not sys.stdin.isatty() and not paths:
data = sys.stdin.buffer.read()
if not data:
print("Error: no data on stdin")
return 1
return 0 if decode_cdat(data, "stdin", show_hex=args.hex) else 1
if paths == ["-"]:
data = sys.stdin.buffer.read()
if not data:
print("Error: no data on stdin")
return 1
return 0 if decode_cdat(data, "stdin", show_hex=args.hex) else 1
if not paths:
paths = find_cdat_sysfs()
if not paths:
print("No CXL devices with CDAT found in sysfs.")
print("Check that CXL devices are present and the cxl_port driver is loaded.")
return 1
ok = True
for path in paths:
data = read_cdat(path)
if data is None:
ok = False
continue
if not data:
dev = os.path.basename(os.path.dirname(path)) if not args.raw else path
print(f"{dev}: CDAT is empty (read from device failed at probe time)")
ok = False
continue
source = path
if not args.raw and "/sys/" in path:
source = os.path.basename(os.path.dirname(path))
if not decode_cdat(data, source, show_hex=args.hex):
ok = False
return 0 if ok else 1
if __name__ == "__main__":
sys.exit(main())
On Thu, 26 Mar 2026 21:54:30 -0400 Gregory Price <gourry@gourry.net> wrote:
> On Tue, Mar 24, 2026 at 02:35:45PM +0900, Rakie Kim wrote:
> > On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> >
> > Init->Target | node0 | node1 | node2 | node3
> > node0 | 0x38B | 0x89F | 0x9C4 | 0x3AFC
> > node1 | 0x89F | 0x38B | 0x3AFC| 0x4268
> >
> > I used the identical type of DRAM and CXL memory for both sockets.
> > However, looking at the table, the local CXL access latency from
> > node0->node2 (0x9C4) and node1->node3 (0x4268) shows a massive,
> > unjustified difference. This asymmetry proves that the table is
> > currently unreliable.
> >
>
> Can you dump your CDAT for each device so you can at least check whether
> the device reports the same latency?
>
> Would at least tell the interested parties whether this is firmware or
> BIOS issue.
>
> sudo cat /sys/bus/cxl/devices/endpointN/CDAT | python3 cdat_dump.py
>
> ~Gregory
>
< --snip-- >
Hello Gregory,
Thank you for providing the python script. It was incredibly helpful.
I ran it across all the CDAT files found in my sysfs.
For context, my system currently has 16 CXL endpoints attached
(from `endpoint9` to `endpoint24`):
/sys/devices/platform/ACPI0017:00/root0/port.../endpoint.../CDAT
As Dave Jiang recently pointed out in this thread, the Intel BIOS
team confirmed that the HMAT values actually represent "end-to-end"
latency. By comparing these CDAT dump results (which show the
device-level latency) with the HMAT numbers, we should have a much
clearer picture of whether the massive asymmetry originates from the
device firmware itself or from the BIOS calculations.
I have attached the extracted CDAT dump results for the devices below.
Thanks again for your help in isolating this issue!
Rakie Kim
=== CDAT from endpoint16 ===
CDAT Header:
Length: 208 bytes
Revision: 1
Checksum: 0xc3 (OK)
Sequence: 0
[0] DSMAS (Device Scoped Memory Affinity Structure)
Offset: 16, Length: 24
DSMAD Handle: 0
Flags: 0x00 (None)
DPA Base: 0x0000000000000000
DPA Length: 0x0000002000000000 (128.00 GiB)
[1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 40, Length: 24
Handle: 0
Flags: 0x00
Data Type: 0 (Access Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 64, Length: 24
Handle: 0
Flags: 0x00
Data Type: 1 (Read Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 88, Length: 24
Handle: 0
Flags: 0x00
Data Type: 2 (Write Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 112, Length: 24
Handle: 0
Flags: 0x00
Data Type: 3 (Access Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 136, Length: 24
Handle: 0
Flags: 0x00
Data Type: 4 (Read Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 160, Length: 24
Handle: 0
Flags: 0x00
Data Type: 5 (Write Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[7] DSEMTS (Device Scoped EFI Memory Type Structure)
Offset: 184, Length: 24
DSMAS Handle: 0
Memory Type: 0 (EfiConventionalMemory)
DPA Offset: 0x0000000000000000
Range Length: 0x0000002000000000 (128.00 GiB)
Summary: 8 entries
DSMAS (Device Scoped Memory Affinity Structure): 1
DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
DSEMTS (Device Scoped EFI Memory Type Structure): 1
=== CDAT from endpoint20 ===
CDAT Header:
Length: 208 bytes
Revision: 1
Checksum: 0xc3 (OK)
Sequence: 0
[0] DSMAS (Device Scoped Memory Affinity Structure)
Offset: 16, Length: 24
DSMAD Handle: 0
Flags: 0x00 (None)
DPA Base: 0x0000000000000000
DPA Length: 0x0000002000000000 (128.00 GiB)
[1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 40, Length: 24
Handle: 0
Flags: 0x00
Data Type: 0 (Access Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 64, Length: 24
Handle: 0
Flags: 0x00
Data Type: 1 (Read Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 88, Length: 24
Handle: 0
Flags: 0x00
Data Type: 2 (Write Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 112, Length: 24
Handle: 0
Flags: 0x00
Data Type: 3 (Access Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 136, Length: 24
Handle: 0
Flags: 0x00
Data Type: 4 (Read Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 160, Length: 24
Handle: 0
Flags: 0x00
Data Type: 5 (Write Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[7] DSEMTS (Device Scoped EFI Memory Type Structure)
Offset: 184, Length: 24
DSMAS Handle: 0
Memory Type: 0 (EfiConventionalMemory)
DPA Offset: 0x0000000000000000
Range Length: 0x0000002000000000 (128.00 GiB)
Summary: 8 entries
DSMAS (Device Scoped Memory Affinity Structure): 1
DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
DSEMTS (Device Scoped EFI Memory Type Structure): 1
=== CDAT from endpoint18 ===
CDAT Header:
Length: 208 bytes
Revision: 1
Checksum: 0xc3 (OK)
Sequence: 0
[0] DSMAS (Device Scoped Memory Affinity Structure)
Offset: 16, Length: 24
DSMAD Handle: 0
Flags: 0x00 (None)
DPA Base: 0x0000000000000000
DPA Length: 0x0000002000000000 (128.00 GiB)
[1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 40, Length: 24
Handle: 0
Flags: 0x00
Data Type: 0 (Access Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 64, Length: 24
Handle: 0
Flags: 0x00
Data Type: 1 (Read Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 88, Length: 24
Handle: 0
Flags: 0x00
Data Type: 2 (Write Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 112, Length: 24
Handle: 0
Flags: 0x00
Data Type: 3 (Access Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 136, Length: 24
Handle: 0
Flags: 0x00
Data Type: 4 (Read Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 160, Length: 24
Handle: 0
Flags: 0x00
Data Type: 5 (Write Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[7] DSEMTS (Device Scoped EFI Memory Type Structure)
Offset: 184, Length: 24
DSMAS Handle: 0
Memory Type: 0 (EfiConventionalMemory)
DPA Offset: 0x0000000000000000
Range Length: 0x0000002000000000 (128.00 GiB)
Summary: 8 entries
DSMAS (Device Scoped Memory Affinity Structure): 1
DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
DSEMTS (Device Scoped EFI Memory Type Structure): 1
=== CDAT from endpoint17 ===
CDAT Header:
Length: 208 bytes
Revision: 1
Checksum: 0xc3 (OK)
Sequence: 0
[0] DSMAS (Device Scoped Memory Affinity Structure)
Offset: 16, Length: 24
DSMAD Handle: 0
Flags: 0x00 (None)
DPA Base: 0x0000000000000000
DPA Length: 0x0000002000000000 (128.00 GiB)
[1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 40, Length: 24
Handle: 0
Flags: 0x00
Data Type: 0 (Access Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 64, Length: 24
Handle: 0
Flags: 0x00
Data Type: 1 (Read Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 88, Length: 24
Handle: 0
Flags: 0x00
Data Type: 2 (Write Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 112, Length: 24
Handle: 0
Flags: 0x00
Data Type: 3 (Access Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 136, Length: 24
Handle: 0
Flags: 0x00
Data Type: 4 (Read Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 160, Length: 24
Handle: 0
Flags: 0x00
Data Type: 5 (Write Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[7] DSEMTS (Device Scoped EFI Memory Type Structure)
Offset: 184, Length: 24
DSMAS Handle: 0
Memory Type: 0 (EfiConventionalMemory)
DPA Offset: 0x0000000000000000
Range Length: 0x0000002000000000 (128.00 GiB)
Summary: 8 entries
DSMAS (Device Scoped Memory Affinity Structure): 1
DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
DSEMTS (Device Scoped EFI Memory Type Structure): 1
=== CDAT from endpoint14 ===
CDAT Header:
Length: 208 bytes
Revision: 1
Checksum: 0xc3 (OK)
Sequence: 0
[0] DSMAS (Device Scoped Memory Affinity Structure)
Offset: 16, Length: 24
DSMAD Handle: 0
Flags: 0x00 (None)
DPA Base: 0x0000000000000000
DPA Length: 0x0000002000000000 (128.00 GiB)
[1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 40, Length: 24
Handle: 0
Flags: 0x00
Data Type: 0 (Access Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 64, Length: 24
Handle: 0
Flags: 0x00
Data Type: 1 (Read Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 88, Length: 24
Handle: 0
Flags: 0x00
Data Type: 2 (Write Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 112, Length: 24
Handle: 0
Flags: 0x00
Data Type: 3 (Access Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 136, Length: 24
Handle: 0
Flags: 0x00
Data Type: 4 (Read Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 160, Length: 24
Handle: 0
Flags: 0x00
Data Type: 5 (Write Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[7] DSEMTS (Device Scoped EFI Memory Type Structure)
Offset: 184, Length: 24
DSMAS Handle: 0
Memory Type: 0 (EfiConventionalMemory)
DPA Offset: 0x0000000000000000
Range Length: 0x0000002000000000 (128.00 GiB)
Summary: 8 entries
DSMAS (Device Scoped Memory Affinity Structure): 1
DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
DSEMTS (Device Scoped EFI Memory Type Structure): 1
=== CDAT from endpoint10 ===
CDAT Header:
Length: 208 bytes
Revision: 1
Checksum: 0xc3 (OK)
Sequence: 0
[0] DSMAS (Device Scoped Memory Affinity Structure)
Offset: 16, Length: 24
DSMAD Handle: 0
Flags: 0x00 (None)
DPA Base: 0x0000000000000000
DPA Length: 0x0000002000000000 (128.00 GiB)
[1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 40, Length: 24
Handle: 0
Flags: 0x00
Data Type: 0 (Access Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 64, Length: 24
Handle: 0
Flags: 0x00
Data Type: 1 (Read Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 88, Length: 24
Handle: 0
Flags: 0x00
Data Type: 2 (Write Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 112, Length: 24
Handle: 0
Flags: 0x00
Data Type: 3 (Access Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 136, Length: 24
Handle: 0
Flags: 0x00
Data Type: 4 (Read Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 160, Length: 24
Handle: 0
Flags: 0x00
Data Type: 5 (Write Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[7] DSEMTS (Device Scoped EFI Memory Type Structure)
Offset: 184, Length: 24
DSMAS Handle: 0
Memory Type: 0 (EfiConventionalMemory)
DPA Offset: 0x0000000000000000
Range Length: 0x0000002000000000 (128.00 GiB)
Summary: 8 entries
DSMAS (Device Scoped Memory Affinity Structure): 1
DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
DSEMTS (Device Scoped EFI Memory Type Structure): 1
=== CDAT from endpoint23 ===
CDAT Header:
Length: 208 bytes
Revision: 1
Checksum: 0xc3 (OK)
Sequence: 0
[0] DSMAS (Device Scoped Memory Affinity Structure)
Offset: 16, Length: 24
DSMAD Handle: 0
Flags: 0x00 (None)
DPA Base: 0x0000000000000000
DPA Length: 0x0000002000000000 (128.00 GiB)
[1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 40, Length: 24
Handle: 0
Flags: 0x00
Data Type: 0 (Access Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 64, Length: 24
Handle: 0
Flags: 0x00
Data Type: 1 (Read Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 88, Length: 24
Handle: 0
Flags: 0x00
Data Type: 2 (Write Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 112, Length: 24
Handle: 0
Flags: 0x00
Data Type: 3 (Access Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 136, Length: 24
Handle: 0
Flags: 0x00
Data Type: 4 (Read Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 160, Length: 24
Handle: 0
Flags: 0x00
Data Type: 5 (Write Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[7] DSEMTS (Device Scoped EFI Memory Type Structure)
Offset: 184, Length: 24
DSMAS Handle: 0
Memory Type: 0 (EfiConventionalMemory)
DPA Offset: 0x0000000000000000
Range Length: 0x0000002000000000 (128.00 GiB)
Summary: 8 entries
DSMAS (Device Scoped Memory Affinity Structure): 1
DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
DSEMTS (Device Scoped EFI Memory Type Structure): 1
=== CDAT from endpoint22 ===
CDAT Header:
Length: 208 bytes
Revision: 1
Checksum: 0xc3 (OK)
Sequence: 0
[0] DSMAS (Device Scoped Memory Affinity Structure)
Offset: 16, Length: 24
DSMAD Handle: 0
Flags: 0x00 (None)
DPA Base: 0x0000000000000000
DPA Length: 0x0000002000000000 (128.00 GiB)
[1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 40, Length: 24
Handle: 0
Flags: 0x00
Data Type: 0 (Access Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 64, Length: 24
Handle: 0
Flags: 0x00
Data Type: 1 (Read Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 88, Length: 24
Handle: 0
Flags: 0x00
Data Type: 2 (Write Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 112, Length: 24
Handle: 0
Flags: 0x00
Data Type: 3 (Access Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 136, Length: 24
Handle: 0
Flags: 0x00
Data Type: 4 (Read Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 160, Length: 24
Handle: 0
Flags: 0x00
Data Type: 5 (Write Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[7] DSEMTS (Device Scoped EFI Memory Type Structure)
Offset: 184, Length: 24
DSMAS Handle: 0
Memory Type: 0 (EfiConventionalMemory)
DPA Offset: 0x0000000000000000
Range Length: 0x0000002000000000 (128.00 GiB)
Summary: 8 entries
DSMAS (Device Scoped Memory Affinity Structure): 1
DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
DSEMTS (Device Scoped EFI Memory Type Structure): 1
=== CDAT from endpoint15 ===
CDAT Header:
Length: 208 bytes
Revision: 1
Checksum: 0xc3 (OK)
Sequence: 0
[0] DSMAS (Device Scoped Memory Affinity Structure)
Offset: 16, Length: 24
DSMAD Handle: 0
Flags: 0x00 (None)
DPA Base: 0x0000000000000000
DPA Length: 0x0000002000000000 (128.00 GiB)
[1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 40, Length: 24
Handle: 0
Flags: 0x00
Data Type: 0 (Access Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 64, Length: 24
Handle: 0
Flags: 0x00
Data Type: 1 (Read Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 88, Length: 24
Handle: 0
Flags: 0x00
Data Type: 2 (Write Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 112, Length: 24
Handle: 0
Flags: 0x00
Data Type: 3 (Access Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 136, Length: 24
Handle: 0
Flags: 0x00
Data Type: 4 (Read Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 160, Length: 24
Handle: 0
Flags: 0x00
Data Type: 5 (Write Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[7] DSEMTS (Device Scoped EFI Memory Type Structure)
Offset: 184, Length: 24
DSMAS Handle: 0
Memory Type: 0 (EfiConventionalMemory)
DPA Offset: 0x0000000000000000
Range Length: 0x0000002000000000 (128.00 GiB)
Summary: 8 entries
DSMAS (Device Scoped Memory Affinity Structure): 1
DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
DSEMTS (Device Scoped EFI Memory Type Structure): 1
=== CDAT from endpoint11 ===
CDAT Header:
Length: 208 bytes
Revision: 1
Checksum: 0xc3 (OK)
Sequence: 0
[0] DSMAS (Device Scoped Memory Affinity Structure)
Offset: 16, Length: 24
DSMAD Handle: 0
Flags: 0x00 (None)
DPA Base: 0x0000000000000000
DPA Length: 0x0000002000000000 (128.00 GiB)
[1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 40, Length: 24
Handle: 0
Flags: 0x00
Data Type: 0 (Access Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 64, Length: 24
Handle: 0
Flags: 0x00
Data Type: 1 (Read Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 88, Length: 24
Handle: 0
Flags: 0x00
Data Type: 2 (Write Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 112, Length: 24
Handle: 0
Flags: 0x00
Data Type: 3 (Access Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 136, Length: 24
Handle: 0
Flags: 0x00
Data Type: 4 (Read Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 160, Length: 24
Handle: 0
Flags: 0x00
Data Type: 5 (Write Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[7] DSEMTS (Device Scoped EFI Memory Type Structure)
Offset: 184, Length: 24
DSMAS Handle: 0
Memory Type: 0 (EfiConventionalMemory)
DPA Offset: 0x0000000000000000
Range Length: 0x0000002000000000 (128.00 GiB)
Summary: 8 entries
DSMAS (Device Scoped Memory Affinity Structure): 1
DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
DSEMTS (Device Scoped EFI Memory Type Structure): 1
=== CDAT from endpoint12 ===
CDAT Header:
Length: 208 bytes
Revision: 1
Checksum: 0xc3 (OK)
Sequence: 0
[0] DSMAS (Device Scoped Memory Affinity Structure)
Offset: 16, Length: 24
DSMAD Handle: 0
Flags: 0x00 (None)
DPA Base: 0x0000000000000000
DPA Length: 0x0000002000000000 (128.00 GiB)
[1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 40, Length: 24
Handle: 0
Flags: 0x00
Data Type: 0 (Access Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 64, Length: 24
Handle: 0
Flags: 0x00
Data Type: 1 (Read Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 88, Length: 24
Handle: 0
Flags: 0x00
Data Type: 2 (Write Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 112, Length: 24
Handle: 0
Flags: 0x00
Data Type: 3 (Access Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 136, Length: 24
Handle: 0
Flags: 0x00
Data Type: 4 (Read Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 160, Length: 24
Handle: 0
Flags: 0x00
Data Type: 5 (Write Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[7] DSEMTS (Device Scoped EFI Memory Type Structure)
Offset: 184, Length: 24
DSMAS Handle: 0
Memory Type: 0 (EfiConventionalMemory)
DPA Offset: 0x0000000000000000
Range Length: 0x0000002000000000 (128.00 GiB)
Summary: 8 entries
DSMAS (Device Scoped Memory Affinity Structure): 1
DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
DSEMTS (Device Scoped EFI Memory Type Structure): 1
=== CDAT from endpoint19 ===
CDAT Header:
Length: 208 bytes
Revision: 1
Checksum: 0xc3 (OK)
Sequence: 0
[0] DSMAS (Device Scoped Memory Affinity Structure)
Offset: 16, Length: 24
DSMAD Handle: 0
Flags: 0x00 (None)
DPA Base: 0x0000000000000000
DPA Length: 0x0000002000000000 (128.00 GiB)
[1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 40, Length: 24
Handle: 0
Flags: 0x00
Data Type: 0 (Access Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 64, Length: 24
Handle: 0
Flags: 0x00
Data Type: 1 (Read Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 88, Length: 24
Handle: 0
Flags: 0x00
Data Type: 2 (Write Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 112, Length: 24
Handle: 0
Flags: 0x00
Data Type: 3 (Access Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 136, Length: 24
Handle: 0
Flags: 0x00
Data Type: 4 (Read Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 160, Length: 24
Handle: 0
Flags: 0x00
Data Type: 5 (Write Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[7] DSEMTS (Device Scoped EFI Memory Type Structure)
Offset: 184, Length: 24
DSMAS Handle: 0
Memory Type: 0 (EfiConventionalMemory)
DPA Offset: 0x0000000000000000
Range Length: 0x0000002000000000 (128.00 GiB)
Summary: 8 entries
DSMAS (Device Scoped Memory Affinity Structure): 1
DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
DSEMTS (Device Scoped EFI Memory Type Structure): 1
=== CDAT from endpoint13 ===
CDAT Header:
Length: 208 bytes
Revision: 1
Checksum: 0xc3 (OK)
Sequence: 0
[0] DSMAS (Device Scoped Memory Affinity Structure)
Offset: 16, Length: 24
DSMAD Handle: 0
Flags: 0x00 (None)
DPA Base: 0x0000000000000000
DPA Length: 0x0000002000000000 (128.00 GiB)
[1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 40, Length: 24
Handle: 0
Flags: 0x00
Data Type: 0 (Access Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 64, Length: 24
Handle: 0
Flags: 0x00
Data Type: 1 (Read Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 88, Length: 24
Handle: 0
Flags: 0x00
Data Type: 2 (Write Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 112, Length: 24
Handle: 0
Flags: 0x00
Data Type: 3 (Access Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 136, Length: 24
Handle: 0
Flags: 0x00
Data Type: 4 (Read Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 160, Length: 24
Handle: 0
Flags: 0x00
Data Type: 5 (Write Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[7] DSEMTS (Device Scoped EFI Memory Type Structure)
Offset: 184, Length: 24
DSMAS Handle: 0
Memory Type: 0 (EfiConventionalMemory)
DPA Offset: 0x0000000000000000
Range Length: 0x0000002000000000 (128.00 GiB)
Summary: 8 entries
DSMAS (Device Scoped Memory Affinity Structure): 1
DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
DSEMTS (Device Scoped EFI Memory Type Structure): 1
=== CDAT from endpoint9 ===
CDAT Header:
Length: 208 bytes
Revision: 1
Checksum: 0xc3 (OK)
Sequence: 0
[0] DSMAS (Device Scoped Memory Affinity Structure)
Offset: 16, Length: 24
DSMAD Handle: 0
Flags: 0x00 (None)
DPA Base: 0x0000000000000000
DPA Length: 0x0000002000000000 (128.00 GiB)
[1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 40, Length: 24
Handle: 0
Flags: 0x00
Data Type: 0 (Access Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 64, Length: 24
Handle: 0
Flags: 0x00
Data Type: 1 (Read Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 88, Length: 24
Handle: 0
Flags: 0x00
Data Type: 2 (Write Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 112, Length: 24
Handle: 0
Flags: 0x00
Data Type: 3 (Access Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 136, Length: 24
Handle: 0
Flags: 0x00
Data Type: 4 (Read Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 160, Length: 24
Handle: 0
Flags: 0x00
Data Type: 5 (Write Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[7] DSEMTS (Device Scoped EFI Memory Type Structure)
Offset: 184, Length: 24
DSMAS Handle: 0
Memory Type: 0 (EfiConventionalMemory)
DPA Offset: 0x0000000000000000
Range Length: 0x0000002000000000 (128.00 GiB)
Summary: 8 entries
DSMAS (Device Scoped Memory Affinity Structure): 1
DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
DSEMTS (Device Scoped EFI Memory Type Structure): 1
=== CDAT from endpoint21 ===
CDAT Header:
Length: 208 bytes
Revision: 1
Checksum: 0xc3 (OK)
Sequence: 0
[0] DSMAS (Device Scoped Memory Affinity Structure)
Offset: 16, Length: 24
DSMAD Handle: 0
Flags: 0x00 (None)
DPA Base: 0x0000000000000000
DPA Length: 0x0000002000000000 (128.00 GiB)
[1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 40, Length: 24
Handle: 0
Flags: 0x00
Data Type: 0 (Access Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 64, Length: 24
Handle: 0
Flags: 0x00
Data Type: 1 (Read Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 88, Length: 24
Handle: 0
Flags: 0x00
Data Type: 2 (Write Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 112, Length: 24
Handle: 0
Flags: 0x00
Data Type: 3 (Access Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 136, Length: 24
Handle: 0
Flags: 0x00
Data Type: 4 (Read Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 160, Length: 24
Handle: 0
Flags: 0x00
Data Type: 5 (Write Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[7] DSEMTS (Device Scoped EFI Memory Type Structure)
Offset: 184, Length: 24
DSMAS Handle: 0
Memory Type: 0 (EfiConventionalMemory)
DPA Offset: 0x0000000000000000
Range Length: 0x0000002000000000 (128.00 GiB)
Summary: 8 entries
DSMAS (Device Scoped Memory Affinity Structure): 1
DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
DSEMTS (Device Scoped EFI Memory Type Structure): 1
=== CDAT from endpoint24 ===
CDAT Header:
Length: 208 bytes
Revision: 1
Checksum: 0xc3 (OK)
Sequence: 0
[0] DSMAS (Device Scoped Memory Affinity Structure)
Offset: 16, Length: 24
DSMAD Handle: 0
Flags: 0x00 (None)
DPA Base: 0x0000000000000000
DPA Length: 0x0000002000000000 (128.00 GiB)
[1] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 40, Length: 24
Handle: 0
Flags: 0x00
Data Type: 0 (Access Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[2] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 64, Length: 24
Handle: 0
Flags: 0x00
Data Type: 1 (Read Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[3] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 88, Length: 24
Handle: 0
Flags: 0x00
Data Type: 2 (Write Latency)
Base Unit: 1000
Entry[0]: 110 -> 110.00 ns (110000 ps)
[4] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 112, Length: 24
Handle: 0
Flags: 0x00
Data Type: 3 (Access Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[5] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 136, Length: 24
Handle: 0
Flags: 0x00
Data Type: 4 (Read Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[6] DSLBIS (Device Scoped Latency and Bandwidth Information Structure)
Offset: 160, Length: 24
Handle: 0
Flags: 0x00
Data Type: 5 (Write Bandwidth)
Base Unit: 1000
Entry[0]: 45 -> 43.95 GB/s (45000 MB/s)
[7] DSEMTS (Device Scoped EFI Memory Type Structure)
Offset: 184, Length: 24
DSMAS Handle: 0
Memory Type: 0 (EfiConventionalMemory)
DPA Offset: 0x0000000000000000
Range Length: 0x0000002000000000 (128.00 GiB)
Summary: 8 entries
DSMAS (Device Scoped Memory Affinity Structure): 1
DSLBIS (Device Scoped Latency and Bandwidth Information Structure): 6
DSEMTS (Device Scoped EFI Memory Type Structure): 1
On 3/23/26 10:35 PM, Rakie Kim wrote: > On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote: >> < --snip-- > > > The HMAT latencies and bandwidths are present, but the values seem > broken. Here is the latency table: > > Init->Target | node0 | node1 | node2 | node3 > node0 | 0x38B | 0x89F | 0x9C4 | 0x3AFC > node1 | 0x89F | 0x38B | 0x3AFC| 0x4268 Do you have the iasl -d outputs of the SRAT and the HMAT somewhere we can look at? DJ
On Thu, 26 Mar 2026 15:24:26 -0700 Dave Jiang <dave.jiang@intel.com> wrote:
>
>
> On 3/23/26 10:35 PM, Rakie Kim wrote:
> > On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> >>
> < --snip-- >
>
> >
> > The HMAT latencies and bandwidths are present, but the values seem
> > broken. Here is the latency table:
> >
> > Init->Target | node0 | node1 | node2 | node3
> > node0 | 0x38B | 0x89F | 0x9C4 | 0x3AFC
> > node1 | 0x89F | 0x38B | 0x3AFC| 0x4268
>
> Do you have the iasl -d outputs of the SRAT and the HMAT somewhere we can look at?
>
> DJ
>
Hello Dave,
Posting the entire ACPI dump would be too long for the list.
Instead, I have extracted the relevant structures from the `iasl -d`
output for the local memory (PXM 0, 1) and CXL memory nodes (PXM A, B).
Here is the truncated `HMAT.dsl` showing the core topology mappings
and the latency matrix we are discussing:
----------------------------------------------------------------------
[000h 0000 4] Signature : "HMAT" [Heterogeneous Memory Attributes Table]
[004h 0004 4] Table Length : 00000668
[008h 0008 1] Revision : 02
[009h 0009 1] Checksum : 6A
[00Ah 0010 6] Oem ID : "GBT "
[010h 0016 8] Oem Table ID : "GBTUACPI"
[018h 0024 4] Oem Revision : 01072009
[01Ch 0028 4] Asl Compiler ID : "AMI "
[020h 0032 4] Asl Compiler Revision : 20230628
[024h 0036 4] Reserved : 00000000
[028h 0040 2] Structure Type : 0000 [Memory Proximity Domain Attributes]
[02Ah 0042 2] Reserved : 0000
[02Ch 0044 4] Length : 00000028
[030h 0048 2] Flags (decoded below) : 0001
Processor Proximity Domain Valid : 1
[032h 0050 2] Reserved1 : 0000
[034h 0052 4] Attached Initiator Proximity Domain : 00000000
[038h 0056 4] Memory Proximity Domain : 00000000
...
[050h 0080 2] Structure Type : 0000 [Memory Proximity Domain Attributes]
[052h 0082 2] Reserved : 0000
[054h 0084 4] Length : 00000028
[058h 0088 2] Flags (decoded below) : 0001
Processor Proximity Domain Valid : 1
[05Ah 0090 2] Reserved1 : 0000
[05Ch 0092 4] Attached Initiator Proximity Domain : 00000001
[060h 0096 4] Memory Proximity Domain : 00000001
...
[078h 0120 2] Structure Type : 0000 [Memory Proximity Domain Attributes]
[07Ah 0122 2] Reserved : 0000
[07Ch 0124 4] Length : 00000028
[080h 0128 2] Flags (decoded below) : 0000
Processor Proximity Domain Valid : 0
[082h 0130 2] Reserved1 : 0000
[084h 0132 4] Attached Initiator Proximity Domain : 00000000
[088h 0136 4] Memory Proximity Domain : 0000000A
...
[0A0h 0160 2] Structure Type : 0000 [Memory Proximity Domain Attributes]
[0A2h 0162 2] Reserved : 0000
[0A4h 0164 4] Length : 00000028
[0A8h 0168 2] Flags (decoded below) : 0000
Processor Proximity Domain Valid : 0
[0AAh 0170 2] Reserved1 : 0000
[0ACh 0172 4] Attached Initiator Proximity Domain : 00000000
[0B0h 0176 4] Memory Proximity Domain : 0000000B
...
[0C8h 0200 2] Structure Type : 0001 [System Locality Latency and Bandwidth Information]
[0CAh 0202 2] Reserved : 0000
[0CCh 0204 4] Length : 00000168
[0D0h 0208 1] Flags (decoded below) : 00
Memory Hierarchy : 0
[0D1h 0209 1] Data Type : 01
[0D2h 0210 2] Reserved1 : 0000
[0D4h 0212 4] Initiator Proximity Domains # : 0000000A
[0D8h 0216 4] Target Proximity Domains # : 0000000C
...
[140h 0320 2] Entry : 038B
[142h 0322 2] Entry : 089F
[144h 0324 2] Entry : 09C4
[146h 0326 2] Entry : 09C4
[148h 0328 2] Entry : 09C4
[14Ah 0330 2] Entry : 09C4
[14Ch 0332 2] Entry : 157C
[14Eh 0334 2] Entry : 157C
[150h 0336 2] Entry : 157C
[152h 0338 2] Entry : 157C
[154h 0340 2] Entry : 3AFC
[156h 0342 2] Entry : 4268
[158h 0344 2] Entry : 089F
[15Ah 0346 2] Entry : 038B
[15Ch 0348 2] Entry : 157C
[15Eh 0350 2] Entry : 157C
[160h 0352 2] Entry : 157C
[162h 0354 2] Entry : 157C
[164h 0356 2] Entry : 09C4
[166h 0358 2] Entry : 09C4
[168h 0360 2] Entry : 09C4
[16Ah 0362 2] Entry : 09C4
[16Ch 0364 2] Entry : 4268
[16Eh 0366 2] Entry : 3AFC
----------------------------------------------------------------------
And here is the relevant extraction from `SRAT.dsl`. As you suspected,
the CXL memory ranges are indeed statically defined in the SRAT at boot:
----------------------------------------------------------------------
[000h 0000 4] Signature : "SRAT" [System Resource Affinity Table]
[004h 0004 4] Table Length : 0000A1F8
[008h 0008 1] Revision : 03
[009h 0009 1] Checksum : 54
[00Ah 0010 6] Oem ID : "GBT "
[010h 0016 8] Oem Table ID : "GBTUACPI"
[018h 0024 4] Oem Revision : 00000002
[01Ch 0028 4] Asl Compiler ID : "AMI "
[020h 0032 4] Asl Compiler Revision : 20230628
[024h 0036 4] Table Revision : 00000001
[028h 0040 8] Reserved : 0000000000000000
[030h 0048 1] Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[031h 0049 1] Length : 10
[032h 0050 1] Proximity Domain Low(8) : 00
[033h 0051 1] Apic ID : FF
[034h 0052 4] Flags (decoded below) : 00000000
Enabled : 0
[038h 0056 1] Local Sapic EID : 00
[039h 0057 3] Proximity Domain High(24) : 000000
[03Ch 0060 4] Clock Domain : 00000000
[040h 0064 1] Subtable Type : 00 [Processor Local APIC/SAPIC Affinity]
[041h 0065 1] Length : 10
[042h 0066 1] Proximity Domain Low(8) : 00
[043h 0067 1] Apic ID : FF
[044h 0068 4] Flags (decoded below) : 00000000
Enabled : 0
[048h 0072 1] Local Sapic EID : 00
...
[A1A8h 41384 1] Subtable Type : 01 [Memory Affinity]
[A1A9h 41385 1] Length : 28
[A1AAh 41386 4] Proximity Domain : 0000000A
[A1AEh 41390 2] Reserved1 : 0000
[A1B0h 41392 8] Base Address : 000000C040000000
[A1B8h 41400 8] Address Length : 0000010000000000
[A1C0h 41408 4] Reserved2 : 00000000
[A1C4h 41412 4] Flags (decoded below) : 0000000B
Enabled : 1
[A1D0h 41424 1] Subtable Type : 01 [Memory Affinity]
[A1D1h 41425 1] Length : 28
[A1D2h 41426 4] Proximity Domain : 0000000B
[A1D6h 41430 2] Reserved1 : 0000
[A1D8h 41432 8] Base Address : 0000071E40000000
[A1E0h 41440 8] Address Length : 0000010000000000
[A1E8h 41448 4] Reserved2 : 00000000
[A1ECh 41452 4] Flags (decoded below) : 0000000B
Enabled : 1
----------------------------------------------------------------------
Rakie Kim
>
Rakie Kim wrote: [..] > Hello Jonathan, > > Your insight is incredibly accurate. To clarify the situation, here is > the actual configuration of my system: > > NODE Type PXD > node0 local memory 0x00 > node1 local memory 0x01 > node2 cxl memory 0x0A > node3 cxl memory 0x0B > > Physically, the node2 CXL is attached to node0 (Socket 0), and the > node3 CXL is attached to node1 (Socket 1). However, extracting the > HMAT.dsl reveals the following: > > - local memory > [028h] Flags: 0001 (Processor Proximity Domain Valid = 1) > Attached Initiator Proximity Domain: 0x00 > Memory Proximity Domain: 0x00 > [050h] Flags: 0001 (Processor Proximity Domain Valid = 1) > Attached Initiator Proximity Domain: 0x01 > Memory Proximity Domain: 0x01 > > - cxl memory > [078h] Flags: 0000 (Processor Proximity Domain Valid = 0) > Attached Initiator Proximity Domain: 0x00 > Memory Proximity Domain: 0x0A > [0A0h] Flags: 0000 (Processor Proximity Domain Valid = 0) > Attached Initiator Proximity Domain: 0x00 > Memory Proximity Domain: 0x0B This looks good. Unless the CPU is directly attached to the memory controller then there is no attached initiator. For example, if you wanted to run an x86 memory controller configuration instruction like PCONFIG you would issue an IPI to the CPU attached to the target memory controller. There is no such connection for a CPU to do the same for a CXL proximity domain. > As you correctly suspected, the flags for the CXL memory are 0000, > meaning the Processor Proximity Domain is marked as invalid. But when > checking the sysfs initiator configurations, it shows a different story: > > Node access0 Initiator access1 Initiator > node0 node0 node0 > node1 node1 node1 > node2 node1 node1 > node3 node1 node1 2 comments. HMAT is not a physical topology layout table. The fallback determination of "best" initiator when "Attached Initiator PXM" is not set is just a heuristic. That heuristic probably has not been touched since the initial HMAT support went upstream. > Although the Attached Initiator is set to 0 in HMAT with an invalid > flag, sysfs strangely registers node1 as the initiator for both CXL > nodes. Because both HMAT and sysfs are exposing abnormal values, it was > impossible for me to determine the true socket connections for CXL > using this data. Yeah, this sounds more like a kernel bug report than a firmware bug report at this point. > > > Even though the distance map shows node2 is physically closer to > > > Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the > > > routing path strictly through Socket 1. Because the HMAT alone made it > > > difficult to determine the exact physical socket connections on these > > > systems, I ended up using the current CXL driver-based approach. > > > > Are the HMAT latencies and bandwidths all there? Or are some missing > > and you have to use SLIT (which generally is garbage for historical > > reasons of tuning SLIT to particular OS behaviour). > > > > The HMAT latencies and bandwidths are present, but the values seem > broken. Here is the latency table: > > Init->Target | node0 | node1 | node2 | node3 > node0 | 0x38B | 0x89F | 0x9C4 | 0x3AFC > node1 | 0x89F | 0x38B | 0x3AFC| 0x4268 > > I used the identical type of DRAM and CXL memory for both sockets. > However, looking at the table, the local CXL access latency from > node0->node2 (0x9C4) and node1->node3 (0x4268) shows a massive, > unjustified difference. This asymmetry proves that the table is > currently unreliable. ...or it is telling the truth. Would need more data. > > > I wonder if others have experienced similar broken HMAT cases with CXL. > > > If HMAT information becomes more reliable in the future, we could > > > build a much more efficient structure. > > > > Given it's being lightly used I suspect there will be many bugs :( > > I hope we can assume they will get fixed however! > > > > ... > > > > The most critical issue caused by this broken initiator setting is that > topology analysis tools like `hwloc` are completely misled. Currently, > `hwloc` displays both CXL nodes as being attached to Socket 1. > > I observed this exact same issue on both Sierra Forest and Granite > Rapids systems. I believe this broken topology exposure is a severe > problem that must be addressed, though I am not entirely sure what the > best fix would be yet. I would love to hear your thoughts on this. Before determining that these numbers are wrong you would need to redo the calculation from CDAT data to see if you get a different answer. The driver currently does this calculation as part of determining a QoS class. It would be reasonable to also use that same calculation to double check the BIOS firmware numbers for CXL proximity domains established at boot. > > > The complex topology cases you presented, such as multi-NUMA per socket, > > > shared CXL switches, and IO expanders, are very important points. > > > I clearly understand that the simple package-level grouping does not fully > > > reflect the 1:1 relationship in these future hardware architectures. > > > > > > I have also thought about the shared CXL switch scenario you mentioned, > > > and I know the current design falls short in addressing it properly. > > > While the current implementation starts with a simple socket-local > > > restriction, I plan to evolve it into a more flexible node aggregation > > > model to properly reflect all the diverse topologies you suggested. > > > > If we can ensure it fails cleanly when it finds a topology that it can't > > cope with (and I guess falls back to current) then I'm fine with a partial > > solution that evolves. > > > > I completely agree with ensuring a clean failure. To stabilize this > partial solution, I am currently considering a few options for the > next version: > > 1. Enable this feature only when a strict 1:1 topology is detected. > 2. Provide a sysfs allowing users to enable/disable it. > 3. Allow users to manually override/configure the topology via sysfs. > 4. Implement dynamic fallback behaviors depending on the detected > topology shape (needs further thought). The advice is always start as simple as possible but no simpler. It may be the case that Linux indeed finds that platform firmware comes to a different result than expected. When that happens the CXL subsystem can probably emit the mismatch details, or otherwise validate the HMAT. As for actual physical topology layout determination, that is out of scope for HMAT, but the CXL CDAT calculations do consider PCI link details. > By the way, when I first posted this RFC, I accidentally missed adding > lsf-pc@lists.linux-foundation.org to the CC list. I am considering > re-posting it to ensure it reaches the lsf-pc. They are on the Cc: now, I expect that is sufficient.
On Thu, 26 Mar 2026 13:13:40 -0700 Dan Williams <dan.j.williams@intel.com> wrote: > Rakie Kim wrote: > [..] > > Hello Jonathan, > > > > Your insight is incredibly accurate. To clarify the situation, here is > > the actual configuration of my system: > > > > NODE Type PXD > > node0 local memory 0x00 > > node1 local memory 0x01 > > node2 cxl memory 0x0A > > node3 cxl memory 0x0B > > > > Physically, the node2 CXL is attached to node0 (Socket 0), and the > > node3 CXL is attached to node1 (Socket 1). However, extracting the > > HMAT.dsl reveals the following: > > > > - local memory > > [028h] Flags: 0001 (Processor Proximity Domain Valid = 1) > > Attached Initiator Proximity Domain: 0x00 > > Memory Proximity Domain: 0x00 > > [050h] Flags: 0001 (Processor Proximity Domain Valid = 1) > > Attached Initiator Proximity Domain: 0x01 > > Memory Proximity Domain: 0x01 > > > > - cxl memory > > [078h] Flags: 0000 (Processor Proximity Domain Valid = 0) > > Attached Initiator Proximity Domain: 0x00 > > Memory Proximity Domain: 0x0A > > [0A0h] Flags: 0000 (Processor Proximity Domain Valid = 0) > > Attached Initiator Proximity Domain: 0x00 > > Memory Proximity Domain: 0x0B > > This looks good. > > Unless the CPU is directly attached to the memory controller then there > is no attached initiator. For example, if you wanted to run an x86 > memory controller configuration instruction like PCONFIG you would issue > an IPI to the CPU attached to the target memory controller. There is no > such connection for a CPU to do the same for a CXL proximity domain. > > > As you correctly suspected, the flags for the CXL memory are 0000, > > meaning the Processor Proximity Domain is marked as invalid. But when > > checking the sysfs initiator configurations, it shows a different story: > > > > Node access0 Initiator access1 Initiator > > node0 node0 node0 > > node1 node1 node1 > > node2 node1 node1 > > node3 node1 node1 > > 2 comments. HMAT is not a physical topology layout table. The > fallback determination of "best" initiator when "Attached Initiator PXM" > is not set is just a heuristic. That heuristic probably has not been > touched since the initial HMAT support went upstream. > > > Although the Attached Initiator is set to 0 in HMAT with an invalid > > flag, sysfs strangely registers node1 as the initiator for both CXL > > nodes. Because both HMAT and sysfs are exposing abnormal values, it was > > impossible for me to determine the true socket connections for CXL > > using this data. > > Yeah, this sounds more like a kernel bug report than a firmware bug > report at this point. > You are right. From the hardware's perspective, the `0000` flag makes perfect sense since the CPU is not directly attached to the CXL memory controller. I completely agree with your assessment that this points directly to a bug in the kernel's outdated fallback heuristic logic, rather than a firmware error. > > > > > Even though the distance map shows node2 is physically closer to > > > > Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the > > > > routing path strictly through Socket 1. Because the HMAT alone made it > > > > difficult to determine the exact physical socket connections on these > > > > systems, I ended up using the current CXL driver-based approach. > > > > > > Are the HMAT latencies and bandwidths all there? Or are some missing > > > and you have to use SLIT (which generally is garbage for historical > > > reasons of tuning SLIT to particular OS behaviour). > > > > > > > The HMAT latencies and bandwidths are present, but the values seem > > broken. Here is the latency table: > > > > Init->Target | node0 | node1 | node2 | node3 > > node0 | 0x38B | 0x89F | 0x9C4 | 0x3AFC > > node1 | 0x89F | 0x38B | 0x3AFC| 0x4268 > > > > I used the identical type of DRAM and CXL memory for both sockets. > > However, looking at the table, the local CXL access latency from > > node0->node2 (0x9C4) and node1->node3 (0x4268) shows a massive, > > unjustified difference. This asymmetry proves that the table is > > currently unreliable. > > ...or it is telling the truth. Would need more data. > > > > > I wonder if others have experienced similar broken HMAT cases with CXL. > > > > If HMAT information becomes more reliable in the future, we could > > > > build a much more efficient structure. > > > > > > Given it's being lightly used I suspect there will be many bugs :( > > > I hope we can assume they will get fixed however! > > > > > > ... > > > > > > > The most critical issue caused by this broken initiator setting is that > > topology analysis tools like `hwloc` are completely misled. Currently, > > `hwloc` displays both CXL nodes as being attached to Socket 1. > > > > I observed this exact same issue on both Sierra Forest and Granite > > Rapids systems. I believe this broken topology exposure is a severe > > problem that must be addressed, though I am not entirely sure what the > > best fix would be yet. I would love to hear your thoughts on this. > > Before determining that these numbers are wrong you would need to redo > the calculation from CDAT data to see if you get a different answer. > > The driver currently does this calculation as part of determining a QoS > class. It would be reasonable to also use that same calculation to double > check the BIOS firmware numbers for CXL proximity domains established at > boot. > It was indeed premature of me to conclude the table was broken solely based on the large and asymmetric numbers. Interestingly, Dave Jiang just mentioned in another reply that the Intel BIOS folks confirmed these HMAT values actually represent "end-to-end" latency, which perfectly explains why the numbers are so much larger than expected. Also, I have just posted the detailed `SRAT` and `HMAT` dumps in my reply to Dave Jiang. Please feel free to refer to the exact firmware structures we are discussing here: https://lore.kernel.org/all/20260330025914.361-1-rakie.kim@sk.com/ > > > > The complex topology cases you presented, such as multi-NUMA per socket, > > > > shared CXL switches, and IO expanders, are very important points. > > > > I clearly understand that the simple package-level grouping does not fully > > > > reflect the 1:1 relationship in these future hardware architectures. > > > > > > > > I have also thought about the shared CXL switch scenario you mentioned, > > > > and I know the current design falls short in addressing it properly. > > > > While the current implementation starts with a simple socket-local > > > > restriction, I plan to evolve it into a more flexible node aggregation > > > > model to properly reflect all the diverse topologies you suggested. > > > > > > If we can ensure it fails cleanly when it finds a topology that it can't > > > cope with (and I guess falls back to current) then I'm fine with a partial > > > solution that evolves. > > > > > > > I completely agree with ensuring a clean failure. To stabilize this > > partial solution, I am currently considering a few options for the > > next version: > > > > 1. Enable this feature only when a strict 1:1 topology is detected. > > 2. Provide a sysfs allowing users to enable/disable it. > > 3. Allow users to manually override/configure the topology via sysfs. > > 4. Implement dynamic fallback behaviors depending on the detected > > topology shape (needs further thought). > > The advice is always start as simple as possible but no simpler. > > It may be the case that Linux indeed finds that platform firmware comes > to a different result than expected. When that happens the CXL subsystem > can probably emit the mismatch details, or otherwise validate the HMAT. > > As for actual physical topology layout determination, that is out of > scope for HMAT, but the CXL CDAT calculations do consider PCI link > details. > Thank you for the clear architectural guidance. Knowing that physical topology determination is strictly out of scope for HMAT reassures me that leveraging the PCI link details is indeed the correct direction for this Socket-aware feature. To discover the topology, I actually implemented a method to retrieve this information directly from the CXL driver in PATCH 3 of this RFC: https://lore.kernel.org/all/20260316051258.246-4-rakie.kim@sk.com/ However, I am still wondering if this specific implementation is the truly correct and most appropriate way to achieve it in the kernel. Any thoughts on that specific approach would be highly appreciated. I will keep your advice in mind and ensure the fallback and policy designs are kept as simple as possible for the next version. Thanks again for your time and all the valuable insights. Rakie Kim
On Tue, 24 Mar 2026 14:35:45 +0900 Rakie Kim <rakie.kim@sk.com> wrote: > On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote: > > > > > > > > > > > > To make this possible, the system requires a mechanism to understand > > > > > the physical topology. The existing NUMA distance model provides only > > > > > relative latency values between nodes and lacks any notion of > > > > > structural grouping such as socket boundaries. This is especially > > > > > problematic for CXL memory nodes, which appear without an explicit > > > > > socket association. > > > > > > > > So in a general sense, the missing info here is effectively the same > > > > stuff we are missing from the HMAT presentation (it's there in the > > > > table and it's there to compute in CXL cases) just because we decided > > > > not to surface anything other than distances to memory from nearest > > > > initiator. I chatted to Joshua and Kieth about filling in that stuff > > > > at last LSFMM. To me that's just a bit of engineering work that needs > > > > doing now we have proven use cases for the data. Mostly it's figuring out > > > > the presentation to userspace and kernel data structures as it's a > > > > lot of data in a big system (typically at least 32 NUMA nodes). > > > > > > > > > > Hearing about the discussion on exposing HMAT data is very welcome news. > > > Because this detailed topology information is not yet fully exposed to > > > the kernel and userspace, I used a temporary package-based restriction. > > > Figuring out how to expose and integrate this data into the kernel data > > > structures is indeed a crucial engineering task we need to solve. > > > > > > Actually, when I first started this work, I considered fetching the > > > topology information from HMAT before adopting the current approach. > > > However, I encountered a firmware issue on my test systems > > > (Granite Rapids and Sierra Forest). > > > > > > Although each socket has its own locally attached CXL device, the HMAT > > > only registers node1 (Socket 1) as the initiator for both CXL memory > > > nodes (node2 and node3). As a result, the sysfs HMAT initiators for > > > both node2 and node3 only expose node1. > > > > Do you mean the Memory Proximity Domain Attributes Structure has > > the "Proximity Domain for the Attached Initiator" set wrong? > > Was this for it's presentation of the full path to CXL mem nodes, or > > to a PXM with a generic port? Sounds like you have SRAT covering > > the CXL mem so ideal would be to have the HMAT data to GP and to > > the CXL PXMs that BIOS has set up. > > > > Either way having that set at all for CXL memory is fishy as it's about > > where the 'memory controller' is and on CXL mem that should be at the > > device end of the link. My understanding of that is was only meant > > to be set when you have separate memory only Nodes where the physical > > controller is in a particular other node (e.g. what you do > > if you have a CPU with DRAM and HBM). Maybe we need to make the > > kernel warn + ignore that if it is set to something odd like yours. > > > > Hello Jonathan, > > Your insight is incredibly accurate. To clarify the situation, here is > the actual configuration of my system: > > NODE Type PXD > node0 local memory 0x00 > node1 local memory 0x01 > node2 cxl memory 0x0A > node3 cxl memory 0x0B > > Physically, the node2 CXL is attached to node0 (Socket 0), and the > node3 CXL is attached to node1 (Socket 1). However, extracting the > HMAT.dsl reveals the following: > > - local memory > [028h] Flags: 0001 (Processor Proximity Domain Valid = 1) > Attached Initiator Proximity Domain: 0x00 > Memory Proximity Domain: 0x00 > [050h] Flags: 0001 (Processor Proximity Domain Valid = 1) > Attached Initiator Proximity Domain: 0x01 > Memory Proximity Domain: 0x01 > > - cxl memory > [078h] Flags: 0000 (Processor Proximity Domain Valid = 0) > Attached Initiator Proximity Domain: 0x00 > Memory Proximity Domain: 0x0A > [0A0h] Flags: 0000 (Processor Proximity Domain Valid = 0) > Attached Initiator Proximity Domain: 0x00 > Memory Proximity Domain: 0x0B That's faintly amusing given it conveys no information at all. Still unless we have a bug shouldn't cause anything odd. > > As you correctly suspected, the flags for the CXL memory are 0000, > meaning the Processor Proximity Domain is marked as invalid. But when > checking the sysfs initiator configurations, it shows a different story: > > Node access0 Initiator access1 Initiator > node0 node0 node0 > node1 node1 node1 > node2 node1 node1 > node3 node1 node1 > > Although the Attached Initiator is set to 0 in HMAT with an invalid > flag, sysfs strangely registers node1 as the initiator for both CXL > nodes. Been a while since I looked the hmat parser.. If ACPI_HMAT_PROCESSOR_PD_VALID isn't set, hmat_parse_proximity_domain() shouldn't set the target. At end of that function should be set to PXM_INVALID. It should therefore retain the state from alloc_memory_intiator() I think? Given I did all my testing without the PD_VALID set (as it wasn't on my test system) it should be fine with that. Anyhow, let's look at the data for proximity. > Because both HMAT and sysfs are exposing abnormal values, it was > impossible for me to determine the true socket connections for CXL > using this data. > > > > > > > Even though the distance map shows node2 is physically closer to > > > Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the > > > routing path strictly through Socket 1. Because the HMAT alone made it > > > difficult to determine the exact physical socket connections on these > > > systems, I ended up using the current CXL driver-based approach. > > > > Are the HMAT latencies and bandwidths all there? Or are some missing > > and you have to use SLIT (which generally is garbage for historical > > reasons of tuning SLIT to particular OS behaviour). > > > > The HMAT latencies and bandwidths are present, but the values seem > broken. Here is the latency table: > > Init->Target | node0 | node1 | node2 | node3 > node0 | 0x38B | 0x89F | 0x9C4 | 0x3AFC > node1 | 0x89F | 0x38B | 0x3AFC| 0x4268 Yeah. That would do it... Looks like that final value is garbage. > > I used the identical type of DRAM and CXL memory for both sockets. > However, looking at the table, the local CXL access latency from > node0->node2 (0x9C4) and node1->node3 (0x4268) shows a massive, > unjustified difference. This asymmetry proves that the table is > currently unreliable. Poke your favourite bios vendor I guess. I asked one of the intel folk to take a look at see if this is a broader issue or just one particular bios. > > > > > > > I wonder if others have experienced similar broken HMAT cases with CXL. > > > If HMAT information becomes more reliable in the future, we could > > > build a much more efficient structure. > > > > Given it's being lightly used I suspect there will be many bugs :( > > I hope we can assume they will get fixed however! > > > > ... > > > > The most critical issue caused by this broken initiator setting is that > topology analysis tools like `hwloc` are completely misled. Currently, > `hwloc` displays both CXL nodes as being attached to Socket 1. > > I observed this exact same issue on both Sierra Forest and Granite > Rapids systems. I believe this broken topology exposure is a severe > problem that must be addressed, though I am not entirely sure what the > best fix would be yet. I would love to hear your thoughts on this. Fix then bios. If you don't mind, can you provide dumps of cat /sys/firmware/acpi/tables/HMAT just so we can check there is nothing wrong with the parser. > > > > > > > The complex topology cases you presented, such as multi-NUMA per socket, > > > shared CXL switches, and IO expanders, are very important points. > > > I clearly understand that the simple package-level grouping does not fully > > > reflect the 1:1 relationship in these future hardware architectures. > > > > > > I have also thought about the shared CXL switch scenario you mentioned, > > > and I know the current design falls short in addressing it properly. > > > While the current implementation starts with a simple socket-local > > > restriction, I plan to evolve it into a more flexible node aggregation > > > model to properly reflect all the diverse topologies you suggested. > > > > If we can ensure it fails cleanly when it finds a topology that it can't > > cope with (and I guess falls back to current) then I'm fine with a partial > > solution that evolves. > > > > I completely agree with ensuring a clean failure. To stabilize this > partial solution, I am currently considering a few options for the > next version: > > 1. Enable this feature only when a strict 1:1 topology is detected. Definitely default to off. Maybe allow a user to say they want to do it anyway. I can see there might be systems that are only a tiny bit off and it makes not practical difference. > 2. Provide a sysfs allowing users to enable/disable it. Makes sense. > 3. Allow users to manually override/configure the topology via sysfs. No. If people are in this state we should apply fixes to the HMAT table either by injection of real data or some quirking. If we add userspace control via simpler means the motivation for people to fix bios goes out the window and it never gets resolved. > 4. Implement dynamic fallback behaviors depending on the detected > topology shape (needs further thought). That would be interesting. But maybe not a 1st version thing :) > > By the way, when I first posted this RFC, I accidentally missed adding > lsf-pc@lists.linux-foundation.org to the CC list. I am considering > re-posting it to ensure it reaches the lsf-pc. Makes sense. Make sure to add a back link to this so it is visible discussion already going on. > > Thanks again for your profound insights and time. It is tremendously > helpful. Thanks to you for starting to solve the problem! J > > Rakie Kim > > > > > > > > > Thanks again for your time and review. > > > > You are welcome. > > > > Thanks > > > > Jonathan > > > > > > > > Rakie Kim > > > >
On Wed, 25 Mar 2026 12:33:50 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote: > On Tue, 24 Mar 2026 14:35:45 +0900 > Rakie Kim <rakie.kim@sk.com> wrote: > > > On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote: > > > > > > > > > > > > > > > To make this possible, the system requires a mechanism to understand > > > > > > the physical topology. The existing NUMA distance model provides only > > > > > > relative latency values between nodes and lacks any notion of > > > > > > structural grouping such as socket boundaries. This is especially > > > > > > problematic for CXL memory nodes, which appear without an explicit > > > > > > socket association. > > > > > > > > > > So in a general sense, the missing info here is effectively the same > > > > > stuff we are missing from the HMAT presentation (it's there in the > > > > > table and it's there to compute in CXL cases) just because we decided > > > > > not to surface anything other than distances to memory from nearest > > > > > initiator. I chatted to Joshua and Kieth about filling in that stuff > > > > > at last LSFMM. To me that's just a bit of engineering work that needs > > > > > doing now we have proven use cases for the data. Mostly it's figuring out > > > > > the presentation to userspace and kernel data structures as it's a > > > > > lot of data in a big system (typically at least 32 NUMA nodes). > > > > > > > > > > > > > Hearing about the discussion on exposing HMAT data is very welcome news. > > > > Because this detailed topology information is not yet fully exposed to > > > > the kernel and userspace, I used a temporary package-based restriction. > > > > Figuring out how to expose and integrate this data into the kernel data > > > > structures is indeed a crucial engineering task we need to solve. > > > > > > > > Actually, when I first started this work, I considered fetching the > > > > topology information from HMAT before adopting the current approach. > > > > However, I encountered a firmware issue on my test systems > > > > (Granite Rapids and Sierra Forest). > > > > > > > > Although each socket has its own locally attached CXL device, the HMAT > > > > only registers node1 (Socket 1) as the initiator for both CXL memory > > > > nodes (node2 and node3). As a result, the sysfs HMAT initiators for > > > > both node2 and node3 only expose node1. > > > > > > Do you mean the Memory Proximity Domain Attributes Structure has > > > the "Proximity Domain for the Attached Initiator" set wrong? > > > Was this for it's presentation of the full path to CXL mem nodes, or > > > to a PXM with a generic port? Sounds like you have SRAT covering > > > the CXL mem so ideal would be to have the HMAT data to GP and to > > > the CXL PXMs that BIOS has set up. > > > > > > Either way having that set at all for CXL memory is fishy as it's about > > > where the 'memory controller' is and on CXL mem that should be at the > > > device end of the link. My understanding of that is was only meant > > > to be set when you have separate memory only Nodes where the physical > > > controller is in a particular other node (e.g. what you do > > > if you have a CPU with DRAM and HBM). Maybe we need to make the > > > kernel warn + ignore that if it is set to something odd like yours. > > > > > > > Hello Jonathan, > > > > Your insight is incredibly accurate. To clarify the situation, here is > > the actual configuration of my system: > > > > NODE Type PXD > > node0 local memory 0x00 > > node1 local memory 0x01 > > node2 cxl memory 0x0A > > node3 cxl memory 0x0B > > > > Physically, the node2 CXL is attached to node0 (Socket 0), and the > > node3 CXL is attached to node1 (Socket 1). However, extracting the > > HMAT.dsl reveals the following: > > > > - local memory > > [028h] Flags: 0001 (Processor Proximity Domain Valid = 1) > > Attached Initiator Proximity Domain: 0x00 > > Memory Proximity Domain: 0x00 > > [050h] Flags: 0001 (Processor Proximity Domain Valid = 1) > > Attached Initiator Proximity Domain: 0x01 > > Memory Proximity Domain: 0x01 > > > > - cxl memory > > [078h] Flags: 0000 (Processor Proximity Domain Valid = 0) > > Attached Initiator Proximity Domain: 0x00 > > Memory Proximity Domain: 0x0A > > [0A0h] Flags: 0000 (Processor Proximity Domain Valid = 0) > > Attached Initiator Proximity Domain: 0x00 > > Memory Proximity Domain: 0x0B > > That's faintly amusing given it conveys no information at all. > Still unless we have a bug shouldn't cause anything odd. > > > > > As you correctly suspected, the flags for the CXL memory are 0000, > > meaning the Processor Proximity Domain is marked as invalid. But when > > checking the sysfs initiator configurations, it shows a different story: > > > > Node access0 Initiator access1 Initiator > > node0 node0 node0 > > node1 node1 node1 > > node2 node1 node1 > > node3 node1 node1 > > > > Although the Attached Initiator is set to 0 in HMAT with an invalid > > flag, sysfs strangely registers node1 as the initiator for both CXL > > nodes. > Been a while since I looked the hmat parser.. > > If ACPI_HMAT_PROCESSOR_PD_VALID isn't set, hmat_parse_proximity_domain() > shouldn't set the target. At end of that function should be set to PXM_INVALID. > > It should therefore retain the state from alloc_memory_intiator() I think? > > Given I did all my testing without the PD_VALID set (as it wasn't on my > test system) it should be fine with that. Anyhow, let's look at the data > for proximity. > > Hello Jonathan, Thank you for the deep insight into the HMAT parser code. As you mentioned, considering the current state where node 1 is still registered as the initiator in sysfs despite the flag being 0, it seems highly likely that the kernel parser logic is not handling this specific situation gracefully. > > > Because both HMAT and sysfs are exposing abnormal values, it was > > impossible for me to determine the true socket connections for CXL > > using this data. > > > > > > > > > > Even though the distance map shows node2 is physically closer to > > > > Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the > > > > routing path strictly through Socket 1. Because the HMAT alone made it > > > > difficult to determine the exact physical socket connections on these > > > > systems, I ended up using the current CXL driver-based approach. > > > > > > Are the HMAT latencies and bandwidths all there? Or are some missing > > > and you have to use SLIT (which generally is garbage for historical > > > reasons of tuning SLIT to particular OS behaviour). > > > > > > > The HMAT latencies and bandwidths are present, but the values seem > > broken. Here is the latency table: > > > > Init->Target | node0 | node1 | node2 | node3 > > node0 | 0x38B | 0x89F | 0x9C4 | 0x3AFC > > node1 | 0x89F | 0x38B | 0x3AFC| 0x4268 > > Yeah. That would do it... Looks like that final value is garbage. > > > > > I used the identical type of DRAM and CXL memory for both sockets. > > However, looking at the table, the local CXL access latency from > > node0->node2 (0x9C4) and node1->node3 (0x4268) shows a massive, > > unjustified difference. This asymmetry proves that the table is > > currently unreliable. > > Poke your favourite bios vendor I guess. > > I asked one of the intel folk to take a look at see if this is a broader issue > or just one particular bios. > I really appreciate you reaching out to the Intel contact to check if this is a broader platform issue. I will also try to find a way to report this BIOS issue to our system vendor, though I might need to figure out the proper channel since I am not the system administrator. Regarding the HMAT dump you requested, how should I provide it to you? Would a hex dump converted via a utility like `xxd` be acceptable, something like the snippet below? 00000000: 484d 4154 6806 0000 026a 4742 5420 2020 HMATh....jGBT 00000010: 4742 5455 4143 5049 0920 0701 414d 4920 GBTUACPI. ..AMI 00000020: 2806 2320 0000 0000 0000 0000 2800 0000 (.# ........(... 00000030: 0100 0000 0000 0000 0000 0000 0000 0000 ................ > > > > > > > > > > I wonder if others have experienced similar broken HMAT cases with CXL. > > > > If HMAT information becomes more reliable in the future, we could > > > > build a much more efficient structure. > > > > > > Given it's being lightly used I suspect there will be many bugs :( > > > I hope we can assume they will get fixed however! > > > > > > ... > > > > > > > The most critical issue caused by this broken initiator setting is that > > topology analysis tools like `hwloc` are completely misled. Currently, > > `hwloc` displays both CXL nodes as being attached to Socket 1. > > > > I observed this exact same issue on both Sierra Forest and Granite > > Rapids systems. I believe this broken topology exposure is a severe > > problem that must be addressed, though I am not entirely sure what the > > best fix would be yet. I would love to hear your thoughts on this. > > Fix then bios. If you don't mind, can you provide dumps of > cat /sys/firmware/acpi/tables/HMAT just so we can check there is nothing > wrong with the parser. > > > > > > > > > > > The complex topology cases you presented, such as multi-NUMA per socket, > > > > shared CXL switches, and IO expanders, are very important points. > > > > I clearly understand that the simple package-level grouping does not fully > > > > reflect the 1:1 relationship in these future hardware architectures. > > > > > > > > I have also thought about the shared CXL switch scenario you mentioned, > > > > and I know the current design falls short in addressing it properly. > > > > While the current implementation starts with a simple socket-local > > > > restriction, I plan to evolve it into a more flexible node aggregation > > > > model to properly reflect all the diverse topologies you suggested. > > > > > > If we can ensure it fails cleanly when it finds a topology that it can't > > > cope with (and I guess falls back to current) then I'm fine with a partial > > > solution that evolves. > > > > > > > I completely agree with ensuring a clean failure. To stabilize this > > partial solution, I am currently considering a few options for the > > next version: > > > > 1. Enable this feature only when a strict 1:1 topology is detected. > Definitely default to off. Maybe allow a user to say they want to do it > anyway. I can see there might be systems that are only a tiny bit off and > it makes not practical difference. > Your suggestion is very reasonable. I will proceed with this approach for the next version, keeping the feature disabled by default. > > 2. Provide a sysfs allowing users to enable/disable it. > Makes sense. I will include this sysfs enable/disable feature in the next version. > > 3. Allow users to manually override/configure the topology via sysfs. > > No. If people are in this state we should apply fixes to the HMAT table > either by injection of real data or some quirking. If we add userspace > control via simpler means the motivation for people to fix bios goes out > the window and it never gets resolved. > Your reasoning is absolutely correct. I will not allow users to modify the topology via sysfs. However, I plan to provide a read-only sysfs interface so users can at least check the current topology information. > > 4. Implement dynamic fallback behaviors depending on the detected > > topology shape (needs further thought). > > That would be interesting. But maybe not a 1st version thing :) > This is an area I also need to think more deeply about. I will not include it in the initial version, but will consider implementing it in the future. Once again, I deeply appreciate your time, thorough review, and for reaching out to Intel for further clarification. It is a huge help. Rakie Kim
On 3/26/26 1:54 AM, Rakie Kim wrote: > On Wed, 25 Mar 2026 12:33:50 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote: >> On Tue, 24 Mar 2026 14:35:45 +0900 >> Rakie Kim <rakie.kim@sk.com> wrote: >> >>> On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote: <--snip--> > Hello Jonathan, > > Thank you for the deep insight into the HMAT parser code. As you > mentioned, considering the current state where node 1 is still > registered as the initiator in sysfs despite the flag being 0, it > seems highly likely that the kernel parser logic is not handling > this specific situation gracefully. > >> >>> Because both HMAT and sysfs are exposing abnormal values, it was >>> impossible for me to determine the true socket connections for CXL >>> using this data. >>> >>>>> >>>>> Even though the distance map shows node2 is physically closer to >>>>> Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the >>>>> routing path strictly through Socket 1. Because the HMAT alone made it >>>>> difficult to determine the exact physical socket connections on these >>>>> systems, I ended up using the current CXL driver-based approach. >>>> >>>> Are the HMAT latencies and bandwidths all there? Or are some missing >>>> and you have to use SLIT (which generally is garbage for historical >>>> reasons of tuning SLIT to particular OS behaviour). >>>> >>> >>> The HMAT latencies and bandwidths are present, but the values seem >>> broken. Here is the latency table: >>> >>> Init->Target | node0 | node1 | node2 | node3 >>> node0 | 0x38B | 0x89F | 0x9C4 | 0x3AFC >>> node1 | 0x89F | 0x38B | 0x3AFC| 0x4268 >> >> Yeah. That would do it... Looks like that final value is garbage. Hi Rakie, So I talked to the Intel BIOS folks and apparently for devices that are not hot-plugged (with memory ranges provided in SRAT), those HMAT values are the value for end to end and not just CPU to Gen Port. That's why they look so much bigger. So there are couple things we'll have to consider: 1. Make sure that Intel, AMD, and ARM HMATs are all created the same way and this is the agreed on way to do this. Hopefully someone from AMD and ARM vendors can comment. We all should get on the same page for the CXL kernel code to work properly. 2. Add code in the CXL driver to detect whether the range is in SRAT and then skip the end to end perf calculation if that is the case. DJ <--snip-->
On Thu, 26 Mar 2026 14:41:32 -0700 Dave Jiang <dave.jiang@intel.com> wrote: > > > On 3/26/26 1:54 AM, Rakie Kim wrote: > > On Wed, 25 Mar 2026 12:33:50 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote: > >> On Tue, 24 Mar 2026 14:35:45 +0900 > >> Rakie Kim <rakie.kim@sk.com> wrote: > >> > >>> On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote: > > <--snip--> > > > > Hello Jonathan, > > > > Thank you for the deep insight into the HMAT parser code. As you > > mentioned, considering the current state where node 1 is still > > registered as the initiator in sysfs despite the flag being 0, it > > seems highly likely that the kernel parser logic is not handling > > this specific situation gracefully. > > > >> > >>> Because both HMAT and sysfs are exposing abnormal values, it was > >>> impossible for me to determine the true socket connections for CXL > >>> using this data. > >>> > >>>>> > >>>>> Even though the distance map shows node2 is physically closer to > >>>>> Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the > >>>>> routing path strictly through Socket 1. Because the HMAT alone made it > >>>>> difficult to determine the exact physical socket connections on these > >>>>> systems, I ended up using the current CXL driver-based approach. > >>>> > >>>> Are the HMAT latencies and bandwidths all there? Or are some missing > >>>> and you have to use SLIT (which generally is garbage for historical > >>>> reasons of tuning SLIT to particular OS behaviour). > >>>> > >>> > >>> The HMAT latencies and bandwidths are present, but the values seem > >>> broken. Here is the latency table: > >>> > >>> Init->Target | node0 | node1 | node2 | node3 > >>> node0 | 0x38B | 0x89F | 0x9C4 | 0x3AFC > >>> node1 | 0x89F | 0x38B | 0x3AFC| 0x4268 > >> > >> Yeah. That would do it... Looks like that final value is garbage. > > Hi Rakie, > So I talked to the Intel BIOS folks and apparently for devices that are not hot-plugged (with memory ranges provided in SRAT), those HMAT values are the value for end to end and not just CPU to Gen Port. That's why they look so much bigger. So there are couple things we'll have to consider: > 1. Make sure that Intel, AMD, and ARM HMATs are all created the same way and this is the agreed on way to do this. Hopefully someone from AMD and ARM vendors can comment. We all should get on the same page for the CXL kernel code to work properly. > > 2. Add code in the CXL driver to detect whether the range is in SRAT and then skip the end to end perf calculation if that is the case. > > DJ > Hello Dave, Thank you for reaching out to the Intel BIOS folks directly. Knowing that the HMAT values represent end-to-end latency completely explains why the numbers seemed so disproportionately large. I strongly agree with your two points. Establishing a consensus across all architecture vendors (Intel, AMD, ARM) on how these values are interpreted is crucial. Also, adding logic to the CXL driver to detect the SRAT presence and skip redundant calculations sounds like the exact right direction. I have posted the detailed SRAT and HMAT information at the link below: https://lore.kernel.org/all/20260330025914.361-1-rakie.kim@sk.com/ Rakie Kim > > <--snip--> >
On 3/26/26 2:41 PM, Dave Jiang wrote: > > > On 3/26/26 1:54 AM, Rakie Kim wrote: >> On Wed, 25 Mar 2026 12:33:50 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote: >>> On Tue, 24 Mar 2026 14:35:45 +0900 >>> Rakie Kim <rakie.kim@sk.com> wrote: >>> >>>> On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote: > > <--snip--> > > >> Hello Jonathan, >> >> Thank you for the deep insight into the HMAT parser code. As you >> mentioned, considering the current state where node 1 is still >> registered as the initiator in sysfs despite the flag being 0, it >> seems highly likely that the kernel parser logic is not handling >> this specific situation gracefully. >> >>> >>>> Because both HMAT and sysfs are exposing abnormal values, it was >>>> impossible for me to determine the true socket connections for CXL >>>> using this data. >>>> >>>>>> >>>>>> Even though the distance map shows node2 is physically closer to >>>>>> Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the >>>>>> routing path strictly through Socket 1. Because the HMAT alone made it >>>>>> difficult to determine the exact physical socket connections on these >>>>>> systems, I ended up using the current CXL driver-based approach. >>>>> >>>>> Are the HMAT latencies and bandwidths all there? Or are some missing >>>>> and you have to use SLIT (which generally is garbage for historical >>>>> reasons of tuning SLIT to particular OS behaviour). >>>>> >>>> >>>> The HMAT latencies and bandwidths are present, but the values seem >>>> broken. Here is the latency table: >>>> >>>> Init->Target | node0 | node1 | node2 | node3 >>>> node0 | 0x38B | 0x89F | 0x9C4 | 0x3AFC >>>> node1 | 0x89F | 0x38B | 0x3AFC| 0x4268 >>> >>> Yeah. That would do it... Looks like that final value is garbage. > > Hi Rakie, > So I talked to the Intel BIOS folks and apparently for devices that are not hot-plugged (with memory ranges provided in SRAT), those HMAT values are the value for end to end and not just CPU to Gen Port. That's why they look so much bigger. So there are couple things we'll have to consider: > 1. Make sure that Intel, AMD, and ARM HMATs are all created the same way and this is the agreed on way to do this. Hopefully someone from AMD and ARM vendors can comment. We all should get on the same page for the CXL kernel code to work properly. > > 2. Add code in the CXL driver to detect whether the range is in SRAT and then skip the end to end perf calculation if that is the case. After further talking to Jonathan, I don't think at least this part is an issue. The devices that are attached at boot do not have Generic Ports in the SRAT. > > DJ > > > <--snip--> > >
On Thu, 26 Mar 2026 15:19:49 -0700 Dave Jiang <dave.jiang@intel.com> wrote: > > > On 3/26/26 2:41 PM, Dave Jiang wrote: > > > > > > On 3/26/26 1:54 AM, Rakie Kim wrote: > >> On Wed, 25 Mar 2026 12:33:50 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote: > >>> On Tue, 24 Mar 2026 14:35:45 +0900 > >>> Rakie Kim <rakie.kim@sk.com> wrote: > >>> > >>>> On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote: > > > > <--snip--> > > > > > >> Hello Jonathan, > >> > >> Thank you for the deep insight into the HMAT parser code. As you > >> mentioned, considering the current state where node 1 is still > >> registered as the initiator in sysfs despite the flag being 0, it > >> seems highly likely that the kernel parser logic is not handling > >> this specific situation gracefully. > >> > >>> > >>>> Because both HMAT and sysfs are exposing abnormal values, it was > >>>> impossible for me to determine the true socket connections for CXL > >>>> using this data. > >>>> > >>>>>> > >>>>>> Even though the distance map shows node2 is physically closer to > >>>>>> Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the > >>>>>> routing path strictly through Socket 1. Because the HMAT alone made it > >>>>>> difficult to determine the exact physical socket connections on these > >>>>>> systems, I ended up using the current CXL driver-based approach. > >>>>> > >>>>> Are the HMAT latencies and bandwidths all there? Or are some missing > >>>>> and you have to use SLIT (which generally is garbage for historical > >>>>> reasons of tuning SLIT to particular OS behaviour). > >>>>> > >>>> > >>>> The HMAT latencies and bandwidths are present, but the values seem > >>>> broken. Here is the latency table: > >>>> > >>>> Init->Target | node0 | node1 | node2 | node3 > >>>> node0 | 0x38B | 0x89F | 0x9C4 | 0x3AFC > >>>> node1 | 0x89F | 0x38B | 0x3AFC| 0x4268 > >>> > >>> Yeah. That would do it... Looks like that final value is garbage. > > > > Hi Rakie, > > So I talked to the Intel BIOS folks and apparently for devices that are not hot-plugged (with memory ranges provided in SRAT), those HMAT values are the value for end to end and not just CPU to Gen Port. That's why they look so much bigger. So there are couple things we'll have to consider: > > 1. Make sure that Intel, AMD, and ARM HMATs are all created the same way and this is the agreed on way to do this. Hopefully someone from AMD and ARM vendors can comment. We all should get on the same page for the CXL kernel code to work properly. > > > > 2. Add code in the CXL driver to detect whether the range is in SRAT and then skip the end to end perf calculation if that is the case. > > After further talking to Jonathan, I don't think at least this part is an issue. The devices that are attached at boot do not have Generic Ports in the SRAT. > Hello Dave, Understood. I'm glad that the discussion with Jonathan helped clarify the situation regarding the Generic Ports in the SRAT. Thank you for the quick update on this. Thanks again for looking into this so thoroughly and keeping me in the loop. Rakie Kim
Hello Rakie! I hope you have been doing well. Thank you for this RFC, I think it is a very interesting idea. [...snip...] > Consider a dual-socket system: > > node0 node1 > +-------+ +-------+ > | CPU 0 |---------| CPU 1 | > +-------+ +-------+ > | DRAM0 | | DRAM1 | > +---+---+ +---+---+ > | | > +---+---+ +---+---+ > | CXL 0 | | CXL 1 | > +-------+ +-------+ > node2 node3 > > Assuming local DRAM provides 300 GB/s and local CXL provides 100 GB/s, > the effective bandwidth varies significantly from the perspective of > each CPU due to inter-socket interconnect penalties. > > Local device capabilities (GB/s) vs. cross-socket effective bandwidth: > > 0 1 2 3 > CPU 0 300 150 100 50 > CPU 1 150 300 50 100 > > A reasonable global weight vector reflecting the base capabilities is: > > node0=3 node1=3 node2=1 node3=1 > > However, because these configured node weights do not account for > interconnect degradation between sockets, applying them flatly to all > sources yields the following effective map from each CPU's perspective: > > 0 1 2 3 > CPU 0 3 3 1 1 > CPU 1 3 3 1 1 > > This does not account for the interconnect penalty (e.g., node0->node1 > drops 300->150, node0->node3 drops 100->50) and thus forces allocations > that cause a mismatch with actual performance. > > This patch makes weighted interleave socket-aware. Before weighting is > applied, the candidate nodes are restricted to the current socket; only > if no eligible local nodes remain does the policy fall back to the > wider set. So when I saw this, I thought the idea was that we would attempt an allocation with these socket-aware weights, and upon failure, fall back to the global weights that are set so that we can try to fulfill the allocation from cross-socket nodes. However, reading the implementation in 4/4, it seems like what is meant by "fallback" here is not in the sense of a fallback allocation, but in the sense of "if there is a misconfiguration and the intersection between policy nodes and the CPU's package is empty, use the global nodes instead". Am I understanding this correctly? And, it seems like what this also means is that under sane configurations, there is no more cross socket memory allocation, since it will always try to fulfill it from the local node. > Even if the configured global weights remain identically set: > > node0=3 node1=3 node2=1 node3=1 > > The resulting effective map from the perspective of each CPU becomes: > > 0 1 2 3 > CPU 0 3 0 1 0 > CPU 1 0 3 0 1 > Now tasks running on node0 prefer DRAM0(3) and CXL0(1), while tasks on > node1 prefer DRAM1(3) and CXL1(1). This aligns allocation with actual > effective bandwidth, preserves NUMA locality, and reduces cross-socket > traffic. In that sense I thought the word "prefer" was a bit confusing, since I thought it would mean that it would try to fulfill the alloactions from within a packet first, then fall back to remote packets if that failed. (Or maybe I am just misunderstanding your explanation. Please do let me know if that is the case : -) ) If what I understand is the case , I think this is the same thing as just restricting allocations to be socket-local. I also wonder if this idea applies to other mempolicies as well (i.e. unweighted interleave) I think we should consider what the expected and desirable behavior is when one socket is fully saturated but the other socket is empty. In my mind this is no different from considering within-packet remote NUMA allocations; the tradeoff becomes between reclaiming locally and keeping allocations local, vs. skipping reclaiming and consuming free memory while eating the remote access latency, similar to zone_reclaim mode (packet_reclaim_mode? ; -) ) In my mind (without doing any benchmarking myself or looking at the numbers) I imagine that there are some scenarios where we actually do want cross socket allocations, like in the example above when we have very asymmetric saturations across sockets. Is this something that could be worth benchmarking as well? I will end by saying that in the normal case (sockets have similar saturation) I think this series is a definite win and improvement to weighted interleave. I just was curious whether we can handle the worst-case scenarios. Thank you again for the series. Have a great day! Joshua
On Mon, 16 Mar 2026 08:19:32 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote: > Hello Rakie! I hope you have been doing well. Thank you for this > RFC, I think it is a very interesting idea. Hello Joshua, I hope you are doing well. Thanks for your review and feedback on this RFC. > > [...snip...] > > > Consider a dual-socket system: > > h > node0 node1 > > +-------+ +-------+ > > | CPU 0 |---------| CPU 1 | > > +-------+ +-------+ > > | DRAM0 | | DRAM1 | > > +---+---+ +---+---+ > > | | > > +---+---+ +---+---+ > > | CXL 0 | | CXL 1 | > > +-------+ +-------+ > > node2 node3 > > > > Assuming local DRAM provides 300 GB/s and local CXL provides 100 GB/s, > > the effective bandwidth varies significantly from the perspective of > > each CPU due to inter-socket interconnect penalties. > > > > Local device capabilities (GB/s) vs. cross-socket effective bandwidth: > > > > 0 1 2 3 > > CPU 0 300 150 100 50 > > CPU 1 150 300 50 100 > > > > A reasonable global weight vector reflecting the base capabilities is: > > > > node0=3 node1=3 node2=1 node3=1 > > > > However, because these configured node weights do not account for > > interconnect degradation between sockets, applying them flatly to all > > sources yields the following effective map from each CPU's perspective: > > > > 0 1 2 3 > > CPU 0 3 3 1 1 > > CPU 1 3 3 1 1 > > > > This does not account for the interconnect penalty (e.g., node0->node1 > > drops 300->150, node0->node3 drops 100->50) and thus forces allocations > > that cause a mismatch with actual performance. > > > > This patch makes weighted interleave socket-aware. Before weighting is > > applied, the candidate nodes are restricted to the current socket; only > > if no eligible local nodes remain does the policy fall back to the > > wider set. > > So when I saw this, I thought the idea was that we would attempt an > allocation with these socket-aware weights, and upon failure, fall back > to the global weights that are set so that we can try to fulfill the > allocation from cross-socket nodes. > > However, reading the implementation in 4/4, it seems like what is meant > by "fallback" here is not in the sense of a fallback allocation, but > in the sense of "if there is a misconfiguration and the intersection > between policy nodes and the CPU's package is empty, use the global > nodes instead". > > Am I understanding this correctly? > > And, it seems like what this also means is that under sane configurations, > there is no more cross socket memory allocation, since it will always > try to fulfill it from the local node. > Your analysis of the code in patch 4/4 is exactly correct. I apologize for using the term "fallback" in the cover letter, which caused some confusion. As you understood, the current implementation strictly restricts allocations to the local socket to avoid cross-socket traffic. > > Even if the configured global weights remain identically set: > > > > node0=3 node1=3 node2=1 node3=1 > > > > The resulting effective map from the perspective of each CPU becomes: > > > > 0 1 2 3 > > CPU 0 3 0 1 0 > > CPU 1 0 3 0 1 > > > Now tasks running on node0 prefer DRAM0(3) and CXL0(1), while tasks on > > node1 prefer DRAM1(3) and CXL1(1). This aligns allocation with actual > > effective bandwidth, preserves NUMA locality, and reduces cross-socket > > traffic. > > In that sense I thought the word "prefer" was a bit confusing, since I > thought it would mean that it would try to fulfill the alloactions > from within a packet first, then fall back to remote packets if that > failed. (Or maybe I am just misunderstanding your explanation. Please > do let me know if that is the case : -) ) > > If what I understand is the case , I think this is the same thing as > just restricting allocations to be socket-local. I also wonder if > this idea applies to other mempolicies as well (i.e. unweighted interleave) Again, I apologize for the confusion caused by words like "prefer" and "fallback" in the commit message. Your understanding is correct; the current code strictly restricts allocations to the socket-local nodes. To determine where memory may be allocated within a socket, the code uses a function named policy_resolve_package_nodes(). As described in the comments, the logic works as follows: 1. Success case: It tries to use the intersection of the current CPU's package nodes and the user's preselected policy nodes. If the intersection is not empty, it uses these local nodes. 2. Failure case: If the intersection is empty (e.g., the user opted out of the current package), it finds the package of another node in the policy nodes and gets the intersection again. If this also yields an empty set, it completely falls back to the original global policy nodes. In this early version, the consideration for handling various detailed cases is insufficient. Also, as you pointed out, applying this strict local restriction directly to other policies like unweighted interleave might be difficult, as it could conflict with the original purpose of interleaving. I plan to consider these aspects further and prepare a more complemented design. > > I think we should consider what the expected and desirable behavior is > when one socket is fully saturated but the other socket is empty. In my > mind this is no different from considering within-packet remote NUMA > allocations; the tradeoff becomes between reclaiming locally and > keeping allocations local, vs. skipping reclaiming and consuming > free memory while eating the remote access latency, similar to > zone_reclaim mode (packet_reclaim_mode? ; -) ) This is an issue I have been thinking about since the early design phase, and it must be resolved to improve this patch series. The trade-off between forcing local memory reclaim to stay local versus accepting the latency penalty of using a remote socket is a point we need to address. I will continue to think about how to handle this properly. > > In my mind (without doing any benchmarking myself or looking at the numbers) > I imagine that there are some scenarios where we actually do want cross > socket allocations, like in the example above when we have very asymmetric > saturations across sockets. Is this something that could be worth > benchmarking as well? Your suggestion is valid and worth considering. I am currently analyzing the behavior of this feature under various workloads. I will also consider the asymmetric saturation scenarios you suggested. > > I will end by saying that in the normal case (sockets have similar saturation) > I think this series is a definite win and improvement to weighted interleave. > I just was curious whether we can handle the worst-case scenarios. > > Thank you again for the series. Have a great day! > Joshua Thanks again for the review. I will prepare a more considered design for the next version based on these points. Rakie Kim
On Mon, Mar 16, 2026 at 08:19:32AM -0700, Joshua Hahn wrote: > > In that sense I thought the word "prefer" was a bit confusing, since I > thought it would mean that it would try to fulfill the alloactions > from within a packet first, then fall back to remote packets if that > failed. (Or maybe I am just misunderstanding your explanation. Please > do let me know if that is the case : -) ) > > If what I understand is the case , I think this is the same thing as > just restricting allocations to be socket-local. I also wonder if > this idea applies to other mempolicies as well (i.e. unweighted interleave) > I was thinking about this as well, and in my head i think you have to consider a 2x2 situation cpuset | multi-socket-cpu single-socket-cpu ================================================================== single-socket-mem | mem-package mem-package ------------------------------------------------------------------ multi-socket-mem | global global ------------------------------------------------------------------ But I think this reduces to cpuset nodes dictates the weights used - which should already be the case with the existing code. I think you are right that we need to be very explicit about the fallback semantics here - but that may just be a matter of dictating whether the allocation falls back or prefers direct reclaim to push pages out of their requested nodes. ~Gregory
On Mon, 16 Mar 2026 15:45:24 -0400 Gregory Price <gourry@gourry.net> wrote: > On Mon, Mar 16, 2026 at 08:19:32AM -0700, Joshua Hahn wrote: > > > > In that sense I thought the word "prefer" was a bit confusing, since I > > thought it would mean that it would try to fulfill the alloactions > > from within a packet first, then fall back to remote packets if that > > failed. (Or maybe I am just misunderstanding your explanation. Please > > do let me know if that is the case : -) ) > > > > If what I understand is the case , I think this is the same thing as > > just restricting allocations to be socket-local. I also wonder if > > this idea applies to other mempolicies as well (i.e. unweighted interleave) > > > > I was thinking about this as well, and in my head i think you have to > consider a 2x2 situation > > cpuset | multi-socket-cpu single-socket-cpu > ================================================================== > single-socket-mem | mem-package mem-package > ------------------------------------------------------------------ > multi-socket-mem | global global > ------------------------------------------------------------------ > > But I think this reduces to cpuset nodes dictates the weights used - > which should already be the case with the existing code. Hello Gregory, Thanks for your additional feedback. I agree with your analysis. The final behavior should follow the nodes dictated by the cpuset or mempolicy configurations. > > I think you are right that we need to be very explicit about the > fallback semantics here - but that may just be a matter of dictating > whether the allocation falls back or prefers direct reclaim to push > pages out of their requested nodes. > > ~Gregory As you and Joshua pointed out, making the fallback semantics explicit is the most critical issue for this patch series. We need a clear policy to decide whether the allocation should fall back to a remote node or force direct reclaim to keep the allocation local. I will explicitly define these fallback semantics and address this trade-off in the design for the next version. Thanks again for your time and review. Rakie Kim
On Mon, Mar 16, 2026 at 02:12:48PM +0900, Rakie Kim wrote: > This patch series is an RFC to propose and discuss the overall design > and concept of a socket-aware weighted interleave mechanism. As there > are areas requiring further refinement, the primary goal at this stage > is to gather feedback on the architectural approach rather than focusing > on fine-grained implementation details. > I gave this a brief browse this morning, and I rather like this approach, more-so than the original proposals for socket-awareness that encoded the weights in a 2-dimensional array. I think this would be a great discussion at LSF, and I wonder if something like memory-package could be used for more purposes than just weighted interleave. ~Gregory
On Mon, 16 Mar 2026 10:01:48 -0400 Gregory Price <gourry@gourry.net> wrote: > On Mon, Mar 16, 2026 at 02:12:48PM +0900, Rakie Kim wrote: > > This patch series is an RFC to propose and discuss the overall design > > and concept of a socket-aware weighted interleave mechanism. As there > > are areas requiring further refinement, the primary goal at this stage > > is to gather feedback on the architectural approach rather than focusing > > on fine-grained implementation details. > > > > I gave this a brief browse this morning, and I rather like this > approach, more-so than the original proposals for socket-awareness > that encoded the weights in a 2-dimensional array. Hello Gregory, Thanks for your review and feedback. I also think this approach is much better than the previous 2-dimensional array idea. Since this is still an early draft, I hope this code will be developed into a better design through community discussions. > > I think this would be a great discussion at LSF, and I wonder if > something like memory-package could be used for more purposes than just > weighted interleave. > > ~Gregory I and Honggyu Kim are actually preparing to propose this exact topic for the upcoming LSF/MM/BPF summit. However, I accidentally missed adding lsf-pc@lists.linux-foundation.org to the CC list. I will re-post or forward this to the LSF PC list soon. You are exactly right about the memory-package. When I first designed it, I wanted to use it for memory tiering and other areas, not just for weighted interleave. For now, weighted interleave is the only implemented use case, but I hope to keep improving it so it can be used in other subsystems as well. Thanks again for your time and review. Rakie Kim
© 2016 - 2026 Red Hat, Inc.