drivers/acpi/numa/srat.c | 45 +++++++++++++++++++++++++--------------- 1 file changed, 28 insertions(+), 17 deletions(-)
On CXL platforms, the Static Resource Affinity Table (SRAT) may not
cover memory affinity information for all the CXL memory regions. Since
each CXL memory region is enumerated via a CXL Fixed Memory Window
Structure (CFMWS), during early boot the kernel parses the CFMWS tables
to find all CXL memory regions and sets a NUMA node for each of them.
This memory affinity information of CXL memory regions is later used by
the CXL ACPI driver.
The CFMWS table doesn't provide the memory affinity information either.
Currently the kernel assigns a 'faked' NUMA node for each CXL memory
region, starting from the next node of the highest node that is
enumerated via the SRAT. This can potentially increase the maximum NUMA
node ID of the platform ('nr_node_ids') a lot. E.g., on a GNR platform
with 4 NUMA nodes and 18 CFMWS tables, this bumps the 'nr_node_ids' to
22.
Increasing the 'nr_node_ids' has side effects. For instance, it is
widely used by the kernel for "highest possible NUMA node" based memory
allocations. It also impacts userspace ABIs, e.g., some NUMA memory
related system calls such as 'get_mempolicy' which requires 'maxnode'
not being smaller than the 'nr_node_ids'.
Currently parsing CFMWS tables and assigning faked NUMA node at boot is
done unconditionally. However, if the CXL ACPI driver is not enabled,
there will be no user of such memory affinity information of CXL memory
regions.
Change to only parsing the CFMWS tables at boot when CXL_ACPI is enabled
in Kconfig to avoid the unnecessary cost of bumping up 'nr_node_ids'.
E.g., on the aforementioned GNR platform, the "Slab" in /proc/meminfo is
reduced with this change (when CXL_ACPI is off):
w/ this change w/o
Slab 900488 kB 923660 kB
Signed-off-by: Kai Huang <kai.huang@intel.com>
---
drivers/acpi/numa/srat.c | 45 +++++++++++++++++++++++++---------------
1 file changed, 28 insertions(+), 17 deletions(-)
diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index aa87ee1583a4..c008fe1f15f7 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -425,6 +425,7 @@ acpi_parse_memory_affinity(union acpi_subtable_headers *header,
return 0;
}
+#if IS_ENABLED(CONFIG_CXL_ACPI)
static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
void *arg, const unsigned long table_end)
{
@@ -476,6 +477,31 @@ static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
return 0;
}
+static void __init acpi_numa_init_cxl(void)
+{
+ int i, fake_pxm;
+
+ /*
+ * CXL Fixed Memory Window Structures (CFMWS) must be parsed
+ * after the SRAT. Create NUMA Nodes for CXL memory ranges that
+ * are defined in the CFMWS and not already defined in the SRAT.
+ * Initialize a fake_pxm as the first available PXM to emulate.
+ */
+
+ /* fake_pxm is the next unused PXM value after SRAT parsing */
+ for (i = 0, fake_pxm = -1; i < MAX_NUMNODES; i++) {
+ if (node_to_pxm_map[i] > fake_pxm)
+ fake_pxm = node_to_pxm_map[i];
+ }
+ last_real_pxm = fake_pxm;
+ fake_pxm++;
+ acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
+ &fake_pxm);
+}
+#else
+static void __init acpi_numa_init_cxl(void) {}
+#endif
+
void __init __weak
acpi_numa_x2apic_affinity_init(struct acpi_srat_x2apic_cpu_affinity *pa)
{
@@ -602,7 +628,7 @@ acpi_table_parse_srat(enum acpi_srat_type id,
int __init acpi_numa_init(void)
{
- int i, fake_pxm, cnt = 0;
+ int cnt = 0;
if (acpi_disabled)
return -EINVAL;
@@ -640,22 +666,7 @@ int __init acpi_numa_init(void)
/* SLIT: System Locality Information Table */
acpi_table_parse(ACPI_SIG_SLIT, acpi_parse_slit);
- /*
- * CXL Fixed Memory Window Structures (CFMWS) must be parsed
- * after the SRAT. Create NUMA Nodes for CXL memory ranges that
- * are defined in the CFMWS and not already defined in the SRAT.
- * Initialize a fake_pxm as the first available PXM to emulate.
- */
-
- /* fake_pxm is the next unused PXM value after SRAT parsing */
- for (i = 0, fake_pxm = -1; i < MAX_NUMNODES; i++) {
- if (node_to_pxm_map[i] > fake_pxm)
- fake_pxm = node_to_pxm_map[i];
- }
- last_real_pxm = fake_pxm;
- fake_pxm++;
- acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
- &fake_pxm);
+ acpi_numa_init_cxl();
if (cnt < 0)
return cnt;
base-commit: af4e9ef3d78420feb8fe58cd9a1ab80c501b3c08
--
2.53.0
On Thu, Mar 05, 2026 at 10:33:42AM +1300, Kai Huang wrote: > Increasing the 'nr_node_ids' has side effects. For instance, it is > widely used by the kernel for "highest possible NUMA node" based memory > allocations. It also impacts userspace ABIs, e.g., some NUMA memory > related system calls such as 'get_mempolicy' which requires 'maxnode' > not being smaller than the 'nr_node_ids'. > Is this a Linux issue or a Firmware issue? Is GNR exporting more CFMWS than it should? Is your SRAT missing entries for CFMWS that are otherwise present? Are the CFMWS empty? (is that even valid) > E.g., on the aforementioned GNR platform, the "Slab" in /proc/meminfo is > reduced with this change (when CXL_ACPI is off): > > w/ this change w/o > > Slab 900488 kB 923660 kB > This is a good effect, but I still question the premise. We don't usually want #ifdef's inside of .c files if we can avoid it. ~Gregory
On Wed, Mar 04, 2026 at 05:33:26PM -0500, Gregory Price wrote: > On Thu, Mar 05, 2026 at 10:33:42AM +1300, Kai Huang wrote: > > Increasing the 'nr_node_ids' has side effects. For instance, it is > > widely used by the kernel for "highest possible NUMA node" based memory > > allocations. It also impacts userspace ABIs, e.g., some NUMA memory > > related system calls such as 'get_mempolicy' which requires 'maxnode' > > not being smaller than the 'nr_node_ids'. > > > > Is this a Linux issue or a Firmware issue? IIUC BIOS creates the CEDT based on the hardware it 'sees' as present. This patch is describing the case (weird as it seems to me) where we then boot a system with ACPI and NUMA enabled but CXL_ACPI disabled. So, I don't think we can blame BIOS. > > Is GNR exporting more CFMWS than it should? Not sure of any limits on flavors of CFMWS's a BIOS can offer. If BIOS can carve out a window, it can create a CFMWS. Not clear how that matters to the issue. > > Is your SRAT missing entries for CFMWS that are otherwise present? > > Are the CFMWS empty? (is that even valid) Why this line of questioning ;) I see the problem as a bit simpler. We have other code that tells us if the CFMWS's are valid, etc, but the point here is, we are not going to use these CFMWS's so stop the parse as early as possible, like right here as Kai has done. > > > E.g., on the aforementioned GNR platform, the "Slab" in /proc/meminfo is > > reduced with this change (when CXL_ACPI is off): > > > > w/ this change w/o > > > > Slab 900488 kB 923660 kB > > > > This is a good effect, but I still question the premise. > > We don't usually want #ifdef's inside of .c files if we can avoid it. I thought similar, but for early init and static helpers, this #if IS_ENABLE(..) wrapper is common. Reviewed-by: Alison Schofield <alison.schofield@intel.com> > > ~Gregory
On Wed, 2026-03-04 at 15:20 -0800, Alison Schofield wrote: > On Wed, Mar 04, 2026 at 05:33:26PM -0500, Gregory Price wrote: > > On Thu, Mar 05, 2026 at 10:33:42AM +1300, Kai Huang wrote: > > > Increasing the 'nr_node_ids' has side effects. For instance, it is > > > widely used by the kernel for "highest possible NUMA node" based memory > > > allocations. It also impacts userspace ABIs, e.g., some NUMA memory > > > related system calls such as 'get_mempolicy' which requires 'maxnode' > > > not being smaller than the 'nr_node_ids'. > > > > > > > > Is this a Linux issue or a Firmware issue? There are two issues actually: 1) on my testing platform, we can see CFMWS tables but there's no actual CXL memory inserted AFAICT; 2) The kernel unconditionally bumps the 'nr_node_ids' even CXL_ACPI driver is not able to detect any real CXL memory (or it is CXL port/mem driver cannot detect, for which I am not very familiar with). We can argue the 1) is FW issue. But this patch aims to improve the 2), which is an improvement regardless the firmware has issue or not: If CXL_ACPI isn't even enabled in Kconfig, there's no need to detect CFMWS tables and assign a 'faked NUMA node' during boot. IIUC this is even true on a perfect fine machine with actual CXL memory installed, since the kernel still have an option to disable CXL_ACPI in Kconfig. > > IIUC BIOS creates the CEDT based on the hardware it 'sees' as present. > > This patch is describing the case (weird as it seems to me) where we > then boot a system with ACPI and NUMA enabled but CXL_ACPI disabled. When I found this issue the CXL_ACPI is enabled in Kconfig, but there's no actual CXL memory being discovered AFAICT. #cxl list tells me nothing. > > So, I don't think we can blame BIOS. As said above, I think it's an improvement in the kernel even on a good machine with CXL memory inserted. > > > > > Is GNR exporting more CFMWS than it should? > Not sure of any limits on flavors of CFMWS's a BIOS can offer. > If BIOS can carve out a window, it can create a CFMWS. > Not clear how that matters to the issue. > > > > > Is your SRAT missing entries for CFMWS that are otherwise present? > > > > Are the CFMWS empty? (is that even valid) > > Why this line of questioning ;) I see the problem as a bit simpler. > We have other code that tells us if the CFMWS's are valid, etc, but > the point here is, we are not going to use these CFMWS's so stop > the parse as early as possible, like right here as Kai has done. Yes. :-) > > > > > > E.g., on the aforementioned GNR platform, the "Slab" in /proc/meminfo is > > > reduced with this change (when CXL_ACPI is off): > > > > > > w/ this change w/o > > > > > > Slab 900488 kB 923660 kB > > > > > > > This is a good effect, but I still question the premise. > > > > We don't usually want #ifdef's inside of .c files if we can avoid it. > > I thought similar, but for early init and static helpers, this > #if IS_ENABLE(..) wrapper is common. Yes. They are early init and static helpers, so no header file is involved here. I also see #if defined(CONFIG_X86) || defined(CONFIG_ARM64) is used for acpi_parse_gi_affinity() in this file. > > > Reviewed-by: Alison Schofield <alison.schofield@intel.com> Thanks!
On Wed, Mar 04, 2026 at 04:04:57PM -0800, Huang, Kai wrote: > On Wed, 2026-03-04 at 15:20 -0800, Alison Schofield wrote: > > On Wed, Mar 04, 2026 at 05:33:26PM -0500, Gregory Price wrote: > > > On Thu, Mar 05, 2026 at 10:33:42AM +1300, Kai Huang wrote: > > > > Increasing the 'nr_node_ids' has side effects. For instance, it is > > > > widely used by the kernel for "highest possible NUMA node" based memory > > > > allocations. It also impacts userspace ABIs, e.g., some NUMA memory > > > > related system calls such as 'get_mempolicy' which requires 'maxnode' > > > > not being smaller than the 'nr_node_ids'. > > > > > > > > > > > > Is this a Linux issue or a Firmware issue? > > There are two issues actually: > > 1) on my testing platform, we can see CFMWS tables but there's no actual CXL > memory inserted AFAICT; > > 2) The kernel unconditionally bumps the 'nr_node_ids' even CXL_ACPI driver > is not able to detect any real CXL memory (or it is CXL port/mem driver > cannot detect, for which I am not very familiar with). > > We can argue the 1) is FW issue. But this patch aims to improve the 2), > which is an improvement regardless the firmware has issue or not: > > If CXL_ACPI isn't even enabled in Kconfig, there's no need to detect CFMWS > tables and assign a 'faked NUMA node' during boot. > > IIUC this is even true on a perfect fine machine with actual CXL memory > installed, since the kernel still have an option to disable CXL_ACPI in > Kconfig. > > > > > IIUC BIOS creates the CEDT based on the hardware it 'sees' as present. > > > > This patch is describing the case (weird as it seems to me) where we > > then boot a system with ACPI and NUMA enabled but CXL_ACPI disabled. > > When I found this issue the CXL_ACPI is enabled in Kconfig, but there's no > actual CXL memory being discovered AFAICT. So this patch will not fix that case, because CXL_ACPI will be enabled. I'd like to take a look at that system Kai. I'll ping you off list. > > #cxl list > > tells me nothing. > > > > > So, I don't think we can blame BIOS. > > As said above, I think it's an improvement in the kernel even on a good > machine with CXL memory inserted. > > > > > > > > > Is GNR exporting more CFMWS than it should? > > Not sure of any limits on flavors of CFMWS's a BIOS can offer. > > If BIOS can carve out a window, it can create a CFMWS. > > Not clear how that matters to the issue. > > > > > > > > Is your SRAT missing entries for CFMWS that are otherwise present? > > > > > > Are the CFMWS empty? (is that even valid) > > > > Why this line of questioning ;) I see the problem as a bit simpler. > > We have other code that tells us if the CFMWS's are valid, etc, but > > the point here is, we are not going to use these CFMWS's so stop > > the parse as early as possible, like right here as Kai has done. > > Yes. :-) > > > > > > > > > > E.g., on the aforementioned GNR platform, the "Slab" in /proc/meminfo is > > > > reduced with this change (when CXL_ACPI is off): > > > > > > > > w/ this change w/o > > > > > > > > Slab 900488 kB 923660 kB > > > > > > > > > > This is a good effect, but I still question the premise. > > > > > > We don't usually want #ifdef's inside of .c files if we can avoid it. > > > > I thought similar, but for early init and static helpers, this > > #if IS_ENABLE(..) wrapper is common. > > Yes. They are early init and static helpers, so no header file is involved > here. > > I also see #if defined(CONFIG_X86) || defined(CONFIG_ARM64) is used for > acpi_parse_gi_affinity() in this file. > > > > > > > Reviewed-by: Alison Schofield <alison.schofield@intel.com> > > Thanks!
On Wed, 2026-03-04 at 16:44 -0800, Alison Schofield wrote: > > When I found this issue the CXL_ACPI is enabled in Kconfig, but there's no > > actual CXL memory being discovered AFAICT. > > So this patch will not fix that case, because CXL_ACPI will be enabled. > > I'd like to take a look at that system Kai. I'll ping you off list. Sent the machine login to you. I would say this patch is not aiming to fix any issue that is due to broken firmware(e.g., I didn't mention anything about that in the changelog). This is to improve the kernel even on a fine CXL-memory-capable machine when CXL_ACPI is off.
On Wed, Mar 04, 2026 at 03:20:11PM -0800, Alison Schofield wrote: > On Wed, Mar 04, 2026 at 05:33:26PM -0500, Gregory Price wrote: > > On Thu, Mar 05, 2026 at 10:33:42AM +1300, Kai Huang wrote: > > > Increasing the 'nr_node_ids' has side effects. For instance, it is > > > widely used by the kernel for "highest possible NUMA node" based memory > > > allocations. It also impacts userspace ABIs, e.g., some NUMA memory > > > related system calls such as 'get_mempolicy' which requires 'maxnode' > > > not being smaller than the 'nr_node_ids'. > > > > > > > > Is this a Linux issue or a Firmware issue? > > IIUC BIOS creates the CEDT based on the hardware it 'sees' as present. > > This patch is describing the case (weird as it seems to me) where we > then boot a system with ACPI and NUMA enabled but CXL_ACPI disabled. > > So, I don't think we can blame BIOS. > > > > > Is GNR exporting more CFMWS than it should? > Not sure of any limits on flavors of CFMWS's a BIOS can offer. > If BIOS can carve out a window, it can create a CFMWS. > Not clear how that matters to the issue. > > > > > Is your SRAT missing entries for CFMWS that are otherwise present? > > > > Are the CFMWS empty? (is that even valid) > > Why this line of questioning ;) I see the problem as a bit simpler. > We have other code that tells us if the CFMWS's are valid, etc, but > the point here is, we are not going to use these CFMWS's so stop > the parse as early as possible, like right here as Kai has done. > Mostly i'm wondering if this issue should be dealt with in the acpi code or if the issue is that we just don't want to figure out how to lazy-create these things instead of always creating them at __init. it does seem rational to build out support for CEDT entries if CXL_ACPI is built out, but this also means you can't otherwise load modules that would have made use of this information. This basically says if specifically CXL_ACPI is built out, the NUMA structure is forever lost - even though it's accurately described by BIOS. Maybe that's a rational decision, just kind of prodding a bit. ~Gregory
On Wed, 2026-03-04 at 18:56 -0500, Gregory Price wrote: > On Wed, Mar 04, 2026 at 03:20:11PM -0800, Alison Schofield wrote: > > On Wed, Mar 04, 2026 at 05:33:26PM -0500, Gregory Price wrote: > > > On Thu, Mar 05, 2026 at 10:33:42AM +1300, Kai Huang wrote: > > > > Increasing the 'nr_node_ids' has side effects. For instance, it is > > > > widely used by the kernel for "highest possible NUMA node" based memory > > > > allocations. It also impacts userspace ABIs, e.g., some NUMA memory > > > > related system calls such as 'get_mempolicy' which requires 'maxnode' > > > > not being smaller than the 'nr_node_ids'. > > > > > > > > > > > > Is this a Linux issue or a Firmware issue? > > > > IIUC BIOS creates the CEDT based on the hardware it 'sees' as present. > > > > This patch is describing the case (weird as it seems to me) where we > > then boot a system with ACPI and NUMA enabled but CXL_ACPI disabled. > > > > So, I don't think we can blame BIOS. > > > > > > > > Is GNR exporting more CFMWS than it should? > > Not sure of any limits on flavors of CFMWS's a BIOS can offer. > > If BIOS can carve out a window, it can create a CFMWS. > > Not clear how that matters to the issue. > > > > > > > > Is your SRAT missing entries for CFMWS that are otherwise present? > > > > > > Are the CFMWS empty? (is that even valid) > > > > Why this line of questioning ;) I see the problem as a bit simpler. > > We have other code that tells us if the CFMWS's are valid, etc, but > > the point here is, we are not going to use these CFMWS's so stop > > the parse as early as possible, like right here as Kai has done. > > > > Mostly i'm wondering if this issue should be dealt with in the acpi code > or if the issue is that we just don't want to figure out how to > lazy-create these things instead of always creating them at __init. > > it does seem rational to build out support for CEDT entries if CXL_ACPI > is built out, but this also means you can't otherwise load modules that > would have made use of this information. Besides CXL_ACPI, is there any other module(s) that uses this information for these CXL memory regions? > > This basically says if specifically CXL_ACPI is built out, the NUMA > structure is forever lost - even though it's accurately described by > BIOS. > The normal NUMA info described in SRAT is still there. It only avoids detecting CFMWS, which doesn't provide any NUMA info actually -- that's why kernel assigns a 'faked' NUMA node for each of them. So we are not losing anything AFAICT.
On 2026/3/5 08:14, Huang, Kai wrote: > On Wed, 2026-03-04 at 18:56 -0500, Gregory Price wrote: >> On Wed, Mar 04, 2026 at 03:20:11PM -0800, Alison Schofield wrote: >>> On Wed, Mar 04, 2026 at 05:33:26PM -0500, Gregory Price wrote: >>>> On Thu, Mar 05, 2026 at 10:33:42AM +1300, Kai Huang wrote: >>>>> Increasing the 'nr_node_ids' has side effects. For instance, it is >>>>> widely used by the kernel for "highest possible NUMA node" based memory >>>>> allocations. It also impacts userspace ABIs, e.g., some NUMA memory >>>>> related system calls such as 'get_mempolicy' which requires 'maxnode' >>>>> not being smaller than the 'nr_node_ids'. >>>>> >>> >>>> >>>> Is this a Linux issue or a Firmware issue? >>> >>> IIUC BIOS creates the CEDT based on the hardware it 'sees' as present. >>> >>> This patch is describing the case (weird as it seems to me) where we >>> then boot a system with ACPI and NUMA enabled but CXL_ACPI disabled. >>> >>> So, I don't think we can blame BIOS. >>> >>>> >>>> Is GNR exporting more CFMWS than it should? >>> Not sure of any limits on flavors of CFMWS's a BIOS can offer. >>> If BIOS can carve out a window, it can create a CFMWS. >>> Not clear how that matters to the issue. >>> >>>> >>>> Is your SRAT missing entries for CFMWS that are otherwise present? >>>> >>>> Are the CFMWS empty? (is that even valid) >>> >>> Why this line of questioning ;) I see the problem as a bit simpler. >>> We have other code that tells us if the CFMWS's are valid, etc, but >>> the point here is, we are not going to use these CFMWS's so stop >>> the parse as early as possible, like right here as Kai has done. >>> >> >> Mostly i'm wondering if this issue should be dealt with in the acpi code >> or if the issue is that we just don't want to figure out how to >> lazy-create these things instead of always creating them at __init. >> >> it does seem rational to build out support for CEDT entries if CXL_ACPI >> is built out, but this also means you can't otherwise load modules that >> would have made use of this information. > > Besides CXL_ACPI, is there any other module(s) that uses this information > for these CXL memory regions? > >> >> This basically says if specifically CXL_ACPI is built out, the NUMA >> structure is forever lost - even though it's accurately described by >> BIOS. >> > > The normal NUMA info described in SRAT is still there. It only avoids > detecting CFMWS, which doesn't provide any NUMA info actually -- that's why > kernel assigns a 'faked' NUMA node for each of them. Hi Kai, we met some performance issues when creating pods because of too much possible node without cxl memory inserted in our machine (machine Intel(R) Xeon(R) 6746E). This patch can workaround our issue. By the way, If insert real cxl memory but disable CXL_ACPI, how the kernel initilaizes the cxl memory and assigns the numa node? can you provide the related kernel souce code? Thanks! > > So we are not losing anything AFAICT.
> > By the way, If insert real cxl memory but disable CXL_ACPI, how the kernel > initilaizes the cxl memory and assigns the numa node? can you provide the > related kernel souce code? Sorry I am a bit confused what you want to do. If you have actual CXL memory installed, that means you want to use that? But if you want to use CXL memory, you cannot disable CXL_ACPI IIUC. If you don't want to use CXL memory in your deployment then you can either disable CXL memory in your BIOS or, with this patch, disable CXL_ACPI in your kernel.
On 2026/3/5 18:25, Huang, Kai wrote: > >> >> By the way, If insert real cxl memory but disable CXL_ACPI, how the kernel >> initilaizes the cxl memory and assigns the numa node? can you provide the >> related kernel souce code? > > Sorry I am a bit confused what you want to do. > " It only avoids detecting CFMWS, which doesn't provide any NUMA info actually -- that's why kernel assigns a 'faked' NUMA node for each of them." Sorry I misunderstand what you said above. I originally thought that the inserted cxl memory was assigned to a 'faked' NUMA node if disable CXL_ACPI. > If you have actual CXL memory installed, that means you want to use that? > > But if you want to use CXL memory, you cannot disable CXL_ACPI IIUC. > > If you don't want to use CXL memory in your deployment then you can either > disable CXL memory in your BIOS or, with this patch, disable CXL_ACPI in > your kernel. > I see. Thanks!
On Thu, Mar 05, 2026 at 12:14:52AM +0000, Huang, Kai wrote: > On Wed, 2026-03-04 at 18:56 -0500, Gregory Price wrote: > > > > This basically says if specifically CXL_ACPI is built out, the NUMA > > structure is forever lost - even though it's accurately described by > > BIOS. > > > > The normal NUMA info described in SRAT is still there. It only avoids > detecting CFMWS, which doesn't provide any NUMA info actually -- that's why > kernel assigns a 'faked' NUMA node for each of them. > > So we are not losing anything AFAICT. Well, I'm mostly confused why there are CEDT entries for hardware that presumably isn't even there - unless this platform is reserving space for future hotplug. Just want to make sure we're not adjusting for strange firmware behavior. The only platform we've seen this behavior on previously was QEMU, but that was because it never emitted SRAT, so i was wondering if there's odd firmware behavior going on (emitting CEDTs when it shouldn't) before we jump to dropping nodes that would have otherwise have been present on existing systems who might have compiled CXL_ACPI out. You are taking something away by nature of compiling something out by default that was previously not compiled out by default. ~Gregory
On Wed, 2026-03-04 at 19:29 -0500, Gregory Price wrote: > On Thu, Mar 05, 2026 at 12:14:52AM +0000, Huang, Kai wrote: > > On Wed, 2026-03-04 at 18:56 -0500, Gregory Price wrote: > > > > > > This basically says if specifically CXL_ACPI is built out, the NUMA > > > structure is forever lost - even though it's accurately described by > > > BIOS. > > > > > > > The normal NUMA info described in SRAT is still there. It only avoids > > detecting CFMWS, which doesn't provide any NUMA info actually -- that's why > > kernel assigns a 'faked' NUMA node for each of them. > > > > So we are not losing anything AFAICT. > > Well, I'm mostly confused why there are CEDT entries for hardware that > presumably isn't even there - unless this platform is reserving space > for future hotplug. > I think this should be the case. > Just want to make sure we're not adjusting for > strange firmware behavior. How to check whether it is "strange"? I dumped the ACPI table on that platform, and what I can see is there's CEDT table which contains couple of "[CXL Host Bridge Structure]" and "[CXL Fixed Memory Window Structure]". I pasted one for each of them here FYI: [024h 0036 001h] Subtable Type : 00 [CXL Host Bridge Structure] [025h 0037 001h] Reserved : 00 [026h 0038 002h] Length : 0020 [028h 0040 004h] Associated host bridge : 00000003 [02Ch 0044 004h] Specification version : 00000001 [030h 0048 004h] Reserved : 00000000 [034h 0052 008h] Register base : 00000000997F0000 [03Ch 0060 008h] Register length : 0000000000010000 [124h 0292 001h] Subtable Type : 01 [CXL Fixed Memory Window Structure] [125h 0293 001h] Reserved : 00 [126h 0294 002h] Length : 0034 [128h 0296 004h] Reserved : 00000000 [12Ch 0300 008h] Window base address : 0000002080000000 [134h 0308 008h] Window size : 0000017B00000000 [13Ch 0316 001h] Interleave Members (2^n) : 02 [13Dh 0317 001h] Interleave Arithmetic : 01 [13Eh 0318 002h] Reserved : 0000 [140h 0320 004h] Granularity : 00000000 [144h 0324 002h] Restrictions : 0006 [146h 0326 002h] QtgId : 0000 [148h 0328 004h] First Target : 00000003 [14Ch 0332 004h] Next Target : 00000023 [150h 0336 004h] Next Target : 00000043 [154h 0340 004h] Next Target : 00000053 > > The only platform we've seen this behavior on previously was QEMU, but > that was because it never emitted SRAT, > My machine has a legitimate SRAT. The kernel detects it has 4 nodes with 256 CPUs + 256G memory. > so i was wondering if there's > odd firmware behavior going on (emitting CEDTs when it shouldn't) before > we jump to dropping nodes that would have otherwise have been present > on existing systems who might have compiled CXL_ACPI out. I don't know how to check whether the BIOS "should" or "should not" generate CEDT. I can check my machine if you let me know what to check specifically? > > You are taking something away by nature of compiling something out by > default that was previously not compiled out by default. Yeah, and it is due to "there's a cost" if we don't compile out by default. Hope that justifies?
Huang, Kai wrote: > On Wed, 2026-03-04 at 19:29 -0500, Gregory Price wrote: > > On Thu, Mar 05, 2026 at 12:14:52AM +0000, Huang, Kai wrote: > > > On Wed, 2026-03-04 at 18:56 -0500, Gregory Price wrote: > > > > > > > > This basically says if specifically CXL_ACPI is built out, the NUMA > > > > structure is forever lost - even though it's accurately described by > > > > BIOS. > > > > > > > > > > The normal NUMA info described in SRAT is still there. It only avoids > > > detecting CFMWS, which doesn't provide any NUMA info actually -- that's why > > > kernel assigns a 'faked' NUMA node for each of them. > > > > > > So we are not losing anything AFAICT. > > > > Well, I'm mostly confused why there are CEDT entries for hardware that > > presumably isn't even there - unless this platform is reserving space > > for future hotplug. > > > > I think this should be the case. > > > Just want to make sure we're not adjusting for > > strange firmware behavior. > > How to check whether it is "strange"? > So these are fine. These are CXL hotplug windows and the expectation is that they *might* be populated in the future. SRAT can not make claims about future CXL hotplug (see NOTE). So the expecation for CXL hotplug is reserve some numa nodes that Linux can determine the affinity of dynamically with HMAT Generic Port and device CDAT information. NOTE: SRAT *does* make claims about the affinity of future *ACPI* Hotplug, but in that case the platform statically knows something about the configuration of memory that can possibly be plugged in the future. > > You are taking something away by nature of compiling something out by > > default that was previously not compiled out by default. > > Yeah, and it is due to "there's a cost" if we don't compile out by default. > > Hope that justifies? I think it makes sense that if you disable CXL hotplug by setting CONFIG_CXL_ACPI=n then no need to reserve numa ids. However, just do something like this rather than add ifdefs to the code: diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c index aa87ee1583a4..62d4a8df0b8c 100644 --- a/drivers/acpi/numa/srat.c +++ b/drivers/acpi/numa/srat.c @@ -654,8 +654,11 @@ int __init acpi_numa_init(void) } last_real_pxm = fake_pxm; fake_pxm++; - acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws, - &fake_pxm); + + /* No need to expand numa nodes if CXL is disabled */ + if (IS_ENABLED(CONFIG_CXL_ACPI)) + acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws, + &fake_pxm); if (cnt < 0) return cnt; The call to acpi_table_parse_cedt() will get skipped and the code for acpi_parse_cfmws() will get automatically compiled out of the file. At the same time I doubt this patch provides end users much value in practice as most distro kernels have CONFIG_CXL_ACPI, and the few end users that have CXL are not going blink at the overhead to support the full feature set. Can you not just disable CXL support in the BIOS for your system and avoid complicating this code path for a small win?
On Wed, 2026-03-04 at 17:29 -0800, dan.j.williams@intel.com wrote: > Huang, Kai wrote: > > On Wed, 2026-03-04 at 19:29 -0500, Gregory Price wrote: > > > On Thu, Mar 05, 2026 at 12:14:52AM +0000, Huang, Kai wrote: > > > > On Wed, 2026-03-04 at 18:56 -0500, Gregory Price wrote: > > > > > > > > > > This basically says if specifically CXL_ACPI is built out, the NUMA > > > > > structure is forever lost - even though it's accurately described by > > > > > BIOS. > > > > > > > > > > > > > The normal NUMA info described in SRAT is still there. It only avoids > > > > detecting CFMWS, which doesn't provide any NUMA info actually -- that's why > > > > kernel assigns a 'faked' NUMA node for each of them. > > > > > > > > So we are not losing anything AFAICT. > > > > > > Well, I'm mostly confused why there are CEDT entries for hardware that > > > presumably isn't even there - unless this platform is reserving space > > > for future hotplug. > > > > > > > I think this should be the case. > > > > > Just want to make sure we're not adjusting for > > > strange firmware behavior. > > > > How to check whether it is "strange"? > > > > So these are fine. These are CXL hotplug windows and the expectation is > that they *might* be populated in the future. SRAT can not make claims > about future CXL hotplug (see NOTE). So the expecation for CXL hotplug > is reserve some numa nodes that Linux can determine the affinity of > dynamically with HMAT Generic Port and device CDAT information. > > NOTE: SRAT *does* make claims about the affinity of future *ACPI* > Hotplug, but in that case the platform statically knows something about > the configuration of memory that can possibly be plugged in the future. Thanks for the info! > > > > You are taking something away by nature of compiling something out by > > > default that was previously not compiled out by default. > > > > Yeah, and it is due to "there's a cost" if we don't compile out by default. > > > > Hope that justifies? > > I think it makes sense that if you disable CXL hotplug by setting > CONFIG_CXL_ACPI=n then no need to reserve numa ids. However, just do > something like this rather than add ifdefs to the code: > > diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c > index aa87ee1583a4..62d4a8df0b8c 100644 > --- a/drivers/acpi/numa/srat.c > +++ b/drivers/acpi/numa/srat.c > @@ -654,8 +654,11 @@ int __init acpi_numa_init(void) > } > last_real_pxm = fake_pxm; > fake_pxm++; > - acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws, > - &fake_pxm); > + > + /* No need to expand numa nodes if CXL is disabled */ > + if (IS_ENABLED(CONFIG_CXL_ACPI)) > + acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws, > + &fake_pxm); > > if (cnt < 0) > return cnt; > > The call to acpi_table_parse_cedt() will get skipped and the code for > acpi_parse_cfmws() will get automatically compiled out of the file. Yeah agreed this is simpler. Thanks for the suggestion. I would like to honor Gregory's suggestion about the comment since it provides more info if you agree? (I also added "impacts certain NUMA userspace ABIs") /* * Some platforms report CEDT CFMWS for hotplug device support. These * windows are unusable without CXL drivers, so don't reserve fake nodes * if they're compiled out - it wastes memory on per-node structures and * impacts certain NUMA userspace ABIs. */ > > At the same time I doubt this patch provides end users much value in > practice as most distro kernels have CONFIG_CXL_ACPI, and the few end > users that have CXL are not going blink at the overhead to support the > full feature set. > > Can you not just disable CXL support in the BIOS for your system and > avoid complicating this code path for a small win? Sure I can (I need to figure out how, though). But I think it's still useful for developers because some of them will tend to only enable Kconfig options that they need to save build time. And sometimes it's not quite easy to turn off CXL in the BIOS since the machine would be mostly remote. So I think this patch is still worthy to do?
Huang, Kai wrote:
[..]
> > I think it makes sense that if you disable CXL hotplug by setting
> > CONFIG_CXL_ACPI=n then no need to reserve numa ids. However, just do
> > something like this rather than add ifdefs to the code:
> >
> > diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
> > index aa87ee1583a4..62d4a8df0b8c 100644
> > --- a/drivers/acpi/numa/srat.c
> > +++ b/drivers/acpi/numa/srat.c
> > @@ -654,8 +654,11 @@ int __init acpi_numa_init(void)
> > }
> > last_real_pxm = fake_pxm;
> > fake_pxm++;
> > - acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
> > - &fake_pxm);
> > +
> > + /* No need to expand numa nodes if CXL is disabled */
> > + if (IS_ENABLED(CONFIG_CXL_ACPI))
> > + acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
> > + &fake_pxm);
> >
> > if (cnt < 0)
> > return cnt;
> >
> > The call to acpi_table_parse_cedt() will get skipped and the code for
> > acpi_parse_cfmws() will get automatically compiled out of the file.
>
> Yeah agreed this is simpler. Thanks for the suggestion.
>
> I would like to honor Gregory's suggestion about the comment since it
> provides more info if you agree?
>
> (I also added "impacts certain NUMA userspace ABIs")
>
> /*
> * Some platforms report CEDT CFMWS for hotplug device support. These
> * windows are unusable without CXL drivers, so don't reserve fake nodes
> * if they're compiled out - it wastes memory on per-node structures and
> * impacts certain NUMA userspace ABIs.
> */
I feel like this is just too much scary detail for a tiny corner case
that maybe only effects a handful of kernel developers at hardware
companies with advanced server platforms.
/* No need to expand numa nodes if CXL is disabled */
...is more than sufficient.
> > At the same time I doubt this patch provides end users much value in
> > practice as most distro kernels have CONFIG_CXL_ACPI, and the few end
> > users that have CXL are not going blink at the overhead to support the
> > full feature set.
> >
> > Can you not just disable CXL support in the BIOS for your system and
> > avoid complicating this code path for a small win?
>
> Sure I can (I need to figure out how, though). But I think it's still
> useful for developers because some of them will tend to only enable Kconfig
> options that they need to save build time.
>
> And sometimes it's not quite easy to turn off CXL in the BIOS since the
> machine would be mostly remote.
>
> So I think this patch is still worthy to do?
A small 2 line patch to save a few megabytes of memory, sure. A
paragraph of context that almost nobody will ever care about on a go
forward basis? No, thank you.
On Wed, Mar 04, 2026 at 06:08:36PM -0800, dan.j.williams@intel.com wrote: > Huang, Kai wrote: > > A small 2 line patch to save a few megabytes of memory, sure. A > paragraph of context that almost nobody will ever care about on a go > forward basis? No, thank you. eh throw it in the changelog *shrug*, i think it's useful context. ~Gregory
Gregory Price wrote: > On Wed, Mar 04, 2026 at 06:08:36PM -0800, dan.j.williams@intel.com wrote: > > Huang, Kai wrote: > > > > A small 2 line patch to save a few megabytes of memory, sure. A > > paragraph of context that almost nobody will ever care about on a go > > forward basis? No, thank you. > > eh throw it in the changelog *shrug*, i think it's useful context. Definitely great changelog fodder and yes, useful context.
On Wed, 2026-03-04 at 19:01 -0800, dan.j.williams@intel.com wrote: > Gregory Price wrote: > > On Wed, Mar 04, 2026 at 06:08:36PM -0800, dan.j.williams@intel.com wrote: > > > Huang, Kai wrote: > > > > > > A small 2 line patch to save a few megabytes of memory, sure. A > > > paragraph of context that almost nobody will ever care about on a go > > > forward basis? No, thank you. > > > > eh throw it in the changelog *shrug*, i think it's useful context. > > Definitely great changelog fodder and yes, useful context. Thanks for all the feedback. So if I follow you guys correctly, the context I provided in the changelog is still useful. And the "save a few megabytes of memory" is also covered since I provided some data about "Slab" in /proc/meminfo. So to me I'll keep the changelog as-is in v2. Btw, I tried to build with W=1 when CXL_ACPI is off to verify what Dan said about the acpi_parse_cfmws() will get compiled out (otherwise we will see '- Wunused-function' warning): The call to acpi_table_parse_cedt() will get skipped and the code for acpi_parse_cfmws() will get automatically compiled out of the file. I tried both -O2 (CC_OPTIMIZE_FOR_PERFORMANCE) and -Os (CC_OPTIMIZE_FOR_SIZE) with W=1 and no warning was found. I also tried both gcc and clang and no warning was found either. So I'll follow Dan's suggestion to make the diff smaller. Hi Alison, Gregory, thanks for your review, please let me know if you have any concern. Thanks.
On Thu, Mar 05, 2026 at 12:45:39AM +0000, Huang, Kai wrote: > > > > You are taking something away by nature of compiling something out by > > default that was previously not compiled out by default. > > Yeah, and it is due to "there's a cost" if we don't compile out by default. > > Hope that justifies? Sorry, i don't mean to come off rough - I don't even necessarily disagree with the change here, but it is worth taking a moment to understand whether this is actually a Linux issue or an platform issue. Up until now there's been some assumption that if a device is attached there will be CEDT windows - but this would be the first time i've seen a platform throw up CEDT windows for devices that don't exist. Anyway, this is good information for Linux to know and worth a comment. We don't know what other people have built that depend on the current behavior. > > > > Well, I'm mostly confused why there are CEDT entries for hardware that > > presumably isn't even there - unless this platform is reserving space > > for future hotplug. > > > > I think this should be the case. > If that's the case then cool :] May be worth a comment in acpi_numa_init_cxl - something like: /* * Some platforms report CEDT CFMWS for hotplug device support. These * windows are unusable without CXL drivers, so don't reserve fake nodes * if they're compiled out - it wastes memory on per-node structures. */ Reviewed-by: Gregory Price <gourry@gourry.net> ~Gregory
© 2016 - 2026 Red Hat, Inc.