[PATCH] ACPI: NUMA: Only parse CFMWS at boot when CXL_ACPI is on

Kai Huang posted 1 patch 1 month ago
There is a newer version of this series
drivers/acpi/numa/srat.c | 45 +++++++++++++++++++++++++---------------
1 file changed, 28 insertions(+), 17 deletions(-)
[PATCH] ACPI: NUMA: Only parse CFMWS at boot when CXL_ACPI is on
Posted by Kai Huang 1 month ago
On CXL platforms, the Static Resource Affinity Table (SRAT) may not
cover memory affinity information for all the CXL memory regions.  Since
each CXL memory region is enumerated via a CXL Fixed Memory Window
Structure (CFMWS), during early boot the kernel parses the CFMWS tables
to find all CXL memory regions and sets a NUMA node for each of them.
This memory affinity information of CXL memory regions is later used by
the CXL ACPI driver.

The CFMWS table doesn't provide the memory affinity information either.
Currently the kernel assigns a 'faked' NUMA node for each CXL memory
region, starting from the next node of the highest node that is
enumerated via the SRAT.  This can potentially increase the maximum NUMA
node ID of the platform ('nr_node_ids') a lot.  E.g., on a GNR platform
with 4 NUMA nodes and 18 CFMWS tables, this bumps the 'nr_node_ids' to
22.

Increasing the 'nr_node_ids' has side effects.  For instance, it is
widely used by the kernel for "highest possible NUMA node" based memory
allocations.  It also impacts userspace ABIs, e.g., some NUMA memory
related system calls such as 'get_mempolicy' which requires 'maxnode'
not being smaller than the 'nr_node_ids'.

Currently parsing CFMWS tables and assigning faked NUMA node at boot is
done unconditionally.  However, if the CXL ACPI driver is not enabled,
there will be no user of such memory affinity information of CXL memory
regions.

Change to only parsing the CFMWS tables at boot when CXL_ACPI is enabled
in Kconfig to avoid the unnecessary cost of bumping up 'nr_node_ids'.

E.g., on the aforementioned GNR platform, the "Slab" in /proc/meminfo is
reduced with this change (when CXL_ACPI is off):

	w/ this change		w/o

Slab	900488 kB		923660 kB

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 drivers/acpi/numa/srat.c | 45 +++++++++++++++++++++++++---------------
 1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index aa87ee1583a4..c008fe1f15f7 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -425,6 +425,7 @@ acpi_parse_memory_affinity(union acpi_subtable_headers *header,
 	return 0;
 }
 
+#if IS_ENABLED(CONFIG_CXL_ACPI)
 static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
 				   void *arg, const unsigned long table_end)
 {
@@ -476,6 +477,31 @@ static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
 	return 0;
 }
 
+static void __init acpi_numa_init_cxl(void)
+{
+	int i, fake_pxm;
+
+	/*
+	 * CXL Fixed Memory Window Structures (CFMWS) must be parsed
+	 * after the SRAT. Create NUMA Nodes for CXL memory ranges that
+	 * are defined in the CFMWS and not already defined in the SRAT.
+	 * Initialize a fake_pxm as the first available PXM to emulate.
+	 */
+
+	/* fake_pxm is the next unused PXM value after SRAT parsing */
+	for (i = 0, fake_pxm = -1; i < MAX_NUMNODES; i++) {
+		if (node_to_pxm_map[i] > fake_pxm)
+			fake_pxm = node_to_pxm_map[i];
+	}
+	last_real_pxm = fake_pxm;
+	fake_pxm++;
+	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
+			      &fake_pxm);
+}
+#else
+static void __init acpi_numa_init_cxl(void) {}
+#endif
+
 void __init __weak
 acpi_numa_x2apic_affinity_init(struct acpi_srat_x2apic_cpu_affinity *pa)
 {
@@ -602,7 +628,7 @@ acpi_table_parse_srat(enum acpi_srat_type id,
 
 int __init acpi_numa_init(void)
 {
-	int i, fake_pxm, cnt = 0;
+	int cnt = 0;
 
 	if (acpi_disabled)
 		return -EINVAL;
@@ -640,22 +666,7 @@ int __init acpi_numa_init(void)
 	/* SLIT: System Locality Information Table */
 	acpi_table_parse(ACPI_SIG_SLIT, acpi_parse_slit);
 
-	/*
-	 * CXL Fixed Memory Window Structures (CFMWS) must be parsed
-	 * after the SRAT. Create NUMA Nodes for CXL memory ranges that
-	 * are defined in the CFMWS and not already defined in the SRAT.
-	 * Initialize a fake_pxm as the first available PXM to emulate.
-	 */
-
-	/* fake_pxm is the next unused PXM value after SRAT parsing */
-	for (i = 0, fake_pxm = -1; i < MAX_NUMNODES; i++) {
-		if (node_to_pxm_map[i] > fake_pxm)
-			fake_pxm = node_to_pxm_map[i];
-	}
-	last_real_pxm = fake_pxm;
-	fake_pxm++;
-	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
-			      &fake_pxm);
+	acpi_numa_init_cxl();
 
 	if (cnt < 0)
 		return cnt;

base-commit: af4e9ef3d78420feb8fe58cd9a1ab80c501b3c08
-- 
2.53.0
Re: [PATCH] ACPI: NUMA: Only parse CFMWS at boot when CXL_ACPI is on
Posted by Gregory Price 1 month ago
On Thu, Mar 05, 2026 at 10:33:42AM +1300, Kai Huang wrote:
> Increasing the 'nr_node_ids' has side effects.  For instance, it is
> widely used by the kernel for "highest possible NUMA node" based memory
> allocations.  It also impacts userspace ABIs, e.g., some NUMA memory
> related system calls such as 'get_mempolicy' which requires 'maxnode'
> not being smaller than the 'nr_node_ids'.
> 

Is this a Linux issue or a Firmware issue?

Is GNR exporting more CFMWS than it should?

Is your SRAT missing entries for CFMWS that are otherwise present?

Are the CFMWS empty? (is that even valid)

> E.g., on the aforementioned GNR platform, the "Slab" in /proc/meminfo is
> reduced with this change (when CXL_ACPI is off):
> 
> 	w/ this change		w/o
> 
> Slab	900488 kB		923660 kB
> 

This is a good effect, but I still question the premise.

We don't usually want #ifdef's inside of .c files if we can avoid it.

~Gregory
Re: [PATCH] ACPI: NUMA: Only parse CFMWS at boot when CXL_ACPI is on
Posted by Alison Schofield 1 month ago
On Wed, Mar 04, 2026 at 05:33:26PM -0500, Gregory Price wrote:
> On Thu, Mar 05, 2026 at 10:33:42AM +1300, Kai Huang wrote:
> > Increasing the 'nr_node_ids' has side effects.  For instance, it is
> > widely used by the kernel for "highest possible NUMA node" based memory
> > allocations.  It also impacts userspace ABIs, e.g., some NUMA memory
> > related system calls such as 'get_mempolicy' which requires 'maxnode'
> > not being smaller than the 'nr_node_ids'.
> > 

> 
> Is this a Linux issue or a Firmware issue?

IIUC BIOS creates the CEDT based on the hardware it 'sees' as present.

This patch is describing the case (weird as it seems to me) where we
then boot a system with ACPI and NUMA enabled but CXL_ACPI disabled.

So, I don't think we can blame BIOS.

> 
> Is GNR exporting more CFMWS than it should?
Not sure of any limits on flavors of CFMWS's a BIOS can offer.
If BIOS can carve out a window, it can create a CFMWS.
Not clear how that matters to the issue.

> 
> Is your SRAT missing entries for CFMWS that are otherwise present?
> 
> Are the CFMWS empty? (is that even valid)

Why this line of questioning ;)  I see the problem as a bit simpler.
We have other code that tells us if the CFMWS's are valid, etc, but
the point here is, we are not going to use these CFMWS's so stop
the parse as early as possible, like right here as Kai has done.

> 
> > E.g., on the aforementioned GNR platform, the "Slab" in /proc/meminfo is
> > reduced with this change (when CXL_ACPI is off):
> > 
> > 	w/ this change		w/o
> > 
> > Slab	900488 kB		923660 kB
> > 
> 
> This is a good effect, but I still question the premise.
> 
> We don't usually want #ifdef's inside of .c files if we can avoid it.

I thought similar, but for early init and static helpers, this 
#if IS_ENABLE(..) wrapper is common.


Reviewed-by: Alison Schofield <alison.schofield@intel.com>


> 
> ~Gregory
Re: [PATCH] ACPI: NUMA: Only parse CFMWS at boot when CXL_ACPI is on
Posted by Huang, Kai 1 month ago
On Wed, 2026-03-04 at 15:20 -0800, Alison Schofield wrote:
> On Wed, Mar 04, 2026 at 05:33:26PM -0500, Gregory Price wrote:
> > On Thu, Mar 05, 2026 at 10:33:42AM +1300, Kai Huang wrote:
> > > Increasing the 'nr_node_ids' has side effects.  For instance, it is
> > > widely used by the kernel for "highest possible NUMA node" based memory
> > > allocations.  It also impacts userspace ABIs, e.g., some NUMA memory
> > > related system calls such as 'get_mempolicy' which requires 'maxnode'
> > > not being smaller than the 'nr_node_ids'.
> > > 
> 
> > 
> > Is this a Linux issue or a Firmware issue?

There are two issues actually: 

1) on my testing platform, we can see CFMWS tables but there's no actual CXL
memory inserted AFAICT; 

2) The kernel unconditionally bumps the 'nr_node_ids' even CXL_ACPI driver
is not able to detect any real CXL memory (or it is CXL port/mem driver
cannot detect, for which I am not very familiar with).

We can argue the 1) is FW issue.  But this patch aims to improve the 2),
which is an improvement regardless the firmware has issue or not:

If CXL_ACPI isn't even enabled in Kconfig, there's no need to detect CFMWS
tables and assign a 'faked NUMA node' during boot.

IIUC this is even true on a perfect fine machine with actual CXL memory
installed, since the kernel still have an option to disable CXL_ACPI in
Kconfig.

> 
> IIUC BIOS creates the CEDT based on the hardware it 'sees' as present.
> 
> This patch is describing the case (weird as it seems to me) where we
> then boot a system with ACPI and NUMA enabled but CXL_ACPI disabled.

When I found this issue the CXL_ACPI is enabled in Kconfig, but there's no
actual CXL memory being discovered AFAICT.

 #cxl list

tells me nothing.

> 
> So, I don't think we can blame BIOS.

As said above, I think it's an improvement in the kernel even on a good
machine with CXL memory inserted.

> 
> > 
> > Is GNR exporting more CFMWS than it should?
> Not sure of any limits on flavors of CFMWS's a BIOS can offer.
> If BIOS can carve out a window, it can create a CFMWS.
> Not clear how that matters to the issue.
> 
> > 
> > Is your SRAT missing entries for CFMWS that are otherwise present?
> > 
> > Are the CFMWS empty? (is that even valid)
> 
> Why this line of questioning ;)  I see the problem as a bit simpler.
> We have other code that tells us if the CFMWS's are valid, etc, but
> the point here is, we are not going to use these CFMWS's so stop
> the parse as early as possible, like right here as Kai has done.

Yes. :-)

> 
> > 
> > > E.g., on the aforementioned GNR platform, the "Slab" in /proc/meminfo is
> > > reduced with this change (when CXL_ACPI is off):
> > > 
> > > 	w/ this change		w/o
> > > 
> > > Slab	900488 kB		923660 kB
> > > 
> > 
> > This is a good effect, but I still question the premise.
> > 
> > We don't usually want #ifdef's inside of .c files if we can avoid it.
> 
> I thought similar, but for early init and static helpers, this 
> #if IS_ENABLE(..) wrapper is common.

Yes.  They are early init and static helpers, so no header file is involved
here.

I also see #if defined(CONFIG_X86) || defined(CONFIG_ARM64) is used for
acpi_parse_gi_affinity() in this file.

> 
> 
> Reviewed-by: Alison Schofield <alison.schofield@intel.com>

Thanks!
Re: [PATCH] ACPI: NUMA: Only parse CFMWS at boot when CXL_ACPI is on
Posted by Alison Schofield 1 month ago
On Wed, Mar 04, 2026 at 04:04:57PM -0800, Huang, Kai wrote:
> On Wed, 2026-03-04 at 15:20 -0800, Alison Schofield wrote:
> > On Wed, Mar 04, 2026 at 05:33:26PM -0500, Gregory Price wrote:
> > > On Thu, Mar 05, 2026 at 10:33:42AM +1300, Kai Huang wrote:
> > > > Increasing the 'nr_node_ids' has side effects.  For instance, it is
> > > > widely used by the kernel for "highest possible NUMA node" based memory
> > > > allocations.  It also impacts userspace ABIs, e.g., some NUMA memory
> > > > related system calls such as 'get_mempolicy' which requires 'maxnode'
> > > > not being smaller than the 'nr_node_ids'.
> > > > 
> > 
> > > 
> > > Is this a Linux issue or a Firmware issue?
> 
> There are two issues actually: 
> 
> 1) on my testing platform, we can see CFMWS tables but there's no actual CXL
> memory inserted AFAICT; 
> 
> 2) The kernel unconditionally bumps the 'nr_node_ids' even CXL_ACPI driver
> is not able to detect any real CXL memory (or it is CXL port/mem driver
> cannot detect, for which I am not very familiar with).
> 
> We can argue the 1) is FW issue.  But this patch aims to improve the 2),
> which is an improvement regardless the firmware has issue or not:
> 
> If CXL_ACPI isn't even enabled in Kconfig, there's no need to detect CFMWS
> tables and assign a 'faked NUMA node' during boot.
> 
> IIUC this is even true on a perfect fine machine with actual CXL memory
> installed, since the kernel still have an option to disable CXL_ACPI in
> Kconfig.
> 
> > 
> > IIUC BIOS creates the CEDT based on the hardware it 'sees' as present.
> > 
> > This patch is describing the case (weird as it seems to me) where we
> > then boot a system with ACPI and NUMA enabled but CXL_ACPI disabled.
> 
> When I found this issue the CXL_ACPI is enabled in Kconfig, but there's no
> actual CXL memory being discovered AFAICT.

So this patch will not fix that case, because CXL_ACPI will be enabled.

I'd like to take a look at that system Kai. I'll ping you off list.

> 
>  #cxl list
> 
> tells me nothing.
> 
> > 
> > So, I don't think we can blame BIOS.
> 
> As said above, I think it's an improvement in the kernel even on a good
> machine with CXL memory inserted.
> 
> > 
> > > 
> > > Is GNR exporting more CFMWS than it should?
> > Not sure of any limits on flavors of CFMWS's a BIOS can offer.
> > If BIOS can carve out a window, it can create a CFMWS.
> > Not clear how that matters to the issue.
> > 
> > > 
> > > Is your SRAT missing entries for CFMWS that are otherwise present?
> > > 
> > > Are the CFMWS empty? (is that even valid)
> > 
> > Why this line of questioning ;)  I see the problem as a bit simpler.
> > We have other code that tells us if the CFMWS's are valid, etc, but
> > the point here is, we are not going to use these CFMWS's so stop
> > the parse as early as possible, like right here as Kai has done.
> 
> Yes. :-)
> 
> > 
> > > 
> > > > E.g., on the aforementioned GNR platform, the "Slab" in /proc/meminfo is
> > > > reduced with this change (when CXL_ACPI is off):
> > > > 
> > > > 	w/ this change		w/o
> > > > 
> > > > Slab	900488 kB		923660 kB
> > > > 
> > > 
> > > This is a good effect, but I still question the premise.
> > > 
> > > We don't usually want #ifdef's inside of .c files if we can avoid it.
> > 
> > I thought similar, but for early init and static helpers, this 
> > #if IS_ENABLE(..) wrapper is common.
> 
> Yes.  They are early init and static helpers, so no header file is involved
> here.
> 
> I also see #if defined(CONFIG_X86) || defined(CONFIG_ARM64) is used for
> acpi_parse_gi_affinity() in this file.
> 
> > 
> > 
> > Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> 
> Thanks!
Re: [PATCH] ACPI: NUMA: Only parse CFMWS at boot when CXL_ACPI is on
Posted by Huang, Kai 1 month ago
On Wed, 2026-03-04 at 16:44 -0800, Alison Schofield wrote:
> > When I found this issue the CXL_ACPI is enabled in Kconfig, but there's no
> > actual CXL memory being discovered AFAICT.
> 
> So this patch will not fix that case, because CXL_ACPI will be enabled.
> 
> I'd like to take a look at that system Kai. I'll ping you off list.

Sent the machine login to you.

I would say this patch is not aiming to fix any issue that is due to broken
firmware(e.g., I didn't mention anything about that in the changelog).

This is to improve the kernel even on a fine CXL-memory-capable machine when
CXL_ACPI is off.
Re: [PATCH] ACPI: NUMA: Only parse CFMWS at boot when CXL_ACPI is on
Posted by Gregory Price 1 month ago
On Wed, Mar 04, 2026 at 03:20:11PM -0800, Alison Schofield wrote:
> On Wed, Mar 04, 2026 at 05:33:26PM -0500, Gregory Price wrote:
> > On Thu, Mar 05, 2026 at 10:33:42AM +1300, Kai Huang wrote:
> > > Increasing the 'nr_node_ids' has side effects.  For instance, it is
> > > widely used by the kernel for "highest possible NUMA node" based memory
> > > allocations.  It also impacts userspace ABIs, e.g., some NUMA memory
> > > related system calls such as 'get_mempolicy' which requires 'maxnode'
> > > not being smaller than the 'nr_node_ids'.
> > > 
> 
> > 
> > Is this a Linux issue or a Firmware issue?
> 
> IIUC BIOS creates the CEDT based on the hardware it 'sees' as present.
> 
> This patch is describing the case (weird as it seems to me) where we
> then boot a system with ACPI and NUMA enabled but CXL_ACPI disabled.
> 
> So, I don't think we can blame BIOS.
> 
> > 
> > Is GNR exporting more CFMWS than it should?
> Not sure of any limits on flavors of CFMWS's a BIOS can offer.
> If BIOS can carve out a window, it can create a CFMWS.
> Not clear how that matters to the issue.
> 
> > 
> > Is your SRAT missing entries for CFMWS that are otherwise present?
> > 
> > Are the CFMWS empty? (is that even valid)
> 
> Why this line of questioning ;)  I see the problem as a bit simpler.
> We have other code that tells us if the CFMWS's are valid, etc, but
> the point here is, we are not going to use these CFMWS's so stop
> the parse as early as possible, like right here as Kai has done.
> 

Mostly i'm wondering if this issue should be dealt with in the acpi code
or if the issue is that we just don't want to figure out how to
lazy-create these things instead of always creating them at __init.

it does seem rational to build out support for CEDT entries if CXL_ACPI
is built out, but this also means you can't otherwise load modules that
would have made use of this information.

This basically says if specifically CXL_ACPI is built out, the NUMA
structure is forever lost - even though it's accurately described by
BIOS.  Maybe that's a rational decision, just kind of prodding a bit.

~Gregory
Re: [PATCH] ACPI: NUMA: Only parse CFMWS at boot when CXL_ACPI is on
Posted by Huang, Kai 1 month ago
On Wed, 2026-03-04 at 18:56 -0500, Gregory Price wrote:
> On Wed, Mar 04, 2026 at 03:20:11PM -0800, Alison Schofield wrote:
> > On Wed, Mar 04, 2026 at 05:33:26PM -0500, Gregory Price wrote:
> > > On Thu, Mar 05, 2026 at 10:33:42AM +1300, Kai Huang wrote:
> > > > Increasing the 'nr_node_ids' has side effects.  For instance, it is
> > > > widely used by the kernel for "highest possible NUMA node" based memory
> > > > allocations.  It also impacts userspace ABIs, e.g., some NUMA memory
> > > > related system calls such as 'get_mempolicy' which requires 'maxnode'
> > > > not being smaller than the 'nr_node_ids'.
> > > > 
> > 
> > > 
> > > Is this a Linux issue or a Firmware issue?
> > 
> > IIUC BIOS creates the CEDT based on the hardware it 'sees' as present.
> > 
> > This patch is describing the case (weird as it seems to me) where we
> > then boot a system with ACPI and NUMA enabled but CXL_ACPI disabled.
> > 
> > So, I don't think we can blame BIOS.
> > 
> > > 
> > > Is GNR exporting more CFMWS than it should?
> > Not sure of any limits on flavors of CFMWS's a BIOS can offer.
> > If BIOS can carve out a window, it can create a CFMWS.
> > Not clear how that matters to the issue.
> > 
> > > 
> > > Is your SRAT missing entries for CFMWS that are otherwise present?
> > > 
> > > Are the CFMWS empty? (is that even valid)
> > 
> > Why this line of questioning ;)  I see the problem as a bit simpler.
> > We have other code that tells us if the CFMWS's are valid, etc, but
> > the point here is, we are not going to use these CFMWS's so stop
> > the parse as early as possible, like right here as Kai has done.
> > 
> 
> Mostly i'm wondering if this issue should be dealt with in the acpi code
> or if the issue is that we just don't want to figure out how to
> lazy-create these things instead of always creating them at __init.
> 
> it does seem rational to build out support for CEDT entries if CXL_ACPI
> is built out, but this also means you can't otherwise load modules that
> would have made use of this information.

Besides CXL_ACPI, is there any other module(s) that uses this information
for these CXL memory regions?

> 
> This basically says if specifically CXL_ACPI is built out, the NUMA
> structure is forever lost - even though it's accurately described by
> BIOS.  
> 

The normal NUMA info described in SRAT is still there.  It only avoids
detecting CFMWS, which doesn't provide any NUMA info actually -- that's why
kernel assigns a 'faked' NUMA node for each of them.

So we are not losing anything AFAICT.
Re: [PATCH] ACPI: NUMA: Only parse CFMWS at boot when CXL_ACPI is on
Posted by Haifeng Xu 1 month ago

On 2026/3/5 08:14, Huang, Kai wrote:
> On Wed, 2026-03-04 at 18:56 -0500, Gregory Price wrote:
>> On Wed, Mar 04, 2026 at 03:20:11PM -0800, Alison Schofield wrote:
>>> On Wed, Mar 04, 2026 at 05:33:26PM -0500, Gregory Price wrote:
>>>> On Thu, Mar 05, 2026 at 10:33:42AM +1300, Kai Huang wrote:
>>>>> Increasing the 'nr_node_ids' has side effects.  For instance, it is
>>>>> widely used by the kernel for "highest possible NUMA node" based memory
>>>>> allocations.  It also impacts userspace ABIs, e.g., some NUMA memory
>>>>> related system calls such as 'get_mempolicy' which requires 'maxnode'
>>>>> not being smaller than the 'nr_node_ids'.
>>>>>
>>>
>>>>
>>>> Is this a Linux issue or a Firmware issue?
>>>
>>> IIUC BIOS creates the CEDT based on the hardware it 'sees' as present.
>>>
>>> This patch is describing the case (weird as it seems to me) where we
>>> then boot a system with ACPI and NUMA enabled but CXL_ACPI disabled.
>>>
>>> So, I don't think we can blame BIOS.
>>>
>>>>
>>>> Is GNR exporting more CFMWS than it should?
>>> Not sure of any limits on flavors of CFMWS's a BIOS can offer.
>>> If BIOS can carve out a window, it can create a CFMWS.
>>> Not clear how that matters to the issue.
>>>
>>>>
>>>> Is your SRAT missing entries for CFMWS that are otherwise present?
>>>>
>>>> Are the CFMWS empty? (is that even valid)
>>>
>>> Why this line of questioning ;)  I see the problem as a bit simpler.
>>> We have other code that tells us if the CFMWS's are valid, etc, but
>>> the point here is, we are not going to use these CFMWS's so stop
>>> the parse as early as possible, like right here as Kai has done.
>>>
>>
>> Mostly i'm wondering if this issue should be dealt with in the acpi code
>> or if the issue is that we just don't want to figure out how to
>> lazy-create these things instead of always creating them at __init.
>>
>> it does seem rational to build out support for CEDT entries if CXL_ACPI
>> is built out, but this also means you can't otherwise load modules that
>> would have made use of this information.
> 
> Besides CXL_ACPI, is there any other module(s) that uses this information
> for these CXL memory regions?
> 
>>
>> This basically says if specifically CXL_ACPI is built out, the NUMA
>> structure is forever lost - even though it's accurately described by
>> BIOS.  
>>
> 
> The normal NUMA info described in SRAT is still there.  It only avoids
> detecting CFMWS, which doesn't provide any NUMA info actually -- that's why
> kernel assigns a 'faked' NUMA node for each of them.

Hi Kai, we met some performance issues when creating pods because of too much
possible node without cxl memory inserted in our machine (machine Intel(R) Xeon(R) 6746E).
This patch can workaround our issue.

By the way, If insert real cxl memory but disable CXL_ACPI, how the kernel
initilaizes the cxl memory and assigns the numa node? can you provide the
related kernel souce code?

Thanks!


> 
> So we are not losing anything AFAICT.

Re: [PATCH] ACPI: NUMA: Only parse CFMWS at boot when CXL_ACPI is on
Posted by Huang, Kai 1 month ago
> 
> By the way, If insert real cxl memory but disable CXL_ACPI, how the kernel
> initilaizes the cxl memory and assigns the numa node? can you provide the
> related kernel souce code?

Sorry I am a bit confused what you want to do.

If you have actual CXL memory installed, that means you want to use that?

But if you want to use CXL memory, you cannot disable CXL_ACPI IIUC.

If you don't want to use CXL memory in your deployment then you can either
disable CXL memory in your BIOS or, with this patch, disable CXL_ACPI in
your kernel.

Re: [PATCH] ACPI: NUMA: Only parse CFMWS at boot when CXL_ACPI is on
Posted by Haifeng Xu 1 month ago

On 2026/3/5 18:25, Huang, Kai wrote:
> 
>>
>> By the way, If insert real cxl memory but disable CXL_ACPI, how the kernel
>> initilaizes the cxl memory and assigns the numa node? can you provide the
>> related kernel souce code?
> 
> Sorry I am a bit confused what you want to do.
> 

" It only avoids detecting CFMWS, which doesn't provide any NUMA info actually -- that's why
kernel assigns a 'faked' NUMA node for each of them."

Sorry I misunderstand what you said above. I originally thought that the inserted cxl memory
was assigned to a 'faked' NUMA node if disable CXL_ACPI. 


> If you have actual CXL memory installed, that means you want to use that?
> 
> But if you want to use CXL memory, you cannot disable CXL_ACPI IIUC.
> 
> If you don't want to use CXL memory in your deployment then you can either
> disable CXL memory in your BIOS or, with this patch, disable CXL_ACPI in
> your kernel.
> 

I see.

Thanks!
Re: [PATCH] ACPI: NUMA: Only parse CFMWS at boot when CXL_ACPI is on
Posted by Gregory Price 1 month ago
On Thu, Mar 05, 2026 at 12:14:52AM +0000, Huang, Kai wrote:
> On Wed, 2026-03-04 at 18:56 -0500, Gregory Price wrote:
> > 
> > This basically says if specifically CXL_ACPI is built out, the NUMA
> > structure is forever lost - even though it's accurately described by
> > BIOS.  
> > 
> 
> The normal NUMA info described in SRAT is still there.  It only avoids
> detecting CFMWS, which doesn't provide any NUMA info actually -- that's why
> kernel assigns a 'faked' NUMA node for each of them.
> 
> So we are not losing anything AFAICT.

Well, I'm mostly confused why there are CEDT entries for hardware that
presumably isn't even there - unless this platform is reserving space
for future hotplug.  Just want to make sure we're not adjusting for
strange firmware behavior.

The only platform we've seen this behavior on previously was QEMU, but
that was because it never emitted SRAT, so i was wondering if there's
odd firmware behavior going on (emitting CEDTs when it shouldn't) before
we jump to dropping nodes that would have otherwise have been present
on existing systems who might have compiled CXL_ACPI out.

You are taking something away by nature of compiling something out by
default that was previously not compiled out by default.

~Gregory
Re: [PATCH] ACPI: NUMA: Only parse CFMWS at boot when CXL_ACPI is on
Posted by Huang, Kai 1 month ago
On Wed, 2026-03-04 at 19:29 -0500, Gregory Price wrote:
> On Thu, Mar 05, 2026 at 12:14:52AM +0000, Huang, Kai wrote:
> > On Wed, 2026-03-04 at 18:56 -0500, Gregory Price wrote:
> > > 
> > > This basically says if specifically CXL_ACPI is built out, the NUMA
> > > structure is forever lost - even though it's accurately described by
> > > BIOS.  
> > > 
> > 
> > The normal NUMA info described in SRAT is still there.  It only avoids
> > detecting CFMWS, which doesn't provide any NUMA info actually -- that's why
> > kernel assigns a 'faked' NUMA node for each of them.
> > 
> > So we are not losing anything AFAICT.
> 
> Well, I'm mostly confused why there are CEDT entries for hardware that
> presumably isn't even there - unless this platform is reserving space
> for future hotplug.  
> 

I think this should be the case.

> Just want to make sure we're not adjusting for
> strange firmware behavior.

How to check whether it is "strange"?

I dumped the ACPI table on that platform, and what I can see is there's CEDT
table which contains couple of "[CXL Host Bridge Structure]" and "[CXL Fixed
Memory Window Structure]".

I pasted one for each of them here FYI:

[024h 0036 001h]               Subtable Type : 00 [CXL Host Bridge
Structure]
[025h 0037 001h]                    Reserved : 00
[026h 0038 002h]                      Length : 0020
[028h 0040 004h]      Associated host bridge : 00000003
[02Ch 0044 004h]       Specification version : 00000001
[030h 0048 004h]                    Reserved : 00000000
[034h 0052 008h]               Register base : 00000000997F0000
[03Ch 0060 008h]             Register length : 0000000000010000

[124h 0292 001h]               Subtable Type : 01 [CXL Fixed Memory Window
Structure]
[125h 0293 001h]                    Reserved : 00
[126h 0294 002h]                      Length : 0034
[128h 0296 004h]                    Reserved : 00000000
[12Ch 0300 008h]         Window base address : 0000002080000000
[134h 0308 008h]                 Window size : 0000017B00000000
[13Ch 0316 001h]    Interleave Members (2^n) : 02
[13Dh 0317 001h]       Interleave Arithmetic : 01
[13Eh 0318 002h]                    Reserved : 0000
[140h 0320 004h]                 Granularity : 00000000
[144h 0324 002h]                Restrictions : 0006
[146h 0326 002h]                       QtgId : 0000
[148h 0328 004h]                First Target : 00000003
[14Ch 0332 004h]                 Next Target : 00000023
[150h 0336 004h]                 Next Target : 00000043
[154h 0340 004h]                 Next Target : 00000053


> 
> The only platform we've seen this behavior on previously was QEMU, but
> that was because it never emitted SRAT, 
> 

My machine has a legitimate SRAT.  The kernel detects it has 4 nodes with
256 CPUs + 256G memory.

> so i was wondering if there's
> odd firmware behavior going on (emitting CEDTs when it shouldn't) before
> we jump to dropping nodes that would have otherwise have been present
> on existing systems who might have compiled CXL_ACPI out.

I don't know how to check whether the BIOS "should" or "should not" generate
CEDT.  I can check my machine if you let me know what to check specifically?

> 
> You are taking something away by nature of compiling something out by
> default that was previously not compiled out by default.

Yeah, and it is due to "there's a cost" if we don't compile out by default.

Hope that justifies?
Re: [PATCH] ACPI: NUMA: Only parse CFMWS at boot when CXL_ACPI is on
Posted by dan.j.williams@intel.com 1 month ago
Huang, Kai wrote:
> On Wed, 2026-03-04 at 19:29 -0500, Gregory Price wrote:
> > On Thu, Mar 05, 2026 at 12:14:52AM +0000, Huang, Kai wrote:
> > > On Wed, 2026-03-04 at 18:56 -0500, Gregory Price wrote:
> > > > 
> > > > This basically says if specifically CXL_ACPI is built out, the NUMA
> > > > structure is forever lost - even though it's accurately described by
> > > > BIOS.  
> > > > 
> > > 
> > > The normal NUMA info described in SRAT is still there.  It only avoids
> > > detecting CFMWS, which doesn't provide any NUMA info actually -- that's why
> > > kernel assigns a 'faked' NUMA node for each of them.
> > > 
> > > So we are not losing anything AFAICT.
> > 
> > Well, I'm mostly confused why there are CEDT entries for hardware that
> > presumably isn't even there - unless this platform is reserving space
> > for future hotplug.  
> > 
> 
> I think this should be the case.
> 
> > Just want to make sure we're not adjusting for
> > strange firmware behavior.
> 
> How to check whether it is "strange"?
>

So these are fine. These are CXL hotplug windows and the expectation is
that they *might* be populated in the future. SRAT can not make claims
about future CXL hotplug (see NOTE). So the expecation for CXL hotplug
is reserve some numa nodes that Linux can determine the affinity of
dynamically with HMAT Generic Port and device CDAT information.

NOTE: SRAT *does* make claims about the affinity of future *ACPI*
Hotplug, but in that case the platform statically knows something about
the configuration of memory that can possibly be plugged in the future.

> > You are taking something away by nature of compiling something out by
> > default that was previously not compiled out by default.
> 
> Yeah, and it is due to "there's a cost" if we don't compile out by default.
> 
> Hope that justifies?

I think it makes sense that if you disable CXL hotplug by setting
CONFIG_CXL_ACPI=n then no need to reserve numa ids. However, just do
something like this rather than add ifdefs to the code:

diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index aa87ee1583a4..62d4a8df0b8c 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -654,8 +654,11 @@ int __init acpi_numa_init(void)
 	}
 	last_real_pxm = fake_pxm;
 	fake_pxm++;
-	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
-			      &fake_pxm);
+
+	/* No need to expand numa nodes if CXL is disabled */
+	if (IS_ENABLED(CONFIG_CXL_ACPI))
+		acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
+				      &fake_pxm);
 
 	if (cnt < 0)
 		return cnt;

The call to acpi_table_parse_cedt() will get skipped and the code for
acpi_parse_cfmws() will get automatically compiled out of the file.

At the same time I doubt this patch provides end users much value in
practice as most distro kernels have CONFIG_CXL_ACPI, and the few end
users that have CXL are not going blink at the overhead to support the
full feature set.

Can you not just disable CXL support in the BIOS for your system and
avoid complicating this code path for a small win?
Re: [PATCH] ACPI: NUMA: Only parse CFMWS at boot when CXL_ACPI is on
Posted by Huang, Kai 1 month ago
On Wed, 2026-03-04 at 17:29 -0800, dan.j.williams@intel.com wrote:
> Huang, Kai wrote:
> > On Wed, 2026-03-04 at 19:29 -0500, Gregory Price wrote:
> > > On Thu, Mar 05, 2026 at 12:14:52AM +0000, Huang, Kai wrote:
> > > > On Wed, 2026-03-04 at 18:56 -0500, Gregory Price wrote:
> > > > > 
> > > > > This basically says if specifically CXL_ACPI is built out, the NUMA
> > > > > structure is forever lost - even though it's accurately described by
> > > > > BIOS.  
> > > > > 
> > > > 
> > > > The normal NUMA info described in SRAT is still there.  It only avoids
> > > > detecting CFMWS, which doesn't provide any NUMA info actually -- that's why
> > > > kernel assigns a 'faked' NUMA node for each of them.
> > > > 
> > > > So we are not losing anything AFAICT.
> > > 
> > > Well, I'm mostly confused why there are CEDT entries for hardware that
> > > presumably isn't even there - unless this platform is reserving space
> > > for future hotplug.  
> > > 
> > 
> > I think this should be the case.
> > 
> > > Just want to make sure we're not adjusting for
> > > strange firmware behavior.
> > 
> > How to check whether it is "strange"?
> > 
> 
> So these are fine. These are CXL hotplug windows and the expectation is
> that they *might* be populated in the future. SRAT can not make claims
> about future CXL hotplug (see NOTE). So the expecation for CXL hotplug
> is reserve some numa nodes that Linux can determine the affinity of
> dynamically with HMAT Generic Port and device CDAT information.
> 
> NOTE: SRAT *does* make claims about the affinity of future *ACPI*
> Hotplug, but in that case the platform statically knows something about
> the configuration of memory that can possibly be plugged in the future.

Thanks for the info!

> 
> > > You are taking something away by nature of compiling something out by
> > > default that was previously not compiled out by default.
> > 
> > Yeah, and it is due to "there's a cost" if we don't compile out by default.
> > 
> > Hope that justifies?
> 
> I think it makes sense that if you disable CXL hotplug by setting
> CONFIG_CXL_ACPI=n then no need to reserve numa ids. However, just do
> something like this rather than add ifdefs to the code:
> 
> diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
> index aa87ee1583a4..62d4a8df0b8c 100644
> --- a/drivers/acpi/numa/srat.c
> +++ b/drivers/acpi/numa/srat.c
> @@ -654,8 +654,11 @@ int __init acpi_numa_init(void)
>  	}
>  	last_real_pxm = fake_pxm;
>  	fake_pxm++;
> -	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
> -			      &fake_pxm);
> +
> +	/* No need to expand numa nodes if CXL is disabled */
> +	if (IS_ENABLED(CONFIG_CXL_ACPI))
> +		acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
> +				      &fake_pxm);
>  
>  	if (cnt < 0)
>  		return cnt;
> 
> The call to acpi_table_parse_cedt() will get skipped and the code for
> acpi_parse_cfmws() will get automatically compiled out of the file.

Yeah agreed this is simpler.  Thanks for the suggestion.

I would like to honor Gregory's suggestion about the comment since it
provides more info if you agree?

(I also added "impacts certain NUMA userspace ABIs")

/*
 * Some platforms report CEDT CFMWS for hotplug device support. These
 * windows are unusable without CXL drivers, so don't reserve fake nodes
 * if they're compiled out - it wastes memory on per-node structures and
 * impacts certain NUMA userspace ABIs.
 */

> 
> At the same time I doubt this patch provides end users much value in
> practice as most distro kernels have CONFIG_CXL_ACPI, and the few end
> users that have CXL are not going blink at the overhead to support the
> full feature set.
> 
> Can you not just disable CXL support in the BIOS for your system and
> avoid complicating this code path for a small win?

Sure I can (I need to figure out how, though).  But I think it's still
useful for developers because some of them will tend to only enable Kconfig
options that they need to save build time.

And sometimes it's not quite easy to turn off CXL in the BIOS since the
machine would be mostly remote.

So I think this patch is still worthy to do?


Re: [PATCH] ACPI: NUMA: Only parse CFMWS at boot when CXL_ACPI is on
Posted by dan.j.williams@intel.com 1 month ago
Huang, Kai wrote:
[..]
> > I think it makes sense that if you disable CXL hotplug by setting
> > CONFIG_CXL_ACPI=n then no need to reserve numa ids. However, just do
> > something like this rather than add ifdefs to the code:
> > 
> > diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
> > index aa87ee1583a4..62d4a8df0b8c 100644
> > --- a/drivers/acpi/numa/srat.c
> > +++ b/drivers/acpi/numa/srat.c
> > @@ -654,8 +654,11 @@ int __init acpi_numa_init(void)
> >  	}
> >  	last_real_pxm = fake_pxm;
> >  	fake_pxm++;
> > -	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
> > -			      &fake_pxm);
> > +
> > +	/* No need to expand numa nodes if CXL is disabled */
> > +	if (IS_ENABLED(CONFIG_CXL_ACPI))
> > +		acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
> > +				      &fake_pxm);
> >  
> >  	if (cnt < 0)
> >  		return cnt;
> > 
> > The call to acpi_table_parse_cedt() will get skipped and the code for
> > acpi_parse_cfmws() will get automatically compiled out of the file.
> 
> Yeah agreed this is simpler.  Thanks for the suggestion.
> 
> I would like to honor Gregory's suggestion about the comment since it
> provides more info if you agree?
> 
> (I also added "impacts certain NUMA userspace ABIs")
> 
> /*
>  * Some platforms report CEDT CFMWS for hotplug device support. These
>  * windows are unusable without CXL drivers, so don't reserve fake nodes
>  * if they're compiled out - it wastes memory on per-node structures and
>  * impacts certain NUMA userspace ABIs.
>  */

I feel like this is just too much scary detail for a tiny corner case
that maybe only effects a handful of kernel developers at hardware
companies with advanced server platforms.

    /* No need to expand numa nodes if CXL is disabled */

...is more than sufficient.

> > At the same time I doubt this patch provides end users much value in
> > practice as most distro kernels have CONFIG_CXL_ACPI, and the few end
> > users that have CXL are not going blink at the overhead to support the
> > full feature set.
> > 
> > Can you not just disable CXL support in the BIOS for your system and
> > avoid complicating this code path for a small win?
> 
> Sure I can (I need to figure out how, though).  But I think it's still
> useful for developers because some of them will tend to only enable Kconfig
> options that they need to save build time.
> 
> And sometimes it's not quite easy to turn off CXL in the BIOS since the
> machine would be mostly remote.
> 
> So I think this patch is still worthy to do?

A small 2 line patch to save a few megabytes of memory, sure. A
paragraph of context that almost nobody will ever care about on a go
forward basis? No, thank you.
Re: [PATCH] ACPI: NUMA: Only parse CFMWS at boot when CXL_ACPI is on
Posted by Gregory Price 1 month ago
On Wed, Mar 04, 2026 at 06:08:36PM -0800, dan.j.williams@intel.com wrote:
> Huang, Kai wrote:
> 
> A small 2 line patch to save a few megabytes of memory, sure. A
> paragraph of context that almost nobody will ever care about on a go
> forward basis? No, thank you.

eh throw it in the changelog *shrug*, i think it's useful context.

~Gregory
Re: [PATCH] ACPI: NUMA: Only parse CFMWS at boot when CXL_ACPI is on
Posted by dan.j.williams@intel.com 1 month ago
Gregory Price wrote:
> On Wed, Mar 04, 2026 at 06:08:36PM -0800, dan.j.williams@intel.com wrote:
> > Huang, Kai wrote:
> > 
> > A small 2 line patch to save a few megabytes of memory, sure. A
> > paragraph of context that almost nobody will ever care about on a go
> > forward basis? No, thank you.
> 
> eh throw it in the changelog *shrug*, i think it's useful context.

Definitely great changelog fodder and yes, useful context.
Re: [PATCH] ACPI: NUMA: Only parse CFMWS at boot when CXL_ACPI is on
Posted by Huang, Kai 1 month ago
On Wed, 2026-03-04 at 19:01 -0800, dan.j.williams@intel.com wrote:
> Gregory Price wrote:
> > On Wed, Mar 04, 2026 at 06:08:36PM -0800, dan.j.williams@intel.com wrote:
> > > Huang, Kai wrote:
> > > 
> > > A small 2 line patch to save a few megabytes of memory, sure. A
> > > paragraph of context that almost nobody will ever care about on a go
> > > forward basis? No, thank you.
> > 
> > eh throw it in the changelog *shrug*, i think it's useful context.
> 
> Definitely great changelog fodder and yes, useful context.

Thanks for all the feedback.

So if I follow you guys correctly, the context I provided in the changelog
is still useful.  And the "save a few megabytes of memory" is also covered
since I provided some data about "Slab" in /proc/meminfo.  So to me I'll
keep the changelog as-is in v2.

Btw, I tried to build with W=1 when CXL_ACPI is off to verify what Dan said
about the acpi_parse_cfmws() will get compiled out (otherwise we will see '-
Wunused-function' warning):

  The call to acpi_table_parse_cedt() will get skipped and the code for
  acpi_parse_cfmws() will get automatically compiled out of the file.

I tried both -O2 (CC_OPTIMIZE_FOR_PERFORMANCE) and -Os
(CC_OPTIMIZE_FOR_SIZE) with W=1 and no warning was found.  I also tried both
gcc and clang and no warning was found either.

So I'll follow Dan's suggestion to make the diff smaller.

Hi Alison, Gregory, thanks for your review, please let me know if you have
any concern.

Thanks.
Re: [PATCH] ACPI: NUMA: Only parse CFMWS at boot when CXL_ACPI is on
Posted by Gregory Price 1 month ago
On Thu, Mar 05, 2026 at 12:45:39AM +0000, Huang, Kai wrote:
> > 
> > You are taking something away by nature of compiling something out by
> > default that was previously not compiled out by default.
> 
> Yeah, and it is due to "there's a cost" if we don't compile out by default.
> 
> Hope that justifies?

Sorry, i don't mean to come off rough - I don't even necessarily
disagree with the change here, but it is worth taking a moment to
understand whether this is actually a Linux issue or an platform issue.

Up until now there's been some assumption that if a device is attached
there will be CEDT windows - but this would be the first time i've seen
a platform throw up CEDT windows for devices that don't exist.

Anyway, this is good information for Linux to know and worth a comment.
We don't know what other people have built that depend on the current
behavior.

> > 
> > Well, I'm mostly confused why there are CEDT entries for hardware that
> > presumably isn't even there - unless this platform is reserving space
> > for future hotplug.  
> > 
> 
> I think this should be the case.
>

If that's the case then cool :]

May be worth a comment in acpi_numa_init_cxl - something like:

/*
 * Some platforms report CEDT CFMWS for hotplug device support. These
 * windows are unusable without CXL drivers, so don't reserve fake nodes
 * if they're compiled out - it wastes memory on per-node structures.
 */

Reviewed-by: Gregory Price <gourry@gourry.net>

~Gregory