[PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node mapping to reduce max_distance

Jia He posted 1 patch 2 months, 2 weeks ago
drivers/base/arch_numa.c | 47 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 46 insertions(+), 1 deletion(-)
[PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node mapping to reduce max_distance
Posted by Jia He 2 months, 2 weeks ago
pcpu_embed_first_chunk() allocates the first percpu chunk via
pcpu_fc_alloc() and used as-is without being mapped into vmalloc area. On
NUMA systems, this can lead to a sparse CPU->unit mapping, resulting in a
large physical address span (max_distance) and excessive vmalloc space
requirements.

For example, on an arm64 N2 server with 256 CPUs, the memory layout
includes:
[    0.000000] NUMA: NODE_DATA [mem 0x100fffff0b00-0x100fffffffff]
[    0.000000] NUMA: NODE_DATA [mem 0x500fffff0b00-0x500fffffffff]
[    0.000000] NUMA: NODE_DATA [mem 0x600fffff0b00-0x600fffffffff]
[    0.000000] NUMA: NODE_DATA [mem 0x700ffffbcb00-0x700ffffcbfff]

With the following NUMA distance matrix:
node distances:
node   0   1   2   3
  0:  10  16  22  22
  1:  16  10  22  22
  2:  22  22  10  16
  3:  22  22  16  10

In this configuration, pcpu_embed_first_chunk() computes a large
max_distance:
percpu: max_distance=0x5fffbfac0000 too large for vmalloc space 0x7bff70000000

As a result, the allocator falls back to pcpu_page_first_chunk(), which
uses page-by-page allocation with nr_groups = 1, leading to degraded
performance.

This patch introduces a normalized CPU-to-NUMA node mapping to mitigate
the issue. Distances of 10 and 16 are treated as local (LOCAL_DISTANCE),
allowing CPUs from nearby nodes to be grouped together. Consequently,
nr_groups will be 2 and pcpu_fc_alloc() uses the normalized node ID to
allocate memory from a common node.

For example:
- cpu0 belongs to node 0
- cpu64 belongs to node 1
Both CPUs are considered local and will allocate memory from node 0.
This normalization reduces max_distance:
percpu: max_distance=0x500000380000, ~64% of vmalloc space 0x7bff70000000

In addition, add a flag _need_norm_ to indicate the normalization is needed
iff when cpu_to_norm_node_map[] is different from cpu_to_node_map[].

Signed-off-by: Jia He <justin.he@arm.com>
---
 drivers/base/arch_numa.c | 47 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 46 insertions(+), 1 deletion(-)

diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
index c99f2ab105e5..f746d88239e9 100644
--- a/drivers/base/arch_numa.c
+++ b/drivers/base/arch_numa.c
@@ -17,6 +17,8 @@
 #include <asm/sections.h>
 
 static int cpu_to_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE };
+static int cpu_to_norm_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE };
+static bool need_norm;
 
 bool numa_off;
 
@@ -149,9 +151,40 @@ int early_cpu_to_node(int cpu)
 	return cpu_to_node_map[cpu];
 }
 
+int __init early_cpu_to_norm_node(int cpu)
+{
+	return cpu_to_norm_node_map[cpu];
+}
+
 static int __init pcpu_cpu_distance(unsigned int from, unsigned int to)
 {
-	return node_distance(early_cpu_to_node(from), early_cpu_to_node(to));
+	int distance = node_distance(early_cpu_to_node(from), early_cpu_to_node(to));
+
+	if (distance > LOCAL_DISTANCE && distance < REMOTE_DISTANCE && !need_norm)
+		need_norm = true;
+
+	return distance;
+}
+
+static int __init pcpu_cpu_norm_distance(unsigned int from, unsigned int to)
+{
+	int distance = pcpu_cpu_distance(from, to);
+
+	if (distance >= REMOTE_DISTANCE)
+		return REMOTE_DISTANCE;
+
+	/*
+	 * If the distance is in the range [LOCAL_DISTANCE, REMOTE_DISTANCE),
+	 * normalize the node map, choose the first local numa node id as its
+	 * normalized node id.
+	 */
+	if (cpu_to_norm_node_map[from] == NUMA_NO_NODE)
+		cpu_to_norm_node_map[from] = cpu_to_node_map[from];
+
+	if (cpu_to_norm_node_map[to] == NUMA_NO_NODE)
+		cpu_to_norm_node_map[to] = cpu_to_norm_node_map[from];
+
+	return LOCAL_DISTANCE;
 }
 
 void __init setup_per_cpu_areas(void)
@@ -169,6 +202,18 @@ void __init setup_per_cpu_areas(void)
 					    PERCPU_DYNAMIC_RESERVE, PAGE_SIZE,
 					    pcpu_cpu_distance,
 					    early_cpu_to_node);
+
+		if (rc < 0 && need_norm) {
+			/* Try the normalized node distance again */
+			pr_info("PERCPU: %s allocator, trying the normalization mode\n",
+				   pcpu_fc_names[pcpu_chosen_fc]);
+
+			rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
+						    PERCPU_DYNAMIC_RESERVE, PAGE_SIZE,
+						    pcpu_cpu_norm_distance,
+						    early_cpu_to_norm_node);
+		}
+
 #ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
 		if (rc < 0)
 			pr_warn("PERCPU: %s allocator failed (%d), falling back to page size\n",
-- 
2.34.1
Re: [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node mapping to reduce max_distance
Posted by kernel test robot 2 months, 1 week ago
Hi Jia,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Jia-He/mm-percpu-Introduce-normalized-CPU-to-NUMA-node-mapping-to-reduce-max_distance/20250722-121559
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250722041418.2024870-1-justin.he%40arm.com
patch subject: [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node mapping to  reduce max_distance
config: arm64-randconfig-r113-20250725 (https://download.01.org/0day-ci/archive/20250726/202507262015.sw4niVFQ-lkp@intel.com/config)
compiler: aarch64-linux-gcc (GCC) 10.5.0
reproduce: (https://download.01.org/0day-ci/archive/20250726/202507262015.sw4niVFQ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507262015.sw4niVFQ-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> drivers/base/arch_numa.c:154:12: sparse: sparse: symbol 'early_cpu_to_norm_node' was not declared. Should it be static?

vim +/early_cpu_to_norm_node +154 drivers/base/arch_numa.c

   153	
 > 154	int __init early_cpu_to_norm_node(int cpu)
   155	{
   156		return cpu_to_norm_node_map[cpu];
   157	}
   158	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Re: [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node mapping to reduce max_distance
Posted by kernel test robot 2 months, 2 weeks ago
Hi Jia,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Jia-He/mm-percpu-Introduce-normalized-CPU-to-NUMA-node-mapping-to-reduce-max_distance/20250722-121559
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250722041418.2024870-1-justin.he%40arm.com
patch subject: [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node mapping to  reduce max_distance
config: arm64-randconfig-001-20250722 (https://download.01.org/0day-ci/archive/20250723/202507230509.juShbryQ-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 853c343b45b3e83cc5eeef5a52fc8cc9d8a09252)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250723/202507230509.juShbryQ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507230509.juShbryQ-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> drivers/base/arch_numa.c:154:12: warning: no previous prototype for function 'early_cpu_to_norm_node' [-Wmissing-prototypes]
     154 | int __init early_cpu_to_norm_node(int cpu)
         |            ^
   drivers/base/arch_numa.c:154:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
     154 | int __init early_cpu_to_norm_node(int cpu)
         | ^
         | static 
   1 warning generated.


vim +/early_cpu_to_norm_node +154 drivers/base/arch_numa.c

   153	
 > 154	int __init early_cpu_to_norm_node(int cpu)
   155	{
   156		return cpu_to_norm_node_map[cpu];
   157	}
   158	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Re: [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node mapping to reduce max_distance
Posted by Greg Kroah-Hartman 2 months, 2 weeks ago
On Tue, Jul 22, 2025 at 04:14:18AM +0000, Jia He wrote:
> pcpu_embed_first_chunk() allocates the first percpu chunk via
> pcpu_fc_alloc() and used as-is without being mapped into vmalloc area. On
> NUMA systems, this can lead to a sparse CPU->unit mapping, resulting in a
> large physical address span (max_distance) and excessive vmalloc space
> requirements.

Why is the subject line "mm: percpu:" when this is driver-core code?

And if it is mm code, please cc: the mm maintainers and list please.

> For example, on an arm64 N2 server with 256 CPUs, the memory layout
> includes:
> [    0.000000] NUMA: NODE_DATA [mem 0x100fffff0b00-0x100fffffffff]
> [    0.000000] NUMA: NODE_DATA [mem 0x500fffff0b00-0x500fffffffff]
> [    0.000000] NUMA: NODE_DATA [mem 0x600fffff0b00-0x600fffffffff]
> [    0.000000] NUMA: NODE_DATA [mem 0x700ffffbcb00-0x700ffffcbfff]
> 
> With the following NUMA distance matrix:
> node distances:
> node   0   1   2   3
>   0:  10  16  22  22
>   1:  16  10  22  22
>   2:  22  22  10  16
>   3:  22  22  16  10
> 
> In this configuration, pcpu_embed_first_chunk() computes a large
> max_distance:
> percpu: max_distance=0x5fffbfac0000 too large for vmalloc space 0x7bff70000000
> 
> As a result, the allocator falls back to pcpu_page_first_chunk(), which
> uses page-by-page allocation with nr_groups = 1, leading to degraded
> performance.

But that's intentional, you don't want to go across the nodes, right?

> This patch introduces a normalized CPU-to-NUMA node mapping to mitigate
> the issue. Distances of 10 and 16 are treated as local (LOCAL_DISTANCE),

Why?  What is this going to now break on those systems that assumed that
those were NOT local?

> allowing CPUs from nearby nodes to be grouped together. Consequently,
> nr_groups will be 2 and pcpu_fc_alloc() uses the normalized node ID to
> allocate memory from a common node.
> 
> For example:
> - cpu0 belongs to node 0
> - cpu64 belongs to node 1
> Both CPUs are considered local and will allocate memory from node 0.
> This normalization reduces max_distance:
> percpu: max_distance=0x500000380000, ~64% of vmalloc space 0x7bff70000000
> 
> In addition, add a flag _need_norm_ to indicate the normalization is needed
> iff when cpu_to_norm_node_map[] is different from cpu_to_node_map[].
> 
> Signed-off-by: Jia He <justin.he@arm.com>

I think this needs a lot of testing and verification and acks from
maintainers of other arches that can say "this also works for us" before
we can take it, as it has the potential to make major changes to
systems.

What did you test this on?


> ---
>  drivers/base/arch_numa.c | 47 +++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 46 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
> index c99f2ab105e5..f746d88239e9 100644
> --- a/drivers/base/arch_numa.c
> +++ b/drivers/base/arch_numa.c
> @@ -17,6 +17,8 @@
>  #include <asm/sections.h>
>  
>  static int cpu_to_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE };
> +static int cpu_to_norm_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE };
> +static bool need_norm;

Shouldn't these be marked __initdata as you don't touch them afterward?

thanks,

greg k-h