drivers/base/arch_numa.c | 47 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 46 insertions(+), 1 deletion(-)
pcpu_embed_first_chunk() allocates the first percpu chunk via
pcpu_fc_alloc() and used as-is without being mapped into vmalloc area. On
NUMA systems, this can lead to a sparse CPU->unit mapping, resulting in a
large physical address span (max_distance) and excessive vmalloc space
requirements.
For example, on an arm64 N2 server with 256 CPUs, the memory layout
includes:
[ 0.000000] NUMA: NODE_DATA [mem 0x100fffff0b00-0x100fffffffff]
[ 0.000000] NUMA: NODE_DATA [mem 0x500fffff0b00-0x500fffffffff]
[ 0.000000] NUMA: NODE_DATA [mem 0x600fffff0b00-0x600fffffffff]
[ 0.000000] NUMA: NODE_DATA [mem 0x700ffffbcb00-0x700ffffcbfff]
With the following NUMA distance matrix:
node distances:
node 0 1 2 3
0: 10 16 22 22
1: 16 10 22 22
2: 22 22 10 16
3: 22 22 16 10
In this configuration, pcpu_embed_first_chunk() computes a large
max_distance:
percpu: max_distance=0x5fffbfac0000 too large for vmalloc space 0x7bff70000000
As a result, the allocator falls back to pcpu_page_first_chunk(), which
uses page-by-page allocation with nr_groups = 1, leading to degraded
performance.
This patch introduces a normalized CPU-to-NUMA node mapping to mitigate
the issue. Distances of 10 and 16 are treated as local (LOCAL_DISTANCE),
allowing CPUs from nearby nodes to be grouped together. Consequently,
nr_groups will be 2 and pcpu_fc_alloc() uses the normalized node ID to
allocate memory from a common node.
For example:
- cpu0 belongs to node 0
- cpu64 belongs to node 1
Both CPUs are considered local and will allocate memory from node 0.
This normalization reduces max_distance:
percpu: max_distance=0x500000380000, ~64% of vmalloc space 0x7bff70000000
In addition, add a flag _need_norm_ to indicate the normalization is needed
iff when cpu_to_norm_node_map[] is different from cpu_to_node_map[].
Signed-off-by: Jia He <justin.he@arm.com>
---
drivers/base/arch_numa.c | 47 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 46 insertions(+), 1 deletion(-)
diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
index c99f2ab105e5..f746d88239e9 100644
--- a/drivers/base/arch_numa.c
+++ b/drivers/base/arch_numa.c
@@ -17,6 +17,8 @@
#include <asm/sections.h>
static int cpu_to_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE };
+static int cpu_to_norm_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE };
+static bool need_norm;
bool numa_off;
@@ -149,9 +151,40 @@ int early_cpu_to_node(int cpu)
return cpu_to_node_map[cpu];
}
+int __init early_cpu_to_norm_node(int cpu)
+{
+ return cpu_to_norm_node_map[cpu];
+}
+
static int __init pcpu_cpu_distance(unsigned int from, unsigned int to)
{
- return node_distance(early_cpu_to_node(from), early_cpu_to_node(to));
+ int distance = node_distance(early_cpu_to_node(from), early_cpu_to_node(to));
+
+ if (distance > LOCAL_DISTANCE && distance < REMOTE_DISTANCE && !need_norm)
+ need_norm = true;
+
+ return distance;
+}
+
+static int __init pcpu_cpu_norm_distance(unsigned int from, unsigned int to)
+{
+ int distance = pcpu_cpu_distance(from, to);
+
+ if (distance >= REMOTE_DISTANCE)
+ return REMOTE_DISTANCE;
+
+ /*
+ * If the distance is in the range [LOCAL_DISTANCE, REMOTE_DISTANCE),
+ * normalize the node map, choose the first local numa node id as its
+ * normalized node id.
+ */
+ if (cpu_to_norm_node_map[from] == NUMA_NO_NODE)
+ cpu_to_norm_node_map[from] = cpu_to_node_map[from];
+
+ if (cpu_to_norm_node_map[to] == NUMA_NO_NODE)
+ cpu_to_norm_node_map[to] = cpu_to_norm_node_map[from];
+
+ return LOCAL_DISTANCE;
}
void __init setup_per_cpu_areas(void)
@@ -169,6 +202,18 @@ void __init setup_per_cpu_areas(void)
PERCPU_DYNAMIC_RESERVE, PAGE_SIZE,
pcpu_cpu_distance,
early_cpu_to_node);
+
+ if (rc < 0 && need_norm) {
+ /* Try the normalized node distance again */
+ pr_info("PERCPU: %s allocator, trying the normalization mode\n",
+ pcpu_fc_names[pcpu_chosen_fc]);
+
+ rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
+ PERCPU_DYNAMIC_RESERVE, PAGE_SIZE,
+ pcpu_cpu_norm_distance,
+ early_cpu_to_norm_node);
+ }
+
#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
if (rc < 0)
pr_warn("PERCPU: %s allocator failed (%d), falling back to page size\n",
--
2.34.1
Hi Jia, kernel test robot noticed the following build warnings: [auto build test WARNING on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Jia-He/mm-percpu-Introduce-normalized-CPU-to-NUMA-node-mapping-to-reduce-max_distance/20250722-121559 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20250722041418.2024870-1-justin.he%40arm.com patch subject: [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node mapping to reduce max_distance config: arm64-randconfig-r113-20250725 (https://download.01.org/0day-ci/archive/20250726/202507262015.sw4niVFQ-lkp@intel.com/config) compiler: aarch64-linux-gcc (GCC) 10.5.0 reproduce: (https://download.01.org/0day-ci/archive/20250726/202507262015.sw4niVFQ-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202507262015.sw4niVFQ-lkp@intel.com/ sparse warnings: (new ones prefixed by >>) >> drivers/base/arch_numa.c:154:12: sparse: sparse: symbol 'early_cpu_to_norm_node' was not declared. Should it be static? vim +/early_cpu_to_norm_node +154 drivers/base/arch_numa.c 153 > 154 int __init early_cpu_to_norm_node(int cpu) 155 { 156 return cpu_to_norm_node_map[cpu]; 157 } 158 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki
Hi Jia, kernel test robot noticed the following build warnings: [auto build test WARNING on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Jia-He/mm-percpu-Introduce-normalized-CPU-to-NUMA-node-mapping-to-reduce-max_distance/20250722-121559 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20250722041418.2024870-1-justin.he%40arm.com patch subject: [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node mapping to reduce max_distance config: arm64-randconfig-001-20250722 (https://download.01.org/0day-ci/archive/20250723/202507230509.juShbryQ-lkp@intel.com/config) compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 853c343b45b3e83cc5eeef5a52fc8cc9d8a09252) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250723/202507230509.juShbryQ-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202507230509.juShbryQ-lkp@intel.com/ All warnings (new ones prefixed by >>): >> drivers/base/arch_numa.c:154:12: warning: no previous prototype for function 'early_cpu_to_norm_node' [-Wmissing-prototypes] 154 | int __init early_cpu_to_norm_node(int cpu) | ^ drivers/base/arch_numa.c:154:1: note: declare 'static' if the function is not intended to be used outside of this translation unit 154 | int __init early_cpu_to_norm_node(int cpu) | ^ | static 1 warning generated. vim +/early_cpu_to_norm_node +154 drivers/base/arch_numa.c 153 > 154 int __init early_cpu_to_norm_node(int cpu) 155 { 156 return cpu_to_norm_node_map[cpu]; 157 } 158 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki
On Tue, Jul 22, 2025 at 04:14:18AM +0000, Jia He wrote: > pcpu_embed_first_chunk() allocates the first percpu chunk via > pcpu_fc_alloc() and used as-is without being mapped into vmalloc area. On > NUMA systems, this can lead to a sparse CPU->unit mapping, resulting in a > large physical address span (max_distance) and excessive vmalloc space > requirements. Why is the subject line "mm: percpu:" when this is driver-core code? And if it is mm code, please cc: the mm maintainers and list please. > For example, on an arm64 N2 server with 256 CPUs, the memory layout > includes: > [ 0.000000] NUMA: NODE_DATA [mem 0x100fffff0b00-0x100fffffffff] > [ 0.000000] NUMA: NODE_DATA [mem 0x500fffff0b00-0x500fffffffff] > [ 0.000000] NUMA: NODE_DATA [mem 0x600fffff0b00-0x600fffffffff] > [ 0.000000] NUMA: NODE_DATA [mem 0x700ffffbcb00-0x700ffffcbfff] > > With the following NUMA distance matrix: > node distances: > node 0 1 2 3 > 0: 10 16 22 22 > 1: 16 10 22 22 > 2: 22 22 10 16 > 3: 22 22 16 10 > > In this configuration, pcpu_embed_first_chunk() computes a large > max_distance: > percpu: max_distance=0x5fffbfac0000 too large for vmalloc space 0x7bff70000000 > > As a result, the allocator falls back to pcpu_page_first_chunk(), which > uses page-by-page allocation with nr_groups = 1, leading to degraded > performance. But that's intentional, you don't want to go across the nodes, right? > This patch introduces a normalized CPU-to-NUMA node mapping to mitigate > the issue. Distances of 10 and 16 are treated as local (LOCAL_DISTANCE), Why? What is this going to now break on those systems that assumed that those were NOT local? > allowing CPUs from nearby nodes to be grouped together. Consequently, > nr_groups will be 2 and pcpu_fc_alloc() uses the normalized node ID to > allocate memory from a common node. > > For example: > - cpu0 belongs to node 0 > - cpu64 belongs to node 1 > Both CPUs are considered local and will allocate memory from node 0. > This normalization reduces max_distance: > percpu: max_distance=0x500000380000, ~64% of vmalloc space 0x7bff70000000 > > In addition, add a flag _need_norm_ to indicate the normalization is needed > iff when cpu_to_norm_node_map[] is different from cpu_to_node_map[]. > > Signed-off-by: Jia He <justin.he@arm.com> I think this needs a lot of testing and verification and acks from maintainers of other arches that can say "this also works for us" before we can take it, as it has the potential to make major changes to systems. What did you test this on? > --- > drivers/base/arch_numa.c | 47 +++++++++++++++++++++++++++++++++++++++- > 1 file changed, 46 insertions(+), 1 deletion(-) > > diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c > index c99f2ab105e5..f746d88239e9 100644 > --- a/drivers/base/arch_numa.c > +++ b/drivers/base/arch_numa.c > @@ -17,6 +17,8 @@ > #include <asm/sections.h> > > static int cpu_to_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE }; > +static int cpu_to_norm_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE }; > +static bool need_norm; Shouldn't these be marked __initdata as you don't touch them afterward? thanks, greg k-h
© 2016 - 2025 Red Hat, Inc.