From nobody Mon Jun 29 14:52:42 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C7D88C433F5 for ; Tue, 8 Feb 2022 13:17:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238186AbiBHNR4 (ORCPT ); Tue, 8 Feb 2022 08:17:56 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54746 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1356857AbiBHMY5 (ORCPT ); Tue, 8 Feb 2022 07:24:57 -0500 Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 00112C03FEC0 for ; Tue, 8 Feb 2022 04:24:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1644323096; x=1675859096; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=k+gmBvQ7B8ZXOmEDi2JL9k+/Wgtm//CxEfF0X55SqRE=; b=Im9x5i+HmUxHAqSmCCOaSPqur5t1RcfaEuH2Loz6uozPW39vdzoPml8x srAQeSNkZr8PcPxTq2cZ1Z+kc3JvuAuWaAi1+L4o4+yl/6FR6SgsmAjDt zMBnfOPLVWc+HdJHi2bQgljg8fUUxvT0f+CeNYGGnAT6iw9VkkOBVnJ4/ grfev4avk2sW5ft2irXUDyt/I1I5b37JLIRn7TADeB94uKEEMZV4k835A 8Jd3FrD3aA9knWJktPTWWm7InCb6W77Tj5rs8cMF1RNPxQhriQbTl+s36 ylwFU6NiBJUFwlsMM8BVyyuWTy7pAuJUWky2Y/aJE0KFI69YK+hTG543x Q==; X-IronPort-AV: E=McAfee;i="6200,9189,10251"; a="236342714" X-IronPort-AV: E=Sophos;i="5.88,352,1635231600"; d="scan'208";a="236342714" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Feb 2022 04:24:55 -0800 X-IronPort-AV: E=Sophos;i="5.88,352,1635231600"; d="scan'208";a="540575964" Received: from ywan154-mobl.ccr.corp.intel.com (HELO yhuang6-mobl1.ccr.corp.intel.com) ([10.254.212.247]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Feb 2022 04:24:53 -0800 From: Huang Ying To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Huang Ying , Valentin Schneider , Ingo Molnar , Mel Gorman , Rik van Riel , Srikar Dronamraju Subject: [RFC PATCH -V2] NUMA balancing: fix NUMA topology for systems with CPU-less nodes Date: Tue, 8 Feb 2022 20:23:22 +0800 Message-Id: <20220208122322.604285-1-ying.huang@intel.com> X-Mailer: git-send-email 2.30.2 MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The NUMA topology parameters (sched_numa_topology_type, sched_domains_numa_levels, and sched_max_numa_distance, etc.) identified by scheduler may be wrong for systems with CPU-less nodes. For example, the ACPI SLIT of a system with CPU-less persistent memory (Intel Optane DCPMM) nodes is as follows, [000h 0000 4] Signature : "SLIT" [System Locality I= nformation Table] [004h 0004 4] Table Length : 0000042C [008h 0008 1] Revision : 01 [009h 0009 1] Checksum : 59 [00Ah 0010 6] Oem ID : "XXXX" [010h 0016 8] Oem Table ID : "XXXXXXX" [018h 0024 4] Oem Revision : 00000001 [01Ch 0028 4] Asl Compiler ID : "INTL" [020h 0032 4] Asl Compiler Revision : 20091013 [024h 0036 8] Localities : 0000000000000004 [02Ch 0044 4] Locality 0 : 0A 15 11 1C [030h 0048 4] Locality 1 : 15 0A 1C 11 [034h 0052 4] Locality 2 : 11 1C 0A 1C [038h 0056 4] Locality 3 : 1C 11 1C 0A While the `numactl -H` output is as follows, available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 = 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 0 size: 64136 MB node 0 free: 5981 MB node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44= 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93= 94 95 node 1 size: 64466 MB node 1 free: 10415 MB node 2 cpus: node 2 size: 253952 MB node 2 free: 253920 MB node 3 cpus: node 3 size: 253952 MB node 3 free: 253951 MB node distances: node 0 1 2 3 0: 10 21 17 28 1: 21 10 28 17 2: 17 28 10 28 3: 28 17 28 10 In this system, there are only 2 sockets. In each memory controller, both DRAM and PMEM DIMMs are installed. Although the physical NUMA topology is simple, the logical NUMA topology becomes a little complex. Because both the distance(0, 1) and distance (1, 3) are less than the distance (0, 3), it appears that node 1 sits between node 0 and node 3. And the whole system appears to be a glueless mesh NUMA topology type. But it's definitely not, there is even no CPU in node 3. This isn't a practical problem now yet. Because the PMEM nodes (node 2 and node 3 in example system) are offlined by default during system boot. So init_numa_topology_type() called during system boot will ignore them and set sched_numa_topology_type to NUMA_DIRECT. And init_numa_topology_type() is only called at runtime when a CPU of a never-onlined-before node gets plugged in. And there's no CPU in the PMEM nodes. But it appears better to fix this to make the code more robust. To test the potential problem. We have used a debug patch to call init_numa_topology_type() when the PMEM node is onlined (in __set_migration_target_nodes()). With that, the NUMA parameters identified by scheduler is as follows, sched_numa_topology_type: NUMA_GLUELESS_MESH sched_domains_numa_levels: 4 sched_max_numa_distance: 28 To fix the issue, the CPU-less nodes are ignored when the NUMA topology parameters are identified. Because a node may become CPU-less or not at run time because of CPU hotplug, the NUMA topology parameters need to be re-initialized at runtime for CPU hotplug too. With the patch, the NUMA parameters identified for the example system above is as follows, sched_numa_topology_type: NUMA_DIRECT sched_domains_numa_levels: 2 sched_max_numa_distance: 21 Signed-off-by: "Huang, Ying" Suggested-by: Peter Zijlstra Cc: Valentin Schneider Cc: Ingo Molnar Cc: Mel Gorman Cc: Rik van Riel Cc: Srikar Dronamraju --- kernel/sched/core.c | 4 +- kernel/sched/fair.c | 2 +- kernel/sched/sched.h | 3 +- kernel/sched/topology.c | 204 +++++++++++++++++++++++----------------- 4 files changed, 126 insertions(+), 87 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 848eaa0efe0e..ec97834dbc0e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -9044,6 +9044,7 @@ int sched_cpu_activate(unsigned int cpu) set_cpu_active(cpu, true); =20 if (sched_smp_initialized) { + sched_reinit_numa(true, cpu); sched_domains_numa_masks_set(cpu); cpuset_cpu_active(); } @@ -9122,6 +9123,7 @@ int sched_cpu_deactivate(unsigned int cpu) if (!sched_smp_initialized) return 0; =20 + sched_reinit_numa(false, cpu); ret =3D cpuset_cpu_inactive(cpu); if (ret) { balance_push_set(cpu, false); @@ -9228,7 +9230,7 @@ int sched_cpu_dying(unsigned int cpu) =20 void __init sched_init_smp(void) { - sched_init_numa(); + sched_init_numa(NUMA_NO_NODE); =20 /* * There's no userspace yet to cause hotplug operations; hence all the diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5146163bfabb..fe5450ef78e0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1283,7 +1283,7 @@ static unsigned long score_nearby_nodes(struct task_s= truct *p, int nid, * The furthest away nodes in the system are not interesting * for placement; nid was already counted. */ - if (dist =3D=3D sched_max_numa_distance || node =3D=3D nid) + if (dist >=3D sched_max_numa_distance || node =3D=3D nid) continue; =20 /* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index de53be905739..0481a385c7a9 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1662,7 +1662,8 @@ enum numa_topology_type { extern enum numa_topology_type sched_numa_topology_type; extern int sched_max_numa_distance; extern bool find_numa_distance(int distance); -extern void sched_init_numa(void); +extern void sched_init_numa(int offline_node); +extern void sched_reinit_numa(bool online, int cpu); extern void sched_domains_numa_masks_set(unsigned int cpu); extern void sched_domains_numa_masks_clear(unsigned int cpu); extern int sched_numa_find_closest(const struct cpumask *cpus, int cpu); diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index d201a7052a29..82107788dc3d 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1492,8 +1492,6 @@ static int sched_domains_curr_level; int sched_max_numa_distance; static int *sched_domains_numa_distance; static struct cpumask ***sched_domains_numa_masks; - -static unsigned long __read_mostly *sched_numa_onlined_nodes; #endif =20 /* @@ -1651,6 +1649,7 @@ static struct sched_domain_topology_level default_top= ology[] =3D { =20 static struct sched_domain_topology_level *sched_domain_topology =3D default_topology; +static struct sched_domain_topology_level *sched_domain_topology_saved; =20 #define for_each_sd_topology(tl) \ for (tl =3D sched_domain_topology; tl->mask; tl++) @@ -1661,6 +1660,7 @@ void set_sched_topology(struct sched_domain_topology_= level *tl) return; =20 sched_domain_topology =3D tl; + sched_domain_topology_saved =3D NULL; } =20 #ifdef CONFIG_NUMA @@ -1684,8 +1684,12 @@ static void sched_numa_warn(const char *str) =20 for (i =3D 0; i < nr_node_ids; i++) { printk(KERN_WARNING " "); - for (j =3D 0; j < nr_node_ids; j++) - printk(KERN_CONT "%02d ", node_distance(i,j)); + for (j =3D 0; j < nr_node_ids; j++) { + if (!node_state(i, N_CPU) || !node_state(j, N_CPU)) + printk(KERN_CONT "(%02d) ", node_distance(i,j)); + else + printk(KERN_CONT " %02d ", node_distance(i,j)); + } printk(KERN_CONT "\n"); } printk(KERN_WARNING "\n"); @@ -1693,19 +1697,34 @@ static void sched_numa_warn(const char *str) =20 bool find_numa_distance(int distance) { - int i; + bool found =3D false; + int i, *distances; =20 if (distance =3D=3D node_distance(0, 0)) return true; =20 + rcu_read_lock(); + distances =3D rcu_dereference(sched_domains_numa_distance); + if (!distances) + goto unlock; for (i =3D 0; i < sched_domains_numa_levels; i++) { - if (sched_domains_numa_distance[i] =3D=3D distance) - return true; + if (distances[i] =3D=3D distance) { + found =3D true; + break; + } } +unlock: + rcu_read_unlock(); =20 - return false; + return found; } =20 +#define for_each_cpu_node_but(n, nbut) \ + for_each_node_state(n, N_CPU) \ + if (n =3D=3D nbut) \ + continue; \ + else + /* * A system can have three types of NUMA topology: * NUMA_DIRECT: all nodes are directly connected, or not a NUMA system @@ -1725,7 +1744,7 @@ bool find_numa_distance(int distance) * there is an intermediary node C, which is < N hops away from both * nodes A and B, the system is a glueless mesh. */ -static void init_numa_topology_type(void) +static void init_numa_topology_type(int offline_node) { int a, b, c, n; =20 @@ -1736,14 +1755,14 @@ static void init_numa_topology_type(void) return; } =20 - for_each_online_node(a) { - for_each_online_node(b) { + for_each_cpu_node_but(a, offline_node) { + for_each_cpu_node_but(b, offline_node) { /* Find two nodes furthest removed from each other. */ if (node_distance(a, b) < n) continue; =20 /* Is there an intermediary node between a and b? */ - for_each_online_node(c) { + for_each_cpu_node_but(c, offline_node) { if (node_distance(a, c) < n && node_distance(b, c) < n) { sched_numa_topology_type =3D @@ -1756,17 +1775,22 @@ static void init_numa_topology_type(void) return; } } + + pr_err("Failed to find a NUMA topology type, defaulting to DIRECT\n"); + sched_numa_topology_type =3D NUMA_DIRECT; } =20 =20 #define NR_DISTANCE_VALUES (1 << DISTANCE_BITS) =20 -void sched_init_numa(void) +void sched_init_numa(int offline_node) { struct sched_domain_topology_level *tl; unsigned long *distance_map; int nr_levels =3D 0; int i, j; + int *distances; + struct cpumask ***masks; =20 /* * O(nr_nodes^2) deduplicating selection sort -- in order to find the @@ -1777,12 +1801,13 @@ void sched_init_numa(void) return; =20 bitmap_zero(distance_map, NR_DISTANCE_VALUES); - for (i =3D 0; i < nr_node_ids; i++) { - for (j =3D 0; j < nr_node_ids; j++) { + for_each_cpu_node_but(i, offline_node) { + for_each_cpu_node_but(j, offline_node) { int distance =3D node_distance(i, j); =20 if (distance < LOCAL_DISTANCE || distance >=3D NR_DISTANCE_VALUES) { sched_numa_warn("Invalid distance value range"); + bitmap_free(distance_map); return; } =20 @@ -1795,16 +1820,17 @@ void sched_init_numa(void) */ nr_levels =3D bitmap_weight(distance_map, NR_DISTANCE_VALUES); =20 - sched_domains_numa_distance =3D kcalloc(nr_levels, sizeof(int), GFP_KERNE= L); - if (!sched_domains_numa_distance) { + distances =3D kcalloc(nr_levels, sizeof(int), GFP_KERNEL); + if (!distances) { bitmap_free(distance_map); return; } =20 for (i =3D 0, j =3D 0; i < nr_levels; i++, j++) { j =3D find_next_bit(distance_map, NR_DISTANCE_VALUES, j); - sched_domains_numa_distance[i] =3D j; + distances[i] =3D j; } + rcu_assign_pointer(sched_domains_numa_distance, distances); =20 bitmap_free(distance_map); =20 @@ -1826,8 +1852,8 @@ void sched_init_numa(void) */ sched_domains_numa_levels =3D 0; =20 - sched_domains_numa_masks =3D kzalloc(sizeof(void *) * nr_levels, GFP_KERN= EL); - if (!sched_domains_numa_masks) + masks =3D kzalloc(sizeof(void *) * nr_levels, GFP_KERNEL); + if (!masks) return; =20 /* @@ -1835,31 +1861,20 @@ void sched_init_numa(void) * CPUs of nodes that are that many hops away from us. */ for (i =3D 0; i < nr_levels; i++) { - sched_domains_numa_masks[i] =3D - kzalloc(nr_node_ids * sizeof(void *), GFP_KERNEL); - if (!sched_domains_numa_masks[i]) + masks[i] =3D kzalloc(nr_node_ids * sizeof(void *), GFP_KERNEL); + if (!masks[i]) return; =20 - for (j =3D 0; j < nr_node_ids; j++) { + for_each_cpu_node_but(j, offline_node) { struct cpumask *mask =3D kzalloc(cpumask_size(), GFP_KERNEL); int k; =20 if (!mask) return; =20 - sched_domains_numa_masks[i][j] =3D mask; - - for_each_node(k) { - /* - * Distance information can be unreliable for - * offline nodes, defer building the node - * masks to its bringup. - * This relies on all unique distance values - * still being visible at init time. - */ - if (!node_online(j)) - continue; + masks[i][j] =3D mask; =20 + for_each_cpu_node_but(k, offline_node) { if (sched_debug() && (node_distance(j, k) !=3D node_distance(k, j))) sched_numa_warn("Node-distance not symmetric"); =20 @@ -1870,6 +1885,7 @@ void sched_init_numa(void) } } } + rcu_assign_pointer(sched_domains_numa_masks, masks); =20 /* Compute default topology size */ for (i =3D 0; sched_domain_topology[i].mask; i++); @@ -1907,59 +1923,67 @@ void sched_init_numa(void) }; } =20 + sched_domain_topology_saved =3D sched_domain_topology; sched_domain_topology =3D tl; =20 sched_domains_numa_levels =3D nr_levels; sched_max_numa_distance =3D sched_domains_numa_distance[nr_levels - 1]; =20 - init_numa_topology_type(); - - sched_numa_onlined_nodes =3D bitmap_alloc(nr_node_ids, GFP_KERNEL); - if (!sched_numa_onlined_nodes) - return; - - bitmap_zero(sched_numa_onlined_nodes, nr_node_ids); - for_each_online_node(i) - bitmap_set(sched_numa_onlined_nodes, i, 1); + init_numa_topology_type(offline_node); } =20 -static void __sched_domains_numa_masks_set(unsigned int node) -{ - int i, j; - - /* - * NUMA masks are not built for offline nodes in sched_init_numa(). - * Thus, when a CPU of a never-onlined-before node gets plugged in, - * adding that new CPU to the right NUMA masks is not sufficient: the - * masks of that CPU's node must also be updated. - */ - if (test_bit(node, sched_numa_onlined_nodes)) - return; =20 - bitmap_set(sched_numa_onlined_nodes, node, 1); - - for (i =3D 0; i < sched_domains_numa_levels; i++) { - for (j =3D 0; j < nr_node_ids; j++) { - if (!node_online(j) || node =3D=3D j) - continue; +void sched_reset_numa(void) +{ + int nr_levels, *distances; + struct cpumask ***masks; =20 - if (node_distance(j, node) > sched_domains_numa_distance[i]) + nr_levels =3D sched_domains_numa_levels; + sched_domains_numa_levels =3D 0; + sched_max_numa_distance =3D 0; + sched_numa_topology_type =3D NUMA_DIRECT; + distances =3D sched_domains_numa_distance; + rcu_assign_pointer(sched_domains_numa_distance, NULL); + masks =3D sched_domains_numa_masks; + rcu_assign_pointer(sched_domains_numa_masks, NULL); + if (distances || masks) { + int i, j; + + synchronize_rcu(); + kfree(distances); + for (i =3D 0; i < nr_levels && masks; i++) { + if (!masks[i]) continue; - - /* Add remote nodes in our masks */ - cpumask_or(sched_domains_numa_masks[i][node], - sched_domains_numa_masks[i][node], - sched_domains_numa_masks[0][j]); + for_each_node(j) + kfree(masks[i][j]); + kfree(masks[i]); } + kfree(masks); } + if (sched_domain_topology_saved) { + kfree(sched_domain_topology); + sched_domain_topology =3D sched_domain_topology_saved; + sched_domain_topology_saved =3D NULL; + } +} + +/* + * Call with hotplug lock held + */ +void sched_reinit_numa(bool online, int cpu) +{ + int node; =20 + node =3D cpu_to_node(cpu); /* - * A new node has been brought up, potentially changing the topology - * classification. - * - * Note that this is racy vs any use of sched_numa_topology_type :/ + * Scheduler NUMA topology is updated when the first CPU of a + * node is onlined or the last CPU of a node is offlined. */ - init_numa_topology_type(); + if (cpumask_weight(cpumask_of_node(node)) !=3D 1) + return; + + sched_reset_numa(); + sched_init_numa(online ? NUMA_NO_NODE : node); } =20 void sched_domains_numa_masks_set(unsigned int cpu) @@ -1967,11 +1991,9 @@ void sched_domains_numa_masks_set(unsigned int cpu) int node =3D cpu_to_node(cpu); int i, j; =20 - __sched_domains_numa_masks_set(node); - for (i =3D 0; i < sched_domains_numa_levels; i++) { for (j =3D 0; j < nr_node_ids; j++) { - if (!node_online(j)) + if (!node_state(j, N_CPU)) continue; =20 /* Set ourselves in the remote node's masks */ @@ -1986,8 +2008,10 @@ void sched_domains_numa_masks_clear(unsigned int cpu) int i, j; =20 for (i =3D 0; i < sched_domains_numa_levels; i++) { - for (j =3D 0; j < nr_node_ids; j++) - cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]); + for (j =3D 0; j < nr_node_ids; j++) { + if (sched_domains_numa_masks[i][j]) + cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]); + } } } =20 @@ -2001,14 +2025,26 @@ void sched_domains_numa_masks_clear(unsigned int cp= u) */ int sched_numa_find_closest(const struct cpumask *cpus, int cpu) { - int i, j =3D cpu_to_node(cpu); + int i, j =3D cpu_to_node(cpu), found =3D nr_cpu_ids; + struct cpumask ***masks; =20 + rcu_read_lock(); + masks =3D rcu_dereference(sched_domains_numa_masks); + if (!masks) + goto unlock; for (i =3D 0; i < sched_domains_numa_levels; i++) { - cpu =3D cpumask_any_and(cpus, sched_domains_numa_masks[i][j]); - if (cpu < nr_cpu_ids) - return cpu; + if (!masks[i][j]) + break; + cpu =3D cpumask_any_and(cpus, masks[i][j]); + if (cpu < nr_cpu_ids) { + found =3D cpu; + break; + } } - return nr_cpu_ids; +unlock: + rcu_read_unlock(); + + return found; } =20 #endif /* CONFIG_NUMA */ --=20 2.30.2