From nobody Thu Oct 2 20:38:43 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C1DC93112BB for ; Thu, 11 Sep 2025 18:24:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.10 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757615045; cv=none; b=s3lCbT5H/DQcHt/sUw2ZGbCyhmYj1tcw/1gJURic5IZdS89u9xcXC3ETImTDC9ykKZagqPhYB0Kep6q6/OHaSrCIM0GAL34svMBGXRL3D5Jau6ydxvcHjGHwWy7MsXgmmMBO3bxF8CBZZDMDPoV6RDKyZTusGG1Wk+tOJiecQiw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757615045; c=relaxed/simple; bh=lw6+r78TMSunY1gbvy8K8XKaxLYAB3K+WlDRp5/dB4s=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=lqT3eFgGOlNiHZUy/UMxJcVVcKs6rbCPNb38AYRnc4gkdPmS8MvhTbpFnbMaF21laIYNJGKOKv/DGZaUahoBLJyuj/N14nOi91f8LchcbCQubWEu3EXD+PvSA6kCbk0Xx5tz1IEEa/F2r6XOTfQo/H6LbbfDA5k5uNqL7PCUoXE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=none smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=G7ICjXw9; arc=none smtp.client-ip=198.175.65.10 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="G7ICjXw9" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1757615044; x=1789151044; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=lw6+r78TMSunY1gbvy8K8XKaxLYAB3K+WlDRp5/dB4s=; b=G7ICjXw95idSMcb12NT2Lz1wZDqB4mCK+Siqd9BxYaIdtM6wuMYC9bXX 9ywjFEnbBWmFselGC0hRW5rHrgALR5sQyIFuJIXJOHInP8DtfC+bZyFL+ iyhYDnVSrzjWLpwv+YjiGhGRMdjDx7eb5iNEwLxBPb5HUJOMVe4cbkcbW R2CCXI52UciRZhZphO48tLQF+czgXr3VxwdU2c9GORYuqPknVHjq62p99 VR0CKVOFXsSKk7kZV3UcXMc/hsg5YbSzHbcirQI5se4gD1E4MC+zOzxzG XOzDQKPBJS6FEBta9p/BCdS7o3TWruDWPbUHsgXB7eHnUi6DLEaegfT6U Q==; X-CSE-ConnectionGUID: cy44lVqDSgWNl4Ef0E+zMQ== X-CSE-MsgGUID: sa2jJ/gpTqiD5SYJ44D0uA== X-IronPort-AV: E=McAfee;i="6800,10657,11549"; a="77413635" X-IronPort-AV: E=Sophos;i="6.18,257,1751266800"; d="scan'208";a="77413635" Received: from orviesa010.jf.intel.com ([10.64.159.150]) by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2025 11:24:03 -0700 X-CSE-ConnectionGUID: vSF677PpQwiHcj7sd4I3Hw== X-CSE-MsgGUID: HMX+tb56R7ajBUp1I7EYMw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.18,257,1751266800"; d="scan'208";a="173058308" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by orviesa010.jf.intel.com with ESMTP; 11 Sep 2025 11:24:03 -0700 From: Tim Chen To: Peter Zijlstra , Ingo Molnar Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Ben Segall , Mel Gorman , Valentin Schneider , Tim Chen , Vincent Guittot , Libo Chen , Abel Wu , Len Brown , linux-kernel@vger.kernel.org, Chen Yu , K Prateek Nayak , "Gautham R . Shenoy" , Zhao Liu , Vinicius Costa Gomes , Arjan Van De Ven Subject: [PATCH v3 1/2] sched: Create architecture specific sched domain distances Date: Thu, 11 Sep 2025 11:30:56 -0700 Message-Id: <1aa0ae94e95c45c8f3353f12e6494907df339632.1757614784.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Allow architecture specific sched domain NUMA distances that can be modified from NUMA node distances for the purpose of building NUMA sched domains. The actual NUMA distances are kept separately. This allows for NUMA domain levels modification when building sched domains for specific architectures. Consolidate the recording of unique NUMA distances in an array to sched_record_numa_dist() so the function can be reused to record NUMA distances when the NUMA distance metric is changed. No functional change if there's no arch specific NUMA distances are being defined. Co-developed-by: Vinicius Costa Gomes Signed-off-by: Vinicius Costa Gomes Signed-off-by: Tim Chen Reviewed-by: Chen Yu Reviewed-by: K Prateek Nayak --- include/linux/sched/topology.h | 2 + kernel/sched/topology.c | 114 ++++++++++++++++++++++++++++----- 2 files changed, 99 insertions(+), 17 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 5263746b63e8..4f58e78ca52e 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -59,6 +59,8 @@ static inline int cpu_numa_flags(void) #endif =20 extern int arch_asym_cpu_priority(int cpu); +extern int arch_sched_node_distance(int from, int to); +extern int sched_avg_remote_numa_distance; =20 struct sched_domain_attr { int relax_domain_level; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 977e133bb8a4..6c0ff62322cb 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1591,10 +1591,13 @@ static void claim_allocations(int cpu, struct sched= _domain *sd) enum numa_topology_type sched_numa_topology_type; =20 static int sched_domains_numa_levels; +static int sched_numa_node_levels; static int sched_domains_curr_level; =20 int sched_max_numa_distance; +int sched_avg_remote_numa_distance; static int *sched_domains_numa_distance; +static int *sched_numa_node_distance; static struct cpumask ***sched_domains_numa_masks; #endif /* CONFIG_NUMA */ =20 @@ -1808,10 +1811,10 @@ bool find_numa_distance(int distance) return true; =20 rcu_read_lock(); - distances =3D rcu_dereference(sched_domains_numa_distance); + distances =3D rcu_dereference(sched_numa_node_distance); if (!distances) goto unlock; - for (i =3D 0; i < sched_domains_numa_levels; i++) { + for (i =3D 0; i < sched_numa_node_levels; i++) { if (distances[i] =3D=3D distance) { found =3D true; break; @@ -1887,14 +1890,32 @@ static void init_numa_topology_type(int offline_nod= e) =20 #define NR_DISTANCE_VALUES (1 << DISTANCE_BITS) =20 -void sched_init_numa(int offline_node) +/* + * Architecture could modify NUMA distance, to change + * grouping of NUMA nodes and number of NUMA levels when creating + * NUMA level sched domains. + * + * One NUMA level is created for each unique + * arch_sched_node_distance. + */ +int __weak arch_sched_node_distance(int from, int to) +{ + return node_distance(from, to); +} + +static int numa_node_dist(int i, int j) +{ + return node_distance(i, j); +} + +static int sched_record_numa_dist(int offline_node, int (*n_dist)(int, int= ), + int **dist, int *levels) + { - struct sched_domain_topology_level *tl; unsigned long *distance_map; int nr_levels =3D 0; int i, j; int *distances; - struct cpumask ***masks; =20 /* * O(nr_nodes^2) de-duplicating selection sort -- in order to find the @@ -1902,17 +1923,17 @@ void sched_init_numa(int offline_node) */ distance_map =3D bitmap_alloc(NR_DISTANCE_VALUES, GFP_KERNEL); if (!distance_map) - return; + return -ENOMEM; =20 bitmap_zero(distance_map, NR_DISTANCE_VALUES); for_each_cpu_node_but(i, offline_node) { for_each_cpu_node_but(j, offline_node) { - int distance =3D node_distance(i, j); + int distance =3D n_dist(i, j); =20 if (distance < LOCAL_DISTANCE || distance >=3D NR_DISTANCE_VALUES) { sched_numa_warn("Invalid distance value range"); bitmap_free(distance_map); - return; + return -EINVAL; } =20 bitmap_set(distance_map, distance, 1); @@ -1927,17 +1948,66 @@ void sched_init_numa(int offline_node) distances =3D kcalloc(nr_levels, sizeof(int), GFP_KERNEL); if (!distances) { bitmap_free(distance_map); - return; + return -ENOMEM; } - for (i =3D 0, j =3D 0; i < nr_levels; i++, j++) { j =3D find_next_bit(distance_map, NR_DISTANCE_VALUES, j); distances[i] =3D j; } - rcu_assign_pointer(sched_domains_numa_distance, distances); + *dist =3D distances; + *levels =3D nr_levels; =20 bitmap_free(distance_map); =20 + return 0; +} + +static int avg_remote_numa_distance(int offline_node) +{ + int i, j; + int distance, nr_remote =3D 0, total_distance =3D 0; + + for_each_cpu_node_but(i, offline_node) { + for_each_cpu_node_but(j, offline_node) { + distance =3D node_distance(i, j); + + if (distance >=3D REMOTE_DISTANCE) { + nr_remote++; + total_distance +=3D distance; + } + } + } + if (nr_remote) + return total_distance / nr_remote; + else + return REMOTE_DISTANCE; +} + +void sched_init_numa(int offline_node) +{ + struct sched_domain_topology_level *tl; + int nr_levels, nr_node_levels; + int i, j; + int *distances, *domain_distances; + struct cpumask ***masks; + + if (sched_record_numa_dist(offline_node, numa_node_dist, &distances, + &nr_node_levels)) + return; + + WRITE_ONCE(sched_avg_remote_numa_distance, + avg_remote_numa_distance(offline_node)); + + if (sched_record_numa_dist(offline_node, + arch_sched_node_distance, &domain_distances, + &nr_levels)) { + kfree(distances); + return; + } + rcu_assign_pointer(sched_numa_node_distance, distances); + WRITE_ONCE(sched_max_numa_distance, distances[nr_node_levels - 1]); + WRITE_ONCE(sched_numa_node_levels, nr_node_levels); + /* * 'nr_levels' contains the number of unique distances * @@ -1954,6 +2024,8 @@ void sched_init_numa(int offline_node) * * We reset it to 'nr_levels' at the end of this function. */ + rcu_assign_pointer(sched_domains_numa_distance, domain_distances); + sched_domains_numa_levels =3D 0; =20 masks =3D kzalloc(sizeof(void *) * nr_levels, GFP_KERNEL); @@ -1979,10 +2051,13 @@ void sched_init_numa(int offline_node) masks[i][j] =3D mask; =20 for_each_cpu_node_but(k, offline_node) { - if (sched_debug() && (node_distance(j, k) !=3D node_distance(k, j))) + if (sched_debug() && + (arch_sched_node_distance(j, k) !=3D + arch_sched_node_distance(k, j))) sched_numa_warn("Node-distance not symmetric"); =20 - if (node_distance(j, k) > sched_domains_numa_distance[i]) + if (arch_sched_node_distance(j, k) > + sched_domains_numa_distance[i]) continue; =20 cpumask_or(mask, mask, cpumask_of_node(k)); @@ -2022,7 +2097,6 @@ void sched_init_numa(int offline_node) sched_domain_topology =3D tl; =20 sched_domains_numa_levels =3D nr_levels; - WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels= - 1]); =20 init_numa_topology_type(offline_node); } @@ -2030,14 +2104,18 @@ void sched_init_numa(int offline_node) =20 static void sched_reset_numa(void) { - int nr_levels, *distances; + int nr_levels, *distances, *dom_distances; struct cpumask ***masks; =20 nr_levels =3D sched_domains_numa_levels; + sched_numa_node_levels =3D 0; sched_domains_numa_levels =3D 0; sched_max_numa_distance =3D 0; + sched_avg_remote_numa_distance =3D 0; sched_numa_topology_type =3D NUMA_DIRECT; - distances =3D sched_domains_numa_distance; + distances =3D sched_numa_node_distance; + dom_distances =3D sched_domains_numa_distance; + rcu_assign_pointer(sched_numa_node_distance, NULL); rcu_assign_pointer(sched_domains_numa_distance, NULL); masks =3D sched_domains_numa_masks; rcu_assign_pointer(sched_domains_numa_masks, NULL); @@ -2054,6 +2132,7 @@ static void sched_reset_numa(void) kfree(masks[i]); } kfree(masks); + kfree(dom_distances); } if (sched_domain_topology_saved) { kfree(sched_domain_topology); @@ -2092,7 +2171,8 @@ void sched_domains_numa_masks_set(unsigned int cpu) continue; =20 /* Set ourselves in the remote node's masks */ - if (node_distance(j, node) <=3D sched_domains_numa_distance[i]) + if (arch_sched_node_distance(j, node) <=3D + sched_domains_numa_distance[i]) cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]); } } --=20 2.32.0