From nobody Sun Jun 14 19:14:31 2026
Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id DA35D3E122D;
	Wed, 20 May 2026 08:34:45 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=193.142.43.55
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779266087; cv=none;
 b=PLTfMiazxave/LyltVaw76aePjowQCDpjCVeW1MIEt5MrAoGiKuddw5GUIbbB/0H11q3VHRW5W1mmB9F9ABjGBHmMd5f34VO6bQpUb/AQVwsXtaMkke+OhAdHy8PR90TYdM7kBH9ldwydgyQQs95hHF5JDqOV9tAey6xxTMWyYw=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779266087; c=relaxed/simple;
	bh=ITOi/l4kTrCFUzld3zRZhIqj2SpbprYMxfDaHT2m98s=;
	h=Date:From:To:Subject:Cc:In-Reply-To:References:MIME-Version:
	 Message-ID:Content-Type;
 b=P/9PbpvBe1dfbovBcw5w7/0WkATlmMHqBjgOVX2ekcRu3fzS8no04XwiyeX5jXFCNt+P7U26gPgA8QiaKINwkCs05nkeHRoSAYyrfIQsLYo8cVrqsI4DCfBw5xHPCGuze3Fb6Wtt3OU/lJCT/jwjz1lJyjASh2e7H3oV/DgG8Gk=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linutronix.de;
 spf=pass smtp.mailfrom=linutronix.de;
 dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b=e48J7wIV;
 dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b=4TEBl69s; arc=none smtp.client-ip=193.142.43.55
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linutronix.de
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linutronix.de
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b="e48J7wIV";
	dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b="4TEBl69s"
Date: Wed, 20 May 2026 08:34:43 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020; t=1779266084;
	h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=dJRHmC1VfXhqnEBB1XwLf8xAe/zpbOXOf+xiXeGmfvg=;
	b=e48J7wIVCEYNAvAS+ePSuRmuB3bbvWnCln4QwOiC/aLunSHxRV2wPrAfSUXoqAGoolBPLS
	ybHi2o2kMwocOctd+z+gOTifEFYkhMqfW1CazGjiamy3m6hroqxd2BEZ/5Cc/0rTl+ORKB
	U8B+/s/nqdjZPWFEwNRQT2lAYQfgchjXA9KottqoE7csFAayP0A1iV4WhJFlkLABv4MO1X
	6eYsvo6DDx5sdtT3SMGgHY1VyU1pAAz+xYZ+WHuwVXsQaRSiJMFhH0Ut61dB2brJ104kvW
	99SiIGTpRAyONRE2a7HQo6HGhg1sHbe7cVrPHujWI6TixThf2mu8goMRphpnAg==
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020e; t=1779266084;
	h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=dJRHmC1VfXhqnEBB1XwLf8xAe/zpbOXOf+xiXeGmfvg=;
	b=4TEBl69swZp9jTWW/1mcvIdvss/1RgHyiwOi7YJ4lE8yP18uusBqqR+PJGNXC8FWo1u4pz
	6XWLRf4kzUp4gECA==
From: "tip-bot2 for Chen Yu" <tip-bot2@linutronix.de>
Sender: tip-bot2@linutronix.de
Reply-to: linux-kernel@vger.kernel.org
To: linux-tip-commits@vger.kernel.org
Subject: [tip: sched/core] sched/cache: Enable cache aware scheduling for
 multi LLCs NUMA node
Cc: Libo Chen <libchen@purestorage.com>,
 Adam Li <adamli@os.amperecomputing.com>, Chen Yu <yu.c.chen@intel.com>,
 Tim Chen <tim.c.chen@linux.intel.com>,
 "Peter Zijlstra (Intel)" <peterz@infradead.org>, x86@kernel.org,
 linux-kernel@vger.kernel.org
In-Reply-To: =?utf-8?q?=3C71972e12ab4f08aff422b31e34df09bdbd94de84=2E1775065?=
 =?utf-8?q?312=2Egit=2Etim=2Ec=2Echen=40linux=2Eintel=2Ecom=3E?=
References: =?utf-8?q?=3C71972e12ab4f08aff422b31e34df09bdbd94de84=2E17750653?=
 =?utf-8?q?12=2Egit=2Etim=2Ec=2Echen=40linux=2Eintel=2Ecom=3E?=
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Message-ID: <177926608315.711.10499803798579189070.tip-bot2@tip-bot2>
Robot-ID: <tip-bot2@linutronix.de>
Robot-Unsubscribe: 
 Contact <mailto:tglx@kernel.org> to get blacklisted from these emails
Precedence: bulk
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     d59f4fd1d303987f434bcf0b8191e89ca1d6a67c
Gitweb:        https://git.kernel.org/tip/d59f4fd1d303987f434bcf0b8191e89ca=
1d6a67c
Author:        Chen Yu <yu.c.chen@intel.com>
AuthorDate:    Wed, 01 Apr 2026 14:52:30 -07:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 09 Apr 2026 15:49:51 +02:00

sched/cache: Enable cache aware scheduling for multi LLCs NUMA node

Introduce sched_cache_present to enable cache aware scheduling for
multi LLCs NUMA node Cache-aware load balancing should only be
enabled if there are more than 1 LLCs within 1 NUMA node.
sched_cache_present is introduced to indicate whether this
platform supports this topology.

Test results:
The first test platform is a 2 socket Intel Sapphire Rapids with 30
cores per socket. The DRAM interleaving is enabled in the BIOS so it
essential has one NUMA node with two last level caches. There are 60
CPUs associated with each last level cache.

The second test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs
per node. Each node has 2 CCXs and each CCX has 16 CPUs.

hackbench/schbench/netperf/stream/stress-ng/chacha20 were launched
on these two platforms.

[TL;DR]
Sappire Rapids:
hackbench shows significant improvement when the number of
different active threads is below the capacity of a LLC.
schbench shows limitted wakeup latency improvement.
ChaCha20-xiangshan(risc-v simulator) shows good throughput
improvement. No obvious difference was observed in
netperf/stream/stress-ng in Hmean.

Genoa:
Significant improvement is observed in hackbench when
the active number of threads is lower than the number
of CPUs within 1 LLC. On v2, Aaron reported improvement
of hackbench/redis when system is underloaded.
ChaCha20-xiangshan shows huge throughput improvement.
Phoronix has tested v1 and shows good improvements in 30+
cases[3]. No obvious difference was observed in
netperf/stream/stress-ng in Hmean.

Detail:
Due to length constraints, data without much difference with
baseline is not presented.

Sapphire Rapids:
[hackbench pipe]
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
case                    load            baseline(std%)  compare%( std%)
threads-pipe-10         1-groups         1.00 (  1.22)  +26.09 (  1.10)
threads-pipe-10         2-groups         1.00 (  4.90)  +22.88 (  0.18)
threads-pipe-10         4-groups         1.00 (  2.07)   +9.00 (  3.49)
threads-pipe-10         8-groups         1.00 (  8.13)   +3.45 (  3.62)
threads-pipe-16         1-groups         1.00 (  2.11)  +26.30 (  0.08)
threads-pipe-16         2-groups         1.00 ( 15.13)   -1.77 ( 11.89)
threads-pipe-16         4-groups         1.00 (  4.37)   +0.58 (  7.99)
threads-pipe-16         8-groups         1.00 (  2.88)   +2.71 (  3.50)
threads-pipe-2          1-groups         1.00 (  9.40)  +22.07 (  0.71)
threads-pipe-2          2-groups         1.00 (  9.99)  +18.01 (  0.95)
threads-pipe-2          4-groups         1.00 (  3.98)  +24.66 (  0.96)
threads-pipe-2          8-groups         1.00 (  7.00)  +21.83 (  0.23)
threads-pipe-20         1-groups         1.00 (  1.03)  +28.84 (  0.21)
threads-pipe-20         2-groups         1.00 (  4.42)  +31.90 (  3.15)
threads-pipe-20         4-groups         1.00 (  9.97)   +4.56 (  1.69)
threads-pipe-20         8-groups         1.00 (  1.87)   +1.25 (  0.74)
threads-pipe-4          1-groups         1.00 (  4.48)  +25.67 (  0.78)
threads-pipe-4          2-groups         1.00 (  9.14)   +4.91 (  2.08)
threads-pipe-4          4-groups         1.00 (  7.68)  +19.36 (  1.53)
threads-pipe-4          8-groups         1.00 ( 10.79)   +7.20 ( 12.20)
threads-pipe-8          1-groups         1.00 (  4.69)  +21.93 (  0.03)
threads-pipe-8          2-groups         1.00 (  1.16)  +25.29 (  0.65)
threads-pipe-8          4-groups         1.00 (  2.23)   -1.27 (  3.62)
threads-pipe-8          8-groups         1.00 (  4.65)   -3.08 (  2.75)

Note: The default number of fd in hackbench is changed from 20 to various
values to ensure that threads fit within a single LLC, especially on AMD
systems. Take "threads-pipe-8, 2-groups" for example, the number of fd
is 8, and 2 groups are created.

[schbench]
The 99th percentile wakeup latency shows some improvements when the
system is underload, while it does not bring much difference with
the increasing of system utilization.

99th Wakeup Latencies	Base (mean std)      Compare (mean std)   Change
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
thread=3D2                 9.00(0.00)           9.00(1.73)           0.00%
thread=3D4                 7.33(0.58)           6.33(0.58)           +13.64%
thread=3D8                 9.00(0.00)           7.67(1.15)           +14.78%
thread=3D16                8.67(0.58)           8.67(1.53)           0.00%
thread=3D32                9.00(0.00)           7.00(0.00)           +22.22%
thread=3D64                9.33(0.58)           9.67(0.58)           -3.64%
thread=3D128              12.00(0.00)          12.00(0.00)           0.00%

[chacha20 on simulated risc-v]
baseline:
Host time spent: 67861ms
cache aware scheduling enabled:
Host time spent: 54441ms

Time reduced by 24%

Genoa:
[hackbench pipe]
The default number of fd is 20, which exceed the number of CPUs
in a LLC. So the fd is adjusted to 2, 4, 6, 8, 20 respectively.
Exclude the result with large run-to-run variance, 10% ~ 50%
improvement is observed when the system is underloaded:

[hackbench pipe]
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
case                    load            baseline(std%)  compare%( std%)
threads-pipe-2          1-groups         1.00 (  2.89)  +47.33 (  1.20)
threads-pipe-2          2-groups         1.00 (  3.88)  +39.82 (  0.61)
threads-pipe-2          4-groups         1.00 (  8.76)   +5.57 ( 13.10)
threads-pipe-20         1-groups         1.00 (  4.61)  +11.72 (  1.06)
threads-pipe-20         2-groups         1.00 (  6.18)  +14.55 (  1.47)
threads-pipe-20         4-groups         1.00 (  2.99)  +10.16 (  4.49)
threads-pipe-4          1-groups         1.00 (  4.23)  +43.70 (  2.14)
threads-pipe-4          2-groups         1.00 (  3.68)   +8.45 (  4.04)
threads-pipe-4          4-groups         1.00 ( 17.72)   +2.42 (  1.14)
threads-pipe-6          1-groups         1.00 (  3.10)   +7.74 (  3.83)
threads-pipe-6          2-groups         1.00 (  3.42)  +14.26 (  4.53)
threads-pipe-6          4-groups         1.00 ( 10.34)  +10.94 (  7.12)
threads-pipe-8          1-groups         1.00 (  4.21)   +9.06 (  4.43)
threads-pipe-8          2-groups         1.00 (  1.88)   +3.74 (  0.58)
threads-pipe-8          4-groups         1.00 (  2.78)  +23.96 (  1.18)

[chacha20 on simulated risc-v]
Host time spent: 54762ms
Host time spent: 28295ms

Time reduced by 48%

Suggested-by: Libo Chen <libchen@purestorage.com>
Suggested-by: Adam Li <adamli@os.amperecomputing.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/71972e12ab4f08aff422b31e34df09bdbd94de84.177=
5065312.git.tim.c.chen@linux.intel.com
---
 kernel/sched/sched.h    |  4 +++-
 kernel/sched/topology.c | 19 +++++++++++++++++--
 2 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a56619b..71f6077 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4069,9 +4069,11 @@ static inline void mm_cid_switch_to(struct task_stru=
ct *prev, struct task_struct
 #endif /* !CONFIG_SCHED_MM_CID */
=20
 #ifdef CONFIG_SCHED_CACHE
+DECLARE_STATIC_KEY_FALSE(sched_cache_present);
+
 static inline bool sched_cache_enabled(void)
 {
-	return false;
+	return static_branch_unlikely(&sched_cache_present);
 }
 #endif
=20
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8954bf7..6a36f8f 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -821,6 +821,7 @@ enum s_alloc {
 };
=20
 #ifdef CONFIG_SCHED_CACHE
+DEFINE_STATIC_KEY_FALSE(sched_cache_present);
 static bool alloc_sd_llc(const struct cpumask *cpu_map,
 			 struct s_data *d)
 {
@@ -2777,6 +2778,7 @@ static int
 build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_att=
r *attr)
 {
 	enum s_alloc alloc_state =3D sa_none;
+	bool has_multi_llcs =3D false;
 	struct sched_domain *sd;
 	struct s_data d;
 	struct rq *rq =3D NULL;
@@ -2870,8 +2872,11 @@ build_sched_domains(const struct cpumask *cpu_map, s=
truct sched_domain_attr *att
 			 * In presence of higher domains, adjust the
 			 * NUMA imbalance stats for the hierarchy.
 			 */
-			if (IS_ENABLED(CONFIG_NUMA) && sd->parent)
-				adjust_numa_imbalance(sd);
+			if (sd->parent) {
+			    if (IS_ENABLED(CONFIG_NUMA))
+				    adjust_numa_imbalance(sd);
+			    has_multi_llcs =3D true;
+			}
 		}
 	}
=20
@@ -2912,6 +2917,16 @@ build_sched_domains(const struct cpumask *cpu_map, s=
truct sched_domain_attr *att
=20
 	ret =3D 0;
 error:
+#ifdef CONFIG_SCHED_CACHE
+	/*
+	 * TBD: check before writing to it. sched domain rebuild
+	 * is not in the critical path, leave as-is for now.
+	 */
+	if (!ret && has_multi_llcs)
+		static_branch_enable_cpuslocked(&sched_cache_present);
+	else
+		static_branch_disable_cpuslocked(&sched_cache_present);
+#endif
 	__free_domain_allocs(&d, alloc_state, cpu_map);
=20
 	return ret;