From nobody Sun Jun 14 17:52:49 2026
Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id DD5703EE1D5;
	Wed, 20 May 2026 08:35:03 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=193.142.43.55
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779266105; cv=none;
 b=qesusfzFgBeLvcphemDj1PmbeMlgN1wBtYIzjlqjzRuDMc6zUsrC7GGzd5Llw5vF8XGCXEoFdXdDjW2tmHgJj0rRFW28LYFmqNfTABHIgRDikN9CECWaJtrPcjUTU0w2FL101UniLJkHoxZwejtmu/LTDLXq/Uhn5T0TrvLqBok=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779266105; c=relaxed/simple;
	bh=gvCoq7A5YTVBskV+wHdBiDVZEFfSGHHCCdL5R992j+k=;
	h=Date:From:To:Subject:Cc:In-Reply-To:References:MIME-Version:
	 Message-ID:Content-Type;
 b=jQL1QOCFLb5q9+RounueBVgj3fij0J4/IHWDaJVVp5PB8WkYgSFfVkE2q5xvr7u30C8g9tvZjbzw7pEyeJlekWO+Kteb0ydYt0d2LhK1vTywEo6iLcUHKmqxlk0z+4+C+VlCadFX73/Em3rg9PIf3RcLRQcBIBLdkw1GeOJMrXU=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linutronix.de;
 spf=pass smtp.mailfrom=linutronix.de;
 dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b=SMNmv7Wa;
 dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b=5B3EKOqH; arc=none smtp.client-ip=193.142.43.55
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linutronix.de
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linutronix.de
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b="SMNmv7Wa";
	dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b="5B3EKOqH"
Date: Wed, 20 May 2026 08:35:01 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020; t=1779266102;
	h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=cXx0baI+bq3HjIi8NvfPawMgqk6KRLXF8ME4l3SreVM=;
	b=SMNmv7WakxJZRe8oGf2096tXMz9vcPtdygxZN1h9auv9CnUbct71gR1gJVU6Pan7f5p04e
	yI152YWCbu6RoNjo3cEexHnHKDh4hTnPVIpe4/q9n2zmPoRI33TP5i5XySawGqkpHQahdR
	yT99PJpfgvs5bAh0V8nlNiEHZhHuth6Z6vEx2EQc6I5juIrKMTJeEmgAly/tCOjMhUEift
	W+wVNhUTvP41pH0EvuAVTNuuM2sZVLNef8cMlibAcoaCzJfhis6NfeofaI/AEjMk5d5nam
	hm3YkwFibzKaRATLWDDKH3fI5gfaUuzAqrFxa1ZT1fNklh8kAj7yShdboz+hQw==
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020e; t=1779266102;
	h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=cXx0baI+bq3HjIi8NvfPawMgqk6KRLXF8ME4l3SreVM=;
	b=5B3EKOqH/Gr+xmjmsgJePoXXZp2tLLE9NpbjMSel9YwE5605DsJO4uLMZlWY1p9uc1clNu
	XC+Kyc3u81xF4aDQ==
From: "tip-bot2 for Chen Yu" <tip-bot2@linutronix.de>
Sender: tip-bot2@linutronix.de
Reply-to: linux-kernel@vger.kernel.org
To: linux-tip-commits@vger.kernel.org
Subject: [tip: sched/core] sched/cache: Limit the scan number of CPUs when
 calculating task occupancy
Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
 Chen Yu <yu.c.chen@intel.com>, Tim Chen <tim.c.chen@linux.intel.com>,
 "Peter Zijlstra (Intel)" <peterz@infradead.org>, x86@kernel.org,
 linux-kernel@vger.kernel.org
In-Reply-To: =?utf-8?q?=3C57ed5fcec9b242803fe4ea2ce6e7f3de6a6efc6b=2E1775065?=
 =?utf-8?q?312=2Egit=2Etim=2Ec=2Echen=40linux=2Eintel=2Ecom=3E?=
References: =?utf-8?q?=3C57ed5fcec9b242803fe4ea2ce6e7f3de6a6efc6b=2E17750653?=
 =?utf-8?q?12=2Egit=2Etim=2Ec=2Echen=40linux=2Eintel=2Ecom=3E?=
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Message-ID: <177926610113.711.7480700058796646491.tip-bot2@tip-bot2>
Robot-ID: <tip-bot2@linutronix.de>
Robot-Unsubscribe: 
 Contact <mailto:tglx@kernel.org> to get blacklisted from these emails
Precedence: bulk
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     b4606faab3188beeacc2287b8a369cca943cc8eb
Gitweb:        https://git.kernel.org/tip/b4606faab3188beeacc2287b8a369cca9=
43cc8eb
Author:        Chen Yu <yu.c.chen@intel.com>
AuthorDate:    Wed, 01 Apr 2026 14:52:14 -07:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 09 Apr 2026 15:49:47 +02:00

sched/cache: Limit the scan number of CPUs when calculating task occupancy

When NUMA balancing is enabled, the kernel currently iterates over all
online CPUs to aggregate process-wide occupancy data. On large systems,
this global scan introduces significant overhead.

To reduce scan latency, limit the search to a subset of relevant CPUs:
1. The task's preferred NUMA node.
2. The node where the task is currently running.
3. The node that contains the task's current preferred LLC..

While focusing solely on the preferred NUMA node is ideal, a
process-wide scan must remain flexible because the "preferred node"
is a per-task attribute. Different threads within the same process may
have different preferred nodes, causing the process-wide preference to
migrate. Maintaining a mask that covers both the preferred and active
running nodes ensures accuracy while significantly reducing the number of
CPUs inspected.

Future work may integrate numa_group to further refine task aggregation.

Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/57ed5fcec9b242803fe4ea2ce6e7f3de6a6efc6b.177=
5065312.git.tim.c.chen@linux.intel.com
---
 kernel/sched/fair.c | 47 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 46 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c9cd064..a55ada2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1522,6 +1522,51 @@ static void task_tick_cache(struct rq *rq, struct ta=
sk_struct *p)
 	}
 }
=20
+static void get_scan_cpumasks(cpumask_var_t cpus, struct task_struct *p)
+{
+#ifdef CONFIG_NUMA_BALANCING
+	int cpu, curr_cpu, nid, pref_nid;
+
+	if (!static_branch_likely(&sched_numa_balancing))
+		goto out;
+
+	cpu =3D p->mm->sc_stat.cpu;
+	if (cpu !=3D -1)
+		nid =3D cpu_to_node(cpu);
+	curr_cpu =3D task_cpu(p);
+
+	/*
+	 * Scanning in the preferred NUMA node is ideal. However, the NUMA
+	 * preferred node is per-task rather than per-process. It is possible
+	 * for different threads of the process to have distinct preferred
+	 * nodes; consequently, the process-wide preferred LLC may bounce
+	 * between different nodes. As a workaround, maintain the scan
+	 * CPU mask to also cover the process's current preferred LLC and the
+	 * current running node to mitigate the bouncing risk.
+	 * TBD: numa_group should be considered during task aggregation.
+	 */
+	pref_nid =3D p->numa_preferred_nid;
+	/* honor the task's preferred node */
+	if (pref_nid =3D=3D NUMA_NO_NODE)
+		goto out;
+
+	cpumask_or(cpus, cpus, cpumask_of_node(pref_nid));
+
+	/* honor the task's preferred LLC CPU */
+	if (cpu !=3D -1 && !cpumask_test_cpu(cpu, cpus) && nid !=3D NUMA_NO_NODE)
+		cpumask_or(cpus, cpus, cpumask_of_node(nid));
+
+	/* make sure the task's current running node is included */
+	if (!cpumask_test_cpu(curr_cpu, cpus))
+		cpumask_or(cpus, cpus, cpumask_of_node(cpu_to_node(curr_cpu)));
+
+	return;
+
+out:
+#endif
+	cpumask_copy(cpus, cpu_online_mask);
+}
+
 static void task_cache_work(struct callback_head *work)
 {
 	struct task_struct *p =3D current;
@@ -1544,7 +1589,7 @@ static void task_cache_work(struct callback_head *wor=
k)
 	scoped_guard (cpus_read_lock) {
 		guard(rcu)();
=20
-		cpumask_copy(cpus, cpu_online_mask);
+		get_scan_cpumasks(cpus, p);
=20
 		for_each_cpu(cpu, cpus) {
 			/* XXX sched_cluster_active */