From nobody Sun Apr 12 21:01:32 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id E4C9935DA5B
	for <linux-kernel@vger.kernel.org>; Wed,  1 Apr 2026 21:46:44 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.15
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1775080006; cv=none;
 b=rs1P65iCcsSQua9N+OWtUx6akF7KtG2IcQ5CrCDKAl90VOAbZijndBS9GF35zNQqLRSOPb2aYaO52kSVQeZbmbNQaHLwLkawsvo00B+mGsMVTL0FG/GWR4ILAF8YoKTLOtkc7sZcUH5OQtaRr4ilTyuGOClVfM8x+iGz98/J6no=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1775080006; c=relaxed/simple;
	bh=FUmlCO6Yw6XB9xyE0FhMqOBkP4b2Z6ciOHzA0OPJKew=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=IMBSYwCoqBMlv7JAUdztw2z/IH9oiu/XcpXmkk8s56Uh0dhq1Hy0KC7PzKVQikqmNoeOkO70yWTaN3ztiyY7+AYoXRKcABHy90wZFiYpnipB4qWLUu5qKXFsc7XNOYfcI/iWdkmAJlFyq716dDYk3H79ji3oXZAig3hQhZccOuY=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=RVoDluAS; arc=none smtp.client-ip=198.175.65.15
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="RVoDluAS"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1775080005; x=1806616005;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=FUmlCO6Yw6XB9xyE0FhMqOBkP4b2Z6ciOHzA0OPJKew=;
  b=RVoDluASfaAUjDeFExLy6+Hh8smXHrPGPIDM7283p88E0onJbEJNdmFE
   zZMZBCcTWjFRE7zQvgxc3uf8RGMY+Lny+mZ/5m4GKrEako7IS+wKIpTgY
   Nwc5O/2YaZmT97QGh0mMUFj1TbGOvHq7vHj0sQEBBB1la6r5JJxAjhKUm
   gAbghugGrbpr/MWSq3T7b/jk6aHbhbMaZwbB+wGd0UAR/Jn1s14Rz2MEF
   l2j3AtFjE98IXiOtCUsG+xfUmpKQioTkDOslnOb57E9baMwHb4Na6G2Jr
   q4Mo/qAIVkaJuSy1FJ6ppvue/xmcoO0sa+NmeHEQEYG5aROCNsWeP/K+G
   A==;
X-CSE-ConnectionGUID: SB/yrvdISaaAq/hWBfGW+Q==
X-CSE-MsgGUID: OQzM2l4pRneg1PE+4QQqEQ==
X-IronPort-AV: E=McAfee;i="6800,10657,11746"; a="79739729"
X-IronPort-AV: E=Sophos;i="6.23,153,1770624000";
   d="scan'208";a="79739729"
Received: from fmviesa002.fm.intel.com ([10.60.135.142])
  by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 01 Apr 2026 14:46:45 -0700
X-CSE-ConnectionGUID: Dnf3oN/5QOKmGDrb3cwsXw==
X-CSE-MsgGUID: 4vF66yAuQXaDko+4D9q1IQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.23,153,1770624000";
   d="scan'208";a="249842400"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa002.fm.intel.com with ESMTP; 01 Apr 2026 14:46:43 -0700
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	linux-kernel@vger.kernel.org
Subject: [Patch v4 02/22] sched/cache: Limit the scan number of CPUs when
 calculating task occupancy
Date: Wed,  1 Apr 2026 14:52:14 -0700
Message-Id: 
 <57ed5fcec9b242803fe4ea2ce6e7f3de6a6efc6b.1775065312.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1775065312.git.tim.c.chen@linux.intel.com>
References: <cover.1775065312.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

When NUMA balancing is enabled, the kernel currently iterates over all
online CPUs to aggregate process-wide occupancy data. On large systems,
this global scan introduces significant overhead.

To reduce scan latency, limit the search to a subset of relevant CPUs:
1. The task's preferred NUMA node.
2. The node where the task is currently running.
3. The node that contains the task's current preferred LLC..

While focusing solely on the preferred NUMA node is ideal, a
process-wide scan must remain flexible because the "preferred node"
is a per-task attribute. Different threads within the same process may
have different preferred nodes, causing the process-wide preference to
migrate. Maintaining a mask that covers both the preferred and active
running nodes ensures accuracy while significantly reducing the number of
CPUs inspected.

Future work may integrate numa_group to further refine task aggregation.

Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v3->v4:
       New patch.

 kernel/sched/fair.c | 46 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 45 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eb3cfb852a93..20a33900f4ea 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1431,6 +1431,50 @@ static void task_tick_cache(struct rq *rq, struct ta=
sk_struct *p)
 	}
 }
=20
+static void get_scan_cpumasks(cpumask_var_t cpus, struct task_struct *p)
+{
+#ifdef CONFIG_NUMA_BALANCING
+	int cpu, curr_cpu, pref_nid;
+
+	if (!static_branch_likely(&sched_numa_balancing))
+		goto out;
+
+	cpu =3D p->mm->sc_stat.cpu;
+	curr_cpu =3D task_cpu(p);
+
+	/*
+	 * Scanning in the preferred NUMA node is ideal. However, the NUMA
+	 * preferred node is per-task rather than per-process. It is possible
+	 * for different threads of the process to have distinct preferred
+	 * nodes; consequently, the process-wide preferred LLC may bounce
+	 * between different nodes. As a workaround, maintain the scan
+	 * CPU mask to also cover the process's current preferred LLC and the
+	 * current running node to mitigate the bouncing risk.
+	 * TBD: numa_group should be considered during task aggregation.
+	 */
+	pref_nid =3D p->numa_preferred_nid;
+	/* honor the task's preferred node */
+	if (pref_nid =3D=3D NUMA_NO_NODE)
+		goto out;
+
+	cpumask_or(cpus, cpus, cpumask_of_node(pref_nid));
+
+	/* honor the task's preferred LLC CPU */
+	if (cpu !=3D -1 && !cpumask_test_cpu(cpu, cpus))
+		cpumask_or(cpus, cpus,
+			   cpumask_of_node(cpu_to_node(cpu)));
+
+	/* make sure the task's current running node is included */
+	if (!cpumask_test_cpu(curr_cpu, cpus))
+		cpumask_or(cpus, cpus, cpumask_of_node(cpu_to_node(curr_cpu)));
+
+	return;
+
+out:
+#endif
+	cpumask_copy(cpus, cpu_online_mask);
+}
+
 static void task_cache_work(struct callback_head *work)
 {
 	struct task_struct *p =3D current;
@@ -1451,7 +1495,7 @@ static void task_cache_work(struct callback_head *wor=
k)
 		return;
=20
 	scoped_guard (cpus_read_lock) {
-		cpumask_copy(cpus, cpu_online_mask);
+		get_scan_cpumasks(cpus, p);
=20
 		for_each_cpu(cpu, cpus) {
 			/* XXX sched_cluster_active */
--=20
2.32.0