From nobody Sun Apr 12 21:01:32 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E4C9935DA5B for ; Wed, 1 Apr 2026 21:46:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.15 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775080006; cv=none; b=rs1P65iCcsSQua9N+OWtUx6akF7KtG2IcQ5CrCDKAl90VOAbZijndBS9GF35zNQqLRSOPb2aYaO52kSVQeZbmbNQaHLwLkawsvo00B+mGsMVTL0FG/GWR4ILAF8YoKTLOtkc7sZcUH5OQtaRr4ilTyuGOClVfM8x+iGz98/J6no= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775080006; c=relaxed/simple; bh=FUmlCO6Yw6XB9xyE0FhMqOBkP4b2Z6ciOHzA0OPJKew=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=IMBSYwCoqBMlv7JAUdztw2z/IH9oiu/XcpXmkk8s56Uh0dhq1Hy0KC7PzKVQikqmNoeOkO70yWTaN3ztiyY7+AYoXRKcABHy90wZFiYpnipB4qWLUu5qKXFsc7XNOYfcI/iWdkmAJlFyq716dDYk3H79ji3oXZAig3hQhZccOuY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=RVoDluAS; arc=none smtp.client-ip=198.175.65.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="RVoDluAS" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1775080005; x=1806616005; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=FUmlCO6Yw6XB9xyE0FhMqOBkP4b2Z6ciOHzA0OPJKew=; b=RVoDluASfaAUjDeFExLy6+Hh8smXHrPGPIDM7283p88E0onJbEJNdmFE zZMZBCcTWjFRE7zQvgxc3uf8RGMY+Lny+mZ/5m4GKrEako7IS+wKIpTgY Nwc5O/2YaZmT97QGh0mMUFj1TbGOvHq7vHj0sQEBBB1la6r5JJxAjhKUm gAbghugGrbpr/MWSq3T7b/jk6aHbhbMaZwbB+wGd0UAR/Jn1s14Rz2MEF l2j3AtFjE98IXiOtCUsG+xfUmpKQioTkDOslnOb57E9baMwHb4Na6G2Jr q4Mo/qAIVkaJuSy1FJ6ppvue/xmcoO0sa+NmeHEQEYG5aROCNsWeP/K+G A==; X-CSE-ConnectionGUID: SB/yrvdISaaAq/hWBfGW+Q== X-CSE-MsgGUID: OQzM2l4pRneg1PE+4QQqEQ== X-IronPort-AV: E=McAfee;i="6800,10657,11746"; a="79739729" X-IronPort-AV: E=Sophos;i="6.23,153,1770624000"; d="scan'208";a="79739729" Received: from fmviesa002.fm.intel.com ([10.60.135.142]) by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Apr 2026 14:46:45 -0700 X-CSE-ConnectionGUID: Dnf3oN/5QOKmGDrb3cwsXw== X-CSE-MsgGUID: 4vF66yAuQXaDko+4D9q1IQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,153,1770624000"; d="scan'208";a="249842400" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa002.fm.intel.com with ESMTP; 01 Apr 2026 14:46:43 -0700 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [Patch v4 02/22] sched/cache: Limit the scan number of CPUs when calculating task occupancy Date: Wed, 1 Apr 2026 14:52:14 -0700 Message-Id: <57ed5fcec9b242803fe4ea2ce6e7f3de6a6efc6b.1775065312.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu When NUMA balancing is enabled, the kernel currently iterates over all online CPUs to aggregate process-wide occupancy data. On large systems, this global scan introduces significant overhead. To reduce scan latency, limit the search to a subset of relevant CPUs: 1. The task's preferred NUMA node. 2. The node where the task is currently running. 3. The node that contains the task's current preferred LLC.. While focusing solely on the preferred NUMA node is ideal, a process-wide scan must remain flexible because the "preferred node" is a per-task attribute. Different threads within the same process may have different preferred nodes, causing the process-wide preference to migrate. Maintaining a mask that covers both the preferred and active running nodes ensures accuracy while significantly reducing the number of CPUs inspected. Future work may integrate numa_group to further refine task aggregation. Suggested-by: Madadi Vineeth Reddy Signed-off-by: Chen Yu Co-developed-by: Tim Chen Signed-off-by: Tim Chen --- Notes: v3->v4: New patch. kernel/sched/fair.c | 46 ++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 45 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index eb3cfb852a93..20a33900f4ea 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1431,6 +1431,50 @@ static void task_tick_cache(struct rq *rq, struct ta= sk_struct *p) } } =20 +static void get_scan_cpumasks(cpumask_var_t cpus, struct task_struct *p) +{ +#ifdef CONFIG_NUMA_BALANCING + int cpu, curr_cpu, pref_nid; + + if (!static_branch_likely(&sched_numa_balancing)) + goto out; + + cpu =3D p->mm->sc_stat.cpu; + curr_cpu =3D task_cpu(p); + + /* + * Scanning in the preferred NUMA node is ideal. However, the NUMA + * preferred node is per-task rather than per-process. It is possible + * for different threads of the process to have distinct preferred + * nodes; consequently, the process-wide preferred LLC may bounce + * between different nodes. As a workaround, maintain the scan + * CPU mask to also cover the process's current preferred LLC and the + * current running node to mitigate the bouncing risk. + * TBD: numa_group should be considered during task aggregation. + */ + pref_nid =3D p->numa_preferred_nid; + /* honor the task's preferred node */ + if (pref_nid =3D=3D NUMA_NO_NODE) + goto out; + + cpumask_or(cpus, cpus, cpumask_of_node(pref_nid)); + + /* honor the task's preferred LLC CPU */ + if (cpu !=3D -1 && !cpumask_test_cpu(cpu, cpus)) + cpumask_or(cpus, cpus, + cpumask_of_node(cpu_to_node(cpu))); + + /* make sure the task's current running node is included */ + if (!cpumask_test_cpu(curr_cpu, cpus)) + cpumask_or(cpus, cpus, cpumask_of_node(cpu_to_node(curr_cpu))); + + return; + +out: +#endif + cpumask_copy(cpus, cpu_online_mask); +} + static void task_cache_work(struct callback_head *work) { struct task_struct *p =3D current; @@ -1451,7 +1495,7 @@ static void task_cache_work(struct callback_head *wor= k) return; =20 scoped_guard (cpus_read_lock) { - cpumask_copy(cpus, cpu_online_mask); + get_scan_cpumasks(cpus, p); =20 for_each_cpu(cpu, cpus) { /* XXX sched_cluster_active */ --=20 2.32.0