From nobody Fri Jun 12 15:46:56 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id A592F25B0B6
	for <linux-kernel@vger.kernel.org>; Wed, 13 May 2026 20:33:33 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.17
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1778704415; cv=none;
 b=RbyuV76HA0Pp4brRK2DNoWoXHpU2ScD7aBMla60KJQSou6VNJRYOtPrFF3hctkg6gxVY8RLMqleprjkJnJWd4rCOQoNUPY5/Mvy2e/fcYibDL+7lNp6VFT6ze/vipd0tYKJG1ihHdWj2OP/jObEKldiC6GymRrQQff/4C90HwIc=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1778704415; c=relaxed/simple;
	bh=9dRtMFEnE3sO2bgwlBXd25CqjvGaW8hJO8hC9cFO1v4=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=MCrZToyTXSGdEmzebuGMCckYIADKpUaobeqDJpdjgQX6kZVOQiGLqx4lLVwnaabiZ64XWvj452F1lFA/Z37vAn1lCetCDJYhvUfXJUS+HC2p8Zss4pTBBBS5bLmZFPYD7jWA8GQI59gNi/CtMPVvdm+V3iPV2C6bSqUUB2x9vnY=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=DTQ7q5O8; arc=none smtp.client-ip=198.175.65.17
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="DTQ7q5O8"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1778704414; x=1810240414;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=9dRtMFEnE3sO2bgwlBXd25CqjvGaW8hJO8hC9cFO1v4=;
  b=DTQ7q5O8z9N0MITEoysSiUb75BiQcKrdlno02Xo/kHVdAGdn1TZC1d/t
   +DkgHJPXgvwtNyfbL/l3M3p48P2fEVnTkt46AmA8gDrOCjQhQB3AW+K0B
   2PZ4gLygmoUsdGqQ5uV29EHvGiQCWMTq1/O73GyrBebehnORmF3lEi+Z5
   Ot5p9tW2o27avxGsk4W9CAzlkGwloIfu1u44zxutfKeqPlcLrgDP0gYx8
   K+3D3syno0dlqMyGoE8TiiY8lPCRgX03kFME22RecO14q6Y+4UZJD5kyX
   HxH+pXBKVyurxjbGxwx9AeXkS1QWQ+eYXYJVRuF22uOrqHJ+zZWvuoSri
   w==;
X-CSE-ConnectionGUID: /hNjS7tiT8a3DisGQHNXhQ==
X-CSE-MsgGUID: I8Drt1ZeSRy5I38KZBdzrw==
X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79622968"
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="79622968"
Received: from orviesa008.jf.intel.com ([10.64.159.148])
  by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 13 May 2026 13:33:33 -0700
X-CSE-ConnectionGUID: WFVFDNOkSNC756h9ak1WyA==
X-CSE-MsgGUID: vyvJQmtlRcG+Vk5HobAMBg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="238076313"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:32 -0700
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Jianyong Wu <wujianyong@hygon.cn>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	Luo Gengkun <luogengkun2@huawei.com>,
	linux-kernel@vger.kernel.org
Subject: [Patch v4 01/16] sched/cache: Allow only 1 thread of the process to
 calculate the LLC occupancy
Date: Wed, 13 May 2026 13:39:12 -0700
Message-Id: 
 <5672b52e588b855b01e5a1a17822f7c6c7237a3d.1778703694.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1778703694.git.tim.c.chen@linux.intel.com>
References: <cover.1778703694.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Jianyong Wu <wujianyong@hygon.cn>

Scanning online CPUs to calculate the occupancy might be
time-consuming. Only allow 1 thread of the process to scan
the CPUs at the same time, which is similar to what
NUMA balance does in task_numa_work().

Signed-off-by: Jianyong Wu <wujianyong@hygon.cn>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 11 +++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d2010483cd77..6d883f109ba3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2423,6 +2423,7 @@ struct sched_cache_stat {
 	struct sched_cache_time __percpu *pcpu_sched;
 	raw_spinlock_t lock;
 	unsigned long epoch;
+	unsigned long next_scan;
 	int cpu;
 } ____cacheline_aligned_in_smp;
=20
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5f22e5a097cf..a759ea669d74 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1451,6 +1451,7 @@ void mm_init_sched(struct mm_struct *mm,
 	raw_spin_lock_init(&mm->sc_stat.lock);
 	mm->sc_stat.epoch =3D epoch;
 	mm->sc_stat.cpu =3D -1;
+	mm->sc_stat.next_scan =3D jiffies;
=20
 	/*
 	 * The update to mm->sc_stat should not be reordered
@@ -1661,6 +1662,7 @@ static void get_scan_cpumasks(cpumask_var_t cpus, str=
uct task_struct *p)
=20
 static void task_cache_work(struct callback_head *work)
 {
+	unsigned long next_scan, now =3D jiffies;
 	struct task_struct *p =3D current;
 	struct mm_struct *mm =3D p->mm;
 	unsigned long m_a_occ =3D 0;
@@ -1675,6 +1677,15 @@ static void task_cache_work(struct callback_head *wo=
rk)
 	if (p->flags & PF_EXITING)
 		return;
=20
+	next_scan =3D READ_ONCE(mm->sc_stat.next_scan);
+	if (time_before(now, next_scan))
+		return;
+
+	/* only 1 thread is allowed to scan */
+	if (!try_cmpxchg(&mm->sc_stat.next_scan, &next_scan,
+			 now + EPOCH_PERIOD))
+		return;
+
 	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
 		return;
=20
--=20
2.32.0
From nobody Fri Jun 12 15:46:56 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6793C368D66
	for <linux-kernel@vger.kernel.org>; Wed, 13 May 2026 20:33:34 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.17
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1778704415; cv=none;
 b=OAqVDa/lFfVL3UQGME5wTQubZDB+IetqwO4vbdDtOwYbO+peIiwXp0ZdtuNMMiGFkVou/LX7z38n7D8ihGjTJ4yaV0++HkpW8bkZCp0yT2Gis5AjrUJxgVX97HjVSXDDtDlas10aXFM2iT21wFrAlR/FMuhNM9rNPK7BkPzTglA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1778704415; c=relaxed/simple;
	bh=z9tPKi8jNn9rAGHAS0N2TDEh0KhDuGReyOUiH6WoTpI=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=n3TISUaEEdsz+Gzk7rz/xM06IymVU7r1Mk/tksDobLZCWzLEO7Pz+8LVT60jwFroa0yyRiiQnBJaPUI2Zdod5uWdxeodWsTWRs5KMR0NTbuMckruYgZLW7WUPzvGCK7jC9fA/el1XDlwmKYKmlSrhsAJTBHXGIAl9MzqdLLYEl8=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=GoHO1cXB; arc=none smtp.client-ip=198.175.65.17
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="GoHO1cXB"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1778704414; x=1810240414;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=z9tPKi8jNn9rAGHAS0N2TDEh0KhDuGReyOUiH6WoTpI=;
  b=GoHO1cXBlFwBLLAU2V60P1oxtpG/2lqj9nHimrI05zhMAj0bWjMnHmbr
   3iacvsC4wfvXpU48EUxwwywgcFseJfqBlXD0Jy4hdKI1OFmC9mGR8Nxid
   eOpxJrdIRTVq5DA+6Y6N4cpK5ILENuvIRuYtWy+j3BAogJwstjG/N0MSY
   eWcWIAT3z+FCjovCcSe6KhzPue+ZToePYjjUEBhoTRhGBoPjER6Wx+OO0
   FTbVzXajZEwTBKj8g/9wcqnCEuvVcMAL6CCfwe7wCDulFt7I2nlXZThLK
   8GE0uZVROfhe9tB9JOrxt0bh3G1k6JWK7N7UNZpJTcpKGSEYTcbTA1lYE
   g==;
X-CSE-ConnectionGUID: w/VUQ7TVRhOytU2kUjy9Dg==
X-CSE-MsgGUID: Id7mw4tKTVOVj3zZcJM0YA==
X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79622991"
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="79622991"
Received: from orviesa008.jf.intel.com ([10.64.159.148])
  by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 13 May 2026 13:33:34 -0700
X-CSE-ConnectionGUID: m1wmAwtnTDGY3NBWlpu8wg==
X-CSE-MsgGUID: PGlpxhm9SMioCgCxq9pXwg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="238076320"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:33 -0700
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	Luo Gengkun <luogengkun2@huawei.com>,
	linux-kernel@vger.kernel.org
Subject: [Patch v4 02/16] sched/cache: Disable cache aware scheduling for
 processes with high thread counts
Date: Wed, 13 May 2026 13:39:13 -0700
Message-Id: 
 <d076cd21a8e6c6341d1e2d927e118db770ebb650.1778703694.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1778703694.git.tim.c.chen@linux.intel.com>
References: <cover.1778703694.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

A performance regression was observed by Prateek when running hackbench
with many threads per process (high fd count). To avoid this, processes
with a large number of active threads are excluded from cache-aware
scheduling.

With sched_cache enabled, record the number of active threads in each
process during the periodic task_cache_work(). While iterating over
CPUs, if the currently running task belongs to the same process as the
task that launched task_cache_work(), increment the active thread count.

If the number of active threads within the process exceeds the number
of Cores (divided by the SMT number) in the LLC, do not enable
cache-aware scheduling. However, on systems with a smaller number of
CPUs within 1 LLC, like Power10/Power11 with SMT4 and an LLC size of 4,
this check effectively disables cache-aware scheduling for any process.
One possible solution suggested by Peter is to use an LLC-mask instead
of a single LLC value for preference. Once there are a 'few' LLCs as
preference, this constraint becomes a little easier. It could be an
enhancement in the future.

For users who wish to perform task aggregation regardless, a debugfs knob
is provided for tuning in a subsequent change.

Tested-by: Tingyin Duan <tingyin.duan@gmail.com>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 48 ++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d883f109ba3..6701911eaaf7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2423,6 +2423,7 @@ struct sched_cache_stat {
 	struct sched_cache_time __percpu *pcpu_sched;
 	raw_spinlock_t lock;
 	unsigned long epoch;
+	u64 nr_running_avg;
 	unsigned long next_scan;
 	int cpu;
 } ____cacheline_aligned_in_smp;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a759ea669d74..808f614fc2d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1384,6 +1384,12 @@ static int llc_id(int cpu)
 	return per_cpu(sd_llc_id, cpu);
 }
=20
+static bool invalid_llc_nr(struct mm_struct *mm, int cpu)
+{
+	return !fits_capacity((mm->sc_stat.nr_running_avg * cpu_smt_num_threads),
+			per_cpu(sd_llc_size, cpu));
+}
+
 static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
 {
 	struct sched_domain *sd;
@@ -1452,7 +1458,7 @@ void mm_init_sched(struct mm_struct *mm,
 	mm->sc_stat.epoch =3D epoch;
 	mm->sc_stat.cpu =3D -1;
 	mm->sc_stat.next_scan =3D jiffies;
-
+	mm->sc_stat.nr_running_avg =3D 0;
 	/*
 	 * The update to mm->sc_stat should not be reordered
 	 * before initialization to mm's other fields, in case
@@ -1574,7 +1580,8 @@ void account_mm_sched(struct rq *rq, struct task_stru=
ct *p, s64 delta_exec)
 	 * If this process hasn't hit task_cache_work() for a while invalidate
 	 * its preferred state.
 	 */
-	if (epoch - READ_ONCE(mm->sc_stat.epoch) > EPOCH_LLC_AFFINITY_TIMEOUT) {
+	if (epoch - READ_ONCE(mm->sc_stat.epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
+	    invalid_llc_nr(mm, cpu_of(rq))) {
 		if (mm->sc_stat.cpu !=3D -1)
 			mm->sc_stat.cpu =3D -1;
 	}
@@ -1660,14 +1667,32 @@ static void get_scan_cpumasks(cpumask_var_t cpus, s=
truct task_struct *p)
 	cpumask_copy(cpus, cpu_online_mask);
 }
=20
+static inline void update_avg_scale(u64 *avg, u64 sample)
+{
+	int factor =3D per_cpu(sd_llc_size, raw_smp_processor_id());
+	s64 diff =3D sample - *avg;
+	u32 divisor;
+
+	/*
+	 * Scale the divisor based on the number of CPUs contained
+	 * in the LLC. This scaling ensures smaller LLC domains use
+	 * a smaller divisor to achieve more precise sensitivity to
+	 * changes in nr_running, while larger LLC domains are capped
+	 * at a maximum divisor of 8 which is the default smoothing
+	 * factor of EWMA in update_avg().
+	 */
+	divisor =3D clamp_t(u32, (factor >> 2), 2, 8);
+	*avg +=3D div64_s64(diff, divisor);
+}
+
 static void task_cache_work(struct callback_head *work)
 {
 	unsigned long next_scan, now =3D jiffies;
-	struct task_struct *p =3D current;
+	struct task_struct *p =3D current, *cur;
+	int cpu, m_a_cpu =3D -1, nr_running =3D 0;
+	unsigned long curr_m_a_occ =3D 0;
 	struct mm_struct *mm =3D p->mm;
 	unsigned long m_a_occ =3D 0;
-	unsigned long curr_m_a_occ =3D 0;
-	int cpu, m_a_cpu =3D -1;
 	cpumask_var_t cpus;
=20
 	WARN_ON_ONCE(work !=3D &p->cache_work);
@@ -1711,6 +1736,11 @@ static void task_cache_work(struct callback_head *wo=
rk)
 					m_occ =3D occ;
 					m_cpu =3D i;
 				}
+
+				cur =3D rcu_dereference_all(cpu_rq(i)->curr);
+				if (cur && !(cur->flags & (PF_EXITING | PF_KTHREAD)) &&
+				    cur->mm =3D=3D mm)
+					nr_running++;
 			}
=20
 			/*
@@ -1754,6 +1784,7 @@ static void task_cache_work(struct callback_head *wor=
k)
 		mm->sc_stat.cpu =3D m_a_cpu;
 	}
=20
+	update_avg_scale(&mm->sc_stat.nr_running_avg, nr_running);
 	free_cpumask_var(cpus);
 }
=20
@@ -10294,6 +10325,13 @@ static enum llc_mig can_migrate_llc_task(int src_c=
pu, int dst_cpu,
 	if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
 		return mig_unrestricted;
=20
+	/* skip cache aware load balance for too many threads */
+	if (invalid_llc_nr(mm, dst_cpu)) {
+		if (mm->sc_stat.cpu !=3D -1)
+			mm->sc_stat.cpu =3D -1;
+		return mig_unrestricted;
+	}
+
 	if (cpus_share_cache(dst_cpu, cpu))
 		to_pref =3D true;
 	else if (cpus_share_cache(src_cpu, cpu))
--=20
2.32.0
From nobody Fri Jun 12 15:46:56 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id A1DD3379C2A
	for <linux-kernel@vger.kernel.org>; Wed, 13 May 2026 20:33:35 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.17
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1778704417; cv=none;
 b=WUnO8VfSGX3vhnxArs0Uji6u816DVeLBMp+M9ZLR8K58EmIX6H6F+sCjg7IQRNBkseTWqlr3WifchyYbVzPHAToK1hq0y0K8frVHy+ej3x9PHj0WSHDTxuGoYyDo/912M/laOCn3FFZhlJcRYR1tBrb5dfrqwW+hzVM6tQjwoHw=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1778704417; c=relaxed/simple;
	bh=t7DrBUiucO8P42ebUXthxbTL07z1urjUPXtvf4iIW6M=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=UJ56jWievTM0ktGrW77uHituWF/8FQp9Tl0XWDvfxMewB+HebUfPa/aLFLxDwRIoF/I4xFjqasT2tJbwS0/dk45HAHn8KoRN7jAz2p2O33ww43c5vqJ74UOR0ukhq3AthnghtL4uUl4WBffYHIW0cYfjwYbfMzWPesZGD8BFFYc=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=K2zLTnpR; arc=none smtp.client-ip=198.175.65.17
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="K2zLTnpR"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1778704416; x=1810240416;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=t7DrBUiucO8P42ebUXthxbTL07z1urjUPXtvf4iIW6M=;
  b=K2zLTnpR7k92F6sDehbSbvuvgT8yTWHRW8gJ9ZMyN1uL2lxtNms+cm71
   x36F4u5Ezk15osCxFhTeuSaxwt9dP1wzakGe+KvEK8abbhOnZPZ+Y0ul+
   fy1p4vbtXROyZi/yYWkJ4KVmgvoK1IiXPRHWWW7zH4E3EhiHxU5hhTr/7
   ZdTOxkR8SSsL72K3TfMSMWbz6OGkBPcLDj0B4CPPQWv+ydKfftuRv3sMG
   kTGmST0BchLyfi9B0xjpWB1rDbfCk9yrBTgc80niFHaQkXDiu8sSfAu0O
   ziTQH3Z2FGqmXELXWt0/a3DCedPJYNTxd23N4PLtmWdLB9gWF8VH1IShH
   w==;
X-CSE-ConnectionGUID: VbUUpOLRQsWEy/XQKWOAHg==
X-CSE-MsgGUID: xAuYKXMtRNK6mmooAVDoWQ==
X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623012"
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="79623012"
Received: from orviesa008.jf.intel.com ([10.64.159.148])
  by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 13 May 2026 13:33:35 -0700
X-CSE-ConnectionGUID: ciPGxiMPQ5261SxmryPbmQ==
X-CSE-MsgGUID: hy8evQcnQjebd8T6GYGxxw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="238076326"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:34 -0700
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	Luo Gengkun <luogengkun2@huawei.com>,
	linux-kernel@vger.kernel.org
Subject: [Patch v4 03/16] sched/cache: Skip cache-aware scheduling for
 single-threaded processes
Date: Wed, 13 May 2026 13:39:14 -0700
Message-Id: 
 <8a59a13aa58fdb48e410ecb2aabd97fe3ea5d256.1778703694.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1778703694.git.tim.c.chen@linux.intel.com>
References: <cover.1778703694.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

For a single thread, the current wakeup path tends to place it
on the same LLC where it was previously running with cache-hot
data. There is no need to enable cache-aware scheduling for
single-threaded processes for the following reasons:

1. Cache-aware scheduling primarily benefits multi-threaded
   processes where threads share data. Single-threaded processes
   typically have no inter-thread data sharing and thus gain little.

2. Enabling it incurs the additional overhead of tracking the
   thread's residency in the LLCs.

3. Bypassing single-threaded processes avoids excessive
   concentration of such tasks on a single LLC.

Nevertheless, this check can be omitted if users explicitly
provide hints for such single-threaded workloads where different
processes have shared memory, e.g., via prctl() or other interfaces
to be added in the future.

Tested-by: Tingyin Duan <tingyin.duan@gmail.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 808f614fc2d2..df21366ba1ca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1384,8 +1384,12 @@ static int llc_id(int cpu)
 	return per_cpu(sd_llc_id, cpu);
 }
=20
-static bool invalid_llc_nr(struct mm_struct *mm, int cpu)
+static bool invalid_llc_nr(struct mm_struct *mm, struct task_struct *p,
+			   int cpu)
 {
+	if (get_nr_threads(p) <=3D 1)
+		return true;
+
 	return !fits_capacity((mm->sc_stat.nr_running_avg * cpu_smt_num_threads),
 			per_cpu(sd_llc_size, cpu));
 }
@@ -1581,7 +1585,7 @@ void account_mm_sched(struct rq *rq, struct task_stru=
ct *p, s64 delta_exec)
 	 * its preferred state.
 	 */
 	if (epoch - READ_ONCE(mm->sc_stat.epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
-	    invalid_llc_nr(mm, cpu_of(rq))) {
+	    invalid_llc_nr(mm, p, cpu_of(rq))) {
 		if (mm->sc_stat.cpu !=3D -1)
 			mm->sc_stat.cpu =3D -1;
 	}
@@ -1687,9 +1691,9 @@ static inline void update_avg_scale(u64 *avg, u64 sam=
ple)
=20
 static void task_cache_work(struct callback_head *work)
 {
+	int cpu, m_a_cpu =3D -1, nr_running =3D 0, curr_cpu;
 	unsigned long next_scan, now =3D jiffies;
 	struct task_struct *p =3D current, *cur;
-	int cpu, m_a_cpu =3D -1, nr_running =3D 0;
 	unsigned long curr_m_a_occ =3D 0;
 	struct mm_struct *mm =3D p->mm;
 	unsigned long m_a_occ =3D 0;
@@ -1711,6 +1715,14 @@ static void task_cache_work(struct callback_head *wo=
rk)
 			 now + EPOCH_PERIOD))
 		return;
=20
+	curr_cpu =3D task_cpu(p);
+	if (invalid_llc_nr(mm, p, curr_cpu)) {
+		if (mm->sc_stat.cpu !=3D -1)
+			mm->sc_stat.cpu =3D -1;
+
+		return;
+	}
+
 	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
 		return;
=20
@@ -10326,7 +10338,7 @@ static enum llc_mig can_migrate_llc_task(int src_cp=
u, int dst_cpu,
 		return mig_unrestricted;
=20
 	/* skip cache aware load balance for too many threads */
-	if (invalid_llc_nr(mm, dst_cpu)) {
+	if (invalid_llc_nr(mm, p, dst_cpu)) {
 		if (mm->sc_stat.cpu !=3D -1)
 			mm->sc_stat.cpu =3D -1;
 		return mig_unrestricted;
--=20
2.32.0
From nobody Fri Jun 12 15:46:56 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id CA929384CD7
	for <linux-kernel@vger.kernel.org>; Wed, 13 May 2026 20:33:36 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.17
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1778704418; cv=none;
 b=XS4hFIWAtoKlvfHUg5ngzaDvgGMPuRPhvTziFxsKy3CSSaRAm2RGEbumT85oKEx+GloK3ZC01NdakQJNmLa0UQ4ZwshbJDGJPYewwv0Uih9gvVX+FDk/1bfkRff/1qWZzbG7gXOWNo/Iq5+cS3nle02nEdXxOT0YFa8965dR2+c=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1778704418; c=relaxed/simple;
	bh=E0nEgU8NHJbSvpI+37J42hPBxHwU7Sw0zA2+0Rlp2Nw=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=Ra0mAU5b1O7GlfQ4DjOi2pRS60BJn9T2NDJePcraX9CAVb2I+owMqUS1fzNtT/6KIx6bmFKPY04OiROLFs4ptWq4/aHe4H6eKs3YzKVDEPBThG/yzQmSbNlfHlkU3YGt/GxfoBf4FUCHgUVcoyR+CzeRRLgwwClyMKcGDUeLNmg=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=hAsd8azL; arc=none smtp.client-ip=198.175.65.17
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="hAsd8azL"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1778704417; x=1810240417;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=E0nEgU8NHJbSvpI+37J42hPBxHwU7Sw0zA2+0Rlp2Nw=;
  b=hAsd8azLyLnLG/XbuZB8UKl5HyYm4bFgjsvG3q3AXaoSEQp6OWl5B7E7
   d9k4lEQFVfnl3mX4YO05lVmgutwlHbSE8dL4kgxB0oR+gFgnH9APGwHzi
   oNXic/z9WdRfq17a+gPWZZ6EEVkNh77MhsGGnSf/HD8rjtTN1o5EV+d9G
   GGRrVM2tleg7BjNcwiUSsxsNQYCqExVuqEjNm/Xx6Irpd0wMSipNp/JSX
   v5NDZhVzpQFP3GFOXu4ZepFlodeKOgZTg6fMrQ1FbLvj0q3rJmjRUKshe
   g2Sa1K/pzg7zSQ766mht2q4E3DWuNeNxxXud9VHkpcADPc1Wj/lsdWpvG
   w==;
X-CSE-ConnectionGUID: kzBcBWY3TS6x7e3cKIeOxQ==
X-CSE-MsgGUID: H4endoy3SCK3DsW6x+jYTw==
X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623037"
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="79623037"
Received: from orviesa008.jf.intel.com ([10.64.159.148])
  by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 13 May 2026 13:33:36 -0700
X-CSE-ConnectionGUID: b3WkzRXdTkiqPyZP/HGnog==
X-CSE-MsgGUID: rRXopY6dTjKDQkwp+cNsNg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="238076338"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:35 -0700
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	Luo Gengkun <luogengkun2@huawei.com>,
	linux-kernel@vger.kernel.org
Subject: [Patch v4 04/16] sched/cache: Calculate the LLC size and store it in
 sched_domain
Date: Wed, 13 May 2026 13:39:15 -0700
Message-Id: 
 <37afee09ff608034da0ce149e72d33b6f4698edf.1778703694.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1778703694.git.tim.c.chen@linux.intel.com>
References: <cover.1778703694.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

Cache aware scheduling needs to know the LLC size that a process
can use, so as to avoid memory-intensive tasks from being
over-aggregated on a single LLC.

Introduce a preparation patch to add get_effective_llc_bytes() to
get the LLC size that a CPU can use. The function can be further
enhanced by subtracting the LLC cache ways reserved by resctrl
(CAT in Intel RDT, etc).

Tested-by: Tingyin Duan <tingyin.duan@gmail.com>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 drivers/base/cacheinfo.c       | 23 ++++++++
 include/linux/cacheinfo.h      |  1 +
 include/linux/sched/topology.h |  7 +++
 kernel/sched/topology.c        | 98 ++++++++++++++++++++++++++++++++--
 4 files changed, 126 insertions(+), 3 deletions(-)

diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
index 391ac5e3d2f5..70701d3bc81c 100644
--- a/drivers/base/cacheinfo.c
+++ b/drivers/base/cacheinfo.c
@@ -17,6 +17,7 @@
 #include <linux/init.h>
 #include <linux/of.h>
 #include <linux/sched.h>
+#include <linux/sched/topology.h>
 #include <linux/slab.h>
 #include <linux/smp.h>
 #include <linux/sysfs.h>
@@ -68,6 +69,24 @@ bool last_level_cache_is_valid(unsigned int cpu)
=20
 }
=20
+/*
+ * Get the cacheinfo of the LLC associated with @cpu.
+ * Derived from update_per_cpu_data_slice_size_cpu().
+ */
+struct cacheinfo *get_cpu_cacheinfo_llc(unsigned int cpu)
+{
+	struct cacheinfo *llc;
+
+	if (!last_level_cache_is_valid(cpu))
+		return NULL;
+
+	llc =3D per_cpu_cacheinfo_idx(cpu, cache_leaves(cpu) - 1);
+	if (llc->type !=3D CACHE_TYPE_DATA && llc->type !=3D CACHE_TYPE_UNIFIED)
+		return NULL;
+
+	return llc;
+}
+
 bool last_level_cache_is_shared(unsigned int cpu_x, unsigned int cpu_y)
 {
 	struct cacheinfo *llc_x, *llc_y;
@@ -1018,6 +1037,7 @@ static int cacheinfo_cpu_online(unsigned int cpu)
 		goto err;
 	if (cpu_map_shared_cache(true, cpu, &cpu_map))
 		update_per_cpu_data_slice_size(true, cpu, cpu_map);
+	sched_update_llc_bytes(cpu);
 	return 0;
 err:
 	free_cache_attributes(cpu);
@@ -1036,6 +1056,9 @@ static int cacheinfo_cpu_pre_down(unsigned int cpu)
 	free_cache_attributes(cpu);
 	if (nr_shared > 1)
 		update_per_cpu_data_slice_size(false, cpu, cpu_map);
+
+	sched_update_llc_bytes(cpu);
+
 	return 0;
 }
=20
diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
index c8f4f0a0b874..fc879ac4cc4f 100644
--- a/include/linux/cacheinfo.h
+++ b/include/linux/cacheinfo.h
@@ -89,6 +89,7 @@ int populate_cache_leaves(unsigned int cpu);
 int cache_setup_acpi(unsigned int cpu);
 bool last_level_cache_is_valid(unsigned int cpu);
 bool last_level_cache_is_shared(unsigned int cpu_x, unsigned int cpu_y);
+struct cacheinfo *get_cpu_cacheinfo_llc(unsigned int cpu);
 int fetch_cache_info(unsigned int cpu);
 int detect_cache_attributes(unsigned int cpu);
 #ifndef CONFIG_ACPI_PPTT
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 0036d6b4bd67..fe09d3268bc9 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -106,6 +106,7 @@ struct sched_domain {
 #ifdef CONFIG_SCHED_CACHE
 	unsigned int llc_max;
 	unsigned int *llc_counts __counted_by_ptr(llc_max);
+	unsigned long llc_bytes;
 #endif
=20
 #ifdef CONFIG_SCHEDSTATS
@@ -265,4 +266,10 @@ static inline int task_node(const struct task_struct *=
p)
 	return cpu_to_node(task_cpu(p));
 }
=20
+#ifdef CONFIG_SCHED_CACHE
+extern void sched_update_llc_bytes(unsigned int cpu);
+#else
+static inline void sched_update_llc_bytes(unsigned int cpu) { }
+#endif
+
 #endif /* _LINUX_SCHED_TOPOLOGY_H */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9fc99346ef4f..7248a7279abe 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -776,9 +776,11 @@ cpu_attach_domain(struct sched_domain *sd, struct root=
_domain *rd, int cpu)
 			/* move buffer to parent as child is being destroyed */
 			sd->llc_counts =3D tmp->llc_counts;
 			sd->llc_max =3D tmp->llc_max;
+			sd->llc_bytes =3D tmp->llc_bytes;
 			/* make sure destroy_sched_domain() does not free it */
 			tmp->llc_counts =3D NULL;
 			tmp->llc_max =3D 0;
+			tmp->llc_bytes =3D 0;
 #endif
 			/*
 			 * sched groups hold the flags of the child sched
@@ -831,10 +833,42 @@ DEFINE_STATIC_KEY_FALSE(sched_cache_active);
 /* user wants cache aware scheduling [0 or 1] */
 int sysctl_sched_cache_user =3D 1;
=20
+/*
+ * Get the effective LLC size in bytes that @cpu's bottom sched_domain
+ * can use. A CPU within a cpuset partition can only use a proportion
+ * of the physical LLC, scaled by the ratio of the partition's span
+ * weight to the hardware LLC sharing weight. @sd should be the
+ * topmost domain with SD_SHARE_LLC.
+ *
+ * Returns 0 if cacheinfo is not yet populated. This happens during
+ * early boot when build_sched_domains() runs before the generic
+ * cacheinfo framework has been initialized (cacheinfo_cpu_online()
+ * is a device_initcall cpuhp callback). In that case,
+ * cacheinfo_cpu_online() will later call sched_update_llc_bytes()
+ * to fill in the bottom domain's llc_bytes once the cache attributes
+ * are available.
+ */
+static unsigned long get_effective_llc_bytes(int cpu,
+					     struct sched_domain *sd)
+{
+	struct cacheinfo *ci;
+	unsigned int hw_weight;
+
+	ci =3D get_cpu_cacheinfo_llc(cpu);
+	if (!ci)
+		return 0;
+
+	hw_weight =3D cpumask_weight(&ci->shared_cpu_map);
+	if (!hw_weight)
+		return 0;
+
+	return div_u64((u64)ci->size * sd->span_weight, hw_weight);
+}
+
 static bool alloc_sd_llc(const struct cpumask *cpu_map,
 			 struct s_data *d)
 {
-	struct sched_domain *sd;
+	struct sched_domain *sd, *top_llc, *parent;
 	unsigned int *p;
 	int i;
=20
@@ -848,8 +882,24 @@ static bool alloc_sd_llc(const struct cpumask *cpu_map,
 		if (!p)
 			goto err;
=20
-		sd->llc_max =3D max_lid + 1;
-		sd->llc_counts =3D p;
+		top_llc =3D sd;
+		/*
+		 * Find the topmost SD_SHARE_LLC domain.
+		 * Not yet attached to the CPU, so per_cpu(sd_llc, i)
+		 * can not be used.
+		 */
+		while ((parent =3D rcu_dereference_protected(top_llc->parent, true)) &&
+		       (parent->flags & SD_SHARE_LLC))
+			top_llc =3D parent;
+
+		if (top_llc->flags & SD_SHARE_LLC) {
+			sd->llc_max =3D max_lid + 1;
+			sd->llc_counts =3D p;
+			sd->llc_bytes =3D get_effective_llc_bytes(i, top_llc);
+		} else {
+			/* avoid memory leak */
+			kfree(p);
+		}
 	}
=20
 	return true;
@@ -860,6 +910,7 @@ static bool alloc_sd_llc(const struct cpumask *cpu_map,
 			kfree(sd->llc_counts);
 			sd->llc_counts =3D NULL;
 			sd->llc_max =3D 0;
+			sd->llc_bytes =3D 0;
 		}
 	}
=20
@@ -919,6 +970,47 @@ void sched_cache_active_set_unlocked(void)
 {
 	return sched_cache_active_set(false);
 }
+
+/*
+ * Update the bottom sched_domain's llc_bytes for @cpu and all its
+ * LLC siblings. Called from cacheinfo_cpu_online() or
+ * cacheinfo_cpu_pre_down() with cpu hotplug lock held.
+ *
+ * Note: get_effective_llc_bytes() returns 0 on PowerPC.
+ * thus cache aware scheduling is disabled on PowerPC for
+ * now. PowerPC does not use the generic cacheinfo framework --
+ * it has its own cacheinfo with a separate struct cache hierarchy
+ * and does not populates the per-CPU struct cpu_cacheinfo array
+ * that get_cpu_cacheinfo_llc() reads.
+ */
+void sched_update_llc_bytes(unsigned int cpu)
+{
+	struct sched_domain *sd, *sdp;
+	unsigned int i;
+
+	sched_domains_mutex_lock();
+
+	sdp =3D rcu_dereference_sched_domain(per_cpu(sd_llc, cpu));
+	if (!sdp)
+		goto unlock;
+
+	/*
+	 * ci->shared_cpu_map is built incrementally as CPUs come
+	 * online, so the first CPU in an LLC initially sees
+	 * hw_weight =3D=3D 1 and computes an inflated llc_bytes in
+	 * get_effective_llc_bytes().  Re-evaluating every LLC
+	 * sibling on each online event corrects this once the full
+	 * shared_cpu_map is known.
+	 */
+	for_each_cpu(i, sched_domain_span(sdp)) {
+		sd =3D rcu_dereference_sched_domain(cpu_rq(i)->sd);
+		if (sd)
+			sd->llc_bytes =3D get_effective_llc_bytes(i, sdp);
+	}
+
+unlock:
+	sched_domains_mutex_unlock();
+}
 #else
 static bool alloc_sd_llc(const struct cpumask *cpu_map,
 			 struct s_data *d)
--=20
2.32.0
From nobody Fri Jun 12 15:46:56 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0B69525B0B6
	for <linux-kernel@vger.kernel.org>; Wed, 13 May 2026 20:33:38 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.17
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1778704419; cv=none;
 b=Ai7CDkiu4GByoe/eOnM2gK6K9pMW2L9s1dGMU3oJT/gIZCPznTUl1VMF0Wt5OfS3OlHscRau8zHWhV41ZshWVWg7v9z45dMVfLjdaw6sEyA4hSciluvb1kZjnmO4+uRDSxFyOgOmFaiSbqRjfd3QPBjovI4BVplg1U2qA5GULJo=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1778704419; c=relaxed/simple;
	bh=qCO/hC7dazYTOOBM6v2DtzEII3MhMEZahEUswvVOKpA=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=AP68ElUpR0soGaF5YOs+aqkLGzf6upWE2hZ5sRqOD+qo6XlHsTxZ82NCPoA/SCFYa4Jnbf3N1ZwgzTrIEkJD6CagF8/SBAY9IEyeQfeZdnJZeahBkSiEz0rQJpl8ZnLyU9oCtrnVdFzQUu2uNXD59r9zsYs111ubi7mLNUhXBTI=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=c2k/pqnQ; arc=none smtp.client-ip=198.175.65.17
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="c2k/pqnQ"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1778704418; x=1810240418;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=qCO/hC7dazYTOOBM6v2DtzEII3MhMEZahEUswvVOKpA=;
  b=c2k/pqnQI05yR7hZDBRDdJeJS4b4TTKW9g0R4rsl4i+0peeUaHMsmMWP
   9ofdLrZHaHNd/Y/aOnT+ijUzSwmB5hCvoLGZq12hf0igaX3Z2uItDtC+x
   nwskJjGRxdiMVOb9B6BE9ZySGG9oJ5lJAcM0+oQL+ypXi5ABcYhv8psU/
   DKhd2JH6Ioahom53b78rK/i36u6oBJYUOuhHKLjtTQHZ7Bd6yTTX6VA94
   ILmLi4fAxJG+HCyhBisvkINRT+e9QXNSN8UcVAp5lLkf+aIF6w+f/1+w5
   fJ3lokz4ahUAnAkuX9Ujzflxj+XwaH0R6X12CyukRfH8fijgHyp8nRYso
   w==;
X-CSE-ConnectionGUID: OY51K5bIT/CpUZJryxmUBg==
X-CSE-MsgGUID: DKl7l1SpR86RrbvJoUaeJg==
X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623061"
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="79623061"
Received: from orviesa008.jf.intel.com ([10.64.159.148])
  by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 13 May 2026 13:33:37 -0700
X-CSE-ConnectionGUID: 8HXddkvBQ5KQdjdz7oOSag==
X-CSE-MsgGUID: 1oPT4vC3QnOhZiQHy/K1BA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="238076344"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:37 -0700
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	Luo Gengkun <luogengkun2@huawei.com>,
	linux-kernel@vger.kernel.org
Subject: [Patch v4 05/16] sched/cache: Avoid cache-aware scheduling for
 memory-heavy processes
Date: Wed, 13 May 2026 13:39:16 -0700
Message-Id: 
 <95cf64a385bcc12f18dcebe9d59e8d3ba8bb318f.1778703694.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1778703694.git.tim.c.chen@linux.intel.com>
References: <cover.1778703694.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

Prateek and Tingyin reported that memory-intensive workloads (such as
stream) can saturate memory bandwidth and caches on the preferred LLC
when sched_cache aggregates too many threads.

To mitigate this, estimate a process's memory footprint by comparing
its NUMA balancing fault statistics to the size of the LLC. If the
footprint exceeds the LLC size, skip cache-aware scheduling.

Note that footprint is only an approximation of the memory footprint,
since the kernel lacks suitable metrics to estimate the real working
set. If a user-provided hint is available in the future, it would be
more accurate. A later patch will allow users to provide a hint to
adjust this threshold.

Tested-by: Tingyin Duan <tingyin.duan@gmail.com>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Vern Hao <vernhao@tencent.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/sched.h |  1 +
 kernel/exit.c         | 29 ++++++++++++++++++++
 kernel/sched/fair.c   | 62 ++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 89 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6701911eaaf7..95729670929c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2425,6 +2425,7 @@ struct sched_cache_stat {
 	unsigned long epoch;
 	u64 nr_running_avg;
 	unsigned long next_scan;
+	unsigned long footprint;
 	int cpu;
 } ____cacheline_aligned_in_smp;
=20
diff --git a/kernel/exit.c b/kernel/exit.c
index ede3117fa7d4..77275c26a2a1 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -543,6 +543,32 @@ void mm_update_next_owner(struct mm_struct *mm)
 }
 #endif /* CONFIG_MEMCG */
=20
+#if defined(CONFIG_SCHED_CACHE) && defined(CONFIG_NUMA_BALANCING)
+/*
+ * Subtract the memory footprint of the current task from
+ * mm.
+ */
+static void exit_mm_sched_cache(struct mm_struct *mm)
+{
+	unsigned long fp, sub;
+
+	if (!current->total_numa_faults)
+		return;
+	/*
+	 * No lock protection due to performance considerations.
+	 * Make sure mm->sc_stat.footprint does not become
+	 * negative.
+	 */
+	fp =3D READ_ONCE(mm->sc_stat.footprint);
+	sub =3D min(fp, current->total_numa_faults);
+	WRITE_ONCE(mm->sc_stat.footprint, fp - sub);
+}
+#else
+static inline void exit_mm_sched_cache(struct mm_struct *mm)
+{
+}
+#endif /* CONFIG_SCHED_CACHE CONFIG_NUMA_BALANCING */
+
 /*
  * Turn us into a lazy TLB process if we
  * aren't already..
@@ -554,6 +580,9 @@ static void exit_mm(void)
 	exit_mm_release(current, mm);
 	if (!mm)
 		return;
+
+	exit_mm_sched_cache(mm);
+
 	mmap_read_lock(mm);
 	mmgrab_lazy_tlb(mm);
 	BUG_ON(mm !=3D current->active_mm);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df21366ba1ca..a10116ffe0d1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1384,6 +1384,32 @@ static int llc_id(int cpu)
 	return per_cpu(sd_llc_id, cpu);
 }
=20
+static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
+{
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned long llc, footprint;
+	struct sched_domain *sd;
+
+	guard(rcu)();
+
+	sd =3D rcu_dereference_sched_domain(cpu_rq(cpu)->sd);
+	if (!sd)
+		return true;
+
+	if (static_branch_likely(&sched_numa_balancing)) {
+		/*
+		 * TBD: RDT exclusive LLC ways reserved should be
+		 * excluded.
+		 */
+		llc =3D sd->llc_bytes;
+		footprint =3D READ_ONCE(mm->sc_stat.footprint);
+
+		return (llc < (footprint * PAGE_SIZE));
+	}
+#endif
+	return false;
+}
+
 static bool invalid_llc_nr(struct mm_struct *mm, struct task_struct *p,
 			   int cpu)
 {
@@ -1463,6 +1489,7 @@ void mm_init_sched(struct mm_struct *mm,
 	mm->sc_stat.cpu =3D -1;
 	mm->sc_stat.next_scan =3D jiffies;
 	mm->sc_stat.nr_running_avg =3D 0;
+	mm->sc_stat.footprint =3D 0;
 	/*
 	 * The update to mm->sc_stat should not be reordered
 	 * before initialization to mm's other fields, in case
@@ -1585,7 +1612,8 @@ void account_mm_sched(struct rq *rq, struct task_stru=
ct *p, s64 delta_exec)
 	 * its preferred state.
 	 */
 	if (epoch - READ_ONCE(mm->sc_stat.epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
-	    invalid_llc_nr(mm, p, cpu_of(rq))) {
+	    invalid_llc_nr(mm, p, cpu_of(rq)) ||
+	    exceed_llc_capacity(mm, cpu_of(rq))) {
 		if (mm->sc_stat.cpu !=3D -1)
 			mm->sc_stat.cpu =3D -1;
 	}
@@ -1716,7 +1744,8 @@ static void task_cache_work(struct callback_head *wor=
k)
 		return;
=20
 	curr_cpu =3D task_cpu(p);
-	if (invalid_llc_nr(mm, p, curr_cpu)) {
+	if (invalid_llc_nr(mm, p, curr_cpu) ||
+	    exceed_llc_capacity(mm, curr_cpu)) {
 		if (mm->sc_stat.cpu !=3D -1)
 			mm->sc_stat.cpu =3D -1;
=20
@@ -3515,6 +3544,7 @@ static void task_numa_placement(struct task_struct *p)
 	unsigned long total_faults;
 	u64 runtime, period;
 	spinlock_t *group_lock =3D NULL;
+	long __maybe_unused new_fp;
 	struct numa_group *ng;
=20
 	/*
@@ -3589,6 +3619,31 @@ static void task_numa_placement(struct task_struct *=
p)
 				ng->total_faults +=3D diff;
 				group_faults +=3D ng->faults[mem_idx];
 			}
+#ifdef CONFIG_SCHED_CACHE
+			/*
+			 * Per task p->numa_faults[mem_idx] converges,
+			 * so the accumulation of each task's faults
+			 * converges too - Given the number of threads,
+			 * it cannot overflow an unsigned long.
+			 * Racy with concurrent updates from other threads
+			 * sharing this mm. Acceptable since footprint is a
+			 * heuristic and occasional lost updates are tolerable.
+			 *
+			 * If a task exits, its corresponding footprint must
+			 * be subtracted from the mm->sc_stat.footprint, otherwise
+			 * the mm->sc_stat.footprint will not converge:
+			 * the exiting thread's footprint remains unchanged/undecayed
+			 * in mm->sc_stat.footprint. See exit_mm().
+			 *
+			 * Lost updates and unsynchronized subtraction
+			 * in exit_mm() can cause footprint + diff to
+			 * go negative. Clamp to zero to prevent the
+			 * unsigned footprint from wrapping.
+			 */
+			new_fp =3D (long)READ_ONCE(p->mm->sc_stat.footprint) + diff;
+			WRITE_ONCE(p->mm->sc_stat.footprint,
+				   max(new_fp, 0L));
+#endif
 		}
=20
 		if (!ng) {
@@ -10338,7 +10393,8 @@ static enum llc_mig can_migrate_llc_task(int src_cp=
u, int dst_cpu,
 		return mig_unrestricted;
=20
 	/* skip cache aware load balance for too many threads */
-	if (invalid_llc_nr(mm, p, dst_cpu)) {
+	if (invalid_llc_nr(mm, p, dst_cpu) ||
+	    exceed_llc_capacity(mm, dst_cpu)) {
 		if (mm->sc_stat.cpu !=3D -1)
 			mm->sc_stat.cpu =3D -1;
 		return mig_unrestricted;
--=20
2.32.0
From nobody Fri Jun 12 15:46:56 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id E8B81382F1A
	for <linux-kernel@vger.kernel.org>; Wed, 13 May 2026 20:33:38 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.17
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1778704420; cv=none;
 b=BfnNM3dxg2q8u+3/3UzvObC/fpsC64lYeXnNrA61Ti/5qHL1aojgppc48XDHy2i87UqsEstPO4NO7NgCdOF47cSitX/R/vdLOk/ZSVjcyMv5i50RLy1yeeb5rmkO0K2HdootLTSe/6RlVp3pxyzD2cBHkus71+0gberAmwHRBj4=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1778704420; c=relaxed/simple;
	bh=My8qPX4ZhhSb/OISYHVK8VtGkA3KRPphT3KhlOl4m1w=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=k2ToAVf7JagLN5Hc/DzTpEuhsi+J5/8twfijGm9Ha0Cg8cWUDQbfi/S2QPYLSac9bka2f2RfRCYjN3VfIVKW5HgqqFLcIzMB5yA0aiSkBBHt5hSxKMD6WY1hLggvJLPxnq486/3bKUU/J97fcapHMGyBMU6B1nK+tGn28SwHkGA=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=HpcNdGND; arc=none smtp.client-ip=198.175.65.17
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="HpcNdGND"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1778704419; x=1810240419;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=My8qPX4ZhhSb/OISYHVK8VtGkA3KRPphT3KhlOl4m1w=;
  b=HpcNdGNDW4pUV++fypwQN7/IDIQia2bE4BllSI1TE/1AC7VWDpJfo9Pw
   wE1jlWy2TjwABVfgaQ8zQOXUqpSoPZwdQsujbRIK0lYKyCLIyEI92oPO8
   dGm0Cst4gOPXprFgDeBF0EPO1eIIzxq68QKmimKGvbd5rlRmJOdy9rm9g
   LH7SGWY1rqnae/3tH69iFv91YnLAXHEOuFzwFrTTIf9NOvuSfyXaUu1tU
   JzgwuvzLOYonJ3KHRfpDy7aAuqlFi68+i/FwWNrrm8//POqHdG8j2lbuK
   052lMk8OQjcvb0zr3XEM3CTctMZBOjMxUV0Hx0SMPlICH0KL9N7AjqLK1
   g==;
X-CSE-ConnectionGUID: TZMwbnxJRRi7XTgKohedTw==
X-CSE-MsgGUID: IaEnJwDKS7yx+rXABrMo0A==
X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623082"
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="79623082"
Received: from orviesa008.jf.intel.com ([10.64.159.148])
  by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 13 May 2026 13:33:38 -0700
X-CSE-ConnectionGUID: rrDAu57tQrGYMBagZkOHPg==
X-CSE-MsgGUID: kMYKS0VHT+K8HUuq30u4PA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="238076352"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:38 -0700
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	Luo Gengkun <luogengkun2@huawei.com>,
	linux-kernel@vger.kernel.org
Subject: [Patch v4 06/16] sched/cache: Add user control to adjust the
 aggressiveness of cache-aware scheduling
Date: Wed, 13 May 2026 13:39:17 -0700
Message-Id: 
 <1c62cc060ba2b33d7b1f0ed98b3390128edbae93.1778703694.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1778703694.git.tim.c.chen@linux.intel.com>
References: <cover.1778703694.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

Introduce a set of debugfs knobs to control how aggressively the
cache aware scheduling does the task aggregation.

(1) aggr_tolerance
With sched_cache enabled, the scheduler uses a process's footprint
as a proxy for its LLC footprint to determine if aggregating tasks
on the preferred LLC could cause cache contention. If the footprint
exceeds the LLC size, aggregation is skipped. Since the kernel
cannot efficiently track per-task cache usage (resctrl is
user-space only), userspace can provide a more accurate hint.

Introduce /sys/kernel/debug/sched/llc_balancing/aggr_tolerance to
let users control how strictly footprint limits aggregation. Values
range from 0 to 100:
  - 0: Cache-aware scheduling is disabled.
  - 1: Strict; tasks with footprint larger than LLC size are skipped.
  - >=3D100: Aggressive; tasks are aggregated regardless of footprint.
For example, with a 32MB L3 cache:

  - aggr_tolerance=3D1 -> tasks with footprint > 32MB are skipped.
  - aggr_tolerance=3D99 -> tasks with footprint > 784GB are skipped
    (784GB =3D (1 + (99 - 1) * 256) * 32MB).
Similarly, /sys/kernel/debug/sched/llc_balancing/aggr_tolerance also
controls how strictly the number of active threads is considered when
doing cache aware load balance. The number of SMTs is also considered.
High SMT counts reduce the aggregation capacity, preventing excessive
task aggregation on SMT-heavy systems like Power10/Power11.

Yangyu suggested introducing separate aggregation controls for the
number of active threads and memory footprint checks. Since there are
plans to add per-process/task group controls, fine-grained tunables are
deferred to that implementation.

(2) epoch_period, epoch_affinity_timeout,
    imb_pct, overaggr_pct are also turned into tunables.

Tested-by: Tingyin Duan <tingyin.duan@gmail.com>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Suggested-by: Tingyin Duan <tingyin.duan@gmail.com>
Suggested-by: Jianyong Wu <jianyong.wu@outlook.com>
Suggested-by: Yangyu Chen <cyy@cyyself.name>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/debug.c | 10 +++++++
 kernel/sched/fair.c  | 68 ++++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h |  5 ++++
 3 files changed, 75 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2eae67cd2ba2..fe569539e888 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -670,6 +670,16 @@ static __init int sched_init_debug(void)
 	llc =3D debugfs_create_dir("llc_balancing", debugfs_sched);
 	debugfs_create_file("enabled", 0644, llc, NULL,
 			    &sched_cache_enable_fops);
+	debugfs_create_u32("aggr_tolerance", 0644, llc,
+			   &llc_aggr_tolerance);
+	debugfs_create_u32("epoch_period", 0644, llc,
+			   &llc_epoch_period);
+	debugfs_create_u32("epoch_affinity_timeout", 0644, llc,
+			   &llc_epoch_affinity_timeout);
+	debugfs_create_u32("overaggr_pct", 0644, llc,
+			   &llc_overaggr_pct);
+	debugfs_create_u32("imb_pct", 0644, llc,
+			   &llc_imb_pct);
 #endif
=20
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops=
);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a10116ffe0d1..01ce646792ff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1375,6 +1375,11 @@ static void set_next_buddy(struct sched_entity *se);
  */
 #define EPOCH_PERIOD	(HZ / 100)	/* 10 ms */
 #define EPOCH_LLC_AFFINITY_TIMEOUT	5	/* 50 ms */
+__read_mostly unsigned int llc_aggr_tolerance	=3D 1;
+__read_mostly unsigned int llc_epoch_period	=3D EPOCH_PERIOD;
+__read_mostly unsigned int llc_epoch_affinity_timeout =3D EPOCH_LLC_AFFINI=
TY_TIMEOUT;
+__read_mostly unsigned int llc_imb_pct		=3D 20;
+__read_mostly unsigned int llc_overaggr_pct	=3D 50;
=20
 static int llc_id(int cpu)
 {
@@ -1384,11 +1389,25 @@ static int llc_id(int cpu)
 	return per_cpu(sd_llc_id, cpu);
 }
=20
+static inline int get_sched_cache_scale(int mul)
+{
+	unsigned int tol =3D READ_ONCE(llc_aggr_tolerance);
+
+	if (!tol)
+		return 0;
+
+	if (tol >=3D 100)
+		return INT_MAX;
+
+	return (1 + (tol - 1) * mul);
+}
+
 static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
 {
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned long llc, footprint;
 	struct sched_domain *sd;
+	int scale;
=20
 	guard(rcu)();
=20
@@ -1404,7 +1423,28 @@ static bool exceed_llc_capacity(struct mm_struct *mm=
, int cpu)
 		llc =3D sd->llc_bytes;
 		footprint =3D READ_ONCE(mm->sc_stat.footprint);
=20
-		return (llc < (footprint * PAGE_SIZE));
+		/*
+		 * Scale the LLC size by 256*llc_aggr_tolerance
+		 * and compare it to the task's footprint.
+		 *
+		 * Suppose the L3 size is 32MB. If the
+		 * llc_aggr_tolerance is 1:
+		 * When the footprint is larger than 32MB, the
+		 * process is regarded as exceeding the LLC
+		 * capacity. If the llc_aggr_tolerance is 99:
+		 * When the footprint is larger than 784GB, the
+		 * process is regarded as exceeding the LLC
+		 * capacity:
+		 * 784GB =3D (1 + (99 - 1) * 256) * 32MB
+		 * If the llc_aggr_tolerance is 100:
+		 * ignore the footprint and do the aggregation
+		 * anyway.
+		 */
+		scale =3D get_sched_cache_scale(256);
+		if (scale =3D=3D INT_MAX)
+			return false;
+
+		return ((llc * (u64)scale) < (footprint * PAGE_SIZE));
 	}
 #endif
 	return false;
@@ -1413,11 +1453,21 @@ static bool exceed_llc_capacity(struct mm_struct *m=
m, int cpu)
 static bool invalid_llc_nr(struct mm_struct *mm, struct task_struct *p,
 			   int cpu)
 {
+	int scale;
+
 	if (get_nr_threads(p) <=3D 1)
 		return true;
=20
+	/*
+	 * Scale the number of 'cores' in a LLC by llc_aggr_tolerance
+	 * and compare it to the task's active threads.
+	 */
+	scale =3D get_sched_cache_scale(1);
+	if (scale =3D=3D INT_MAX)
+		return false;
+
 	return !fits_capacity((mm->sc_stat.nr_running_avg * cpu_smt_num_threads),
-			per_cpu(sd_llc_size, cpu));
+			(scale * per_cpu(sd_llc_size, cpu)));
 }
=20
 static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
@@ -1513,13 +1563,14 @@ static inline void __update_mm_sched(struct rq *rq,
 {
 	lockdep_assert_held(&rq->cpu_epoch_lock);
=20
+	unsigned int period =3D max(READ_ONCE(llc_epoch_period), 1U);
 	unsigned long n, now =3D jiffies;
 	long delta =3D now - rq->cpu_epoch_next;
=20
 	if (delta > 0) {
-		n =3D (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
+		n =3D (delta + period - 1) / period;
 		rq->cpu_epoch +=3D n;
-		rq->cpu_epoch_next +=3D n * EPOCH_PERIOD;
+		rq->cpu_epoch_next +=3D n * period;
 		__shr_u64(&rq->cpu_runtime, n);
 	}
=20
@@ -1611,7 +1662,7 @@ void account_mm_sched(struct rq *rq, struct task_stru=
ct *p, s64 delta_exec)
 	 * If this process hasn't hit task_cache_work() for a while invalidate
 	 * its preferred state.
 	 */
-	if (epoch - READ_ONCE(mm->sc_stat.epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
+	if (epoch - READ_ONCE(mm->sc_stat.epoch) > llc_epoch_affinity_timeout ||
 	    invalid_llc_nr(mm, p, cpu_of(rq)) ||
 	    exceed_llc_capacity(mm, cpu_of(rq))) {
 		if (mm->sc_stat.cpu !=3D -1)
@@ -1740,7 +1791,8 @@ static void task_cache_work(struct callback_head *wor=
k)
=20
 	/* only 1 thread is allowed to scan */
 	if (!try_cmpxchg(&mm->sc_stat.next_scan, &next_scan,
-			 now + EPOCH_PERIOD))
+			 now + max_t(unsigned long,
+				     READ_ONCE(llc_epoch_period), 1)))
 		return;
=20
 	curr_cpu =3D task_cpu(p);
@@ -10232,7 +10284,7 @@ static inline int task_is_ineligible_on_dst_cpu(str=
uct task_struct *p, int dest_
  */
 static bool fits_llc_capacity(unsigned long util, unsigned long max)
 {
-	u32 aggr_pct =3D 50;
+	u32 aggr_pct =3D llc_overaggr_pct;
=20
 	/*
 	 * For single core systems, raise the aggregation
@@ -10252,7 +10304,7 @@ static bool fits_llc_capacity(unsigned long util, u=
nsigned long max)
  */
 /* Allows dst util to be bigger than src util by up to bias percent */
 #define util_greater(util1, util2) \
-	((util1) * 100 > (util2) * 120)
+	((util1) * 100 > (util2) * (100 + llc_imb_pct))
=20
 static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
 					 unsigned long *cap)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f499d5dd1130..27409399137c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4072,6 +4072,11 @@ static inline void mm_cid_switch_to(struct task_stru=
ct *prev, struct task_struct
 DECLARE_STATIC_KEY_FALSE(sched_cache_present);
 DECLARE_STATIC_KEY_FALSE(sched_cache_active);
 extern int sysctl_sched_cache_user;
+extern unsigned int llc_aggr_tolerance;
+extern unsigned int llc_epoch_period;
+extern unsigned int llc_epoch_affinity_timeout;
+extern unsigned int llc_imb_pct;
+extern unsigned int llc_overaggr_pct;
=20
 static inline bool sched_cache_enabled(void)
 {
--=20
2.32.0
From nobody Fri Jun 12 15:46:56 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id F13AA1C84BB
	for <linux-kernel@vger.kernel.org>; Wed, 13 May 2026 20:33:39 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.17
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1778704425; cv=none;
 b=R2svB5wIM888zLWp05qrO8PLxcGhve11o/utv44aFztYvmftvh0yd10k54qGXPSl7Kse3Z/zgkr/1dFkiVpP5jRs7FaK+KSma1uSytD64imEmFGMc6yG65Y54vYHmiaLNndvLEVI4OHIrMZc2V5Wl+Y/mQ+/CfulOKY81CDzjB0=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1778704425; c=relaxed/simple;
	bh=QpfAZD5+RHmNHPnMXUPpxcOBAUzJXhzR1ngj1b6Stms=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=SLfiMX4mSqjnY+kIQ1isiUVDkkUSGUcTGA6ruOKrJP3tBF3xzKUB0KMbWmuGYN9AQ40wrz85WQipm8ybpXwoAN44DRrqOgbU1Y0CY03ooVi2CZjNsQgQjSJ6iXyE/xdoW0gXAfjXIshjlEeNe90Fm2E43Uhrp6wYlSw7nPJ/FNo=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=Qm7e2MxG; arc=none smtp.client-ip=198.175.65.17
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="Qm7e2MxG"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1778704420; x=1810240420;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=QpfAZD5+RHmNHPnMXUPpxcOBAUzJXhzR1ngj1b6Stms=;
  b=Qm7e2MxGlEV2VUJw7tBbmUeyHlHNdFXEyjH3t9q1MMh4UcGolmVqGBvJ
   pmB1fPzkidyDBJOxlPvgWo/nlLlQzFUpU6HCF66fmvn6+S6kS27oADkwg
   CsHj4IOBKYELpwTi1EOOJSQr/83/0XbWRdhLRHw1F8KvF8M2JBo/p+yYf
   szTqLvlpGhImBIyrNuAFL7qr2kC7gP3Jx3EPTUP77RTYxp6/fVbgL4fGN
   V05IN3JdYQwMaAI1fGlE6KP+WE0mDD7V//Em3uhnfEUan9C4zs2M41FEP
   UUtD7ozLhJuyKK4vkzbBSMR6H0LabHbdr3iJlPcGu+p9KVZu9Rs7K9Icn
   w==;
X-CSE-ConnectionGUID: fIRuwq/aTaKgkAQmE+4Ntg==
X-CSE-MsgGUID: 0O0sWxMqRhGWBPJKfVaWaw==
X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623104"
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="79623104"
Received: from orviesa008.jf.intel.com ([10.64.159.148])
  by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 13 May 2026 13:33:40 -0700
X-CSE-ConnectionGUID: BHD3mRTgSFuyz3N4yveAJA==
X-CSE-MsgGUID: e708Cl5vSlmRux+CgJue2g==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="238076359"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:39 -0700
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	Luo Gengkun <luogengkun2@huawei.com>,
	linux-kernel@vger.kernel.org
Subject: [Patch v4 07/16] sched/cache: Fix rcu warning when accessing sd_llc
 domain
Date: Wed, 13 May 2026 13:39:18 -0700
Message-Id: 
 <2dc49455e861215d8059a1c877953f0b95990038.1778703694.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1778703694.git.tim.c.chen@linux.intel.com>
References: <cover.1778703694.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

rcu_dereference_all() should be used to access the
sd_llc domain under RCU protection.

This bug was reported by sashiko.

Fixes: df0d98475954 ("sched/cache: Introduce infrastructure for cache-aware=
 load balancing")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 01ce646792ff..be96d80c9310 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1814,7 +1814,7 @@ static void task_cache_work(struct callback_head *wor=
k)
=20
 		for_each_cpu(cpu, cpus) {
 			/* XXX sched_cluster_active */
-			struct sched_domain *sd =3D per_cpu(sd_llc, cpu);
+			struct sched_domain *sd =3D rcu_dereference_all(per_cpu(sd_llc, cpu));
 			unsigned long occ, m_occ =3D 0, a_occ =3D 0;
 			int m_cpu =3D -1, i;
=20
--=20
2.32.0
From nobody Fri Jun 12 15:46:56 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 66F36379C33
	for <linux-kernel@vger.kernel.org>; Wed, 13 May 2026 20:33:41 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.17
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1778704430; cv=none;
 b=hKnbMvU4rY/UERAim6mTLXlmRjOINyX30iZuJmicTVefZmZP70ljSQDSV/01EfzwfJvtgZa96lq0vNpzwV8sCfhjPeRB0zuzr2LHrWx+vF4xQol0woyr576E0c8gdgAT5Oc5FuGgqJZZLb0PhYvX7uiJx6Dsi8YZIxqaJ/2R7JI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1778704430; c=relaxed/simple;
	bh=X7rEYXsps9lolKowVjX6yfvNXC4HUV/AW1S5DZm0CWw=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=aecz0eyBM/8uH/qLTYIvTQvdV9rxjxrOPTO94cHntb5UWDWPcTAfx+2SdyveP0yX59gWHrQIK5tz+Paq8MVgCzsBtofCSVAIYCRZMlwloovbAVE26fNlFpxk75CCH3Q81g5B26kmmXaGb++LW0solWfcxS+D8HJoKY9OFOME/ro=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=HaVik9vd; arc=none smtp.client-ip=198.175.65.17
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="HaVik9vd"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1778704421; x=1810240421;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=X7rEYXsps9lolKowVjX6yfvNXC4HUV/AW1S5DZm0CWw=;
  b=HaVik9vd5050uzo8wv7tNjMM3riyOw6umXw4iJ5JntEtfmPfPEjtBYDS
   Br19V4wjBmDP/QENT/8hfHLFaoKwXlH0i734tF8O/2rY47vBvucV7jFNx
   n9EfRUqxBjE8/NtRQsAnpawPfip4yEFQRpYH4dnmM64d/Cy9X9zAXVmF+
   1I83oCMKeW4/YKpOCIZ5ZgSViEEDmVpj3QdqLBbUhbHgDcV6SFjrSyUyT
   F+eT71hTuWSANINmNJ2QjBX9sfsQ+c/v6/K//Ew7/k42t8vB2doh1Yy8c
   UAVn/PFsy5TwUcVIilQjnjGe4UaRyNBRA3HS2bk6/4Pfz6eXSU9LYesat
   g==;
X-CSE-ConnectionGUID: jjElaELlQlqt0Pw5HhF/1Q==
X-CSE-MsgGUID: mQrHFYrgRAaLN7SGht0JNQ==
X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623127"
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="79623127"
Received: from orviesa008.jf.intel.com ([10.64.159.148])
  by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 13 May 2026 13:33:41 -0700
X-CSE-ConnectionGUID: bU0/dPyARu6YYe+GqnizOw==
X-CSE-MsgGUID: gl0WTuQLQouL4gu8cgT/fg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="238076374"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:40 -0700
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	Luo Gengkun <luogengkun2@huawei.com>,
	linux-kernel@vger.kernel.org
Subject: [Patch v4 08/16] sched/cache: Fix potential NULL mm pointer access
Date: Wed, 13 May 2026 13:39:19 -0700
Message-Id: 
 <066d8cfa45d4822bf4367e788c50377c66bbcc82.1778703694.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1778703694.git.tim.c.chen@linux.intel.com>
References: <cover.1778703694.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

A concurrent task exit might cause a NULL pointer dereference
in account_mm_sched(). Use the locally cached mm pointer instead,
since the active_mm reference guarantees the structure remains
allocated. Meanwhile, skip the kernel thread because it has
nothing to do with cache aware scheduling.

This bug was reported by sashiko and Vern.

Fixes: df0d98475954 ("sched/cache: Introduce infrastructure for cache-aware=
 load balancing")
Reported-by: Vern Hao <haoxing990@gmail.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lore.kernel.org/all/09cf7ee3-6e27-4505-9692-4b4a4707c8b2@gmai=
l.com/
---
 kernel/sched/fair.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index be96d80c9310..913b09254732 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1649,7 +1649,7 @@ void account_mm_sched(struct rq *rq, struct task_stru=
ct *p, s64 delta_exec)
 	if (!mm || !mm->sc_stat.pcpu_sched)
 		return;
=20
-	pcpu_sched =3D per_cpu_ptr(p->mm->sc_stat.pcpu_sched, cpu_of(rq));
+	pcpu_sched =3D per_cpu_ptr(mm->sc_stat.pcpu_sched, cpu_of(rq));
=20
 	scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
 		__update_mm_sched(rq, pcpu_sched);
@@ -1689,7 +1689,8 @@ static void task_tick_cache(struct rq *rq, struct tas=
k_struct *p)
 	if (!sched_cache_enabled())
 		return;
=20
-	if (!mm || !mm->sc_stat.pcpu_sched)
+	if (!mm || p->flags & PF_KTHREAD ||
+	    !mm->sc_stat.pcpu_sched)
 		return;
=20
 	epoch =3D rq->cpu_epoch;
--=20
2.32.0
From nobody Fri Jun 12 15:46:56 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id EFE66382F1E
	for <linux-kernel@vger.kernel.org>; Wed, 13 May 2026 20:33:42 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.17
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1778704431; cv=none;
 b=g78Ij0kN79U0abRJhSepyo/SGyV1+pr6oqKClpk/+XjCKC0SUwtVXUDU50zeP+PdaOJVYEXxwWpKFPSV5CTQKU+H2AGndXwMtWjJXf0yb6zfNo2qZ6ud8Vk/aNKyQu+a56XqNKpRirllRTZhB8bZcLSm00Wo66bYCfPJ3OPGN7E=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1778704431; c=relaxed/simple;
	bh=Hjr3I2K3sS9NXaWhU0v/hUPqGrqUl+Q6KUJXkwPe5g8=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=sotYRLjYUat5MRn5xgmrCCela/wWUZ0RekMlLnQa4YpCaFv3As46KsC0iJNW3cReyzCMRwWNktil6v7j/A8hzzc2U8Yx8wbhnbe0ZnXLF7zcnMhpSzrY8x320XaDSkTK95/1WuNabXrc5zXkx2EsSdQTcoHSVALyDmpdwroXl7w=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=HYnTiCWY; arc=none smtp.client-ip=198.175.65.17
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="HYnTiCWY"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1778704423; x=1810240423;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=Hjr3I2K3sS9NXaWhU0v/hUPqGrqUl+Q6KUJXkwPe5g8=;
  b=HYnTiCWYOKGaGwRhcG2O4KL+npu3VudnaL7Szvoiw9Ag3kW0Vn4HNjA6
   Nl3L6DjnsWU1OAoBX94Od9DYZCnT0a3Hd90934xN1hBrazLLxvEJ7CCRp
   atnvO6tS3CuQuKNy49GFlpbB7wvfhWFhinbXQ1nccoQllmPa1JSt89JV9
   8wtBTAlvMyHc8QhUHERK0ytPC3U34hmR+f3ytQDOfThOZT7ODseX03jHR
   CGPiuRgqdx3bm6T+6sQ3LkK1DH/r2EYXjiNAjSxiSdK9p8Zcq50G56Nje
   nx+gmONn2bdtCvyZPN+w+Q/95ieTxVHcxySd7IZD1xKzCh7fkaEhVLwW7
   g==;
X-CSE-ConnectionGUID: Z4zLvo2LRLmzkVG/qjCgvQ==
X-CSE-MsgGUID: YsuJsQbzRkCGlW6T4rc7Ig==
X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623151"
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="79623151"
Received: from orviesa008.jf.intel.com ([10.64.159.148])
  by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 13 May 2026 13:33:42 -0700
X-CSE-ConnectionGUID: zoYX+OWCQmG7l1zeSc0pBQ==
X-CSE-MsgGUID: acUGvIlsRGSUvc/uvvqw8w==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="238076384"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:41 -0700
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	Luo Gengkun <luogengkun2@huawei.com>,
	linux-kernel@vger.kernel.org
Subject: [Patch v4 09/16] sched/cache: Annotate lockless accesses to
 mm->sc_stat.cpu
Date: Wed, 13 May 2026 13:39:20 -0700
Message-Id: 
 <63ea494f12efcf265d7134400a06cd75d7f2c310.1778703694.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1778703694.git.tim.c.chen@linux.intel.com>
References: <cover.1778703694.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

mm->sc_stat.cpu is written by task_cache_work() and could be read
locklessly by several functions on other CPUs.  Use READ_ONCE and
WRITE_ONCE on mm->sc_stat.cpu access and write to prevent inconsistent
values from compiler optimizations when there are multiple accesses.

For example in get_pref_llc(), if the writer updated the field between
two compiler-generated loads, the validation (e.g., cpu !=3D -1) and
subsequent use (e.g., llc_id(cpu)) could operate on different values,
allowing a negative CPU ID to be used as an index.

Leave plain write in mm_init_sched(), where the mm is not
yet visible to other CPUs.

This bug was reported by sashiko.

Fixes: 47d8696b95f7 ("sched/cache: Assign preferred LLC ID to processes")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 913b09254732..73f185ba6e48 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1598,13 +1598,14 @@ static unsigned long fraction_mm_sched(struct rq *r=
q,
=20
 static int get_pref_llc(struct task_struct *p, struct mm_struct *mm)
 {
-	int mm_sched_llc =3D -1;
+	int mm_sched_llc =3D -1, mm_sched_cpu;
=20
 	if (!mm)
 		return -1;
=20
-	if (mm->sc_stat.cpu !=3D -1) {
-		mm_sched_llc =3D llc_id(mm->sc_stat.cpu);
+	mm_sched_cpu =3D READ_ONCE(mm->sc_stat.cpu);
+	if (mm_sched_cpu !=3D -1) {
+		mm_sched_llc =3D llc_id(mm_sched_cpu);
=20
 #ifdef CONFIG_NUMA_BALANCING
 		/*
@@ -1619,7 +1620,7 @@ static int get_pref_llc(struct task_struct *p, struct=
 mm_struct *mm)
 		 */
 		if (static_branch_likely(&sched_numa_balancing) &&
 		    p->numa_preferred_nid >=3D 0 &&
-		    cpu_to_node(mm->sc_stat.cpu) !=3D p->numa_preferred_nid)
+		    cpu_to_node(mm_sched_cpu) !=3D p->numa_preferred_nid)
 			mm_sched_llc =3D -1;
 #endif
 	}
@@ -1665,8 +1666,8 @@ void account_mm_sched(struct rq *rq, struct task_stru=
ct *p, s64 delta_exec)
 	if (epoch - READ_ONCE(mm->sc_stat.epoch) > llc_epoch_affinity_timeout ||
 	    invalid_llc_nr(mm, p, cpu_of(rq)) ||
 	    exceed_llc_capacity(mm, cpu_of(rq))) {
-		if (mm->sc_stat.cpu !=3D -1)
-			mm->sc_stat.cpu =3D -1;
+		if (READ_ONCE(mm->sc_stat.cpu) !=3D -1)
+			WRITE_ONCE(mm->sc_stat.cpu, -1);
 	}
=20
 	mm_sched_llc =3D get_pref_llc(p, mm);
@@ -1714,7 +1715,7 @@ static void get_scan_cpumasks(cpumask_var_t cpus, str=
uct task_struct *p)
 	if (!static_branch_likely(&sched_numa_balancing))
 		goto out;
=20
-	cpu =3D p->mm->sc_stat.cpu;
+	cpu =3D READ_ONCE(p->mm->sc_stat.cpu);
 	if (cpu !=3D -1)
 		nid =3D cpu_to_node(cpu);
 	curr_cpu =3D task_cpu(p);
@@ -1799,8 +1800,8 @@ static void task_cache_work(struct callback_head *wor=
k)
 	curr_cpu =3D task_cpu(p);
 	if (invalid_llc_nr(mm, p, curr_cpu) ||
 	    exceed_llc_capacity(mm, curr_cpu)) {
-		if (mm->sc_stat.cpu !=3D -1)
-			mm->sc_stat.cpu =3D -1;
+		if (READ_ONCE(mm->sc_stat.cpu) !=3D -1)
+			WRITE_ONCE(mm->sc_stat.cpu, -1);
=20
 		return;
 	}
@@ -1857,7 +1858,7 @@ static void task_cache_work(struct callback_head *wor=
k)
 				m_a_cpu =3D m_cpu;
 			}
=20
-			if (llc_id(cpu) =3D=3D llc_id(mm->sc_stat.cpu))
+			if (llc_id(cpu) =3D=3D llc_id(READ_ONCE(mm->sc_stat.cpu)))
 				curr_m_a_occ =3D a_occ;
=20
 			cpumask_andnot(cpus, cpus, sched_domain_span(sd));
@@ -1875,7 +1876,7 @@ static void task_cache_work(struct callback_head *wor=
k)
 		 * 3. 2X is chosen based on test results, as it delivers
 		 *    the optimal performance gain so far.
 		 */
-		mm->sc_stat.cpu =3D m_a_cpu;
+		WRITE_ONCE(mm->sc_stat.cpu, m_a_cpu);
 	}
=20
 	update_avg_scale(&mm->sc_stat.nr_running_avg, nr_running);
@@ -10441,15 +10442,15 @@ static enum llc_mig can_migrate_llc_task(int src_=
cpu, int dst_cpu,
 	if (!mm)
 		return mig_unrestricted;
=20
-	cpu =3D mm->sc_stat.cpu;
+	cpu =3D READ_ONCE(mm->sc_stat.cpu);
 	if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
 		return mig_unrestricted;
=20
 	/* skip cache aware load balance for too many threads */
 	if (invalid_llc_nr(mm, p, dst_cpu) ||
 	    exceed_llc_capacity(mm, dst_cpu)) {
-		if (mm->sc_stat.cpu !=3D -1)
-			mm->sc_stat.cpu =3D -1;
+		if (READ_ONCE(mm->sc_stat.cpu) !=3D -1)
+			WRITE_ONCE(mm->sc_stat.cpu, -1);
 		return mig_unrestricted;
 	}
=20
--=20
2.32.0
From nobody Fri Jun 12 15:46:56 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 847E8379C3B
	for <linux-kernel@vger.kernel.org>; Wed, 13 May 2026 20:33:45 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.17
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1778704433; cv=none;
 b=PP0CAeAOB9e6+wAv8gyRLOijdaRQVKQRwqo8k5wbCWIHDk0oifkpopLS8EZIlHZ4Zv4lXierE9jooFoYjLcKU/C3wGPUbLTGasYpjuq/utKG/aV8Iup7BWdttzM6h1hzfOBfV6lpD3cu8Z3J1UhI1eH5ajMon2FxQ4Lhh4NC6xo=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1778704433; c=relaxed/simple;
	bh=YESSvGLYZOHvqekIBAvKifjum8PdrXsf2pmyO9yVqwk=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=I1nnfZhJXNDEJ2RBbh2gyM+mVkbGykBtNvQyDI93YTrJID6k2b7w0yQigmsnXDWzjJLTAyakOtCNJNj2GSdbCLb6r5ULF8FGv/aTuuweokA/BLSxDwY1sOevovs6UlSmX1zVj/R4xUGoe6GE3nDbuBcw0skbkfONA6xtN0SpES4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=igLz2Ylj; arc=none smtp.client-ip=198.175.65.17
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="igLz2Ylj"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1778704427; x=1810240427;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=YESSvGLYZOHvqekIBAvKifjum8PdrXsf2pmyO9yVqwk=;
  b=igLz2YljQ5XhHF31ec/jTtjLTyiZUE0ny9Xmhm0fqivVHu6dECIZDgr6
   LlI6BPm/Oo8VI6Wwdy15j6EIfyNOMoL6ZJIJ9RjCFd5FCIQPL1EXHaZwk
   vrf2kM7vBLiiinmBo/oy3e5xdvfvTQoi166icNh00YCVUbbt7LpW33FRH
   glFnZ7G6Ta0mVvrIBYmtZWVGGokPNdtwkoFD2/Xv35lqbUcpjEl/bNQh9
   89jMa04zRjDm7xlKFjQbpX+L78w2Pd+Uz+KFhf+ejmnrqEosKAMJMMQZB
   8GcyxTOq5ZbgjrHv90Gww916XZsQDPnO89PrKbsTCj3unN+AoL0aBfbSz
   Q==;
X-CSE-ConnectionGUID: 6ncHlTb+TQOJWfRCMOKKSw==
X-CSE-MsgGUID: amQ8MBS+RfSy4wzuJlvY0A==
X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623176"
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="79623176"
Received: from orviesa008.jf.intel.com ([10.64.159.148])
  by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 13 May 2026 13:33:43 -0700
X-CSE-ConnectionGUID: shjoP4GxTQqdqWA/pM63sg==
X-CSE-MsgGUID: 8v4OE/Z4TYaR1iaTMQdBlw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="238076389"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:43 -0700
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	Luo Gengkun <luogengkun2@huawei.com>,
	linux-kernel@vger.kernel.org
Subject: [Patch v4 10/16] sched/cache: Fix unpaired
 account_llc_enqueue/dequeue
Date: Wed, 13 May 2026 13:39:21 -0700
Message-Id: 
 <0c8c6a1571d66792a4d2ff0103ba3cc13e059046.1778703694.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1778703694.git.tim.c.chen@linux.intel.com>
References: <cover.1778703694.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

There is a race condition that, after a task is enqueued
on a runqueue, task_llc(p) may change due to CPU hotplug,
because the llc_id is dynamically allocated and adjusted
at runtime.
Therefore, checking task_llc(p) to determine whether the
task is being dequeued from its preferred LLC is unreliable
and can cause inconsistent values.

To fix this problem, record whether p is enqueued on its
preferred LLC, in order to pair with account_llc_dequeue()
to maintain a consistent nr_pref_llc_running per runqueue.

This bug was reported by sashiko, and the solution was once
suggested by Prateek.

Fixes: 46afe3af7ead ("sched/cache: Track LLC-preferred tasks per runqueue")
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/sched.h |  2 ++
 init/init_task.c      |  1 +
 kernel/sched/fair.c   | 31 ++++++++++++++++++++++++++++---
 3 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 95729670929c..2c9e8e2edde1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1410,6 +1410,8 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CACHE
 	struct callback_head		cache_work;
 	int				preferred_llc;
+	/* 1: task was enqueued to its preferred LLC, 0 otherwise */
+	int				pref_llc_queued;
 #endif
=20
 	struct rseq_data		rseq;
diff --git a/init/init_task.c b/init/init_task.c
index 5d90db4ff1f8..3ecd66fbd563 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -217,6 +217,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) =
=3D {
 #endif
 #ifdef CONFIG_SCHED_CACHE
 	.preferred_llc  =3D -1,
+	.pref_llc_queued  =3D 0,
 #endif
 #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
 	.kasan_depth	=3D 1,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 73f185ba6e48..9e6edd40cd80 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1472,15 +1472,32 @@ static bool invalid_llc_nr(struct mm_struct *mm, st=
ruct task_struct *p,
=20
 static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
 {
+	int pref_llc, pref_llc_queued;
 	struct sched_domain *sd;
-	int pref_llc;
=20
 	pref_llc =3D p->preferred_llc;
 	if (pref_llc < 0)
 		return;
=20
+	pref_llc_queued =3D (pref_llc =3D=3D task_llc(p));
 	rq->nr_llc_running++;
-	rq->nr_pref_llc_running +=3D (pref_llc =3D=3D task_llc(p));
+	rq->nr_pref_llc_running +=3D pref_llc_queued;
+
+	/*
+	 * Record whether p is enqueued on its preferred
+	 * LLC, in order to pair with account_llc_dequeue()
+	 * to maintain a consistent nr_pref_llc_running per
+	 * runqueue.
+	 * This is necessary because a race condition exists:
+	 * after a task is enqueued on a runqueue, task_llc(p)
+	 * may change due to CPU hotplug. Therefore, checking
+	 * task_llc(p) to determine whether the task is being
+	 * dequeued from its preferred LLC is unreliable and
+	 * can cause inconsistent values - checking the
+	 * p->pref_llc_queued in account_llc_dequeue() would
+	 * be reliable.
+	 */
+	p->pref_llc_queued =3D pref_llc_queued;
=20
 	sd =3D rcu_dereference_all(rq->sd);
 	if (sd && (unsigned int)pref_llc < sd->llc_max)
@@ -1497,7 +1514,15 @@ static void account_llc_dequeue(struct rq *rq, struc=
t task_struct *p)
 		return;
=20
 	rq->nr_llc_running--;
-	rq->nr_pref_llc_running -=3D (pref_llc =3D=3D task_llc(p));
+	if (p->pref_llc_queued) {
+		rq->nr_pref_llc_running--;
+		/*
+		 * Update the status in case
+		 * other logic might query
+		 * this.
+		 */
+		p->pref_llc_queued =3D 0;
+	}
=20
 	sd =3D rcu_dereference_all(rq->sd);
 	if (sd && (unsigned int)pref_llc < sd->llc_max) {
--=20
2.32.0
From nobody Fri Jun 12 15:46:56 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id B5FEF368D66
	for <linux-kernel@vger.kernel.org>; Wed, 13 May 2026 20:33:50 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.17
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1778704432; cv=none;
 b=RZdfwgkAMd0J0wtb4p3LDpp/B/KJ8nbHlqaB5wilcHRjN4rnK2LJ5Q9mC9Y25TlIR5JOtmwGHjyl8ibGcX2BsIztVR6atI597XuWQRyI/BLqnkYpKfAcWp+Wprgnf3yny7VoLptVVcAETfSl7udeq/wuG6FYG1k8ey246cP8NE0=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1778704432; c=relaxed/simple;
	bh=UGo6DN50Uv/tRUerMamkZ/Y7eGfcGZqAsqtg5pBXdUo=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=tWaC3iIZ3bWg52+XSoyq+uR9J7v5btGiCpDW+Cg2eu8mplC71kK081orr4kP35Lgqb8+TG8C0cY+tNZ697GOVTbP4d1Vt+Z57KUb10dir8QSA+lsuSHXI69dVYh7V3c2U+TyuVmEeRcL4mA/bnhGikjWsQY31/4OPeC0rEee+pE=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=SXyVODbl; arc=none smtp.client-ip=198.175.65.17
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="SXyVODbl"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1778704431; x=1810240431;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=UGo6DN50Uv/tRUerMamkZ/Y7eGfcGZqAsqtg5pBXdUo=;
  b=SXyVODblH1seiWkaEWmDeGSwisRQpwjwuounnu/NxGcwUFkptEIiw3EL
   ERSzljRetjiuM5eVVRFs5qN7S1Wa0bqXcFpicqYbmiyzudDfnSH7pVb63
   GmyeJ/lM63WOiXSPDmQFXfE5JCseekmuLAIB5ALFcUeDv8lfwOddm5NBG
   GhjYct8t9lp7Umx44a9KUH9leNxspYaMJcYw1fG1HYk0yi0d8b4mfIeIa
   mqkM1noQM8q4N4Q2R43oLNzQ79lIyvdwXePjFdtan3CEezzz7wkoYFG9T
   tWMakTLyjcsTrBFABg/gbJgg14ba6HkY4+I6Q5RifHWV/JrKyTvKpft2o
   A==;
X-CSE-ConnectionGUID: Y4K4TgKESPSYposRryIQOw==
X-CSE-MsgGUID: Iy41SqAWTruo1d1RFJCang==
X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623199"
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="79623199"
Received: from orviesa008.jf.intel.com ([10.64.159.148])
  by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 13 May 2026 13:33:44 -0700
X-CSE-ConnectionGUID: g2LzkM4+RWySZsJCkY5Tbg==
X-CSE-MsgGUID: UIXm4gRrQjWfkBmKQJpgDw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="238076392"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:44 -0700
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	Luo Gengkun <luogengkun2@huawei.com>,
	linux-kernel@vger.kernel.org
Subject: [Patch v4 11/16] sched/cache: Fix checking active load balance by
 only considering the CFS task
Date: Wed, 13 May 2026 13:39:22 -0700
Message-Id: 
 <f9161133cf040d286dca11344a112c5ef2a5253d.1778703694.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1778703694.git.tim.c.chen@linux.intel.com>
References: <cover.1778703694.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

The currently running task cur may not be a CFS task, such as
an RT or Deadline task. For non-CFS tasks, the task_util(cur)
utilization average is not maintained, so this might pass a
stale or meaningless value to can_migrate_llc().

Check if the task is CFS before getting its task_util().

This bug was reported by sashiko.

Fixes: 714059f79ff0 ("sched/cache: Handle moving single tasks to/from their=
 preferred LLC")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9e6edd40cd80..8617cd3642c7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10509,7 +10509,8 @@ alb_break_llc(struct lb_env *env)
 	/*
 	 * All tasks prefer to stay on their current CPU.
 	 * Do not pull a task from its preferred CPU if:
-	 * 1. It is the only task running there(not too imbalance); OR
+	 * 1. It is the only task running and does not exceed
+	 *    imbalance allowance; OR
 	 * 2. Migrating it away from its preferred LLC would violate
 	 *    the cache-aware scheduling policy.
 	 */
@@ -10522,7 +10523,7 @@ alb_break_llc(struct lb_env *env)
 			return true;
=20
 		cur =3D rcu_dereference_all(env->src_rq->curr);
-		if (cur)
+		if (cur && cur->sched_class =3D=3D &fair_sched_class)
 			util =3D task_util(cur);
=20
 		if (can_migrate_llc(env->src_cpu, env->dst_cpu,
--=20
2.32.0
From nobody Fri Jun 12 15:46:56 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id B14992E0901
	for <linux-kernel@vger.kernel.org>; Wed, 13 May 2026 20:33:51 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.17
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1778704433; cv=none;
 b=Su66mD6x8WCs99MgCcfVMOfZBASSVMTIl4B8JwJbuOEBUB4tTrLH1vOIbeYqzHaX/0NM0l5fqVmQAv+RbDOd9txljQVCoGpQ9GqelFDzz60b0JY83Vz7ueB2vMy6owlx0ylzuRceaCkUDHS8+D1xzG2td00TYphDKViPODL53ZA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1778704433; c=relaxed/simple;
	bh=XbuGDa7JJ7FVqD5xSNQStR6gz6ySjrTGgL8YslZ2MOY=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=UZQAKbN8XqLNn1ndZiyTc4LaleobIPyv4bshd2iLlTSeoVh5Rxl6p1wUd61SJqyVlCY2U6mBQUXow0fhFb/dBKafV9mMy0WpmFuJDYs/gKY5qrD6rGx81ocxPJHFBRyF/ngeIm5wHk79FOIb/6lScuxqYHxO34xBo1NF3w700Kw=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=oHS+MsNU; arc=none smtp.client-ip=198.175.65.17
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="oHS+MsNU"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1778704432; x=1810240432;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=XbuGDa7JJ7FVqD5xSNQStR6gz6ySjrTGgL8YslZ2MOY=;
  b=oHS+MsNUjJ9M+r1YZv72Ts0PFqtf4nvzd/3kp1mA60atqXtFcZAZQuwb
   zDnoFa2Tsq62xNHX6Sm1HMSrILzMoHZEeY8cWPvwXpOe5Kf75LU5NSfMv
   T3J5X4I6lnlrQP9/6oF/kDCNQpcitb3Mhd+J+LacRdplDlCZ7Yn9LBiAF
   pkCJIakapVjp4z/EiMuyaeT7a0Emv1MrWJOBZ5rfyXd1aD658UrJciB3g
   rzWoUN7NptTDwYFu3QnwC7SeDH2Hr/tlk0Z7zTRObX2pE/JLT6b4eKhDA
   whLQByHOSdSzDuXNnEKmp8xGeXtEanB/i208bxoBCnkBOgRHKkW1RjuUl
   Q==;
X-CSE-ConnectionGUID: 0OZa9K75SseKqH4UFhsYjA==
X-CSE-MsgGUID: XDVtHtarRHmT2f/GVmrVUw==
X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623220"
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="79623220"
Received: from orviesa008.jf.intel.com ([10.64.159.148])
  by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 13 May 2026 13:33:46 -0700
X-CSE-ConnectionGUID: 2uFbUuxzT8qEkaBirVibpg==
X-CSE-MsgGUID: 7+U8vwYWS1qQxb+xWNKJpw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="238076397"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:45 -0700
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	Luo Gengkun <luogengkun2@huawei.com>,
	linux-kernel@vger.kernel.org
Subject: [Patch v4 12/16] sched/cache: Fix race condition during sched domain
 rebuild
Date: Wed, 13 May 2026 13:39:23 -0700
Message-Id: 
 <9afddf439687f04bb56b46625bd9f153eb8abad5.1778703694.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1778703694.git.tim.c.chen@linux.intel.com>
References: <cover.1778703694.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

sched_cache_active_set_unlocked() checks hardware support without
locks:
static void sched_cache_active_set(bool locked)
{
        /* hardware does not support */
        if (!static_branch_likely(&sched_cache_present)) {
                _sched_cache_active_set(false, locked);
                return;
        }
    ...
If build_sched_domains() runs concurrently during CPU hotplug,
it can disable sched_cache_present under sched_domains_mutex
and the CPU hotplug lock. If a debugfs write thread evaluates
sched_cache_present as true right before that, and then blocks
or gets preempted, it might proceed to enable sched_cache_active
after the hardware support has been marked as absent. Make it
safer by acquiring cpus_read_lock() and sched_domains_mutex_lock()
when the user changes sched_cache_active via debugfs.

This bug was reported by sashiko.

Fixes: 067a31358143 ("sched/cache: Allow the user space to turn on and off =
cache aware scheduling")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/debug.c    |  4 +++-
 kernel/sched/sched.h    |  2 +-
 kernel/sched/topology.c | 42 +++++++++++++++--------------------------
 3 files changed, 19 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index fe569539e888..ed3a0d65da0c 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -224,7 +224,9 @@ sched_cache_enable_write(struct file *filp, const char =
__user *ubuf,
=20
 	sysctl_sched_cache_user =3D val;
=20
-	sched_cache_active_set_unlocked();
+	sched_cache_active_set();
+
+	*ppos +=3D cnt;
=20
 	return cnt;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 27409399137c..45a3b77f46aa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4083,7 +4083,7 @@ static inline bool sched_cache_enabled(void)
 	return static_branch_unlikely(&sched_cache_active);
 }
=20
-extern void sched_cache_active_set_unlocked(void);
+extern void sched_cache_active_set(void);
=20
 #endif
=20
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 7248a7279abe..cff5a0ecd64d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -917,30 +917,19 @@ static bool alloc_sd_llc(const struct cpumask *cpu_ma=
p,
 	return false;
 }
=20
-static void _sched_cache_active_set(bool enable, bool locked)
-{
-	if (enable) {
-		if (locked)
-			static_branch_enable_cpuslocked(&sched_cache_active);
-		else
-			static_branch_enable(&sched_cache_active);
-	} else {
-		if (locked)
-			static_branch_disable_cpuslocked(&sched_cache_active);
-		else
-			static_branch_disable(&sched_cache_active);
-	}
-}
-
 /*
  * Enable/disable cache aware scheduling according to
  * user input and the presence of hardware support.
+ * Expected to be protected by cpus_read_lock() and
+ * sched_domains_mutex_lock()
  */
-static void sched_cache_active_set(bool locked)
+static void _sched_cache_active_set(void)
 {
 	/* hardware does not support */
 	if (!static_branch_likely(&sched_cache_present)) {
-		_sched_cache_active_set(false, locked);
+		static_branch_disable_cpuslocked(&sched_cache_active);
+		if (sched_debug())
+			pr_info("%s: cache aware scheduling not supported on this platform\n", =
__func__);
 		return;
 	}
=20
@@ -951,24 +940,23 @@ static void sched_cache_active_set(bool locked)
 	 * for now.
 	 */
 	if (sysctl_sched_cache_user) {
-		_sched_cache_active_set(true, locked);
+		static_branch_enable_cpuslocked(&sched_cache_active);
 		if (sched_debug())
 			pr_info("%s: enabling cache aware scheduling\n", __func__);
 	} else {
-		_sched_cache_active_set(false, locked);
+		static_branch_disable_cpuslocked(&sched_cache_active);
 		if (sched_debug())
 			pr_info("%s: disabling cache aware scheduling\n", __func__);
 	}
 }
=20
-static void sched_cache_active_set_locked(void)
-{
-	return sched_cache_active_set(true);
-}
-
-void sched_cache_active_set_unlocked(void)
+void sched_cache_active_set(void)
 {
-	return sched_cache_active_set(false);
+	cpus_read_lock();
+	sched_domains_mutex_lock();
+	_sched_cache_active_set();
+	sched_domains_mutex_unlock();
+	cpus_read_unlock();
 }
=20
 /*
@@ -3082,7 +3070,7 @@ build_sched_domains(const struct cpumask *cpu_map, st=
ruct sched_domain_attr *att
 	else
 		static_branch_disable_cpuslocked(&sched_cache_present);
=20
-	sched_cache_active_set_locked();
+	_sched_cache_active_set();
 #endif
 	__free_domain_allocs(&d, alloc_state, cpu_map);
=20
--=20
2.32.0
From nobody Fri Jun 12 15:46:56 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 33D23374E71
	for <linux-kernel@vger.kernel.org>; Wed, 13 May 2026 20:33:52 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.17
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1778704435; cv=none;
 b=ghgDAI9ryHDy5oqElfCu9DtSHTOi+QJEi40GlGxRyZWcQ3BOauuqGXfcBkpiyPXWrr92BqCrSTMtgU78fWMnjq147YUhAWCsUh9+fN/vWuCQ6yD9KuVdkBihtCrz3ajW6ixoe1d6iVFz5OeBHkh+s7JI2j5ISoPDNgWHTCws7PI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1778704435; c=relaxed/simple;
	bh=/VmgYErXZx6IcFJgtwwENbrpk1uKBU0gJ+0pKw7HLsI=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=cMHdsaZBk6xd+lKdzRL5Tm6hoCkY+LAsY/a8dKEjVOeLssngxmShjzHJKuOL7fH1FTqt95huJNwWuQSV4n4H0epAyRmRnxSIG1iYPugASDiX7kmFyCZZmmABT3Ynrp/FRiwv4elyQ0jErm1aZ53dzIlkkHBjlbp3x5J7eUTV3dQ=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=HxtzzN6y; arc=none smtp.client-ip=198.175.65.17
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="HxtzzN6y"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1778704432; x=1810240432;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=/VmgYErXZx6IcFJgtwwENbrpk1uKBU0gJ+0pKw7HLsI=;
  b=HxtzzN6yYMv4gJahnVnBCqIyny1GqUc6DsH6j7hsPtUV2f+SXqpdgqUn
   8FTVgTpv6UM3Kre7LsoyFzw1rLWJuhZmvBoCfimafTGx8u4mVL1W/oovX
   4fSqDGLfHy4+nmcSIGgZ0QfgkzNCQXQYA9FlUW58TandMVwN+E9fmj+Bv
   mzxvoMWw+Nd3oLi/HvvABSJnBAYma2T16cG+1lrTQ/FO3hRchx7suiUPt
   vCZMByTtixFxvyBKu0QTm/BfIr0gB4jvzUaI0+PVUU+EXMXQoRVkykWBW
   jJ1FbYz2kFqz89k5fQ8iPDNTRvpwuuUuypDG1vmLRSRsni+0K/ugGpwqi
   Q==;
X-CSE-ConnectionGUID: fPIq7/RCT7C1BrdnU/01Fw==
X-CSE-MsgGUID: U0rh/NKoQq6JD0ccyTCpRg==
X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623246"
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="79623246"
Received: from orviesa008.jf.intel.com ([10.64.159.148])
  by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 13 May 2026 13:33:47 -0700
X-CSE-ConnectionGUID: dfLlWYxpQoK1Dh6kQ0pFrg==
X-CSE-MsgGUID: 2+WaR7QYR+iL6KLdW11vBA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="238076400"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:46 -0700
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	Luo Gengkun <luogengkun2@huawei.com>,
	linux-kernel@vger.kernel.org
Subject: [Patch v4 13/16] sched/cache: Fix cache aware scheduling enabling for
 multi LLCs system
Date: Wed, 13 May 2026 13:39:24 -0700
Message-Id: 
 <6328a8a7f40925cec2a712d81ee58128a4c4444a.1778703694.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1778703694.git.tim.c.chen@linux.intel.com>
References: <cover.1778703694.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

If there are multiple LLCs in the system, cache aware scheduling
should be enabled. However, there is a corner case where, if there
is a single NUMA node and a single LLC per node, cache aware
scheduling will be turned on in the current implementation -
because at this moment, the parent domain has not yet been
degenerated, and it is possible that the current domain has the
same cpu span as its parent. There is no need to turn cache aware
scheduling on in this scenario.

Fix it by iterating the parent domains to find a domain that is
a superset of the current sd_llc, so that later, after the duplicated
parent domains have been degenerated, cache aware scheduling will
take effect.

For example, the expected behavior would be:
2 sockets, 1 LLC per socket: MC span=3D0-3, PKG span=3D0-7, has_multi_llcs=
=3Dtrue
1 socket, 2 LLCs per socket: MC span=3D0-3, PKG span=3D0-7, has_multi_llcs=
=3Dtrue
2 sockets, 2 LLCs per socket: MC span=3D0-3, PKG span=3D0-7, has_multi_llcs=
=3Dtrue
1 socket, 1 LLC per socket: MC span=3D0-3, PKG span=3D0-3, has_multi_llcs=
=3Dfalse

This bug was reported by sashiko.

Fixes: d59f4fd1d303 ("sched/cache: Enable cache aware scheduling for
multi LLCs NUMA node")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/topology.c | 39 ++++++++++++++++++++++++++++++++++++---
 1 file changed, 36 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index cff5a0ecd64d..07f0a3d28253 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1007,6 +1007,37 @@ static bool alloc_sd_llc(const struct cpumask *cpu_m=
ap,
 }
 #endif
=20
+/*
+ * Return true if @sd belongs to an LLC group whose enclosing
+ * partition spans more than one LLC. @sd must be the topmost
+ * SD_SHARE_LLC domain.
+ *
+ * Any duplicated parent domains with the same span as @sd are
+ * skipped: before cpu_attach_domain() degeneration these still
+ * exist, after degeneration the loop is a no-op. This makes the
+ * helper usable both during sched domain build and against an
+ * already-attached domain tree.
+ *
+ * Note: For systems with a single LLC per node, cache-aware
+ * scheduling is still enabled when multiple nodes exist.
+ * However, NUMA balancing decisions take precedence over
+ * cache-aware scheduling. Conversely, if there is only one
+ * LLC per partition, cache-aware scheduling should be disabled.
+ */
+static bool sd_in_multi_llcs(struct sched_domain *sd)
+{
+	struct sched_domain *sdp =3D sd->parent;
+
+	/* it does not make sense to aggregate to 1 CPU */
+	if (sd->span_weight =3D=3D 1)
+		return false;
+
+	while (sdp && sdp->span_weight =3D=3D sd->span_weight)
+		sdp =3D sdp->parent;
+
+	return !!sdp;
+}
+
 /*
  * Return the canonical balance CPU for this group, this is the first CPU
  * of this group that's also in the balance mask.
@@ -3016,9 +3047,11 @@ build_sched_domains(const struct cpumask *cpu_map, s=
truct sched_domain_attr *att
 			 * NUMA imbalance stats for the hierarchy.
 			 */
 			if (sd->parent) {
-			    if (IS_ENABLED(CONFIG_NUMA))
-				    adjust_numa_imbalance(sd);
-			    has_multi_llcs =3D true;
+				if (IS_ENABLED(CONFIG_NUMA))
+					adjust_numa_imbalance(sd);
+
+				if (sd_in_multi_llcs(sd))
+					has_multi_llcs =3D true;
 			}
 		}
 	}
--=20
2.32.0
From nobody Fri Jun 12 15:46:56 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7A5803815FE
	for <linux-kernel@vger.kernel.org>; Wed, 13 May 2026 20:33:53 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.17
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1778704435; cv=none;
 b=AcATnOnej1q4kulz77PdgRHGEXSYduHsCsYQHDLbLHbaoaFtJQOFbG0kI+Dcsob81koT55Gu10CnH1MieNw502QvAAWoC+nN50PK1OTaH0SYqO3AXNBSgdTOkuscC03F9cEA2AuemBlLazUMitNwnEKuzoM3PPf418UNplMz/JI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1778704435; c=relaxed/simple;
	bh=QlxgnIL3wX/PxcACVKd44HmjipBp71zolANDvZekoIE=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=RRmDUhaeb+jVz4din2znAuOM9lT8XQKea3WvkRs136eF2v+ReJRI6wisVngFyAJWEmxAT1hOKHLnJqSgHLLHa1OhDqBbMXwW3s8k183ufS2ts2raefPO7giVE+09FANleSK/Yn/KgSyTmDgHnBbau1fkrw+o4PjSTZmy74LFPH4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=n31hyawj; arc=none smtp.client-ip=198.175.65.17
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="n31hyawj"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1778704434; x=1810240434;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=QlxgnIL3wX/PxcACVKd44HmjipBp71zolANDvZekoIE=;
  b=n31hyawjCEyjnQeSIdB3zMylOziu/2D0ivrATt57OcbTmafOPJ/xLF1E
   E77gRxXkkIeZdbBo/zQ38RdLnHxiVT6KzESTVfZcy5aWuVPVOW7WpXeNL
   ti4blsyTfNSCdXJYCoSleyqB0+M/rfdUF5+VmCfojiVGjn4uEoIfEQi/c
   ObYj75OoVg5m2kYUrqe/pqJ+kCgYWo3nBjodLbmNWxxw5+oqlg+x7X7CI
   pK65ZbF/AcqBrGXJp0IhZIV/+fi/aXeqibNzHfGrmvxn5C3nwcQunJzeL
   ZHyzpt+MELUfpwWn58Z88tqkqru0+z0/nahEOj46zUah5W0qkyQaIRwlX
   g==;
X-CSE-ConnectionGUID: 8Hn54XJyQW+XYt7Q68vdWA==
X-CSE-MsgGUID: rIT/2kM6R2uYuG2Lo9GmtQ==
X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623268"
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="79623268"
Received: from orviesa008.jf.intel.com ([10.64.159.148])
  by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 13 May 2026 13:33:48 -0700
X-CSE-ConnectionGUID: UIhc3xJlSWSRdmR/Hmdb5w==
X-CSE-MsgGUID: IbITKe/+TgC93+xn0le14Q==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="238076404"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:47 -0700
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	Luo Gengkun <luogengkun2@huawei.com>,
	linux-kernel@vger.kernel.org
Subject: [Patch v4 14/16] sched/cache: Fix has_multi_llcs iff at least one
 partition has multiple LLCs
Date: Wed, 13 May 2026 13:39:25 -0700
Message-Id: 
 <c541af2547d54509fbfd3b3a1e8072e2e5c7ff68.1778703694.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1778703694.git.tim.c.chen@linux.intel.com>
References: <cover.1778703694.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

sched_cache_present is a global static key, but build_sched_domains()
is called per partition from the "Build new domains" loop in
partition_sched_domains_locked(). Each call unconditionally sets the
key based solely on the has_multi_llcs local variable for that partition.

The call to the last partition set the value even when there
are previous partitions with multiple LLCs.

If partition A (multi-LLC) is built first, the key is enabled. Then
when partition B (single-LLC) is built, the key is disabled. The
multi-LLC partition A is still active but the key is now off.

Fix it by doing a similar thing as sched_energy_present: check the
multi-LLCs during the iteration over all the partitions rather than
checking it on a single partition.

This bug was reported by sashiko.

Fixes: d59f4fd1d303 ("sched/cache: Enable cache aware scheduling for
multi LLCs NUMA node")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/topology.c | 69 +++++++++++++++++++++++++++++++----------
 1 file changed, 53 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 07f0a3d28253..4c5ea369d835 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -950,6 +950,7 @@ static void _sched_cache_active_set(void)
 	}
 }
=20
+/* used by debugfs */
 void sched_cache_active_set(void)
 {
 	cpus_read_lock();
@@ -999,12 +1000,27 @@ void sched_update_llc_bytes(unsigned int cpu)
 unlock:
 	sched_domains_mutex_unlock();
 }
+
+static void sched_cache_set(bool has_multi_llcs)
+{
+	/*
+	 * TBD: check before writing to it. sched domain rebuild
+	 * is not in the critical path, leave as-is for now.
+	 */
+	if (has_multi_llcs)
+		static_branch_enable_cpuslocked(&sched_cache_present);
+	else
+		static_branch_disable_cpuslocked(&sched_cache_present);
+
+	_sched_cache_active_set();
+}
 #else
 static bool alloc_sd_llc(const struct cpumask *cpu_map,
 			 struct s_data *d)
 {
 	return false;
 }
+static inline void sched_cache_set(bool has_multi_llcs) { }
 #endif
=20
 /*
@@ -2949,7 +2965,8 @@ void sched_domains_free_llc_id(int cpu)
  * to the individual CPUs
  */
 static int
-build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_att=
r *attr)
+build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_att=
r *attr,
+		    bool *multi_llcs)
 {
 	enum s_alloc alloc_state =3D sa_none;
 	bool has_multi_llcs =3D false;
@@ -3093,18 +3110,7 @@ build_sched_domains(const struct cpumask *cpu_map, s=
truct sched_domain_attr *att
=20
 	ret =3D 0;
 error:
-#ifdef CONFIG_SCHED_CACHE
-	/*
-	 * TBD: check before writing to it. sched domain rebuild
-	 * is not in the critical path, leave as-is for now.
-	 */
-	if (!ret && has_multi_llcs)
-		static_branch_enable_cpuslocked(&sched_cache_present);
-	else
-		static_branch_disable_cpuslocked(&sched_cache_present);
-
-	_sched_cache_active_set();
-#endif
+	*multi_llcs =3D has_multi_llcs;
 	__free_domain_allocs(&d, alloc_state, cpu_map);
=20
 	return ret;
@@ -3167,6 +3173,7 @@ void free_sched_domains(cpumask_var_t doms[], unsigne=
d int ndoms)
  */
 int __init sched_init_domains(const struct cpumask *cpu_map)
 {
+	bool multi_llcs;
 	int err;
=20
 	zalloc_cpumask_var(&sched_domains_llc_id_allocmask, GFP_KERNEL);
@@ -3181,7 +3188,9 @@ int __init sched_init_domains(const struct cpumask *c=
pu_map)
 	if (!doms_cur)
 		doms_cur =3D &fallback_doms;
 	cpumask_and(doms_cur[0], cpu_map, housekeeping_cpumask(HK_TYPE_DOMAIN));
-	err =3D build_sched_domains(doms_cur[0], NULL);
+	err =3D build_sched_domains(doms_cur[0], NULL, &multi_llcs);
+	if (!err)
+		sched_cache_set(multi_llcs);
=20
 	return err;
 }
@@ -3254,6 +3263,7 @@ static void partition_sched_domains_locked(int ndoms_=
new, cpumask_var_t doms_new
 				    struct sched_domain_attr *dattr_new)
 {
 	bool __maybe_unused has_eas =3D false;
+	bool has_multi_llcs =3D false, multi_llcs;
 	int i, j, n;
 	int new_topology;
=20
@@ -3303,14 +3313,41 @@ static void partition_sched_domains_locked(int ndom=
s_new, cpumask_var_t doms_new
 	for (i =3D 0; i < ndoms_new; i++) {
 		for (j =3D 0; j < n && !new_topology; j++) {
 			if (cpumask_equal(doms_new[i], doms_cur[j]) &&
-			    dattrs_equal(dattr_new, i, dattr_cur, j))
+			    dattrs_equal(dattr_new, i, dattr_cur, j)) {
+				/*
+				 * Reused partition has to be taken care
+				 * of here, because there could be a corner
+				 * case that if the reused partition is skipped
+				 * and only new partition is considered, an
+				 * incorrect has_multi_llcs would be set. For
+				 * example:
+				 * If the only multi-LLC partition is reused
+				 * and a new single-LLC partition is built,
+				 * sched_cache_set(false) disables cache-aware
+				 * scheduling globally despite the reused
+				 * multi-LLC partition still being active.
+				 */
+				struct sched_domain *sd;
+				int cpu =3D cpumask_first(doms_cur[j]);
+
+				guard(rcu)();
+				sd =3D rcu_dereference(cpu_rq(cpu)->sd);
+				while (sd && sd->parent && (sd->parent->flags & SD_SHARE_LLC))
+					sd =3D sd->parent;
+				if (sd && (sd->flags & SD_SHARE_LLC) && sd->parent &&
+				    sd_in_multi_llcs(sd))
+					has_multi_llcs =3D true;
 				goto match2;
+			}
 		}
 		/* No match - add a new doms_new */
-		build_sched_domains(doms_new[i], dattr_new ? dattr_new + i : NULL);
+		build_sched_domains(doms_new[i], dattr_new ? dattr_new + i : NULL,
+				    &multi_llcs);
+		has_multi_llcs |=3D multi_llcs;
 match2:
 		;
 	}
+	sched_cache_set(has_multi_llcs);
=20
 #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
 	/* Build perf domains: */
--=20
2.32.0
From nobody Fri Jun 12 15:46:56 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id C9077360EED
	for <linux-kernel@vger.kernel.org>; Wed, 13 May 2026 20:33:53 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.17
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1778704435; cv=none;
 b=O5FNa5hCMA6Kwglh7mWRdiVljBWi+1VPZakMOodIz5vPHbFIsKl5OVR4g9NDrBieWfsmu57yunMMni0UDISLI1/T7hdm6jKJdg0/qwByrJiNw6LzkGNmmopP9TgmotsA9oFqLmolcAESMMbWBaFPi9b55TCcvoeeomaZ7zm9+JA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1778704435; c=relaxed/simple;
	bh=t1FwK5H8eA5XceraosnpeVkGplWnliav65lNNIbDDfU=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=qll82RvDi6W+m6eZBwh+Jw/k+zZZFZx4CHmwexgD9MvflV+bSViGh3Bo2qS+Pto41JI3yGuHJkiHjq2jMYcByxPm0ANWmP8tcbWnnING202ruEeOk0XQFe7qvY8wysDEoC4e8ebmc8P/BXutfVXFGuLhvgUQi9Wuc22EzGbm+p4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=i0LZbcXM; arc=none smtp.client-ip=198.175.65.17
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="i0LZbcXM"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1778704434; x=1810240434;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=t1FwK5H8eA5XceraosnpeVkGplWnliav65lNNIbDDfU=;
  b=i0LZbcXMHI8g/H89E89jOLAe2N5P3uMI7eZV+dXc06+TfeHo+fdjN4i5
   GyjEJUr9XC7ZzRRRJc4wfSb8jA6qmm1Ptwx219yDHjs41fTv/8+EwbB4m
   ewbIsAmWSLEj6iLRDKOaL2PjWILspb//uIeqRoyBSuE1UlB3UmrC12osI
   Pa0oNGDxzUTe0Q9IqsEvMad/Jl30GxtJmj/vrxmUTVOh4MtvbitY9YulT
   NSG9kBkbYGHU0usp5mY+U2WoQ9O8xMQk761dI+EYlGe/POnS8UqH2CI53
   fcOOUvUvHxKmSEWEClgNKUHe5/BOo4YMoZLhfITMGrBev49mGahVHSIGR
   w==;
X-CSE-ConnectionGUID: nImrx8VQSEyknvWWV0vYhw==
X-CSE-MsgGUID: L7hL1qc7SWuUGXZdmoqH6g==
X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623292"
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="79623292"
Received: from orviesa008.jf.intel.com ([10.64.159.148])
  by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 13 May 2026 13:33:49 -0700
X-CSE-ConnectionGUID: ww2XjdZVQwGP/xMBgK9hIw==
X-CSE-MsgGUID: 7sZq6untS8OBizL5JkqDVg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="238076408"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:48 -0700
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	Luo Gengkun <luogengkun2@huawei.com>,
	linux-kernel@vger.kernel.org
Subject: [Patch v4 15/16] sched/cache: Fix possible overflow when invalidating
 the preferred CPU
Date: Wed, 13 May 2026 13:39:26 -0700
Message-Id: 
 <e5c5cd010da06387f40ab731a4d95389b92aeb82.1778703694.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1778703694.git.tim.c.chen@linux.intel.com>
References: <cover.1778703694.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

epoch comes from the local rq->cpu_epoch, but mm->sc_stat.epoch is written
by task_tick_cache() running on any CPU - potentially a different CPU whose
rq->cpu_epoch is further ahead. The unsigned underflow wraps to a huge numb=
er,
so the condition fires incorrectly.

Fix this by converting the result to long.

Fixes: df0d98475954 ("sched/cache: Introduce infrastructure for cache-aware=
 load balancing")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8617cd3642c7..7e64cd18727e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1688,7 +1688,7 @@ void account_mm_sched(struct rq *rq, struct task_stru=
ct *p, s64 delta_exec)
 	 * If this process hasn't hit task_cache_work() for a while invalidate
 	 * its preferred state.
 	 */
-	if (epoch - READ_ONCE(mm->sc_stat.epoch) > llc_epoch_affinity_timeout ||
+	if ((long)(epoch - READ_ONCE(mm->sc_stat.epoch)) > (long)llc_epoch_affini=
ty_timeout ||
 	    invalid_llc_nr(mm, p, cpu_of(rq)) ||
 	    exceed_llc_capacity(mm, cpu_of(rq))) {
 		if (READ_ONCE(mm->sc_stat.cpu) !=3D -1)
--=20
2.32.0
From nobody Fri Jun 12 15:46:56 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 74A8E3955F3
	for <linux-kernel@vger.kernel.org>; Wed, 13 May 2026 20:33:55 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.17
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1778704443; cv=none;
 b=BJrRr0Xky1aNGzxswGos20DZEEHt281hTIYg9J2DRcJbOLoLZGwW+K9Yr/IYbYlZNegXqtpUZkFvcvMLtVP8lKCQ3qg2tgAO7sZMhiG2wIuq3grdTfmD/jJBgkiL/Ig2j+WazeI3V6SBWnhhU6zgvjPKTfDv/UOjfwPD0WOFE1Y=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1778704443; c=relaxed/simple;
	bh=LeKY5OSGAFQHBTheCebjoiSuXh/qiznyE2rafuiDcYw=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=QhxHfofduPQAkoaV9niUxesWLiiUWdMLFVYMz7EKCKPcpbpxLR/3du0oHpgwXbZWTDQUVvNaGuVrhGfHBwajOXT/dM4+bnXg1lV9KQvVz46X9iCz1XbKQlJ74c3tk9nn7NJdhw4Zr7i6Uw4LzNxidT5nUvto8d2PKZM8TyaxTMM=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=csQXKvMv; arc=none smtp.client-ip=198.175.65.17
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="csQXKvMv"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1778704436; x=1810240436;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=LeKY5OSGAFQHBTheCebjoiSuXh/qiznyE2rafuiDcYw=;
  b=csQXKvMvmMuUz8TLTse9EX4mJtBJauaFn/g9ugXIwzCHdktO3dxSduVi
   hpSsvnpNd1LZEC3HDwa+iuMydO8BuQ6vMWrCF+XnFoKfQ8iDj/nl+1iFw
   /rquPd9Ywmy0JvBuq+U0E1bm24Ql76gENuxOzaKYl+mO1JuvQvjqMjpUz
   H3wSpfVnQJ14Hat7nmiGWt0TyNnkL2kKjWQGCLJizAxXL/jtXBMS4jzt5
   n/ou9f8sa8egayIgyUmohFpsofMr5hjSZCIiGoE6u5p0YqcaGZm+krebl
   WVoOi7R7G6jOajV3POEcNtX7ootZPYWRorIjCY+VCFoFvI+qg35tS/fpe
   g==;
X-CSE-ConnectionGUID: QK1cemjYRhOV+auh2VIN9w==
X-CSE-MsgGUID: aQCTKQ6rRsyk/RkU3jUmqA==
X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623313"
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="79623313"
Received: from orviesa008.jf.intel.com ([10.64.159.148])
  by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 13 May 2026 13:33:50 -0700
X-CSE-ConnectionGUID: /TZAYCgTSkmniKeQefjO/Q==
X-CSE-MsgGUID: LdA4D9VaQPai4TkFq4gShQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.23,233,1770624000";
   d="scan'208";a="238076412"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:50 -0700
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	Luo Gengkun <luogengkun2@huawei.com>,
	linux-kernel@vger.kernel.org
Subject: [Patch v4 16/16] sched/cache: Fix stale preferred_llc for a new task
Date: Wed, 13 May 2026 13:39:27 -0700
Message-Id: 
 <0ec7309d0e24ede97656754d1505b7490403d966.1778703694.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1778703694.git.tim.c.chen@linux.intel.com>
References: <cover.1778703694.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

On fork without CLONE_VM, the child gets a new mm,
the parent's preferred_llc value is stale for the
child.

Fix this by resetting the task's preferred_llc to -1.

This bug was reported by sashiko.

Fixes: 47d8696b95f7 ("sched/cache: Assign preferred LLC ID to processes")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7e64cd18727e..73da6f8fc9ec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1914,6 +1914,11 @@ void init_sched_mm(struct task_struct *p)
=20
 	init_task_work(work, task_cache_work);
 	work->next =3D work;
+	/*
+	 * Reset new task's preference to avoid
+	 * polluting account_llc_enqueue().
+	 */
+	p->preferred_llc =3D -1;
 }
=20
 #else /* CONFIG_SCHED_CACHE */
--=20
2.32.0