From nobody Thu Apr  2 15:36:01 2026
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0C05A334C00
	for <linux-kernel@vger.kernel.org>; Tue, 10 Feb 2026 22:12:58 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.9
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1770761580; cv=none;
 b=VO0+uTUSZwOsLETymrt4CJaoHYVxHd7qf1MPygQa/f8+ZZqxWre1mLLW8lXY+Ww9s6/pRCjzbunMJwjqpMt0yDYHcogyt28qOqDjxUzTphijS0qz9JSDp/q+phkixpSq1C6GwSpuX+8/OWfhvkJv1N7Zea3NtHrbbDv3FqI5bl4=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1770761580; c=relaxed/simple;
	bh=6uQjRiumUsaI2l7luwCke7Ha40a//jdajoSL7kWpAVo=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=YV43JgT4GPp09NpIWJMhlr/GySwY6YFYRmXSzxhJzrjOzjzl0T6zIToxNNammb1fKHkrlBP9lHpt9QAWNdk5EXAmJXPyL8+gxNJafbmopYKFsj5unhiocrRUHb80ztalFFh2Cl89wawzH6xrKkbAu8xw6kUNG0A9jKLVXU2ZOjc=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=jPsb71HF; arc=none smtp.client-ip=192.198.163.9
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="jPsb71HF"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1770761579; x=1802297579;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=6uQjRiumUsaI2l7luwCke7Ha40a//jdajoSL7kWpAVo=;
  b=jPsb71HFBhb0qO6hCNuAz46yb8VUwHw1e4/MkwuGXl1BPiuTae0quWRE
   juXdqPtrGwxZKGYv5DiZBWit1qTQ+lUWXCvwy17yhue9Vlau5s7sZWbHr
   n7HcNWSZfa0Or8WmDqwQOorIKzNNA1MmBUr8yrmxfa66Zbzwempslz9j2
   TCPOpfr1xSbwZGlIvtkaphw4W15Ffve/k+OaKlDI8mw61m4pyerVM0reg
   Ky07jjOzdJtWl4PSyBHIG72mBd4/wEgRlf70/SgYTtBDPfahgFkQgQMI+
   TkCq/1LbGqEb2tvwIs2DTGRyb362lo0E8V7KOMbtuhtc+tW39leAhWFPk
   w==;
X-CSE-ConnectionGUID: C/M32EiGRiqAQ3Clr4MO5Q==
X-CSE-MsgGUID: WtTHnyfHSta4y9hU+FjgZQ==
X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631205"
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="82631205"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 10 Feb 2026 14:12:58 -0800
X-CSE-ConnectionGUID: TBjjeU/7SV6w9aDrJM2P1g==
X-CSE-MsgGUID: s1dZML/LQEue7d3RGZikgQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="216373859"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:12:56 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v3 01/21] sched/cache: Introduce infrastructure for
 cache-aware load balancing
Date: Tue, 10 Feb 2026 14:18:41 -0800
Message-Id: 
 <6ec6eee6e1c620c0cfb9f56923f8bfbb71c31a75.1770760558.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1770760558.git.tim.c.chen@linux.intel.com>
References: <cover.1770760558.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: "Peter Zijlstra (Intel)" <peterz@infradead.org>

Adds infrastructure to enable cache-aware load balancing,
which improves cache locality by grouping tasks that share resources
within the same cache domain. This reduces cache misses and improves
overall data access efficiency.

In this initial implementation, threads belonging to the same process
are treated as entities that likely share working sets. The mechanism
tracks per-process CPU occupancy across cache domains and attempts to
migrate threads toward cache-hot domains where their process already
has active threads, thereby enhancing locality.

This provides a basic model for cache affinity. While the current code
targets the last-level cache (LLC), the approach could be extended to
other domain types such as clusters (L2) or node-internal groupings.

At present, the mechanism selects the CPU within an LLC that has the
highest recent runtime. Subsequent patches in this series will use this
information in the load-balancing path to guide task placement toward
preferred LLCs.

In the future, more advanced policies could be integrated through NUMA
balancing-for example, migrating a task to its preferred LLC when spare
capacity exists, or swapping tasks across LLCs to improve cache affinity.
Grouping of tasks could also be generalized from that of a process
to be that of a NUMA group, or be user configurable.

Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---

Notes:
    v2->v3:
    Fix the wrap in epoch for time comparison of mm->mm_sched_epoch.
    (Peter Zijlstra)
   =20
    Remove __no_profile tag. (Peter Zijlstra)
   =20
    Introduce a new structure named sched_cache_stat
    to save the statistics of cache aware scheduling, similar
    to mm_mm_cid. (Peter Zijlstra)

 include/linux/mm_types.h |  32 +++++
 include/linux/sched.h    |  24 ++++
 init/Kconfig             |  11 ++
 kernel/fork.c            |   6 +
 kernel/sched/core.c      |   6 +
 kernel/sched/fair.c      | 265 +++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h     |  14 +++
 7 files changed, 358 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 42af2292951d..777a48523aa6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1125,6 +1125,8 @@ struct mm_struct {
 		/* MM CID related storage */
 		struct mm_mm_cid mm_cid;
=20
+		/* sched_cache related statistics */
+		struct sched_cache_stat sc_stat;
 #ifdef CONFIG_MMU
 		atomic_long_t pgtables_bytes;	/* size of all page tables */
 #endif
@@ -1519,6 +1521,36 @@ static inline unsigned int mm_cid_size(void)
 }
 #endif /* CONFIG_SCHED_MM_CID */
=20
+#ifdef CONFIG_SCHED_CACHE
+void mm_init_sched(struct mm_struct *mm,
+		   struct sched_cache_time __percpu *pcpu_sched);
+
+static inline int mm_alloc_sched_noprof(struct mm_struct *mm)
+{
+	struct sched_cache_time __percpu *pcpu_sched =3D
+		alloc_percpu_noprof(struct sched_cache_time);
+
+	if (!pcpu_sched)
+		return -ENOMEM;
+
+	mm_init_sched(mm, pcpu_sched);
+	return 0;
+}
+
+#define mm_alloc_sched(...)	alloc_hooks(mm_alloc_sched_noprof(__VA_ARGS__))
+
+static inline void mm_destroy_sched(struct mm_struct *mm)
+{
+	free_percpu(mm->sc_stat.pcpu_sched);
+	mm->sc_stat.pcpu_sched =3D NULL;
+}
+#else /* !CONFIG_SCHED_CACHE */
+
+static inline int mm_alloc_sched(struct mm_struct *mm) { return 0; }
+static inline void mm_destroy_sched(struct mm_struct *mm) { }
+
+#endif /* CONFIG_SCHED_CACHE */
+
 struct mmu_gather;
 extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct=
 *mm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d395f2810fac..2817a21ee055 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1409,6 +1409,10 @@ struct task_struct {
 	unsigned long			numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
=20
+#ifdef CONFIG_SCHED_CACHE
+	struct callback_head		cache_work;
+#endif
+
 	struct rseq_data		rseq;
 	struct sched_mm_cid		mm_cid;
=20
@@ -2330,6 +2334,26 @@ static __always_inline int task_mm_cid(struct task_s=
truct *t)
 }
 #endif
=20
+#ifdef CONFIG_SCHED_CACHE
+
+struct sched_cache_time {
+	u64 runtime;
+	unsigned long epoch;
+};
+
+struct sched_cache_stat {
+	struct sched_cache_time __percpu *pcpu_sched;
+	raw_spinlock_t lock;
+	unsigned long epoch;
+	int cpu;
+} ____cacheline_aligned_in_smp;
+
+#else
+
+struct sched_cache_stat { };
+
+#endif
+
 #ifndef MODULE
 #ifndef COMPILE_OFFSETS
=20
diff --git a/init/Kconfig b/init/Kconfig
index fa79feb8fe57..f4b2649f8401 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -990,6 +990,17 @@ config NUMA_BALANCING
=20
 	  This system will be inactive on UMA systems.
=20
+config SCHED_CACHE
+	bool "Cache aware load balance"
+	default y
+	depends on SMP
+	help
+	  When enabled, the scheduler will attempt to aggregate tasks from
+	  the same process onto a single Last Level Cache (LLC) domain when
+	  possible. This improves cache locality by keeping tasks that share
+	  resources within the same cache domain, reducing cache misses and
+	  lowering data access latency.
+
 config NUMA_BALANCING_DEFAULT_ENABLED
 	bool "Automatically enable NUMA aware memory/task placement"
 	default y
diff --git a/kernel/fork.c b/kernel/fork.c
index b1f3915d5f8e..2a49c49f29f9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -723,6 +723,7 @@ void __mmdrop(struct mm_struct *mm)
 	cleanup_lazy_tlbs(mm);
=20
 	WARN_ON_ONCE(mm =3D=3D current->active_mm);
+	mm_destroy_sched(mm);
 	mm_free_pgd(mm);
 	mm_free_id(mm);
 	destroy_context(mm);
@@ -1123,6 +1124,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm=
, struct task_struct *p,
 	if (mm_alloc_cid(mm, p))
 		goto fail_cid;
=20
+	if (mm_alloc_sched(mm))
+		goto fail_sched;
+
 	if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT,
 				     NR_MM_COUNTERS))
 		goto fail_pcpu;
@@ -1132,6 +1136,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm=
, struct task_struct *p,
 	return mm;
=20
 fail_pcpu:
+	mm_destroy_sched(mm);
+fail_sched:
 	mm_destroy_cid(mm);
 fail_cid:
 	destroy_context(mm);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 41ba0be16911..c6efa71cf500 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4412,6 +4412,7 @@ static void __sched_fork(u64 clone_flags, struct task=
_struct *p)
 	init_numa_balancing(clone_flags, p);
 	p->wake_entry.u_flags =3D CSD_TYPE_TTWU;
 	p->migration_pending =3D NULL;
+	init_sched_mm(p);
 }
=20
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
@@ -8691,6 +8692,11 @@ void __init sched_init(void)
=20
 		rq->core_cookie =3D 0UL;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+		raw_spin_lock_init(&rq->cpu_epoch_lock);
+		rq->cpu_epoch_next =3D jiffies;
+#endif
+
 		zalloc_cpumask_var_node(&rq->scratch_mask, GFP_KERNEL, cpu_to_node(i));
 	}
=20
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da46c3164537..58286275e166 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1136,6 +1136,8 @@ void post_init_entity_util_avg(struct task_struct *p)
 	sa->runnable_avg =3D sa->util_avg;
 }
=20
+static inline void account_mm_sched(struct rq *rq, struct task_struct *p, =
s64 delta_exec);
+
 static s64 update_se(struct rq *rq, struct sched_entity *se)
 {
 	u64 now =3D rq_clock_task(rq);
@@ -1158,6 +1160,7 @@ static s64 update_se(struct rq *rq, struct sched_enti=
ty *se)
=20
 		trace_sched_stat_runtime(running, delta_exec);
 		account_group_exec_runtime(running, delta_exec);
+		account_mm_sched(rq, running, delta_exec);
=20
 		/* cgroup time is always accounted against the donor */
 		cgroup_account_cputime(donor, delta_exec);
@@ -1179,6 +1182,266 @@ static s64 update_se(struct rq *rq, struct sched_en=
tity *se)
=20
 static void set_next_buddy(struct sched_entity *se);
=20
+#ifdef CONFIG_SCHED_CACHE
+
+/*
+ * XXX numbers come from a place the sun don't shine -- probably wants to =
be SD
+ * tunable or so.
+ */
+#define EPOCH_PERIOD	(HZ / 100)	/* 10 ms */
+#define EPOCH_LLC_AFFINITY_TIMEOUT	5	/* 50 ms */
+
+static int llc_id(int cpu)
+{
+	if (cpu < 0)
+		return -1;
+
+	return per_cpu(sd_llc_id, cpu);
+}
+
+void mm_init_sched(struct mm_struct *mm,
+		   struct sched_cache_time __percpu *_pcpu_sched)
+{
+	unsigned long epoch;
+	int i;
+
+	for_each_possible_cpu(i) {
+		struct sched_cache_time *pcpu_sched =3D per_cpu_ptr(_pcpu_sched, i);
+		struct rq *rq =3D cpu_rq(i);
+
+		pcpu_sched->runtime =3D 0;
+		pcpu_sched->epoch =3D rq->cpu_epoch;
+		epoch =3D rq->cpu_epoch;
+	}
+
+	raw_spin_lock_init(&mm->sc_stat.lock);
+	mm->sc_stat.epoch =3D epoch;
+	mm->sc_stat.cpu =3D -1;
+
+	/*
+	 * The update to mm->sc_stat should not be reordered
+	 * before initialization to mm's other fields, in case
+	 * the readers may get invalid mm_sched_epoch, etc.
+	 */
+	smp_store_release(&mm->sc_stat.pcpu_sched, _pcpu_sched);
+}
+
+/* because why would C be fully specified */
+static __always_inline void __shr_u64(u64 *val, unsigned int n)
+{
+	if (n >=3D 64) {
+		*val =3D 0;
+		return;
+	}
+	*val >>=3D n;
+}
+
+static inline void __update_mm_sched(struct rq *rq,
+				     struct sched_cache_time *pcpu_sched)
+{
+	lockdep_assert_held(&rq->cpu_epoch_lock);
+
+	unsigned long n, now =3D jiffies;
+	long delta =3D now - rq->cpu_epoch_next;
+
+	if (delta > 0) {
+		n =3D (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
+		rq->cpu_epoch +=3D n;
+		rq->cpu_epoch_next +=3D n * EPOCH_PERIOD;
+		__shr_u64(&rq->cpu_runtime, n);
+	}
+
+	n =3D rq->cpu_epoch - pcpu_sched->epoch;
+	if (n) {
+		pcpu_sched->epoch +=3D n;
+		__shr_u64(&pcpu_sched->runtime, n);
+	}
+}
+
+static unsigned long fraction_mm_sched(struct rq *rq,
+				       struct sched_cache_time *pcpu_sched)
+{
+	guard(raw_spinlock_irqsave)(&rq->cpu_epoch_lock);
+
+	__update_mm_sched(rq, pcpu_sched);
+
+	/*
+	 * Runtime is a geometric series (r=3D0.5) and as such will sum to twice
+	 * the accumulation period, this means the multiplcation here should
+	 * not overflow.
+	 */
+	return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq->cpu_runtime + 1);
+}
+
+static inline
+void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
+{
+	struct sched_cache_time *pcpu_sched;
+	struct mm_struct *mm =3D p->mm;
+	unsigned long epoch;
+
+	if (!sched_cache_enabled())
+		return;
+
+	if (p->sched_class !=3D &fair_sched_class)
+		return;
+	/*
+	 * init_task, kthreads and user thread created
+	 * by user_mode_thread() don't have mm.
+	 */
+	if (!mm || !mm->sc_stat.pcpu_sched)
+		return;
+
+	pcpu_sched =3D per_cpu_ptr(p->mm->sc_stat.pcpu_sched, cpu_of(rq));
+
+	scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
+		__update_mm_sched(rq, pcpu_sched);
+		pcpu_sched->runtime +=3D delta_exec;
+		rq->cpu_runtime +=3D delta_exec;
+		epoch =3D rq->cpu_epoch;
+	}
+
+	/*
+	 * If this process hasn't hit task_cache_work() for a while, or it
+	 * has only 1 thread, invalidate its preferred state.
+	 */
+	if (time_after(epoch,
+		       READ_ONCE(mm->sc_stat.epoch) + EPOCH_LLC_AFFINITY_TIMEOUT) ||
+	    get_nr_threads(p) <=3D 1) {
+		if (mm->sc_stat.cpu !=3D -1)
+			mm->sc_stat.cpu =3D -1;
+	}
+}
+
+static void task_tick_cache(struct rq *rq, struct task_struct *p)
+{
+	struct callback_head *work =3D &p->cache_work;
+	struct mm_struct *mm =3D p->mm;
+	unsigned long epoch;
+
+	if (!sched_cache_enabled())
+		return;
+
+	if (!mm || !mm->sc_stat.pcpu_sched)
+		return;
+
+	epoch =3D rq->cpu_epoch;
+	/* avoid moving backwards */
+	if (time_after_eq(mm->sc_stat.epoch, epoch))
+		return;
+
+	guard(raw_spinlock)(&mm->sc_stat.lock);
+
+	if (work->next =3D=3D work) {
+		task_work_add(p, work, TWA_RESUME);
+		WRITE_ONCE(mm->sc_stat.epoch, epoch);
+	}
+}
+
+static void task_cache_work(struct callback_head *work)
+{
+	struct task_struct *p =3D current;
+	struct mm_struct *mm =3D p->mm;
+	unsigned long m_a_occ =3D 0;
+	unsigned long curr_m_a_occ =3D 0;
+	int cpu, m_a_cpu =3D -1;
+	cpumask_var_t cpus;
+
+	WARN_ON_ONCE(work !=3D &p->cache_work);
+
+	work->next =3D work;
+
+	if (p->flags & PF_EXITING)
+		return;
+
+	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
+		return;
+
+	scoped_guard (cpus_read_lock) {
+		cpumask_copy(cpus, cpu_online_mask);
+
+		for_each_cpu(cpu, cpus) {
+			/* XXX sched_cluster_active */
+			struct sched_domain *sd =3D per_cpu(sd_llc, cpu);
+			unsigned long occ, m_occ =3D 0, a_occ =3D 0;
+			int m_cpu =3D -1, i;
+
+			if (!sd)
+				continue;
+
+			for_each_cpu(i, sched_domain_span(sd)) {
+				occ =3D fraction_mm_sched(cpu_rq(i),
+							per_cpu_ptr(mm->sc_stat.pcpu_sched, i));
+				a_occ +=3D occ;
+				if (occ > m_occ) {
+					m_occ =3D occ;
+					m_cpu =3D i;
+				}
+			}
+
+			/*
+			 * Compare the accumulated occupancy of each LLC. The
+			 * reason for using accumulated occupancy rather than average
+			 * per CPU occupancy is that it works better in asymmetric LLC
+			 * scenarios.
+			 * For example, if there are 2 threads in a 4CPU LLC and 3
+			 * threads in an 8CPU LLC, it might be better to choose the one
+			 * with 3 threads. However, this would not be the case if the
+			 * occupancy is divided by the number of CPUs in an LLC (i.e.,
+			 * if average per CPU occupancy is used).
+			 * Besides, NUMA balancing fault statistics behave similarly:
+			 * the total number of faults per node is compared rather than
+			 * the average number of faults per CPU. This strategy is also
+			 * followed here.
+			 */
+			if (a_occ > m_a_occ) {
+				m_a_occ =3D a_occ;
+				m_a_cpu =3D m_cpu;
+			}
+
+			if (llc_id(cpu) =3D=3D llc_id(mm->sc_stat.cpu))
+				curr_m_a_occ =3D a_occ;
+
+			cpumask_andnot(cpus, cpus, sched_domain_span(sd));
+		}
+	}
+
+	if (m_a_occ > (2 * curr_m_a_occ)) {
+		/*
+		 * Avoid switching sc_stat.cpu too fast.
+		 * The reason to choose 2X is because:
+		 * 1. It is better to keep the preferred LLC stable,
+		 *    rather than changing it frequently and cause migrations
+		 * 2. 2X means the new preferred LLC has at least 1 more
+		 *    busy CPU than the old one(200% vs 100%, eg)
+		 * 3. 2X is chosen based on test results, as it delivers
+		 *    the optimal performance gain so far.
+		 */
+		mm->sc_stat.cpu =3D m_a_cpu;
+	}
+
+	free_cpumask_var(cpus);
+}
+
+void init_sched_mm(struct task_struct *p)
+{
+	struct callback_head *work =3D &p->cache_work;
+
+	init_task_work(work, task_cache_work);
+	work->next =3D work;
+}
+
+#else
+
+static inline void account_mm_sched(struct rq *rq, struct task_struct *p,
+				    s64 delta_exec) { }
+
+void init_sched_mm(struct task_struct *p) { }
+
+static void task_tick_cache(struct rq *rq, struct task_struct *p) { }
+
+#endif
+
 /*
  * Used by other classes to account runtime.
  */
@@ -13377,6 +13640,8 @@ static void task_tick_fair(struct rq *rq, struct ta=
sk_struct *curr, int queued)
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
=20
+	task_tick_cache(rq, curr);
+
 	update_misfit_status(curr, rq);
 	check_update_overutilized_status(task_rq(curr));
=20
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d30cca6870f5..de5b701c3950 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1196,6 +1196,12 @@ struct rq {
 	u64			clock_pelt_idle_copy;
 	u64			clock_idle_copy;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+	raw_spinlock_t		cpu_epoch_lock ____cacheline_aligned;
+	u64			cpu_runtime;
+	unsigned long		cpu_epoch;
+	unsigned long		cpu_epoch_next;
+#endif
=20
 	atomic_t		nr_iowait;
=20
@@ -3890,6 +3896,14 @@ static inline void mm_cid_switch_to(struct task_stru=
ct *prev, struct task_struct
 static inline void mm_cid_switch_to(struct task_struct *prev, struct task_=
struct *next) { }
 #endif /* !CONFIG_SCHED_MM_CID */
=20
+#ifdef CONFIG_SCHED_CACHE
+static inline bool sched_cache_enabled(void)
+{
+	return false;
+}
+#endif
+extern void init_sched_mm(struct task_struct *p);
+
 extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
 extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
 static inline
--=20
2.32.0
From nobody Thu Apr  2 15:36:01 2026
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id D6E17336EC3
	for <linux-kernel@vger.kernel.org>; Tue, 10 Feb 2026 22:13:02 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.9
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1770761584; cv=none;
 b=HPF9HVepaYVIkCeEL6fDOLvHKHe7zd6SDICGTb7VjUN3zfe9uoBmtvahOfr3tVuKwwQnDRFNLgubyTKmxpGgWHPxh+5gynghxolqgbUYqBP2LcTD8aBmHRV62JMvJxgI+1Sw3KgmXMDW+DpYDGxLslk7F2nI/ojzt7AXis75rXI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1770761584; c=relaxed/simple;
	bh=y5FA6q54pFBJOrqtRYVJ26W6vHWJ8QZmmJZVJrZ0TTc=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=dAOUQ1VflrunYboXPxtdWVfslJQF4V9MklqAY3fQYauPIgumuSlzCq+rQx4duykRZWWe3GDFWUFddxyGAeVDwy/mHj5yFenqOCxUnfrciG7M28mfQIOzYKLP8INbk8KRqvF5YhPj+ENeWvBnve3tOl1j1Raqs7BsiLEi06rzFYA=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=JLixBCXg; arc=none smtp.client-ip=192.198.163.9
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="JLixBCXg"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1770761583; x=1802297583;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=y5FA6q54pFBJOrqtRYVJ26W6vHWJ8QZmmJZVJrZ0TTc=;
  b=JLixBCXgZGlbzrl00Xm9LR27zuQenB6SR4NZlqVWx3xU5TQCuZfhvXy6
   foO0iXVGh/whwtDjVHS0KtALezhSyQmEVnPS7Fvo3mU0UWjMQ0j4vB7+F
   SGXdV4ndlhBJ2O5ofOqvyEv7zR17ddprHQe7Cd+OD1QObh/N0o+YV8boP
   3Ftv3thW3UPRzPe9vFEtDDN5qUKM1E4vIPCMOgQF90LQJ9CqrbtbYpXbo
   mVlwXADh5L06Zh03CKkOgIo9IWNHheeB4lGp3pcA0Vs/vuQ7JLIG17ncO
   50bKTy+WckeDa9pXVD5RH2wGOXGkYRRKw/EFWkpWylqK8Cr0eSXBKEJOG
   g==;
X-CSE-ConnectionGUID: t+HsXdjHSFm5A1fkVQG4gQ==
X-CSE-MsgGUID: EpIlYTW+QNGZqiEwQbrgWg==
X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631231"
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="82631231"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 10 Feb 2026 14:13:03 -0800
X-CSE-ConnectionGUID: y9jum0wkQ9uvwm0eLtRYbQ==
X-CSE-MsgGUID: K+nx4pxVRtW9Tv7wcOM+YQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="216373869"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:01 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v3 02/21] sched/cache: Record per LLC utilization to guide
 cache aware scheduling decisions
Date: Tue, 10 Feb 2026 14:18:42 -0800
Message-Id: 
 <93f0a3958e2398e8b4a05c15cb89f0fd759c5ac9.1770760558.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1770760558.git.tim.c.chen@linux.intel.com>
References: <cover.1770760558.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

When a system becomes busy and a process's preferred LLC is
saturated with too many threads, tasks within that LLC migrate
frequently. These in LLC migrations introduce latency and degrade
performance. To avoid this, task aggregation should be suppressed
when the preferred LLC is overloaded, which requires a metric to
indicate LLC utilization.

Record per LLC utilization/cpu capacity during periodic load
balancing. These statistics will be used in later patches to decide
whether tasks should be aggregated into their preferred LLC.

Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---

Notes:
    v2->v3:
    Remove ____cacheline_aligned_in_smp attribute in
    struct sched_domain_shared to avoid premature optimization.
    (Peter Zijlstra)

 include/linux/sched/topology.h |  4 ++
 kernel/sched/fair.c            | 70 ++++++++++++++++++++++++++++++++++
 2 files changed, 74 insertions(+)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 45c0022b91ce..a4e2fb31f2fd 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -68,6 +68,10 @@ struct sched_domain_shared {
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
 	int		nr_idle_scan;
+#ifdef CONFIG_SCHED_CACHE
+	unsigned long	util_avg;
+	unsigned long	capacity;
+#endif
 };
=20
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 58286275e166..dfeb107f2cfd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9688,6 +9688,29 @@ static inline int task_is_ineligible_on_dst_cpu(stru=
ct task_struct *p, int dest_
 	return 0;
 }
=20
+#ifdef CONFIG_SCHED_CACHE
+/* Called from load balancing paths with rcu_read_lock held */
+static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
+					 unsigned long *cap)
+{
+	struct sched_domain_shared *sd_share;
+
+	sd_share =3D rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (!sd_share)
+		return false;
+
+	*util =3D READ_ONCE(sd_share->util_avg);
+	*cap =3D READ_ONCE(sd_share->capacity);
+
+	return true;
+}
+#else
+static inline bool get_llc_stats(int cpu, unsigned long *util,
+				 unsigned long *cap)
+{
+	return false;
+}
+#endif
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
  */
@@ -10658,6 +10681,52 @@ sched_reduced_capacity(struct rq *rq, struct sched=
_domain *sd)
 	return check_cpu_capacity(rq, sd);
 }
=20
+#ifdef CONFIG_SCHED_CACHE
+/*
+ * Record the statistics for this scheduler group for later
+ * use. These values guide load balancing on aggregating tasks
+ * to a LLC.
+ */
+static void record_sg_llc_stats(struct lb_env *env,
+				struct sg_lb_stats *sgs,
+				struct sched_group *group)
+{
+	struct sched_domain_shared *sd_share;
+
+	if (!sched_cache_enabled() || env->idle =3D=3D CPU_NEWLY_IDLE)
+		return;
+
+	/* Only care about sched domain spanning multiple LLCs */
+	if (env->sd->child !=3D rcu_dereference(per_cpu(sd_llc, env->dst_cpu)))
+		return;
+
+	/*
+	 * At this point we know this group spans a LLC domain.
+	 * Record the statistic of this group in its corresponding
+	 * shared LLC domain.
+	 * Note: sd_share cannot be obtained via sd->child->shared,
+	 * because the latter refers to the domain that covers the
+	 * local group. Instead, sd_share should be located using
+	 * the first CPU of the LLC group.
+	 */
+	sd_share =3D rcu_dereference(per_cpu(sd_llc_shared,
+					   cpumask_first(sched_group_span(group))));
+	if (!sd_share)
+		return;
+
+	if (READ_ONCE(sd_share->util_avg) !=3D sgs->group_util)
+		WRITE_ONCE(sd_share->util_avg, sgs->group_util);
+
+	if (unlikely(READ_ONCE(sd_share->capacity) !=3D sgs->group_capacity))
+		WRITE_ONCE(sd_share->capacity, sgs->group_capacity);
+}
+#else
+static inline void record_sg_llc_stats(struct lb_env *env, struct sg_lb_st=
ats *sgs,
+				       struct sched_group *group)
+{
+}
+#endif
+
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
@@ -10747,6 +10816,7 @@ static inline void update_sg_lb_stats(struct lb_env=
 *env,
=20
 	sgs->group_type =3D group_classify(env->sd->imbalance_pct, group, sgs);
=20
+	record_sg_llc_stats(env, sgs, group);
 	/* Computing avg_load makes sense only when group is overloaded */
 	if (sgs->group_type =3D=3D group_overloaded)
 		sgs->avg_load =3D (sgs->group_load * SCHED_CAPACITY_SCALE) /
--=20
2.32.0
From nobody Thu Apr  2 15:36:01 2026
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id AC725339844
	for <linux-kernel@vger.kernel.org>; Tue, 10 Feb 2026 22:13:06 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.9
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1770761588; cv=none;
 b=WtWWbaOSaMqfOLzBw0wpnI/FAQ++GFJAEiPf30wtkC6gGn4NBPX3Altzoq8V4S4GwE4naexzPk59f8DTNg5LSVUIi9NkcGIj7uZWvaWcjsW0G2ESyjElCDNkCGWBQfnoR/G6LiW2kxGn2p+gcN7AIm4p/yjyD55nihyMbhs5ZBA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1770761588; c=relaxed/simple;
	bh=+dXdc/foU6CrK/UWJil+Zj/UilvizxjOxvM4MOqR9P8=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=iOXGWeOGlieF+zOxGqvPqeKRLCutE2wia+XIcHpeM8ruE/4Ob+O7q73J/i/pJoTknyZieUJoEit1mriZhUvlO2soe4I/TaCa5ExPFL1GS5HDSoOdaRIvBhXhz4V+dFmnizzPq6pJKkff3tugvcV9yeDdgPE+ZJ9Kxqu0twTcvHQ=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=dJQ5x1RJ; arc=none smtp.client-ip=192.198.163.9
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="dJQ5x1RJ"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1770761587; x=1802297587;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=+dXdc/foU6CrK/UWJil+Zj/UilvizxjOxvM4MOqR9P8=;
  b=dJQ5x1RJv32EDA/tkEjzTAEisxFMHTLbVQqgBeT2wJEcQ2zx/kcCo487
   23H7U5GAK8VxbXQKrLinrplCg7mObHvaz3ZIDGPnzCVzDOX7+XE6Hfd5m
   4for/kizWHg1yu/3uIBp1Mv7XiXoJYYNNlaxenGjoKRmbAZ6pm5dGpmcF
   Ght9FkMxsltLHIplHfJ5ZEJD2qIRDkDoiMePTdcbc8DYbrhdCngeo9Jae
   hfMHctF1rXxdrOSTgZpTVlCGQUDSvRtVtCLct7VwWTEm1K3NJo26Q64Ul
   99Y4cjl1AWjJaruseF55haHCJTKJxBTWF0CUKLE3rrukv9c79Mc3Z5xkD
   w==;
X-CSE-ConnectionGUID: s2RFJKMwSV6FuIS/rY4UXg==
X-CSE-MsgGUID: vrEwD8ZFSSCn/t6I7MJpZw==
X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631269"
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="82631269"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 10 Feb 2026 14:13:06 -0800
X-CSE-ConnectionGUID: 55QKO9xtTQ+Oe/DRpzT+Uw==
X-CSE-MsgGUID: xl1OQkW3RvKl4kjoe+fRBg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="216373890"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:05 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v3 03/21] sched/cache: Introduce helper functions to enforce
 LLC migration policy
Date: Tue, 10 Feb 2026 14:18:43 -0800
Message-Id: 
 <7475922f6020abe5d458a136b0c88fe24e823091.1770760558.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1770760558.git.tim.c.chen@linux.intel.com>
References: <cover.1770760558.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

Cache-aware scheduling aggregates threads onto their preferred LLC,
mainly through load balancing. When the preferred LLC becomes
saturated, more threads are still placed there, increasing latency.
A mechanism is needed to limit aggregation so that the preferred LLC
does not become overloaded.

Introduce helper functions can_migrate_llc() and
can_migrate_llc_task() to enforce the LLC migration policy:

  1. Aggregate a task to its preferred LLC if both source and
     destination LLCs are not too busy, or if doing so will not
     leave the preferred LLC much more imbalanced than the
     non-preferred one (>20% utilization difference, a little
     higher than imbalance_pct(17%) of the LLC domain as hysteresis).
  2. Allow moving a task from overloaded preferred LLC to a non
     preferred LLC if this will not cause the non preferred LLC
     to become too imbalanced to cause a later migration back.
  3. If both LLCs are too busy, let the generic load balance to
     spread the tasks.

Further (hysteresis)action could be taken in the future to prevent tasks
from being migrated into and out of the preferred LLC frequently (back and
forth): the threshold for migrating a task out of its preferred LLC should
be higher than that for migrating it into the LLC.

Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---

Notes:
    v2->v3:
    No change.

 kernel/sched/fair.c | 153 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 153 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dfeb107f2cfd..bf5f39a01017 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9689,6 +9689,27 @@ static inline int task_is_ineligible_on_dst_cpu(stru=
ct task_struct *p, int dest_
 }
=20
 #ifdef CONFIG_SCHED_CACHE
+/*
+ * The margin used when comparing LLC utilization with CPU capacity.
+ * It determines the LLC load level where active LLC aggregation is
+ * done.
+ * Derived from fits_capacity().
+ *
+ * (default: ~50%)
+ */
+#define fits_llc_capacity(util, max)	\
+	((util) * 2 < (max))
+
+/*
+ * The margin used when comparing utilization.
+ * is 'util1' noticeably greater than 'util2'
+ * Derived from capacity_greater().
+ * Bias is in perentage.
+ */
+/* Allows dst util to be bigger than src util by up to bias percent */
+#define util_greater(util1, util2) \
+	((util1) * 100 > (util2) * 120)
+
 /* Called from load balancing paths with rcu_read_lock held */
 static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
 					 unsigned long *cap)
@@ -9704,6 +9725,138 @@ static __maybe_unused bool get_llc_stats(int cpu, u=
nsigned long *util,
=20
 	return true;
 }
+
+/*
+ * Decision matrix according to the LLC utilization. To
+ * decide whether we can do task aggregation across LLC.
+ *
+ * By default, 50% is the threshold for treating the LLC
+ * as busy. The reason for choosing 50% is to avoid saturation
+ * of SMT-2, and it is also a safe cutoff for other SMT-n
+ * platforms.
+ *
+ * 20% is the utilization imbalance percentage to decide
+ * if the preferred LLC is busier than the non-preferred LLC.
+ * 20 is a little higher than the LLC domain's imbalance_pct
+ * 17. The hysteresis is used to avoid task bouncing between the
+ * preferred LLC and the non-preferred LLC.
+ *
+ * 1. moving towards the preferred LLC, dst is the preferred
+ *    LLC, src is not.
+ *
+ * src \ dst      30%  40%  50%  60%
+ * 30%            Y    Y    Y    N
+ * 40%            Y    Y    Y    Y
+ * 50%            Y    Y    G    G
+ * 60%            Y    Y    G    G
+ *
+ * 2. moving out of the preferred LLC, src is the preferred
+ *    LLC, dst is not:
+ *
+ * src \ dst      30%  40%  50%  60%
+ * 30%            N    N    N    N
+ * 40%            N    N    N    N
+ * 50%            N    N    G    G
+ * 60%            Y    N    G    G
+ *
+ * src :      src_util
+ * dst :      dst_util
+ * Y :        Yes, migrate
+ * N :        No, do not migrate
+ * G :        let the Generic load balance to even the load.
+ *
+ * The intention is that if both LLCs are quite busy, cache aware
+ * load balance should not be performed, and generic load balance
+ * should take effect. However, if one is busy and the other is not,
+ * the preferred LLC capacity(50%) and imbalance criteria(20%) should
+ * be considered to determine whether LLC aggregation should be
+ * performed to bias the load towards the preferred LLC.
+ */
+
+/* migration decision, 3 states are orthogonal. */
+enum llc_mig {
+	mig_forbid =3D 0,		/* N: Don't migrate task, respect LLC preference */
+	mig_llc,		/* Y: Do LLC preference based migration */
+	mig_unrestricted	/* G: Don't restrict generic load balance migration */
+};
+
+/*
+ * Check if task can be moved from the source LLC to the
+ * destination LLC without breaking cache aware preferrence.
+ * src_cpu and dst_cpu are arbitrary CPUs within the source
+ * and destination LLCs, respectively.
+ */
+static enum llc_mig can_migrate_llc(int src_cpu, int dst_cpu,
+				    unsigned long tsk_util,
+				    bool to_pref)
+{
+	unsigned long src_util, dst_util, src_cap, dst_cap;
+
+	if (!get_llc_stats(src_cpu, &src_util, &src_cap) ||
+	    !get_llc_stats(dst_cpu, &dst_util, &dst_cap))
+		return mig_unrestricted;
+
+	if (!fits_llc_capacity(dst_util, dst_cap) &&
+	    !fits_llc_capacity(src_util, src_cap))
+		return mig_unrestricted;
+
+	src_util =3D src_util < tsk_util ? 0 : src_util - tsk_util;
+	dst_util =3D dst_util + tsk_util;
+	if (to_pref) {
+		/*
+		 * Don't migrate if we will get preferred LLC too
+		 * heavily loaded and if the dest is much busier
+		 * than the src, in which case migration will
+		 * increase the imbalance too much.
+		 */
+		if (!fits_llc_capacity(dst_util, dst_cap) &&
+		    util_greater(dst_util, src_util))
+			return mig_forbid;
+	} else {
+		/*
+		 * Don't migrate if we will leave preferred LLC
+		 * too idle, or if this migration leads to the
+		 * non-preferred LLC falls within sysctl_aggr_imb percent
+		 * of preferred LLC, leading to migration again
+		 * back to preferred LLC.
+		 */
+		if (fits_llc_capacity(src_util, src_cap) ||
+		    !util_greater(src_util, dst_util))
+			return mig_forbid;
+	}
+	return mig_llc;
+}
+
+/*
+ * Check if task p can migrate from source LLC to
+ * destination LLC in terms of cache aware load balance.
+ */
+static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int d=
st_cpu,
+							struct task_struct *p)
+{
+	struct mm_struct *mm;
+	bool to_pref;
+	int cpu;
+
+	mm =3D p->mm;
+	if (!mm)
+		return mig_unrestricted;
+
+	cpu =3D mm->sc_stat.cpu;
+	if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
+		return mig_unrestricted;
+
+	if (cpus_share_cache(dst_cpu, cpu))
+		to_pref =3D true;
+	else if (cpus_share_cache(src_cpu, cpu))
+		to_pref =3D false;
+	else
+		return mig_unrestricted;
+
+	return can_migrate_llc(src_cpu, dst_cpu,
+			       task_util(p), to_pref);
+}
+
 #else
 static inline bool get_llc_stats(int cpu, unsigned long *util,
 				 unsigned long *cap)
--=20
2.32.0
From nobody Thu Apr  2 15:36:01 2026
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5AFB6336EDA
	for <linux-kernel@vger.kernel.org>; Tue, 10 Feb 2026 22:13:10 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.9
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1770761591; cv=none;
 b=BNdqxwtCXnKuBYJpucdERXgi+HawQVKpcCqFdrbjrnBCua3z0cT1w47ofeMbJDtah4NDHZctBooG3XFgrgMDp96BLQ7UU24rKy6l1JWRL4ZH+H2W0tNggmBM9KYv0dEiV3dn/5P6rxFaRzEr4jeuXYxPm3+tcKqB7sXGK0nAB6k=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1770761591; c=relaxed/simple;
	bh=T3k34hbPq1z9/RrkdAzRCXvBVVy+/ochK3PX9bbZfIk=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=aNEWZQrvLB9XgaRGuVW8dPyOPEMImLtxIif74W9VJ0P/eSfd20kLPkPbBvUeI3zSyNR/djdmgpJ9BjnczB7f1IoJ+lJcPKOiQIh9O3d+EDnGGS9kD/PtaN6jF5Vlx146gcJdqRMeNKg0nz3RiiIM73DdZ9T7pGof+rvFAd+dpqE=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=F07cAAwc; arc=none smtp.client-ip=192.198.163.9
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="F07cAAwc"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1770761591; x=1802297591;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=T3k34hbPq1z9/RrkdAzRCXvBVVy+/ochK3PX9bbZfIk=;
  b=F07cAAwcHT0ODqmzXtDL9rEZVHxLTQ74NAQmCmbgo7SQx+f7yD/NObO5
   A4pZHS8zfU06rjv22RhvM2YanidQalAJigfH5d6XgCSqjoMCQGbaZJ9RD
   sthm12u6v3kVFgfFYefkXGA84EjTBS2hnS1kZyA9KbpDGTWOm/1kQ8iFS
   prpI1NIbHSETBUmLWY7fGWQ0bW+wAGsX0HMNrWBWk2x4ZwQqd4zc4+iM4
   QQYi1oGLk4IAWRbxgdFP2cB17rPIyrjvPTal/vWLdS2s16Wz/jD56jYIW
   N84NzvEpZi+HcooAb4LrsnSTX8wCAbLEFNA3Q0TGayp+1YlgOAry8pWSz
   w==;
X-CSE-ConnectionGUID: UJ8OCAktR8q4NLljxweDFA==
X-CSE-MsgGUID: +4doCTFVT6WE36a2nw4yEA==
X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631291"
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="82631291"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 10 Feb 2026 14:13:10 -0800
X-CSE-ConnectionGUID: KzpshQ8vSvGb7c+EVFX4hA==
X-CSE-MsgGUID: Meul46aLQpqEDG4rc/ZJUQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="216373897"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:08 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v3 04/21] sched/cache: Make LLC id continuous
Date: Tue, 10 Feb 2026 14:18:44 -0800
Message-Id: 
 <60a05a3f50d14a7bf3b968f62cca87893c5c552c.1770760558.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1770760558.git.tim.c.chen@linux.intel.com>
References: <cover.1770760558.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

Introduce an index mapping between CPUs and their LLCs. This provides
a continuous per LLC index needed for cache-aware load balancing in
later patches.

The existing per_cpu llc_id usually points to the first CPU of the
LLC domain, which is sparse and unsuitable as an array index. Using
llc_id directly would waste memory.

With the new mapping, CPUs in the same LLC share a continuous id:

  per_cpu(llc_id, CPU=3D0...15)  =3D 0
  per_cpu(llc_id, CPU=3D16...31) =3D 1
  per_cpu(llc_id, CPU=3D32...47) =3D 2
  ...

Once a CPU has been assigned an llc_id, this ID persists even when
the CPU is taken offline and brought back online, which can facilitate
the management of the ID.

Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Co-developed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---

Notes:
    v2->v3:
    Allocate the LLC id according to the topology level data directly, rath=
er
    than calculating from the sched domain. This simplifies the code.
    (Peter Zijlstra, K Prateek Nayak)

 kernel/sched/topology.c | 47 ++++++++++++++++++++++++++++++++++++++---
 1 file changed, 44 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index cf643a5ddedd..ca46b5cf7f78 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -20,6 +20,7 @@ void sched_domains_mutex_unlock(void)
 /* Protected by sched_domains_mutex: */
 static cpumask_var_t sched_domains_tmpmask;
 static cpumask_var_t sched_domains_tmpmask2;
+static int tl_max_llcs;
=20
 static int __init sched_debug_setup(char *str)
 {
@@ -658,7 +659,7 @@ static void destroy_sched_domains(struct sched_domain *=
sd)
  */
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
-DEFINE_PER_CPU(int, sd_llc_id);
+DEFINE_PER_CPU(int, sd_llc_id) =3D -1;
 DEFINE_PER_CPU(int, sd_share_id);
 DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
@@ -684,7 +685,6 @@ static void update_top_cache_domain(int cpu)
=20
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
 	per_cpu(sd_llc_size, cpu) =3D size;
-	per_cpu(sd_llc_id, cpu) =3D id;
 	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
=20
 	sd =3D lowest_flag_domain(cpu, SD_CLUSTER);
@@ -2567,10 +2567,18 @@ build_sched_domains(const struct cpumask *cpu_map, =
struct sched_domain_attr *att
=20
 	/* Set up domains for CPUs specified by the cpu_map: */
 	for_each_cpu(i, cpu_map) {
-		struct sched_domain_topology_level *tl;
+		struct sched_domain_topology_level *tl, *tl_llc =3D NULL;
+		int lid;
=20
 		sd =3D NULL;
 		for_each_sd_topology(tl) {
+			int flags =3D 0;
+
+			if (tl->sd_flags)
+				flags =3D (*tl->sd_flags)();
+
+			if (flags & SD_SHARE_LLC)
+				tl_llc =3D tl;
=20
 			sd =3D build_sched_domain(tl, cpu_map, attr, sd, i);
=20
@@ -2581,6 +2589,39 @@ build_sched_domains(const struct cpumask *cpu_map, s=
truct sched_domain_attr *att
 			if (cpumask_equal(cpu_map, sched_domain_span(sd)))
 				break;
 		}
+
+		lid =3D per_cpu(sd_llc_id, i);
+		if (lid =3D=3D -1) {
+			int j;
+
+			/*
+			 * Assign the llc_id to the CPUs that do not
+			 * have an LLC.
+			 */
+			if (!tl_llc) {
+				per_cpu(sd_llc_id, i) =3D tl_max_llcs++;
+
+				continue;
+			}
+
+			/* try to reuse the llc_id of its siblings */
+			for_each_cpu(j, tl_llc->mask(tl_llc, i)) {
+				if (i =3D=3D j)
+					continue;
+
+				lid =3D per_cpu(sd_llc_id, j);
+
+				if (lid !=3D -1) {
+					per_cpu(sd_llc_id, i) =3D lid;
+
+					break;
+				}
+			}
+
+			/* a new LLC is detected */
+			if (lid =3D=3D -1)
+				per_cpu(sd_llc_id, i) =3D tl_max_llcs++;
+		}
 	}
=20
 	if (WARN_ON(!topology_span_sane(cpu_map)))
--=20
2.32.0
From nobody Thu Apr  2 15:36:01 2026
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9A71E33A708
	for <linux-kernel@vger.kernel.org>; Tue, 10 Feb 2026 22:13:12 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.9
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1770761593; cv=none;
 b=Xff1SOBNxlYC/Cl/OYVKL+ODGk4wOtBtJxrAp9d4nJk39pDbSGZU1ue1TR51hbJ8HbRuLmlLShdlFxBwnm3ky0u6d+MCuIt33zcw4AG82Cf8s0XVfGKAGVNTM9m1yfvArHVQddzFhpvWdPImIk7/lNNQzCufDCMbffu9Cctc26Q=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1770761593; c=relaxed/simple;
	bh=frV2KkgqfOLT5qQofbpSaoorLbzVFFWBXGLK+9NJg0E=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=tLefFrxxGWPayNsqvhCIV+sZ6HUwRbGV0vokkRalzekB9yK8My33vNJFUtWbk2FFZnSlci7HOsc5LfvWWAUIoBAyM7DJC7zsziDDtYRWosmUZic71U3B+S0CZK8ATuzeXW/hn6DJx98WKbck8IOQZJn6io3cti96oHNAIudvwno=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=ayEUExAv; arc=none smtp.client-ip=192.198.163.9
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="ayEUExAv"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1770761593; x=1802297593;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=frV2KkgqfOLT5qQofbpSaoorLbzVFFWBXGLK+9NJg0E=;
  b=ayEUExAvRA6fKbL6BAfyo5IUDoP2+uS7N7yRREcUWLwxIiwNuzryxl0M
   J7dnvrkQ8ynNWrnDDG3niokrZs6KvSCLvYK0LGpRrbmUD6RSW9GkWMOso
   FRzsnRLlMs/9LnnXbwJWIipFYUvPB7lCsKpGxIS41EVCmHvJ/DqOH2C2I
   T0Gy9cUyeIoOSXXab5WAqRxct0/0qJwdVD8C8Mbx6L94bpO9r8IE+TTjP
   6yNXDYe02UyJtopinbWEdAC5OJpnuyzKp25JBXyPKbpHnV96dOEs2xFDn
   xrdvZquH3XLmqk9JprkX1tr+36sh9FtF6M0pAORuxR/ofpq6V6ysOm8wH
   g==;
X-CSE-ConnectionGUID: rA7xOuXJRxeXDjc5YfOwRg==
X-CSE-MsgGUID: +RoPoL/0TmSGqfLBsfmoaA==
X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631312"
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="82631312"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 10 Feb 2026 14:13:12 -0800
X-CSE-ConnectionGUID: zYyG9ZqcQ96eji+Wxkb/Yw==
X-CSE-MsgGUID: +LNVKCaRTvOvBtd4slrtCQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="216373903"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:11 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v3 05/21] sched/cache: Assign preferred LLC ID to processes
Date: Tue, 10 Feb 2026 14:18:45 -0800
Message-Id: 
 <4a92b93edb669845e3bdca24c3ae3354b317c3eb.1770760558.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1770760558.git.tim.c.chen@linux.intel.com>
References: <cover.1770760558.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

With cache-aware scheduling enabled, each task is assigned a
preferred LLC ID. This allows quick identification of the LLC domain
where the task prefers to run, similar to numa_preferred_nid in
NUMA balancing.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v2->v3:
    Add comments around code handling NUMA balance conflict with cache aware
    scheduling. (Peter Zijlstra)
   =20
    Check if NUMA balancing is disabled before checking numa_preferred_nid
    (Jianyong Wu)

 include/linux/sched.h |  1 +
 init/init_task.c      |  3 +++
 kernel/sched/fair.c   | 42 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 46 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2817a21ee055..c98bd1c46088 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1411,6 +1411,7 @@ struct task_struct {
=20
 #ifdef CONFIG_SCHED_CACHE
 	struct callback_head		cache_work;
+	int				preferred_llc;
 #endif
=20
 	struct rseq_data		rseq;
diff --git a/init/init_task.c b/init/init_task.c
index 49b13d7c3985..baa420de2644 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -218,6 +218,9 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) =
=3D {
 	.numa_group	=3D NULL,
 	.numa_faults	=3D NULL,
 #endif
+#ifdef CONFIG_SCHED_CACHE
+	.preferred_llc  =3D -1,
+#endif
 #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
 	.kasan_depth	=3D 1,
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf5f39a01017..0b4ed0f2809d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1273,11 +1273,43 @@ static unsigned long fraction_mm_sched(struct rq *r=
q,
 	return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq->cpu_runtime + 1);
 }
=20
+static int get_pref_llc(struct task_struct *p, struct mm_struct *mm)
+{
+	int mm_sched_llc =3D -1;
+
+	if (!mm)
+		return -1;
+
+	if (mm->sc_stat.cpu !=3D -1) {
+		mm_sched_llc =3D llc_id(mm->sc_stat.cpu);
+
+#ifdef CONFIG_NUMA_BALANCING
+		/*
+		 * Don't assign preferred LLC if it
+		 * conflicts with NUMA balancing.
+		 * This can happen when sched_setnuma() gets
+		 * called, however it is not much of an issue
+		 * because we expect account_mm_sched() to get
+		 * called fairly regularly -- at a higher rate
+		 * than sched_setnuma() at least -- and thus the
+		 * conflict only exists for a short period of time.
+		 */
+		if (static_branch_likely(&sched_numa_balancing) &&
+		    p->numa_preferred_nid >=3D 0 &&
+		    cpu_to_node(mm->sc_stat.cpu) !=3D p->numa_preferred_nid)
+			mm_sched_llc =3D -1;
+#endif
+	}
+
+	return mm_sched_llc;
+}
+
 static inline
 void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 {
 	struct sched_cache_time *pcpu_sched;
 	struct mm_struct *mm =3D p->mm;
+	int mm_sched_llc =3D -1;
 	unsigned long epoch;
=20
 	if (!sched_cache_enabled())
@@ -1311,6 +1343,11 @@ void account_mm_sched(struct rq *rq, struct task_str=
uct *p, s64 delta_exec)
 		if (mm->sc_stat.cpu !=3D -1)
 			mm->sc_stat.cpu =3D -1;
 	}
+
+	mm_sched_llc =3D get_pref_llc(p, mm);
+
+	if (p->preferred_llc !=3D mm_sched_llc)
+		p->preferred_llc =3D mm_sched_llc;
 }
=20
 static void task_tick_cache(struct rq *rq, struct task_struct *p)
@@ -1440,6 +1477,11 @@ void init_sched_mm(struct task_struct *p) { }
=20
 static void task_tick_cache(struct rq *rq, struct task_struct *p) { }
=20
+static inline int get_pref_llc(struct task_struct *p,
+			       struct mm_struct *mm)
+{
+	return -1;
+}
 #endif
=20
 /*
--=20
2.32.0
From nobody Thu Apr  2 15:36:01 2026
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id A359733ADAE
	for <linux-kernel@vger.kernel.org>; Tue, 10 Feb 2026 22:13:15 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.9
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1770761596; cv=none;
 b=ZiLlxFLRXf6ZR0gbEOkS/uDl5Vi/+sRD2xax1CmOz+7Ft/6E0+V9Z3z0JO7QLZbveE3FS+nUQ/8EKSwWDFhgVUSrpetH5G0pybVfE30FfrO17My8gFdH7RqAIJzPPrhbbtQkg7o9AaJYuxf7QbyQmNyZYXnls6XWpjG1IVYWD5U=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1770761596; c=relaxed/simple;
	bh=n35wV+hjZT+VOHWA+vniVGH39bA/Nsa9EOYs7IP5CVQ=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=KaY45y/gFg5jFUm7NxTA2cXz9LcPn+p2LcQkNF048Kx9auWOHK9HQvD36mhlXbtwZRJOB4xumAs10DnUbDNPJQc4VhocIDiv4K4G7RwJ3mxgZsmtVhqIrbrbiSLbzWSEcFGsCD+LZ21L8+c+0N+g+aDC3Y/8mBuqJ1x0E7T9vZE=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=SjW+0o9S; arc=none smtp.client-ip=192.198.163.9
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="SjW+0o9S"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1770761596; x=1802297596;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=n35wV+hjZT+VOHWA+vniVGH39bA/Nsa9EOYs7IP5CVQ=;
  b=SjW+0o9STo+/ZVX3CoiEIOvLvgyUq7DuReyDQ1eyn5uYWvaeIlnEJq9n
   7eZW09bZwenzGCKluCD7+cGUtQuPlc2125UjJcb4jHT0f8/QQQwE5CB83
   GEHxz4FXzda9AslVoMOnwYXf0jHV0dDYoujN/Uw03OrOO+cinMwGnRz+Y
   GNrewwEtuPB+SIKV3+DAKJgherHO/KwxxxjAjiGUEfXntNmpEQoqDOW+z
   cskL2aUjMabb4y37gmp2mCtzwUw+t9Pv+VwFXQHgzj91S3goxNC/wVX3h
   /xIm1knRZaazXh2kl62JiK5vjLE7sshChcEIjnBolERRjCUnqdPtJSDOz
   w==;
X-CSE-ConnectionGUID: M8o50/DbRAux2rv+KOpUqA==
X-CSE-MsgGUID: M5eKDl3tTQ+c5DhxMgbngw==
X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631334"
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="82631334"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 10 Feb 2026 14:13:15 -0800
X-CSE-ConnectionGUID: o0TDk7zeTQiLM83McRAh4Q==
X-CSE-MsgGUID: MJMqV7wXTHiht5R0O8PTfw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="216373911"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:13 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v3 06/21] sched/cache: Track LLC-preferred tasks per runqueue
Date: Tue, 10 Feb 2026 14:18:46 -0800
Message-Id: 
 <b57e09f09a024051f6fcfa6452cbfbc8a6c6afd6.1770760558.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1770760558.git.tim.c.chen@linux.intel.com>
References: <cover.1770760558.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

For each runqueue, track the number of tasks with an LLC preference
and how many of them are running on their preferred LLC. This mirrors
nr_numa_running and nr_preferred_running for NUMA balancing, and will
be used by cache-aware load balancing in later patches.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v2->v3:
    Remove the sched_cache_enabled() check and make the
    account_llc_{en,de}queue() depending on CONFIG_SCHED_CACHE,
    so sched_llc_active in v2 can be removed.
    (Peter Zijlstra)

 kernel/sched/core.c  |  5 +++++
 kernel/sched/fair.c  | 48 +++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h |  6 ++++++
 3 files changed, 56 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c6efa71cf500..c464e370576f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -532,6 +532,11 @@ void __trace_set_current_state(int state_value)
 }
 EXPORT_SYMBOL(__trace_set_current_state);
=20
+int task_llc(const struct task_struct *p)
+{
+	return per_cpu(sd_llc_id, task_cpu(p));
+}
+
 /*
  * Serialization rules:
  *
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0b4ed0f2809d..6ad9ad2f918f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1199,6 +1199,30 @@ static int llc_id(int cpu)
 	return per_cpu(sd_llc_id, cpu);
 }
=20
+static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
+{
+	int pref_llc;
+
+	pref_llc =3D p->preferred_llc;
+	if (pref_llc < 0)
+		return;
+
+	rq->nr_llc_running++;
+	rq->nr_pref_llc_running +=3D (pref_llc =3D=3D task_llc(p));
+}
+
+static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
+{
+	int pref_llc;
+
+	pref_llc =3D p->preferred_llc;
+	if (pref_llc < 0)
+		return;
+
+	rq->nr_llc_running--;
+	rq->nr_pref_llc_running -=3D (pref_llc =3D=3D task_llc(p));
+}
+
 void mm_init_sched(struct mm_struct *mm,
 		   struct sched_cache_time __percpu *_pcpu_sched)
 {
@@ -1304,6 +1328,8 @@ static int get_pref_llc(struct task_struct *p, struct=
 mm_struct *mm)
 	return mm_sched_llc;
 }
=20
+static unsigned int task_running_on_cpu(int cpu, struct task_struct *p);
+
 static inline
 void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 {
@@ -1346,8 +1372,13 @@ void account_mm_sched(struct rq *rq, struct task_str=
uct *p, s64 delta_exec)
=20
 	mm_sched_llc =3D get_pref_llc(p, mm);
=20
-	if (p->preferred_llc !=3D mm_sched_llc)
+	/* task not on rq accounted later in account_entity_enqueue() */
+	if (task_running_on_cpu(rq->cpu, p) &&
+	    p->preferred_llc !=3D mm_sched_llc) {
+		account_llc_dequeue(rq, p);
 		p->preferred_llc =3D mm_sched_llc;
+		account_llc_enqueue(rq, p);
+	}
 }
=20
 static void task_tick_cache(struct rq *rq, struct task_struct *p)
@@ -1482,6 +1513,11 @@ static inline int get_pref_llc(struct task_struct *p,
 {
 	return -1;
 }
+
+static void account_llc_enqueue(struct rq *rq, struct task_struct *p) {}
+
+static void account_llc_dequeue(struct rq *rq, struct task_struct *p) {}
+
 #endif
=20
 /*
@@ -3970,9 +4006,11 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct=
 sched_entity *se)
 {
 	update_load_add(&cfs_rq->load, se->load.weight);
 	if (entity_is_task(se)) {
+		struct task_struct *p =3D task_of(se);
 		struct rq *rq =3D rq_of(cfs_rq);
=20
-		account_numa_enqueue(rq, task_of(se));
+		account_numa_enqueue(rq, p);
+		account_llc_enqueue(rq, p);
 		list_add(&se->group_node, &rq->cfs_tasks);
 	}
 	cfs_rq->nr_queued++;
@@ -3983,7 +4021,11 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct=
 sched_entity *se)
 {
 	update_load_sub(&cfs_rq->load, se->load.weight);
 	if (entity_is_task(se)) {
-		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
+		struct task_struct *p =3D task_of(se);
+		struct rq *rq =3D rq_of(cfs_rq);
+
+		account_numa_dequeue(rq, p);
+		account_llc_dequeue(rq, p);
 		list_del_init(&se->group_node);
 	}
 	cfs_rq->nr_queued--;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index de5b701c3950..35cea6aa32a4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1128,6 +1128,10 @@ struct rq {
 	unsigned int		nr_preferred_running;
 	unsigned int		numa_migrate_on;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+	unsigned int		nr_pref_llc_running;
+	unsigned int		nr_llc_running;
+#endif
 #ifdef CONFIG_NO_HZ_COMMON
 	unsigned long		last_blocked_load_update_tick;
 	unsigned int		has_blocked_load;
@@ -1996,6 +2000,8 @@ init_numa_balancing(u64 clone_flags, struct task_stru=
ct *p)
=20
 #endif /* !CONFIG_NUMA_BALANCING */
=20
+int task_llc(const struct task_struct *p);
+
 static inline void
 queue_balance_callback(struct rq *rq,
 		       struct balance_callback *head,
--=20
2.32.0
From nobody Thu Apr  2 15:36:01 2026
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id EFFFD33A701
	for <linux-kernel@vger.kernel.org>; Tue, 10 Feb 2026 22:13:17 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.9
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1770761599; cv=none;
 b=OtU6wwFJFfLEBKrQOeChXULozjzh5P8LDXQwNvbnn1/PLwSwI7eQ5xHl2SfJacRiJzFUmJtTAdOJLBiqt55eThcO8vL2P+r8GXEcRUNVwZZ8Gk99Z76FE5RLZQpqB1NIKPFm12X6nTNAVaauFQOyIApRz8GFX9wncRcH3Osf9OM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1770761599; c=relaxed/simple;
	bh=Tft1zswN1Z0GrU5Df0edo2pze4mac6yMpiFecl7nqK0=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=XrKhhOkH3E7zlXYh6TmfTjfP0mEGd84jZTtOCDWEqaPWv2zEoGqVFPnupU8224+fu9pe2K1ii6Uw7qXI8fwCp/yZwcPBdUF17dFvFSyDnXs2NMX0NiiuYI8Dv8y/tFhSYHjy3Rmad43iH+Tn4tOglOWlsmfT17uc2aeTKxVzXJA=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=GI88rwsL; arc=none smtp.client-ip=192.198.163.9
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="GI88rwsL"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1770761598; x=1802297598;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=Tft1zswN1Z0GrU5Df0edo2pze4mac6yMpiFecl7nqK0=;
  b=GI88rwsL+Fri6OHzVL92NL2sOs9+ovlyZ2a+RTjL/Su2QzwUqUkd2js5
   9GtgUysFrmlo0KYQBMMKiiSqodzZ4RSFYesJvg9DFhvJUneth5v9jasgO
   cs+03bGRgBBHxeBtrT1OAceaSzPTvX0sk3pF79n/oK4H+csoYG/OWT4BQ
   5ZcF3lOFh68xyhP1Q9SORM6hfxm3WmUfmhFdFHOKu0n9bvskX0P/Ob/M4
   sLMuZtnCnqzDVcr4eICXHjdwoUflI0D5Q/eS+Nbn1CJUDttfJpxiCRyvc
   azRFK5frhjZXgqt9Fc8fV/DmoTLCJ9LLD1SvhA3nXwBsBApueYNHJAzmw
   A==;
X-CSE-ConnectionGUID: npdPCJs+SgqKdJSU5Se8fQ==
X-CSE-MsgGUID: Z45iMYw1RNmM4+TOFs7vCw==
X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631356"
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="82631356"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 10 Feb 2026 14:13:18 -0800
X-CSE-ConnectionGUID: mmHdVatVSPyMhOV76Dz32A==
X-CSE-MsgGUID: DavtVsktQ5m+AX7JSdRrkQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="216373935"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:16 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v3 07/21] sched/cache: Introduce per CPU's tasks LLC
 preference counter
Date: Tue, 10 Feb 2026 14:18:47 -0800
Message-Id: 
 <d2d2685a1eb411907f21d908d8acb001323e6821.1770760558.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1770760558.git.tim.c.chen@linux.intel.com>
References: <cover.1770760558.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

The lowest level of sched domain for each CPU is assigned an
array where each element tracks the number of tasks preferring
a given LLC, indexed from 0 to max_llcs - 1. Since each CPU
has its dedicated sd, this implies that each CPU will have
a dedicated task LLC preference counter.

For example, sd->pf[3] =3D 2 signifies that there
are 2 tasks on this runqueue which prefer to run within LLC3.

The load balancer can use this information to identify busy
runqueues and migrate tasks to their preferred LLC domains.
This array will be reallocated at runtime during sched domain
rebuild.

Introduce the buffer allocation mechanism, and the statistics
will be calculated in the subsequent patch.

Note: the LLC preference statistics of each CPU are reset on
sched domain rebuild and may under count temporarily, until the
CPU becomes idle and the count is cleared. This is a trade off
to avoid complex data synchronization across sched domain builds.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v2->v3:
    Allocate preferred LLC buffer in rq->sd rather than
    the rq. That way it automagically gets reallocated
    and old buffer gets recycled during sched domain rebuild.
    (Peter Zijlstra)

 include/linux/sched/topology.h |  4 +++
 kernel/sched/sched.h           |  2 ++
 kernel/sched/topology.c        | 64 +++++++++++++++++++++++++++++++++-
 3 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index a4e2fb31f2fd..3aa6c101b2e4 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -102,6 +102,10 @@ struct sched_domain {
 	u64 max_newidle_lb_cost;
 	unsigned long last_decay_max_lb_cost;
=20
+#ifdef CONFIG_SCHED_CACHE
+	unsigned int *pf;
+#endif
+
 #ifdef CONFIG_SCHEDSTATS
 	/* sched_balance_rq() stats */
 	unsigned int lb_count[CPU_MAX_IDLE_TYPES];
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 35cea6aa32a4..ac8c7ac1ac0d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3903,6 +3903,8 @@ static inline void mm_cid_switch_to(struct task_struc=
t *prev, struct task_struct
 #endif /* !CONFIG_SCHED_MM_CID */
=20
 #ifdef CONFIG_SCHED_CACHE
+extern int max_llcs;
+
 static inline bool sched_cache_enabled(void)
 {
 	return false;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index ca46b5cf7f78..dae78b5915a7 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -21,6 +21,7 @@ void sched_domains_mutex_unlock(void)
 static cpumask_var_t sched_domains_tmpmask;
 static cpumask_var_t sched_domains_tmpmask2;
 static int tl_max_llcs;
+int max_llcs;
=20
 static int __init sched_debug_setup(char *str)
 {
@@ -628,6 +629,11 @@ static void destroy_sched_domain(struct sched_domain *=
sd)
=20
 	if (sd->shared && atomic_dec_and_test(&sd->shared->ref))
 		kfree(sd->shared);
+
+#ifdef CONFIG_SCHED_CACHE
+	/* only the bottom sd has pref_llc array */
+	kfree(sd->pf);
+#endif
 	kfree(sd);
 }
=20
@@ -747,10 +753,15 @@ cpu_attach_domain(struct sched_domain *sd, struct roo=
t_domain *rd, int cpu)
 	if (sd && sd_degenerate(sd)) {
 		tmp =3D sd;
 		sd =3D sd->parent;
-		destroy_sched_domain(tmp);
+
 		if (sd) {
 			struct sched_group *sg =3D sd->groups;
=20
+#ifdef CONFIG_SCHED_CACHE
+			/* move pf to parent as child is being destroyed */
+			sd->pf =3D tmp->pf;
+			tmp->pf =3D NULL;
+#endif
 			/*
 			 * sched groups hold the flags of the child sched
 			 * domain for convenience. Clear such flags since
@@ -762,6 +773,8 @@ cpu_attach_domain(struct sched_domain *sd, struct root_=
domain *rd, int cpu)
=20
 			sd->child =3D NULL;
 		}
+
+		destroy_sched_domain(tmp);
 	}
=20
 	sched_domain_debug(sd, cpu);
@@ -787,6 +800,46 @@ enum s_alloc {
 	sa_none,
 };
=20
+#ifdef CONFIG_SCHED_CACHE
+static bool alloc_sd_pref(const struct cpumask *cpu_map,
+			  struct s_data *d)
+{
+	struct sched_domain *sd;
+	unsigned int *pf;
+	int i;
+
+	for_each_cpu(i, cpu_map) {
+		sd =3D *per_cpu_ptr(d->sd, i);
+		if (!sd)
+			goto err;
+
+		pf =3D kcalloc(tl_max_llcs, sizeof(unsigned int), GFP_KERNEL);
+		if (!pf)
+			goto err;
+
+		sd->pf =3D pf;
+	}
+
+	return true;
+err:
+	for_each_cpu(i, cpu_map) {
+		sd =3D *per_cpu_ptr(d->sd, i);
+		if (sd) {
+			kfree(sd->pf);
+			sd->pf =3D NULL;
+		}
+	}
+
+	return false;
+}
+#else
+static bool alloc_sd_pref(const struct cpumask *cpu_map,
+			  struct s_data *d)
+{
+	return false;
+}
+#endif
+
 /*
  * Return the canonical balance CPU for this group, this is the first CPU
  * of this group that's also in the balance mask.
@@ -2710,6 +2763,8 @@ build_sched_domains(const struct cpumask *cpu_map, st=
ruct sched_domain_attr *att
 		}
 	}
=20
+	alloc_sd_pref(cpu_map, &d);
+
 	/* Attach the domains */
 	rcu_read_lock();
 	for_each_cpu(i, cpu_map) {
@@ -2723,6 +2778,13 @@ build_sched_domains(const struct cpumask *cpu_map, s=
truct sched_domain_attr *att
 	}
 	rcu_read_unlock();
=20
+	/*
+	 * Ensure we see enlarged sd->pf when we use new llc_ids and
+	 * bigger max_llcs.
+	 */
+	smp_mb();
+	max_llcs =3D tl_max_llcs;
+
 	if (has_asym)
 		static_branch_inc_cpuslocked(&sched_asym_cpucapacity);
=20
--=20
2.32.0
From nobody Thu Apr  2 15:36:01 2026
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id CBA8E33B6F1
	for <linux-kernel@vger.kernel.org>; Tue, 10 Feb 2026 22:13:19 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.9
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1770761601; cv=none;
 b=CyA6GLqyzy6qPy7eM4z2myjYokiNNkI64nDL55uilQdQ44nReATUU0WJy1Q5ZOQyrozMYUccXMx7nJ1/f3hIsqOmPD6D+MfrFRQTVcSUkyvkc1h82SxL9SdVu1dNh+hgkeIZ5Nz48162B6emxzem4IetLnCroJsFJiCK3VXYdbU=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1770761601; c=relaxed/simple;
	bh=DWu++QLbzyBJytYKRjnoJfPb+BN1oEQx/b7EvvlxTjU=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=UPmDMY7DZmek4jpDEk9kbA3dy3GKc42jKcLfbDlsd7+baPsRtTertpWK9PlKbiwVW9caLZaEopZoeMv3voeMgEdvJYB/V6cwcF0Cf0dvqHUVtCgctFV5rmNGzyAY05TkBQvqsHGoKydkrmrPU23AHwd4uhX0UDmdKB9k9OPKEW4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=Ia+udd6i; arc=none smtp.client-ip=192.198.163.9
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="Ia+udd6i"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1770761600; x=1802297600;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=DWu++QLbzyBJytYKRjnoJfPb+BN1oEQx/b7EvvlxTjU=;
  b=Ia+udd6iPmC7VvGvKj9nfKB590Fn0f3t/zbTKGBt8wyYMWJteM1tvXqD
   LCkHRPnSdiZzAnOnwa3cd+cwjM/HjpP4qdbIvCETqhFYWNwyKRAgJPE42
   9j7insbiLgmu91WObHds7IyMUEFY/nUOIvSTXGdZEbVbG5dLoRNkz6pF3
   DS2yN1owhEEgMdXdGf2Vb5Q7IJf8AAVsx1bTGdCZCm4wGW2tRwuV3V9bv
   nsaWMCtJRlc6q5rjokIQLcKlxNDjmo8VSpmnO+iLC+wDXj74PsgfMZGYO
   HYT/CzSekrrP0APectfheWtmCvjTKHgYdW3neuFbbkqbJqU9JSToZHJ5w
   Q==;
X-CSE-ConnectionGUID: MvzCwiLATdCb3BjfRNzGdQ==
X-CSE-MsgGUID: rMJBZNE1QpeS/NodwVY+CQ==
X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631378"
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="82631378"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 10 Feb 2026 14:13:20 -0800
X-CSE-ConnectionGUID: +kW0xOefQL+p/3/t8mxyqg==
X-CSE-MsgGUID: HxX5F/csQZSuBNIc0rmErw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="216373939"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:18 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v3 08/21] sched/cache: Calculate the percpu sd task LLC
 preference
Date: Tue, 10 Feb 2026 14:18:48 -0800
Message-Id: 
 <41f8e91b70060e7697840163b80c3dc097aabb34.1770760558.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1770760558.git.tim.c.chen@linux.intel.com>
References: <cover.1770760558.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Calculate the number of tasks' LLC preferences for each runqueue.
This statistic is computed during task enqueue and dequeue
operations, and is used by the cache-aware load balancing.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v2->v3: Move max_llcs check from patch4 to this patch.
    This would clarify the rationale for the
    max_llc check and makes review easier (Peter Zijlstra).

 kernel/sched/fair.c | 56 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 54 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6ad9ad2f918f..4a98aa866d65 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1199,28 +1199,80 @@ static int llc_id(int cpu)
 	return per_cpu(sd_llc_id, cpu);
 }
=20
+static inline bool valid_llc_id(int id)
+{
+	if (unlikely(id < 0 || id >=3D max_llcs))
+		return false;
+
+	return true;
+}
+
+static inline bool valid_llc_buf(struct sched_domain *sd,
+				 int id)
+{
+	/*
+	 * The check for sd and its corresponding pf is to
+	 * confirm that the sd->pf[] has been allocated in
+	 * build_sched_domains() after the assignment of
+	 * per_cpu(sd_llc_id, i). This is used to avoid
+	 * the race condition.
+	 */
+	if (unlikely(!sd || !sd->pf))
+		return false;
+
+	return valid_llc_id(id);
+}
+
 static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
 {
+	struct sched_domain *sd;
 	int pref_llc;
=20
 	pref_llc =3D p->preferred_llc;
-	if (pref_llc < 0)
+	if (!valid_llc_id(pref_llc))
 		return;
=20
 	rq->nr_llc_running++;
 	rq->nr_pref_llc_running +=3D (pref_llc =3D=3D task_llc(p));
+
+	scoped_guard (rcu) {
+		sd =3D rcu_dereference(rq->sd);
+		if (valid_llc_buf(sd, pref_llc))
+			sd->pf[pref_llc]++;
+	}
 }
=20
 static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
 {
+	struct sched_domain *sd;
 	int pref_llc;
=20
 	pref_llc =3D p->preferred_llc;
-	if (pref_llc < 0)
+	if (!valid_llc_id(pref_llc))
 		return;
=20
 	rq->nr_llc_running--;
 	rq->nr_pref_llc_running -=3D (pref_llc =3D=3D task_llc(p));
+
+	scoped_guard (rcu) {
+		sd =3D rcu_dereference(rq->sd);
+		if (valid_llc_buf(sd, pref_llc)) {
+			/*
+			 * There is a race condition between dequeue
+			 * and CPU hotplug. After a task has been enqueued
+			 * on CPUx, a CPU hotplug event occurs, and all online
+			 * CPUs (including CPUx) rebuild their sched_domains
+			 * and reset statistics to zero (including sd->pf).
+			 * This can cause temporary undercount and we have to
+			 * check for such underflow in sd->pf.
+			 *
+			 * This undercount is temporary and accurate accounting
+			 * will resume once the rq has a chance to be idle.
+			 */
+			if (sd->pf[pref_llc])
+				sd->pf[pref_llc]--;
+		}
+	}
 }
=20
 void mm_init_sched(struct mm_struct *mm,
--=20
2.32.0
From nobody Thu Apr  2 15:36:01 2026
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id BF2D933BBB1
	for <linux-kernel@vger.kernel.org>; Tue, 10 Feb 2026 22:13:21 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.9
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1770761603; cv=none;
 b=C0Ig+W7IzWDti+3rGYjUc5KFVRVrX0TVzG1Et+RvTagFt9rdWHyp5Rho3dE2QKZ7AWsyq/WM98+xeQsRH5ygoP67txnLVPdf0kDjXY0duIbKvjCekXMyNt0rGJH1JzU3wdxkXA6PHtA5iGjYDgVlXS2pwH3szNk1/dPhm58A3xg=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1770761603; c=relaxed/simple;
	bh=h8EaW97unXwqr3Zm62BEQ6Jz9Y52LZhXe+5mCa90Ga8=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=LuIQz21nYGQHPO7uJJAJ45nDLRPcNqNwvC1lOh9uO4/EaPlpgngiYOeQBvbYznP4Y2BiJuLt+k6P+w9sDp9KZCpjwOAm1P7MYkBCyOXf8zWmI3fuSd9kK77JVRXKLDgJlAZD7PIOx3kGUgV0jdCSvToWVL/Xbti0fwUMv1DyMsM=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=glOugbil; arc=none smtp.client-ip=192.198.163.9
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="glOugbil"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1770761602; x=1802297602;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=h8EaW97unXwqr3Zm62BEQ6Jz9Y52LZhXe+5mCa90Ga8=;
  b=glOugbilH1PQc8KdYJ8uRolEJhViBrL219XjjE01mq5Bm1iQs2ALFYnj
   j2p7RiR+dHP0qYSuRjFDLzkVWKkRsKelbhUeaLakkcyl1KztbPq+R3Htz
   8E1ykcIxoF1KmfeHSBl6lA9Mj1w9fW/im8FZnpoi5L+bNW1YwClcZIrEE
   osQpSLqiupOlUgqCLwsZ2B5T4+RA0m3oaK3yXEjKNACRmXV5Qy6nTNJen
   p0mjh87isA7bG4Jd3yhcz92seI1hcl8glAvhyuXr93vzqoe63zpYXo/aa
   U5jGEWELL0+VFO3uw/xMAHDKCFZZRWnbMqFTBGsewNWA8W6VMqG5esKja
   g==;
X-CSE-ConnectionGUID: i/3QiOSmRxqJcSyzjv1o7g==
X-CSE-MsgGUID: 45sp/AO6RrmCtAmJ825RKg==
X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631400"
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="82631400"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 10 Feb 2026 14:13:22 -0800
X-CSE-ConnectionGUID: QuYWffTFSP6QPG/1OcbvPg==
X-CSE-MsgGUID: aunmfRrgTDmCm3s+6M9UjQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="216373944"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:20 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v3 09/21] sched/cache: Count tasks prefering destination LLC
 in a sched group
Date: Tue, 10 Feb 2026 14:18:49 -0800
Message-Id: 
 <eb97cee4e41dc9a987a21712cd731d254faee9b2.1770760558.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1770760558.git.tim.c.chen@linux.intel.com>
References: <cover.1770760558.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

During LLC load balancing, tabulate the number of tasks on each runqueue
that prefer the LLC contains the env->dst_cpu in a sched group.

For example, consider a system with 4 LLC sched groups (LLC0 to LLC3)
balancing towards LLC3. LLC0 has 3 tasks preferring LLC3, LLC1 has
2, and LLC2 has 1. LLC0, having the most tasks preferring LLC3, is
selected as the busiest source to pick tasks from.

Within a source LLC, the total number of tasks preferring a destination
LLC is computed by summing counts across all CPUs in that LLC. For
instance, if LLC0 has CPU0 with 2 tasks and CPU1 with 1 task preferring
LLC3, the total for LLC0 is 3.

These statistics allow the load balancer to choose tasks from source
sched groups that best match their preferred LLCs.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v2->v3:
    Rename nr_pref_llc to nr_pref_dst_llc for clarification.

 kernel/sched/fair.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4a98aa866d65..bb93cc046d73 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10566,6 +10566,9 @@ struct sg_lb_stats {
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+	unsigned int nr_pref_dst_llc;
+#endif
 };
=20
 /*
@@ -11034,6 +11037,9 @@ static inline void update_sg_lb_stats(struct lb_env=
 *env,
 {
 	int i, nr_running, local_group, sd_flags =3D env->sd->flags;
 	bool balancing_at_rd =3D !env->sd->parent;
+#ifdef CONFIG_SCHED_CACHE
+	int dst_llc =3D llc_id(env->dst_cpu);
+#endif
=20
 	memset(sgs, 0, sizeof(*sgs));
=20
@@ -11054,6 +11060,15 @@ static inline void update_sg_lb_stats(struct lb_en=
v *env,
 		if (cpu_overutilized(i))
 			*sg_overutilized =3D 1;
=20
+#ifdef CONFIG_SCHED_CACHE
+		if (sched_cache_enabled() && llc_id(i) !=3D dst_llc) {
+			struct sched_domain *sd_tmp =3D rcu_dereference(rq->sd);
+
+			if (valid_llc_buf(sd_tmp, dst_llc))
+				sgs->nr_pref_dst_llc +=3D sd_tmp->pf[dst_llc];
+		}
+#endif
+
 		/*
 		 * No need to call idle_cpu() if nr_running is not 0
 		 */
--=20
2.32.0
From nobody Thu Apr  2 15:36:01 2026
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8A11A33AD9E
	for <linux-kernel@vger.kernel.org>; Tue, 10 Feb 2026 22:13:24 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.9
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1770761605; cv=none;
 b=iTJD6aiU1H7dHrrcgjQbvueP90EjZ28vz3uH683tL7b3dTq666tXjsV4FejCBoJDjAY58TmjuJ4C1TMrWzHDlwr/Z6fwSxvb+uj4SPvMDhyTCwUpQB8TJ5HFQlgqVdfWW7SZp/lUJh+7kddc/A1k+ve0S7pZI2d0BXNzQX3xxww=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1770761605; c=relaxed/simple;
	bh=deREQWr008Oddvl6ktaFRrYAHwf1vWih75YBvefoALY=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=OT13iyB5B9nTFSTN9qBEIFBN7L98adFY/j+VDNLJZIWuP5Ktg8ykP/QguVmZZtXMammdZ4jF784zB9U1Nok2SGOu1hwrzqcMGe8bCHOcidl5sqnzjkwd5dJzfJMX9kVdIz2mvaZ1x+7RjFe4dCZa2YcMnGfqxv6gzA5xAfAYXo4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=VQm13UYO; arc=none smtp.client-ip=192.198.163.9
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="VQm13UYO"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1770761605; x=1802297605;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=deREQWr008Oddvl6ktaFRrYAHwf1vWih75YBvefoALY=;
  b=VQm13UYORIFsGskCJRkQeeo2zuRDP/iPPRD6sr5ySXNE6bDCYyKpS1Hl
   M1poa0FjM2oLwu9SPlXaETXmxM9NIv56Z156uEcjn7GhHVHL20g+VtS5C
   1P2DBYVeiC0Hk1ScgDSoVNB+yvk6zvK9BPCH7H8EGkhL1QscTLws074fR
   MmBB3BN8NLu9nSqXJcD8Zb4BpdIrczEtktD5GMCwqfKl3HlNriBZdgvtv
   lGsaqZaKnnjEfVhEmIm8FNlItf3OtLrE0psfqp03k3sAJ6OKpcX1k1+2P
   7aIN0mVufFES778JO172kYB3Hgt+4ocy31dnv8+F9sm3/MfloU9eEcIvW
   g==;
X-CSE-ConnectionGUID: 23Aou1VYSHqlVg4OjMXmwg==
X-CSE-MsgGUID: Ic2QsJqSSMCoisKrkI4ERQ==
X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631420"
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="82631420"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 10 Feb 2026 14:13:24 -0800
X-CSE-ConnectionGUID: hUpMd5nNQByyiA1jelrdjw==
X-CSE-MsgGUID: XEub+xfKQKKpitoFXSkj/A==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="216373952"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:22 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v3 10/21] sched/cache: Check local_group only once in
 update_sg_lb_stats()
Date: Tue, 10 Feb 2026 14:18:50 -0800
Message-Id: 
 <9b77a144811f5c11217a0e6a4e6c2b5cfe9dffb9.1770760558.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1770760558.git.tim.c.chen@linux.intel.com>
References: <cover.1770760558.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

There is no need to check the local group twice for both
group_asym_packing and group_smt_balance. Adjust the code
to facilitate future checks for group types (cache-aware
load balancing) as well.

No functional changes are expected.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v2->v3:
    No change.

 kernel/sched/fair.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bb93cc046d73..b0cf4424d198 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11109,14 +11109,16 @@ static inline void update_sg_lb_stats(struct lb_e=
nv *env,
=20
 	sgs->group_weight =3D group->group_weight;
=20
-	/* Check if dst CPU is idle and preferred to this group */
-	if (!local_group && env->idle && sgs->sum_h_nr_running &&
-	    sched_group_asym(env, sgs, group))
-		sgs->group_asym_packing =3D 1;
-
-	/* Check for loaded SMT group to be balanced to dst CPU */
-	if (!local_group && smt_balance(env, sgs, group))
-		sgs->group_smt_balance =3D 1;
+	if (!local_group) {
+		/* Check if dst CPU is idle and preferred to this group */
+		if (env->idle && sgs->sum_h_nr_running &&
+		    sched_group_asym(env, sgs, group))
+			sgs->group_asym_packing =3D 1;
+
+		/* Check for loaded SMT group to be balanced to dst CPU */
+		if (smt_balance(env, sgs, group))
+			sgs->group_smt_balance =3D 1;
+	}
=20
 	sgs->group_type =3D group_classify(env->sd->imbalance_pct, group, sgs);
=20
--=20
2.32.0
From nobody Thu Apr  2 15:36:01 2026
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id B89E233F8DD
	for <linux-kernel@vger.kernel.org>; Tue, 10 Feb 2026 22:13:27 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.9
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1770761609; cv=none;
 b=jVDywHr1Y+GwBL2hIlAYLV1VTDIaVRwFeaRWBI4M+mASbFRZjgZaZldB3v/lBKPn/lrorwqN8ZGNbUbQ8TtvTRj/HTP1s7BTM1b9Hb3CPLnHk4ckWDXpcID3KMY27dQBUJPQe3S63auMjWvFWCtCgziKmhFSOUirTLg8ygEVvVc=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1770761609; c=relaxed/simple;
	bh=Qq/LgW2Vk/aD8G9oPEpX37cscCriWqkxV6kRZ4EgZpA=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=bhWWZTAnGqFB3Q4now16HWDWBn3ZxzZNhuOSRZRK5iAEJPkNUU4/CW5rsw7gXMgjM2pnritv9kaZqLAsjhDDiGp7+F/dIpiHCeV8yHfNeC/A7BlntLAYIOO8S5c4LfYbxp24rmfYAwuz+Bs9OJQNSpBtIpncxlciCSPwL73uKqg=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=ZdGq0Q0R; arc=none smtp.client-ip=192.198.163.9
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="ZdGq0Q0R"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1770761608; x=1802297608;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=Qq/LgW2Vk/aD8G9oPEpX37cscCriWqkxV6kRZ4EgZpA=;
  b=ZdGq0Q0RClgAzyHYIIQoglg20S+Oc53k9xniKqVn/w2Lou0rsBpIjIRV
   2Oej6aVHLXek/QLrYw5Yf9P8FnrG0ZVPtUkd575J2VW7eQIUfTZRxBciK
   AhSxiXfhTKoSVE1cQd6gwXJV7ml6R1Ra3odXJUNLK1hCkGkbo6kORg4iu
   nzG/tmU7D2XthJnXomMnkfhbC43SsMMsEL4gz8kNtfcdHQwhOOg2WIwTQ
   FEf03uhhzLFbdqWjfZMZ+yJTKDzBO3JBbRF1WbDYOVpvmCKW5YFfwp3i2
   oxFhorbi4StSsC7CSrvqHnB3x9Ny0W0jfn9A7ShhaUmovoOW9/1k4h2NV
   A==;
X-CSE-ConnectionGUID: 6zFsVbcMSq2310Nie2ewrw==
X-CSE-MsgGUID: qzP9tiQcRkiw+X6nS6+/9Q==
X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631443"
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="82631443"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 10 Feb 2026 14:13:28 -0800
X-CSE-ConnectionGUID: Idq5vmFiSdGMbmZViAWi7g==
X-CSE-MsgGUID: jGXGWr5yTF+CRDo3EpfDng==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="216373959"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:25 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v3 11/21] sched/cache: Prioritize tasks preferring destination
 LLC during balancing
Date: Tue, 10 Feb 2026 14:18:51 -0800
Message-Id: 
 <4754991218da7da039a0891b0b9647f6eabd5716.1770760558.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1770760558.git.tim.c.chen@linux.intel.com>
References: <cover.1770760558.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

During LLC load balancing, first check for tasks that prefer the
destination LLC and balance them to it before others.

Mark source sched groups containing tasks preferring non local LLCs
with the group_llc_balance flag. This ensures the load balancer later
pulls or pushes these tasks toward their preferred LLCs.

The load balancer selects the busiest sched_group and migrates tasks
to less busy groups to distribute load across CPUs.

With cache-aware scheduling enabled, the busiest sched_group is
the one with most tasks preferring the destination LLC. If
the group has the llc_balance flag set, cache aware load balancing is
triggered.

Introduce the helper function update_llc_busiest() to identify the
sched_group with the most tasks preferring the destination LLC.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v2->v3:
    Consider sd->nr_balance_failed when deciding whether
    LLC load balance should be used.
    (Peter Zijlstra)

 kernel/sched/fair.c | 77 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 76 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b0cf4424d198..43dcf2827298 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9649,6 +9649,11 @@ enum group_type {
 	 * from balancing the load across the system.
 	 */
 	group_imbalanced,
+	/*
+	 * There are tasks running on non-preferred LLC, possible to move
+	 * them to their preferred LLC without creating too much imbalance.
+	 */
+	group_llc_balance,
 	/*
 	 * The CPU is overloaded and can't provide expected CPU cycles to all
 	 * tasks.
@@ -10561,6 +10566,7 @@ struct sg_lb_stats {
 	enum group_type group_type;
 	unsigned int group_asym_packing;	/* Tasks should be moved to preferred CP=
U */
 	unsigned int group_smt_balance;		/* Task on busy SMT be moved */
+	unsigned int group_llc_balance;		/* Tasks should be moved to preferred LL=
C */
 	unsigned long group_misfit_task_load;	/* A CPU has a task too big for its=
 capacity */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
@@ -10819,6 +10825,9 @@ group_type group_classify(unsigned int imbalance_pc=
t,
 	if (group_is_overloaded(imbalance_pct, sgs))
 		return group_overloaded;
=20
+	if (sgs->group_llc_balance)
+		return group_llc_balance;
+
 	if (sg_imbalanced(group))
 		return group_imbalanced;
=20
@@ -11012,11 +11021,66 @@ static void record_sg_llc_stats(struct lb_env *en=
v,
 	if (unlikely(READ_ONCE(sd_share->capacity) !=3D sgs->group_capacity))
 		WRITE_ONCE(sd_share->capacity, sgs->group_capacity);
 }
+
+/*
+ * Do LLC balance on sched group that contains LLC, and have tasks preferr=
ing
+ * to run on LLC in idle dst_cpu.
+ */
+static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
+			       struct sched_group *group)
+{
+	if (!sched_cache_enabled())
+		return false;
+
+	if (env->sd->flags & SD_SHARE_LLC)
+		return false;
+
+	/*
+	 * Don't do cache aware balancing if there
+	 * are too many balance failures.
+	 *
+	 * Should fall back to regular load balancing
+	 * after repeated cache aware balance failures.
+	 */
+	if (env->sd->nr_balance_failed >=3D
+	    env->sd->cache_nice_tries + 1)
+		return false;
+
+	if (sgs->nr_pref_dst_llc &&
+	    can_migrate_llc(cpumask_first(sched_group_span(group)),
+			    env->dst_cpu, 0, true) =3D=3D mig_llc)
+		return true;
+
+	return false;
+}
+
+static bool update_llc_busiest(struct lb_env *env,
+			       struct sg_lb_stats *busiest,
+			       struct sg_lb_stats *sgs)
+{
+	/*
+	 * There are more tasks that want to run on dst_cpu's LLC.
+	 */
+	return sgs->nr_pref_dst_llc > busiest->nr_pref_dst_llc;
+}
 #else
 static inline void record_sg_llc_stats(struct lb_env *env, struct sg_lb_st=
ats *sgs,
 				       struct sched_group *group)
 {
 }
+
+static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
+			       struct sched_group *group)
+{
+	return false;
+}
+
+static bool update_llc_busiest(struct lb_env *env,
+			       struct sg_lb_stats *busiest,
+			       struct sg_lb_stats *sgs)
+{
+	return false;
+}
 #endif
=20
 /**
@@ -11118,6 +11182,10 @@ static inline void update_sg_lb_stats(struct lb_en=
v *env,
 		/* Check for loaded SMT group to be balanced to dst CPU */
 		if (smt_balance(env, sgs, group))
 			sgs->group_smt_balance =3D 1;
+
+		/* Check for tasks in this group can be moved to their preferred LLC */
+		if (llc_balance(env, sgs, group))
+			sgs->group_llc_balance =3D 1;
 	}
=20
 	sgs->group_type =3D group_classify(env->sd->imbalance_pct, group, sgs);
@@ -11181,6 +11249,10 @@ static bool update_sd_pick_busiest(struct lb_env *=
env,
 		/* Select the overloaded group with highest avg_load. */
 		return sgs->avg_load > busiest->avg_load;
=20
+	case group_llc_balance:
+		/* Select the group with most tasks preferring dst LLC */
+		return update_llc_busiest(env, busiest, sgs);
+
 	case group_imbalanced:
 		/*
 		 * Select the 1st imbalanced group as we don't have any way to
@@ -11443,6 +11515,7 @@ static bool update_pick_idlest(struct sched_group *=
idlest,
 			return false;
 		break;
=20
+	case group_llc_balance:
 	case group_imbalanced:
 	case group_asym_packing:
 	case group_smt_balance:
@@ -11575,6 +11648,7 @@ sched_balance_find_dst_group(struct sched_domain *s=
d, struct task_struct *p, int
 			return NULL;
 		break;
=20
+	case group_llc_balance:
 	case group_imbalanced:
 	case group_asym_packing:
 	case group_smt_balance:
@@ -12074,7 +12148,8 @@ static struct sched_group *sched_balance_find_src_g=
roup(struct lb_env *env)
 	 * group's child domain.
 	 */
 	if (sds.prefer_sibling && local->group_type =3D=3D group_has_spare &&
-	    sibling_imbalance(env, &sds, busiest, local) > 1)
+	    (busiest->group_type =3D=3D group_llc_balance ||
+	    sibling_imbalance(env, &sds, busiest, local) > 1))
 		goto force_balance;
=20
 	if (busiest->group_type !=3D group_overloaded) {
--=20
2.32.0
From nobody Thu Apr  2 15:36:01 2026
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id D7ACF33FE1F
	for <linux-kernel@vger.kernel.org>; Tue, 10 Feb 2026 22:13:29 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.9
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1770761611; cv=none;
 b=LYnGbU1eJGO9U8QvhlnNFIOUX/XvarI33ljQ5tB2eQbfBYcOXsZO2cm25MppD1ZYemIG0tmZDXNeGmr/n4xOtheeK0FwigSG7NbQ15eUh89Q0j/qq36hKHbQt8gYgj5GC4kUNElI2lk9gHfKzya64zJzI2YLsBhwLMEhluKaba8=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1770761611; c=relaxed/simple;
	bh=UN264F5dvdXr4hYvmSUaM59YbuS9/BGP5NAiGJVH8vQ=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=Q14MMmwWT9F/+s24LYl4uzdC5sU/4AWsFPkdSDsDW0hI0ZWC6ZuOtATG30OyEVpk+6cbFSYyAE/72ArbMwEExBp04TmAP+MUh0HZnAw1IIczCmDiV4Z+ei1cCht40wf7QzVCPB6GL1FjAHMNwJavCiJV0c4pA0onb6B/kojs3lg=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=TM24SUhH; arc=none smtp.client-ip=192.198.163.9
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="TM24SUhH"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1770761609; x=1802297609;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=UN264F5dvdXr4hYvmSUaM59YbuS9/BGP5NAiGJVH8vQ=;
  b=TM24SUhH9/RyZoG459NP+AmabZVsnPngn1zE2/ZV02xp+osEX8qJODQz
   74LaSBw4VHSQmRsslLREG8pHQKCpimbafrIshAdmV0JQ9Vy/Hp5FgAheQ
   XTaoI9P3Zo7b406AjQVD5j82513az4ESzCQszsAX2Cj5S4N0fni6f72BN
   OknXiaWNswFLoKA4mAkkEzQyBUc11fS8rCly/sYQ04zfS1Ustv1ScE6kg
   nSEpyB+OL5LsYQhxe32BoD9eTSylsZ1O/q/ACXyzBbPTQBNC+kwdPHgi2
   1iv7LDQI4+9Bmx1VvnZLQDjNv5h5VE5C3dnJih3bWaLrvig2I7UOmy5KK
   Q==;
X-CSE-ConnectionGUID: PwbWaJb2Ttqlus8fJEOX7Q==
X-CSE-MsgGUID: w5xkH7KzRuiSvXukasL2xQ==
X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631463"
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="82631463"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 10 Feb 2026 14:13:30 -0800
X-CSE-ConnectionGUID: 1/zkImpXRPeYfWS2QZ9C+g==
X-CSE-MsgGUID: NEMij9YqSmCxg4sYBGmvsw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="216373973"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:28 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v3 12/21] sched/cache: Add migrate_llc_task migration type for
 cache-aware balancing
Date: Tue, 10 Feb 2026 14:18:52 -0800
Message-Id: 
 <9038c2e0d40b744d5db19138c384819717eb03e6.1770760558.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1770760558.git.tim.c.chen@linux.intel.com>
References: <cover.1770760558.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Introduce a new migration type, migrate_llc_task, to support
cache-aware load balancing.

After identifying the busiest sched_group (having the most tasks
preferring the destination LLC), mark migrations with this type.
During load balancing, each runqueue in the busiest sched_group is
examined, and the runqueue with the highest number of tasks preferring
the destination CPU is selected as the busiest runqueue.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v2->v3:
    Let the enum and switch statements have the same order.
    (Peter Zijlstra)

 kernel/sched/fair.c | 38 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 43dcf2827298..1697791ef11c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9665,7 +9665,8 @@ enum migration_type {
 	migrate_load =3D 0,
 	migrate_util,
 	migrate_task,
-	migrate_misfit
+	migrate_misfit,
+	migrate_llc_task
 };
=20
 #define LBF_ALL_PINNED	0x01
@@ -10266,6 +10267,10 @@ static int detach_tasks(struct lb_env *env)
=20
 			env->imbalance =3D 0;
 			break;
+
+		case migrate_llc_task:
+			env->imbalance--;
+			break;
 		}
=20
 		detach_task(p, env);
@@ -11902,6 +11907,15 @@ static inline void calculate_imbalance(struct lb_e=
nv *env, struct sd_lb_stats *s
 		return;
 	}
=20
+#ifdef CONFIG_SCHED_CACHE
+	if (busiest->group_type =3D=3D group_llc_balance) {
+		/* Move a task that prefer local LLC */
+		env->migration_type =3D migrate_llc_task;
+		env->imbalance =3D 1;
+		return;
+	}
+#endif
+
 	if (busiest->group_type =3D=3D group_imbalanced) {
 		/*
 		 * In the group_imb case we cannot rely on group-wide averages
@@ -12209,6 +12223,11 @@ static struct rq *sched_balance_find_src_rq(struct=
 lb_env *env,
 	struct rq *busiest =3D NULL, *rq;
 	unsigned long busiest_util =3D 0, busiest_load =3D 0, busiest_capacity =
=3D 1;
 	unsigned int busiest_nr =3D 0;
+#ifdef CONFIG_SCHED_CACHE
+	unsigned int busiest_pref_llc =3D 0;
+	struct sched_domain *sd_tmp;
+	int dst_llc;
+#endif
 	int i;
=20
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
@@ -12336,6 +12355,21 @@ static struct rq *sched_balance_find_src_rq(struct=
 lb_env *env,
=20
 			break;
=20
+		case migrate_llc_task:
+#ifdef CONFIG_SCHED_CACHE
+			sd_tmp =3D rcu_dereference(rq->sd);
+			dst_llc =3D llc_id(env->dst_cpu);
+			if (valid_llc_buf(sd_tmp, dst_llc)) {
+				unsigned int this_pref_llc =3D sd_tmp->pf[dst_llc];
+
+				if (busiest_pref_llc < this_pref_llc) {
+					busiest_pref_llc =3D this_pref_llc;
+					busiest =3D rq;
+				}
+			}
+#endif
+			break;
+
 		}
 	}
=20
@@ -12499,6 +12533,8 @@ static void update_lb_imbalance_stat(struct lb_env =
*env, struct sched_domain *sd
 	case migrate_misfit:
 		__schedstat_add(sd->lb_imbalance_misfit[idle], env->imbalance);
 		break;
+	case migrate_llc_task:
+		break;
 	}
 }
=20
--=20
2.32.0
From nobody Thu Apr  2 15:36:01 2026
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 35F4733B6D3
	for <linux-kernel@vger.kernel.org>; Tue, 10 Feb 2026 22:13:32 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.9
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1770761613; cv=none;
 b=lk5MdHYVwPvEAxRUxjjnLyU8eZm/aw1HrlY67y2jMliVs1TKRGvJNdrMex+/ou5rB8pWUDHYsOEWGCWWiqWm7Y36Brw1+YIF1O9rOlpSxbDYLTEp51CQ4Z8qpi5f5xbl3NQKQFkAA686/G7sg0QvG8Q/lMNlnHcvKGCTUsUpVe8=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1770761613; c=relaxed/simple;
	bh=Sp3CApwES6LZ8wznsuorYrIR44CICnE+yAXXfi436d0=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=pEZDNKCxmNotKlUXdwAQ+KzBiKf/Z7Y51298bvHQV3Gl2leBgy5SYxkpfoUbBEnBn1OfVeAdIXqNMN/4pQ38o9G1F8yzRTOd+lsVXk357BqC2n0dqWe4mxvMl/ylc+V35zjiMWuEVF6dwcvXYXTi9o7DJXnIEvFmgx2q+ksitVM=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=dbU0w3sc; arc=none smtp.client-ip=192.198.163.9
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="dbU0w3sc"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1770761612; x=1802297612;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=Sp3CApwES6LZ8wznsuorYrIR44CICnE+yAXXfi436d0=;
  b=dbU0w3scI28x8xgP1OsFEnOnOSkfQgZFnBbG60v7xb6s0m5f0eRyYecS
   RLBKA33qal+fb+ImbzNSVaXDN151KwTXloCH0sM6hyHGthq9Qg4rmE3nP
   5RgBeLIX8VQnD3EdYqxmmDP3eJ+RUx8CI0hdPD4fHDHRjQwMOalJ3eIMJ
   WceAvccOSNCRnkZ4dsKbT8e5fSbmVbN8pUhznZFHigqWvBtmoovhNXSOD
   RKqORiE1pARQqgI3OFX39acsOwGqjN+l3pIVKpAsNABgTsXssZgwu0Sjy
   Q7yWwbp9vFi0PLyDbssW/67XYMBcMbd7E0rdZro8wIiRGxTn7iF010bwl
   A==;
X-CSE-ConnectionGUID: a+0EVXeTTqq5hO8JZ+Wxvg==
X-CSE-MsgGUID: 1qYXTk7RSk+b8S3Pj2aNcA==
X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631485"
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="82631485"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 10 Feb 2026 14:13:31 -0800
X-CSE-ConnectionGUID: XP3ymKcFRS2pztVOCTB5Hw==
X-CSE-MsgGUID: GLjdOTeEQdu1Z++BmJyQSg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="216373986"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:30 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v3 13/21] sched/cache: Handle moving single tasks to/from
 their preferred LLC
Date: Tue, 10 Feb 2026 14:18:53 -0800
Message-Id: 
 <92fa33fc26f069d8044bac3b0efc3598f53131de.1770760558.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1770760558.git.tim.c.chen@linux.intel.com>
References: <cover.1770760558.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

In the generic load balance(non-cache-aware-load-balance),
if the busiest runqueue has only one task, active balancing may be
invoked to move it. However, this migration might break LLC locality.

Before migration, check whether the task is running on its preferred
LLC: Do not move a lone task to another LLC if it would move the task
away from its preferred LLC or cause excessive imbalance between LLCs.

On the other hand, if the migration type is migrate_llc_task, it means
that there are tasks on the env->src_cpu that want to be migrated to
their preferred LLC,  launch the active load balance anyway.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v2->v3:
    Remove redundant rcu read lock in break_llc_locality().

 kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 53 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1697791ef11c..03959a701514 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9999,12 +9999,60 @@ static __maybe_unused enum llc_mig can_migrate_llc_=
task(int src_cpu, int dst_cpu
 			       task_util(p), to_pref);
 }
=20
+/*
+ * Check if active load balance breaks LLC locality in
+ * terms of cache aware load balance.
+ */
+static inline bool
+alb_break_llc(struct lb_env *env)
+{
+	if (!sched_cache_enabled())
+		return false;
+
+	if (cpus_share_cache(env->src_cpu, env->dst_cpu))
+		return false;
+	/*
+	 * All tasks prefer to stay on their current CPU.
+	 * Do not pull a task from its preferred CPU if:
+	 * 1. It is the only task running there; OR
+	 * 2. Migrating it away from its preferred LLC would violate
+	 *    the cache-aware scheduling policy.
+	 */
+	if (env->src_rq->nr_pref_llc_running &&
+	    env->src_rq->nr_pref_llc_running =3D=3D env->src_rq->cfs.h_nr_runnabl=
e) {
+		unsigned long util =3D 0;
+		struct task_struct *cur;
+
+		if (env->src_rq->nr_running <=3D 1)
+			return true;
+
+		/*
+		 * Reach here in load balance with
+		 * rcu_read_lock() protected.
+		 */
+		cur =3D rcu_dereference(env->src_rq->curr);
+		if (cur)
+			util =3D task_util(cur);
+
+		if (can_migrate_llc(env->src_cpu, env->dst_cpu,
+				    util, false) =3D=3D mig_forbid)
+			return true;
+	}
+
+	return false;
+}
 #else
 static inline bool get_llc_stats(int cpu, unsigned long *util,
 				 unsigned long *cap)
 {
 	return false;
 }
+
+static inline bool
+alb_break_llc(struct lb_env *env)
+{
+	return false;
+}
 #endif
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
@@ -12421,6 +12469,9 @@ static int need_active_balance(struct lb_env *env)
 {
 	struct sched_domain *sd =3D env->sd;
=20
+	if (alb_break_llc(env))
+		return 0;
+
 	if (asym_active_balance(env))
 		return 1;
=20
@@ -12440,7 +12491,8 @@ static int need_active_balance(struct lb_env *env)
 			return 1;
 	}
=20
-	if (env->migration_type =3D=3D migrate_misfit)
+	if (env->migration_type =3D=3D migrate_misfit ||
+	    env->migration_type =3D=3D migrate_llc_task)
 		return 1;
=20
 	return 0;
--=20
2.32.0
From nobody Thu Apr  2 15:36:01 2026
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 85041355033
	for <linux-kernel@vger.kernel.org>; Tue, 10 Feb 2026 22:13:34 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.9
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1770761615; cv=none;
 b=b3o1n/cP++SPpH1qPMo3bQFAeO6/e6kbkxVHNMSobSKYN+7XCULzztII746yVSbYCaIZjFBXxLFRkqs9AwzLXdC3bMuFRMr8/82YtorIf1VaZVjjkLA+bNqIUbKzysC4oGbSAVW1R9AkSst6KoGO4YhWnzpXCMglkcEDoP4lYWQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1770761615; c=relaxed/simple;
	bh=tYv2iHyag5NKSPeYQKd4k8stcZ7hWz9Pzz005OH/eMI=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=GQOCk2B67DTBvv6LH6XV7tzB4uO80JvzPDDFSv9PW+MJSC3Kse9eVIdC3YN+QCQ1PfzAqmYxOFbHqUe1VkdQLcVYuNO1plkmqd9B5B1rIl7baQHTDAyJvQETGWLgiaEZvkvRf/Gj2TSS9LIm3gkLLOQX3M8ugBczbmhbuHFNutc=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=Dn4F0W/w; arc=none smtp.client-ip=192.198.163.9
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="Dn4F0W/w"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1770761614; x=1802297614;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=tYv2iHyag5NKSPeYQKd4k8stcZ7hWz9Pzz005OH/eMI=;
  b=Dn4F0W/wxOycB1W3oNIstnIL1NK3/w/hG1+9QHEu2iG1mJJxUuSY4qC8
   oCHXEAg8R/8bGRmqk5KMmWVOlpsF13dnelPUcQgpy5lHeKWHa1yzHoO4p
   5zc4rVi6Lj1j1+K1FfSbcloy+Y/EbF/f81OZtTanFlI95LHi9mPZOdK8u
   ThkbMMWLMuZ1lfH9f9gE7gbe3DIFbBGOwdfyft3Fvmz9+jDGJ04UdEzDF
   9y64emCRuwTyW4kLf9cu1eEMRBLyri/5aLg7kIm7UEzX9C5R10WFKEoPR
   eKDWgJLU4Phi39hIsoy6GoqucyWEnkNeq1EWJAn7dacDgqrYbYu6acCub
   Q==;
X-CSE-ConnectionGUID: KAZJi8v8Qxavym9c30i3aw==
X-CSE-MsgGUID: GvzSXcoJSQGz74f+BBWbLQ==
X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631505"
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="82631505"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 10 Feb 2026 14:13:34 -0800
X-CSE-ConnectionGUID: ADAxB+eRTjKPH1xcLvfBDQ==
X-CSE-MsgGUID: JDVcTB7EQw2uF/CBovIwBw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="216374007"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:32 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v3 14/21] sched/cache: Respect LLC preference in task
 migration and detach
Date: Tue, 10 Feb 2026 14:18:54 -0800
Message-Id: 
 <82aeb78bbfb80cb6861b85e4db9d398f6c8e331b.1770760558.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1770760558.git.tim.c.chen@linux.intel.com>
References: <cover.1770760558.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

During the final step of load balancing, can_migrate_task() now
considers a task's LLC preference before moving it out of its
preferred LLC.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v2->v3:
    Use the similar mechanism as NUMA balancing, which skips over
    the tasks that would degrade locality in can_migrate_task();
    and only if nr_balanced_failed is high enough do we ignore that.
    (Peter Zijlstra)
   =20
    Let migrate_degrade_locality() take precedence over
    migrate_degrades_llc(), which aims to migrate towards the preferred
    NUMA node. (Peter Zijlstra)

 kernel/sched/fair.c  | 64 +++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h | 13 +++++++++
 2 files changed, 73 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03959a701514..d1145997b88d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9973,8 +9973,8 @@ static enum llc_mig can_migrate_llc(int src_cpu, int =
dst_cpu,
  * Check if task p can migrate from source LLC to
  * destination LLC in terms of cache aware load balance.
  */
-static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int d=
st_cpu,
-							struct task_struct *p)
+static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
+					 struct task_struct *p)
 {
 	struct mm_struct *mm;
 	bool to_pref;
@@ -10041,6 +10041,47 @@ alb_break_llc(struct lb_env *env)
=20
 	return false;
 }
+
+/*
+ * Check if migrating task p from env->src_cpu to
+ * env->dst_cpu breaks LLC localiy.
+ */
+static bool migrate_degrades_llc(struct task_struct *p, struct lb_env *env)
+{
+	if (!sched_cache_enabled())
+		return false;
+
+	if (task_has_sched_core(p))
+		return false;
+	/*
+	 * Skip over tasks that would degrade LLC locality;
+	 * only when nr_balanced_failed is sufficiently high do we
+	 * ignore this constraint.
+	 *
+	 * Threshold of cache_nice_tries is set to 1 higher
+	 * than nr_balance_failed to avoid excessive task
+	 * migration at the same time. Refer to comments around
+	 * llc_balance().
+	 */
+	if (env->sd->nr_balance_failed >=3D env->sd->cache_nice_tries + 1)
+		return false;
+
+	/*
+	 * We know the env->src_cpu has some tasks prefer to
+	 * run on env->dst_cpu, skip the tasks do not prefer
+	 * env->dst_cpu, and find the one that prefers.
+	 */
+	if (env->migration_type =3D=3D migrate_llc_task &&
+	    task_llc(p) !=3D llc_id(env->dst_cpu))
+		return true;
+
+	if (can_migrate_llc_task(env->src_cpu,
+				 env->dst_cpu, p) !=3D mig_forbid)
+		return false;
+
+	return true;
+}
+
 #else
 static inline bool get_llc_stats(int cpu, unsigned long *util,
 				 unsigned long *cap)
@@ -10053,6 +10094,12 @@ alb_break_llc(struct lb_env *env)
 {
 	return false;
 }
+
+static inline bool
+migrate_degrades_llc(struct task_struct *p, struct lb_env *env)
+{
+	return false;
+}
 #endif
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
@@ -10150,10 +10197,19 @@ int can_migrate_task(struct task_struct *p, struc=
t lb_env *env)
 		return 1;
=20
 	degrades =3D migrate_degrades_locality(p, env);
-	if (!degrades)
+	if (!degrades) {
+		/*
+		 * If the NUMA locality is not broken,
+		 * further check if migration would hurt
+		 * LLC locality.
+		 */
+		if (migrate_degrades_llc(p, env))
+			return 0;
+
 		hot =3D task_hot(p, env);
-	else
+	} else {
 		hot =3D degrades > 0;
+	}
=20
 	if (!hot || env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
 		if (hot)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ac8c7ac1ac0d..c18e59f320a6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1495,6 +1495,14 @@ extern void sched_core_dequeue(struct rq *rq, struct=
 task_struct *p, int flags);
 extern void sched_core_get(void);
 extern void sched_core_put(void);
=20
+static inline bool task_has_sched_core(struct task_struct *p)
+{
+	if (sched_core_disabled())
+		return false;
+
+	return !!p->core_cookie;
+}
+
 #else /* !CONFIG_SCHED_CORE: */
=20
 static inline bool sched_core_enabled(struct rq *rq)
@@ -1534,6 +1542,11 @@ static inline bool sched_group_cookie_match(struct r=
q *rq,
 	return true;
 }
=20
+static inline bool task_has_sched_core(struct task_struct *p)
+{
+	return false;
+}
+
 #endif /* !CONFIG_SCHED_CORE */
=20
 #ifdef CONFIG_RT_GROUP_SCHED
--=20
2.32.0
From nobody Thu Apr  2 15:36:01 2026
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0A2F633B95E
	for <linux-kernel@vger.kernel.org>; Tue, 10 Feb 2026 22:13:37 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.9
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1770761618; cv=none;
 b=LR47nAPZY1PgF6IbjenhtrkUHuxkJ2bAYwrloAbJT9FBcwvP0a7NwD1i/3OSaDDYGOPcHcgwGntObyqKZDcIJpARwBBIuYbwrciPNW4WzD3eqQIOctgtCqUI/T3z3dD9vG6sHxwuxqXvJ9R+MgQe4aS3ku/JJA0LE0fWEFNh1YE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1770761618; c=relaxed/simple;
	bh=NUaoOtUWcv68MwDfmTICP/GIBfn5Rq1xgTOvSnQx5n4=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=nMVmD7PJO6ckzGNg9di55uePxBQfptt/rYn1lgcvi9ftoAAEYXuyiF76REfvpl9zr/2Owoo6A+aItk1/bMp1Bm4y3sliOA8jDrL9phhm3L/qYlr++JTqFU3jCpl7frsxmXHtENApOSdBoCfStRPvPxDPxP5NkxjvwltbLK7GgfM=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=SMVg0Ta0; arc=none smtp.client-ip=192.198.163.9
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="SMVg0Ta0"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1770761617; x=1802297617;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=NUaoOtUWcv68MwDfmTICP/GIBfn5Rq1xgTOvSnQx5n4=;
  b=SMVg0Ta0JLKQ2GU5NfqaNuj+AyLUuOGDNh5kYdEk2YxseAPD4Xz4B9oJ
   1Khb0UQPYuHWFowHMr4Qcs2WXgL7ITs00IvvSCbOwUtRkD8DTAT75rvbR
   J4S+R9BUVIDHwDEUgY9xN6JSSBJ1TscSFqic5h91YXCCgUeImA7cbyize
   hYMPK0HOH3dYyrP7jMcbeQcqlsm6uY3YzH/01u0EslqyLdRP0y9em73L/
   Qh90IJ3LVDpz3jnIbKbxu48g43psbUPkp50iFw7Jm8Q9C5sMPUrbRToqo
   E9BsYVwG5S83CHWBgZe41UGCjyJLTQY7LLxxIRYwDlHHNGezK6aSI+YQo
   A==;
X-CSE-ConnectionGUID: mqc7m0U8QOGkx2vqjLbDPg==
X-CSE-MsgGUID: C/6a71C4RS+1Eq4gdekLLQ==
X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631526"
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="82631526"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 10 Feb 2026 14:13:36 -0800
X-CSE-ConnectionGUID: OklA/d5yRyii4/PE7G03HQ==
X-CSE-MsgGUID: ieN5axIRQvu2hTWBWedZwQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="216374013"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:34 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v3 15/21] sched/cache: Disable cache aware scheduling for
 processes with high thread counts
Date: Tue, 10 Feb 2026 14:18:55 -0800
Message-Id: 
 <f3113a94a9494371517c912e3dae4e1ca0cb195f.1770760558.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1770760558.git.tim.c.chen@linux.intel.com>
References: <cover.1770760558.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

A performance regression was observed by Prateek when running hackbench
with many threads per process (high fd count). To avoid this, processes
with a large number of active threads are excluded from cache-aware
scheduling.

With sched_cache enabled, record the number of active threads in each
process during the periodic task_cache_work(). While iterating over
CPUs, if the currently running task belongs to the same process as the
task that launched task_cache_work(), increment the active thread count.

If the number of active threads within the process exceeds the number
of Cores(divided by SMTs number) in the LLC, do not enable cache-aware
scheduling. For users who wish to perform task aggregation regardless,
a debugfs knob is provided for tuning in a subsequent patch.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---

Notes:
    v2->v3:
    Put the calculating of nr_running_avg and the use of it into 1 patch.
    (Peter Zijlstra)
   =20
    Use guard(rcu)() when calculating the number of active threads of the
    process.
    (Peter Zijlstra)
   =20
    Introduce update_avg_scale() rather than using update_avg() to fit
    system with small LLC.
    (Aaron Lu)

 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 59 ++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c98bd1c46088..511c9b263386 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2346,6 +2346,7 @@ struct sched_cache_stat {
 	struct sched_cache_time __percpu *pcpu_sched;
 	raw_spinlock_t lock;
 	unsigned long epoch;
+	u64 nr_running_avg;
 	int cpu;
 } ____cacheline_aligned_in_smp;
=20
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d1145997b88d..86b6b08e7e1e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1223,6 +1223,19 @@ static inline bool valid_llc_buf(struct sched_domain=
 *sd,
 	return valid_llc_id(id);
 }
=20
+static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
+{
+	int smt_nr =3D 1;
+
+#ifdef CONFIG_SCHED_SMT
+	if (sched_smt_active())
+		smt_nr =3D cpumask_weight(cpu_smt_mask(cpu));
+#endif
+
+	return !fits_capacity((mm->sc_stat.nr_running_avg * smt_nr),
+			per_cpu(sd_llc_size, cpu));
+}
+
 static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
 {
 	struct sched_domain *sd;
@@ -1417,7 +1430,8 @@ void account_mm_sched(struct rq *rq, struct task_stru=
ct *p, s64 delta_exec)
 	 */
 	if (time_after(epoch,
 		       READ_ONCE(mm->sc_stat.epoch) + EPOCH_LLC_AFFINITY_TIMEOUT) ||
-	    get_nr_threads(p) <=3D 1) {
+	    get_nr_threads(p) <=3D 1 ||
+	    exceed_llc_nr(mm, cpu_of(rq))) {
 		if (mm->sc_stat.cpu !=3D -1)
 			mm->sc_stat.cpu =3D -1;
 	}
@@ -1458,13 +1472,31 @@ static void task_tick_cache(struct rq *rq, struct t=
ask_struct *p)
 	}
 }
=20
+static inline void update_avg_scale(u64 *avg, u64 sample)
+{
+	int factor =3D per_cpu(sd_llc_size, raw_smp_processor_id());
+	s64 diff =3D sample - *avg;
+	u32 divisor;
+
+	/*
+	 * Scale the divisor based on the number of CPUs contained
+	 * in the LLC. This scaling ensures smaller LLC domains use
+	 * a smaller divisor to achieve more precise sensitivity to
+	 * changes in nr_running, while larger LLC domains are capped
+	 * at a maximum divisor of 8 which is the default smoothing
+	 * factor of EWMA in update_avg().
+	 */
+	divisor =3D clamp_t(u32, (factor >> 2), 2, 8);
+	*avg +=3D div64_s64(diff, divisor);
+}
+
 static void task_cache_work(struct callback_head *work)
 {
-	struct task_struct *p =3D current;
+	struct task_struct *p =3D current, *cur;
 	struct mm_struct *mm =3D p->mm;
 	unsigned long m_a_occ =3D 0;
 	unsigned long curr_m_a_occ =3D 0;
-	int cpu, m_a_cpu =3D -1;
+	int cpu, m_a_cpu =3D -1, nr_running =3D 0;
 	cpumask_var_t cpus;
=20
 	WARN_ON_ONCE(work !=3D &p->cache_work);
@@ -1474,6 +1506,13 @@ static void task_cache_work(struct callback_head *wo=
rk)
 	if (p->flags & PF_EXITING)
 		return;
=20
+	if (get_nr_threads(p) <=3D 1) {
+		if (mm->sc_stat.cpu !=3D -1)
+			mm->sc_stat.cpu =3D -1;
+
+		return;
+	}
+
 	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
 		return;
=20
@@ -1497,6 +1536,12 @@ static void task_cache_work(struct callback_head *wo=
rk)
 					m_occ =3D occ;
 					m_cpu =3D i;
 				}
+				scoped_guard (rcu) {
+					cur =3D rcu_dereference(cpu_rq(i)->curr);
+					if (cur && !(cur->flags & (PF_EXITING | PF_KTHREAD)) &&
+					    cur->mm =3D=3D mm)
+						nr_running++;
+				}
 			}
=20
 			/*
@@ -1540,6 +1585,7 @@ static void task_cache_work(struct callback_head *wor=
k)
 		mm->sc_stat.cpu =3D m_a_cpu;
 	}
=20
+	update_avg_scale(&mm->sc_stat.nr_running_avg, nr_running);
 	free_cpumask_var(cpus);
 }
=20
@@ -9988,6 +10034,13 @@ static enum llc_mig can_migrate_llc_task(int src_cp=
u, int dst_cpu,
 	if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
 		return mig_unrestricted;
=20
+	/* skip cache aware load balance for single/too many threads */
+	if (get_nr_threads(p) <=3D 1 || exceed_llc_nr(mm, dst_cpu)) {
+		if (mm->sc_stat.cpu !=3D -1)
+			mm->sc_stat.cpu =3D -1;
+		return mig_unrestricted;
+	}
+
 	if (cpus_share_cache(dst_cpu, cpu))
 		to_pref =3D true;
 	else if (cpus_share_cache(src_cpu, cpu))
--=20
2.32.0
From nobody Thu Apr  2 15:36:01 2026
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id B9B2838A9DF
	for <linux-kernel@vger.kernel.org>; Tue, 10 Feb 2026 22:13:39 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.9
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1770761621; cv=none;
 b=cp1OYQ16tAgHHOxCKH/Nv3grWd1PeLMNWOzC7Znesl798wAGFGzL/HyPbiHecQnX8bz3NrsTxqYGGKxXKWHn8kUZIPQm7nXQxei+9eNZiMSwEEc0QDKnD2VRfayA20dojntY8tiCBqJdHEUX3dDwaOFY8DHV5enksdsNKDZ2MyE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1770761621; c=relaxed/simple;
	bh=nZHFDOb9VWAdLlbvPBVHD2UPMHAPV9s16xdLaYp6e4A=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=DMuVgmYu5/gUN6ygUrdnPsqtnb6Zyjo72hARgcoipYKLsv/hvQ0Vk0cDnV60bYoAJF4eHFsZycGcfY0sP5DDoIBOYI0NmlRWoTBRPfUCDII8KBKrAPA4KcFZnvS5fKu+WqeW6LlhqHwOslT1k/xY6Yl18mqHPwSlN/PrpAkr6dw=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=dB+RxgXV; arc=none smtp.client-ip=192.198.163.9
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="dB+RxgXV"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1770761619; x=1802297619;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=nZHFDOb9VWAdLlbvPBVHD2UPMHAPV9s16xdLaYp6e4A=;
  b=dB+RxgXVGuzBBXF95swyQl8T270h1ABo9VtzHcMGGZwhxznbuCJ5Hdmn
   /gzCrhZVBti+lmlg0aBc8s82QbyV6eLhQDbGyke61ZSt1Cb6NWAy4NnaA
   RwWo4d66cQ4JZSonKj4KUo0oPXNQayyKlPKzs+5Sz+E+rxSoh2aAa3Uht
   Cjeh+oVFo4edxLnXZgWGfv90197RlTryXT8GiQByqvnDrEyE5cVbkFtQ1
   cnHY1qjkpElXdt7kBWaFfXzujG/Gu0QNuJLElaOQimQYmYHP7HzdlL1vE
   izdb5/eTsBd7bHEsIe5ki9NXpwzg0fEtsQ4mnEAt0Ap2xwXbrYf2ZPCxA
   Q==;
X-CSE-ConnectionGUID: HsPNimK2QeSODCud0dW5Hw==
X-CSE-MsgGUID: ujHoP/TPRw2ZV+OyFfB0cQ==
X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631547"
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="82631547"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 10 Feb 2026 14:13:39 -0800
X-CSE-ConnectionGUID: NHqepHf5TX26sK7KhXnhcA==
X-CSE-MsgGUID: 0nNW0XZzQSypqvmedbXkAQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="216374017"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:37 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v3 16/21] sched/cache: Avoid cache-aware scheduling for
 memory-heavy processes
Date: Tue, 10 Feb 2026 14:18:56 -0800
Message-Id: 
 <9f2c28aa9d981ee17ce7d2db0d4b883954b1e71c.1770760558.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1770760558.git.tim.c.chen@linux.intel.com>
References: <cover.1770760558.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

Prateek and Tingyin reported that memory-intensive workloads (such as
stream) can saturate memory bandwidth and caches on the preferred LLC
when sched_cache aggregates too many threads.

To mitigate this, estimate a process's memory footprint by comparing
its RSS (anonymous and shared pages) to the size of the LLC. If RSS
exceeds the LLC size, skip cache-aware scheduling.

Note that RSS is only an approximation of the memory footprint.
By default, the comparison is strict, but a later patch will allow
users to provide a hint to adjust this threshold.

According to the test from Adam, some systems do not have shared L3
but with shared L2 as clusters. In this case, the L2 becomes the LLC[1].

Link[1]: https://lore.kernel.org/all/3cb6ebc7-a2fd-42b3-8739-b00e28a09cb6@o=
s.amperecomputing.com/

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---

Notes:
    v2->v3:
    Fix overflow issue in exceed_llc_capacity() by changing
    the type of llc from int to u64.
    (Jianyong Wu, Yangyu Chen)

 include/linux/cacheinfo.h | 21 ++++++++++-------
 kernel/sched/fair.c       | 48 +++++++++++++++++++++++++++++++++++----
 2 files changed, 56 insertions(+), 13 deletions(-)

diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
index c8f4f0a0b874..82d0d59ca0e1 100644
--- a/include/linux/cacheinfo.h
+++ b/include/linux/cacheinfo.h
@@ -113,18 +113,11 @@ int acpi_get_cache_info(unsigned int cpu,
=20
 const struct attribute_group *cache_get_priv_group(struct cacheinfo *this_=
leaf);
=20
-/*
- * Get the cacheinfo structure for the cache associated with @cpu at
- * level @level.
- * cpuhp lock must be held.
- */
-static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
+static inline struct cacheinfo *_get_cpu_cacheinfo_level(int cpu, int leve=
l)
 {
 	struct cpu_cacheinfo *ci =3D get_cpu_cacheinfo(cpu);
 	int i;
=20
-	lockdep_assert_cpus_held();
-
 	for (i =3D 0; i < ci->num_leaves; i++) {
 		if (ci->info_list[i].level =3D=3D level) {
 			if (ci->info_list[i].attributes & CACHE_ID)
@@ -136,6 +129,18 @@ static inline struct cacheinfo *get_cpu_cacheinfo_leve=
l(int cpu, int level)
 	return NULL;
 }
=20
+/*
+ * Get the cacheinfo structure for the cache associated with @cpu at
+ * level @level.
+ * cpuhp lock must be held.
+ */
+static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
+{
+	lockdep_assert_cpus_held();
+
+	return _get_cpu_cacheinfo_level(cpu, level);
+}
+
 /*
  * Get the id of the cache associated with @cpu at level @level.
  * cpuhp lock must be held.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 86b6b08e7e1e..ee4982af2bdd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1223,6 +1223,37 @@ static inline bool valid_llc_buf(struct sched_domain=
 *sd,
 	return valid_llc_id(id);
 }
=20
+static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
+{
+	struct cacheinfo *ci;
+	u64 rss, llc;
+
+	/*
+	 * get_cpu_cacheinfo_level() can not be used
+	 * because it requires the cpu_hotplug_lock
+	 * to be held. Use _get_cpu_cacheinfo_level()
+	 * directly because the 'cpu' can not be
+	 * offlined at the moment.
+	 */
+	ci =3D _get_cpu_cacheinfo_level(cpu, 3);
+	if (!ci) {
+		/*
+		 * On system without L3 but with shared L2,
+		 * L2 becomes the LLC.
+		 */
+		ci =3D _get_cpu_cacheinfo_level(cpu, 2);
+		if (!ci)
+			return true;
+	}
+
+	llc =3D ci->size;
+
+	rss =3D get_mm_counter(mm, MM_ANONPAGES) +
+		get_mm_counter(mm, MM_SHMEMPAGES);
+
+	return (llc <=3D (rss * PAGE_SIZE));
+}
+
 static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
 {
 	int smt_nr =3D 1;
@@ -1431,7 +1462,8 @@ void account_mm_sched(struct rq *rq, struct task_stru=
ct *p, s64 delta_exec)
 	if (time_after(epoch,
 		       READ_ONCE(mm->sc_stat.epoch) + EPOCH_LLC_AFFINITY_TIMEOUT) ||
 	    get_nr_threads(p) <=3D 1 ||
-	    exceed_llc_nr(mm, cpu_of(rq))) {
+	    exceed_llc_nr(mm, cpu_of(rq)) ||
+	    exceed_llc_capacity(mm, cpu_of(rq))) {
 		if (mm->sc_stat.cpu !=3D -1)
 			mm->sc_stat.cpu =3D -1;
 	}
@@ -1496,7 +1528,7 @@ static void task_cache_work(struct callback_head *wor=
k)
 	struct mm_struct *mm =3D p->mm;
 	unsigned long m_a_occ =3D 0;
 	unsigned long curr_m_a_occ =3D 0;
-	int cpu, m_a_cpu =3D -1, nr_running =3D 0;
+	int cpu, m_a_cpu =3D -1, nr_running =3D 0, curr_cpu;
 	cpumask_var_t cpus;
=20
 	WARN_ON_ONCE(work !=3D &p->cache_work);
@@ -1506,7 +1538,9 @@ static void task_cache_work(struct callback_head *wor=
k)
 	if (p->flags & PF_EXITING)
 		return;
=20
-	if (get_nr_threads(p) <=3D 1) {
+	curr_cpu =3D task_cpu(p);
+	if (get_nr_threads(p) <=3D 1 ||
+	    exceed_llc_capacity(mm, curr_cpu)) {
 		if (mm->sc_stat.cpu !=3D -1)
 			mm->sc_stat.cpu =3D -1;
=20
@@ -10034,8 +10068,12 @@ static enum llc_mig can_migrate_llc_task(int src_c=
pu, int dst_cpu,
 	if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
 		return mig_unrestricted;
=20
-	/* skip cache aware load balance for single/too many threads */
-	if (get_nr_threads(p) <=3D 1 || exceed_llc_nr(mm, dst_cpu)) {
+	/*
+	 * Skip cache aware load balance for single/too many threads
+	 * or large memory RSS.
+	 */
+	if (get_nr_threads(p) <=3D 1 || exceed_llc_nr(mm, dst_cpu) ||
+	    exceed_llc_capacity(mm, dst_cpu)) {
 		if (mm->sc_stat.cpu !=3D -1)
 			mm->sc_stat.cpu =3D -1;
 		return mig_unrestricted;
--=20
2.32.0
From nobody Thu Apr  2 15:36:01 2026
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 585A138B7C4
	for <linux-kernel@vger.kernel.org>; Tue, 10 Feb 2026 22:13:41 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.9
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1770761622; cv=none;
 b=cDvHg+BuIcf+FWLAkGK0GPTLvZxiTfEEFAH0T2YWZVfdhmjzZktPfLg7KmaNZUDJtO/VbQv76Bh/MLquEejuXyBkbYfJUFFv0f5W0m+T104RKGoqutv79Z1BDLbYauMu1rdUVAfhj+GlQu7B52//43nFYYwOLRH+LjEdU0KxT7g=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1770761622; c=relaxed/simple;
	bh=iOg2XKjD/jnBvASIUIGijnOmsOwKYW6UQ4+RkvVcrLc=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=Dwdug2wMf0e34CMpE9hTtoPWlbXSGVrU+IFmanq0NDanXDGz4qcsFqYZ1awhtnIsUEfMVjXebTooZRMNr1p8JJaAdOog1mStyxsdnwh7BdeC6KIMyvcL5lG7R+koO3idz6LHqXzaNVxARpfWVKVVLd/PhR1AaYR2l40tdyPYG8k=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=g/oadkep; arc=none smtp.client-ip=192.198.163.9
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="g/oadkep"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1770761621; x=1802297621;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=iOg2XKjD/jnBvASIUIGijnOmsOwKYW6UQ4+RkvVcrLc=;
  b=g/oadkepIDV1cj/vwsOCpJaDWD6N4cNrzT1x3WAQBwKKjuVY97pYhNaV
   Gg2jH+HF/YypLAoUySudtZ//Uu4nQQxuLT0tKMRcTd0GwokeM5IIQARBb
   vHyvkxN7Ohlh4kVFE5MSsKj9lcn3WRtOFCedNIvkgaQd4LKGZTv+D9rFV
   mkmND2eyNHi5hSDw4LnW4hpknq3MtPp6uyjOHFhyH+Cw9dk8UtNEqGcTD
   B2jHhpIOvb1bgU76BgOxqBGhgFtqhI58phO4mzNpNGnsZsvkQsk40yL8K
   TJkaBnYr7wpL3WLTaCJ+96RnApDykCNb1JOYdz2MjQzAiGxq2zmQ19vYX
   g==;
X-CSE-ConnectionGUID: ZMpavvj9SFy+C5k/C84akg==
X-CSE-MsgGUID: qIv96OsqQHOV88Ts3/LNdA==
X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631568"
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="82631568"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 10 Feb 2026 14:13:41 -0800
X-CSE-ConnectionGUID: s9V8yD4vSWeAVjBFClw4tw==
X-CSE-MsgGUID: t9oJi9eZRIerGF4FI4dhug==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="216374023"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:40 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v3 17/21] sched/cache: Enable cache aware scheduling for multi
 LLCs NUMA node
Date: Tue, 10 Feb 2026 14:18:57 -0800
Message-Id: 
 <e8bd96c944daf0afee58b6be1ba369291f5ca19e.1770760558.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1770760558.git.tim.c.chen@linux.intel.com>
References: <cover.1770760558.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

Introduce sched_cache_present to enable cache aware scheduling for
multi LLCs NUMA node Cache-aware load balancing should only be
enabled if there are more than 1 LLCs within 1 NUMA node.
sched_cache_present is introduced to indicate whether this
platform supports this topology.

Suggested-by: Libo Chen <libchen@purestorage.com>
Suggested-by: Adam Li <adamli@os.amperecomputing.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---

Notes:
    v2->v3:
    No change.

 kernel/sched/sched.h    |  3 ++-
 kernel/sched/topology.c | 18 ++++++++++++++++--
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c18e59f320a6..59ac04625842 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3916,11 +3916,12 @@ static inline void mm_cid_switch_to(struct task_str=
uct *prev, struct task_struct
 #endif /* !CONFIG_SCHED_MM_CID */
=20
 #ifdef CONFIG_SCHED_CACHE
+DECLARE_STATIC_KEY_FALSE(sched_cache_present);
 extern int max_llcs;
=20
 static inline bool sched_cache_enabled(void)
 {
-	return false;
+	return static_branch_unlikely(&sched_cache_present);
 }
 #endif
 extern void init_sched_mm(struct task_struct *p);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index dae78b5915a7..9104fed25351 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -801,6 +801,7 @@ enum s_alloc {
 };
=20
 #ifdef CONFIG_SCHED_CACHE
+DEFINE_STATIC_KEY_FALSE(sched_cache_present);
 static bool alloc_sd_pref(const struct cpumask *cpu_map,
 			  struct s_data *d)
 {
@@ -2604,6 +2605,7 @@ static int
 build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_att=
r *attr)
 {
 	enum s_alloc alloc_state =3D sa_none;
+	bool has_multi_llcs =3D false;
 	struct sched_domain *sd;
 	struct s_data d;
 	struct rq *rq =3D NULL;
@@ -2731,10 +2733,12 @@ build_sched_domains(const struct cpumask *cpu_map, =
struct sched_domain_attr *att
 				 * between LLCs and memory channels.
 				 */
 				nr_llcs =3D sd->span_weight / child->span_weight;
-				if (nr_llcs =3D=3D 1)
+				if (nr_llcs =3D=3D 1) {
 					imb =3D sd->span_weight >> 3;
-				else
+				} else {
 					imb =3D nr_llcs;
+					has_multi_llcs =3D true;
+				}
 				imb =3D max(1U, imb);
 				sd->imb_numa_nr =3D imb;
=20
@@ -2796,6 +2800,16 @@ build_sched_domains(const struct cpumask *cpu_map, s=
truct sched_domain_attr *att
=20
 	ret =3D 0;
 error:
+#ifdef CONFIG_SCHED_CACHE
+	/*
+	 * TBD: check before writing to it. sched domain rebuild
+	 * is not in the critical path, leave as-is for now.
+	 */
+	if (!ret && has_multi_llcs)
+		static_branch_enable_cpuslocked(&sched_cache_present);
+	else
+		static_branch_disable_cpuslocked(&sched_cache_present);
+#endif
 	__free_domain_allocs(&d, alloc_state, cpu_map);
=20
 	return ret;
--=20
2.32.0
From nobody Thu Apr  2 15:36:01 2026
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 427C538B9A3
	for <linux-kernel@vger.kernel.org>; Tue, 10 Feb 2026 22:13:43 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.9
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1770761624; cv=none;
 b=Od7ihFB8AVd6Cn8BuKEbyMIAdfAMXEssL2hGePb0c0MVlkNOogl+q8QUje6dCMQLCZIsDhjyOG2vtRBJqjc+ZKqyzdY6C35WRwgcuvkdy35YeZydVyxIHDIfT21FvGoUMJeM+ggHReSzOtJT4uV2ulENrTsoH0IeS46QUPSt6Kw=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1770761624; c=relaxed/simple;
	bh=DdamgIBRlreQFym+IV+sNw42fL88iaPB67NI+vDOkMM=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=TwpFD5O9IVCkUVjdY4hfFhYvV3mSS4SrG3Lh1JNpU1utrCc2+rHI+OL+ZbKnJxa2SquQQs+6IBh+lKL5Qf7lzqA1g45dw/YRuWwDXek8NRnJfQzikIcVD/Qhkumx3utvJgxSoXGWrxrSZniNkXBZA+T1yxgTVSNg8ECNSwMUiiE=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=V1IqKXYw; arc=none smtp.client-ip=192.198.163.9
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="V1IqKXYw"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1770761623; x=1802297623;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=DdamgIBRlreQFym+IV+sNw42fL88iaPB67NI+vDOkMM=;
  b=V1IqKXYwHhRY5C7YiOyfngX/dOXZ4NNIiWEOOW3TxycdigBrxHFYnbsT
   3DzaTWG/+ug2ZoMnhhNb1kNSAWQgVhaqYo0SIHSz1aY8a+Hm43IRHB5k/
   dbQfsW43dBCz7QQVw3lTRg2E3Imu6GAkJiepCvbfYSH8lXeicZUxcUvgm
   mhQcszGHh99DKdD1fmCNjUcLzw7qSUnCozp1BOuEKpNI/S8qKnN4NDiNe
   XQb20118VMqify/ZEkeJ2Np2clD01N+D3peo9eF08PiDMVOVDfXTQBMW2
   hZL7GdI9ZI2Kxw6YNdfQlU8kkVSVFrzzGjwr9temIuTOab+O2p9+dn491
   A==;
X-CSE-ConnectionGUID: v417zybOThewi8t7Sry5uw==
X-CSE-MsgGUID: 3JI9lRqWTuyM1woUKV7Now==
X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631588"
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="82631588"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 10 Feb 2026 14:13:42 -0800
X-CSE-ConnectionGUID: uKw0XR9HQVylq7/DW8VW6w==
X-CSE-MsgGUID: r+C0uYY1SR2HVCVoeEQ0EA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="216374029"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:42 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v3 18/21] sched/cache: Allow the user space to turn on and off
 cache aware scheduling
Date: Tue, 10 Feb 2026 14:18:58 -0800
Message-Id: 
 <57f431298bf6346d37a3046ec771898607ae6ccf.1770760558.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1770760558.git.tim.c.chen@linux.intel.com>
References: <cover.1770760558.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

Provide a debugfs knob to allow the user to turn off and on the
cache aware scheduling at runtime.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v2->v3:
    Split into a new patch for better review, use kstrtobool_from_user()
    to get the user input.  (Peter Zijlstra)

 kernel/sched/debug.c    | 45 ++++++++++++++++++++++++++++
 kernel/sched/sched.h    |  7 +++--
 kernel/sched/topology.c | 65 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 115 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 41caa22e0680..bae747eddc59 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -215,6 +215,46 @@ static const struct file_operations sched_scaling_fops=
 =3D {
 	.release	=3D single_release,
 };
=20
+#ifdef CONFIG_SCHED_CACHE
+static ssize_t
+sched_cache_enable_write(struct file *filp, const char __user *ubuf,
+			 size_t cnt, loff_t *ppos)
+{
+	bool val;
+	int ret;
+
+	ret =3D kstrtobool_from_user(ubuf, cnt, &val);
+	if (ret)
+		return ret;
+
+	sysctl_sched_cache_user =3D val;
+
+	sched_cache_active_set_unlocked();
+
+	return cnt;
+}
+
+static int sched_cache_enable_show(struct seq_file *m, void *v)
+{
+	seq_printf(m, "%d\n", sysctl_sched_cache_user);
+	return 0;
+}
+
+static int sched_cache_enable_open(struct inode *inode,
+				   struct file *filp)
+{
+	return single_open(filp, sched_cache_enable_show, NULL);
+}
+
+static const struct file_operations sched_cache_enable_fops =3D {
+	.open           =3D sched_cache_enable_open,
+	.write          =3D sched_cache_enable_write,
+	.read           =3D seq_read,
+	.llseek         =3D seq_lseek,
+	.release        =3D single_release,
+};
+#endif
+
 #ifdef CONFIG_PREEMPT_DYNAMIC
=20
 static ssize_t sched_dynamic_write(struct file *filp, const char __user *u=
buf,
@@ -523,6 +563,11 @@ static __init int sched_init_debug(void)
 	debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing=
_hot_threshold);
 #endif /* CONFIG_NUMA_BALANCING */
=20
+#ifdef CONFIG_SCHED_CACHE
+	debugfs_create_file("llc_enabled", 0644, debugfs_sched, NULL,
+			    &sched_cache_enable_fops);
+#endif
+
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops=
);
=20
 	debugfs_fair_server_init();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 59ac04625842..adf3428745dd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3917,12 +3917,15 @@ static inline void mm_cid_switch_to(struct task_str=
uct *prev, struct task_struct
=20
 #ifdef CONFIG_SCHED_CACHE
 DECLARE_STATIC_KEY_FALSE(sched_cache_present);
-extern int max_llcs;
+DECLARE_STATIC_KEY_FALSE(sched_cache_active);
+extern int max_llcs, sysctl_sched_cache_user;
=20
 static inline bool sched_cache_enabled(void)
 {
-	return static_branch_unlikely(&sched_cache_present);
+	return static_branch_unlikely(&sched_cache_active);
 }
+
+extern void sched_cache_active_set_unlocked(void);
 #endif
 extern void init_sched_mm(struct task_struct *p);
=20
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9104fed25351..e86dea1b9e86 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -801,7 +801,16 @@ enum s_alloc {
 };
=20
 #ifdef CONFIG_SCHED_CACHE
+/* hardware support for cache aware scheduling */
 DEFINE_STATIC_KEY_FALSE(sched_cache_present);
+/*
+ * Indicator of whether cache aware scheduling
+ * is active, used by the scheduler.
+ */
+DEFINE_STATIC_KEY_FALSE(sched_cache_active);
+/* user wants cache aware scheduling [0 or 1] */
+int sysctl_sched_cache_user =3D 1;
+
 static bool alloc_sd_pref(const struct cpumask *cpu_map,
 			  struct s_data *d)
 {
@@ -833,6 +842,60 @@ static bool alloc_sd_pref(const struct cpumask *cpu_ma=
p,
=20
 	return false;
 }
+
+static void _sched_cache_active_set(bool enable, bool locked)
+{
+	if (enable) {
+		if (locked)
+			static_branch_enable_cpuslocked(&sched_cache_active);
+		else
+			static_branch_enable(&sched_cache_active);
+	} else {
+		if (locked)
+			static_branch_disable_cpuslocked(&sched_cache_active);
+		else
+			static_branch_disable(&sched_cache_active);
+	}
+}
+
+/*
+ * Enable/disable cache aware scheduling according to
+ * user input and the presence of hardware support.
+ */
+static void sched_cache_active_set(bool locked)
+{
+	/* hardware does not support */
+	if (!static_branch_likely(&sched_cache_present)) {
+		_sched_cache_active_set(false, locked);
+		return;
+	}
+
+	/*
+	 * user wants it or not ?
+	 * TBD: read before writing the static key.
+	 * It is not in the critical path, leave as-is
+	 * for now.
+	 */
+	if (sysctl_sched_cache_user) {
+		_sched_cache_active_set(true, locked);
+		if (sched_debug())
+			pr_info("%s: enabling cache aware scheduling\n", __func__);
+	} else {
+		_sched_cache_active_set(false, locked);
+		if (sched_debug())
+			pr_info("%s: disabling cache aware scheduling\n", __func__);
+	}
+}
+
+static void sched_cache_active_set_locked(void)
+{
+	return sched_cache_active_set(true);
+}
+
+void sched_cache_active_set_unlocked(void)
+{
+	return sched_cache_active_set(false);
+}
 #else
 static bool alloc_sd_pref(const struct cpumask *cpu_map,
 			  struct s_data *d)
@@ -2809,6 +2872,8 @@ build_sched_domains(const struct cpumask *cpu_map, st=
ruct sched_domain_attr *att
 		static_branch_enable_cpuslocked(&sched_cache_present);
 	else
 		static_branch_disable_cpuslocked(&sched_cache_present);
+
+	sched_cache_active_set_locked();
 #endif
 	__free_domain_allocs(&d, alloc_state, cpu_map);
=20
--=20
2.32.0
From nobody Thu Apr  2 15:36:01 2026
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 36BAF3921E9
	for <linux-kernel@vger.kernel.org>; Tue, 10 Feb 2026 22:13:45 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.9
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1770761626; cv=none;
 b=UpAud2q/96oCS2xPYGdU4Lru9tk/SFjX3/8DVEoQTyejQCvxwQFX5QOvaIUDORTJ7BsKaT9rHB0AyLb24V84V2RHmUsT5hMCs6CW3A9nfl5IkdmvJCeGBGWuXq158cYdBauebWJuN9366dVbrJ2o/JecXNvBgbNRQoVHieWFhaA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1770761626; c=relaxed/simple;
	bh=9Q/v0b5kiovrrYSgga3rUosv6cdxZ5oB+lZ0wwPobZ0=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=Kk+c5D08R4UVSrJM4dOxC4MyQiutrpu/zwO78uGFMFRtNxcY+wVZ1ribpV0xJDVcy7m25S/8Tnf4TPshGWoHInz0sjxl9rcbLTjiKDPBhZcg1wGbHO6ieaDgffmTQQNzZ4xDk37ZfuEWn6vXy4kSu0U7jVtLoWgZxBgXpbj7ySc=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=XZIrpUof; arc=none smtp.client-ip=192.198.163.9
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="XZIrpUof"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1770761625; x=1802297625;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=9Q/v0b5kiovrrYSgga3rUosv6cdxZ5oB+lZ0wwPobZ0=;
  b=XZIrpUoffnO9gk9GuO91t3t7XpuZ5AgLZ+WmRzZexJ679+yS7hZj55T7
   Wrw1IUeRDzH8IIKK6hBymJEUbbsvr4jtfIua6Ov9ScdQHSVNG+kdJq4iu
   kPp8nNW3OOJrG3YbfVLrhpiuPKF8IyqF5lqgORlL0TUAFgzPiBwN3Rz0y
   R7O7UdDNRzu5rpA+VAD2hZYfXWgUrzpP8Lc+MQm9Kkc15PPlT0ADrJWeO
   zToc/sRxMmnu2z5x3I/jZf+iqufgoPtJ3bCCxT/aYsNvE3wGpix+QA+Hf
   yYYTh2iI0ZPbHtcs8Vl9U/0fGNbGOCqeJVsEhcYnI3JzGgJvazJPS5xPE
   Q==;
X-CSE-ConnectionGUID: UlYB2d7PQ0uf0w6Oth9ndw==
X-CSE-MsgGUID: zb0hkBctT2Swd6jM0dDBkw==
X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631609"
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="82631609"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 10 Feb 2026 14:13:44 -0800
X-CSE-ConnectionGUID: x33Gk2zbQpi7YhesuJBdLw==
X-CSE-MsgGUID: XqTQyB72TDG69za4CTK/1A==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="216374036"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:43 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v3 19/21] sched/cache: Add user control to adjust the
 aggressiveness of cache-aware scheduling
Date: Tue, 10 Feb 2026 14:18:59 -0800
Message-Id: 
 <ea12146a9f431b0a6f3ff30b0197c3f4a1d807f1.1770760558.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1770760558.git.tim.c.chen@linux.intel.com>
References: <cover.1770760558.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

Introduce a set of debugfs knobs to control how aggressive the
cache aware scheduling do the task aggregation.

(1) llc_aggr_tolerance
With sched_cache enabled, the scheduler uses a process's RSS as a
proxy for its LLC footprint to determine if aggregating tasks on the
preferred LLC could cause cache contention. If RSS exceeds the LLC
size, aggregation is skipped. Some workloads with large RSS but small
actual memory footprints may still benefit from aggregation. Since
the kernel cannot efficiently track per-task cache usage (resctrl is
user-space only), userspace can provide a more accurate hint.

Introduce /sys/kernel/debug/sched/llc_aggr_tolerance to let
users control how strictly RSS limits aggregation. Values range from
0 to 100:
  - 0: Cache-aware scheduling is disabled.
  - 1: Strict; tasks with RSS larger than LLC size are skipped.
  - >=3D100: Aggressive; tasks are aggregated regardless of RSS.
For example, with a 32MB L3 cache:

  - llc_aggr_tolerance=3D1 -> tasks with RSS > 32MB are skipped.
  - llc_aggr_tolerance=3D99 -> tasks with RSS > 784GB are skipped
    (784GB =3D (1 + (99 - 1) * 256) * 32MB).
Similarly, /sys/kernel/debug/sched/llc_aggr_tolerance also controls
how strictly the number of active threads is considered when doing
cache aware load balance. The number of SMTs is also considered.
High SMT counts reduce the aggregation capacity, preventing excessive
task aggregation on SMT-heavy systems like Power10/Power11.

Yangyu suggested introducing separate aggregation controls for the
number of active threads and memory RSS checks. Since there are plans
to add per-process/task group controls, fine-grained tunables are
deferred to that implementation.

(2) llc_epoch_period, llc_epoch_affinity_timeout,
    llc_imb_pct, llc_overaggr_pct are also turned into tunable.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Suggested-by: Tingyin Duan <tingyin.duan@gmail.com>
Suggested-by: Jianyong Wu <jianyong.wu@outlook.com>
Suggested-by: Yangyu Chen <cyy@cyyself.name>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---

Notes:
    v2->v3:
    Simplify the implementation by using debugfs_create_u32() for all
    tunable parameters.

 kernel/sched/debug.c | 10 ++++++++
 kernel/sched/fair.c  | 59 ++++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h |  5 ++++
 3 files changed, 67 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index bae747eddc59..dc4b7de6569f 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -566,6 +566,16 @@ static __init int sched_init_debug(void)
 #ifdef CONFIG_SCHED_CACHE
 	debugfs_create_file("llc_enabled", 0644, debugfs_sched, NULL,
 			    &sched_cache_enable_fops);
+	debugfs_create_u32("llc_aggr_tolerance", 0644, debugfs_sched,
+			   &llc_aggr_tolerance);
+	debugfs_create_u32("llc_epoch_period", 0644, debugfs_sched,
+			   &llc_epoch_period);
+	debugfs_create_u32("llc_epoch_affinity_timeout", 0644, debugfs_sched,
+			   &llc_epoch_affinity_timeout);
+	debugfs_create_u32("llc_overaggr_pct", 0644, debugfs_sched,
+			   &llc_overaggr_pct);
+	debugfs_create_u32("llc_imb_pct", 0644, debugfs_sched,
+			   &llc_imb_pct);
 #endif
=20
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops=
);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee4982af2bdd..da4291ace24c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1191,6 +1191,12 @@ static void set_next_buddy(struct sched_entity *se);
 #define EPOCH_PERIOD	(HZ / 100)	/* 10 ms */
 #define EPOCH_LLC_AFFINITY_TIMEOUT	5	/* 50 ms */
=20
+__read_mostly unsigned int llc_aggr_tolerance     =3D 1;
+__read_mostly unsigned int llc_epoch_period       =3D EPOCH_PERIOD;
+__read_mostly unsigned int llc_epoch_affinity_timeout =3D EPOCH_LLC_AFFINI=
TY_TIMEOUT;
+__read_mostly unsigned int llc_imb_pct     =3D 20;
+__read_mostly unsigned int llc_overaggr_pct     =3D 50;
+
 static int llc_id(int cpu)
 {
 	if (cpu < 0)
@@ -1223,10 +1229,22 @@ static inline bool valid_llc_buf(struct sched_domai=
n *sd,
 	return valid_llc_id(id);
 }
=20
+static inline int get_sched_cache_scale(int mul)
+{
+	if (!llc_aggr_tolerance)
+		return 0;
+
+	if (llc_aggr_tolerance >=3D 100)
+		return INT_MAX;
+
+	return (1 + (llc_aggr_tolerance - 1) * mul);
+}
+
 static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
 {
 	struct cacheinfo *ci;
 	u64 rss, llc;
+	int scale;
=20
 	/*
 	 * get_cpu_cacheinfo_level() can not be used
@@ -1251,20 +1269,47 @@ static bool exceed_llc_capacity(struct mm_struct *m=
m, int cpu)
 	rss =3D get_mm_counter(mm, MM_ANONPAGES) +
 		get_mm_counter(mm, MM_SHMEMPAGES);
=20
-	return (llc <=3D (rss * PAGE_SIZE));
+	/*
+	 * Scale the LLC size by 256*llc_aggr_tolerance
+	 * and compare it to the task's RSS size.
+	 *
+	 * Suppose the L3 size is 32MB. If the
+	 * llc_aggr_tolerance is 1:
+	 * When the RSS is larger than 32MB, the process
+	 * is regarded as exceeding the LLC capacity. If
+	 * the llc_aggr_tolerance is 99:
+	 * When the RSS is larger than 784GB, the process
+	 * is regarded as exceeding the LLC capacity:
+	 * 784GB =3D (1 + (99 - 1) * 256) * 32MB
+	 * If the llc_aggr_tolerance is 100:
+	 * ignore the RSS.
+	 */
+	scale =3D get_sched_cache_scale(256);
+	if (scale =3D=3D INT_MAX)
+		return false;
+
+	return ((llc * scale) <=3D (rss * PAGE_SIZE));
 }
=20
 static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
 {
-	int smt_nr =3D 1;
+	int smt_nr =3D 1, scale;
=20
 #ifdef CONFIG_SCHED_SMT
 	if (sched_smt_active())
 		smt_nr =3D cpumask_weight(cpu_smt_mask(cpu));
 #endif
=20
+	/*
+	 * Scale the number of 'cores' in a LLC by llc_aggr_tolerance
+	 * and compare it to the task's active threads.
+	 */
+	scale =3D get_sched_cache_scale(1);
+	if (scale =3D=3D INT_MAX)
+		return false;
+
 	return !fits_capacity((mm->sc_stat.nr_running_avg * smt_nr),
-			per_cpu(sd_llc_size, cpu));
+			(scale * per_cpu(sd_llc_size, cpu)));
 }
=20
 static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
@@ -1365,7 +1410,7 @@ static inline void __update_mm_sched(struct rq *rq,
 	long delta =3D now - rq->cpu_epoch_next;
=20
 	if (delta > 0) {
-		n =3D (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
+		n =3D (delta + llc_epoch_period - 1) / llc_epoch_period;
 		rq->cpu_epoch +=3D n;
 		rq->cpu_epoch_next +=3D n * EPOCH_PERIOD;
 		__shr_u64(&rq->cpu_runtime, n);
@@ -1460,7 +1505,7 @@ void account_mm_sched(struct rq *rq, struct task_stru=
ct *p, s64 delta_exec)
 	 * has only 1 thread, invalidate its preferred state.
 	 */
 	if (time_after(epoch,
-		       READ_ONCE(mm->sc_stat.epoch) + EPOCH_LLC_AFFINITY_TIMEOUT) ||
+		       READ_ONCE(mm->sc_stat.epoch) + llc_epoch_affinity_timeout) ||
 	    get_nr_threads(p) <=3D 1 ||
 	    exceed_llc_nr(mm, cpu_of(rq)) ||
 	    exceed_llc_capacity(mm, cpu_of(rq))) {
@@ -9920,7 +9965,7 @@ static inline int task_is_ineligible_on_dst_cpu(struc=
t task_struct *p, int dest_
  * (default: ~50%)
  */
 #define fits_llc_capacity(util, max)	\
-	((util) * 2 < (max))
+	((util) * 100 < (max) * llc_overaggr_pct)
=20
 /*
  * The margin used when comparing utilization.
@@ -9930,7 +9975,7 @@ static inline int task_is_ineligible_on_dst_cpu(struc=
t task_struct *p, int dest_
  */
 /* Allows dst util to be bigger than src util by up to bias percent */
 #define util_greater(util1, util2) \
-	((util1) * 100 > (util2) * 120)
+	((util1) * 100 > (util2) * (100 + llc_imb_pct))
=20
 /* Called from load balancing paths with rcu_read_lock held */
 static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index adf3428745dd..f4785f84b1f1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3919,6 +3919,11 @@ static inline void mm_cid_switch_to(struct task_stru=
ct *prev, struct task_struct
 DECLARE_STATIC_KEY_FALSE(sched_cache_present);
 DECLARE_STATIC_KEY_FALSE(sched_cache_active);
 extern int max_llcs, sysctl_sched_cache_user;
+extern unsigned int llc_aggr_tolerance;
+extern unsigned int llc_epoch_period;
+extern unsigned int llc_epoch_affinity_timeout;
+extern unsigned int llc_imb_pct;
+extern unsigned int llc_overaggr_pct;
=20
 static inline bool sched_cache_enabled(void)
 {
--=20
2.32.0
From nobody Thu Apr  2 15:36:01 2026
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id CE4A638B9A3
	for <linux-kernel@vger.kernel.org>; Tue, 10 Feb 2026 22:13:48 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.9
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1770761630; cv=none;
 b=W9BAUXsIdiUgOu/jugDl5iuSEVzsNW5XrVY7R+aldzJpO1qaGVzhguOsGkwSsVnyQnggayJC1eCZnBtkWnSMq0cbwYtorNOgZV9ozwUkoOfisv/zG1SR0RkwiBr3Nyw8XN08+5cmYDTcs3lKcD4F4xmCCspbk/BPAYHmnhfeHjk=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1770761630; c=relaxed/simple;
	bh=0Co2o6pud5mwuJHoDbZhN4azymt88zPl/PqLGlZkDXU=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=QOtxeM1OFLsuXT4YJp/CHQh5uUAdfjHbs5Hi4ZPTouE1n9rCowuA1f4LEshAn3KQMMB3nroZOjJYDZobWekSxzrbWf74F6Ah4xcMJVsJnycvZkUJB4iRLwlj1mb7fUMLXKahC7BKXz7vHKSCgboMdbHvy+KFwb73jr9cZ4nhrKE=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=Wetf7WAn; arc=none smtp.client-ip=192.198.163.9
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="Wetf7WAn"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1770761628; x=1802297628;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=0Co2o6pud5mwuJHoDbZhN4azymt88zPl/PqLGlZkDXU=;
  b=Wetf7WAnwwYHThKur2sCgKcmV7qrpTT7O6/bbF5bBo+E1yCIy7tG8Oo8
   fH/jg3dge9L0YA3ORsXRNXRdF3j/vFORfRE9CIXIpw6KlPrpRsIYh4Lcl
   q+CbLzid6v+0MRkLHtnYwgpWcPIb9UIBHBCR4U5yCRURl6iLwGrddJoGu
   A1dU5wtD7Tl49/zM0EVPPVtFdll8ePgvYJRHBneVPbluGkEOHxDyeLulv
   VuKu6GkaDctsGDAG5gQEJqVtYJ/xHHKck7st3vCEHNUhGiRVKol8GrLAs
   59kDlZfFkwEhL8B7eu0qLHDv82stdc4Shz+vv3431CvttZM6aCeuQiqY2
   w==;
X-CSE-ConnectionGUID: 1plUBdGMSouLTCEewYScHg==
X-CSE-MsgGUID: 1i2YZsoiQ5qIlVf5+a/H1Q==
X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631629"
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="82631629"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 10 Feb 2026 14:13:48 -0800
X-CSE-ConnectionGUID: 24BLstINQF2mO/5HF20AiQ==
X-CSE-MsgGUID: MkThpCS0SredZayvs99A0Q==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="216374041"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:46 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v3 20/21] -- DO NOT APPLY!!! -- sched/cache/debug: Display the
 per LLC occupancy for each process via proc fs
Date: Tue, 10 Feb 2026 14:19:00 -0800
Message-Id: 
 <09c48847deeb9d2c1c7de1f2799cc128cd2e866e.1770760558.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1770760558.git.tim.c.chen@linux.intel.com>
References: <cover.1770760558.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

Debug patch only.

Show the per-LLC occupancy in /proc/{PID}/schedstat, with each column
corresponding to one LLC. This can be used to verify if the cache-aware
load balancer works as expected by aggregating threads onto dedicated LLCs.

Suppose there are 2 LLCs and the sampling duration is 10 seconds:

Enable the cache aware load balance:
0 12281  <--- LLC0 residency delta is 0, LLC1 is 12 seconds
0 18881
0 16217

disable the cache aware load balance:
6497 15802
9299 5435
17811 8278

Co-developed-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v2->v3:
    Enhance the informational output by printing the task's
    preferred LLC. (Aaron Lu)

 fs/proc/base.c           | 31 +++++++++++++++++++++++++
 include/linux/mm_types.h | 17 +++++++++++---
 include/linux/sched.h    |  6 +++++
 kernel/sched/fair.c      | 50 ++++++++++++++++++++++++++++++++++++----
 4 files changed, 97 insertions(+), 7 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 4eec684baca9..76b49e80af1a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -518,6 +518,37 @@ static int proc_pid_schedstat(struct seq_file *m, stru=
ct pid_namespace *ns,
 		   (unsigned long long)task->se.sum_exec_runtime,
 		   (unsigned long long)task->sched_info.run_delay,
 		   task->sched_info.pcount);
+#ifdef CONFIG_SCHED_CACHE
+	if (sched_cache_inuse()) {
+		struct mm_struct *mm =3D task->mm;
+		u64 *llc_runtime;
+		int mm_sched_llc;
+
+		if (!mm)
+			return 0;
+
+		llc_runtime =3D kcalloc(max_llcs, sizeof(u64), GFP_KERNEL);
+		if (!llc_runtime)
+			return 0;
+
+		if (get_mm_per_llc_runtime(task, llc_runtime))
+			goto out;
+
+		if (mm->sc_stat.cpu =3D=3D -1)
+			mm_sched_llc =3D -1;
+		else
+			mm_sched_llc =3D llc_id(mm->sc_stat.cpu);
+
+		for (int i =3D 0; i < max_llcs; i++)
+			seq_printf(m, "%s%s%llu ",
+				   i =3D=3D task->preferred_llc ? "*" : "",
+				   i =3D=3D mm_sched_llc ? "?" : "",
+				   llc_runtime[i]);
+		seq_puts(m, "\n");
+out:
+		kfree(llc_runtime);
+	}
+#endif
=20
 	return 0;
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 777a48523aa6..2b8d0ec032e8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1523,17 +1523,26 @@ static inline unsigned int mm_cid_size(void)
=20
 #ifdef CONFIG_SCHED_CACHE
 void mm_init_sched(struct mm_struct *mm,
-		   struct sched_cache_time __percpu *pcpu_sched);
+		   struct sched_cache_time __percpu *pcpu_sched,
+		   struct sched_cache_time __percpu *pcpu_time);
=20
 static inline int mm_alloc_sched_noprof(struct mm_struct *mm)
 {
 	struct sched_cache_time __percpu *pcpu_sched =3D
-		alloc_percpu_noprof(struct sched_cache_time);
+		alloc_percpu_noprof(struct sched_cache_time),
+		*pcpu_time;
=20
 	if (!pcpu_sched)
 		return -ENOMEM;
=20
-	mm_init_sched(mm, pcpu_sched);
+	pcpu_time =3D alloc_percpu_noprof(struct sched_cache_time);
+	if (!pcpu_time) {
+		free_percpu(pcpu_sched);
+		return -ENOMEM;
+	}
+
+	mm_init_sched(mm, pcpu_sched, pcpu_time);
+
 	return 0;
 }
=20
@@ -1542,7 +1551,9 @@ static inline int mm_alloc_sched_noprof(struct mm_str=
uct *mm)
 static inline void mm_destroy_sched(struct mm_struct *mm)
 {
 	free_percpu(mm->sc_stat.pcpu_sched);
+	free_percpu(mm->sc_stat.pcpu_time);
 	mm->sc_stat.pcpu_sched =3D NULL;
+	mm->sc_stat.pcpu_time =3D NULL;
 }
 #else /* !CONFIG_SCHED_CACHE */
=20
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 511c9b263386..4236cacbb409 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2344,12 +2344,18 @@ struct sched_cache_time {
=20
 struct sched_cache_stat {
 	struct sched_cache_time __percpu *pcpu_sched;
+	struct sched_cache_time __percpu *pcpu_time;
 	raw_spinlock_t lock;
 	unsigned long epoch;
 	u64 nr_running_avg;
 	int cpu;
 } ____cacheline_aligned_in_smp;
=20
+int get_mm_per_llc_runtime(struct task_struct *p, u64 *buf);
+bool sched_cache_inuse(void);
+extern int max_llcs;
+int llc_id(int cpu);
+
 #else
=20
 struct sched_cache_stat { };
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da4291ace24c..25cee3dd767c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1197,7 +1197,12 @@ __read_mostly unsigned int llc_epoch_affinity_timeou=
t =3D EPOCH_LLC_AFFINITY_TIMEO
 __read_mostly unsigned int llc_imb_pct     =3D 20;
 __read_mostly unsigned int llc_overaggr_pct     =3D 50;
=20
-static int llc_id(int cpu)
+bool sched_cache_inuse(void)
+{
+	return sched_cache_enabled();
+}
+
+int llc_id(int cpu)
 {
 	if (cpu < 0)
 		return -1;
@@ -1365,17 +1370,20 @@ static void account_llc_dequeue(struct rq *rq, stru=
ct task_struct *p)
 }
=20
 void mm_init_sched(struct mm_struct *mm,
-		   struct sched_cache_time __percpu *_pcpu_sched)
+		   struct sched_cache_time __percpu *_pcpu_sched,
+		   struct sched_cache_time __percpu *_pcpu_time)
 {
 	unsigned long epoch;
 	int i;
=20
 	for_each_possible_cpu(i) {
 		struct sched_cache_time *pcpu_sched =3D per_cpu_ptr(_pcpu_sched, i);
+		struct sched_cache_time *pcpu_time =3D per_cpu_ptr(_pcpu_time, i);
 		struct rq *rq =3D cpu_rq(i);
=20
 		pcpu_sched->runtime =3D 0;
 		pcpu_sched->epoch =3D rq->cpu_epoch;
+		pcpu_time->runtime =3D 0;
 		epoch =3D rq->cpu_epoch;
 	}
=20
@@ -1389,6 +1397,8 @@ void mm_init_sched(struct mm_struct *mm,
 	 * the readers may get invalid mm_sched_epoch, etc.
 	 */
 	smp_store_release(&mm->sc_stat.pcpu_sched, _pcpu_sched);
+	/* barrier */
+	smp_store_release(&mm->sc_stat.pcpu_time, _pcpu_time);
 }
=20
 /* because why would C be fully specified */
@@ -1474,7 +1484,8 @@ static unsigned int task_running_on_cpu(int cpu, stru=
ct task_struct *p);
 static inline
 void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 {
-	struct sched_cache_time *pcpu_sched;
+	struct sched_cache_time *pcpu_sched,
+		*pcpu_time;
 	struct mm_struct *mm =3D p->mm;
 	int mm_sched_llc =3D -1;
 	unsigned long epoch;
@@ -1488,14 +1499,18 @@ void account_mm_sched(struct rq *rq, struct task_st=
ruct *p, s64 delta_exec)
 	 * init_task, kthreads and user thread created
 	 * by user_mode_thread() don't have mm.
 	 */
-	if (!mm || !mm->sc_stat.pcpu_sched)
+	if (!mm || !mm->sc_stat.pcpu_sched ||
+	    !mm->sc_stat.pcpu_time)
 		return;
=20
 	pcpu_sched =3D per_cpu_ptr(p->mm->sc_stat.pcpu_sched, cpu_of(rq));
+	pcpu_time =3D per_cpu_ptr(p->mm->sc_stat.pcpu_time, cpu_of(rq));
=20
 	scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
 		__update_mm_sched(rq, pcpu_sched);
 		pcpu_sched->runtime +=3D delta_exec;
+		/* pure runtime without decay */
+		pcpu_time->runtime +=3D delta_exec;
 		rq->cpu_runtime +=3D delta_exec;
 		epoch =3D rq->cpu_epoch;
 	}
@@ -1676,6 +1691,33 @@ void init_sched_mm(struct task_struct *p)
 	work->next =3D work;
 }
=20
+/* p->pi_lock is hold */
+int get_mm_per_llc_runtime(struct task_struct *p, u64 *buf)
+{
+	struct sched_cache_time *pcpu_time;
+	struct mm_struct *mm =3D p->mm;
+	int cpu;
+
+	if (!mm)
+		return -EINVAL;
+
+	rcu_read_lock();
+	for_each_online_cpu(cpu) {
+		int llc =3D llc_id(cpu);
+		u64 runtime_ms;
+
+		if (!valid_llc_id(llc))
+			continue;
+
+		pcpu_time =3D per_cpu_ptr(mm->sc_stat.pcpu_sched, cpu);
+		runtime_ms =3D div_u64(pcpu_time->runtime, NSEC_PER_MSEC);
+		buf[llc] +=3D runtime_ms;
+	}
+	rcu_read_unlock();
+
+	return 0;
+}
+
 #else
=20
 static inline void account_mm_sched(struct rq *rq, struct task_struct *p,
--=20
2.32.0
From nobody Thu Apr  2 15:36:01 2026
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4F38F394481
	for <linux-kernel@vger.kernel.org>; Tue, 10 Feb 2026 22:13:50 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.9
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1770761631; cv=none;
 b=CbsdXjboyNHam/s/WgHNm2UwErJtHkfQqkiR6t1EwFvwnrfjpGPL7+fA57h0mqZbt6rFSQ58Nxtuw50O8vO/cJXuWaK/nJgAUn+fcyXUPB9OjKiiZ5dw7EqJf/YFZao72EcKTe+GJWiUqiYfk1IFwj0Kn0V9nZ44H7aOe/2qaYY=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1770761631; c=relaxed/simple;
	bh=wghvjcCtlAuZLzOfZ4J/EiqZ5QL4yndcyXeomtKcxz8=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=QYEHtnMS97gFJr7UP5SX3RK690dgbnmYAcIQcjgwTBwEcdMODCUpEur8kKBR/g1d7smNJgdsNhE4H4D3zaB+9r/SAzpliPJb9Mw9uxiWMYLuOUrE5nbccqkyuG1hOKv/PzAwn3Kzgx7MsPwxbcrJOhUeKnKpQVhyOXoMZNPa5KA=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=YhdbfbjW; arc=none smtp.client-ip=192.198.163.9
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="YhdbfbjW"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1770761630; x=1802297630;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=wghvjcCtlAuZLzOfZ4J/EiqZ5QL4yndcyXeomtKcxz8=;
  b=YhdbfbjWjyKqgcWXrUUh+Bx6MUZmnEo3ibRVtCzCy+1wxYsrTKbzHdMZ
   DDDEMPZwr0iPLUMxQv1X8QnUD/DJ9Y0RZaaTuMOjvEcVtKnxc8u2oEHoN
   JySn++1SiF1yQWg+ZcdFzqSfUwZQl3ikWvkkP4Z622i5LIQzMbfmG5PIx
   U59Eh3g2OlfyyQY5c4bWkoqZOgiHErYr6KB17UyNYCZkl0Mi2+NHfV4on
   JJS+ecQCABxXHZ+IQa5hOqzWBnct2CTGTxfu+QjCSTc3YJPpMtXimOmS5
   ZbrRPpBCXw92xEravRppxq4S4W7wejo994Acx4S+e7+8hYkQY3TttHEqC
   w==;
X-CSE-ConnectionGUID: tFTCPkIMRc62c5T2VK6jew==
X-CSE-MsgGUID: qOcwBoX/TDyh/kqnX5PRxA==
X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631651"
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="82631651"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 10 Feb 2026 14:13:49 -0800
X-CSE-ConnectionGUID: ye5jfKVFQ9+0v7lGFII88g==
X-CSE-MsgGUID: NdY6Z6pfSjSMakp/1q8R3g==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.21,283,1763452800";
   d="scan'208";a="216374046"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:48 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v3 21/21] -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace
 to track the load balance statistics
Date: Tue, 10 Feb 2026 14:19:01 -0800
Message-Id: 
 <5d663caaed7ebe93ab9b272235675b2400b3ed8b.1770760558.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1770760558.git.tim.c.chen@linux.intel.com>
References: <cover.1770760558.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

Debug patch only.

The user leverages these trace events (via bpftrace, etc.)
to monitor the cache-aware load balancing activity - specifically,
whether tasks are moved to their preferred LLC, moved out of their
preferred LLC, or whether cache-aware load balancing is skipped
due to exceeding the memory footprint limit or too many active
tasks.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v2->v3:
    Add more trace events when the process exceeds the limitation
    of LLC size or number of active threads(moved from schedstat
    to trace event for better bpf tracking)

 include/trace/events/sched.h | 79 ++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c          | 40 ++++++++++++++----
 2 files changed, 110 insertions(+), 9 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 7b2645b50e78..b73327653e4b 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -10,6 +10,85 @@
 #include <linux/tracepoint.h>
 #include <linux/binfmts.h>
=20
+#ifdef CONFIG_SCHED_CACHE
+TRACE_EVENT(sched_exceed_llc_cap,
+
+	TP_PROTO(struct task_struct *t, int exceeded),
+
+	TP_ARGS(t, exceeded),
+
+	TP_STRUCT__entry(
+		__array( char,	comm,	TASK_COMM_LEN	)
+		__field( pid_t,	pid			)
+		__field( int,	exceeded		)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
+		__entry->pid		=3D t->pid;
+		__entry->exceeded	=3D exceeded;
+	),
+
+	TP_printk("comm=3D%s pid=3D%d exceed_cap=3D%d",
+			__entry->comm, __entry->pid,
+			__entry->exceeded)
+);
+
+TRACE_EVENT(sched_exceed_llc_nr,
+
+	TP_PROTO(struct task_struct *t, int exceeded),
+
+	TP_ARGS(t, exceeded),
+
+	TP_STRUCT__entry(
+		__array( char,	comm,	TASK_COMM_LEN	)
+		__field( pid_t,	pid			)
+		__field( int,	exceeded		)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
+		__entry->pid		=3D t->pid;
+		__entry->exceeded	=3D exceeded;
+	),
+
+	TP_printk("comm=3D%s pid=3D%d exceed_nr=3D%d",
+			__entry->comm, __entry->pid,
+			__entry->exceeded)
+);
+
+TRACE_EVENT(sched_attach_task,
+
+	TP_PROTO(struct task_struct *t, int pref_cpu, int pref_llc,
+		 int attach_cpu, int attach_llc),
+
+	TP_ARGS(t, pref_cpu, pref_llc, attach_cpu, attach_llc),
+
+	TP_STRUCT__entry(
+		__array( char,	comm,	TASK_COMM_LEN	)
+		__field( pid_t,	pid			)
+		__field( int,	pref_cpu		)
+		__field( int,	pref_llc		)
+		__field( int,	attach_cpu		)
+		__field( int,	attach_llc		)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
+		__entry->pid		=3D t->pid;
+		__entry->pref_cpu	=3D pref_cpu;
+		__entry->pref_llc	=3D pref_llc;
+		__entry->attach_cpu	=3D attach_cpu;
+		__entry->attach_llc	=3D attach_llc;
+	),
+
+	TP_printk("comm=3D%s pid=3D%d pref_cpu=3D%d pref_llc=3D%d attach_cpu=3D%d=
 attach_llc=3D%d",
+			__entry->comm, __entry->pid,
+			__entry->pref_cpu, __entry->pref_llc,
+			__entry->attach_cpu, __entry->attach_llc)
+);
+#endif
+
 /*
  * Tracepoint for calling kthread_stop, performed to end a kthread:
  */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25cee3dd767c..977091fd0e49 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1245,9 +1245,11 @@ static inline int get_sched_cache_scale(int mul)
 	return (1 + (llc_aggr_tolerance - 1) * mul);
 }
=20
-static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
+static bool exceed_llc_capacity(struct mm_struct *mm, int cpu,
+				struct task_struct *p)
 {
 	struct cacheinfo *ci;
+	bool exceeded;
 	u64 rss, llc;
 	int scale;
=20
@@ -1293,12 +1295,18 @@ static bool exceed_llc_capacity(struct mm_struct *m=
m, int cpu)
 	if (scale =3D=3D INT_MAX)
 		return false;
=20
-	return ((llc * scale) <=3D (rss * PAGE_SIZE));
+	exceeded =3D ((llc * scale) <=3D (rss * PAGE_SIZE));
+
+	trace_sched_exceed_llc_cap(p, exceeded);
+
+	return exceeded;
 }
=20
-static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
+static bool exceed_llc_nr(struct mm_struct *mm, int cpu,
+			  struct task_struct *p)
 {
 	int smt_nr =3D 1, scale;
+	bool exceeded;
=20
 #ifdef CONFIG_SCHED_SMT
 	if (sched_smt_active())
@@ -1313,8 +1321,12 @@ static bool exceed_llc_nr(struct mm_struct *mm, int =
cpu)
 	if (scale =3D=3D INT_MAX)
 		return false;
=20
-	return !fits_capacity((mm->sc_stat.nr_running_avg * smt_nr),
+	exceeded =3D !fits_capacity((mm->sc_stat.nr_running_avg * smt_nr),
 			(scale * per_cpu(sd_llc_size, cpu)));
+
+	trace_sched_exceed_llc_nr(p, exceeded);
+
+	return exceeded;
 }
=20
 static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
@@ -1522,8 +1534,8 @@ void account_mm_sched(struct rq *rq, struct task_stru=
ct *p, s64 delta_exec)
 	if (time_after(epoch,
 		       READ_ONCE(mm->sc_stat.epoch) + llc_epoch_affinity_timeout) ||
 	    get_nr_threads(p) <=3D 1 ||
-	    exceed_llc_nr(mm, cpu_of(rq)) ||
-	    exceed_llc_capacity(mm, cpu_of(rq))) {
+	    exceed_llc_nr(mm, cpu_of(rq), p) ||
+	    exceed_llc_capacity(mm, cpu_of(rq), p)) {
 		if (mm->sc_stat.cpu !=3D -1)
 			mm->sc_stat.cpu =3D -1;
 	}
@@ -1600,7 +1612,7 @@ static void task_cache_work(struct callback_head *wor=
k)
=20
 	curr_cpu =3D task_cpu(p);
 	if (get_nr_threads(p) <=3D 1 ||
-	    exceed_llc_capacity(mm, curr_cpu)) {
+	    exceed_llc_capacity(mm, curr_cpu, p)) {
 		if (mm->sc_stat.cpu !=3D -1)
 			mm->sc_stat.cpu =3D -1;
=20
@@ -10159,8 +10171,8 @@ static enum llc_mig can_migrate_llc_task(int src_cp=
u, int dst_cpu,
 	 * Skip cache aware load balance for single/too many threads
 	 * or large memory RSS.
 	 */
-	if (get_nr_threads(p) <=3D 1 || exceed_llc_nr(mm, dst_cpu) ||
-	    exceed_llc_capacity(mm, dst_cpu)) {
+	if (get_nr_threads(p) <=3D 1 || exceed_llc_nr(mm, dst_cpu, p) ||
+	    exceed_llc_capacity(mm, dst_cpu, p)) {
 		if (mm->sc_stat.cpu !=3D -1)
 			mm->sc_stat.cpu =3D -1;
 		return mig_unrestricted;
@@ -10602,6 +10614,16 @@ static void attach_task(struct rq *rq, struct task=
_struct *p)
 {
 	lockdep_assert_rq_held(rq);
=20
+#ifdef CONFIG_SCHED_CACHE
+	if (p->mm) {
+		int pref_cpu =3D p->mm->sc_stat.cpu;
+
+		trace_sched_attach_task(p,
+					pref_cpu,
+					pref_cpu !=3D -1 ? llc_id(pref_cpu) : -1,
+					cpu_of(rq), llc_id(cpu_of(rq)));
+	}
+#endif
 	WARN_ON_ONCE(task_rq(p) !=3D rq);
 	activate_task(rq, p, ENQUEUE_NOCLOCK);
 	wakeup_preempt(rq, p, 0);
--=20
2.32.0