From nobody Sun Feb  8 19:22:26 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3E1062EA481
	for <linux-kernel@vger.kernel.org>; Wed,  3 Dec 2025 23:01:19 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.11
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764802883; cv=none;
 b=UKLk6Rg4Ag2RrVZTM6q83e57jrtOabhFLy87jTKdCORkErT5oscdmGvQFuZ8uzk4JddS6cPh1pfkZIjrorb34GjVrTfhTnjF3Ev1eA9P3f9SHm6a8HG5wxWf/yS25iz0NQWmXUw8INvgj0a9A56o6dRBuDjYNgK/XPE8bAKiBUg=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764802883; c=relaxed/simple;
	bh=6xbRUXX8feoSk8bOjg/vcAGiqy4i78lNKWOOyysMsTg=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=ezTAzjYx2Rp52iZO2WWYVcoqrFo5k7CxRy+shLmmCt9X8OAnGBmN2eYuhkz/I7t0LW4rAjnmLXBSt4s5lKDI7cjNxUO/rV3B0EWqv13ojuB5QKkGvUXb3YGE9U0EUSc8TdruI55O35k40Uh0lNID1k89G7Dxb8VJ6Ckm0RWpbqE=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=hXZ7RTSy; arc=none smtp.client-ip=198.175.65.11
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="hXZ7RTSy"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1764802880; x=1796338880;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=6xbRUXX8feoSk8bOjg/vcAGiqy4i78lNKWOOyysMsTg=;
  b=hXZ7RTSym9fS1Xvrd9iY6zdRxiZpXzgeaEnDkbWt4E9kikaWOOGcUivi
   QbmpWan09GqoanGn0S6Vft9B7BxCpebF9EW9KXpkUelSttyWWDfdj3/y5
   FTK7BCv2Ykd5RjEGqBmouxnoYSthhh0M052SACkie+UXmvYxcT/sOQCCX
   HOsATO8B6T2nuON/L4dyuLl54HqVuf+JcbMOZ0ABnQ6ZFHGM/cCwqCXcJ
   AmUI07y2Khz2g6thC1D3WG4YXreJSp+sT28iidXrCmaZBan6+WI286Msl
   K0/hGg9Y68V2FBcOV+wIiAuy+MY5XGtKxf7nZIp0LSDOwP7fiuTJEB8U9
   g==;
X-CSE-ConnectionGUID: cpmnVUlITmyoLapwFs/Now==
X-CSE-MsgGUID: 1fVK363gQBOq9Aw6XgwrKg==
X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136182"
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="77136182"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Dec 2025 15:01:18 -0800
X-CSE-ConnectionGUID: d8cQS9oyRh+diLyesP0AjA==
X-CSE-MsgGUID: JRVFNz3/S1eHusPWEwHUNA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="199763734"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:18 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 01/23] sched/cache: Introduce infrastructure for
 cache-aware load balancing
Date: Wed,  3 Dec 2025 15:07:20 -0800
Message-Id: 
 <06f0d7edbc3185ec730b50b3b00d87ace44169b3.1764801860.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1764801860.git.tim.c.chen@linux.intel.com>
References: <cover.1764801860.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: "Peter Zijlstra (Intel)" <peterz@infradead.org>

Adds infrastructure to enable cache-aware load balancing,
which improves cache locality by grouping tasks that share resources
within the same cache domain. This reduces cache misses and improves
overall data access efficiency.

In this initial implementation, threads belonging to the same process
are treated as entities that likely share working sets. The mechanism
tracks per-process CPU occupancy across cache domains and attempts to
migrate threads toward cache-hot domains where their process already
has active threads, thereby enhancing locality.

This provides a basic model for cache affinity. While the current code
targets the last-level cache (LLC), the approach could be extended to
other domain types such as clusters (L2) or node-internal groupings.

At present, the mechanism selects the CPU within an LLC that has the
highest recent runtime. Subsequent patches in this series will use this
information in the load-balancing path to guide task placement toward
preferred LLCs.

In the future, more advanced policies could be integrated through NUMA
balancing-for example, migrating a task to its preferred LLC when spare
capacity exists, or swapping tasks across LLCs to improve cache affinity.
Grouping of tasks could also be generalized from that of a process
to be that of a NUMA group, or be user configurable.

Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2:
       Restore the original CPU scan to cover all online CPUs,
       rather than scanning within the preferred NUMA node.
       (Peter Zijlstra)
   =20
       Use rq->curr instead of rq->donor. (K Prateek Nayak)
   =20
       Minor fix in task_tick_cache() to use
       if (mm->mm_sched_epoch >=3D rq->cpu_epoch)
       to avoid mm_sched_epoch going backwards.

 include/linux/mm_types.h |  44 +++++++
 include/linux/sched.h    |  11 ++
 init/Kconfig             |  11 ++
 kernel/fork.c            |   6 +
 kernel/sched/core.c      |   6 +
 kernel/sched/fair.c      | 258 +++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h     |   8 ++
 7 files changed, 344 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 90e5790c318f..1ea16ef90566 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -939,6 +939,11 @@ typedef struct {
 	DECLARE_BITMAP(__mm_flags, NUM_MM_FLAG_BITS);
 } __private mm_flags_t;
=20
+struct mm_sched {
+	u64 runtime;
+	unsigned long epoch;
+};
+
 struct kioctx_table;
 struct iommu_mm_data;
 struct mm_struct {
@@ -1029,6 +1034,17 @@ struct mm_struct {
 		 */
 		raw_spinlock_t cpus_allowed_lock;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+		/*
+		 * Track per-cpu-per-process occupancy as a proxy for cache residency.
+		 * See account_mm_sched() and ...
+		 */
+		struct mm_sched __percpu *pcpu_sched;
+		raw_spinlock_t mm_sched_lock;
+		unsigned long mm_sched_epoch;
+		int mm_sched_cpu;
+#endif
+
 #ifdef CONFIG_MMU
 		atomic_long_t pgtables_bytes;	/* size of all page tables */
 #endif
@@ -1487,6 +1503,34 @@ static inline unsigned int mm_cid_size(void)
 static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct =
cpumask *cpumask) { }
 #endif /* CONFIG_SCHED_MM_CID */
=20
+#ifdef CONFIG_SCHED_CACHE
+void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *pcpu_sc=
hed);
+
+static inline int mm_alloc_sched_noprof(struct mm_struct *mm)
+{
+	struct mm_sched __percpu *pcpu_sched =3D alloc_percpu_noprof(struct mm_sc=
hed);
+
+	if (!pcpu_sched)
+		return -ENOMEM;
+
+	mm_init_sched(mm, pcpu_sched);
+	return 0;
+}
+
+#define mm_alloc_sched(...)	alloc_hooks(mm_alloc_sched_noprof(__VA_ARGS__))
+
+static inline void mm_destroy_sched(struct mm_struct *mm)
+{
+	free_percpu(mm->pcpu_sched);
+	mm->pcpu_sched =3D NULL;
+}
+#else /* !CONFIG_SCHED_CACHE */
+
+static inline int mm_alloc_sched(struct mm_struct *mm) { return 0; }
+static inline void mm_destroy_sched(struct mm_struct *mm) { }
+
+#endif /* CONFIG_SCHED_CACHE */
+
 struct mmu_gather;
 extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct=
 *mm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b469878de25c..278b529c91df 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1406,6 +1406,10 @@ struct task_struct {
 	unsigned long			numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
=20
+#ifdef CONFIG_SCHED_CACHE
+	struct callback_head		cache_work;
+#endif
+
 #ifdef CONFIG_RSEQ
 	struct rseq __user *rseq;
 	u32 rseq_len;
@@ -2428,4 +2432,11 @@ extern void migrate_enable(void);
=20
 DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
=20
+#ifdef CONFIG_SCHED_CACHE
+static inline bool sched_cache_enabled(void)
+{
+	return false;
+}
+#endif
+
 #endif
diff --git a/init/Kconfig b/init/Kconfig
index cab3ad28ca49..88556ef8cfd1 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -983,6 +983,17 @@ config NUMA_BALANCING
=20
 	  This system will be inactive on UMA systems.
=20
+config SCHED_CACHE
+	bool "Cache aware load balance"
+	default y
+	depends on SMP
+	help
+	  When enabled, the scheduler will attempt to aggregate tasks from
+	  the same process onto a single Last Level Cache (LLC) domain when
+	  possible. This improves cache locality by keeping tasks that share
+	  resources within the same cache domain, reducing cache misses and
+	  lowering data access latency.
+
 config NUMA_BALANCING_DEFAULT_ENABLED
 	bool "Automatically enable NUMA aware memory/task placement"
 	default y
diff --git a/kernel/fork.c b/kernel/fork.c
index 3da0f08615a9..aae5053d1e30 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -680,6 +680,7 @@ void __mmdrop(struct mm_struct *mm)
 	cleanup_lazy_tlbs(mm);
=20
 	WARN_ON_ONCE(mm =3D=3D current->active_mm);
+	mm_destroy_sched(mm);
 	mm_free_pgd(mm);
 	mm_free_id(mm);
 	destroy_context(mm);
@@ -1083,6 +1084,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm=
, struct task_struct *p,
 	if (mm_alloc_cid(mm, p))
 		goto fail_cid;
=20
+	if (mm_alloc_sched(mm))
+		goto fail_sched;
+
 	if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT,
 				     NR_MM_COUNTERS))
 		goto fail_pcpu;
@@ -1092,6 +1096,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm=
, struct task_struct *p,
 	return mm;
=20
 fail_pcpu:
+	mm_destroy_sched(mm);
+fail_sched:
 	mm_destroy_cid(mm);
 fail_cid:
 	destroy_context(mm);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f754a60de848..e8bdf03a4b7f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4488,6 +4488,7 @@ static void __sched_fork(u64 clone_flags, struct task=
_struct *p)
 	p->wake_entry.u_flags =3D CSD_TYPE_TTWU;
 	p->migration_pending =3D NULL;
 	init_sched_mm_cid(p);
+	init_sched_mm(p);
 }
=20
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
@@ -8791,6 +8792,11 @@ void __init sched_init(void)
=20
 		rq->core_cookie =3D 0UL;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+		raw_spin_lock_init(&rq->cpu_epoch_lock);
+		rq->cpu_epoch_next =3D jiffies;
+#endif
+
 		zalloc_cpumask_var_node(&rq->scratch_mask, GFP_KERNEL, cpu_to_node(i));
 	}
=20
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b752324270b..cb82f558dc5b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1152,6 +1152,8 @@ void post_init_entity_util_avg(struct task_struct *p)
 	sa->runnable_avg =3D sa->util_avg;
 }
=20
+static inline void account_mm_sched(struct rq *rq, struct task_struct *p, =
s64 delta_exec);
+
 static s64 update_se(struct rq *rq, struct sched_entity *se)
 {
 	u64 now =3D rq_clock_task(rq);
@@ -1174,6 +1176,7 @@ static s64 update_se(struct rq *rq, struct sched_enti=
ty *se)
=20
 		trace_sched_stat_runtime(running, delta_exec);
 		account_group_exec_runtime(running, delta_exec);
+		account_mm_sched(rq, running, delta_exec);
=20
 		/* cgroup time is always accounted against the donor */
 		cgroup_account_cputime(donor, delta_exec);
@@ -1193,6 +1196,259 @@ static s64 update_se(struct rq *rq, struct sched_en=
tity *se)
 	return delta_exec;
 }
=20
+#ifdef CONFIG_SCHED_CACHE
+
+/*
+ * XXX numbers come from a place the sun don't shine -- probably wants to =
be SD
+ * tunable or so.
+ */
+#define EPOCH_PERIOD	(HZ / 100)	/* 10 ms */
+#define EPOCH_LLC_AFFINITY_TIMEOUT	5	/* 50 ms */
+
+static int llc_id(int cpu)
+{
+	if (cpu < 0)
+		return -1;
+
+	return per_cpu(sd_llc_id, cpu);
+}
+
+void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_s=
ched)
+{
+	unsigned long epoch;
+	int i;
+
+	for_each_possible_cpu(i) {
+		struct mm_sched *pcpu_sched =3D per_cpu_ptr(_pcpu_sched, i);
+		struct rq *rq =3D cpu_rq(i);
+
+		pcpu_sched->runtime =3D 0;
+		pcpu_sched->epoch =3D rq->cpu_epoch;
+		epoch =3D rq->cpu_epoch;
+	}
+
+	raw_spin_lock_init(&mm->mm_sched_lock);
+	mm->mm_sched_epoch =3D epoch;
+	mm->mm_sched_cpu =3D -1;
+
+	/*
+	 * The update to mm->pcpu_sched should not be reordered
+	 * before initialization to mm's other fields, in case
+	 * the readers may get invalid mm_sched_epoch, etc.
+	 */
+	smp_store_release(&mm->pcpu_sched, _pcpu_sched);
+}
+
+/* because why would C be fully specified */
+static __always_inline void __shr_u64(u64 *val, unsigned int n)
+{
+	if (n >=3D 64) {
+		*val =3D 0;
+		return;
+	}
+	*val >>=3D n;
+}
+
+static inline void __update_mm_sched(struct rq *rq, struct mm_sched *pcpu_=
sched)
+{
+	lockdep_assert_held(&rq->cpu_epoch_lock);
+
+	unsigned long n, now =3D jiffies;
+	long delta =3D now - rq->cpu_epoch_next;
+
+	if (delta > 0) {
+		n =3D (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
+		rq->cpu_epoch +=3D n;
+		rq->cpu_epoch_next +=3D n * EPOCH_PERIOD;
+		__shr_u64(&rq->cpu_runtime, n);
+	}
+
+	n =3D rq->cpu_epoch - pcpu_sched->epoch;
+	if (n) {
+		pcpu_sched->epoch +=3D n;
+		__shr_u64(&pcpu_sched->runtime, n);
+	}
+}
+
+static unsigned long __no_profile fraction_mm_sched(struct rq *rq, struct =
mm_sched *pcpu_sched)
+{
+	guard(raw_spinlock_irqsave)(&rq->cpu_epoch_lock);
+
+	__update_mm_sched(rq, pcpu_sched);
+
+	/*
+	 * Runtime is a geometric series (r=3D0.5) and as such will sum to twice
+	 * the accumulation period, this means the multiplcation here should
+	 * not overflow.
+	 */
+	return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq->cpu_runtime + 1);
+}
+
+static inline
+void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
+{
+	struct mm_struct *mm =3D p->mm;
+	struct mm_sched *pcpu_sched;
+	unsigned long epoch;
+
+	if (!sched_cache_enabled())
+		return;
+
+	if (p->sched_class !=3D &fair_sched_class)
+		return;
+	/*
+	 * init_task and kthreads don't having mm
+	 */
+	if (!mm || !mm->pcpu_sched)
+		return;
+
+	pcpu_sched =3D per_cpu_ptr(p->mm->pcpu_sched, cpu_of(rq));
+
+	scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
+		__update_mm_sched(rq, pcpu_sched);
+		pcpu_sched->runtime +=3D delta_exec;
+		rq->cpu_runtime +=3D delta_exec;
+		epoch =3D rq->cpu_epoch;
+	}
+
+	/*
+	 * If this task hasn't hit task_cache_work() for a while, or it
+	 * has only 1 thread, invalidate its preferred state.
+	 */
+	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
+	    get_nr_threads(p) <=3D 1) {
+		if (mm->mm_sched_cpu !=3D -1)
+			mm->mm_sched_cpu =3D -1;
+	}
+}
+
+static void task_tick_cache(struct rq *rq, struct task_struct *p)
+{
+	struct callback_head *work =3D &p->cache_work;
+	struct mm_struct *mm =3D p->mm;
+
+	if (!sched_cache_enabled())
+		return;
+
+	if (!mm || !mm->pcpu_sched)
+		return;
+
+	/* avoid moving backwards */
+	if (mm->mm_sched_epoch >=3D rq->cpu_epoch)
+		return;
+
+	guard(raw_spinlock)(&mm->mm_sched_lock);
+
+	if (work->next =3D=3D work) {
+		task_work_add(p, work, TWA_RESUME);
+		WRITE_ONCE(mm->mm_sched_epoch, rq->cpu_epoch);
+	}
+}
+
+static void __no_profile task_cache_work(struct callback_head *work)
+{
+	struct task_struct *p =3D current;
+	struct mm_struct *mm =3D p->mm;
+	unsigned long m_a_occ =3D 0;
+	unsigned long curr_m_a_occ =3D 0;
+	int cpu, m_a_cpu =3D -1;
+	cpumask_var_t cpus;
+
+	WARN_ON_ONCE(work !=3D &p->cache_work);
+
+	work->next =3D work;
+
+	if (p->flags & PF_EXITING)
+		return;
+
+	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
+		return;
+
+	scoped_guard (cpus_read_lock) {
+		cpumask_copy(cpus, cpu_online_mask);
+
+		for_each_cpu(cpu, cpus) {
+			/* XXX sched_cluster_active */
+			struct sched_domain *sd =3D per_cpu(sd_llc, cpu);
+			unsigned long occ, m_occ =3D 0, a_occ =3D 0;
+			int m_cpu =3D -1, i;
+
+			if (!sd)
+				continue;
+
+			for_each_cpu(i, sched_domain_span(sd)) {
+				occ =3D fraction_mm_sched(cpu_rq(i),
+							per_cpu_ptr(mm->pcpu_sched, i));
+				a_occ +=3D occ;
+				if (occ > m_occ) {
+					m_occ =3D occ;
+					m_cpu =3D i;
+				}
+			}
+
+			/*
+			 * Compare the accumulated occupancy of each LLC. The
+			 * reason for using accumulated occupancy rather than average
+			 * per CPU occupancy is that it works better in asymmetric LLC
+			 * scenarios.
+			 * For example, if there are 2 threads in a 4CPU LLC and 3
+			 * threads in an 8CPU LLC, it might be better to choose the one
+			 * with 3 threads. However, this would not be the case if the
+			 * occupancy is divided by the number of CPUs in an LLC (i.e.,
+			 * if average per CPU occupancy is used).
+			 * Besides, NUMA balancing fault statistics behave similarly:
+			 * the total number of faults per node is compared rather than
+			 * the average number of faults per CPU. This strategy is also
+			 * followed here.
+			 */
+			if (a_occ > m_a_occ) {
+				m_a_occ =3D a_occ;
+				m_a_cpu =3D m_cpu;
+			}
+
+			if (llc_id(cpu) =3D=3D llc_id(mm->mm_sched_cpu))
+				curr_m_a_occ =3D a_occ;
+
+			cpumask_andnot(cpus, cpus, sched_domain_span(sd));
+		}
+	}
+
+	if (m_a_occ > (2 * curr_m_a_occ)) {
+		/*
+		 * Avoid switching mm_sched_cpu too fast.
+		 * The reason to choose 2X is because:
+		 * 1. It is better to keep the preferred LLC stable,
+		 *    rather than changing it frequently and cause migrations
+		 * 2. 2X means the new preferred LLC has at least 1 more
+		 *    busy CPU than the old one(200% vs 100%, eg)
+		 * 3. 2X is chosen based on test results, as it delivers
+		 *    the optimal performance gain so far.
+		 */
+		mm->mm_sched_cpu =3D m_a_cpu;
+	}
+
+	free_cpumask_var(cpus);
+}
+
+void init_sched_mm(struct task_struct *p)
+{
+	struct callback_head *work =3D &p->cache_work;
+
+	init_task_work(work, task_cache_work);
+	work->next =3D work;
+}
+
+#else
+
+static inline void account_mm_sched(struct rq *rq, struct task_struct *p,
+				    s64 delta_exec) { }
+
+void init_sched_mm(struct task_struct *p) { }
+
+static void task_tick_cache(struct rq *rq, struct task_struct *p) { }
+
+#endif
+
 /*
  * Used by other classes to account runtime.
  */
@@ -13124,6 +13380,8 @@ static void task_tick_fair(struct rq *rq, struct ta=
sk_struct *curr, int queued)
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
=20
+	task_tick_cache(rq, curr);
+
 	update_misfit_status(curr, rq);
 	check_update_overutilized_status(task_rq(curr));
=20
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index adfb6e3409d7..84118b522f22 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1194,6 +1194,12 @@ struct rq {
 	u64			clock_pelt_idle_copy;
 	u64			clock_idle_copy;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+	raw_spinlock_t		cpu_epoch_lock ____cacheline_aligned;
+	u64			cpu_runtime;
+	unsigned long		cpu_epoch;
+	unsigned long		cpu_epoch_next;
+#endif
=20
 	atomic_t		nr_iowait;
=20
@@ -3819,6 +3825,8 @@ static inline void task_tick_mm_cid(struct rq *rq, st=
ruct task_struct *curr) { }
 static inline void init_sched_mm_cid(struct task_struct *t) { }
 #endif /* !CONFIG_SCHED_MM_CID */
=20
+extern void init_sched_mm(struct task_struct *p);
+
 extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
 extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
 static inline
--=20
2.32.0
From nobody Sun Feb  8 19:22:26 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8872B2EC0A3
	for <linux-kernel@vger.kernel.org>; Wed,  3 Dec 2025 23:01:21 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.11
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764802884; cv=none;
 b=PIVYWfHNGhpYcL5pUf5pbJV6z5GC4MufyMLaT00/IZT2eIAKxBzqzRglsyVDKa18ZuvGOOBF6720BmFO1QjbQTlm++JQNaJ2Li4EQo87RGn9XE96gbHXFQW46Ye00LdP+tH7Hh5mDSD6E7sACuXB9wl4PappMcJ/np+rPkSv+fk=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764802884; c=relaxed/simple;
	bh=1tsEZhdWTsEDcQ9RmMyka/N/6UwyydH6Z8nvicoX744=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version:Content-Type;
 b=HIfoZU9H/SZm0t6eE4dquYqikhNAFvY4+BXlcqSIZ3CtUZjOzIUSOC63YZp9YVMZHXi1YQfdjTLmXM4JflgdOMpsYGcmIdM9y97XnpuLltYZndJJ3UMie+BQAS7WTzwavGBbWlwvukQFWzaAt18tTAj+n7TvfZUbdaq3Hd/PwnQ=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=gY2DTyL8; arc=none smtp.client-ip=198.175.65.11
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="gY2DTyL8"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1764802881; x=1796338881;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=1tsEZhdWTsEDcQ9RmMyka/N/6UwyydH6Z8nvicoX744=;
  b=gY2DTyL8yWnM8kjFl12irX509n24BDz4iFKCqM6WCNUCLXRN5a5IlNUP
   CfYI2+/YpAT4bu6uPNEPMLhPBFM2XD4LK26owQYwXoYEFxXYOPyRzMCCr
   rISEhzC11YficDTuxwWe3QvPX3HaXsnsqXtK9HLG/hiT6NfkxrHYuu33P
   2QVChiY0MqYwc1nvL417RDFrqZbCy7kRQLG02T5nK00USUuGMRvgZv+U3
   gt7oM5XlbDtNyyU+5sVU7KIViaRsZSfklkuYRaOOMQ39LYUdIFQ+Ue6G0
   EAocEYO+P59FhkDZmjjHTJ9I3dlRH+Fcb/w/MBdqObwG/r+XHEjXGxmQZ
   g==;
X-CSE-ConnectionGUID: leGPfNk6R8KUwcSASjrrFg==
X-CSE-MsgGUID: gXlnURSyTm+Cie+BIy/27w==
X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136204"
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="77136204"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Dec 2025 15:01:21 -0800
X-CSE-ConnectionGUID: R/PsDZqXSOeZVekLwfPG7Q==
X-CSE-MsgGUID: gEHdD8uJSdmMqZ2X+NMi/w==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="199763741"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:20 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 02/23] sched/cache: Record per-LLC utilization to guide
 cache-aware scheduling decisions
Date: Wed,  3 Dec 2025 15:07:21 -0800
Message-Id: 
 <af576e6ac2d8c45c5aef6889f818e956f9c804e5.1764801860.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1764801860.git.tim.c.chen@linux.intel.com>
References: <cover.1764801860.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

From: Chen Yu <yu.c.chen@intel.com>

When a system becomes busy and a process=E2=80=99s preferred LLC is
saturated with too many threads, tasks within that LLC migrate
frequently. These in LLC migrations introduce latency and degrade
performance. To avoid this, task aggregation should be suppressed when
the preferred LLC is overloaded, which requires a metric to indicate
LLC utilization.

Record per LLC utilization/cpu capacity during periodic load
balancing. These statistics will be used in later patches to decide
whether tasks should be aggregated into their preferred LLC.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2:
       Refine the comments in record_sg_llc_stats().(Peter Zijlstra).

 include/linux/sched/topology.h |  4 ++
 kernel/sched/fair.c            | 69 ++++++++++++++++++++++++++++++++++
 2 files changed, 73 insertions(+)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index bbcfdf12aa6e..0ba4697d74ba 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -68,6 +68,10 @@ struct sched_domain_shared {
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
 	int		nr_idle_scan;
+#ifdef CONFIG_SCHED_CACHE
+	unsigned long	util_avg;
+	unsigned long	capacity ____cacheline_aligned_in_smp;
+#endif
 };
=20
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cb82f558dc5b..b9f336300f14 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9622,6 +9622,29 @@ static inline int task_is_ineligible_on_dst_cpu(stru=
ct task_struct *p, int dest_
 	return 0;
 }
=20
+#ifdef CONFIG_SCHED_CACHE
+/* Called from load balancing paths with rcu_read_lock held */
+static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
+					 unsigned long *cap)
+{
+	struct sched_domain_shared *sd_share;
+
+	sd_share =3D rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (!sd_share)
+		return false;
+
+	*util =3D READ_ONCE(sd_share->util_avg);
+	*cap =3D READ_ONCE(sd_share->capacity);
+
+	return true;
+}
+#else
+static inline bool get_llc_stats(int cpu, unsigned long *util,
+				 unsigned long *cap)
+{
+	return false;
+}
+#endif
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
  */
@@ -10592,6 +10615,51 @@ sched_reduced_capacity(struct rq *rq, struct sched=
_domain *sd)
 	return check_cpu_capacity(rq, sd);
 }
=20
+#ifdef CONFIG_SCHED_CACHE
+/*
+ * Record the statistics for this scheduler group for later
+ * use. These values guide load balancing on aggregating tasks
+ * to a LLC.
+ */
+static void record_sg_llc_stats(struct lb_env *env,
+				struct sg_lb_stats *sgs,
+				struct sched_group *group)
+{
+	struct sched_domain_shared *sd_share;
+
+	if (!sched_cache_enabled() || env->idle =3D=3D CPU_NEWLY_IDLE)
+		return;
+
+	/* Only care about sched domain spanning multiple LLCs */
+	if (env->sd->child !=3D rcu_dereference(per_cpu(sd_llc, env->dst_cpu)))
+		return;
+
+	/*
+	 * At this point we know this group spans a LLC domain.
+	 * Record the statistic of this group in its corresponding
+	 * shared LLC domain.
+	 * Note: sd_share cannot be obtained via sd->child->shared, because
+	 * it refers to the domain that covers the local group, while
+	 * sd_share could represent any of the LLC group.
+	 */
+	sd_share =3D rcu_dereference(per_cpu(sd_llc_shared,
+					   cpumask_first(sched_group_span(group))));
+	if (!sd_share)
+		return;
+
+	if (READ_ONCE(sd_share->util_avg) !=3D sgs->group_util)
+		WRITE_ONCE(sd_share->util_avg, sgs->group_util);
+
+	if (unlikely(READ_ONCE(sd_share->capacity) !=3D sgs->group_capacity))
+		WRITE_ONCE(sd_share->capacity, sgs->group_capacity);
+}
+#else
+static inline void record_sg_llc_stats(struct lb_env *env, struct sg_lb_st=
ats *sgs,
+				       struct sched_group *group)
+{
+}
+#endif
+
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
@@ -10681,6 +10749,7 @@ static inline void update_sg_lb_stats(struct lb_env=
 *env,
=20
 	sgs->group_type =3D group_classify(env->sd->imbalance_pct, group, sgs);
=20
+	record_sg_llc_stats(env, sgs, group);
 	/* Computing avg_load makes sense only when group is overloaded */
 	if (sgs->group_type =3D=3D group_overloaded)
 		sgs->avg_load =3D (sgs->group_load * SCHED_CAPACITY_SCALE) /
--=20
2.32.0

From nobody Sun Feb  8 19:22:26 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2F2692EBDDE
	for <linux-kernel@vger.kernel.org>; Wed,  3 Dec 2025 23:01:23 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.11
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764802885; cv=none;
 b=fboZZkFKPl6gqpHDoF2b6zbyblNamhVu+FcjT54t3oU8vxsb1XXezAqbDtyJgvQY5nilFQH3AKBGOohsQ/SQ3tX2mRk+BSCtjeqUEVqOw4w0dDc2wtmgFtlHa6V/L30IDsIjeiViMUZM4y4AiA82fvOBsu4+NJQNRAWoaUu83no=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764802885; c=relaxed/simple;
	bh=TakEXE1LpDhxRe/Kb7GWIrlVFYabIDFNwz7qIMJeAzA=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=lWQIStOFn4iy99stFlHV/qSBEi3k7WL/GF8q0g3QeYxyAInDLMtgRyHdyj4lgpwV4+hcrGelSaLn9GQ314YsxP62kdg4igNnwsJ5I/UGLtE/m0W5/zOTgeJYpf5nNjxi042Eu8UJR3sDuMQmXljn/+2COvTOKDQkes+q8dJg4fs=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=ZJw0lk7W; arc=none smtp.client-ip=198.175.65.11
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="ZJw0lk7W"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1764802883; x=1796338883;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=TakEXE1LpDhxRe/Kb7GWIrlVFYabIDFNwz7qIMJeAzA=;
  b=ZJw0lk7W6RTQQr7pUyzeAA7+tFRn5rwcdkgzS49IJ6otxSXwAzwDWZIh
   72+xVH8b/09ZAgA4A4sjEOCcav+jAPzfD2L3N7AxSkmW/F8BHhBoUD3JQ
   QbRstLbqNMnMwfrcQ+qBeU1Q3VwTeXm0rmxciTrI2u6z3GCHX79/Bxc9Y
   tid45au2Oifch9e3/2xq9ljpUEYKZAVIVVPqiF3n86ssLv/OdDy75IUHo
   67RTdQeGc20OckklfmpRjpvC7cCT1mZKRlid3w67UBs6EEbQgCGzqXjOi
   NdatFPNJvaFIWKoBtqpyQd9yFecmVzXENUGCr745w3Jqa3QUeXJGyO4fH
   g==;
X-CSE-ConnectionGUID: /fs42l2aRamlkF2vGhwf9A==
X-CSE-MsgGUID: 2SoBy6EYTSqmAsupZBGzKg==
X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136230"
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="77136230"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Dec 2025 15:01:22 -0800
X-CSE-ConnectionGUID: O845bRyGQ8Wd6MpeSkE96g==
X-CSE-MsgGUID: Cz70j2EQQ0GHWMf5pEgBBw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="199763752"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:22 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 03/23] sched/cache: Introduce helper functions to enforce
 LLC migration policy
Date: Wed,  3 Dec 2025 15:07:22 -0800
Message-Id: 
 <12e90c8c26c690b40e48cc1e03c785f2f99fafa8.1764801860.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1764801860.git.tim.c.chen@linux.intel.com>
References: <cover.1764801860.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

Cache-aware scheduling aggregates threads onto their preferred LLC,
mainly through load balancing. When the preferred LLC becomes
saturated, more threads are still placed there, increasing latency.
A mechanism is needed to limit aggregation so that the preferred LLC
does not become overloaded.

Introduce helper functions can_migrate_llc() and
can_migrate_llc_task() to enforce the LLC migration policy:

  1. Aggregate a task to its preferred LLC if both source and
     destination LLCs are not too busy (<50% utilization),
     or if doing so will not leave the preferred LLC much more
     imbalanced than the non-preferred one (>20% utilization
     difference, similar to imbalance_pct of the LLC domain).
  2. Allow moving a task from overloaded preferred LLC to a non preferred
     LLC if this will not cause the non preferred LLC to become
     too imbalanced to cause a later migration back.
  3. If both LLCs are too busy, let the generic load balance to spread
     the tasks.

Further (hysteresis)action could be taken in the future to prevent tasks
from being migrated into and out of the preferred LLC frequently (back and
forth): the threshold for migrating a task out of its preferred LLC should
be higher than that for migrating it into the LLC.

Since aggregation tends to make the preferred LLC busier than others,
the imbalance tolerance is controlled by llc_imb_pct. If set to 0,
tasks may still aggregate to the preferred LLC as long as it is
not more utilized than the source LLC, preserving the preference.

Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2:
       No change.

 kernel/sched/fair.c  | 153 +++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |   5 ++
 2 files changed, 158 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b9f336300f14..710ed9943d27 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1205,6 +1205,9 @@ static s64 update_se(struct rq *rq, struct sched_enti=
ty *se)
 #define EPOCH_PERIOD	(HZ / 100)	/* 10 ms */
 #define EPOCH_LLC_AFFINITY_TIMEOUT	5	/* 50 ms */
=20
+__read_mostly unsigned int llc_overload_pct       =3D 50;
+__read_mostly unsigned int llc_imb_pct            =3D 20;
+
 static int llc_id(int cpu)
 {
 	if (cpu < 0)
@@ -9623,6 +9626,27 @@ static inline int task_is_ineligible_on_dst_cpu(stru=
ct task_struct *p, int dest_
 }
=20
 #ifdef CONFIG_SCHED_CACHE
+/*
+ * The margin used when comparing LLC utilization with CPU capacity.
+ * Parameter llc_overload_pct determines the LLC load level where
+ * active LLC aggregation is done.
+ * Derived from fits_capacity().
+ *
+ * (default: ~50%)
+ */
+#define fits_llc_capacity(util, max)	\
+	((util) * 100 < (max) * llc_overload_pct)
+
+/*
+ * The margin used when comparing utilization.
+ * is 'util1' noticeably greater than 'util2'
+ * Derived from capacity_greater().
+ * Bias is in perentage.
+ */
+/* Allows dst util to be bigger than src util by up to bias percent */
+#define util_greater(util1, util2) \
+	((util1) * 100 > (util2) * (100 + llc_imb_pct))
+
 /* Called from load balancing paths with rcu_read_lock held */
 static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
 					 unsigned long *cap)
@@ -9638,6 +9662,135 @@ static __maybe_unused bool get_llc_stats(int cpu, u=
nsigned long *util,
=20
 	return true;
 }
+
+/*
+ * Decision matrix according to the LLC utilization. To
+ * decide whether we can do task aggregation across LLC.
+ *
+ * By default, 50% is the threshold to treat the LLC as busy,
+ * and 20% is the utilization imbalance percentage to decide
+ * if the preferred LLC is busier than the non-preferred LLC.
+ * The hysteresis is used to avoid task bouncing between the
+ * preferred LLC and the non-preferred LLC.
+ *
+ * 1. moving towards the preferred LLC, dst is the preferred
+ *    LLC, src is not.
+ *
+ * src \ dst      30%  40%  50%  60%
+ * 30%            Y    Y    Y    N
+ * 40%            Y    Y    Y    Y
+ * 50%            Y    Y    G    G
+ * 60%            Y    Y    G    G
+ *
+ * 2. moving out of the preferred LLC, src is the preferred
+ *    LLC, dst is not:
+ *
+ * src \ dst      30%  40%  50%  60%
+ * 30%            N    N    N    N
+ * 40%            N    N    N    N
+ * 50%            N    N    G    G
+ * 60%            Y    N    G    G
+ *
+ * src :      src_util
+ * dst :      dst_util
+ * Y :        Yes, migrate
+ * N :        No, do not migrate
+ * G :        let the Generic load balance to even the load.
+ *
+ * The intention is that if both LLCs are quite busy, cache aware
+ * load balance should not be performed, and generic load balance
+ * should take effect. However, if one is busy and the other is not,
+ * the preferred LLC capacity(50%) and imbalance criteria(20%) should
+ * be considered to determine whether LLC aggregation should be
+ * performed to bias the load towards the preferred LLC.
+ */
+
+/* migration decision, 3 states are orthogonal. */
+enum llc_mig {
+	mig_forbid =3D 0,		/* N: Don't migrate task, respect LLC preference */
+	mig_llc,		/* Y: Do LLC preference based migration */
+	mig_unrestricted	/* G: Don't restrict generic load balance migration */
+};
+
+/*
+ * Check if task can be moved from the source LLC to the
+ * destination LLC without breaking cache aware preferrence.
+ * src_cpu and dst_cpu are arbitrary CPUs within the source
+ * and destination LLCs, respectively.
+ */
+static enum llc_mig can_migrate_llc(int src_cpu, int dst_cpu,
+				    unsigned long tsk_util,
+				    bool to_pref)
+{
+	unsigned long src_util, dst_util, src_cap, dst_cap;
+
+	if (!get_llc_stats(src_cpu, &src_util, &src_cap) ||
+	    !get_llc_stats(dst_cpu, &dst_util, &dst_cap))
+		return mig_unrestricted;
+
+	if (!fits_llc_capacity(dst_util, dst_cap) &&
+	    !fits_llc_capacity(src_util, src_cap))
+		return mig_unrestricted;
+
+	src_util =3D src_util < tsk_util ? 0 : src_util - tsk_util;
+	dst_util =3D dst_util + tsk_util;
+	if (to_pref) {
+		/*
+		 * llc_imb_pct is the imbalance allowed between
+		 * preferred LLC and non-preferred LLC.
+		 * Don't migrate if we will get preferred LLC too
+		 * heavily loaded and if the dest is much busier
+		 * than the src, in which case migration will
+		 * increase the imbalance too much.
+		 */
+		if (!fits_llc_capacity(dst_util, dst_cap) &&
+		    util_greater(dst_util, src_util))
+			return mig_forbid;
+	} else {
+		/*
+		 * Don't migrate if we will leave preferred LLC
+		 * too idle, or if this migration leads to the
+		 * non-preferred LLC falls within sysctl_aggr_imb percent
+		 * of preferred LLC, leading to migration again
+		 * back to preferred LLC.
+		 */
+		if (fits_llc_capacity(src_util, src_cap) ||
+		    !util_greater(src_util, dst_util))
+			return mig_forbid;
+	}
+	return mig_llc;
+}
+
+/*
+ * Check if task p can migrate from source LLC to
+ * destination LLC in terms of cache aware load balance.
+ */
+static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int d=
st_cpu,
+							struct task_struct *p)
+{
+	struct mm_struct *mm;
+	bool to_pref;
+	int cpu;
+
+	mm =3D p->mm;
+	if (!mm)
+		return mig_unrestricted;
+
+	cpu =3D mm->mm_sched_cpu;
+	if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
+		return mig_unrestricted;
+
+	if (cpus_share_cache(dst_cpu, cpu))
+		to_pref =3D true;
+	else if (cpus_share_cache(src_cpu, cpu))
+		to_pref =3D false;
+	else
+		return mig_unrestricted;
+
+	return can_migrate_llc(src_cpu, dst_cpu,
+			       task_util(p), to_pref);
+}
+
 #else
 static inline bool get_llc_stats(int cpu, unsigned long *util,
 				 unsigned long *cap)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 84118b522f22..bf72c5bab506 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2828,6 +2828,11 @@ extern unsigned int sysctl_numa_balancing_scan_perio=
d_max;
 extern unsigned int sysctl_numa_balancing_scan_size;
 extern unsigned int sysctl_numa_balancing_hot_threshold;
=20
+#ifdef CONFIG_SCHED_CACHE
+extern unsigned int llc_overload_pct;
+extern unsigned int llc_imb_pct;
+#endif
+
 #ifdef CONFIG_SCHED_HRTICK
=20
 /*
--=20
2.32.0
From nobody Sun Feb  8 19:22:26 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 55E9C2EC08D
	for <linux-kernel@vger.kernel.org>; Wed,  3 Dec 2025 23:01:24 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.11
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764802886; cv=none;
 b=d3VrPdnnjHo1v15INzZi2Be9GCCRZHIzY8RvdjoDE/lVfQN7C6RgefM63jeAgMs+Ej4xBAgNM48bikZgcfBK97s516BGyLXX1Rbvhsn/lxdjOTLJb7/BzUSsXmqizKiXSV4Q40vVu+4KUJUTuTrw0EcRJX7axQAupxl66/Njl7g=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764802886; c=relaxed/simple;
	bh=/WXShEpYiDAFPDra61vUPdbNcgE+VqMlav+UUM59jU0=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=sp6UO1OW6Q3DioPA4TyMAxm2w7jZEWfXn+BecCi+DY63bhyHNOAdo2gxE9qPcZ4H/AG5K6vG0sVgNdh5TPmn2YDZ1M3oPRXJYAPeKE66XGC3smKX35V4ctG4LeLd8SIPZYPGBwl8SDEjENvTH1Cw9AGh2YoAZb6Q6CfS4bRt+vY=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=mJ6yn4qm; arc=none smtp.client-ip=198.175.65.11
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="mJ6yn4qm"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1764802884; x=1796338884;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=/WXShEpYiDAFPDra61vUPdbNcgE+VqMlav+UUM59jU0=;
  b=mJ6yn4qmNCzGpuUPMZdx+lsUqY/Y8q397TD5tze5hB735PCmFim3TtR3
   Eh74z+kUDoOPtNaJnMct+g67IgKwnq6+WYRbc+f3oEEw9Wg1Gcg9yN7oU
   vI5Oubm8s7zVFVo1CwCylUT7AAgUyeA+NaPz/BoikrttCBobaJqnnubeC
   HmGkKxv21UFMqlb7bdh2Dv1ZUBuQd/5iPTCr2He8Z4My1BxTJHc0KlROt
   IrrMfarEIQ6kjL275GsASGznmrL05FEBJGY2at3hHLlbpnBR+lPPkEK0Y
   B/H+e/fK9u8hElcLfWPp6Axh3PPWmX2TiXZI/s6f1Be/ZF/FgJXPpYRSc
   Q==;
X-CSE-ConnectionGUID: G2tkFvPIT6SY1+ZRXOxXBw==
X-CSE-MsgGUID: uYGT49/IQCatA5I80r2gog==
X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136249"
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="77136249"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Dec 2025 15:01:23 -0800
X-CSE-ConnectionGUID: uiDsRffYTdabgXsA6tZowg==
X-CSE-MsgGUID: ym+MS4XuQPSAYB5T1Atjpw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="199763756"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:23 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 04/23] sched/cache: Make LLC id continuous
Date: Wed,  3 Dec 2025 15:07:23 -0800
Message-Id: 
 <f7026f347934cfe710650d29bb52271a6cabccd9.1764801860.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1764801860.git.tim.c.chen@linux.intel.com>
References: <cover.1764801860.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Introduce an index mapping between CPUs and their LLCs. This provides
a continuous per LLC index needed for cache-aware load balancing in
later patches.

The existing per_cpu llc_id usually points to the first CPU of the
LLC domain, which is sparse and unsuitable as an array index. Using
llc_id directly would waste memory.

With the new mapping, CPUs in the same LLC share a continuous id:

  per_cpu(llc_id, CPU=3D0...15)  =3D 0
  per_cpu(llc_id, CPU=3D16...31) =3D 1
  per_cpu(llc_id, CPU=3D32...47) =3D 2
  ...

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2:
       Convert the static LLC id to be allocated sequentially as LLCs are
       discovered, and replace the old sd_llc_id. (Peter Zijlstra)

 kernel/sched/fair.c     |  9 ++++++-
 kernel/sched/sched.h    |  1 +
 kernel/sched/topology.c | 60 +++++++++++++++++++++++++++++++++++++++--
 3 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 710ed9943d27..0a3918269906 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1210,10 +1210,17 @@ __read_mostly unsigned int llc_imb_pct            =
=3D 20;
=20
 static int llc_id(int cpu)
 {
+	int llc;
+
 	if (cpu < 0)
 		return -1;
=20
-	return per_cpu(sd_llc_id, cpu);
+	llc =3D per_cpu(sd_llc_id, cpu);
+	/* avoid race with cpu hotplug */
+	if (unlikely(llc >=3D max_llcs))
+		return -1;
+
+	return llc;
 }
=20
 void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_s=
ched)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bf72c5bab506..728737641847 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2075,6 +2075,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_=
cpucapacity);
=20
 extern struct static_key_false sched_asym_cpucapacity;
 extern struct static_key_false sched_cluster_active;
+extern int max_llcs;
=20
 static __always_inline bool sched_asym_cpucap_active(void)
 {
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 444bdfdab731..f25d950ab015 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -17,6 +17,8 @@ void sched_domains_mutex_unlock(void)
 	mutex_unlock(&sched_domains_mutex);
 }
=20
+int max_llcs;
+
 /* Protected by sched_domains_mutex: */
 static cpumask_var_t sched_domains_tmpmask;
 static cpumask_var_t sched_domains_tmpmask2;
@@ -668,6 +670,55 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cp=
ucapacity);
 DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
 DEFINE_STATIC_KEY_FALSE(sched_cluster_active);
=20
+/*
+ * Assign continuous llc id for the CPU, and return
+ * the assigned llc id.
+ */
+static int update_llc_id(struct sched_domain *sd,
+			 int cpu)
+{
+	int id =3D per_cpu(sd_llc_id, cpu), i;
+
+	if (id >=3D 0)
+		return id;
+
+	if (sd) {
+		/* Look for any assigned id and reuse it.*/
+		for_each_cpu(i, sched_domain_span(sd)) {
+			id =3D per_cpu(sd_llc_id, i);
+
+			if (id >=3D 0) {
+				per_cpu(sd_llc_id, cpu) =3D id;
+				return id;
+			}
+		}
+	}
+
+	/*
+	 * When 1. there is no id assigned to this LLC domain,
+	 * or 2. the sd is NULL, we reach here.
+	 * Consider the following scenario,
+	 * CPU0~CPU95 are in the node0, CPU96~CPU191 are
+	 * in the node1. During bootup, maxcpus=3D96 is
+	 * appended.
+	 * case 1: When running cpu_attach_domain(CPU24)
+	 * during boot up, CPU24 is the first CPU in its
+	 * non-NULL LLC domain. However,
+	 * its corresponding llc id has not been assigned yet.
+	 *
+	 * case 2: After boot up, the CPU100 is brought up
+	 * via sysfs manually. As a result, CPU100 has only a
+	 * Numa domain attached, because CPU100 is the only CPU
+	 * of a sched domain, all its bottom domains are degenerated.
+	 * The LLC domain pointer sd is NULL for CPU100.
+	 *
+	 * For both cases, we want to increase the number of LLCs.
+	 */
+	per_cpu(sd_llc_id, cpu) =3D max_llcs++;
+
+	return per_cpu(sd_llc_id, cpu);
+}
+
 static void update_top_cache_domain(int cpu)
 {
 	struct sched_domain_shared *sds =3D NULL;
@@ -677,14 +728,13 @@ static void update_top_cache_domain(int cpu)
=20
 	sd =3D highest_flag_domain(cpu, SD_SHARE_LLC);
 	if (sd) {
-		id =3D cpumask_first(sched_domain_span(sd));
 		size =3D cpumask_weight(sched_domain_span(sd));
 		sds =3D sd->shared;
 	}
=20
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
 	per_cpu(sd_llc_size, cpu) =3D size;
-	per_cpu(sd_llc_id, cpu) =3D id;
+	id =3D update_llc_id(sd, cpu);
 	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
=20
 	sd =3D lowest_flag_domain(cpu, SD_CLUSTER);
@@ -2488,6 +2538,12 @@ build_sched_domains(const struct cpumask *cpu_map, s=
truct sched_domain_attr *att
 	bool has_asym =3D false;
 	bool has_cluster =3D false;
=20
+	/* first scan of LLCs */
+	if (!max_llcs) {
+		for_each_possible_cpu(i)
+			per_cpu(sd_llc_id, i) =3D -1;
+	}
+
 	if (WARN_ON(cpumask_empty(cpu_map)))
 		goto error;
=20
--=20
2.32.0
From nobody Sun Feb  8 19:22:26 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 679562ECEBB
	for <linux-kernel@vger.kernel.org>; Wed,  3 Dec 2025 23:01:26 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.11
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764802888; cv=none;
 b=ETXMSycIjg3hW2uD7ktvuDRCwlm80jzWlfuybxMLSJjuPv1gOLZC1i6pxE62EG9+cDFAU1hLySS0z9EjoSW7h+IC9WTpkMIZz2geJs1QP3R/eObNqU3OG+yETt/G54TGksleKQ7hmlJH6AIkTyDQ9XdCc+AMJOzQkCsvN6AteuA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764802888; c=relaxed/simple;
	bh=0AOJ8UhIDlWuve34OwSELAi4hyIDL68J1uZ46Rj5j/U=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=oT6l76w5OE/CgwV2buKuyAjl0MI2Q/KFcNiA5tSmBm5YfGauRJZvP4km+gtrjR5EEwXVgaCsan/LhKN6+lL1MozMs4acvCaZOIR7MI0TH1a6DN/iL60iGgK73IOwTgFjrIfIZLKuBBoFD14Z4gbqwWYyV8VrRWfEVNe6RZksId4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=MYgWTb60; arc=none smtp.client-ip=198.175.65.11
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="MYgWTb60"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1764802886; x=1796338886;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=0AOJ8UhIDlWuve34OwSELAi4hyIDL68J1uZ46Rj5j/U=;
  b=MYgWTb60T6yG49rQ3nLnjAfGEf6N3B3x0R1ujoF4MP+f6thBTMMmV5A2
   6gtXPzCButviIBBCpY7AZSw3brie2XhnzEv9X/ke/XBPmw9iwTMQXM9o0
   iuW5LJjdLixT+ECza7WcFjH4T9QTfvwhG/w9TZhOFFXAm15dszIkONvBa
   SXqv+2sjbXByYYFdX59mzr/UJBdZJP29/Qsoq52Bq39LKfBUjAIOaxdni
   O3Dd1ftGoYiiVuFKxIPrD6KHkaSzbffy0qzla2yFfiBHwoJt7cDfE6IuV
   V+N5yhbYcGH4NZwhO7yAb7il3S4WiOKkWeUjmgInRdyyz/X833IZzWB7j
   A==;
X-CSE-ConnectionGUID: kjwyz+xTQ4WR9aHYObBogw==
X-CSE-MsgGUID: zaCpH7h5Qxu4sNwXX6Krdg==
X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136266"
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="77136266"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Dec 2025 15:01:25 -0800
X-CSE-ConnectionGUID: alNeGVnjSJOwjLHIJ4OuIg==
X-CSE-MsgGUID: 5cfoA8m6Su+MIcnd+RQoGg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="199763763"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:25 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 05/23] sched/cache: Assign preferred LLC ID to processes
Date: Wed,  3 Dec 2025 15:07:24 -0800
Message-Id: 
 <a968a2eb5b0c515d1baeeee4db50fcd7b0d83da0.1764801860.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1764801860.git.tim.c.chen@linux.intel.com>
References: <cover.1764801860.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

With cache-aware scheduling enabled, each task is assigned a
preferred LLC ID. This allows quick identification of the LLC domain
where the task prefers to run, similar to numa_preferred_nid in
NUMA balancing.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2: Align preferred LLC with NUMA balancing's preferred node.

 include/linux/sched.h |  1 +
 init/init_task.c      |  3 +++
 kernel/sched/fair.c   | 18 ++++++++++++++++++
 3 files changed, 22 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 278b529c91df..1ad46220cd04 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1408,6 +1408,7 @@ struct task_struct {
=20
 #ifdef CONFIG_SCHED_CACHE
 	struct callback_head		cache_work;
+	int				preferred_llc;
 #endif
=20
 #ifdef CONFIG_RSEQ
diff --git a/init/init_task.c b/init/init_task.c
index a55e2189206f..44bae72b5b7d 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -191,6 +191,9 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) =
=3D {
 	.numa_group	=3D NULL,
 	.numa_faults	=3D NULL,
 #endif
+#ifdef CONFIG_SCHED_CACHE
+	.preferred_llc  =3D -1,
+#endif
 #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
 	.kasan_depth	=3D 1,
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0a3918269906..10cec83f65d5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1300,6 +1300,7 @@ void account_mm_sched(struct rq *rq, struct task_stru=
ct *p, s64 delta_exec)
 	struct mm_struct *mm =3D p->mm;
 	struct mm_sched *pcpu_sched;
 	unsigned long epoch;
+	int mm_sched_llc =3D -1;
=20
 	if (!sched_cache_enabled())
 		return;
@@ -1330,6 +1331,23 @@ void account_mm_sched(struct rq *rq, struct task_str=
uct *p, s64 delta_exec)
 		if (mm->mm_sched_cpu !=3D -1)
 			mm->mm_sched_cpu =3D -1;
 	}
+
+	if (mm->mm_sched_cpu !=3D -1) {
+		mm_sched_llc =3D llc_id(mm->mm_sched_cpu);
+
+#ifdef CONFIG_NUMA_BALANCING
+		/*
+		 * Don't assign preferred LLC if it
+		 * conflicts with NUMA balancing.
+		 */
+		if (p->numa_preferred_nid >=3D 0 &&
+		    cpu_to_node(mm->mm_sched_cpu) !=3D p->numa_preferred_nid)
+			mm_sched_llc =3D -1;
+#endif
+	}
+
+	if (p->preferred_llc !=3D mm_sched_llc)
+		p->preferred_llc =3D mm_sched_llc;
 }
=20
 static void task_tick_cache(struct rq *rq, struct task_struct *p)
--=20
2.32.0
From nobody Sun Feb  8 19:22:26 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id C93962EF652
	for <linux-kernel@vger.kernel.org>; Wed,  3 Dec 2025 23:01:27 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.11
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764802890; cv=none;
 b=EDsiu7g2BtXvvoS9BKwrirW/B8ldDhmwGPx+cdJzoxBtklhxCuicf7XZFi+5IO9eicj+U0q988drhlH0OJjM+IwUt0amTGbw3mfM6d+6WZDelOH8Kc3PIbWBuITzHpbg31UVRdkj3UEviuqp+uvpMTrssPknIugATiCNu3Bm+08=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764802890; c=relaxed/simple;
	bh=VfUUqC84e+k4dM9OCiHr0qSll3wkyw96Z2hiwhlrd+g=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=bMgyCWF3/XMpBtns9xgAQbuvJYsQoxOLy5qU1v3Ure2zyH7eaHG4ZLbKyqgBn1NINjkU2O0RPcPn7whkPdiyLRm36oluEWQ4viCDhC3YxOj/EZYMjqKw4E92UmhMBk5j0NYcW2RvXkMIEQxCZjUg4qUDiMfwP1eraXWWdJgmvkk=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=HkSBtET5; arc=none smtp.client-ip=198.175.65.11
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="HkSBtET5"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1764802887; x=1796338887;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=VfUUqC84e+k4dM9OCiHr0qSll3wkyw96Z2hiwhlrd+g=;
  b=HkSBtET5tJuyrYVLfwF6tgrJB3jPRTx01PEveXBF1wIqsiOJxkXhzAOm
   sC1smgQCW8wgJyR4E1u9VSEyU2s5OeGIeEuC988/p/oKmWX8sR4t5I1+Q
   tI0jgAIHPovP+AIphgRpysIDP7uveWJciGMii/zPUANlnHxP4W7VRq2eJ
   sBFqpGeZy1Ve8fewNRoxQswiP1fA+sTe9iwHVjtYcP+1v4kzgt4NxJNt7
   wXwMA6vcMf7L8X5pDnsHkNo+K4j1B34n8SEcNJu9+4em9z3ghkY3MGzod
   zaVcGH6lY2mH/znHiuVlkKaau6etkJB5XXnU6Zdt6/ZSkCkDGyN6SoMYU
   A==;
X-CSE-ConnectionGUID: k1qC8aFmROqogX9M8c8KUg==
X-CSE-MsgGUID: rOmGsn1SSNWPs0ITFZygdw==
X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136288"
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="77136288"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Dec 2025 15:01:27 -0800
X-CSE-ConnectionGUID: cvzDuwr8RH6q0DcIt04zOw==
X-CSE-MsgGUID: uxGA3PlMTN6liJURHzFh/g==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="199763775"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:26 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 06/23] sched/cache: Track LLC-preferred tasks per runqueue
Date: Wed,  3 Dec 2025 15:07:25 -0800
Message-Id: 
 <f086ad5603dca8749678aec805ca13214eea04a8.1764801860.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1764801860.git.tim.c.chen@linux.intel.com>
References: <cover.1764801860.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

For each runqueue, track the number of tasks with an LLC preference
and how many of them are running on their preferred LLC. This mirrors
nr_numa_running and nr_preferred_running for NUMA balancing, and will
be used by cache-aware load balancing in later patches.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2: Invoke task_of() once and reuse its result afterwards.
            (Peter Zijlstra)
            Remove hacky reset_llc_stats() and introduce sched_llc_active f=
lag
            to properly pair enqueue/dequeue statistics update (Peter Zijls=
tra, K Prateek Nayak)

 include/linux/sched.h |  2 ++
 init/init_task.c      |  1 +
 kernel/sched/core.c   |  5 ++++
 kernel/sched/fair.c   | 60 ++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h  |  6 +++++
 5 files changed, 71 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1ad46220cd04..466ba8b7398c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1408,6 +1408,8 @@ struct task_struct {
=20
 #ifdef CONFIG_SCHED_CACHE
 	struct callback_head		cache_work;
+	/*the p is currently refcounted in a rq's preferred llc stats*/
+	bool				sched_llc_active;
 	int				preferred_llc;
 #endif
=20
diff --git a/init/init_task.c b/init/init_task.c
index 44bae72b5b7d..ee78837b0aa2 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -192,6 +192,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) =
=3D {
 	.numa_faults	=3D NULL,
 #endif
 #ifdef CONFIG_SCHED_CACHE
+	.sched_llc_active =3D false,
 	.preferred_llc  =3D -1,
 #endif
 #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e8bdf03a4b7f..48626c81ba8e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -531,6 +531,11 @@ void __trace_set_current_state(int state_value)
 }
 EXPORT_SYMBOL(__trace_set_current_state);
=20
+int task_llc(const struct task_struct *p)
+{
+	return per_cpu(sd_llc_id, task_cpu(p));
+}
+
 /*
  * Serialization rules:
  *
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 10cec83f65d5..d46a70a9d9fb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1223,6 +1223,43 @@ static int llc_id(int cpu)
 	return llc;
 }
=20
+static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
+{
+	int pref_llc;
+
+	if (!sched_cache_enabled())
+		return;
+
+	pref_llc =3D p->preferred_llc;
+	if (pref_llc < 0)
+		return;
+
+	rq->nr_llc_running++;
+	rq->nr_pref_llc_running +=3D (pref_llc =3D=3D task_llc(p));
+	p->sched_llc_active =3D true;
+}
+
+static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
+{
+	int pref_llc;
+
+	/*
+	 * Borrow the uc_se->active from uclamp_rq_inc_id(),
+	 * uclamp_rq_dec_id() to avoid the unbalanced calculation
+	 * of rq statistics.
+	 */
+	if (unlikely(!p->sched_llc_active))
+		return;
+
+	pref_llc =3D p->preferred_llc;
+	if (pref_llc < 0)
+		return;
+
+	rq->nr_llc_running--;
+	rq->nr_pref_llc_running -=3D (pref_llc =3D=3D task_llc(p));
+	p->sched_llc_active =3D false;
+}
+
 void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_s=
ched)
 {
 	unsigned long epoch;
@@ -1294,6 +1331,8 @@ static unsigned long __no_profile fraction_mm_sched(s=
truct rq *rq, struct mm_sch
 	return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq->cpu_runtime + 1);
 }
=20
+static unsigned int task_running_on_cpu(int cpu, struct task_struct *p);
+
 static inline
 void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 {
@@ -1346,8 +1385,13 @@ void account_mm_sched(struct rq *rq, struct task_str=
uct *p, s64 delta_exec)
 #endif
 	}
=20
-	if (p->preferred_llc !=3D mm_sched_llc)
+	/* task not on rq accounted later in account_entity_enqueue() */
+	if (task_running_on_cpu(rq->cpu, p) &&
+	    p->preferred_llc !=3D mm_sched_llc) {
+		account_llc_dequeue(rq, p);
 		p->preferred_llc =3D mm_sched_llc;
+		account_llc_enqueue(rq, p);
+	}
 }
=20
 static void task_tick_cache(struct rq *rq, struct task_struct *p)
@@ -1475,6 +1519,10 @@ void init_sched_mm(struct task_struct *p) { }
=20
 static void task_tick_cache(struct rq *rq, struct task_struct *p) { }
=20
+static void account_llc_enqueue(struct rq *rq, struct task_struct *p) {}
+
+static void account_llc_dequeue(struct rq *rq, struct task_struct *p) {}
+
 #endif
=20
 /*
@@ -3965,9 +4013,11 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct=
 sched_entity *se)
 {
 	update_load_add(&cfs_rq->load, se->load.weight);
 	if (entity_is_task(se)) {
+		struct task_struct *p =3D task_of(se);
 		struct rq *rq =3D rq_of(cfs_rq);
=20
-		account_numa_enqueue(rq, task_of(se));
+		account_numa_enqueue(rq, p);
+		account_llc_enqueue(rq, p);
 		list_add(&se->group_node, &rq->cfs_tasks);
 	}
 	cfs_rq->nr_queued++;
@@ -3978,7 +4028,11 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct=
 sched_entity *se)
 {
 	update_load_sub(&cfs_rq->load, se->load.weight);
 	if (entity_is_task(se)) {
-		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
+		struct task_struct *p =3D task_of(se);
+		struct rq *rq =3D rq_of(cfs_rq);
+
+		account_numa_dequeue(rq, p);
+		account_llc_dequeue(rq, p);
 		list_del_init(&se->group_node);
 	}
 	cfs_rq->nr_queued--;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 728737641847..ee8b70647835 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1126,6 +1126,10 @@ struct rq {
 	unsigned int		nr_preferred_running;
 	unsigned int		numa_migrate_on;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+	unsigned int		nr_pref_llc_running;
+	unsigned int		nr_llc_running;
+#endif
 #ifdef CONFIG_NO_HZ_COMMON
 	unsigned long		last_blocked_load_update_tick;
 	unsigned int		has_blocked_load;
@@ -1980,6 +1984,8 @@ init_numa_balancing(u64 clone_flags, struct task_stru=
ct *p)
=20
 #endif /* !CONFIG_NUMA_BALANCING */
=20
+int task_llc(const struct task_struct *p);
+
 static inline void
 queue_balance_callback(struct rq *rq,
 		       struct balance_callback *head,
--=20
2.32.0
From nobody Sun Feb  8 19:22:26 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7F2A72F0696
	for <linux-kernel@vger.kernel.org>; Wed,  3 Dec 2025 23:01:29 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.11
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764802892; cv=none;
 b=JuI4HP7FPjUZRRvIF57U5a+nyKFVaejSLBjOwb2o4K+dyMy+TzvS6alNai1tmhDlx/F2kpTdrbKJxXsp0ye0xTv9vWh98FuHcXDXimNg3p+EZ0AClnIocNRkMFznzOXiGUgsNO6KJzOsOmRV7MqRji4PoMn2fV9YYulhopDCdW0=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764802892; c=relaxed/simple;
	bh=Kgfm8ZrVAem+cuIFSErLp11pWO+uaVSLeCf068dctEI=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=oWgrj/+vmbD4ydPoKoPApP5RMU0UBhjF4mxsiADMVL/t5AARqr//6C8rqPkshWdzhhrhMPF1AzqYud7ZATo+YBem2D9OjWwAWcvEU+adG0BNbDeKX0F/tFC7FpYkxBtH1K1PhGVx8OIwbNowGJZ5W0OZkvMWwyvk09t3vXbHMn4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=Ff+wBHml; arc=none smtp.client-ip=198.175.65.11
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="Ff+wBHml"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1764802889; x=1796338889;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=Kgfm8ZrVAem+cuIFSErLp11pWO+uaVSLeCf068dctEI=;
  b=Ff+wBHmlGI9Ls+hPQ/icfiRSQpZE9xFA2dMFUkAvN4HoLDq9rsPxZPeZ
   8VRONCVnKKzfdp0/tx6ByohayUgQnukEiUM/5FG80edcOUwn8pLvcV6CD
   rsakyGnOPLHSStQkG1+f0q6DnjhqobEUdJaywwMsE54fftDticAbLprId
   3bhB2AwAPJQjK37rs0/N96in+m4FjW7qil9FvPJrQKe2CXx6Vw8vc05XH
   UOnoKjT+4VoaXotKSh3uNxjPZTKFSxLyHcD1a3z71R7y9pyahaHenJnCZ
   3UkyBEcsW2m1c1Cx8k4IAc/bj/uxMr+zGfxYNNEZL+3nmX/2zLcKYH7UG
   w==;
X-CSE-ConnectionGUID: XVdRsMs0TMKO/Xjz8IoNBA==
X-CSE-MsgGUID: 8lt8Jb1nTZqY7huRsHsSQw==
X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136318"
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="77136318"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Dec 2025 15:01:28 -0800
X-CSE-ConnectionGUID: HuS/ZH/YT/Cjm+dSF3UiRw==
X-CSE-MsgGUID: YDDbEJCdQwCGXEo7YEaNTQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="199763787"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:28 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 07/23] sched/cache: Introduce per runqueue task LLC
 preference counter
Date: Wed,  3 Dec 2025 15:07:26 -0800
Message-Id: 
 <63091f7ca7bb473fbc176af86a87d27a07a6e149.1764801860.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1764801860.git.tim.c.chen@linux.intel.com>
References: <cover.1764801860.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Each runqueue is assigned an array where each element tracks
the number of tasks preferring a given LLC, indexed from 0 to
max_llcs - 1.

For example, rq->nr_pref_llc[3] =3D 2 signifies that there are 2 tasks on
this runqueue which prefer to run within LLC3.

The load balancer can use this information to identify busy
runqueues and migrate tasks to their preferred LLC domains.
This array will be reallocated at runtime if the number of LLCs
increases due to CPU hotplug. Only extending the buffer(rather
than shrinking it) is supported to simplify the implementation.

Introduce the buffer allocation mechanism, and the statistics
will be calculated in the subsequent patch.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2:
        Remove static allocation of per runqueue LLC preference arrays.
        Allocate array size to the actual number of LLCs online. (Peter Zij=
lstra, Madadi Vineeth Reddy)

 kernel/sched/core.c     |   1 +
 kernel/sched/sched.h    |   1 +
 kernel/sched/topology.c | 117 +++++++++++++++++++++++++++++++++++++++-
 3 files changed, 118 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 48626c81ba8e..ce533dc485f5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8800,6 +8800,7 @@ void __init sched_init(void)
 #ifdef CONFIG_SCHED_CACHE
 		raw_spin_lock_init(&rq->cpu_epoch_lock);
 		rq->cpu_epoch_next =3D jiffies;
+		rq->nr_pref_llc =3D NULL;
 #endif
=20
 		zalloc_cpumask_var_node(&rq->scratch_mask, GFP_KERNEL, cpu_to_node(i));
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ee8b70647835..8f2a779825e4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1129,6 +1129,7 @@ struct rq {
 #ifdef CONFIG_SCHED_CACHE
 	unsigned int		nr_pref_llc_running;
 	unsigned int		nr_llc_running;
+	unsigned int		*nr_pref_llc;
 #endif
 #ifdef CONFIG_NO_HZ_COMMON
 	unsigned long		last_blocked_load_update_tick;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f25d950ab015..d583399fc6a1 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -17,8 +17,121 @@ void sched_domains_mutex_unlock(void)
 	mutex_unlock(&sched_domains_mutex);
 }
=20
+/* the number of max LLCs being detected */
+static int new_max_llcs;
+/* the current number of max LLCs */
 int max_llcs;
=20
+#ifdef CONFIG_SCHED_CACHE
+
+static unsigned int *alloc_new_pref_llcs(unsigned int *old, unsigned int *=
*gc)
+{
+	unsigned int *new =3D NULL;
+
+	new =3D kcalloc(new_max_llcs, sizeof(unsigned int),
+		      GFP_KERNEL | __GFP_NOWARN);
+
+	if (!new) {
+		*gc =3D NULL;
+	} else {
+		/*
+		 * Place old entry in garbage collector
+		 * for later disposal.
+		 */
+		*gc =3D old;
+	}
+	return new;
+}
+
+static void populate_new_pref_llcs(unsigned int *old, unsigned int *new)
+{
+	int i;
+
+	if (!old)
+		return;
+
+	for (i =3D 0; i < max_llcs; i++)
+		new[i] =3D old[i];
+}
+
+static int resize_llc_pref(void)
+{
+	unsigned int *__percpu *tmp_llc_pref;
+	int i, ret =3D 0;
+
+	if (new_max_llcs <=3D max_llcs)
+		return 0;
+
+	/*
+	 * Allocate temp percpu pointer for old llc_pref,
+	 * which will be released after switching to the
+	 * new buffer.
+	 */
+	tmp_llc_pref =3D alloc_percpu_noprof(unsigned int *);
+	if (!tmp_llc_pref)
+		return -ENOMEM;
+
+	for_each_present_cpu(i)
+		*per_cpu_ptr(tmp_llc_pref, i) =3D NULL;
+
+	/*
+	 * Resize the per rq nr_pref_llc buffer and
+	 * switch to this new buffer.
+	 */
+	for_each_present_cpu(i) {
+		struct rq_flags rf;
+		unsigned int *new;
+		struct rq *rq;
+
+		rq =3D cpu_rq(i);
+		new =3D alloc_new_pref_llcs(rq->nr_pref_llc, per_cpu_ptr(tmp_llc_pref, i=
));
+		if (!new) {
+			ret =3D -ENOMEM;
+
+			goto release_old;
+		}
+
+		/*
+		 * Locking rq ensures that rq->nr_pref_llc values
+		 * don't change with new task enqueue/dequeue
+		 * when we repopulate the newly enlarged array.
+		 */
+		rq_lock_irqsave(rq, &rf);
+		populate_new_pref_llcs(rq->nr_pref_llc, new);
+		rq->nr_pref_llc =3D new;
+		rq_unlock_irqrestore(rq, &rf);
+	}
+
+release_old:
+	/*
+	 * Load balance is done under rcu_lock.
+	 * Wait for load balance before and during resizing to
+	 * be done. They may refer to old nr_pref_llc[]
+	 * that hasn't been resized.
+	 */
+	synchronize_rcu();
+	for_each_present_cpu(i)
+		kfree(*per_cpu_ptr(tmp_llc_pref, i));
+
+	free_percpu(tmp_llc_pref);
+
+	/* succeed and update */
+	if (!ret)
+		max_llcs =3D new_max_llcs;
+
+	return ret;
+}
+
+#else
+
+static int resize_llc_pref(void)
+{
+	max_llcs =3D new_max_llcs;
+	return 0;
+}
+
+#endif
+
 /* Protected by sched_domains_mutex: */
 static cpumask_var_t sched_domains_tmpmask;
 static cpumask_var_t sched_domains_tmpmask2;
@@ -714,7 +827,7 @@ static int update_llc_id(struct sched_domain *sd,
 	 *
 	 * For both cases, we want to increase the number of LLCs.
 	 */
-	per_cpu(sd_llc_id, cpu) =3D max_llcs++;
+	per_cpu(sd_llc_id, cpu) =3D new_max_llcs++;
=20
 	return per_cpu(sd_llc_id, cpu);
 }
@@ -2674,6 +2787,8 @@ build_sched_domains(const struct cpumask *cpu_map, st=
ruct sched_domain_attr *att
 	if (has_cluster)
 		static_branch_inc_cpuslocked(&sched_cluster_active);
=20
+	resize_llc_pref();
+
 	if (rq && sched_debug_verbose)
 		pr_info("root domain span: %*pbl\n", cpumask_pr_args(cpu_map));
=20
--=20
2.32.0
From nobody Sun Feb  8 19:22:26 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id D24952F0C5B
	for <linux-kernel@vger.kernel.org>; Wed,  3 Dec 2025 23:01:30 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.11
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764802893; cv=none;
 b=oh6ql8wRtQTKo8nnK9dUK9t3JsNVUN1SrqTTOLrpZpUDsIKZ+qt9qst5oOs9c5FDd2R9eecOFriCSP4q8iJw0WZIClfw/A2n3lz9QanZX0TndqedBRildmD/ptw2VXSsbXzzCrUFl3ehtEIBnQQqE0gyq5YyFY1waemEa1gZMq0=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764802893; c=relaxed/simple;
	bh=ubwbCrnLe+FpFs84fmQJ8NDFPPh85CKovnWcqS4HszM=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=Q4hmeOs7hnwqOE8JDGvxpGVeABvVS45aiDvLk6ZpSrPGuTfn+4YcfZc0AFuBMnvnutRPD41rCA1to3LTp3U/rg4Ky2sVe8bcd4xUTzxW+ljCc0tBYewYHhc60QRARoN5k0NGQJalWwDG5Ur5+u4g9f7uSgwIhh8HrXiFwlORSOs=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=ngl+EBZ5; arc=none smtp.client-ip=198.175.65.11
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="ngl+EBZ5"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1764802890; x=1796338890;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=ubwbCrnLe+FpFs84fmQJ8NDFPPh85CKovnWcqS4HszM=;
  b=ngl+EBZ5jKSYuF1GoScWtzUvUawQCvSqX6vXeypzCig51al5M6EFhEW6
   6ZPkta/KDGc5tm3cLZAn+Q0r4sAGXevBcvNbEeEF94NWh0Q5o4Qi40yoE
   6fENyQt6WsIYC5Biv3AXCHk/Ns+vA3D+5k8K971vxD5ci0G6jwAhua/Ip
   V4EYKsxzhnY36WL45Wqmck026Nhmf3XpLNt/wYGNgwSMFF7INI6pnMGxW
   qdO3IW9AZPldmpFj84igpzlIJMlsU2GHA5/5/K1uwnar4bbN3Va12Jz5l
   CXyXS2But8o6/1q/DIrjmb1ErBv9PahFCMwFzVlsm1m+7SCCYHQiWGEv4
   A==;
X-CSE-ConnectionGUID: 1JiC3BwvQxSVrDu/qUQ3kA==
X-CSE-MsgGUID: lKtqqExhQ5+Xp1L40Y0AnQ==
X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136340"
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="77136340"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Dec 2025 15:01:30 -0800
X-CSE-ConnectionGUID: lyR4BYigTY+QEoG74KEnKQ==
X-CSE-MsgGUID: lE96dcq2TgepzkLPnZNnrg==
X-Ironport-Invalid-End-Of-Message: True
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="199763795"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:30 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 08/23] sched/cache: Calculate the per runqueue task LLC
 preference
Date: Wed,  3 Dec 2025 15:07:27 -0800
Message-Id: 
 <f659bd08e561b377bb7057b6e1b5a656e738c8c6.1764801860.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1764801860.git.tim.c.chen@linux.intel.com>
References: <cover.1764801860.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Calculate the number of tasks' LLC preferences for each runqueue.
This statistic is computed during task enqueue and dequeue
operations, and is used by the cache-aware load balancing.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2: Split from previous patch for easier review.

 kernel/sched/fair.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d46a70a9d9fb..b0e87616e377 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1231,11 +1231,12 @@ static void account_llc_enqueue(struct rq *rq, stru=
ct task_struct *p)
 		return;
=20
 	pref_llc =3D p->preferred_llc;
-	if (pref_llc < 0)
+	if (pref_llc < 0 || pref_llc >=3D max_llcs)
 		return;
=20
 	rq->nr_llc_running++;
 	rq->nr_pref_llc_running +=3D (pref_llc =3D=3D task_llc(p));
+	rq->nr_pref_llc[pref_llc]++;
 	p->sched_llc_active =3D true;
 }
=20
@@ -1252,11 +1253,12 @@ static void account_llc_dequeue(struct rq *rq, stru=
ct task_struct *p)
 		return;
=20
 	pref_llc =3D p->preferred_llc;
-	if (pref_llc < 0)
+	if (pref_llc < 0 || pref_llc >=3D max_llcs)
 		return;
=20
 	rq->nr_llc_running--;
 	rq->nr_pref_llc_running -=3D (pref_llc =3D=3D task_llc(p));
+	rq->nr_pref_llc[pref_llc]--;
 	p->sched_llc_active =3D false;
 }
=20
--=20
2.32.0
From nobody Sun Feb  8 19:22:26 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 761A92F12BE
	for <linux-kernel@vger.kernel.org>; Wed,  3 Dec 2025 23:01:32 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.11
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764802894; cv=none;
 b=VOvR4Yo5MT+v4vvHJHnJrL04tUMLwfYbb4+GQbWJ3QO13hC1zjlHArO6dzcuGLllayHXLBw43BKllYMjOKohjC7Fzd9T9m3hYmCRq3WLpZzHqcCQuO2JcQTdEeD/rjnDRhN1lGZeCfQEi5WHKdPb8iHSUPG9WfZsKEu6JozCWHQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764802894; c=relaxed/simple;
	bh=Hwjod13ydyBeyAl1Bc0MaWee5egwZS7IehFiRUr+3EU=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=Qm3SflXuxBKuYopJgqhcvipXf7FPYSYSF15V5hLWMr9nUpsAfdv+d2spbB0P7Tw1LmX/zkoTpJ7guZJ5VbPuMzy9Baf9HL/h+ZfC7oU8NJtxgafnNNwl0O1u1CDaxlhc7yoqMW17JyUgVXekWAPj30g3bMDCDrz5uBQLCvlVneA=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=cSsts8rq; arc=none smtp.client-ip=198.175.65.11
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="cSsts8rq"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1764802892; x=1796338892;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=Hwjod13ydyBeyAl1Bc0MaWee5egwZS7IehFiRUr+3EU=;
  b=cSsts8rq9lESYUplMXqyaf7fQNdZgkgjFqCazxZIqivu0ulnrxBxtLfr
   2q49FeXJtEzQZUFodeAzsWSFeSbbR0eNrEPCzAiJg3hLVd3plskFuoc8R
   LSKLX41Wp9fMgp9Ou54k2TxPn+ZJpABPQDMRZBxyysFrDh3CB41EwtGEs
   RrfwNP72MRObV0Rpqk7QGgKlk2FmXjIY1nC71X0MFH6YEKKSRhWDNHOyK
   9xcJGzOrMyQT5S0kQJJP+Yjr1dE5itsHoR0sqlWiS8N54X7izsEc5kZbZ
   a2UxxHPNluXsMUFiW8C3sWBY39nJzoHIE5rPFYFCFz7BLdiv2vnTIfuTx
   g==;
X-CSE-ConnectionGUID: +jOJhU2XTqKlvSAAbvSNZg==
X-CSE-MsgGUID: bWKw4Hx3R3mw3p0kqP8M9w==
X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136361"
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="77136361"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Dec 2025 15:01:31 -0800
X-CSE-ConnectionGUID: fNxL8O0TTpG29riKu1HOzA==
X-CSE-MsgGUID: yN1bkFJBSRe3C9XnLxsEJA==
X-Ironport-Invalid-End-Of-Message: True
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="199763802"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:31 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 09/23] sched/cache: Count tasks prefering destination LLC
 in a sched group
Date: Wed,  3 Dec 2025 15:07:28 -0800
Message-Id: 
 <1eb6a231ec82b37483208983f0cf10eec823ec9d.1764801860.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1764801860.git.tim.c.chen@linux.intel.com>
References: <cover.1764801860.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

During LLC load balancing, tabulate the number of tasks on each runqueue
that prefer the LLC contains the env->dst_cpu in a sched group.

For example, consider a system with 4 LLC sched groups (LLC0 to LLC3)
balancing towards LLC3. LLC0 has 3 tasks preferring LLC3, LLC1 has
2, and LLC2 has 1. LLC0, having the most tasks preferring LLC3, is
selected as the busiest source to pick tasks from.

Within a source LLC, the total number of tasks preferring a destination
LLC is computed by summing counts across all CPUs in that LLC. For
instance, if LLC0 has CPU0 with 2 tasks and CPU1 with 1 task preferring
LLC3, the total for LLC0 is 3.

These statistics allow the load balancer to choose tasks from source
sched groups that best match their preferred LLCs.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2:
        Convert nr_pref_llc array in sg_lb_stats to a single
        variable as only the dst LLC stat is needed.
        (K Prateek Nayak)

 kernel/sched/fair.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b0e87616e377..4d7803f69a74 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10445,6 +10445,9 @@ struct sg_lb_stats {
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+	unsigned int nr_pref_llc;
+#endif
 };
=20
 /*
@@ -10912,6 +10915,9 @@ static inline void update_sg_lb_stats(struct lb_env=
 *env,
 {
 	int i, nr_running, local_group, sd_flags =3D env->sd->flags;
 	bool balancing_at_rd =3D !env->sd->parent;
+#ifdef CONFIG_SCHED_CACHE
+	int dst_llc =3D llc_id(env->dst_cpu);
+#endif
=20
 	memset(sgs, 0, sizeof(*sgs));
=20
@@ -10932,6 +10938,12 @@ static inline void update_sg_lb_stats(struct lb_en=
v *env,
 		if (cpu_overutilized(i))
 			*sg_overutilized =3D 1;
=20
+#ifdef CONFIG_SCHED_CACHE
+		if (sched_cache_enabled() && llc_id(i) !=3D dst_llc &&
+		    dst_llc >=3D 0)
+			sgs->nr_pref_llc +=3D rq->nr_pref_llc[dst_llc];
+#endif
+
 		/*
 		 * No need to call idle_cpu() if nr_running is not 0
 		 */
--=20
2.32.0
From nobody Sun Feb  8 19:22:26 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3209E2EDD45
	for <linux-kernel@vger.kernel.org>; Wed,  3 Dec 2025 23:01:34 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.11
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764802897; cv=none;
 b=AN9K8aiWJQG7HDbeaWXpGDetIW2icpqGbDr6zs/psxf+4ZLm2ceitwFSdlkxUNnHO69aqE5S3Lgw8UXlsXoedmM4Pr7i5RbMpn7L1KrlbpjXV6xeAEYh8XRvFtihZU5ev2z3gpc9wUtfTNoORHKd7LfpH7/RywEIWMBBa/DRGKQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764802897; c=relaxed/simple;
	bh=p+3h65+/r+G8M/UVKx3C3o18pTa5Qaadr44RFr//JJM=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=MmMfghNG3eQEQnrI1wgmAlkBPcwScfTCOYIB2L9oD0PhxTEQvycV+raEGlUU7tq/cOm1m41tgx1zgYVTnsY1VCpNGnM6slJtSvukwWoNbVbq6sVz9SyOM9hVO35VnfPEJ/kFPYJD7nSsZDAVCSBbwe4MWGUKumJjlC3jPA1Gp5w=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=oK3XGSFi; arc=none smtp.client-ip=198.175.65.11
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="oK3XGSFi"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1764802894; x=1796338894;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=p+3h65+/r+G8M/UVKx3C3o18pTa5Qaadr44RFr//JJM=;
  b=oK3XGSFi2bGDDnHY3Lou8C7HjUQfAlxc1xp5Jsb4tWssOTetEyKk8VhS
   xWt++svfjbe9DJCu7kK8NB54Iyuv23cDcsruzAVgtKiHf34SlRWKEmzrW
   D+oCFG7YN+VzH5prFgSppmI032uc/cJAJ/qAKAOk+5EqFUqWcIySUNujp
   dnKCK0NZsBYY0rnhzU9NpLtzRd0sgBD+P+q/gVsngGR9F8P7Ojt0z+4k+
   FNbn0vTsTTr/tR3CHEUKYnt1XKHxIQth0oKpXgg30ClUCUHrWShO5n1wq
   sHaXMI4sp88m3bKftZXPxnzsOaTk5Sy2iUOBeydtIg4kqCpHbvNeeio00
   A==;
X-CSE-ConnectionGUID: eUWbdnCGTbS8UdiOjQqeaw==
X-CSE-MsgGUID: M8ATD04uQSWlmkQNqjCJ+A==
X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136382"
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="77136382"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Dec 2025 15:01:33 -0800
X-CSE-ConnectionGUID: +B4a0CVGS5aDMise1kmgcw==
X-CSE-MsgGUID: mYFfuf8aQyCTLl73SGkFmQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="199763810"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:33 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 10/23] sched/cache: Check local_group only once in
 update_sg_lb_stats()
Date: Wed,  3 Dec 2025 15:07:29 -0800
Message-Id: 
 <2581fa14a0083bbd22b50837cd86003e59192c00.1764801860.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1764801860.git.tim.c.chen@linux.intel.com>
References: <cover.1764801860.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

There is no need to check the local group twice for both group_asym_packing
and group_smt_balance. Adjust the code to facilitate future checks for group
types (cache-aware load balancing) as well.

No functional changes are expected.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2:
       New code cleanup patch. (Peter Zijlstra)

 kernel/sched/fair.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4d7803f69a74..6e4c1ae1bdda 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10984,14 +10984,16 @@ static inline void update_sg_lb_stats(struct lb_e=
nv *env,
=20
 	sgs->group_weight =3D group->group_weight;
=20
-	/* Check if dst CPU is idle and preferred to this group */
-	if (!local_group && env->idle && sgs->sum_h_nr_running &&
-	    sched_group_asym(env, sgs, group))
-		sgs->group_asym_packing =3D 1;
-
-	/* Check for loaded SMT group to be balanced to dst CPU */
-	if (!local_group && smt_balance(env, sgs, group))
-		sgs->group_smt_balance =3D 1;
+	if (!local_group) {
+		/* Check if dst CPU is idle and preferred to this group */
+		if (env->idle && sgs->sum_h_nr_running &&
+		    sched_group_asym(env, sgs, group))
+			sgs->group_asym_packing =3D 1;
+
+		/* Check for loaded SMT group to be balanced to dst CPU */
+		if (smt_balance(env, sgs, group))
+			sgs->group_smt_balance =3D 1;
+	}
=20
 	sgs->group_type =3D group_classify(env->sd->imbalance_pct, group, sgs);
=20
--=20
2.32.0
From nobody Sun Feb  8 19:22:26 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id C1A762F6160
	for <linux-kernel@vger.kernel.org>; Wed,  3 Dec 2025 23:01:36 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.11
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764802899; cv=none;
 b=Y+KSvOIGNo37S4ppd6Zqb+qeXMYg9H7oOVVSwDUONcSmPmmNo+OfFtkUVVLhQy9Kszncjru9WbcIa9UEetZqhMPsmMY2k5fVZ6RAWQZpLFm3o5ZOTcH4gt2vkBWUME5YgLQA3NYdBf+3LQy/lgsvGtAErx6vO+QUxr5PuBX7rAE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764802899; c=relaxed/simple;
	bh=CwrhaA/K9sEcx5ifxeMnRiF7w0oKVkh5kmhIRkZCI08=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=oku/UCNYCKxtHDLcs7jWCa04/T613otu/fvMOx46pM5Fk461C8jmF88SnvfkaEKbY/tPKG6ssSj+6jJ5qq4aFqOkczxx9qajmomVw1d15n0Nxc/H0Jxmj7YmItsjsTy0cRx3h5fJ6U2M4vg8NuLnjq+H/GqT2czhMHBhUawwwc0=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=NQQGq9b9; arc=none smtp.client-ip=198.175.65.11
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="NQQGq9b9"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1764802896; x=1796338896;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=CwrhaA/K9sEcx5ifxeMnRiF7w0oKVkh5kmhIRkZCI08=;
  b=NQQGq9b9vgYmeBpCZdLnkhTURcvp8LDZjz79tucf4QAOjRH6WMoJ7DIc
   VCEpH4PZk5+dZi9trvIpapAwsuwYkQegVq+/LDqHzSrIt129SaHxgL94Y
   5nrvAHUr0MUD5UNXllanE0V0Fykum1uE2UTQDl3LnIDioTcTzOYpAO1X1
   4qycYWShsJLluL7efSyQ+/SgISKYo/HIyxL8OBYx1D4XH6mSLaqEpIaiX
   g8GbNG2ofsWe9Fe2YAYpsC9b78PtUUg4W2Vm4/GWu3tuk8/oeCtghHVCm
   rv/mHq9+NoDA+NgB2cghgRnsU5NYvBkjZ9v38NvuhidP8frlEkqZR1gb1
   Q==;
X-CSE-ConnectionGUID: p32r4lRGQkiQ0lJdqe2vYQ==
X-CSE-MsgGUID: Aoa+xDJNReuNdKo3ZUU7Lw==
X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136420"
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="77136420"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Dec 2025 15:01:36 -0800
X-CSE-ConnectionGUID: bwFbXM5NTD2aXs2HmVfWRA==
X-CSE-MsgGUID: oV/d19IyQLihLj6NBNyYIA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="199763827"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:35 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 11/23] sched/cache: Prioritize tasks preferring destination
 LLC during balancing
Date: Wed,  3 Dec 2025 15:07:30 -0800
Message-Id: 
 <ce9c071a11620b3ae7c155849483f8cbdbe0837e.1764801860.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1764801860.git.tim.c.chen@linux.intel.com>
References: <cover.1764801860.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

During LLC load balancing, first check for tasks that prefer the
destination LLC and balance them to it before others.

Mark source sched groups containing tasks preferring non local LLCs
with the group_llc_balance flag. This ensures the load balancer later
pulls or pushes these tasks toward their preferred LLCs.

The load balancer selects the busiest sched_group and migrates tasks
to less busy groups to distribute load across CPUs.

With cache-aware scheduling enabled, the busiest sched_group is
the one with most tasks preferring the destination LLC. If
the group has the llc_balance flag set, cache aware load balancing is
triggered.

Introduce the helper function update_llc_busiest() to identify the
sched_group with the most tasks preferring the destination LLC.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2:
       Fix comparison in can_migrate_llc(), which uses an uninitialized
       env->src_cpu. Use the candidate group's first CPU instead. (Aaron Lu)
   =20
       Fix a race condition during bootup with build_sched_domains(),
       where the per-cpu(sd_llc_id) is reset to -1. (lkp/0day)
       Put the set of group_llc_balance and the usage of it into
       1 patch. (Peter Zijlstra)
   =20
       Change group_llc_balance priority to be lower than group_overloaded
       and embed it into normal load balance path. (Peter Zijlstra)
   =20
       Remove the sched group's SD_SHARE_LLC check in llc_balance(), because
       we should allow tasks migration across NUMA nodes to their preferred=
 LLC,
       where the domain does not have SD_SHARE_LLC flag.

 kernel/sched/fair.c | 66 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e4c1ae1bdda..db555c11b5b8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9531,6 +9531,11 @@ enum group_type {
 	 * from balancing the load across the system.
 	 */
 	group_imbalanced,
+	/*
+	 * There are tasks running on non-preferred LLC, possible to move
+	 * them to their preferred LLC without creating too much imbalance.
+	 */
+	group_llc_balance,
 	/*
 	 * The CPU is overloaded and can't provide expected CPU cycles to all
 	 * tasks.
@@ -10440,6 +10445,7 @@ struct sg_lb_stats {
 	enum group_type group_type;
 	unsigned int group_asym_packing;	/* Tasks should be moved to preferred CP=
U */
 	unsigned int group_smt_balance;		/* Task on busy SMT be moved */
+	unsigned int group_llc_balance;		/* Tasks should be moved to preferred LL=
C */
 	unsigned long group_misfit_task_load;	/* A CPU has a task too big for its=
 capacity */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
@@ -10698,6 +10704,9 @@ group_type group_classify(unsigned int imbalance_pc=
t,
 	if (group_is_overloaded(imbalance_pct, sgs))
 		return group_overloaded;
=20
+	if (sgs->group_llc_balance)
+		return group_llc_balance;
+
 	if (sg_imbalanced(group))
 		return group_imbalanced;
=20
@@ -10890,11 +10899,55 @@ static void record_sg_llc_stats(struct lb_env *en=
v,
 	if (unlikely(READ_ONCE(sd_share->capacity) !=3D sgs->group_capacity))
 		WRITE_ONCE(sd_share->capacity, sgs->group_capacity);
 }
+
+/*
+ * Do LLC balance on sched group that contains LLC, and have tasks preferr=
ing
+ * to run on LLC in idle dst_cpu.
+ */
+static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
+			       struct sched_group *group)
+{
+	if (!sched_cache_enabled())
+		return false;
+
+	if (env->sd->flags & SD_SHARE_LLC)
+		return false;
+
+	if (sgs->nr_pref_llc &&
+	    can_migrate_llc(cpumask_first(sched_group_span(group)),
+			    env->dst_cpu, 0, true) =3D=3D mig_llc)
+		return true;
+
+	return false;
+}
+
+static bool update_llc_busiest(struct lb_env *env,
+			       struct sg_lb_stats *busiest,
+			       struct sg_lb_stats *sgs)
+{
+	/*
+	 * There are more tasks that want to run on dst_cpu's LLC.
+	 */
+	return sgs->nr_pref_llc > busiest->nr_pref_llc;
+}
 #else
 static inline void record_sg_llc_stats(struct lb_env *env, struct sg_lb_st=
ats *sgs,
 				       struct sched_group *group)
 {
 }
+
+static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
+			       struct sched_group *group)
+{
+	return false;
+}
+
+static bool update_llc_busiest(struct lb_env *env,
+			       struct sg_lb_stats *busiest,
+			       struct sg_lb_stats *sgs)
+{
+	return false;
+}
 #endif
=20
 /**
@@ -10993,6 +11046,10 @@ static inline void update_sg_lb_stats(struct lb_en=
v *env,
 		/* Check for loaded SMT group to be balanced to dst CPU */
 		if (smt_balance(env, sgs, group))
 			sgs->group_smt_balance =3D 1;
+
+		/* Check for tasks in this group can be moved to their preferred LLC */
+		if (llc_balance(env, sgs, group))
+			sgs->group_llc_balance =3D 1;
 	}
=20
 	sgs->group_type =3D group_classify(env->sd->imbalance_pct, group, sgs);
@@ -11056,6 +11113,10 @@ static bool update_sd_pick_busiest(struct lb_env *=
env,
 		/* Select the overloaded group with highest avg_load. */
 		return sgs->avg_load > busiest->avg_load;
=20
+	case group_llc_balance:
+		/* Select the group with most tasks preferring dst LLC */
+		return update_llc_busiest(env, busiest, sgs);
+
 	case group_imbalanced:
 		/*
 		 * Select the 1st imbalanced group as we don't have any way to
@@ -11318,6 +11379,7 @@ static bool update_pick_idlest(struct sched_group *=
idlest,
 			return false;
 		break;
=20
+	case group_llc_balance:
 	case group_imbalanced:
 	case group_asym_packing:
 	case group_smt_balance:
@@ -11450,6 +11512,7 @@ sched_balance_find_dst_group(struct sched_domain *s=
d, struct task_struct *p, int
 			return NULL;
 		break;
=20
+	case group_llc_balance:
 	case group_imbalanced:
 	case group_asym_packing:
 	case group_smt_balance:
@@ -11949,7 +12012,8 @@ static struct sched_group *sched_balance_find_src_g=
roup(struct lb_env *env)
 	 * group's child domain.
 	 */
 	if (sds.prefer_sibling && local->group_type =3D=3D group_has_spare &&
-	    sibling_imbalance(env, &sds, busiest, local) > 1)
+	    (busiest->group_type =3D=3D group_llc_balance ||
+	    sibling_imbalance(env, &sds, busiest, local) > 1))
 		goto force_balance;
=20
 	if (busiest->group_type !=3D group_overloaded) {
--=20
2.32.0
From nobody Sun Feb  8 19:22:26 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8446B2FCBE3
	for <linux-kernel@vger.kernel.org>; Wed,  3 Dec 2025 23:01:38 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.11
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764802901; cv=none;
 b=AczUsF+ErJIRlmzmMhdwLmi7ZupDah78/dCkfXKoZGQ3XVlhu9qwGaFYSDg3FFQU9754xRJEORkGrcVZU1ssicX++R+V0FXfSTdSUEZWfvt980XcoUhlWnK7J8un6y7YNQXxJBfZVrhj31WyccQPJJevDmK67sgqqF6PsKk29mc=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764802901; c=relaxed/simple;
	bh=da90OiAHbhR9NPA8Ratl9FUXidYv15t1ql0bzkXvmzA=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=hCA0ZezNljYOVjtzlpNDPYqpoGKoW7yU4ihuYN4DdplXI8ZjqyOysntDUcfzbne+6CzBonX2R+LOUwUNh5V4ZvlW0NEG+WGaT266Gr89t7EmmUAyb0SQ4i4NDSbCHrELFwlVL45n3XsDuBwIKNxjYMRKZj90lzt9XJuGVK0hJpE=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=E/oDxO7e; arc=none smtp.client-ip=198.175.65.11
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="E/oDxO7e"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1764802898; x=1796338898;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=da90OiAHbhR9NPA8Ratl9FUXidYv15t1ql0bzkXvmzA=;
  b=E/oDxO7ef4nI4G5J3jOvjR+X/vFua+P9e3AZXKLJcxFJriNr7Ua944xG
   AxkcNTluTudW0fa7LiL2oLSyXQGNm4wxTedztXy+Kb3GNW3m1xItQPgjY
   yaKpw+/5zQcwTUlI7cSSe2yq6pGi70PjZnOQeUYqx+6LdidqnzQeT9x0d
   oKfUVrBxLwV+bxjJ5X7pfb+amTWF/9P1/Z2cwQnN4MgR4+xZfJ/oQETi0
   OhZkv30WMo989iIGaDW9QOVZENXrnIYuSR0poLGwGoz4vGxEA6oadIK33
   rSOZLBiBoM9ORQbnZoVJ4AxudF9GCXu3fDkCd/li1EhJxcKamQHTatJeP
   g==;
X-CSE-ConnectionGUID: Ktrog9qIS3GVBMh0FikIKg==
X-CSE-MsgGUID: wFdH+7CRS02fjEaxSZiyog==
X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136444"
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="77136444"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Dec 2025 15:01:37 -0800
X-CSE-ConnectionGUID: 6HowfZdBQD20KHd0gzJbtg==
X-CSE-MsgGUID: 6WhOzrMuS8+5P3U6JQUdyA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="199763835"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:37 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 12/23] sched/cache: Add migrate_llc_task migration type for
 cache-aware balancing
Date: Wed,  3 Dec 2025 15:07:31 -0800
Message-Id: 
 <ab44d75ad69b81b669cbab41695c0e89407c6feb.1764801860.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1764801860.git.tim.c.chen@linux.intel.com>
References: <cover.1764801860.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Introduce a new migration type, migrate_llc_task, to support
cache-aware load balancing.

After identifying the busiest sched_group (having the most tasks
preferring the destination LLC), mark migrations with this type.
During load balancing, each runqueue in the busiest sched_group is
examined, and the runqueue with the highest number of tasks preferring
the destination CPU is selected as the busiest runqueue.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2: Remove unnecessary cpus_share_cache() check in
            sched_balance_find_src_rq() (K Prateek Nayak)

 kernel/sched/fair.c | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index db555c11b5b8..529adf342ce0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9547,7 +9547,8 @@ enum migration_type {
 	migrate_load =3D 0,
 	migrate_util,
 	migrate_task,
-	migrate_misfit
+	migrate_misfit,
+	migrate_llc_task
 };
=20
 #define LBF_ALL_PINNED	0x01
@@ -10134,6 +10135,10 @@ static int detach_tasks(struct lb_env *env)
 			env->imbalance -=3D util;
 			break;
=20
+		case migrate_llc_task:
+			env->imbalance--;
+			break;
+
 		case migrate_task:
 			env->imbalance--;
 			break;
@@ -11766,6 +11771,15 @@ static inline void calculate_imbalance(struct lb_e=
nv *env, struct sd_lb_stats *s
 		return;
 	}
=20
+#ifdef CONFIG_SCHED_CACHE
+	if (busiest->group_type =3D=3D group_llc_balance) {
+		/* Move a task that prefer local LLC */
+		env->migration_type =3D migrate_llc_task;
+		env->imbalance =3D 1;
+		return;
+	}
+#endif
+
 	if (busiest->group_type =3D=3D group_imbalanced) {
 		/*
 		 * In the group_imb case we cannot rely on group-wide averages
@@ -12073,6 +12087,10 @@ static struct rq *sched_balance_find_src_rq(struct=
 lb_env *env,
 	struct rq *busiest =3D NULL, *rq;
 	unsigned long busiest_util =3D 0, busiest_load =3D 0, busiest_capacity =
=3D 1;
 	unsigned int busiest_nr =3D 0;
+#ifdef CONFIG_SCHED_CACHE
+	unsigned int busiest_pref_llc =3D 0;
+	int dst_llc;
+#endif
 	int i;
=20
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
@@ -12181,6 +12199,16 @@ static struct rq *sched_balance_find_src_rq(struct=
 lb_env *env,
 			}
 			break;
=20
+		case migrate_llc_task:
+#ifdef CONFIG_SCHED_CACHE
+			dst_llc =3D llc_id(env->dst_cpu);
+			if (dst_llc >=3D 0 &&
+			    busiest_pref_llc < rq->nr_pref_llc[dst_llc]) {
+				busiest_pref_llc =3D rq->nr_pref_llc[dst_llc];
+				busiest =3D rq;
+			}
+#endif
+			break;
 		case migrate_task:
 			if (busiest_nr < nr_running) {
 				busiest_nr =3D nr_running;
@@ -12363,6 +12391,8 @@ static void update_lb_imbalance_stat(struct lb_env =
*env, struct sched_domain *sd
 	case migrate_misfit:
 		__schedstat_add(sd->lb_imbalance_misfit[idle], env->imbalance);
 		break;
+	case migrate_llc_task:
+		break;
 	}
 }
=20
--=20
2.32.0
From nobody Sun Feb  8 19:22:26 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 97BB42FFF98
	for <linux-kernel@vger.kernel.org>; Wed,  3 Dec 2025 23:01:39 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.11
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764802903; cv=none;
 b=KDFpHcGAhKnHRBZFMFMtHMoRnhc4icrwIxIA8u+Vif5oz7Z18LHjkzu1IOV8tRYJFy4lXDjG6wYe22JV6BPtT9JAf2mUHKRyigHv1MkoPNBeRIKSEJ51iH0zebfyiiIhyx46QCps5MkfKG9xVMGg3N7ENza6Vv2+y6dsL+Zp0lE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764802903; c=relaxed/simple;
	bh=8SM2jHHpi12dQS+zJornGRPQxkuowwvNXMVhwIeDBGA=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=U/JdvWoJw/IU3s2ub70NWLePIaQBRwHwPYibO+bbRJhw5I3xBFJgWmgkN/HfBIb1ABZRWNcUN5ladx9wdRE4q84V9sG4/k/92/pAoHRgP60/SkA1N0lBh+0oNDDaOMmcaJymNEYAB4Y+PlNTanSz07u82e6zrOmPcftrMVq0eQg=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=QFrFmdLP; arc=none smtp.client-ip=198.175.65.11
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="QFrFmdLP"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1764802900; x=1796338900;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=8SM2jHHpi12dQS+zJornGRPQxkuowwvNXMVhwIeDBGA=;
  b=QFrFmdLPstihmD8vLzP896hsrOed6TFf664ZbLxgCKDyVP1ElFu/KxlL
   cWka8HAx7lSbtKJIRs2zDLb662V+u3vSkOL/+GmAmBZOGy6YahHgzdZ+w
   Cm8JPiAUQ0kzPS2n/rAw++vW0A14d5QX1S2PZ0RvAxgtjOMIEQght4vtw
   NlNGyMxSykwrfzzHo/Khc6YFVxKydWs7zQdFb7hjDddawl3rivgSTQ4lM
   rXsDbUmw/L0HUCnUtshRY/GabXqs3gMSK3t3UfCRyfscjIhW5T7A4/xG6
   Ul+07Ph3CpTYgJ6hsHVxiRy1rZKIhjL1V7FiHZTJQ8OxeBn2eVIqDAWXn
   g==;
X-CSE-ConnectionGUID: Ew56K6WcQq6rz+G+xWa07Q==
X-CSE-MsgGUID: UdKIolArSn+zoD5IfTBGAQ==
X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136469"
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="77136469"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Dec 2025 15:01:39 -0800
X-CSE-ConnectionGUID: 2oDVFmE5Rvu2YRtXhtrPTg==
X-CSE-MsgGUID: oS68+IQVQH2bxeYLezigJQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="199763850"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:38 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 13/23] sched/cache: Handle moving single tasks to/from
 their preferred LLC
Date: Wed,  3 Dec 2025 15:07:32 -0800
Message-Id: 
 <d561c06433e15279674895d3de430a4313dbf973.1764801860.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1764801860.git.tim.c.chen@linux.intel.com>
References: <cover.1764801860.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

If the busiest runqueue has only one task, active balancing may be
invoked to move it. However, before migration, check whether the task
is running on its preferred LLC.

Do not move a lone task to another LLC if it would move the task
away from its preferred LLC or cause excessive imbalance between LLCs.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2: Remove uneeded preferred LLC migration check from
            active_load_balance_cpu_stop().

 kernel/sched/fair.c | 51 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 50 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 529adf342ce0..aed3fab98d7c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9878,12 +9878,57 @@ static __maybe_unused enum llc_mig can_migrate_llc_=
task(int src_cpu, int dst_cpu
 			       task_util(p), to_pref);
 }
=20
+/*
+ * Check if active load balance breaks LLC locality in
+ * terms of cache aware load balance.
+ */
+static inline bool
+break_llc_locality(struct lb_env *env)
+{
+	if (!sched_cache_enabled())
+		return false;
+
+	if (cpus_share_cache(env->src_cpu, env->dst_cpu))
+		return false;
+	/*
+	 * All tasks prefer to stay on their current CPU.
+	 * Do not pull a task from its preferred CPU if:
+	 * 1. It is the only task running there; OR
+	 * 2. Migrating it away from its preferred LLC would violate
+	 *    the cache-aware scheduling policy.
+	 */
+	if (env->src_rq->nr_pref_llc_running =3D=3D env->src_rq->cfs.h_nr_runnabl=
e) {
+		unsigned long util =3D 0;
+		struct task_struct *cur;
+
+		if (env->src_rq->nr_running <=3D 1)
+			return true;
+
+		rcu_read_lock();
+		cur =3D rcu_dereference(env->src_rq->curr);
+		if (cur)
+			util =3D task_util(cur);
+		rcu_read_unlock();
+
+		if (can_migrate_llc(env->src_cpu, env->dst_cpu,
+				    util, false) =3D=3D mig_forbid)
+			return true;
+	}
+
+	return false;
+}
 #else
 static inline bool get_llc_stats(int cpu, unsigned long *util,
 				 unsigned long *cap)
 {
 	return false;
 }
+
+static inline bool
+break_llc_locality(struct lb_env *env)
+{
+	return false;
+}
 #endif
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
@@ -12279,6 +12324,9 @@ static int need_active_balance(struct lb_env *env)
 {
 	struct sched_domain *sd =3D env->sd;
=20
+	if (break_llc_locality(env))
+		return 0;
+
 	if (asym_active_balance(env))
 		return 1;
=20
@@ -12298,7 +12346,8 @@ static int need_active_balance(struct lb_env *env)
 			return 1;
 	}
=20
-	if (env->migration_type =3D=3D migrate_misfit)
+	if (env->migration_type =3D=3D migrate_misfit ||
+	    env->migration_type =3D=3D migrate_llc_task)
 		return 1;
=20
 	return 0;
--=20
2.32.0
From nobody Sun Feb  8 19:22:26 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id C1938F513
	for <linux-kernel@vger.kernel.org>; Wed,  3 Dec 2025 23:01:41 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.11
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764802905; cv=none;
 b=sPmV7aM8SfneES++JSxoAMTpkJxsxkIaVzLucunnA9mKqP6A+4Tm600kyT9VTXTzXq34T39lXTUp9sHWoERIl8w+bTu7J1HC+rfyTlXxwEVQV8C99GFpkkbN1BPFHILnrVb4xczJGDnWK5dD50Ye9FIBTMyihvIerGvjfEsmqNE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764802905; c=relaxed/simple;
	bh=lHL2pgABc7GHr6ACmg9H32RJUswizn6AHQFobrHrbFw=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=G7rXkqjakmupf9n++e5JGAkMIXq3jqgQc6G6Gw5IYyY/VhHNnlVMfdVNOcDomPtYPBMavf9m7Y2bsSMUvQExqTt6CASUZ8aGZ8iX+XoR/Ej28b5EwCnggenbKxXL4Xj0/E38v+KIJD/T8MnOLbFEeGjSREtAQxxgu/2prdjZMw8=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=FiUlG+0K; arc=none smtp.client-ip=198.175.65.11
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="FiUlG+0K"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1764802901; x=1796338901;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=lHL2pgABc7GHr6ACmg9H32RJUswizn6AHQFobrHrbFw=;
  b=FiUlG+0K/UC9vVMh/oPWl1WUBZdhy5MrB44PaaHkXUAA4jYHkLTFSSsi
   qocTAQQFuheK8JLYpFg2R7aU2iv4GZRGXge93BEc9kS9nTpx4oQOMWekm
   +vXMxJj28JhCGkxAcIYAkVQvbks0I4+snX/or9+O6+kLtJoq4VW98lvHt
   gsZRnKPvbTAbfB8BLT4mfbZqijYwb7I27I0TW2bqZx35wIeRxh9EeBFyi
   ROuei6K/cuomwGMaKK20uTZT8/nP1CIoBiGImBAQQNhK7Hgo6jMMsFX23
   lLTcZHF+7w8PBbIBEKU+iwv08wqwC5Czno4lf4DE3GutioUzRJHIw2uIq
   A==;
X-CSE-ConnectionGUID: LC9AWrvJQPSiwhGuWRZBoA==
X-CSE-MsgGUID: vAm1J3bzStCyct17U9yt9Q==
X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136497"
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="77136497"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Dec 2025 15:01:40 -0800
X-CSE-ConnectionGUID: pVhkYbuLQ6qhXibAglfMuQ==
X-CSE-MsgGUID: 3iy0SCYQQeWYaEaxUD8biw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="199763859"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:40 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 14/23] sched/cache: Consider LLC preference when selecting
 tasks for load balancing
Date: Wed,  3 Dec 2025 15:07:33 -0800
Message-Id: 
 <048601436d24f19e84c0a002e1c5897f95853276.1764801860.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1764801860.git.tim.c.chen@linux.intel.com>
References: <cover.1764801860.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Currently, task selection from the busiest runqueue ignores LLC
preferences. Reorder tasks in the busiest queue to prioritize selection
as follows:

  1. Tasks preferring the destination CPU's LLC
  2. Tasks with no LLC preference
  3. Tasks preferring an LLC different from their current one
  4. Tasks preferring the LLC they are currently on

This improves the likelihood that tasks are migrated to their
preferred LLC.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2: No change.

 kernel/sched/fair.c | 66 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aed3fab98d7c..dd09a816670e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10092,6 +10092,68 @@ static struct task_struct *detach_one_task(struct =
lb_env *env)
 	return NULL;
 }
=20
+#ifdef CONFIG_SCHED_CACHE
+/*
+ * Prepare lists to detach tasks in the following order:
+ * 1. tasks that prefer dst cpu's LLC
+ * 2. tasks that have no preference in LLC
+ * 3. tasks that prefer LLC other than the ones they are on
+ * 4. tasks that prefer the LLC that they are currently on.
+ */
+static struct list_head
+*order_tasks_by_llc(struct lb_env *env, struct list_head *tasks)
+{
+	struct task_struct *p;
+	LIST_HEAD(pref_old_llc);
+	LIST_HEAD(pref_new_llc);
+	LIST_HEAD(no_pref_llc);
+	LIST_HEAD(pref_other_llc);
+
+	if (!sched_cache_enabled())
+		return tasks;
+
+	if (cpus_share_cache(env->dst_cpu, env->src_cpu))
+		return tasks;
+
+	while (!list_empty(tasks)) {
+		p =3D list_last_entry(tasks, struct task_struct, se.group_node);
+
+		if (p->preferred_llc =3D=3D llc_id(env->dst_cpu)) {
+			list_move(&p->se.group_node, &pref_new_llc);
+			continue;
+		}
+
+		if (p->preferred_llc =3D=3D llc_id(env->src_cpu)) {
+			list_move(&p->se.group_node, &pref_old_llc);
+			continue;
+		}
+
+		if (p->preferred_llc =3D=3D -1) {
+			list_move(&p->se.group_node, &no_pref_llc);
+			continue;
+		}
+
+		list_move(&p->se.group_node, &pref_other_llc);
+	}
+
+	/*
+	 * We detach tasks from list tail in detach tasks.  Put tasks
+	 * to be chosen first at end of list.
+	 */
+	list_splice(&pref_new_llc, tasks);
+	list_splice(&no_pref_llc, tasks);
+	list_splice(&pref_other_llc, tasks);
+	list_splice(&pref_old_llc, tasks);
+	return tasks;
+}
+#else
+static inline struct list_head
+*order_tasks_by_llc(struct lb_env *env, struct list_head *tasks)
+{
+	return tasks;
+}
+#endif
+
 /*
  * detach_tasks() -- tries to detach up to imbalance load/util/tasks from
  * busiest_rq, as part of a balancing operation within domain "sd".
@@ -10100,7 +10162,7 @@ static struct task_struct *detach_one_task(struct l=
b_env *env)
  */
 static int detach_tasks(struct lb_env *env)
 {
-	struct list_head *tasks =3D &env->src_rq->cfs_tasks;
+	struct list_head *tasks;
 	unsigned long util, load;
 	struct task_struct *p;
 	int detached =3D 0;
@@ -10119,6 +10181,8 @@ static int detach_tasks(struct lb_env *env)
 	if (env->imbalance <=3D 0)
 		return 0;
=20
+	tasks =3D order_tasks_by_llc(env, &env->src_rq->cfs_tasks);
+
 	while (!list_empty(tasks)) {
 		/*
 		 * We don't want to steal all, otherwise we may be treated likewise,
--=20
2.32.0
From nobody Sun Feb  8 19:22:26 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id F15862EC0B3
	for <linux-kernel@vger.kernel.org>; Wed,  3 Dec 2025 23:01:43 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.11
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764802907; cv=none;
 b=gaatxX9hyfNCQNZuo8e4RU3vaqRhxVWET62DnEKpixJNU5xDEVuougssJt9/6wdKqXoIUOBKaKYsQEEI9+soes2dovmZhy3fGDXwD4VJshA6aArNO/9BRtmRmrSUH+Qeb4uxqCx6TiODM+aPCVtCEwIA755BalFPfrmj7+qULOI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764802907; c=relaxed/simple;
	bh=Qq+bxGUfP5y5uzFrPweEIf2ig+fLfO0Fva+8tsaaHnM=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=PMemmZdPDG2ErK7Z6ebePwKSI9cabjQRZi7fOaAynPsVbH0TYAxCQkgG7kmEu1N/+0Kmoqb2iEytzk5b6Y83O56eTuw4wsJTpcQbn5OA5nrv8fwKgYRvMuPqwTWStSC5o/clmWh6Un/rG7VXFCAXnoxf+tadmloUwr1ceD4Iuek=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=K1c6F2rL; arc=none smtp.client-ip=198.175.65.11
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="K1c6F2rL"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1764802904; x=1796338904;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=Qq+bxGUfP5y5uzFrPweEIf2ig+fLfO0Fva+8tsaaHnM=;
  b=K1c6F2rLIDMugmXFGo0VPRa3CkwpTWx9IJrRa/hsq4UrL7DnV0pw8ajG
   BaGeCuW4iC0q3KpRjUrb5Gjs2+rOB74bBmgvjzvP0Bgae0TPuFdvMjX23
   z6+gGGgG19Wv4ve1vRjEwTT08BRcUINH2YNXiTUVgX6ibcCJComlk0Y6n
   quNDMVfwdU0hQZhwOtrSHXPRqMojx8I7m9WQ/PmD1woe8uT6yci0V4u2u
   jfnFFUMEbPvj3J6FUSZjuQwGSGo/EqXqp0xk/5KRyXKafHJF8xEhV/udJ
   e4v9JDT09EYShziT4Bzd1zuoH2hhzYHA7OeJFLCwdgppCCBWwVA2w3KmJ
   w==;
X-CSE-ConnectionGUID: 9y9zHDIITpm9FBwYNTu03A==
X-CSE-MsgGUID: Qrhpktr7Tg2wRc89JHsg/g==
X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136537"
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="77136537"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Dec 2025 15:01:42 -0800
X-CSE-ConnectionGUID: JansXFpeT5WbVZuKS0jHBA==
X-CSE-MsgGUID: NjtVIxeZSgSD5Yg4dtwXow==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="199763888"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:42 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 15/23] sched/cache: Respect LLC preference in task
 migration and detach
Date: Wed,  3 Dec 2025 15:07:34 -0800
Message-Id: 
 <1c75f54a2e259737eb9b15c98a5c1d1f142fdef6.1764801860.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1764801860.git.tim.c.chen@linux.intel.com>
References: <cover.1764801860.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

During the final step of load balancing, can_migrate_task() now
considers a task's LLC preference before moving it out of its
preferred LLC.

Additionally, add checks in detach_tasks() to prevent selecting tasks
that prefer their current LLC.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2: Leave out tasks under core scheduling from the cache aware
            load balance. (K Prateek Nayak)
   =20
            Reduce the degree of honoring preferred_llc in detach_tasks().
            If certain conditions are met, stop migrating tasks that prefer
            their current LLC and instead continue load balancing from other
            busiest runqueues. (K Prateek Nayak)

 kernel/sched/fair.c  | 63 ++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h | 13 +++++++++
 2 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dd09a816670e..580a967efdac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9852,8 +9852,8 @@ static enum llc_mig can_migrate_llc(int src_cpu, int =
dst_cpu,
  * Check if task p can migrate from source LLC to
  * destination LLC in terms of cache aware load balance.
  */
-static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int d=
st_cpu,
-							struct task_struct *p)
+static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
+					 struct task_struct *p)
 {
 	struct mm_struct *mm;
 	bool to_pref;
@@ -10025,6 +10025,13 @@ int can_migrate_task(struct task_struct *p, struct=
 lb_env *env)
 	if (env->flags & LBF_ACTIVE_LB)
 		return 1;
=20
+#ifdef CONFIG_SCHED_CACHE
+	if (sched_cache_enabled() &&
+	    can_migrate_llc_task(env->src_cpu, env->dst_cpu, p) =3D=3D mig_forbid=
 &&
+	    !task_has_sched_core(p))
+		return 0;
+#endif
+
 	degrades =3D migrate_degrades_locality(p, env);
 	if (!degrades)
 		hot =3D task_hot(p, env);
@@ -10146,12 +10153,55 @@ static struct list_head
 	list_splice(&pref_old_llc, tasks);
 	return tasks;
 }
+
+static bool stop_migrate_src_rq(struct task_struct *p,
+				struct lb_env *env,
+				int detached)
+{
+	if (!sched_cache_enabled() || p->preferred_llc =3D=3D -1 ||
+	    cpus_share_cache(env->src_cpu, env->dst_cpu) ||
+	    env->sd->nr_balance_failed)
+		return false;
+
+	/*
+	 * Stop migration for the src_rq and pull from a
+	 * different busy runqueue in the following cases:
+	 *
+	 * 1. Trying to migrate task to its preferred
+	 *    LLC, but the chosen task does not prefer dest
+	 *    LLC - case 3 in order_tasks_by_llc(). This violates
+	 *    the goal of migrate_llc_task. However, we should
+	 *    stop detaching only if some tasks have been detached
+	 *    and the imbalance has been mitigated.
+	 *
+	 * 2. Don't detach more tasks if the remaining tasks want
+	 *    to stay. We know the remaining tasks all prefer the
+	 *    current LLC, because after order_tasks_by_llc(), the
+	 *    tasks that prefer the current LLC are the least favored
+	 *    candidates to be migrated out.
+	 */
+	if (env->migration_type =3D=3D migrate_llc_task &&
+	    detached && llc_id(env->dst_cpu) !=3D p->preferred_llc)
+		return true;
+
+	if (llc_id(env->src_cpu) =3D=3D p->preferred_llc)
+		return true;
+
+	return false;
+}
 #else
 static inline struct list_head
 *order_tasks_by_llc(struct lb_env *env, struct list_head *tasks)
 {
 	return tasks;
 }
+
+static bool stop_migrate_src_rq(struct task_struct *p,
+				struct lb_env *env,
+				int detached)
+{
+	return false;
+}
 #endif
=20
 /*
@@ -10205,6 +10255,15 @@ static int detach_tasks(struct lb_env *env)
=20
 		p =3D list_last_entry(tasks, struct task_struct, se.group_node);
=20
+		/*
+		 * Check if detaching current src_rq should be stopped, because
+		 * doing so would break cache aware load balance. If we stop
+		 * here, the env->flags has LBF_ALL_PINNED, which would cause
+		 * the load balance to pull from another busy runqueue.
+		 */
+		if (stop_migrate_src_rq(p, env, detached))
+			break;
+
 		if (!can_migrate_task(p, env))
 			goto next;
=20
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8f2a779825e4..40798a06e058 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1485,6 +1485,14 @@ extern void sched_core_dequeue(struct rq *rq, struct=
 task_struct *p, int flags);
 extern void sched_core_get(void);
 extern void sched_core_put(void);
=20
+static inline bool task_has_sched_core(struct task_struct *p)
+{
+	if (sched_core_disabled())
+		return false;
+
+	return !!p->core_cookie;
+}
+
 #else /* !CONFIG_SCHED_CORE: */
=20
 static inline bool sched_core_enabled(struct rq *rq)
@@ -1524,6 +1532,11 @@ static inline bool sched_group_cookie_match(struct r=
q *rq,
 	return true;
 }
=20
+static inline bool task_has_sched_core(struct task_struct *p)
+{
+	return false;
+}
+
 #endif /* !CONFIG_SCHED_CORE */
=20
 #ifdef CONFIG_RT_GROUP_SCHED
--=20
2.32.0
From nobody Sun Feb  8 19:22:26 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6E3AB2EFDAD
	for <linux-kernel@vger.kernel.org>; Wed,  3 Dec 2025 23:01:44 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.11
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764802907; cv=none;
 b=DqEqfXaSW0ZZxydgpHnr//9Y+r8Kz4ipcj+CchWbORZ48RCt17FQ2DquLW8sfqca/x+abOrEYIPaq71/GVkzdhR5YktmlcdFPno7ta7IuxETAlghruG+YXcsfmrH3WvfypIFBRxcIK9G7zQ7Meao90BbtEmbg2ZH1AORZqaQMHw=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764802907; c=relaxed/simple;
	bh=UY2I5n5Zb5eoLU5mFytvpnggFlTCSd5WOZCBICo1NK0=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=ss76k4YY8rB/Z6uAGDFyQbUZ7bARhHHFMR8yOxKyMTjDj6HDUJk3fTrjyBpd8eZwWLWJd6uE+i5j5z2Y9c/kkgK7AnD0FSS5RcyHMwddwez0X8IBpyAwZBkh9Vkri2qy0caEGEQrs66nsLD9/pRtuqh/ensvo0F7AVsRu2xo2+M=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=b/WFlJ1d; arc=none smtp.client-ip=198.175.65.11
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="b/WFlJ1d"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1764802904; x=1796338904;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=UY2I5n5Zb5eoLU5mFytvpnggFlTCSd5WOZCBICo1NK0=;
  b=b/WFlJ1dlftZ7EAiu5bb8CTSjdtBeseHX8isQ4Wht5vD1dxWm6RURFOT
   R1B3Vg98GKNKQd2LzX3IPnNH9KdzkcCltvIyuRjvzvHEAhFOFxsI/nNCA
   UEadn+0Fte3u19UFuKUeR+zfOfQY/nrc24OBpPT4wpQKXE96Ne4Zzhez9
   CGKthr3Nhi0su6EqgFcgXSic3+e2vAZwxOJETpVdCkTcXOxPoH3AQRibc
   89EqfPOQ7c13HxarJn7Y8fuv5oRcK9m2z4cMXZ93jLuPQkW6wM0YzTFzA
   la772T94DglzvBNsM6aU73BVVoFLW1MUMY65Xa6wwGE8bwa6iEUdQtZCN
   w==;
X-CSE-ConnectionGUID: hV5QNWsDRNeWT6DD4+0r9w==
X-CSE-MsgGUID: zyp/cB/OQI6PxkSBENXT7w==
X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136566"
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="77136566"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Dec 2025 15:01:44 -0800
X-CSE-ConnectionGUID: iafqWAoBQZGMBBV/tLdIww==
X-CSE-MsgGUID: S+YoPmfDSRiUGxMt3Qjmgw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="199763904"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:43 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	linux-kernel@vger.kernel.org,
	Libo Chen <libo.chen@oracle.com>
Subject: [PATCH v2 16/23] sched/cache: Introduce sched_cache_present to enable
 cache aware scheduling for multi LLCs NUMA node
Date: Wed,  3 Dec 2025 15:07:35 -0800
Message-Id: 
 <7453e3f901878608959f23dacaa36dfc0432c05b.1764801860.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1764801860.git.tim.c.chen@linux.intel.com>
References: <cover.1764801860.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

Cache-aware load balancing should only be enabled if there are more
than 1 LLCs within 1 NUMA node. sched_cache_present is introduced to
indicate whether this platform supports this topology.

Suggested-by: Libo Chen <libo.chen@oracle.com>
Suggested-by: Adam Li <adamli@os.amperecomputing.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2:
    	Use flag sched_cache_present to indicate whether a platform
    	supports cache aware scheduling. Change this flag from staic key.
    	There should be only 1 static key to control the cache aware
    	scheduling. (Peter Zijlstra)

 kernel/sched/topology.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d583399fc6a1..9799e3a9a609 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -24,6 +24,8 @@ int max_llcs;
=20
 #ifdef CONFIG_SCHED_CACHE
=20
+static bool sched_cache_present;
+
 static unsigned int *alloc_new_pref_llcs(unsigned int *old, unsigned int *=
*gc)
 {
 	unsigned int *new =3D NULL;
@@ -54,7 +56,7 @@ static void populate_new_pref_llcs(unsigned int *old, uns=
igned int *new)
 		new[i] =3D old[i];
 }
=20
-static int resize_llc_pref(void)
+static int resize_llc_pref(bool has_multi_llcs)
 {
 	unsigned int *__percpu *tmp_llc_pref;
 	int i, ret =3D 0;
@@ -102,6 +104,11 @@ static int resize_llc_pref(void)
 		rq_unlock_irqrestore(rq, &rf);
 	}
=20
+	if (has_multi_llcs) {
+		sched_cache_present =3D true;
+		pr_info_once("Cache aware load balance is enabled on the platform.\n");
+	}
+
 release_old:
 	/*
 	 * Load balance is done under rcu_lock.
@@ -124,7 +131,7 @@ static int resize_llc_pref(void)
=20
 #else
=20
-static int resize_llc_pref(void)
+static int resize_llc_pref(bool has_multi_llcs)
 {
 	max_llcs =3D new_max_llcs;
 	return 0;
@@ -2644,6 +2651,7 @@ static int
 build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_att=
r *attr)
 {
 	enum s_alloc alloc_state =3D sa_none;
+	bool has_multi_llcs =3D false;
 	struct sched_domain *sd;
 	struct s_data d;
 	struct rq *rq =3D NULL;
@@ -2736,10 +2744,12 @@ build_sched_domains(const struct cpumask *cpu_map, =
struct sched_domain_attr *att
 				 * between LLCs and memory channels.
 				 */
 				nr_llcs =3D sd->span_weight / child->span_weight;
-				if (nr_llcs =3D=3D 1)
+				if (nr_llcs =3D=3D 1) {
 					imb =3D sd->span_weight >> 3;
-				else
+				} else {
 					imb =3D nr_llcs;
+					has_multi_llcs =3D true;
+				}
 				imb =3D max(1U, imb);
 				sd->imb_numa_nr =3D imb;
=20
@@ -2787,7 +2797,7 @@ build_sched_domains(const struct cpumask *cpu_map, st=
ruct sched_domain_attr *att
 	if (has_cluster)
 		static_branch_inc_cpuslocked(&sched_cluster_active);
=20
-	resize_llc_pref();
+	resize_llc_pref(has_multi_llcs);
=20
 	if (rq && sched_debug_verbose)
 		pr_info("root domain span: %*pbl\n", cpumask_pr_args(cpu_map));
--=20
2.32.0
From nobody Sun Feb  8 19:22:26 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2CA91309EF4
	for <linux-kernel@vger.kernel.org>; Wed,  3 Dec 2025 23:01:46 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.11
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764802908; cv=none;
 b=nzlhLORShQGH6z2OKPCwgPj3fFYQBq0S4kjlB8PdpAMAbRvUDKx69/o9oLg1lRga1/7uLzN7ZJmwClhqm7REccEFVBXjMxnF8O6F1qeXlUxSc5j6wsPAdvgE25W54gtIVxKBjQRnZDVLeIGtXbaxk29EoCqp7pm1fCpS1IY7jQo=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764802908; c=relaxed/simple;
	bh=7G8GAR73tqFcdrEyXVcfBaeUwRwA82VAe47pEbdUV2w=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=Jh1NMZniFEQvMeyAac4yMWESOURMqAUIKW5GcomnPyFPuACvinoSr0dUF9HnUWSFLODn+/4wiWm4ySl8YKMzKSgIL7OQSmo169aanmL/sbmdbfeduyjfscZaBGqL5cQYK99GiDZLKPt44QcYP3KC0gclEaC+Rkd8OiTRxeMU500=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=E0yq8JMN; arc=none smtp.client-ip=198.175.65.11
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="E0yq8JMN"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1764802906; x=1796338906;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=7G8GAR73tqFcdrEyXVcfBaeUwRwA82VAe47pEbdUV2w=;
  b=E0yq8JMN3sNhZ58s1b5iZ/cpqNuM9N0pDevJEvrPce0R2mUndVkmGScN
   McHDjEQAdkFny/+9qg6ANdvlFmYlDA/4TibC4Yz5kBPZKGiM/VEgmSwNx
   Wv+0fExbPAqEqTORsnJ61vyIc7KAkoB0P/ug+G27y1gOBAwA36EGLI/OA
   /yCpUK6WyND+MO1j8Jd+Z6+AKRhUgaidNDGg0GWIIit5s7o17SsHVlDsV
   qRWNYanMa3En1ALugyelInfcAx8tLNFNwwlqUz9ZCh6D2uuGRuoBR5fLH
   VziKp+AH5f2oXxMZP43VD+u7hWt+ni9sCpFuAa1/qPyus5y+HPClviJWH
   w==;
X-CSE-ConnectionGUID: oDqO/ga6T/+BT+4b/VYbEw==
X-CSE-MsgGUID: BsO2ZD3WSAih53lppi2XZQ==
X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136597"
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="77136597"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Dec 2025 15:01:45 -0800
X-CSE-ConnectionGUID: vKt+yECETT+2Z5MJs0mW1A==
X-CSE-MsgGUID: XpexGbaTSRGCCth9FMIgbg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="199763921"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:45 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 17/23] sched/cache: Record the number of active threads per
 process for cache-aware scheduling
Date: Wed,  3 Dec 2025 15:07:36 -0800
Message-Id: 
 <d14c517c70f1e19bb1223ae705d9a4311420b2ee.1764801860.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1764801860.git.tim.c.chen@linux.intel.com>
References: <cover.1764801860.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

A performance regression was observed by Prateek when running hackbench
with many threads per process (high fd count). To avoid this, processes
with a large number of active threads are excluded from cache-aware
scheduling.

With sched_cache enabled, record the number of active threads in each
process during the periodic task_cache_work(). While iterating over
CPUs, if the currently running task belongs to the same process as the
task that launched task_cache_work(), increment the active thread count.

This number will be used by subsequent patch to inhibit cache aware
load balance.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2: No change.

 include/linux/mm_types.h |  1 +
 kernel/sched/fair.c      | 11 +++++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1ea16ef90566..04743983de4d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1043,6 +1043,7 @@ struct mm_struct {
 		raw_spinlock_t mm_sched_lock;
 		unsigned long mm_sched_epoch;
 		int mm_sched_cpu;
+		u64 nr_running_avg ____cacheline_aligned_in_smp;
 #endif
=20
 #ifdef CONFIG_MMU
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 580a967efdac..2f38ad82688f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1421,11 +1421,11 @@ static void task_tick_cache(struct rq *rq, struct t=
ask_struct *p)
=20
 static void __no_profile task_cache_work(struct callback_head *work)
 {
-	struct task_struct *p =3D current;
+	struct task_struct *p =3D current, *cur;
 	struct mm_struct *mm =3D p->mm;
 	unsigned long m_a_occ =3D 0;
 	unsigned long curr_m_a_occ =3D 0;
-	int cpu, m_a_cpu =3D -1;
+	int cpu, m_a_cpu =3D -1, nr_running =3D 0;
 	cpumask_var_t cpus;
=20
 	WARN_ON_ONCE(work !=3D &p->cache_work);
@@ -1458,6 +1458,12 @@ static void __no_profile task_cache_work(struct call=
back_head *work)
 					m_occ =3D occ;
 					m_cpu =3D i;
 				}
+				rcu_read_lock();
+				cur =3D rcu_dereference(cpu_rq(i)->curr);
+				if (cur && !(cur->flags & (PF_EXITING | PF_KTHREAD)) &&
+				    cur->mm =3D=3D mm)
+					nr_running++;
+				rcu_read_unlock();
 			}
=20
 			/*
@@ -1501,6 +1507,7 @@ static void __no_profile task_cache_work(struct callb=
ack_head *work)
 		mm->mm_sched_cpu =3D m_a_cpu;
 	}
=20
+	update_avg(&mm->nr_running_avg, nr_running);
 	free_cpumask_var(cpus);
 }
=20
--=20
2.32.0
From nobody Sun Feb  8 19:22:26 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3BC5B30C376
	for <linux-kernel@vger.kernel.org>; Wed,  3 Dec 2025 23:01:48 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.11
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764802910; cv=none;
 b=onpA+M+D8g+bB7DNpp5zLpepvUh9w8T9C2/oqeKTUWUlV9lpl/W31aZarTCR7uvwI9r/kkm/FD7MwcDDnX7hNWvSaLIvFHtht8DxsLrUWb3j5NtWoxy2IAV7VHzxT0RxTQbEVmk6ub/tCK+n4V2wt8/jU8sGCZYABu8xUNFmQzE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764802910; c=relaxed/simple;
	bh=BCBRwLmdA+4IVzADPAWhC/3F5wk90mYr0XsPdVDldug=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=P22IAZf3pO0DwcaeGaXfPF45reu5KwrXd9udmOhkXnd4XQpVPzlUupze8eBT005FfxLXJRNYY4JgHS7VRdg5qBGX8VhBoX9G0rOKgnTr7U9RHG4jdp1TU4xtGdenBrAxzksuJ/5c09oa/Ni6O8HCwsplWWOi+6exHbX7OKSFqwo=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=Ty7FUw1A; arc=none smtp.client-ip=198.175.65.11
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="Ty7FUw1A"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1764802908; x=1796338908;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=BCBRwLmdA+4IVzADPAWhC/3F5wk90mYr0XsPdVDldug=;
  b=Ty7FUw1AorJFrTn1pShKiLwJJ/bjWAtb7y1krTlw9/SRwaxzgqmmczqo
   u3N/1SifTNffuhxC1c0FAisXDHgXvvqPgSL0eykN2kILgw5XGJw02WLu5
   DTsTU9YL6pY9pb/nL5ZARaF9QKCpSpfipEIM2etVGVvo5Q7kFSTOXs+H8
   iIxOD/4oSuYwezAxsdbkRhhzIdd7YfjUSvB9o0XWfU4YnsJl/heMOcJ7B
   H3ZduMD5RF+5BphEK1nTa5CXhVJ0S2nzOaIUo5QipmWAbfGExiFD7Dfvc
   B8hxG4haeF2aHk7F8TdO+F6bVlL/xt/ae41Mu5pc0GlLavso3K0AzD+Xh
   Q==;
X-CSE-ConnectionGUID: 93dV145yReO721FOecTa9w==
X-CSE-MsgGUID: cCQ1dcHHRZCkHxfc9efEaQ==
X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136621"
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="77136621"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Dec 2025 15:01:47 -0800
X-CSE-ConnectionGUID: gYbWyA1jQSuPW79ZakwQKg==
X-CSE-MsgGUID: TU95ucBJS6iZ5dz55kzZsQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="199763946"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:47 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 18/23] sched/cache: Disable cache aware scheduling for
 processes with high thread counts
Date: Wed,  3 Dec 2025 15:07:37 -0800
Message-Id: 
 <b063de86c9d52611748a664d1d6bf3ecdaace2f3.1764801860.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1764801860.git.tim.c.chen@linux.intel.com>
References: <cover.1764801860.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

If the number of active threads within the process exceeds the number
of Cores(divided by SMTs number) in the LLC, do not enable cache-aware
scheduling. This is because there is a risk of cache contention within
the preferred LLC when too many threads are present.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2: No change.

 kernel/sched/fair.c | 29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2f38ad82688f..6afa3f9a4e9b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1223,6 +1223,18 @@ static int llc_id(int cpu)
 	return llc;
 }
=20
+static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
+{
+	int smt_nr =3D 1;
+
+#ifdef CONFIG_SCHED_SMT
+	if (sched_smt_active())
+		smt_nr =3D cpumask_weight(cpu_smt_mask(cpu));
+#endif
+
+	return ((mm->nr_running_avg * smt_nr) > per_cpu(sd_llc_size, cpu));
+}
+
 static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
 {
 	int pref_llc;
@@ -1365,10 +1377,12 @@ void account_mm_sched(struct rq *rq, struct task_st=
ruct *p, s64 delta_exec)
=20
 	/*
 	 * If this task hasn't hit task_cache_work() for a while, or it
-	 * has only 1 thread, invalidate its preferred state.
+	 * has only 1 thread, or has too many active threads, invalidate
+	 * its preferred state.
 	 */
 	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
-	    get_nr_threads(p) <=3D 1) {
+	    get_nr_threads(p) <=3D 1 ||
+	    exceed_llc_nr(mm, cpu_of(rq))) {
 		if (mm->mm_sched_cpu !=3D -1)
 			mm->mm_sched_cpu =3D -1;
 	}
@@ -1435,6 +1449,13 @@ static void __no_profile task_cache_work(struct call=
back_head *work)
 	if (p->flags & PF_EXITING)
 		return;
=20
+	if (get_nr_threads(p) <=3D 1) {
+		if (mm->mm_sched_cpu !=3D -1)
+			mm->mm_sched_cpu =3D -1;
+
+		return;
+	}
+
 	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
 		return;
=20
@@ -9874,6 +9895,10 @@ static enum llc_mig can_migrate_llc_task(int src_cpu=
, int dst_cpu,
 	if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
 		return mig_unrestricted;
=20
+	/* skip cache aware load balance for single/too many threads */
+	if (get_nr_threads(p) <=3D 1 || exceed_llc_nr(mm, dst_cpu))
+		return mig_unrestricted;
+
 	if (cpus_share_cache(dst_cpu, cpu))
 		to_pref =3D true;
 	else if (cpus_share_cache(src_cpu, cpu))
--=20
2.32.0
From nobody Sun Feb  8 19:22:26 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9101E2EC54D
	for <linux-kernel@vger.kernel.org>; Wed,  3 Dec 2025 23:01:50 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.11
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764802913; cv=none;
 b=mZ6zgozB73YTe2Q60NzNJeXcrA6dwd6hmTIv0PKyoFj0ekz5KBJkRG1qM2/BURh0aF7CFHE0sYQDT25Sh/ho6UmSGiIRzP3Vlf26ErGeRZYynNy7Hu4jA7k4JybnWrC09LDy8qEGxsIyAxdcr/3QTceL1Zxm0kxxCEBV46nlDEI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764802913; c=relaxed/simple;
	bh=ty+thnKFxG9+3T4ifTVEX04pmBe/l14iXANMioAm72I=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=bowCsa1//bbyzKU9WSJiWQsUHsXrqBQlvs/cAKgMyk/m4Bld010TDYg5UwVzdHKRvlpaid+xFoVz12quGwWlGa5F6HadDbBqKTBPP6/p1CNg91urhPN3p32qxubeGCoBIbuMM7MCO6I/YdFGB6u4/f5TpvPg3YmLnLcjC8/C7Xc=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=bxWe6OeK; arc=none smtp.client-ip=198.175.65.11
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="bxWe6OeK"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1764802910; x=1796338910;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=ty+thnKFxG9+3T4ifTVEX04pmBe/l14iXANMioAm72I=;
  b=bxWe6OeKUH0dPxqgW1jI5HE2e1z6OmOiyR4hMvqwqKai+AqvYcbOCYwu
   JOlPn9ZWYosHECHx5UGnkdTGEzkOmDWCRC2K3ypKwePUhIyD1337RCjJ3
   uixa8Z2lYSQS2J5GJVC48B2f/yhUzBFPqFV4CEHvCoMLsK1cOf7W1aP4l
   eQBVHvIxVJB4mpBt3ae1f/13ipHHAFwfwmFLo4k5SToBHKxSAT6nyvK8a
   Vm37u8PzhAmKBcxxBJlGGGzpwc2T4MC/PWSin17i5/r/Xk+DaSUzLnxaF
   ZlP2B1+lT/NuonQU/h16sWvSe3/WRw4AeV5gKIbsttEfaewPOisfGEd7j
   g==;
X-CSE-ConnectionGUID: Jgmht7L1SaW2ul5kAUA6dw==
X-CSE-MsgGUID: 8CE3l3r/SEaFaHk/6vdVRg==
X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136653"
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="77136653"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Dec 2025 15:01:49 -0800
X-CSE-ConnectionGUID: 88MGzjBCTmOWjRxLdU7vUw==
X-CSE-MsgGUID: Bi68ivGaS76IdMdbGxb19w==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="199763965"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:49 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for
 memory-heavy processes
Date: Wed,  3 Dec 2025 15:07:38 -0800
Message-Id: 
 <f1acf9ae29b7f69f3c74492b77d780a242343dfe.1764801860.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1764801860.git.tim.c.chen@linux.intel.com>
References: <cover.1764801860.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

Prateek and Tingyin reported that memory-intensive workloads (such as
stream) can saturate memory bandwidth and caches on the preferred LLC
when sched_cache aggregates too many threads.

To mitigate this, estimate a process's memory footprint by comparing
its RSS (anonymous and shared pages) to the size of the LLC. If RSS
exceeds the LLC size, skip cache-aware scheduling.

Note that RSS is only an approximation of the memory footprint.
By default, the comparison is strict, but a later patch will allow
users to provide a hint to adjust this threshold.

According to the test from Adam, some systems do not have shared L3
but with shared L2 as clusters. In this case, the L2 becomes the LLC[1].

Link[1]: https://lore.kernel.org/all/3cb6ebc7-a2fd-42b3-8739-b00e28a09cb6@o=
s.amperecomputing.com/

Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2: Assigned curr_cpu in task_cache_work() before checking
            exceed_llc_capacity(mm, curr_cpu) to avoid out-of-bound
            access.(lkp/0day)

 include/linux/cacheinfo.h | 21 ++++++++++-------
 kernel/sched/fair.c       | 49 +++++++++++++++++++++++++++++++++++----
 2 files changed, 57 insertions(+), 13 deletions(-)

diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
index c8f4f0a0b874..82d0d59ca0e1 100644
--- a/include/linux/cacheinfo.h
+++ b/include/linux/cacheinfo.h
@@ -113,18 +113,11 @@ int acpi_get_cache_info(unsigned int cpu,
=20
 const struct attribute_group *cache_get_priv_group(struct cacheinfo *this_=
leaf);
=20
-/*
- * Get the cacheinfo structure for the cache associated with @cpu at
- * level @level.
- * cpuhp lock must be held.
- */
-static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
+static inline struct cacheinfo *_get_cpu_cacheinfo_level(int cpu, int leve=
l)
 {
 	struct cpu_cacheinfo *ci =3D get_cpu_cacheinfo(cpu);
 	int i;
=20
-	lockdep_assert_cpus_held();
-
 	for (i =3D 0; i < ci->num_leaves; i++) {
 		if (ci->info_list[i].level =3D=3D level) {
 			if (ci->info_list[i].attributes & CACHE_ID)
@@ -136,6 +129,18 @@ static inline struct cacheinfo *get_cpu_cacheinfo_leve=
l(int cpu, int level)
 	return NULL;
 }
=20
+/*
+ * Get the cacheinfo structure for the cache associated with @cpu at
+ * level @level.
+ * cpuhp lock must be held.
+ */
+static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
+{
+	lockdep_assert_cpus_held();
+
+	return _get_cpu_cacheinfo_level(cpu, level);
+}
+
 /*
  * Get the id of the cache associated with @cpu at level @level.
  * cpuhp lock must be held.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6afa3f9a4e9b..424ec601cfdf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1223,6 +1223,38 @@ static int llc_id(int cpu)
 	return llc;
 }
=20
+static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
+{
+	struct cacheinfo *ci;
+	unsigned long rss;
+	unsigned int llc;
+
+	/*
+	 * get_cpu_cacheinfo_level() can not be used
+	 * because it requires the cpu_hotplug_lock
+	 * to be held. Use _get_cpu_cacheinfo_level()
+	 * directly because the 'cpu' can not be
+	 * offlined at the moment.
+	 */
+	ci =3D _get_cpu_cacheinfo_level(cpu, 3);
+	if (!ci) {
+		/*
+		 * On system without L3 but with shared L2,
+		 * L2 becomes the LLC.
+		 */
+		ci =3D _get_cpu_cacheinfo_level(cpu, 2);
+		if (!ci)
+			return true;
+	}
+
+	llc =3D ci->size;
+
+	rss =3D get_mm_counter(mm, MM_ANONPAGES) +
+		get_mm_counter(mm, MM_SHMEMPAGES);
+
+	return (llc <=3D (rss * PAGE_SIZE));
+}
+
 static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
 {
 	int smt_nr =3D 1;
@@ -1382,7 +1414,8 @@ void account_mm_sched(struct rq *rq, struct task_stru=
ct *p, s64 delta_exec)
 	 */
 	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
 	    get_nr_threads(p) <=3D 1 ||
-	    exceed_llc_nr(mm, cpu_of(rq))) {
+	    exceed_llc_nr(mm, cpu_of(rq)) ||
+	    exceed_llc_capacity(mm, cpu_of(rq))) {
 		if (mm->mm_sched_cpu !=3D -1)
 			mm->mm_sched_cpu =3D -1;
 	}
@@ -1439,7 +1472,7 @@ static void __no_profile task_cache_work(struct callb=
ack_head *work)
 	struct mm_struct *mm =3D p->mm;
 	unsigned long m_a_occ =3D 0;
 	unsigned long curr_m_a_occ =3D 0;
-	int cpu, m_a_cpu =3D -1, nr_running =3D 0;
+	int cpu, m_a_cpu =3D -1, nr_running =3D 0, curr_cpu;
 	cpumask_var_t cpus;
=20
 	WARN_ON_ONCE(work !=3D &p->cache_work);
@@ -1449,7 +1482,9 @@ static void __no_profile task_cache_work(struct callb=
ack_head *work)
 	if (p->flags & PF_EXITING)
 		return;
=20
-	if (get_nr_threads(p) <=3D 1) {
+	curr_cpu =3D task_cpu(p);
+	if (get_nr_threads(p) <=3D 1 ||
+	    exceed_llc_capacity(mm, curr_cpu)) {
 		if (mm->mm_sched_cpu !=3D -1)
 			mm->mm_sched_cpu =3D -1;
=20
@@ -9895,8 +9930,12 @@ static enum llc_mig can_migrate_llc_task(int src_cpu=
, int dst_cpu,
 	if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
 		return mig_unrestricted;
=20
-	/* skip cache aware load balance for single/too many threads */
-	if (get_nr_threads(p) <=3D 1 || exceed_llc_nr(mm, dst_cpu))
+	/*
+	 * Skip cache aware load balance for single/too many threads
+	 * or large footprint.
+	 */
+	if (get_nr_threads(p) <=3D 1 || exceed_llc_nr(mm, dst_cpu) ||
+	    exceed_llc_capacity(mm, dst_cpu))
 		return mig_unrestricted;
=20
 	if (cpus_share_cache(dst_cpu, cpu))
--=20
2.32.0
From nobody Sun Feb  8 19:22:26 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id E85E42F0C6F
	for <linux-kernel@vger.kernel.org>; Wed,  3 Dec 2025 23:01:52 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.11
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764802915; cv=none;
 b=NKB81c5nkJMF1m/c1AQra8pCalQ/VATWqz8ZHIWg0eoz6hnNECnbqY6IjBOdnDBFvVl/b9HVmkECeNM1mHW2uEI8K209dQ6+mwy42BNPEeHaX20qEOS7RazcHKvkjiS5SxHlmYAv1Sx5K4HGlnkZ+3m/wG0/DRyA26pbDpUaoF0=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764802915; c=relaxed/simple;
	bh=j5hfiRZ2EYaCTsQGDmAvNRTgCCnUI1j/ItMFRbl9uzY=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=V2hqbFyqQGneKfxIcpO2Kc5dagTB+TDzJUq23BN2DeHLv/PgsNga9e2rv+hmluwZMbEcHv9RyyZKJ8F8TwCiuK0Z3yMm4l1RIXSG3p6TYCnyj/3zsuh7jcDOrc/cJgzZvLgpTBDOt79ulEa8r4q4GzHG4PsV4tL2S7Y8MOiS1eo=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=aHHISq0g; arc=none smtp.client-ip=198.175.65.11
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="aHHISq0g"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1764802913; x=1796338913;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=j5hfiRZ2EYaCTsQGDmAvNRTgCCnUI1j/ItMFRbl9uzY=;
  b=aHHISq0gwB38J2pv7w+1lfXdj3ALD4Re5eBGYwuwYbgrSTS87mzWr9d9
   6z8UE8JAD8ovVTi9HPH2Dj4nm47BQyJFWTB7aSIByFBZvHQDMif8JcxQo
   YN44mNhAEn4CrrZXow3MjME9dhVbGveKvuIPn5IfCupOo2V/UomJWHR8v
   dtkYFqLnVw3S3bkna5BsUdpRh9ZBimaMuGq/+WwGF2nx4rrzpNdxn0j5U
   3rhoVYZ01bV7elVPmaWw/ckqsd0iILZe0x+W0mSMx9qrnSVEtbw4rvo6z
   M5hLadE9a+KUPXiCE/w4A03eCnExBDNTMSqLbTk/r37NYHjbU70zyE3SM
   g==;
X-CSE-ConnectionGUID: EZWPyiB6S9KiFT6DKfxjxA==
X-CSE-MsgGUID: 07XoWa+5TBOCIV3mZenWgw==
X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136682"
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="77136682"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Dec 2025 15:01:51 -0800
X-CSE-ConnectionGUID: MiEptcrPQgi3rw/P5nNNDA==
X-CSE-MsgGUID: DrdTMc52RGuwpeHC+Js9gg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="199763975"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:51 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 20/23] sched/cache: Add user control to adjust the
 parameters of cache-aware scheduling
Date: Wed,  3 Dec 2025 15:07:39 -0800
Message-Id: 
 <e5336e1a9fd555ba6af4a35ea5f834be6f5186b6.1764801860.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1764801860.git.tim.c.chen@linux.intel.com>
References: <cover.1764801860.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

Introduce a set of debugfs knobs to control the enabling of
and parameters for cache-aware load balancing.

(1) llc_enabled
llc_enabled acts as the primary switch - users can toggle it to
enable or disable cache aware load balancing.

(2) llc_aggr_tolerance
With sched_cache enabled, the scheduler uses a process's RSS as a
proxy for its LLC footprint to determine if aggregating tasks on the
preferred LLC could cause cache contention. If RSS exceeds the LLC
size, aggregation is skipped. Some workloads with large RSS but small
actual memory footprints may still benefit from aggregation. Since
the kernel cannot efficiently track per-task cache usage (resctrl is
user-space only), userspace can provide a more accurate hint.

Introduce /sys/kernel/debug/sched/llc_aggr_tolerance to let
users control how strictly RSS limits aggregation. Values range from
0 to 100:

  - 0: Cache-aware scheduling is disabled.
  - 1: Strict; tasks with RSS larger than LLC size are skipped.
  - 100: Aggressive; tasks are aggregated regardless of RSS.

For example, with a 32MB L3 cache:

  - llc_aggr_tolerance=3D1 -> tasks with RSS > 32MB are skipped.
  - llc_aggr_tolerance=3D99 -> tasks with RSS > 784GB are skipped
    (784GB =3D (1 + (99 - 1) * 256) * 32MB).

Similarly, /sys/kernel/debug/sched/llc_aggr_tolerance also controls
how strictly the number of active threads is considered when doing
cache aware load balance. The number of SMTs is also considered.
High SMT counts reduce the aggregation capacity, preventing excessive
task aggregation on SMT-heavy systems like Power10/Power11.

For example, with 8 Cores/16 CPUs in a L3:

  - llc_aggr_tolerance=3D1 -> tasks with nr_running > 8 are skipped.
  - llc_aggr_tolerance=3D99 -> tasks with nr_running > 785 are skipped
    785 =3D (1 + (99 - 1) * 8).

(3) llc_epoch_period/llc_epoch_affinity_timeout
Besides, llc_epoch_period and llc_epoch_affinity_timeout are also turned
into tunable.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Suggested-by: Tingyin Duan <tingyin.duan@gmail.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---

Notes:
    v1->v2: Remove the smt_nr check in fits_llc_capacity().
            (Aaron Lu)

 include/linux/sched.h   |  4 ++-
 kernel/sched/debug.c    | 62 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c     | 63 ++++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h    |  5 ++++
 kernel/sched/topology.c | 54 +++++++++++++++++++++++++++++++++--
 5 files changed, 178 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 466ba8b7398c..95bf080bbbf0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2436,9 +2436,11 @@ extern void migrate_enable(void);
 DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
=20
 #ifdef CONFIG_SCHED_CACHE
+DECLARE_STATIC_KEY_FALSE(sched_cache_on);
+
 static inline bool sched_cache_enabled(void)
 {
-	return false;
+	return static_branch_unlikely(&sched_cache_on);
 }
 #endif
=20
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 02e16b70a790..cde324672103 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -169,6 +169,53 @@ static const struct file_operations sched_feat_fops =
=3D {
 	.release	=3D single_release,
 };
=20
+#ifdef CONFIG_SCHED_CACHE
+#define SCHED_CACHE_CREATE_CONTROL(name, max)			  \
+static ssize_t sched_cache_write_##name(struct file *filp,	  \
+					const char __user *ubuf,  \
+					size_t cnt, loff_t *ppos) \
+{								  \
+	char buf[16];						  \
+	unsigned int val;					  \
+	if (cnt > 15)						  \
+		cnt =3D 15;					  \
+	if (copy_from_user(&buf, ubuf, cnt))			  \
+		return -EFAULT;					  \
+	buf[cnt] =3D '\0';					  \
+	if (kstrtouint(buf, 10, &val))				  \
+		return -EINVAL;					  \
+	if (val > (max))						  \
+		return -EINVAL;					  \
+	llc_##name =3D val;					  \
+	if (!strcmp(#name, "enabled"))				  \
+		sched_cache_set(false);				  \
+	*ppos +=3D cnt;						  \
+	return cnt;						  \
+}								  \
+static int sched_cache_show_##name(struct seq_file *m, void *v)	  \
+{								  \
+	seq_printf(m, "%d\n", llc_##name);			  \
+	return 0;						  \
+}								  \
+static int sched_cache_open_##name(struct inode *inode,		  \
+				   struct file *filp)		  \
+{								  \
+	return single_open(filp, sched_cache_show_##name, NULL);  \
+}								  \
+static const struct file_operations sched_cache_fops_##name =3D {	  \
+	.open		=3D sched_cache_open_##name,		  \
+	.write		=3D sched_cache_write_##name,		  \
+	.read		=3D seq_read,				  \
+	.llseek		=3D seq_lseek,				  \
+	.release	=3D single_release,			  \
+}
+
+SCHED_CACHE_CREATE_CONTROL(overload_pct, 100);
+SCHED_CACHE_CREATE_CONTROL(imb_pct, 100);
+SCHED_CACHE_CREATE_CONTROL(aggr_tolerance, 100);
+SCHED_CACHE_CREATE_CONTROL(enabled, 1);
+#endif /* SCHED_CACHE */
+
 static ssize_t sched_scaling_write(struct file *filp, const char __user *u=
buf,
 				   size_t cnt, loff_t *ppos)
 {
@@ -523,6 +570,21 @@ static __init int sched_init_debug(void)
 	debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing=
_hot_threshold);
 #endif /* CONFIG_NUMA_BALANCING */
=20
+#ifdef CONFIG_SCHED_CACHE
+	debugfs_create_file("llc_overload_pct", 0644, debugfs_sched, NULL,
+			    &sched_cache_fops_overload_pct);
+	debugfs_create_file("llc_imb_pct", 0644, debugfs_sched, NULL,
+			    &sched_cache_fops_imb_pct);
+	debugfs_create_file("llc_aggr_tolerance", 0644, debugfs_sched, NULL,
+			    &sched_cache_fops_aggr_tolerance);
+	debugfs_create_file("llc_enabled", 0644, debugfs_sched, NULL,
+			    &sched_cache_fops_enabled);
+	debugfs_create_u32("llc_epoch_period", 0644, debugfs_sched,
+			   &llc_epoch_period);
+	debugfs_create_u32("llc_epoch_affinity_timeout", 0644, debugfs_sched,
+			   &llc_epoch_affinity_timeout);
+#endif
+
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops=
);
=20
 	debugfs_fair_server_init();
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 424ec601cfdf..a2e2d6742481 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1207,6 +1207,9 @@ static s64 update_se(struct rq *rq, struct sched_enti=
ty *se)
=20
 __read_mostly unsigned int llc_overload_pct       =3D 50;
 __read_mostly unsigned int llc_imb_pct            =3D 20;
+__read_mostly unsigned int llc_aggr_tolerance     =3D 1;
+__read_mostly unsigned int llc_epoch_period       =3D EPOCH_PERIOD;
+__read_mostly unsigned int llc_epoch_affinity_timeout =3D EPOCH_LLC_AFFINI=
TY_TIMEOUT;
=20
 static int llc_id(int cpu)
 {
@@ -1223,11 +1226,22 @@ static int llc_id(int cpu)
 	return llc;
 }
=20
+static inline int get_sched_cache_scale(int mul)
+{
+	if (!llc_aggr_tolerance)
+		return 0;
+
+	if (llc_aggr_tolerance =3D=3D 100)
+		return INT_MAX;
+
+	return (1 + (llc_aggr_tolerance - 1) * mul);
+}
+
 static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
 {
+	unsigned int llc, scale;
 	struct cacheinfo *ci;
 	unsigned long rss;
-	unsigned int llc;
=20
 	/*
 	 * get_cpu_cacheinfo_level() can not be used
@@ -1252,19 +1266,54 @@ static bool exceed_llc_capacity(struct mm_struct *m=
m, int cpu)
 	rss =3D get_mm_counter(mm, MM_ANONPAGES) +
 		get_mm_counter(mm, MM_SHMEMPAGES);
=20
-	return (llc <=3D (rss * PAGE_SIZE));
+	/*
+	 * Scale the LLC size by 256*llc_aggr_tolerance
+	 * and compare it to the task's RSS size.
+	 *
+	 * Suppose the L3 size is 32MB. If the
+	 * llc_aggr_tolerance is 1:
+	 * When the RSS is larger than 32MB, the process
+	 * is regarded as exceeding the LLC capacity. If
+	 * the llc_aggr_tolerance is 99:
+	 * When the RSS is larger than 784GB, the process
+	 * is regarded as exceeding the LLC capacity because:
+	 * 784GB =3D (1 + (99 - 1) * 256) * 32MB
+	 */
+	scale =3D get_sched_cache_scale(256);
+	if (scale =3D=3D INT_MAX)
+		return false;
+
+	return ((llc * scale) <=3D (rss * PAGE_SIZE));
 }
=20
 static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
 {
-	int smt_nr =3D 1;
+	int smt_nr =3D 1, scale;
=20
 #ifdef CONFIG_SCHED_SMT
 	if (sched_smt_active())
 		smt_nr =3D cpumask_weight(cpu_smt_mask(cpu));
 #endif
+	/*
+	 * Scale the Core number in a LLC by llc_aggr_tolerance
+	 * and compare it to the task's active threads.
+	 *
+	 * Suppose the number of Cores in LLC is 8.
+	 * Every core has 2 SMTs.
+	 * If the llc_aggr_tolerance is 1: When the
+	 * nr_running is larger than 8, the process
+	 * is regarded as exceeding the LLC capacity.
+	 * If the llc_aggr_tolerance is 99:
+	 * When the nr_running is larger than 785,
+	 * the process is regarded as exceeding
+	 * the LLC capacity:
+	 * 785 =3D 1 + (99 - 1) * 8
+	 */
+	scale =3D get_sched_cache_scale(1);
+	if (scale =3D=3D INT_MAX)
+		return false;
=20
-	return ((mm->nr_running_avg * smt_nr) > per_cpu(sd_llc_size, cpu));
+	return ((mm->nr_running_avg * smt_nr) > (scale * per_cpu(sd_llc_size, cpu=
)));
 }
=20
 static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
@@ -1350,9 +1399,9 @@ static inline void __update_mm_sched(struct rq *rq, s=
truct mm_sched *pcpu_sched)
 	long delta =3D now - rq->cpu_epoch_next;
=20
 	if (delta > 0) {
-		n =3D (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
+		n =3D (delta + llc_epoch_period - 1) / llc_epoch_period;
 		rq->cpu_epoch +=3D n;
-		rq->cpu_epoch_next +=3D n * EPOCH_PERIOD;
+		rq->cpu_epoch_next +=3D n * llc_epoch_period;
 		__shr_u64(&rq->cpu_runtime, n);
 	}
=20
@@ -1412,7 +1461,7 @@ void account_mm_sched(struct rq *rq, struct task_stru=
ct *p, s64 delta_exec)
 	 * has only 1 thread, or has too many active threads, invalidate
 	 * its preferred state.
 	 */
-	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
+	if (epoch - READ_ONCE(mm->mm_sched_epoch) > llc_epoch_affinity_timeout ||
 	    get_nr_threads(p) <=3D 1 ||
 	    exceed_llc_nr(mm, cpu_of(rq)) ||
 	    exceed_llc_capacity(mm, cpu_of(rq))) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 40798a06e058..15d126bd3728 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2852,6 +2852,11 @@ extern unsigned int sysctl_numa_balancing_hot_thresh=
old;
 #ifdef CONFIG_SCHED_CACHE
 extern unsigned int llc_overload_pct;
 extern unsigned int llc_imb_pct;
+extern unsigned int llc_aggr_tolerance;
+extern unsigned int llc_epoch_period;
+extern unsigned int llc_epoch_affinity_timeout;
+extern unsigned int llc_enabled;
+void sched_cache_set(bool locked);
 #endif
=20
 #ifdef CONFIG_SCHED_HRTICK
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9799e3a9a609..818599ddaaef 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -26,6 +26,49 @@ int max_llcs;
=20
 static bool sched_cache_present;
=20
+unsigned int llc_enabled =3D 1;
+DEFINE_STATIC_KEY_FALSE(sched_cache_on);
+
+/*
+ * Enable/disable cache aware scheduling according to
+ * user input and the presence of hardware support.
+ */
+static void _sched_cache_set(bool enable, bool locked)
+{
+	if (enable) {
+		if (locked)
+			static_branch_enable_cpuslocked(&sched_cache_on);
+		else
+			static_branch_enable(&sched_cache_on);
+	} else {
+		if (locked)
+			static_branch_disable_cpuslocked(&sched_cache_on);
+		else
+			static_branch_disable(&sched_cache_on);
+	}
+}
+
+void sched_cache_set(bool locked)
+{
+	/* hardware does not support */
+	if (!sched_cache_present) {
+		if (static_branch_likely(&sched_cache_on))
+			_sched_cache_set(false, locked);
+
+		return;
+	}
+
+	/* user wants it or not ?*/
+	if (llc_enabled) {
+		if (!static_branch_likely(&sched_cache_on))
+			_sched_cache_set(true, locked);
+
+	} else {
+		if (static_branch_likely(&sched_cache_on))
+			_sched_cache_set(false, locked);
+	}
+}
+
 static unsigned int *alloc_new_pref_llcs(unsigned int *old, unsigned int *=
*gc)
 {
 	unsigned int *new =3D NULL;
@@ -70,8 +113,12 @@ static int resize_llc_pref(bool has_multi_llcs)
 	 * new buffer.
 	 */
 	tmp_llc_pref =3D alloc_percpu_noprof(unsigned int *);
-	if (!tmp_llc_pref)
-		return -ENOMEM;
+	if (!tmp_llc_pref) {
+		sched_cache_present =3D false;
+		ret =3D -ENOMEM;
+
+		goto out;
+	}
=20
 	for_each_present_cpu(i)
 		*per_cpu_ptr(tmp_llc_pref, i) =3D NULL;
@@ -89,6 +136,7 @@ static int resize_llc_pref(bool has_multi_llcs)
 		new =3D alloc_new_pref_llcs(rq->nr_pref_llc, per_cpu_ptr(tmp_llc_pref, i=
));
 		if (!new) {
 			ret =3D -ENOMEM;
+			sched_cache_present =3D false;
=20
 			goto release_old;
 		}
@@ -126,6 +174,8 @@ static int resize_llc_pref(bool has_multi_llcs)
 	if (!ret)
 		max_llcs =3D new_max_llcs;
=20
+out:
+	sched_cache_set(true);
 	return ret;
 }
=20
--=20
2.32.0
From nobody Sun Feb  8 19:22:26 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id D10542F12DD
	for <linux-kernel@vger.kernel.org>; Wed,  3 Dec 2025 23:01:53 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.11
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764802915; cv=none;
 b=AmWzQXbFY2sN5heLcp4s9rWoLO7pjURsg464nsA8jjoqA5nJagwpJv9G+UJULof1tTaFgz2GmAr0hHkABofj6ydnfXE2fd4hRRYb7GE+M+4gERnZr5wAJOQw/zTEmxBeWSSE5iNgbAWmM054GBUn6MCdpITYzuKbb1BP7b3L3sk=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764802915; c=relaxed/simple;
	bh=heewblv8+VUSifHzkX3W2P+i26TuBbpse5E1oodIHu8=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=Xpw0JLbgiphYf3Sab645eHcm9Luo+Mx2FuFXrjcPsXJxnYfglU5zHbY1C3nGcYUTlQht3caQEhhC7tRceDrXIkNZHUg5zn5pvhgic99RbM9RtmxCAUWRJKHEvQHILxmwPExiCxVB0m/pqwl8+stVV67Gqhqd6Lhw1hT41ldDB9s=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=Cxg4oTl4; arc=none smtp.client-ip=198.175.65.11
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="Cxg4oTl4"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1764802913; x=1796338913;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=heewblv8+VUSifHzkX3W2P+i26TuBbpse5E1oodIHu8=;
  b=Cxg4oTl4FXmQXHOKywDD1PXh0TwFbaiKduxzegiGnyiEGbaHQGeStB45
   heDXhCr5sdgqIhbxUFp1vM0glTwn0l4/6ZiEL/dgHN9LNlGjaYsII9jc1
   2qGZ9JRhqrUWqdc8Jm6fWF0Wuz16A6ncwR05z1/osHOGjbKNCnVNF9Y0l
   4FSdn5Pg7wz/0mo5Tfd9kz21TLqYSS8tlCVsn5MnhfbvMVKYOtZOb0WKR
   3KiZKcH2I7DsvpgO/euP9zAwOTpRdP8eIGES5K1LCg7I6oiUiavAKbHWR
   nP3xATAIJENhZb+rdETusA0Fs1MIUcnKK88Vr8NJIw3yCIQUWh4CdT9qz
   A==;
X-CSE-ConnectionGUID: 32GrqILbQRmayJvwXZJ8Bg==
X-CSE-MsgGUID: F77VqvbYTl+X7/J43cXgzw==
X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136713"
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="77136713"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Dec 2025 15:01:53 -0800
X-CSE-ConnectionGUID: U2qPsSdSRnej2tO9wNUi2g==
X-CSE-MsgGUID: YiZKZfnpSMaNZy2O6pI9Xg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="199763990"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:52 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 21/23] -- DO NOT APPLY!!! -- sched/cache/stats: Add
 schedstat for cache aware load balancing
Date: Wed,  3 Dec 2025 15:07:40 -0800
Message-Id: 
 <71b94a7547f7843230270e20b84ecb0a540ab604.1764801860.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1764801860.git.tim.c.chen@linux.intel.com>
References: <cover.1764801860.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

Debug patch only.

With cache-aware load balancing enabled, statistics related to its activity
are exposed via /proc/schedstat and debugfs. For instance, if users want to
verify metrics like the number of exceeding RSS and nr_running limits, they
can filter the output of /sys/kernel/debug/sched/debug and compute the requ=
ired
statistics manually:

llc_exceed_cap SUM: 6
llc_exceed_nr SUM: 4531

Furthermore, these statistics exposed in /proc/schedstats can be queried ma=
nually
or via perf sched stats[1] with minor modifications.

Link: https://lore.kernel.org/all/20250909114227.58802-1-swapnil.sapkal@amd=
.com #1

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/sched/topology.h | 1 +
 kernel/sched/fair.c            | 1 +
 kernel/sched/stats.c           | 5 +++--
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 0ba4697d74ba..8702c1e731a0 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -108,6 +108,7 @@ struct sched_domain {
 	unsigned int lb_imbalance_util[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_imbalance_task[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_imbalance_misfit[CPU_MAX_IDLE_TYPES];
+	unsigned int lb_imbalance_llc[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_gained[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES];
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a2e2d6742481..742e455b093e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12684,6 +12684,7 @@ static void update_lb_imbalance_stat(struct lb_env =
*env, struct sched_domain *sd
 		__schedstat_add(sd->lb_imbalance_misfit[idle], env->imbalance);
 		break;
 	case migrate_llc_task:
+		__schedstat_add(sd->lb_imbalance_llc[idle], env->imbalance);
 		break;
 	}
 }
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index d1c9429a4ac5..3736f6102261 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -104,7 +104,7 @@ void __update_stats_enqueue_sleeper(struct rq *rq, stru=
ct task_struct *p,
  * Bump this up when changing the output format or the meaning of an exist=
ing
  * format, so that tools can adapt (or abort)
  */
-#define SCHEDSTAT_VERSION 17
+#define SCHEDSTAT_VERSION 18
=20
 static int show_schedstat(struct seq_file *seq, void *v)
 {
@@ -139,7 +139,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
 			seq_printf(seq, "domain%d %s %*pb", dcount++, sd->name,
 				   cpumask_pr_args(sched_domain_span(sd)));
 			for (itype =3D 0; itype < CPU_MAX_IDLE_TYPES; itype++) {
-				seq_printf(seq, " %u %u %u %u %u %u %u %u %u %u %u",
+				seq_printf(seq, " %u %u %u %u %u %u %u %u %u %u %u %u",
 				    sd->lb_count[itype],
 				    sd->lb_balanced[itype],
 				    sd->lb_failed[itype],
@@ -147,6 +147,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
 				    sd->lb_imbalance_util[itype],
 				    sd->lb_imbalance_task[itype],
 				    sd->lb_imbalance_misfit[itype],
+				    sd->lb_imbalance_llc[itype],
 				    sd->lb_gained[itype],
 				    sd->lb_hot_gained[itype],
 				    sd->lb_nobusyq[itype],
--=20
2.32.0
From nobody Sun Feb  8 19:22:26 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id B53DB2EDD63
	for <linux-kernel@vger.kernel.org>; Wed,  3 Dec 2025 23:01:55 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.11
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764802917; cv=none;
 b=Sv2g8yh/ssOUkCxGvmjgju6aonEWXYABCuXTb+U7pmXY6LV36x4JKu1MuMeuYO1vCluXZy/7Ay7i1yE6FtkBqXrqbYaDn/USnq7xKePL08B+Z5erY6PuyaIsHhWqUANVdUR5D6Behj/PK8qsySaRT1rgt6AitMIk8lP+NbOAHCE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764802917; c=relaxed/simple;
	bh=Z1RTLO9XI8wzi8HuLfkDHGWZXGFVHLiwaXN3uD70l1Y=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=CATqSWfOo+6YE9nXLVWZJO6JnMOLrl52x8cVMx1zwPuSpCTUr3IN5JnkiXN2GyKQ26mCPXBWcWxBHdzMY7E9cxtAmJLxGzbXdU2Fg+4DSuAYi1K0o6tozFHYiuKS+6QKbzMtYuK8+ri9bLYJjOu4P79WeHsP8FgYaIsrRFSRu2w=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=mJ2c+rcP; arc=none smtp.client-ip=198.175.65.11
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="mJ2c+rcP"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1764802915; x=1796338915;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=Z1RTLO9XI8wzi8HuLfkDHGWZXGFVHLiwaXN3uD70l1Y=;
  b=mJ2c+rcP1UOBgGP4yRYC4G9oY4qxvoF1rz/E8g2VluXVhdaKym+KKeiM
   98QozNlJsgm6c2psR2Mp1UJhkz/Z+hMiEVNErwajLDcIdLXPKWwrmkhgP
   CWKO4YFSmv7sZsGBLUL6MPnqDCpqzgPQvR5FKXPgi7m3I3rXLqAaZgLzM
   bfubfkiwaBvcluOfyoYhJ37GeqSNPw53SP+PU0pGAu+cSL5BeyuIN+g+r
   dRFzsYKK0wBWGsqYyMy6aje2lH7qKav3U/83YEE1h0WkyFF5hAmr4RJRT
   /HIg5gjIb43mMeVrXXMSuFG2ajgVo7HXw1utNSmLiOQiREq43MfL2zw8y
   w==;
X-CSE-ConnectionGUID: 9oIPtAybQ2qo8rrXonYwDQ==
X-CSE-MsgGUID: zYiFP9hSSKCIjoTpSYg3HA==
X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136743"
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="77136743"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Dec 2025 15:01:54 -0800
X-CSE-ConnectionGUID: WESeCoKDRrGN0u/wU3Xx1Q==
X-CSE-MsgGUID: 8VfHl2BFSC60lwjXMiz/LQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="199764003"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:54 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 22/23] -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace
 to track the load balance statistics
Date: Wed,  3 Dec 2025 15:07:41 -0800
Message-Id: 
 <445303c70d8d464c35c97f33d4be7b752e8db5ae.1764801860.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1764801860.git.tim.c.chen@linux.intel.com>
References: <cover.1764801860.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

Debug patch only.

The user leverages this trace event (via bpftrace, etc)to monitor the cache
aware load balance activity - whether the tasks are moved to their preferred
LLC, or moved out of their preferred LLC.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/trace/events/sched.h | 31 +++++++++++++++++++++++++++++++
 kernel/sched/fair.c          | 10 ++++++++++
 2 files changed, 41 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 7b2645b50e78..bd03f49f7e3c 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -10,6 +10,37 @@
 #include <linux/tracepoint.h>
 #include <linux/binfmts.h>
=20
+TRACE_EVENT(sched_attach_task,
+
+	TP_PROTO(struct task_struct *t, int pref_cpu, int pref_llc,
+		 int attach_cpu, int attach_llc),
+
+	TP_ARGS(t, pref_cpu, pref_llc, attach_cpu, attach_llc),
+
+	TP_STRUCT__entry(
+			__array(	char,	comm,	TASK_COMM_LEN	)
+			__field(	pid_t,	pid			)
+			__field(	int,	pref_cpu		)
+			__field(	int,	pref_llc		)
+			__field(	int,	attach_cpu		)
+			__field(	int,	attach_llc		)
+	),
+
+	TP_fast_assign(
+		      memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
+		      __entry->pid	=3D t->pid;
+		      __entry->pref_cpu	=3D pref_cpu;
+		      __entry->pref_llc	=3D pref_llc;
+		      __entry->attach_cpu	=3D attach_cpu;
+		      __entry->attach_llc	=3D attach_llc;
+	),
+
+	TP_printk("comm=3D%s pid=3D%d pref_cpu=3D%d pref_llc=3D%d attach_cpu=3D%d=
 attach_llc=3D%d",
+		  __entry->comm, __entry->pid,
+		  __entry->pref_cpu, __entry->pref_llc,
+		  __entry->attach_cpu, __entry->attach_llc)
+);
+
 /*
  * Tracepoint for calling kthread_stop, performed to end a kthread:
  */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 742e455b093e..e47b4096f0a6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10487,6 +10487,16 @@ static void attach_task(struct rq *rq, struct task=
_struct *p)
 {
 	lockdep_assert_rq_held(rq);
=20
+#ifdef CONFIG_SCHED_CACHE
+	if (p->mm) {
+		int pref_cpu =3D p->mm->mm_sched_cpu;
+
+		trace_sched_attach_task(p,
+					pref_cpu,
+					pref_cpu !=3D -1 ? llc_id(pref_cpu) : -1,
+					cpu_of(rq), llc_id(cpu_of(rq)));
+	}
+#endif
 	WARN_ON_ONCE(task_rq(p) !=3D rq);
 	activate_task(rq, p, ENQUEUE_NOCLOCK);
 	wakeup_preempt(rq, p, 0);
--=20
2.32.0
From nobody Sun Feb  8 19:22:26 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0CB492EBBB7
	for <linux-kernel@vger.kernel.org>; Wed,  3 Dec 2025 23:01:56 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.11
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764802919; cv=none;
 b=m/i8AM9jez30fmSC1ThjI0YmAYEwTjLN0aX4/W91cI/xJdwDY3yhTCxjuRQMXmg8XAbCVHRL4AColOXBfQy71E1URs7aT+GLFscw7WH4+OFmIN9YsDx0KaMus5WdBjhF8tzszL6TEZ12kmt42mlqOXQoE5Z3dqzJYmLCcriio58=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764802919; c=relaxed/simple;
	bh=wLrJr/SuamOWjFO9gpHP5B2k8lcK+6x8dlASnUWXGe8=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=eJbxWPHDUsl7XKuqPrYe829WccTGNXVp007ecq2JrHaVKwuvPh4j19TPROJM5V4vppIdkk1U3AT26iFdDx2qrmsewZCkwqlDeBPDJqbvbZbY+3Vimkg2ojZhH8CLl94yalOO4ZSXRjWefBovmf2taUbRtFOEBHGk1S0e1XvQ7G4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=jEU0XIYU; arc=none smtp.client-ip=198.175.65.11
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="jEU0XIYU"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1764802917; x=1796338917;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=wLrJr/SuamOWjFO9gpHP5B2k8lcK+6x8dlASnUWXGe8=;
  b=jEU0XIYUmUP1w1odUpoZztkux4d2T4uFzSQDEoeQkO6AEZ1yfHcuVfq9
   YwImXDzBWY46rQh33rL3qoP+4HJZhnOXgjU9/vwFZtLvGkGs5rHvI8YBx
   jDLfActh0h/lcktc8ZNAWUhHLuPaktpxkehHuTNiQ+/PYiyL7+Hj8Xdrd
   41rYFhxJEN7aCEKecsCgMgtV2kyKG5rxF89kVp/FA/73jNvUXDa5pRoN7
   yqtdT/I+zUDFwYL0JDyMdCOZxceWrOHrciU5DroHkoBLTkvVc7oA5oIMh
   KkFun1mmeV+tcvGf8EXfa3CUEmb0TvEhrDlTxbkcFqltiq0sEOiCw8NXE
   w==;
X-CSE-ConnectionGUID: 7s0dCQLrSayFkNv254nlIw==
X-CSE-MsgGUID: oFf+c8koRFSSrN6sf+Ly3g==
X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136770"
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="77136770"
Received: from fmviesa004.fm.intel.com ([10.60.135.144])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Dec 2025 15:01:56 -0800
X-CSE-ConnectionGUID: F/ChnV0DRm2XulsnpHzSUQ==
X-CSE-MsgGUID: VHoDeeBRRb2BuWXAH6c04Q==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.20,247,1758610800";
   d="scan'208";a="199764012"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:55 -0800
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 23/23] -- DO NOT APPLY!!! -- sched/cache/debug: Display the
 per LLC occupancy for each process via proc fs
Date: Wed,  3 Dec 2025 15:07:42 -0800
Message-Id: 
 <0eaf9b9f89f0d97dbf46b760421f65aee3ffe063.1764801860.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
In-Reply-To: <cover.1764801860.git.tim.c.chen@linux.intel.com>
References: <cover.1764801860.git.tim.c.chen@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Chen Yu <yu.c.chen@intel.com>

Debug patch only.

Show the per-LLC occupancy in /proc/{PID}/schedstat, with each column
corresponding to one LLC. This can be used to verify if the cache-aware
load balancer works as expected by aggregating threads onto dedicated LLCs.

Suppose there are 2 LLCs and the sampling duration is 10 seconds:

Enable the cache aware load balance:
0 12281  <--- LLC0 residency delta is 0, LLC1 is 12 seconds
0 18881
0 16217

disable the cache aware load balance:
6497 15802
9299 5435
17811 8278

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 fs/proc/base.c           | 22 ++++++++++++++++++++++
 include/linux/mm_types.h | 19 +++++++++++++++++--
 include/linux/sched.h    |  3 +++
 kernel/sched/fair.c      | 40 ++++++++++++++++++++++++++++++++++++++--
 4 files changed, 80 insertions(+), 4 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 6299878e3d97..f4be96f4bd01 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -518,6 +518,28 @@ static int proc_pid_schedstat(struct seq_file *m, stru=
ct pid_namespace *ns,
 		   (unsigned long long)task->se.sum_exec_runtime,
 		   (unsigned long long)task->sched_info.run_delay,
 		   task->sched_info.pcount);
+#ifdef CONFIG_SCHED_CACHE
+	if (sched_cache_enabled()) {
+		struct mm_struct *mm =3D task->mm;
+		u64 *llc_runtime;
+
+		if (!mm)
+			return 0;
+
+		llc_runtime =3D kcalloc(max_llcs, sizeof(u64), GFP_KERNEL);
+		if (!llc_runtime)
+			return 0;
+
+		if (get_mm_per_llc_runtime(task, llc_runtime))
+			goto out;
+
+		for (int i =3D 0; i < max_llcs; i++)
+			seq_printf(m, "%llu ", llc_runtime[i]);
+		seq_puts(m, "\n");
+out:
+		kfree(llc_runtime);
+	}
+#endif
=20
 	return 0;
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 04743983de4d..255c22be7312 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -944,6 +944,10 @@ struct mm_sched {
 	unsigned long epoch;
 };
=20
+struct mm_time {
+	u64 runtime_ns;
+};
+
 struct kioctx_table;
 struct iommu_mm_data;
 struct mm_struct {
@@ -1040,6 +1044,7 @@ struct mm_struct {
 		 * See account_mm_sched() and ...
 		 */
 		struct mm_sched __percpu *pcpu_sched;
+		struct mm_time __percpu *pcpu_time;
 		raw_spinlock_t mm_sched_lock;
 		unsigned long mm_sched_epoch;
 		int mm_sched_cpu;
@@ -1505,16 +1510,24 @@ static inline void mm_set_cpus_allowed(struct mm_st=
ruct *mm, const struct cpumas
 #endif /* CONFIG_SCHED_MM_CID */
=20
 #ifdef CONFIG_SCHED_CACHE
-void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *pcpu_sc=
hed);
+void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *pcpu_sc=
hed,
+		   struct mm_time __percpu *pcpu_time);
=20
 static inline int mm_alloc_sched_noprof(struct mm_struct *mm)
 {
 	struct mm_sched __percpu *pcpu_sched =3D alloc_percpu_noprof(struct mm_sc=
hed);
+	struct mm_time __percpu *pcpu_time;
=20
 	if (!pcpu_sched)
 		return -ENOMEM;
=20
-	mm_init_sched(mm, pcpu_sched);
+	pcpu_time =3D alloc_percpu_noprof(struct mm_time);
+	if (!pcpu_time) {
+		free_percpu(mm->pcpu_sched);
+		return -ENOMEM;
+	}
+
+	mm_init_sched(mm, pcpu_sched, pcpu_time);
 	return 0;
 }
=20
@@ -1523,7 +1536,9 @@ static inline int mm_alloc_sched_noprof(struct mm_str=
uct *mm)
 static inline void mm_destroy_sched(struct mm_struct *mm)
 {
 	free_percpu(mm->pcpu_sched);
+	free_percpu(mm->pcpu_time);
 	mm->pcpu_sched =3D NULL;
+	mm->pcpu_time =3D NULL;
 }
 #else /* !CONFIG_SCHED_CACHE */
=20
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 95bf080bbbf0..875ac3f4208b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2442,6 +2442,9 @@ static inline bool sched_cache_enabled(void)
 {
 	return static_branch_unlikely(&sched_cache_on);
 }
+
+int get_mm_per_llc_runtime(struct task_struct *p, u64 *buf);
+extern int max_llcs;
 #endif
=20
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e47b4096f0a6..205208f061bb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1355,16 +1355,19 @@ static void account_llc_dequeue(struct rq *rq, stru=
ct task_struct *p)
 	p->sched_llc_active =3D false;
 }
=20
-void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_s=
ched)
+void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_s=
ched,
+		   struct mm_time __percpu *_pcpu_time)
 {
 	unsigned long epoch;
 	int i;
=20
 	for_each_possible_cpu(i) {
 		struct mm_sched *pcpu_sched =3D per_cpu_ptr(_pcpu_sched, i);
+		struct mm_time *pcpu_time =3D per_cpu_ptr(_pcpu_time, i);
 		struct rq *rq =3D cpu_rq(i);
=20
 		pcpu_sched->runtime =3D 0;
+		pcpu_time->runtime_ns =3D 0;
 		pcpu_sched->epoch =3D rq->cpu_epoch;
 		epoch =3D rq->cpu_epoch;
 	}
@@ -1379,6 +1382,8 @@ void mm_init_sched(struct mm_struct *mm, struct mm_sc=
hed __percpu *_pcpu_sched)
 	 * the readers may get invalid mm_sched_epoch, etc.
 	 */
 	smp_store_release(&mm->pcpu_sched, _pcpu_sched);
+	/* same as above */
+	smp_store_release(&mm->pcpu_time, _pcpu_time);
 }
=20
 /* because why would C be fully specified */
@@ -1428,11 +1433,39 @@ static unsigned long __no_profile fraction_mm_sched=
(struct rq *rq, struct mm_sch
=20
 static unsigned int task_running_on_cpu(int cpu, struct task_struct *p);
=20
+/* p->pi_lock is hold */
+int get_mm_per_llc_runtime(struct task_struct *p, u64 *buf)
+{
+	struct mm_struct *mm =3D p->mm;
+	struct mm_time *pcpu_time;
+	int cpu;
+
+	if (!mm)
+		return -EINVAL;
+
+	rcu_read_lock();
+	for_each_online_cpu(cpu) {
+		int llc =3D llc_id(cpu);
+		u64 runtime_ms;
+
+		if (llc < 0)
+			continue;
+
+		pcpu_time =3D per_cpu_ptr(mm->pcpu_time, cpu);
+		runtime_ms =3D div_u64(pcpu_time->runtime_ns, NSEC_PER_MSEC);
+		buf[llc] +=3D runtime_ms;
+	}
+	rcu_read_unlock();
+
+	return 0;
+}
+
 static inline
 void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 {
 	struct mm_struct *mm =3D p->mm;
 	struct mm_sched *pcpu_sched;
+	struct mm_time *pcpu_time;
 	unsigned long epoch;
 	int mm_sched_llc =3D -1;
=20
@@ -1444,14 +1477,17 @@ void account_mm_sched(struct rq *rq, struct task_st=
ruct *p, s64 delta_exec)
 	/*
 	 * init_task and kthreads don't having mm
 	 */
-	if (!mm || !mm->pcpu_sched)
+	if (!mm || !mm->pcpu_sched || !mm->pcpu_time)
 		return;
=20
 	pcpu_sched =3D per_cpu_ptr(p->mm->pcpu_sched, cpu_of(rq));
+	pcpu_time =3D per_cpu_ptr(p->mm->pcpu_time, cpu_of(rq));
=20
 	scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
 		__update_mm_sched(rq, pcpu_sched);
 		pcpu_sched->runtime +=3D delta_exec;
+		/* pure runtime without decay */
+		pcpu_time->runtime_ns +=3D delta_exec;
 		rq->cpu_runtime +=3D delta_exec;
 		epoch =3D rq->cpu_epoch;
 	}
--=20
2.32.0