From nobody Sun Feb  8 17:30:11 2026
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id EB2F53F9D2
	for <linux-kernel@vger.kernel.org>; Mon, 21 Apr 2025 03:30:08 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.12
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1745206211; cv=none;
 b=uNiHFJYNyDjmgYVzzwwflA9DMyTBxrKuuGFZBPc369U8TEkJbsu3reGnIw6yuB5I9UwZjHDr7+IAS6xSUM2c+iF4byCI9OgOxXBgJscqk9fjR6yDqEvvqkK+Mg9mUY1yJg4Q5/Qs/WhKk2ZGCGAw4uk5LdREICgkeFRpFzlzey4=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1745206211; c=relaxed/simple;
	bh=43fWVZ91clW26wn4qUJZMl9xz2gH6KO1ZZF6RwNGW3M=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=BmUDNtrajEU/JbUkeE4pHLdhBWQJBciz07dWcL18Jw7y6zWkmN5AMUuzQLiVJeTCKgmcMB5003U4itqPUpBrR8/95SBtTxCmHjf4/9r2GopIBMwPB9NOHMGbBmxfQpM9DiUSfbMSevxn8pg9uRkPj3ollvGTNxC8OO19QqARSB0=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=U4WpDywl; arc=none smtp.client-ip=192.198.163.12
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="U4WpDywl"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1745206209; x=1776742209;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=43fWVZ91clW26wn4qUJZMl9xz2gH6KO1ZZF6RwNGW3M=;
  b=U4WpDywlwRLQJJ7/sOxbQ3f3cKHOPgkcq8RF27UuLUIAisEk1P7Vw+uM
   L8NOoR+Mnhaaiq/+vFjHkc2t+gA/tL8++Jcl2MNnd/EBD5qUSMyuScHz/
   cwSR2ihcLdqfgNK/fggYCTlzDaxnsrEUiHi5ujKgU6UraEHHG3YhpuYy7
   WvG/lqU8Ky8mQ5X6o13HZG86W7Ir6nD97zhgqRTSAbSFnj1ghqnz8RHVZ
   Jjee0p+ua8tZaxDp3nR6d6xAeeF+M6HiQMGxvTzSt2csIfmUaftAyavme
   Tq4wYm0PiZ4fG/mYjPLnNx2rNlWRBVWxVw6HhQf47ugcc/N9+nNFJGLpz
   w==;
X-CSE-ConnectionGUID: lKL5GBaWSAOdT2qGSnQ9uw==
X-CSE-MsgGUID: xfmuwF1HSf2g8zrSrptTWg==
X-IronPort-AV: E=McAfee;i="6700,10204,11409"; a="50563072"
X-IronPort-AV: E=Sophos;i="6.15,227,1739865600";
   d="scan'208";a="50563072"
Received: from fmviesa002.fm.intel.com ([10.60.135.142])
  by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 20 Apr 2025 20:30:08 -0700
X-CSE-ConnectionGUID: 5YpeFg53RXi4oPdFu6JJXw==
X-CSE-MsgGUID: YpHpx6ImRgSou5P2frWBwg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.15,227,1739865600";
   d="scan'208";a="154772506"
Received: from chenyu-dev.sh.intel.com ([10.239.62.107])
  by fmviesa002.fm.intel.com with ESMTP; 20 Apr 2025 20:30:03 -0700
From: Chen Yu <yu.c.chen@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>
Cc: Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Libo Chen <libo.chen@oracle.com>,
	Abel Wu <wuyun.abel@bytedance.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	linux-kernel@vger.kernel.org
Subject: [RFC PATCH 1/5] sched: Cache aware load-balancing
Date: Mon, 21 Apr 2025 11:24:26 +0800
Message-Id: 
 <391c48836585786ed32d66df9534366459684383.1745199017.git.yu.c.chen@intel.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <cover.1745199017.git.yu.c.chen@intel.com>
References: <cover.1745199017.git.yu.c.chen@intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Peter Zijlstra <peterz@infradead.org>

Hi all,

One of the many things on the eternal todo list has been finishing the
below hackery.

It is an attempt at modelling cache affinity -- and while the patch
really only targets LLC, it could very well be extended to also apply to
clusters (L2). Specifically any case of multiple cache domains inside a
node.

Anyway, I wrote this about a year ago, and I mentioned this at the
recent OSPM conf where Gautham and Prateek expressed interest in playing
with this code.

So here goes, very rough and largely unproven code ahead :-)

It applies to current tip/master, but I know it will fail the __percpu
validation that sits in -next, although that shouldn't be terribly hard
to fix up.

As is, it only computes a CPU inside the LLC that has the highest recent
runtime, this CPU is then used in the wake-up path to steer towards this
LLC and in task_hot() to limit migrations away from it.

More elaborate things could be done, notably there is an XXX in there
somewhere about finding the best LLC inside a NODE (interaction with
NUMA_BALANCING).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/mm_types.h |  44 ++++++
 include/linux/sched.h    |   4 +
 init/Kconfig             |   4 +
 kernel/fork.c            |   5 +
 kernel/sched/core.c      |  13 +-
 kernel/sched/fair.c      | 330 +++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h     |   8 +
 7 files changed, 388 insertions(+), 20 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 56d07edd01f9..013291c6aaa2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -893,6 +893,12 @@ struct mm_cid {
 };
 #endif
=20
+struct mm_sched {
+	u64 runtime;
+	unsigned long epoch;
+	unsigned long occ;
+};
+
 struct kioctx_table;
 struct iommu_mm_data;
 struct mm_struct {
@@ -983,6 +989,17 @@ struct mm_struct {
 		 */
 		raw_spinlock_t cpus_allowed_lock;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+		/*
+		 * Track per-cpu-per-process occupancy as a proxy for cache residency.
+		 * See account_mm_sched() and ...
+		 */
+		struct mm_sched __percpu *pcpu_sched;
+		raw_spinlock_t mm_sched_lock;
+		unsigned long mm_sched_epoch;
+		int mm_sched_cpu;
+#endif
+
 #ifdef CONFIG_MMU
 		atomic_long_t pgtables_bytes;	/* size of all page tables */
 #endif
@@ -1393,6 +1410,33 @@ static inline unsigned int mm_cid_size(void)
 static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct =
cpumask *cpumask) { }
 #endif /* CONFIG_SCHED_MM_CID */
=20
+#ifdef CONFIG_SCHED_CACHE
+extern void mm_init_sched(struct mm_struct *mm, struct mm_sched *pcpu_sche=
d);
+
+static inline int mm_alloc_sched_noprof(struct mm_struct *mm)
+{
+	struct mm_sched *pcpu_sched =3D alloc_percpu_noprof(struct mm_sched);
+	if (!pcpu_sched)
+		return -ENOMEM;
+
+	mm_init_sched(mm, pcpu_sched);
+	return 0;
+}
+
+#define mm_alloc_sched(...)	alloc_hooks(mm_alloc_sched_noprof(__VA_ARGS__))
+
+static inline void mm_destroy_sched(struct mm_struct *mm)
+{
+	free_percpu(mm->pcpu_sched);
+	mm->pcpu_sched =3D NULL;
+}
+#else /* !CONFIG_SCHED_CACHE */
+
+static inline int mm_alloc_sched(struct mm_struct *mm) { return 0; }
+static inline void mm_destroy_sched(struct mm_struct *mm) { }
+
+#endif /* CONFIG_SCHED_CACHE */
+
 struct mmu_gather;
 extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct=
 *mm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f96ac1982893..d0e4cda2b3cd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1399,6 +1399,10 @@ struct task_struct {
 	unsigned long			numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
=20
+#ifdef CONFIG_SCHED_CACHE
+	struct callback_head		cache_work;
+#endif
+
 #ifdef CONFIG_RSEQ
 	struct rseq __user *rseq;
 	u32 rseq_len;
diff --git a/init/Kconfig b/init/Kconfig
index b2c045c71d7f..7e0104efd138 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -950,6 +950,10 @@ config NUMA_BALANCING
=20
 	  This system will be inactive on UMA systems.
=20
+config SCHED_CACHE
+	bool "Cache aware scheduler"
+	default y
+
 config NUMA_BALANCING_DEFAULT_ENABLED
 	bool "Automatically enable NUMA aware memory/task placement"
 	default y
diff --git a/kernel/fork.c b/kernel/fork.c
index c4b26cd8998b..974869841e62 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1331,6 +1331,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm=
, struct task_struct *p,
 	if (mm_alloc_cid(mm, p))
 		goto fail_cid;
=20
+	if (mm_alloc_sched(mm))
+		goto fail_sched;
+
 	if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT,
 				     NR_MM_COUNTERS))
 		goto fail_pcpu;
@@ -1340,6 +1343,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm=
, struct task_struct *p,
 	return mm;
=20
 fail_pcpu:
+	mm_destroy_sched(mm);
+fail_sched:
 	mm_destroy_cid(mm);
 fail_cid:
 	destroy_context(mm);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 79692f85643f..5a92c02df97b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4524,6 +4524,7 @@ static void __sched_fork(unsigned long clone_flags, s=
truct task_struct *p)
 	p->migration_pending =3D NULL;
 #endif
 	init_sched_mm_cid(p);
+	init_sched_mm(p);
 }
=20
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
@@ -8528,6 +8529,7 @@ static struct kmem_cache *task_group_cache __ro_after=
_init;
=20
 void __init sched_init(void)
 {
+	unsigned long now =3D jiffies;
 	unsigned long ptr =3D 0;
 	int i;
=20
@@ -8602,7 +8604,7 @@ void __init sched_init(void)
 		raw_spin_lock_init(&rq->__lock);
 		rq->nr_running =3D 0;
 		rq->calc_load_active =3D 0;
-		rq->calc_load_update =3D jiffies + LOAD_FREQ;
+		rq->calc_load_update =3D now + LOAD_FREQ;
 		init_cfs_rq(&rq->cfs);
 		init_rt_rq(&rq->rt);
 		init_dl_rq(&rq->dl);
@@ -8646,7 +8648,7 @@ void __init sched_init(void)
 		rq->cpu_capacity =3D SCHED_CAPACITY_SCALE;
 		rq->balance_callback =3D &balance_push_callback;
 		rq->active_balance =3D 0;
-		rq->next_balance =3D jiffies;
+		rq->next_balance =3D now;
 		rq->push_cpu =3D 0;
 		rq->cpu =3D i;
 		rq->online =3D 0;
@@ -8658,7 +8660,7 @@ void __init sched_init(void)
=20
 		rq_attach_root(rq, &def_root_domain);
 #ifdef CONFIG_NO_HZ_COMMON
-		rq->last_blocked_load_update_tick =3D jiffies;
+		rq->last_blocked_load_update_tick =3D now;
 		atomic_set(&rq->nohz_flags, 0);
=20
 		INIT_CSD(&rq->nohz_csd, nohz_csd_func, rq);
@@ -8683,6 +8685,11 @@ void __init sched_init(void)
=20
 		rq->core_cookie =3D 0UL;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+		raw_spin_lock_init(&rq->cpu_epoch_lock);
+		rq->cpu_epoch_next =3D now;
+#endif
+
 		zalloc_cpumask_var_node(&rq->scratch_mask, GFP_KERNEL, cpu_to_node(i));
 	}
=20
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5e1bd9e8464c..23ea35dbd381 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1166,10 +1166,229 @@ static s64 update_curr_se(struct rq *rq, struct sc=
hed_entity *curr)
 	return delta_exec;
 }
=20
-static inline void update_curr_task(struct task_struct *p, s64 delta_exec)
+#ifdef CONFIG_SCHED_CACHE
+
+/*
+ * XXX numbers come from a place the sun don't shine -- probably wants to =
be SD
+ * tunable or so.
+ */
+#define EPOCH_PERIOD	(HZ/100)	/* 10 ms */
+#define EPOCH_OLD	5		/* 50 ms */
+
+void mm_init_sched(struct mm_struct *mm, struct mm_sched *_pcpu_sched)
+{
+	unsigned long epoch;
+	int i;
+
+	for_each_possible_cpu(i) {
+		struct mm_sched *pcpu_sched =3D per_cpu_ptr(_pcpu_sched, i);
+		struct rq *rq =3D cpu_rq(i);
+
+		pcpu_sched->runtime =3D 0;
+		pcpu_sched->epoch =3D epoch =3D rq->cpu_epoch;
+		pcpu_sched->occ =3D -1;
+	}
+
+	raw_spin_lock_init(&mm->mm_sched_lock);
+	mm->mm_sched_epoch =3D epoch;
+	mm->mm_sched_cpu =3D -1;
+
+	smp_store_release(&mm->pcpu_sched, _pcpu_sched);
+}
+
+/* because why would C be fully specified */
+static __always_inline void __shr_u64(u64 *val, unsigned int n)
+{
+	if (n >=3D 64) {
+		*val =3D 0;
+		return;
+	}
+	*val >>=3D n;
+}
+
+static inline void __update_mm_sched(struct rq *rq, struct mm_sched *pcpu_=
sched)
+{
+	lockdep_assert_held(&rq->cpu_epoch_lock);
+
+	unsigned long n, now =3D jiffies;
+	long delta =3D now - rq->cpu_epoch_next;
+
+	if (delta > 0) {
+		n =3D (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
+		rq->cpu_epoch +=3D n;
+		rq->cpu_epoch_next +=3D n * EPOCH_PERIOD;
+		__shr_u64(&rq->cpu_runtime, n);
+	}
+
+	n =3D rq->cpu_epoch - pcpu_sched->epoch;
+	if (n) {
+		pcpu_sched->epoch +=3D n;
+		__shr_u64(&pcpu_sched->runtime, n);
+	}
+}
+
+static unsigned long fraction_mm_sched(struct rq *rq, struct mm_sched *pcp=
u_sched)
+{
+	guard(raw_spinlock_irqsave)(&rq->cpu_epoch_lock);
+
+	__update_mm_sched(rq, pcpu_sched);
+
+	/*
+	 * Runtime is a geometric series (r=3D0.5) and as such will sum to twice
+	 * the accumulation period, this means the multiplcation here should
+	 * not overflow.
+	 */
+	return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq->cpu_runtime + 1);
+}
+
+static inline
+void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
+{
+	struct mm_struct *mm =3D p->mm;
+	struct mm_sched *pcpu_sched;
+	unsigned long epoch;
+
+	/*
+	 * init_task and kthreads don't be having no mm
+	 */
+	if (!mm || !mm->pcpu_sched)
+		return;
+
+	pcpu_sched =3D this_cpu_ptr(p->mm->pcpu_sched);
+
+	scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
+		__update_mm_sched(rq, pcpu_sched);
+		pcpu_sched->runtime +=3D delta_exec;
+		rq->cpu_runtime +=3D delta_exec;
+		epoch =3D rq->cpu_epoch;
+	}
+
+	/*
+	 * If this task hasn't hit task_cache_work() for a while, invalidate
+	 * it's preferred state.
+	 */
+	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_OLD) {
+		mm->mm_sched_cpu =3D -1;
+		pcpu_sched->occ =3D -1;
+	}
+}
+
+static void task_tick_cache(struct rq *rq, struct task_struct *p)
+{
+	struct callback_head *work =3D &p->cache_work;
+	struct mm_struct *mm =3D p->mm;
+
+	if (!mm || !mm->pcpu_sched)
+		return;
+
+	if (mm->mm_sched_epoch =3D=3D rq->cpu_epoch)
+		return;
+
+	guard(raw_spinlock)(&mm->mm_sched_lock);
+
+	if (mm->mm_sched_epoch =3D=3D rq->cpu_epoch)
+		return;
+
+	if (work->next =3D=3D work) {
+		task_work_add(p, work, TWA_RESUME);
+		WRITE_ONCE(mm->mm_sched_epoch, rq->cpu_epoch);
+	}
+}
+
+static void task_cache_work(struct callback_head *work)
+{
+	struct task_struct *p =3D current;
+	struct mm_struct *mm =3D p->mm;
+	unsigned long m_a_occ =3D 0;
+	int cpu, m_a_cpu =3D -1;
+	cpumask_var_t cpus;
+
+	WARN_ON_ONCE(work !=3D &p->cache_work);
+
+	work->next =3D work;
+
+	if (p->flags & PF_EXITING)
+		return;
+
+	if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
+		return;
+
+	scoped_guard (cpus_read_lock) {
+		cpumask_copy(cpus, cpu_online_mask);
+
+		for_each_cpu(cpu, cpus) {
+			/* XXX sched_cluster_active */
+			struct sched_domain *sd =3D per_cpu(sd_llc, cpu);
+			unsigned long occ, m_occ =3D 0, a_occ =3D 0;
+			int m_cpu =3D -1, nr =3D 0, i;
+
+			for_each_cpu(i, sched_domain_span(sd)) {
+				occ =3D fraction_mm_sched(cpu_rq(i),
+							per_cpu_ptr(mm->pcpu_sched, i));
+				a_occ +=3D occ;
+				if (occ > m_occ) {
+					m_occ =3D occ;
+					m_cpu =3D i;
+				}
+				nr++;
+				trace_printk("(%d) occ: %ld m_occ: %ld m_cpu: %d nr: %d\n",
+					     per_cpu(sd_llc_id, i), occ, m_occ, m_cpu, nr);
+			}
+
+			a_occ /=3D nr;
+			if (a_occ > m_a_occ) {
+				m_a_occ =3D a_occ;
+				m_a_cpu =3D m_cpu;
+			}
+
+			trace_printk("(%d) a_occ: %ld m_a_occ: %ld\n",
+				     per_cpu(sd_llc_id, cpu), a_occ, m_a_occ);
+
+			for_each_cpu(i, sched_domain_span(sd)) {
+				/* XXX threshold ? */
+				per_cpu_ptr(mm->pcpu_sched, i)->occ =3D a_occ;
+			}
+
+			cpumask_andnot(cpus, cpus, sched_domain_span(sd));
+		}
+	}
+
+	/*
+	 * If the max average cache occupancy is 'small' we don't care.
+	 */
+	if (m_a_occ < (NICE_0_LOAD >> EPOCH_OLD))
+		m_a_cpu =3D -1;
+
+	mm->mm_sched_cpu =3D m_a_cpu;
+
+	free_cpumask_var(cpus);
+}
+
+void init_sched_mm(struct task_struct *p)
+{
+	struct callback_head *work =3D &p->cache_work;
+	init_task_work(work, task_cache_work);
+	work->next =3D work;
+}
+
+#else
+
+static inline void account_mm_sched(struct rq *rq, struct task_struct *p,
+				    s64 delta_exec) { }
+
+
+void init_sched_mm(struct task_struct *p) { }
+
+static void task_tick_cache(struct rq *rq, struct task_struct *p) { }
+
+#endif
+
+static inline
+void update_curr_task(struct rq *rq, struct task_struct *p, s64 delta_exec)
 {
 	trace_sched_stat_runtime(p, delta_exec);
 	account_group_exec_runtime(p, delta_exec);
+	account_mm_sched(rq, p, delta_exec);
 	cgroup_account_cputime(p, delta_exec);
 }
=20
@@ -1215,7 +1434,7 @@ s64 update_curr_common(struct rq *rq)
=20
 	delta_exec =3D update_curr_se(rq, &donor->se);
 	if (likely(delta_exec > 0))
-		update_curr_task(donor, delta_exec);
+		update_curr_task(rq, donor, delta_exec);
=20
 	return delta_exec;
 }
@@ -1244,7 +1463,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	if (entity_is_task(curr)) {
 		struct task_struct *p =3D task_of(curr);
=20
-		update_curr_task(p, delta_exec);
+		update_curr_task(rq, p, delta_exec);
=20
 		/*
 		 * If the fair_server is active, we need to account for the
@@ -7843,7 +8062,7 @@ static int select_idle_sibling(struct task_struct *p,=
 int prev, int target)
 	 * per-cpu select_rq_mask usage
 	 */
 	lockdep_assert_irqs_disabled();
-
+again:
 	if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
 	    asym_fits_cpu(task_util, util_min, util_max, target))
 		return target;
@@ -7881,7 +8100,8 @@ static int select_idle_sibling(struct task_struct *p,=
 int prev, int target)
 	/* Check a recently used CPU as a potential idle candidate: */
 	recent_used_cpu =3D p->recent_used_cpu;
 	p->recent_used_cpu =3D prev;
-	if (recent_used_cpu !=3D prev &&
+	if (prev =3D=3D p->wake_cpu &&
+	    recent_used_cpu !=3D prev &&
 	    recent_used_cpu !=3D target &&
 	    cpus_share_cache(recent_used_cpu, target) &&
 	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cp=
u)) &&
@@ -7934,6 +8154,18 @@ static int select_idle_sibling(struct task_struct *p=
, int prev, int target)
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
=20
+	if (prev !=3D p->wake_cpu && !cpus_share_cache(prev, p->wake_cpu)) {
+		/*
+		 * Most likely select_cache_cpu() will have re-directed
+		 * the wakeup, but getting here means the preferred cache is
+		 * too busy, so re-try with the actual previous.
+		 *
+		 * XXX wake_affine is lost for this pass.
+		 */
+		prev =3D target =3D p->wake_cpu;
+		goto again;
+	}
+
 	/*
 	 * For cluster machines which have lower sharing cache like L2 or
 	 * LLC Tag, we tend to find an idle CPU in the target's cluster
@@ -8556,6 +8788,40 @@ static int find_energy_efficient_cpu(struct task_str=
uct *p, int prev_cpu)
 	return target;
 }
=20
+#ifdef CONFIG_SCHED_CACHE
+static long __migrate_degrades_locality(struct task_struct *p, int src_cpu=
, int dst_cpu, bool idle);
+
+static int select_cache_cpu(struct task_struct *p, int prev_cpu)
+{
+	struct mm_struct *mm =3D p->mm;
+	int cpu;
+
+	if (!mm || p->nr_cpus_allowed =3D=3D 1)
+		return prev_cpu;
+
+	cpu =3D mm->mm_sched_cpu;
+	if (cpu < 0)
+		return prev_cpu;
+
+
+	if (static_branch_likely(&sched_numa_balancing) &&
+	    __migrate_degrades_locality(p, prev_cpu, cpu, false) > 0) {
+		/*
+		 * XXX look for max occupancy inside prev_cpu's node
+		 */
+		return prev_cpu;
+	}
+
+	return cpu;
+}
+#else
+static int select_cache_cpu(struct task_struct *p, int prev_cpu)
+{
+	return prev_cpu;
+}
+#endif
+
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in doma=
ins
  * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAK=
E,
@@ -8581,6 +8847,8 @@ select_task_rq_fair(struct task_struct *p, int prev_c=
pu, int wake_flags)
 	 * required for stable ->cpus_allowed
 	 */
 	lockdep_assert_held(&p->pi_lock);
+	guard(rcu)();
+
 	if (wake_flags & WF_TTWU) {
 		record_wakee(p);
=20
@@ -8588,6 +8856,8 @@ select_task_rq_fair(struct task_struct *p, int prev_c=
pu, int wake_flags)
 		    cpumask_test_cpu(cpu, p->cpus_ptr))
 			return cpu;
=20
+		new_cpu =3D prev_cpu =3D select_cache_cpu(p, prev_cpu);
+
 		if (!is_rd_overutilized(this_rq()->rd)) {
 			new_cpu =3D find_energy_efficient_cpu(p, prev_cpu);
 			if (new_cpu >=3D 0)
@@ -8598,7 +8868,6 @@ select_task_rq_fair(struct task_struct *p, int prev_c=
pu, int wake_flags)
 		want_affine =3D !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
 	}
=20
-	rcu_read_lock();
 	for_each_domain(cpu, tmp) {
 		/*
 		 * If both 'cpu' and 'prev_cpu' are part of this domain,
@@ -8631,7 +8900,6 @@ select_task_rq_fair(struct task_struct *p, int prev_c=
pu, int wake_flags)
 		/* Fast path */
 		new_cpu =3D select_idle_sibling(p, prev_cpu, new_cpu);
 	}
-	rcu_read_unlock();
=20
 	return new_cpu;
 }
@@ -9281,6 +9549,17 @@ static int task_hot(struct task_struct *p, struct lb=
_env *env)
 	if (sysctl_sched_migration_cost =3D=3D 0)
 		return 0;
=20
+#ifdef CONFIG_SCHED_CACHE
+	if (p->mm && p->mm->pcpu_sched) {
+		/*
+		 * XXX things like Skylake have non-inclusive L3 and might not
+		 * like this L3 centric view. What to do about L2 stickyness ?
+		 */
+		return per_cpu_ptr(p->mm->pcpu_sched, env->src_cpu)->occ >
+		       per_cpu_ptr(p->mm->pcpu_sched, env->dst_cpu)->occ;
+	}
+#endif
+
 	delta =3D rq_clock_task(env->src_rq) - p->se.exec_start;
=20
 	return delta < (s64)sysctl_sched_migration_cost;
@@ -9292,27 +9571,25 @@ static int task_hot(struct task_struct *p, struct l=
b_env *env)
  * Returns 0, if task migration is not affected by locality.
  * Returns a negative value, if task migration improves locality i.e migra=
tion preferred.
  */
-static long migrate_degrades_locality(struct task_struct *p, struct lb_env=
 *env)
+static long __migrate_degrades_locality(struct task_struct *p, int src_cpu=
, int dst_cpu, bool idle)
 {
 	struct numa_group *numa_group =3D rcu_dereference(p->numa_group);
 	unsigned long src_weight, dst_weight;
 	int src_nid, dst_nid, dist;
=20
-	if (!static_branch_likely(&sched_numa_balancing))
-		return 0;
-
-	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+	if (!p->numa_faults)
 		return 0;
=20
-	src_nid =3D cpu_to_node(env->src_cpu);
-	dst_nid =3D cpu_to_node(env->dst_cpu);
+	src_nid =3D cpu_to_node(src_cpu);
+	dst_nid =3D cpu_to_node(dst_cpu);
=20
 	if (src_nid =3D=3D dst_nid)
 		return 0;
=20
 	/* Migrating away from the preferred node is always bad. */
 	if (src_nid =3D=3D p->numa_preferred_nid) {
-		if (env->src_rq->nr_running > env->src_rq->nr_preferred_running)
+		struct rq *src_rq =3D cpu_rq(src_cpu);
+		if (src_rq->nr_running > src_rq->nr_preferred_running)
 			return 1;
 		else
 			return 0;
@@ -9323,7 +9600,7 @@ static long migrate_degrades_locality(struct task_str=
uct *p, struct lb_env *env)
 		return -1;
=20
 	/* Leaving a core idle is often worse than degrading locality. */
-	if (env->idle =3D=3D CPU_IDLE)
+	if (idle)
 		return 0;
=20
 	dist =3D node_distance(src_nid, dst_nid);
@@ -9338,7 +9615,24 @@ static long migrate_degrades_locality(struct task_st=
ruct *p, struct lb_env *env)
 	return src_weight - dst_weight;
 }
=20
+static long migrate_degrades_locality(struct task_struct *p, struct lb_env=
 *env)
+{
+	if (!static_branch_likely(&sched_numa_balancing))
+		return 0;
+
+	if (!(env->sd->flags & SD_NUMA))
+		return 0;
+
+	return __migrate_degrades_locality(p, env->src_cpu, env->dst_cpu,
+					   env->idle =3D=3D CPU_IDLE);
+}
+
 #else
+static long __migrate_degrades_locality(struct task_struct *p, int src_cpu=
, int dst_cpu, bool idle)
+{
+	return 0;
+}
+
 static inline long migrate_degrades_locality(struct task_struct *p,
 					     struct lb_env *env)
 {
@@ -13098,8 +13392,8 @@ static inline void task_tick_core(struct rq *rq, st=
ruct task_struct *curr) {}
  */
 static void task_tick_fair(struct rq *rq, struct task_struct *curr, int qu=
eued)
 {
-	struct cfs_rq *cfs_rq;
 	struct sched_entity *se =3D &curr->se;
+	struct cfs_rq *cfs_rq;
=20
 	for_each_sched_entity(se) {
 		cfs_rq =3D cfs_rq_of(se);
@@ -13109,6 +13403,8 @@ static void task_tick_fair(struct rq *rq, struct ta=
sk_struct *curr, int queued)
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
=20
+	task_tick_cache(rq, curr);
+
 	update_misfit_status(curr, rq);
 	check_update_overutilized_status(task_rq(curr));
=20
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c5a6a503eb6d..1b6d7e374bc3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1173,6 +1173,12 @@ struct rq {
 	u64			clock_pelt_idle_copy;
 	u64			clock_idle_copy;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+	raw_spinlock_t		cpu_epoch_lock;
+	u64			cpu_runtime;
+	unsigned long		cpu_epoch;
+	unsigned long		cpu_epoch_next;
+#endif
=20
 	atomic_t		nr_iowait;
=20
@@ -3887,6 +3893,8 @@ static inline void task_tick_mm_cid(struct rq *rq, st=
ruct task_struct *curr) { }
 static inline void init_sched_mm_cid(struct task_struct *t) { }
 #endif /* !CONFIG_SCHED_MM_CID */
=20
+extern void init_sched_mm(struct task_struct *p);
+
 extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
 extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
 #ifdef CONFIG_SMP
--=20
2.25.1
From nobody Sun Feb  8 17:30:11 2026
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 753641ADC8D
	for <linux-kernel@vger.kernel.org>; Mon, 21 Apr 2025 03:30:32 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.12
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1745206234; cv=none;
 b=Uy1+upQv+uUMpMtgF3JdjUicyIKngsw5KOw8YkddDgZcW0VsMc2a+E6HM13qBYjjlEaIKGO+NyOyJf65IQr4ntHIHi/ypZ6wADfj1cfF1oFdfa9ZgSWIvnHC/CmuDK5ypjMPswJqsix5vceL0saN10HBNjaIF/3bevoBPnqWSbg=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1745206234; c=relaxed/simple;
	bh=Uvyzn2iZjV1/tyvMqIEGzXXaiEHvY5yO7ykOj+isebM=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=o+tyMEDZ5YgwJczsbzqLlnM8yzWqy3DxjbnSBoiuhoRmqIng7QDyEB/vmIjRaMCEpJOcZs4yJZWQ/NPUJLUEDNmGVw794Fx8MKPjdWpo0OIKfuE0ellRK2hvoiENiaHe/duxMbLhn9JuakkZdpVELKFlKhz3lxmaZreerdIr9fI=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=iGedbg1b; arc=none smtp.client-ip=192.198.163.12
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="iGedbg1b"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1745206232; x=1776742232;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=Uvyzn2iZjV1/tyvMqIEGzXXaiEHvY5yO7ykOj+isebM=;
  b=iGedbg1bBvbtXPixA9ALpn4sR2v4hQOxGm7IynDmayVjcuQKKrUhTrW8
   JBENO/GUMN3gYmWmyXM16TlgbkH3JlrkMAWaAlBxkxxr/kW3iKOG2qrCF
   Pk2oopRVRrT2JkjoRD35+1wVN94h52/RXZuOjhQN0QVBOKg3G5WhceUdg
   6tpBO7CF+NI0J54oxcRs1sa5kDuWPwkrJE8dTixT13Ex0YAp/8l/2Duw8
   xJp4ES7wEvPqRicpSVStLskBmyQK8kgDYSBwYWGmP8q4/FotSXE34u78M
   qkFwSsgPaLj7ri+xytItc3Kv3XQiolg6kkU34/gcYyzsNabhfHM1DIEzl
   w==;
X-CSE-ConnectionGUID: 0qxie4TnSjOYjKCjHfB6yA==
X-CSE-MsgGUID: zN66KwU3T5iWRjdlVjGmHw==
X-IronPort-AV: E=McAfee;i="6700,10204,11409"; a="50563108"
X-IronPort-AV: E=Sophos;i="6.15,227,1739865600";
   d="scan'208";a="50563108"
Received: from fmviesa002.fm.intel.com ([10.60.135.142])
  by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 20 Apr 2025 20:30:30 -0700
X-CSE-ConnectionGUID: B3f+p0sFS9qo+OAGS9RZaQ==
X-CSE-MsgGUID: Z0kNp9jLSnSxv6NyrZjBXg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.15,227,1739865600";
   d="scan'208";a="154772663"
Received: from chenyu-dev.sh.intel.com ([10.239.62.107])
  by fmviesa002.fm.intel.com with ESMTP; 20 Apr 2025 20:30:24 -0700
From: Chen Yu <yu.c.chen@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>
Cc: Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Libo Chen <libo.chen@oracle.com>,
	Abel Wu <wuyun.abel@bytedance.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	linux-kernel@vger.kernel.org,
	Chen Yu <yu.c.chen@intel.com>
Subject: [RFC PATCH 2/5] sched: Several fixes for cache aware scheduling
Date: Mon, 21 Apr 2025 11:24:41 +0800
Message-Id: 
 <660bc36a8aacc6ba55fbcf8b0f9f05b6326e69ce.1745199017.git.yu.c.chen@intel.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <cover.1745199017.git.yu.c.chen@intel.com>
References: <cover.1745199017.git.yu.c.chen@intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

1. Fix the compile errors on per-CPU allocation.
2. Enqueue tasks to the target CPU instead of the current CPU;
   otherwise, the per-CPU occupancy will be messed up.
3. Fix the NULL LLC sched domain issue(Libo Chen).
4. Avoid duplicated epoch check in task_tick_cache()
5. Introduce sched feature SCHED_CACHE to control cache aware
   scheduling

TBD suggestion in previous version:
move cache_work from per task to per mm_struct, consider the actual cpu
capacity in fraction_mm_sched() (Abel Wu)

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/mm_types.h |  4 ++--
 kernel/sched/fair.c      | 15 +++++++++------
 kernel/sched/features.h  |  1 +
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 013291c6aaa2..9de4a0a13c4d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1411,11 +1411,11 @@ static inline void mm_set_cpus_allowed(struct mm_st=
ruct *mm, const struct cpumas
 #endif /* CONFIG_SCHED_MM_CID */
=20
 #ifdef CONFIG_SCHED_CACHE
-extern void mm_init_sched(struct mm_struct *mm, struct mm_sched *pcpu_sche=
d);
+extern void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *=
pcpu_sched);
=20
 static inline int mm_alloc_sched_noprof(struct mm_struct *mm)
 {
-	struct mm_sched *pcpu_sched =3D alloc_percpu_noprof(struct mm_sched);
+	struct mm_sched __percpu *pcpu_sched =3D alloc_percpu_noprof(struct mm_sc=
hed);
 	if (!pcpu_sched)
 		return -ENOMEM;
=20
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 23ea35dbd381..22b5830e7e4e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1175,7 +1175,7 @@ static s64 update_curr_se(struct rq *rq, struct sched=
_entity *curr)
 #define EPOCH_PERIOD	(HZ/100)	/* 10 ms */
 #define EPOCH_OLD	5		/* 50 ms */
=20
-void mm_init_sched(struct mm_struct *mm, struct mm_sched *_pcpu_sched)
+void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_s=
ched)
 {
 	unsigned long epoch;
 	int i;
@@ -1254,7 +1254,7 @@ void account_mm_sched(struct rq *rq, struct task_stru=
ct *p, s64 delta_exec)
 	if (!mm || !mm->pcpu_sched)
 		return;
=20
-	pcpu_sched =3D this_cpu_ptr(p->mm->pcpu_sched);
+	pcpu_sched =3D per_cpu_ptr(p->mm->pcpu_sched, cpu_of(rq));
=20
 	scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
 		__update_mm_sched(rq, pcpu_sched);
@@ -1286,9 +1286,6 @@ static void task_tick_cache(struct rq *rq, struct tas=
k_struct *p)
=20
 	guard(raw_spinlock)(&mm->mm_sched_lock);
=20
-	if (mm->mm_sched_epoch =3D=3D rq->cpu_epoch)
-		return;
-
 	if (work->next =3D=3D work) {
 		task_work_add(p, work, TWA_RESUME);
 		WRITE_ONCE(mm->mm_sched_epoch, rq->cpu_epoch);
@@ -1322,6 +1319,9 @@ static void task_cache_work(struct callback_head *wor=
k)
 			unsigned long occ, m_occ =3D 0, a_occ =3D 0;
 			int m_cpu =3D -1, nr =3D 0, i;
=20
+			if (!sd)
+				continue;
+
 			for_each_cpu(i, sched_domain_span(sd)) {
 				occ =3D fraction_mm_sched(cpu_rq(i),
 							per_cpu_ptr(mm->pcpu_sched, i));
@@ -8796,6 +8796,9 @@ static int select_cache_cpu(struct task_struct *p, in=
t prev_cpu)
 	struct mm_struct *mm =3D p->mm;
 	int cpu;
=20
+	if (!sched_feat(SCHED_CACHE))
+		return prev_cpu;
+
 	if (!mm || p->nr_cpus_allowed =3D=3D 1)
 		return prev_cpu;
=20
@@ -9550,7 +9553,7 @@ static int task_hot(struct task_struct *p, struct lb_=
env *env)
 		return 0;
=20
 #ifdef CONFIG_SCHED_CACHE
-	if (p->mm && p->mm->pcpu_sched) {
+	if (sched_feat(SCHED_CACHE) && p->mm && p->mm->pcpu_sched) {
 		/*
 		 * XXX things like Skylake have non-inclusive L3 and might not
 		 * like this L3 centric view. What to do about L2 stickyness ?
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 3c12d9f93331..d2af7bfd36bf 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -87,6 +87,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_UTIL, true)
=20
+SCHED_FEAT(SCHED_CACHE, true)
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
  * in a single rq->lock section. Default disabled because the
--=20
2.25.1
From nobody Sun Feb  8 17:30:11 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 601312045BC
	for <linux-kernel@vger.kernel.org>; Mon, 21 Apr 2025 03:30:44 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.10
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1745206246; cv=none;
 b=cyUDcaWwS/vnAC/9vr6QVtia4FcTQMPd7eMvED7v9EKZ0kKV2VJJLgwBwEqIE3MmRS+wNHMY36fQXAFbakXWR9x6Kv8NRZJFtJBcBk4umPV+oZOEHo9eVoYQAJ5ouoUVWfAi5+myL9jL7wvER+fafEXrrPk3isKNaugWgbBd+nw=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1745206246; c=relaxed/simple;
	bh=UD6nYLwiuBEsKgfJlf0Unq67sRh4l0bjKGn9AJ0VpF0=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=J6F7uqEaxT32UEBxrSQCL+3cojdRbPZQQFQKhOytZdS+PTN+q7WKXy50s3loBm9q6HkxZqYYtT6UFXynFuF4pnt/SnFD2VH06ivZy5P7mvOfsYwPgoo1ypV6YM94Nf2l8p9aPyFJGZuLth/e8Cyoxqk5pAXICdDi4f8LTD4Hsc8=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=M5G6iEoA; arc=none smtp.client-ip=198.175.65.10
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="M5G6iEoA"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1745206245; x=1776742245;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=UD6nYLwiuBEsKgfJlf0Unq67sRh4l0bjKGn9AJ0VpF0=;
  b=M5G6iEoAu7fNeFu+iumhwMY356Zm623f8ujNFitSCUq18XM22GqKhn//
   VnJCdo76lNoAuDAqH+aZoNdGRoSUqSKJj/ZSmz6qk0oRd6bsIK1ZLWgx2
   vJeD0YcggG/EOVqhh3yzmGZb++iyjmBU3AMpXm91aFL0j6/uYDSijsMjx
   LqTbdBRluvlPHGQqD+XJodEQ3lQOxeN8nYecVSlgmaoEV1rt/z56IS9hf
   Zo02ylcxHKJFPJJWQAwDRNddckwXeGAZciLz2KndSZGl5eUCAmuLjeO7J
   T4dsXznKnCDg539b44E5Xp8xSCU9E90dKFySynjlVBoEAkw2jLUNT3lbH
   w==;
X-CSE-ConnectionGUID: eJdKTGJ5RwGjlSIsKnJ+Zw==
X-CSE-MsgGUID: zSr1hSx/TPm4VczOmWLm2Q==
X-IronPort-AV: E=McAfee;i="6700,10204,11409"; a="64144193"
X-IronPort-AV: E=Sophos;i="6.15,227,1739865600";
   d="scan'208";a="64144193"
Received: from orviesa009.jf.intel.com ([10.64.159.149])
  by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 20 Apr 2025 20:30:44 -0700
X-CSE-ConnectionGUID: lRA4h+KMRiCO6ODq9AiIoA==
X-CSE-MsgGUID: U+OblyDQQ46Vo4z3tm1qgQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.15,227,1739865600";
   d="scan'208";a="131355364"
Received: from chenyu-dev.sh.intel.com ([10.239.62.107])
  by orviesa009.jf.intel.com with ESMTP; 20 Apr 2025 20:30:40 -0700
From: Chen Yu <yu.c.chen@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>
Cc: Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Libo Chen <libo.chen@oracle.com>,
	Abel Wu <wuyun.abel@bytedance.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	linux-kernel@vger.kernel.org,
	Chen Yu <yu.c.chen@intel.com>
Subject: [RFC PATCH 3/5] sched: Avoid task migration within its preferred LLC
Date: Mon, 21 Apr 2025 11:25:04 +0800
Message-Id: 
 <01a54d63193fab5c819aab75321f6aa492491997.1745199017.git.yu.c.chen@intel.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <cover.1745199017.git.yu.c.chen@intel.com>
References: <cover.1745199017.git.yu.c.chen@intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

It was found that when running schbench, there is a
significant amount of in-LLC task migration, even if
the wakee is woken up on its preferred LLC. This
leads to core-to-core latency and impairs performance.

Inhibit task migration if the wakee is already in its
preferred LLC. Meanwhile, prevent the load balancer
from treating the task as cache-hot if this task is
being migrated out of its preferred LLC, rather than
comparing the occupancy between CPUs.

With this enhancement applied, the in-LLC task migration
has been reduced a lot(use PATCH 5/5 to verify).

It was found that when schbench is running, there is a
significant amount of in-LLC task migration, even if the
wakee is woken up on its preferred LLC. This leads to
core-to-core latency and impairs performance.

Inhibit task migration if the wakee is already in its
preferred LLC. Meanwhile, prevent the load balancer from
treating the task as cache-hot if this task is being migrated
out of its preferred LLC, instead of comparing occupancy
between CPUs directly.

With this enhancement applied, the in-LLC task migration has
been reduced significantly, (use PATCH 5/5 to verify).

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/fair.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 22b5830e7e4e..1733eb83042c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8806,6 +8806,12 @@ static int select_cache_cpu(struct task_struct *p, i=
nt prev_cpu)
 	if (cpu < 0)
 		return prev_cpu;
=20
+	/*
+	 * No need to migrate the task if previous and preferred CPU
+	 * are in the same LLC.
+	 */
+	if (cpus_share_cache(prev_cpu, cpu))
+		return prev_cpu;
=20
 	if (static_branch_likely(&sched_numa_balancing) &&
 	    __migrate_degrades_locality(p, prev_cpu, cpu, false) > 0) {
@@ -9553,14 +9559,13 @@ static int task_hot(struct task_struct *p, struct l=
b_env *env)
 		return 0;
=20
 #ifdef CONFIG_SCHED_CACHE
-	if (sched_feat(SCHED_CACHE) && p->mm && p->mm->pcpu_sched) {
-		/*
-		 * XXX things like Skylake have non-inclusive L3 and might not
-		 * like this L3 centric view. What to do about L2 stickyness ?
-		 */
-		return per_cpu_ptr(p->mm->pcpu_sched, env->src_cpu)->occ >
-		       per_cpu_ptr(p->mm->pcpu_sched, env->dst_cpu)->occ;
-	}
+	/*
+	 * Don't migrate task out of its preferred LLC.
+	 */
+	if (sched_feat(SCHED_CACHE) && p->mm && p->mm->mm_sched_cpu >=3D 0 &&
+	    cpus_share_cache(env->src_cpu, p->mm->mm_sched_cpu) &&
+	    !cpus_share_cache(env->src_cpu, env->dst_cpu))
+		return 1;
 #endif
=20
 	delta =3D rq_clock_task(env->src_rq) - p->se.exec_start;
--=20
2.25.1
From nobody Sun Feb  8 17:30:11 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 21ABB17BEBF
	for <linux-kernel@vger.kernel.org>; Mon, 21 Apr 2025 03:30:58 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.10
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1745206261; cv=none;
 b=BgON1Q6NtCJ5s+QmrduVZgAzjvm548BXKyVJToiEmB4wZaFDKSPFr2p9Q5TAGOigaJzI7vMl12RQ4cIFFX8wZpsOzpwB0vS26Ko+lqr07x0sTwNT6JXTwmlnqaGbsmi5DozOk/9R0tlcC6F4JEgXXu3a0Qujd0KGjVbTymi+izc=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1745206261; c=relaxed/simple;
	bh=1MJqnxJnIT/C5QdCNLNcR3OoMKXPPnOwSmh9yOZXkAY=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=MFw3z9hlx8sfk04LtQuk5EIQeZbyzuuNxN5wP8umpLSjkhPJD0TuLz6eXCCq1EaYCfRdcl6MPu+u67J9yMmUtk8FWEhQxIWC4B9BpnZhZ7TQ4EG7rx+499kdiOw957TfvBb53boLQc/lnsbiYix/YkjBSFHW2PmxZnK5zfyi8hU=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=Z4qGMvk+; arc=none smtp.client-ip=198.175.65.10
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="Z4qGMvk+"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1745206259; x=1776742259;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=1MJqnxJnIT/C5QdCNLNcR3OoMKXPPnOwSmh9yOZXkAY=;
  b=Z4qGMvk+MYybEPB2KwHUZ3uIIjonLsTE7i5QY/HqmdMTncmzH3GvRSTv
   /c53mq+LS1T8DXQFzlx6np14ZVfNRJqw/mIvKpRz0D4HjiDG8VqG9VVY6
   Gq2/gUVdwBKtKbPK9JKnNK8GhKF+iMpzE+mcfogQPoUiWM8HLx2pdz8JG
   BHIp67lEn/jEzS35UTdhALl9AGvAMZRBNWMCpYNTbiKl38/MtkPxDLVsV
   tXxozQ5RbvvI73xlWEKw+TCL18Pf9UobP/4T/0kxoqvgqWebuGygSm2pK
   CFFN49K3oUnzOunoHKuXhwlNZgB1nIYjFejXKLwRQc7WjcG6QUB7SZysq
   Q==;
X-CSE-ConnectionGUID: ZWYdHXSxR1ie1d8UekXgLQ==
X-CSE-MsgGUID: JptGT7ThTSGM7tFAh1y5PQ==
X-IronPort-AV: E=McAfee;i="6700,10204,11409"; a="64144220"
X-IronPort-AV: E=Sophos;i="6.15,227,1739865600";
   d="scan'208";a="64144220"
Received: from orviesa009.jf.intel.com ([10.64.159.149])
  by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 20 Apr 2025 20:30:59 -0700
X-CSE-ConnectionGUID: trwDvCizQ7SHIqO4u/C1pw==
X-CSE-MsgGUID: 39pzNhYuQQiknrhMPQq5dA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.15,227,1739865600";
   d="scan'208";a="131355380"
Received: from chenyu-dev.sh.intel.com ([10.239.62.107])
  by orviesa009.jf.intel.com with ESMTP; 20 Apr 2025 20:30:54 -0700
From: Chen Yu <yu.c.chen@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>
Cc: Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Libo Chen <libo.chen@oracle.com>,
	Abel Wu <wuyun.abel@bytedance.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	linux-kernel@vger.kernel.org,
	Chen Yu <yu.c.chen@intel.com>
Subject: [RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the
 preferred LLC is over aggregated
Date: Mon, 21 Apr 2025 11:25:18 +0800
Message-Id: 
 <2c45f6db1efef84c6c1ed514a8d24a9bc4a2ca4b.1745199017.git.yu.c.chen@intel.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <cover.1745199017.git.yu.c.chen@intel.com>
References: <cover.1745199017.git.yu.c.chen@intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

It is found that when the process's preferred LLC gets saturated by too many
threads, task contention is very frequent and causes performance regression.

Save the per LLC statistics calculated by periodic load balance. The statis=
tics
include the average utilization and the average number of runnable tasks.
The task wakeup path for cache aware scheduling manipulates these statistics
to inhibit cache aware scheduling to avoid performance regression. When eit=
her
the average utilization of the preferred LLC has reached 25%, or the average
number of runnable tasks has exceeded 1/3 of the LLC weight, the cache aware
wakeup is disabled. Only when the process has more threads than the LLC wei=
ght
will this restriction be enabled.

Running schbench via mmtests on a Xeon platform, which has 2 sockets, each =
socket
has 60 Cores/120 CPUs. The DRAM interleave is enabled across NUMA nodes via=
 BIOS,
so there are 2 "LLCs" in 1 NUMA node.

compare-mmtests.pl --directory work/log --benchmark schbench --names baseli=
ne,sched_cache
                                    baselin             sched_cach
                                   baseline            sched_cache
Lat 50.0th-qrtle-1          6.00 (   0.00%)        6.00 (   0.00%)
Lat 90.0th-qrtle-1         10.00 (   0.00%)        9.00 (  10.00%)
Lat 99.0th-qrtle-1         29.00 (   0.00%)       13.00 (  55.17%)
Lat 99.9th-qrtle-1         35.00 (   0.00%)       21.00 (  40.00%)
Lat 20.0th-qrtle-1        266.00 (   0.00%)      266.00 (   0.00%)
Lat 50.0th-qrtle-2          8.00 (   0.00%)        6.00 (  25.00%)
Lat 90.0th-qrtle-2         10.00 (   0.00%)       10.00 (   0.00%)
Lat 99.0th-qrtle-2         19.00 (   0.00%)       18.00 (   5.26%)
Lat 99.9th-qrtle-2         27.00 (   0.00%)       29.00 (  -7.41%)
Lat 20.0th-qrtle-2        533.00 (   0.00%)      507.00 (   4.88%)
Lat 50.0th-qrtle-4          6.00 (   0.00%)        5.00 (  16.67%)
Lat 90.0th-qrtle-4          8.00 (   0.00%)        5.00 (  37.50%)
Lat 99.0th-qrtle-4         14.00 (   0.00%)        9.00 (  35.71%)
Lat 99.9th-qrtle-4         22.00 (   0.00%)       14.00 (  36.36%)
Lat 20.0th-qrtle-4       1070.00 (   0.00%)      995.00 (   7.01%)
Lat 50.0th-qrtle-8          5.00 (   0.00%)        5.00 (   0.00%)
Lat 90.0th-qrtle-8          7.00 (   0.00%)        5.00 (  28.57%)
Lat 99.0th-qrtle-8         12.00 (   0.00%)       11.00 (   8.33%)
Lat 99.9th-qrtle-8         19.00 (   0.00%)       16.00 (  15.79%)
Lat 20.0th-qrtle-8       2140.00 (   0.00%)     2140.00 (   0.00%)
Lat 50.0th-qrtle-16         6.00 (   0.00%)        5.00 (  16.67%)
Lat 90.0th-qrtle-16         7.00 (   0.00%)        5.00 (  28.57%)
Lat 99.0th-qrtle-16        12.00 (   0.00%)       10.00 (  16.67%)
Lat 99.9th-qrtle-16        17.00 (   0.00%)       14.00 (  17.65%)
Lat 20.0th-qrtle-16      4296.00 (   0.00%)     4200.00 (   2.23%)
Lat 50.0th-qrtle-32         6.00 (   0.00%)        5.00 (  16.67%)
Lat 90.0th-qrtle-32         8.00 (   0.00%)        6.00 (  25.00%)
Lat 99.0th-qrtle-32        12.00 (   0.00%)       10.00 (  16.67%)
Lat 99.9th-qrtle-32        17.00 (   0.00%)       14.00 (  17.65%)
Lat 20.0th-qrtle-32      8496.00 (   0.00%)     8528.00 (  -0.38%)
Lat 50.0th-qrtle-64         6.00 (   0.00%)        5.00 (  16.67%)
Lat 90.0th-qrtle-64         8.00 (   0.00%)        8.00 (   0.00%)
Lat 99.0th-qrtle-64        12.00 (   0.00%)       12.00 (   0.00%)
Lat 99.9th-qrtle-64        17.00 (   0.00%)       17.00 (   0.00%)
Lat 20.0th-qrtle-64     17120.00 (   0.00%)    17120.00 (   0.00%)
Lat 50.0th-qrtle-128        7.00 (   0.00%)        7.00 (   0.00%)
Lat 90.0th-qrtle-128        9.00 (   0.00%)        9.00 (   0.00%)
Lat 99.0th-qrtle-128       13.00 (   0.00%)       14.00 (  -7.69%)
Lat 99.9th-qrtle-128       20.00 (   0.00%)       20.00 (   0.00%)
Lat 20.0th-qrtle-128    31776.00 (   0.00%)    30496.00 (   4.03%)
Lat 50.0th-qrtle-239        9.00 (   0.00%)        9.00 (   0.00%)
Lat 90.0th-qrtle-239       14.00 (   0.00%)       18.00 ( -28.57%)
Lat 99.0th-qrtle-239       43.00 (   0.00%)       56.00 ( -30.23%)
Lat 99.9th-qrtle-239      106.00 (   0.00%)      483.00 (-355.66%)
Lat 20.0th-qrtle-239    30176.00 (   0.00%)    29984.00 (   0.64%)

We can see overall latency improvement and some throughput degradation
when the system gets saturated.

Also, we run schbench (old version) on an EPYC 7543 system, which has
4 NUMA nodes, and each node has 4 LLCs. Monitor the 99.0th latency:

case                    load            baseline(std%)  compare%( std%)
normal                  4-mthreads-1-workers     1.00 (  6.47)   +9.02 (  4=
.68)
normal                  4-mthreads-2-workers     1.00 (  3.25)  +28.03 (  8=
.76)
normal                  4-mthreads-4-workers     1.00 (  6.67)   -4.32 (  2=
.58)
normal                  4-mthreads-8-workers     1.00 (  2.38)   +1.27 (  2=
.41)
normal                  4-mthreads-16-workers    1.00 (  5.61)   -8.48 (  4=
.39)
normal                  4-mthreads-31-workers    1.00 (  9.31)   -0.22 (  9=
.77)

When the LLC is underloaded, the latency improvement is observed. When the =
LLC
gets saturated, we observe some degradation.

The aggregation of tasks will move tasks towards the preferred LLC
pretty quickly during wake ups. However load balance will tend to move
tasks away from the aggregated LLC. The two migrations are in the
opposite directions and tend to bounce tasks between LLCs. Such task
migrations should be impeded in load balancing as long as the home LLC.
We're working on fixing up the load balancing path to address such issues.

Co-developed-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched/topology.h |   4 ++
 kernel/sched/fair.c            | 101 ++++++++++++++++++++++++++++++++-
 2 files changed, 104 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 198bb5cc1774..9625d9d762f5 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -78,6 +78,10 @@ struct sched_domain_shared {
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
 	int		nr_idle_scan;
+#ifdef CONFIG_SCHED_CACHE
+	unsigned long	util_avg;
+	u64		nr_avg;
+#endif
 };
=20
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1733eb83042c..f74d8773c811 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8791,6 +8791,58 @@ static int find_energy_efficient_cpu(struct task_str=
uct *p, int prev_cpu)
 #ifdef CONFIG_SCHED_CACHE
 static long __migrate_degrades_locality(struct task_struct *p, int src_cpu=
, int dst_cpu, bool idle);
=20
+/* expected to be protected by rcu_read_lock() */
+static bool get_llc_stats(int cpu, int *nr, int *weight, unsigned long *ut=
il)
+{
+	struct sched_domain_shared *sd_share;
+
+	sd_share =3D rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (!sd_share)
+		return false;
+
+	*nr =3D READ_ONCE(sd_share->nr_avg);
+	*util =3D READ_ONCE(sd_share->util_avg);
+	*weight =3D per_cpu(sd_llc_size, cpu);
+
+	return true;
+}
+
+static bool valid_target_cpu(int cpu, struct task_struct *p)
+{
+	int nr_running, llc_weight;
+	unsigned long util, llc_cap;
+
+	if (!get_llc_stats(cpu, &nr_running, &llc_weight,
+			   &util))
+		return false;
+
+	llc_cap =3D llc_weight * SCHED_CAPACITY_SCALE;
+
+	/*
+	 * If this process has many threads, be careful to avoid
+	 * task stacking on the preferred LLC, by checking the system's
+	 * utilization and runnable tasks. Otherwise, if this
+	 * process does not have many threads, honor the cache
+	 * aware wakeup.
+	 */
+	if (get_nr_threads(p) < llc_weight)
+		return true;
+
+	/*
+	 * Check if it exceeded 25% of average utiliazation,
+	 * or if it exceeded 33% of CPUs. This is a magic number
+	 * that did not cause heavy cache contention on Xeon or
+	 * Zen.
+	 */
+	if (util * 4 >=3D llc_cap)
+		return false;
+
+	if (nr_running * 3 >=3D llc_weight)
+		return false;
+
+	return true;
+}
+
 static int select_cache_cpu(struct task_struct *p, int prev_cpu)
 {
 	struct mm_struct *mm =3D p->mm;
@@ -8813,6 +8865,9 @@ static int select_cache_cpu(struct task_struct *p, in=
t prev_cpu)
 	if (cpus_share_cache(prev_cpu, cpu))
 		return prev_cpu;
=20
+	if (!valid_target_cpu(cpu, p))
+		return prev_cpu;
+
 	if (static_branch_likely(&sched_numa_balancing) &&
 	    __migrate_degrades_locality(p, prev_cpu, cpu, false) > 0) {
 		/*
@@ -9564,7 +9619,8 @@ static int task_hot(struct task_struct *p, struct lb_=
env *env)
 	 */
 	if (sched_feat(SCHED_CACHE) && p->mm && p->mm->mm_sched_cpu >=3D 0 &&
 	    cpus_share_cache(env->src_cpu, p->mm->mm_sched_cpu) &&
-	    !cpus_share_cache(env->src_cpu, env->dst_cpu))
+	    !cpus_share_cache(env->src_cpu, env->dst_cpu) &&
+	     !valid_target_cpu(env->dst_cpu, p))
 		return 1;
 #endif
=20
@@ -10634,6 +10690,48 @@ sched_reduced_capacity(struct rq *rq, struct sched=
_domain *sd)
 	return check_cpu_capacity(rq, sd);
 }
=20
+#ifdef CONFIG_SCHED_CACHE
+/*
+ * Save this sched group's statistic for later use:
+ * The task wakeup and load balance can make better
+ * decision based on these statistics.
+ */
+static void update_sg_if_llc(struct lb_env *env, struct sg_lb_stats *sgs,
+			     struct sched_group *group)
+{
+	/* Find the sched domain that spans this group. */
+	struct sched_domain *sd =3D env->sd->child;
+	struct sched_domain_shared *sd_share;
+	u64 last_nr;
+
+	if (!sched_feat(SCHED_CACHE) || env->idle =3D=3D CPU_NEWLY_IDLE)
+		return;
+
+	/* only care the sched domain that spans 1 LLC */
+	if (!sd || !(sd->flags & SD_SHARE_LLC) ||
+	    !sd->parent || (sd->parent->flags & SD_SHARE_LLC))
+		return;
+
+	sd_share =3D rcu_dereference(per_cpu(sd_llc_shared,
+				   cpumask_first(sched_group_span(group))));
+	if (!sd_share)
+		return;
+
+	last_nr =3D READ_ONCE(sd_share->nr_avg);
+	update_avg(&last_nr, sgs->sum_nr_running);
+
+	if (likely(READ_ONCE(sd_share->util_avg) !=3D sgs->group_util))
+		WRITE_ONCE(sd_share->util_avg, sgs->group_util);
+
+	WRITE_ONCE(sd_share->nr_avg, last_nr);
+}
+#else
+static inline void update_sg_if_llc(struct lb_env *env, struct sg_lb_stats=
 *sgs,
+				    struct sched_group *group)
+{
+}
+#endif
+
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
@@ -10723,6 +10821,7 @@ static inline void update_sg_lb_stats(struct lb_env=
 *env,
=20
 	sgs->group_type =3D group_classify(env->sd->imbalance_pct, group, sgs);
=20
+	update_sg_if_llc(env, sgs, group);
 	/* Computing avg_load makes sense only when group is overloaded */
 	if (sgs->group_type =3D=3D group_overloaded)
 		sgs->avg_load =3D (sgs->group_load * SCHED_CAPACITY_SCALE) /
--=20
2.25.1
From nobody Sun Feb  8 17:30:11 2026
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2E32FF507
	for <linux-kernel@vger.kernel.org>; Mon, 21 Apr 2025 03:31:13 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.11
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1745206277; cv=none;
 b=Sy2OmM07akeAyE7SQ83Gt7ygEH36ao7hrVLGMiZVZgjlpLIHXnma2RZxYtq4xZDi7okdR2nYuS4fmbhpkIXtgCVovv3PvoUgo5lq2ryUoJ77wHt12NXIEwIrs4ulE5at5pHsYXSXvMCqst9q9jztUaEoYaG976WhLcA91fv2QDE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1745206277; c=relaxed/simple;
	bh=yMJHcecm2zj6M6Q7DbedbLw6+meHZ2UNj7fqi76HCNQ=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=kZPFkaOYdXyRmevrT4/sZC4hCp/TupabKRGzs1td8WMqA/GsOlUJZ1s3WAG9ZNbUbkogXzm27SniI/sagl3q3/f0bhPbYVzfeALEGSRURhzQaA2lUlvyFqfiYxvByVPF3yxWjwtFVWPFn9LLNj91hybn8UZ0a1hyFl5KBLolZU4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=Pj53XDj/; arc=none smtp.client-ip=198.175.65.11
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="Pj53XDj/"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1745206277; x=1776742277;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=yMJHcecm2zj6M6Q7DbedbLw6+meHZ2UNj7fqi76HCNQ=;
  b=Pj53XDj/IO0qJOuvtvqrM16lPjaEVY1ks3oPSXFDzF8ltVGpzsoStm1l
   C+6YrmSyvReUJt3/MCuN4grNA0cGBkyK62hlRsNT6HRWivyhTNzPMbPhl
   xsKgLcs2n1MLdKNH9CBWkwX/zobDVEfzRZ5OdT4zqcv+r1QfW+h6+9aI6
   gBFhy02HonE1rDg0jhYiTr21JdJeiRJBKwA8F9CuIHskYOT7sgPJl+oWi
   MYm24J2Wxe91+FTDDRqyfjDy+KN9n4+ynZZGyz82jTUgf/WKJiW5FZfqA
   MIRrRnbVqu/R/UjAdoRIEvQHcS8csAWzMvIBqozuRUWEPJyvDMO6wpmVG
   A==;
X-CSE-ConnectionGUID: 6TXKzjgjQbegoH94qV5gVg==
X-CSE-MsgGUID: Eaz94i42Q+eIIKykoUY8vw==
X-IronPort-AV: E=McAfee;i="6700,10204,11409"; a="56913626"
X-IronPort-AV: E=Sophos;i="6.15,227,1739865600";
   d="scan'208";a="56913626"
Received: from fmviesa009.fm.intel.com ([10.60.135.149])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 20 Apr 2025 20:31:14 -0700
X-CSE-ConnectionGUID: HJ/8UCbwQTyqzEykmeLLwQ==
X-CSE-MsgGUID: NPsK7sG8Rv6yXocPBpLLiw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.15,227,1739865600";
   d="scan'208";a="132488712"
Received: from chenyu-dev.sh.intel.com ([10.239.62.107])
  by fmviesa009.fm.intel.com with ESMTP; 20 Apr 2025 20:31:09 -0700
From: Chen Yu <yu.c.chen@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>
Cc: Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Libo Chen <libo.chen@oracle.com>,
	Abel Wu <wuyun.abel@bytedance.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	linux-kernel@vger.kernel.org,
	Chen Yu <yu.c.chen@intel.com>
Subject: [RFC PATCH 5/5] sched: Add ftrace to track task migration and load
 balance within and across LLC
Date: Mon, 21 Apr 2025 11:25:33 +0800
Message-Id: 
 <5d5a6e243b88d47a744f3c84d2a3a74832a6ef35.1745199017.git.yu.c.chen@intel.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <cover.1745199017.git.yu.c.chen@intel.com>
References: <cover.1745199017.git.yu.c.chen@intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

[Not for upstream]
Introduce these ftrace events for debugging purposes.
The task migration activity is an important indicator to
infer the performance regression.

Use the following bpftrace script to capture the task migrations:

tracepoint:sched:sched_attach_task
{
  $src_cpu =3D args->src_cpu;
  $dst_cpu =3D args->dst_cpu;
  $src_llc =3D args->src_llc;
  $dst_llc =3D args->dst_llc;
  $idle =3D args->idle;

  if ($src_llc =3D=3D $dst_llc) {
    @lb_mig_1llc[$idle] =3D count();
  } else {
    @lb_mig_2llc[$idle] =3D count();
  }
}

tracepoint:sched:sched_select_task_rq
{
  $new_cpu =3D args->new_cpu;
  $old_cpu =3D args->old_cpu;
  $new_llc =3D args->new_llc;
  $old_llc =3D args->old_llc;

  if ($new_cpu !=3D $old_cpu) {
    if ($new_llc =3D=3D $old_llc) {
      @wake_mig_1llc[$new_llc] =3D count();
    } else {
      @wake_mig_2llc =3D count();
    }
  }
}

interval:s:10
{
        time("\n%H:%M:%S scheduler statistics: \n");
        print(@lb_mig_1llc);
        clear(@lb_mig_1llc);
        print(@lb_mig_2llc);
        clear(@lb_mig_2llc);
        print(@wake_mig_1llc);
        clear(@wake_mig_1llc);
        print(@wake_mig_2llc);
        clear(@wake_mig_2llc);
}

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/trace/events/sched.h | 51 ++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c          | 24 ++++++++++++-----
 2 files changed, 69 insertions(+), 6 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 3bec9fb73a36..9995e09525ed 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -10,6 +10,57 @@
 #include <linux/tracepoint.h>
 #include <linux/binfmts.h>
=20
+TRACE_EVENT(sched_attach_task,
+
+	TP_PROTO(int src_cpu, int dst_cpu, int src_llc, int dst_llc, int idle),
+
+	TP_ARGS(src_cpu, dst_cpu, src_llc, dst_llc, idle),
+
+	TP_STRUCT__entry(
+		__field(	int,	src_cpu		)
+		__field(	int,	dst_cpu		)
+		__field(	int,	src_llc		)
+		__field(	int,	dst_llc		)
+		__field(	int,	idle		)
+	),
+
+	TP_fast_assign(
+		__entry->src_cpu	=3D src_cpu;
+		__entry->dst_cpu	=3D dst_cpu;
+		__entry->src_llc	=3D src_llc;
+		__entry->dst_llc	=3D dst_llc;
+		__entry->idle		=3D idle;
+	),
+
+	TP_printk("src_cpu=3D%d dst_cpu=3D%d src_llc=3D%d dst_llc=3D%d idle=3D%d",
+		  __entry->src_cpu, __entry->dst_cpu, __entry->src_llc,
+		  __entry->dst_llc, __entry->idle)
+);
+
+TRACE_EVENT(sched_select_task_rq,
+
+	TP_PROTO(int new_cpu, int old_cpu, int new_llc, int old_llc),
+
+	TP_ARGS(new_cpu, old_cpu, new_llc, old_llc),
+
+	TP_STRUCT__entry(
+		__field(	int,	new_cpu		)
+		__field(	int,	old_cpu		)
+		__field(	int,	new_llc		)
+		__field(	int,	old_llc		)
+	),
+
+	TP_fast_assign(
+		__entry->new_cpu	=3D new_cpu;
+		__entry->old_cpu	=3D old_cpu;
+		__entry->new_llc	=3D new_llc;
+		__entry->old_llc	=3D old_llc;
+	),
+
+	TP_printk("new_cpu=3D%d old_cpu=3D%d new_llc=3D%d old_llc=3D%d",
+		  __entry->new_cpu, __entry->old_cpu, __entry->new_llc, __entry->old_llc)
+);
+
 /*
  * Tracepoint for calling kthread_stop, performed to end a kthread:
  */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f74d8773c811..635fd3a6009c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8902,7 +8902,7 @@ select_task_rq_fair(struct task_struct *p, int prev_c=
pu, int wake_flags)
 	int sync =3D (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
 	struct sched_domain *tmp, *sd =3D NULL;
 	int cpu =3D smp_processor_id();
-	int new_cpu =3D prev_cpu;
+	int new_cpu =3D prev_cpu, orig_prev_cpu =3D prev_cpu;
 	int want_affine =3D 0;
 	/* SD_flags and WF_flags share the first nibble */
 	int sd_flag =3D wake_flags & 0xF;
@@ -8965,6 +8965,10 @@ select_task_rq_fair(struct task_struct *p, int prev_=
cpu, int wake_flags)
 		new_cpu =3D select_idle_sibling(p, prev_cpu, new_cpu);
 	}
=20
+	trace_sched_select_task_rq(new_cpu, orig_prev_cpu,
+				   per_cpu(sd_llc_id, new_cpu),
+				   per_cpu(sd_llc_id, orig_prev_cpu));
+
 	return new_cpu;
 }
=20
@@ -10026,11 +10030,17 @@ static int detach_tasks(struct lb_env *env)
 /*
  * attach_task() -- attach the task detached by detach_task() to its new r=
q.
  */
-static void attach_task(struct rq *rq, struct task_struct *p)
+static void attach_task(struct rq *rq, struct task_struct *p, struct lb_en=
v *env)
 {
 	lockdep_assert_rq_held(rq);
=20
 	WARN_ON_ONCE(task_rq(p) !=3D rq);
+
+	if (env)
+		trace_sched_attach_task(env->src_cpu, env->dst_cpu,
+					per_cpu(sd_llc_id, env->src_cpu),
+					per_cpu(sd_llc_id, env->dst_cpu),
+					env->idle);
 	activate_task(rq, p, ENQUEUE_NOCLOCK);
 	wakeup_preempt(rq, p, 0);
 }
@@ -10039,13 +10049,13 @@ static void attach_task(struct rq *rq, struct tas=
k_struct *p)
  * attach_one_task() -- attaches the task returned from detach_one_task() =
to
  * its new rq.
  */
-static void attach_one_task(struct rq *rq, struct task_struct *p)
+static void attach_one_task(struct rq *rq, struct task_struct *p, struct l=
b_env *env)
 {
 	struct rq_flags rf;
=20
 	rq_lock(rq, &rf);
 	update_rq_clock(rq);
-	attach_task(rq, p);
+	attach_task(rq, p, env);
 	rq_unlock(rq, &rf);
 }
=20
@@ -10066,7 +10076,7 @@ static void attach_tasks(struct lb_env *env)
 		p =3D list_first_entry(tasks, struct task_struct, se.group_node);
 		list_del_init(&p->se.group_node);
=20
-		attach_task(env->dst_rq, p);
+		attach_task(env->dst_rq, p, env);
 	}
=20
 	rq_unlock(env->dst_rq, &rf);
@@ -12457,6 +12467,7 @@ static int active_load_balance_cpu_stop(void *data)
 	struct sched_domain *sd;
 	struct task_struct *p =3D NULL;
 	struct rq_flags rf;
+	struct lb_env env_tmp;
=20
 	rq_lock_irq(busiest_rq, &rf);
 	/*
@@ -12512,6 +12523,7 @@ static int active_load_balance_cpu_stop(void *data)
 		} else {
 			schedstat_inc(sd->alb_failed);
 		}
+		memcpy(&env_tmp, &env, sizeof(env));
 	}
 	rcu_read_unlock();
 out_unlock:
@@ -12519,7 +12531,7 @@ static int active_load_balance_cpu_stop(void *data)
 	rq_unlock(busiest_rq, &rf);
=20
 	if (p)
-		attach_one_task(target_rq, p);
+		attach_one_task(target_rq, p, sd ? &env_tmp : NULL);
=20
 	local_irq_enable();
=20
--=20
2.25.1