From nobody Sun Jun 14 19:14:31 2026
Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id D7B2C3DB334;
	Wed, 20 May 2026 08:34:41 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=193.142.43.55
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779266083; cv=none;
 b=HIsbnqHe1cnlMbrCCfrhYGymP+tb4EEL/hT9jVwrvywajd95fi9jOxp4ulKngu+bpcxpSr1q5si0fEkC7ugZs30G8FXrSKxIrBtAo6QxZ1HkGVwHmp8aWLj1tGNnIaM/RMq/k5uuCglHUpv29JuoFleAGZekLdMv1jK+tGosMyI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779266083; c=relaxed/simple;
	bh=HINUMX8BWaCGBJNsIadiD8Mqlx6MdMTVJyJV2FUgT9w=;
	h=Date:From:To:Subject:Cc:In-Reply-To:References:MIME-Version:
	 Message-ID:Content-Type;
 b=PDAi+E0zXw/VjJcUVzDsPUzK51re4ETMcEBz+lRoJ/BHefuJBw4f40gqdirwBk67xbSLe59HxGqfhdsblrpkmhg9AWPTEAjnumebdYYJrSyaNHfqE4LwZ7uOZE6sFBV9A2pU5y/u+ojaADIJ7CuAKo1nJ5lAPkCqtgOocLyfcPM=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linutronix.de;
 spf=pass smtp.mailfrom=linutronix.de;
 dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b=N5uKyBGD;
 dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b=ZTRGvkJ5; arc=none smtp.client-ip=193.142.43.55
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linutronix.de
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linutronix.de
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b="N5uKyBGD";
	dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b="ZTRGvkJ5"
Date: Wed, 20 May 2026 08:34:39 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020; t=1779266080;
	h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=3znqxbsoxmDTtr6+blMQ9pSyKwzGGFTUgFMTpUB/X60=;
	b=N5uKyBGD9xuZ4Uk22Daywnl6QI17oVYFcTOd2/sz29zqkMwDJZ1M+ck75WkMxKzF8kcnXa
	WF86H2ql427t8oXelrd9Dwi5IT++mAp2WFbZtxBV4XEUldwpCEd1Xf3M0mA4E+mUjNT3Gh
	nOTMQ8A9kOike2uJY2sZ/1U5w6+y6l0YmHhAHouFZRit5vHXmrGTViFr7wG1Nkzs7azbw2
	esDW6zg4muAjMvDALKDv+gXHHiG2R7h1codbjD2Juh0R8wzYq8fWRxBJ5ulk0NIJcoVssf
	SlOnAab2x1MHo8oOCWOaPMBFh1eNKetNSzrsMJFwzvXyfbfJUNZCUS+mfXt4iA==
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020e; t=1779266080;
	h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=3znqxbsoxmDTtr6+blMQ9pSyKwzGGFTUgFMTpUB/X60=;
	b=ZTRGvkJ59xdPqC1/sPDVQD2IGY9SzXkqx1TwmkuPnWQDIdHqZZ43QieItu4RqK1zikDhHE
	T8wOC4B15cJGteDg==
From: "tip-bot2 for Chen Yu" <tip-bot2@linutronix.de>
Sender: tip-bot2@linutronix.de
Reply-to: linux-kernel@vger.kernel.org
To: linux-tip-commits@vger.kernel.org
Subject: [tip: sched/core] sched/cache: Disable cache aware scheduling for
 processes with high thread counts
Cc: K Prateek Nayak <kprateek.nayak@amd.com>,
 Aaron Lu <ziqianlu@bytedance.com>, Chen Yu <yu.c.chen@intel.com>,
 Tim Chen <tim.c.chen@linux.intel.com>,
 "Peter Zijlstra (Intel)" <peterz@infradead.org>,
 Tingyin Duan <tingyin.duan@gmail.com>, x86@kernel.org,
 linux-kernel@vger.kernel.org
In-Reply-To: =?utf-8?q?=3Cd076cd21a8e6c6341d1e2d927e118db770ebb650=2E1778703?=
 =?utf-8?q?694=2Egit=2Etim=2Ec=2Echen=40linux=2Eintel=2Ecom=3E?=
References: =?utf-8?q?=3Cd076cd21a8e6c6341d1e2d927e118db770ebb650=2E17787036?=
 =?utf-8?q?94=2Egit=2Etim=2Ec=2Echen=40linux=2Eintel=2Ecom=3E?=
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Message-ID: <177926607920.711.3350939884328890543.tip-bot2@tip-bot2>
Robot-ID: <tip-bot2@linutronix.de>
Robot-Unsubscribe: 
 Contact <mailto:tglx@kernel.org> to get blacklisted from these emails
Precedence: bulk
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     deee5e27d5b608323c04dc99979e55f944016a13
Gitweb:        https://git.kernel.org/tip/deee5e27d5b608323c04dc99979e55f94=
4016a13
Author:        Chen Yu <yu.c.chen@intel.com>
AuthorDate:    Wed, 13 May 2026 13:39:13 -07:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 18 May 2026 21:33:14 +02:00

sched/cache: Disable cache aware scheduling for processes with high thread =
counts

A performance regression was observed by Prateek when running hackbench
with many threads per process (high fd count). To avoid this, processes
with a large number of active threads are excluded from cache-aware
scheduling.

With sched_cache enabled, record the number of active threads in each
process during the periodic task_cache_work(). While iterating over
CPUs, if the currently running task belongs to the same process as the
task that launched task_cache_work(), increment the active thread count.

If the number of active threads within the process exceeds the number
of Cores (divided by the SMT number) in the LLC, do not enable
cache-aware scheduling. However, on systems with a smaller number of
CPUs within 1 LLC, like Power10/Power11 with SMT4 and an LLC size of 4,
this check effectively disables cache-aware scheduling for any process.
One possible solution suggested by Peter is to use an LLC-mask instead
of a single LLC value for preference. Once there are a 'few' LLCs as
preference, this constraint becomes a little easier. It could be an
enhancement in the future.

For users who wish to perform task aggregation regardless, a debugfs knob
is provided for tuning in a subsequent change.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Tingyin Duan <tingyin.duan@gmail.com>
Link: https://patch.msgid.link/d076cd21a8e6c6341d1e2d927e118db770ebb650.177=
8703694.git.tim.c.chen@linux.intel.com
---
 include/linux/sched.h |  1 +-
 kernel/sched/fair.c   | 48 +++++++++++++++++++++++++++++++++++++-----
 2 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d883f1..6701911 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2423,6 +2423,7 @@ struct sched_cache_stat {
 	struct sched_cache_time __percpu *pcpu_sched;
 	raw_spinlock_t lock;
 	unsigned long epoch;
+	u64 nr_running_avg;
 	unsigned long next_scan;
 	int cpu;
 } ____cacheline_aligned_in_smp;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a759ea6..808f614 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1384,6 +1384,12 @@ static int llc_id(int cpu)
 	return per_cpu(sd_llc_id, cpu);
 }
=20
+static bool invalid_llc_nr(struct mm_struct *mm, int cpu)
+{
+	return !fits_capacity((mm->sc_stat.nr_running_avg * cpu_smt_num_threads),
+			per_cpu(sd_llc_size, cpu));
+}
+
 static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
 {
 	struct sched_domain *sd;
@@ -1452,7 +1458,7 @@ void mm_init_sched(struct mm_struct *mm,
 	mm->sc_stat.epoch =3D epoch;
 	mm->sc_stat.cpu =3D -1;
 	mm->sc_stat.next_scan =3D jiffies;
-
+	mm->sc_stat.nr_running_avg =3D 0;
 	/*
 	 * The update to mm->sc_stat should not be reordered
 	 * before initialization to mm's other fields, in case
@@ -1574,7 +1580,8 @@ void account_mm_sched(struct rq *rq, struct task_stru=
ct *p, s64 delta_exec)
 	 * If this process hasn't hit task_cache_work() for a while invalidate
 	 * its preferred state.
 	 */
-	if (epoch - READ_ONCE(mm->sc_stat.epoch) > EPOCH_LLC_AFFINITY_TIMEOUT) {
+	if (epoch - READ_ONCE(mm->sc_stat.epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
+	    invalid_llc_nr(mm, cpu_of(rq))) {
 		if (mm->sc_stat.cpu !=3D -1)
 			mm->sc_stat.cpu =3D -1;
 	}
@@ -1660,14 +1667,32 @@ out:
 	cpumask_copy(cpus, cpu_online_mask);
 }
=20
+static inline void update_avg_scale(u64 *avg, u64 sample)
+{
+	int factor =3D per_cpu(sd_llc_size, raw_smp_processor_id());
+	s64 diff =3D sample - *avg;
+	u32 divisor;
+
+	/*
+	 * Scale the divisor based on the number of CPUs contained
+	 * in the LLC. This scaling ensures smaller LLC domains use
+	 * a smaller divisor to achieve more precise sensitivity to
+	 * changes in nr_running, while larger LLC domains are capped
+	 * at a maximum divisor of 8 which is the default smoothing
+	 * factor of EWMA in update_avg().
+	 */
+	divisor =3D clamp_t(u32, (factor >> 2), 2, 8);
+	*avg +=3D div64_s64(diff, divisor);
+}
+
 static void task_cache_work(struct callback_head *work)
 {
 	unsigned long next_scan, now =3D jiffies;
-	struct task_struct *p =3D current;
+	struct task_struct *p =3D current, *cur;
+	int cpu, m_a_cpu =3D -1, nr_running =3D 0;
+	unsigned long curr_m_a_occ =3D 0;
 	struct mm_struct *mm =3D p->mm;
 	unsigned long m_a_occ =3D 0;
-	unsigned long curr_m_a_occ =3D 0;
-	int cpu, m_a_cpu =3D -1;
 	cpumask_var_t cpus;
=20
 	WARN_ON_ONCE(work !=3D &p->cache_work);
@@ -1711,6 +1736,11 @@ static void task_cache_work(struct callback_head *wo=
rk)
 					m_occ =3D occ;
 					m_cpu =3D i;
 				}
+
+				cur =3D rcu_dereference_all(cpu_rq(i)->curr);
+				if (cur && !(cur->flags & (PF_EXITING | PF_KTHREAD)) &&
+				    cur->mm =3D=3D mm)
+					nr_running++;
 			}
=20
 			/*
@@ -1754,6 +1784,7 @@ static void task_cache_work(struct callback_head *wor=
k)
 		mm->sc_stat.cpu =3D m_a_cpu;
 	}
=20
+	update_avg_scale(&mm->sc_stat.nr_running_avg, nr_running);
 	free_cpumask_var(cpus);
 }
=20
@@ -10294,6 +10325,13 @@ static enum llc_mig can_migrate_llc_task(int src_c=
pu, int dst_cpu,
 	if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
 		return mig_unrestricted;
=20
+	/* skip cache aware load balance for too many threads */
+	if (invalid_llc_nr(mm, dst_cpu)) {
+		if (mm->sc_stat.cpu !=3D -1)
+			mm->sc_stat.cpu =3D -1;
+		return mig_unrestricted;
+	}
+
 	if (cpus_share_cache(dst_cpu, cpu))
 		to_pref =3D true;
 	else if (cpus_share_cache(src_cpu, cpu))