From nobody Sun Jun 14 19:17:05 2026
Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6D5E93DA7DD;
	Wed, 20 May 2026 08:34:38 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=193.142.43.55
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779266080; cv=none;
 b=EIJRwqU/fPFvSyLVB1clqg5gV+3z2uweujX75F7Y0aRcKNVjCv4cHIWT3gHBcmP7lit3vCNi5S3UQ+Acqry/K9YBx3WUsm5hvLamwJTA97sJIDC7kR4sYfAZxytZFRuVO7SShhI/ZT/LAilFG7HrzrkLGT5NMj4VvnzjiyYz0yY=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779266080; c=relaxed/simple;
	bh=y3GCmdT+350k957rtdROAD6KFtXbsD+aRuhsbMrNQIk=;
	h=Date:From:To:Subject:Cc:In-Reply-To:References:MIME-Version:
	 Message-ID:Content-Type;
 b=pLDMI48fChUzUTpMCIO7PrYYF1SZ9/kbnxqj8Wbehzglt3yH5iyzD7rXMDJ9HYzae3uNltfzt3YHxt5AvIbQKejk3/uuKbPE4ymJs1EaW+WocXwnQC68v5AvAZKZTgvfaHiICcf0WWTw1wLZ4HqYr5dfon4sCmEJuY7NLZo0qao=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linutronix.de;
 spf=pass smtp.mailfrom=linutronix.de;
 dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b=rsfSKP1u;
 dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b=FoSnIQ1T; arc=none smtp.client-ip=193.142.43.55
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linutronix.de
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linutronix.de
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b="rsfSKP1u";
	dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b="FoSnIQ1T"
Date: Wed, 20 May 2026 08:34:35 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020; t=1779266076;
	h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=ohyeymPdfKmELhSnPn49CrjOF6PA2JsQpvaKgmTIP8Q=;
	b=rsfSKP1uYNkKgM7+0/9uAqANDU73EkJRgdi5ueDhpWgcPLYgJjFLXGeobXGJmWnYzjqCiz
	61XH9WvNiuJTNf+J/UVhIgA8tXPu5dNEuf7zWZCEdM7TcHTivFMRd8goAaqEYbGe04y6PL
	a9QlDME0nQXWCYmpxGkdkDR89IPFPThlhNnBV4nuveXTO7rO8n4J/E3grOAaMFoWXN9ve0
	zXggJTQtz4IuPHxpRFd8zFTkW6NPJJEO3Rb64wTuU5XOGPDudOXrkUDMErTDhvbFqm6MXS
	sm6+1I2H/jZwpesXeqR+R186snG199MX7Ble5pS6Q7LG0fv2Jnudf4gLKczpPA==
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020e; t=1779266076;
	h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=ohyeymPdfKmELhSnPn49CrjOF6PA2JsQpvaKgmTIP8Q=;
	b=FoSnIQ1T2cxIxXM7WCDy7kT8R074t0zzO0zAiYSg7YfRhFM2Ytt+tqFRlaygeqvXqm4Xpc
	nGnL6AgUhYbcj4BQ==
From: "tip-bot2 for Chen Yu" <tip-bot2@linutronix.de>
Sender: tip-bot2@linutronix.de
Reply-to: linux-kernel@vger.kernel.org
To: linux-tip-commits@vger.kernel.org
Subject: [tip: sched/core] sched/cache: Avoid cache-aware scheduling for
 memory-heavy processes
Cc: K Prateek Nayak <kprateek.nayak@amd.com>, Vern Hao <vernhao@tencent.com>,
 Chen Yu <yu.c.chen@intel.com>, Tim Chen <tim.c.chen@linux.intel.com>,
 "Peter Zijlstra (Intel)" <peterz@infradead.org>,
 Tingyin Duan <tingyin.duan@gmail.com>, x86@kernel.org,
 linux-kernel@vger.kernel.org
In-Reply-To: =?utf-8?q?=3C95cf64a385bcc12f18dcebe9d59e8d3ba8bb318f=2E1778703?=
 =?utf-8?q?694=2Egit=2Etim=2Ec=2Echen=40linux=2Eintel=2Ecom=3E?=
References: =?utf-8?q?=3C95cf64a385bcc12f18dcebe9d59e8d3ba8bb318f=2E17787036?=
 =?utf-8?q?94=2Egit=2Etim=2Ec=2Echen=40linux=2Eintel=2Ecom=3E?=
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Message-ID: <177926607555.711.13711823912594864456.tip-bot2@tip-bot2>
Robot-ID: <tip-bot2@linutronix.de>
Robot-Unsubscribe: 
 Contact <mailto:tglx@kernel.org> to get blacklisted from these emails
Precedence: bulk
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     808915f982c2a52f5d148510ecfab52284de67cf
Gitweb:        https://git.kernel.org/tip/808915f982c2a52f5d148510ecfab5228=
4de67cf
Author:        Chen Yu <yu.c.chen@intel.com>
AuthorDate:    Wed, 13 May 2026 13:39:16 -07:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 18 May 2026 21:33:15 +02:00

sched/cache: Avoid cache-aware scheduling for memory-heavy processes

Prateek and Tingyin reported that memory-intensive workloads (such as
stream) can saturate memory bandwidth and caches on the preferred LLC
when sched_cache aggregates too many threads.

To mitigate this, estimate a process's memory footprint by comparing
its NUMA balancing fault statistics to the size of the LLC. If the
footprint exceeds the LLC size, skip cache-aware scheduling.

Note that footprint is only an approximation of the memory footprint,
since the kernel lacks suitable metrics to estimate the real working
set. If a user-provided hint is available in the future, it would be
more accurate. A later patch will allow users to provide a hint to
adjust this threshold.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Vern Hao <vernhao@tencent.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Tingyin Duan <tingyin.duan@gmail.com>
Link: https://patch.msgid.link/95cf64a385bcc12f18dcebe9d59e8d3ba8bb318f.177=
8703694.git.tim.c.chen@linux.intel.com
---
 include/linux/sched.h |  1 +-
 kernel/exit.c         | 29 ++++++++++++++++++++-
 kernel/sched/fair.c   | 62 +++++++++++++++++++++++++++++++++++++++---
 3 files changed, 89 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6701911..9572967 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2425,6 +2425,7 @@ struct sched_cache_stat {
 	unsigned long epoch;
 	u64 nr_running_avg;
 	unsigned long next_scan;
+	unsigned long footprint;
 	int cpu;
 } ____cacheline_aligned_in_smp;
=20
diff --git a/kernel/exit.c b/kernel/exit.c
index ede3117..77275c2 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -543,6 +543,32 @@ void mm_update_next_owner(struct mm_struct *mm)
 }
 #endif /* CONFIG_MEMCG */
=20
+#if defined(CONFIG_SCHED_CACHE) && defined(CONFIG_NUMA_BALANCING)
+/*
+ * Subtract the memory footprint of the current task from
+ * mm.
+ */
+static void exit_mm_sched_cache(struct mm_struct *mm)
+{
+	unsigned long fp, sub;
+
+	if (!current->total_numa_faults)
+		return;
+	/*
+	 * No lock protection due to performance considerations.
+	 * Make sure mm->sc_stat.footprint does not become
+	 * negative.
+	 */
+	fp =3D READ_ONCE(mm->sc_stat.footprint);
+	sub =3D min(fp, current->total_numa_faults);
+	WRITE_ONCE(mm->sc_stat.footprint, fp - sub);
+}
+#else
+static inline void exit_mm_sched_cache(struct mm_struct *mm)
+{
+}
+#endif /* CONFIG_SCHED_CACHE CONFIG_NUMA_BALANCING */
+
 /*
  * Turn us into a lazy TLB process if we
  * aren't already..
@@ -554,6 +580,9 @@ static void exit_mm(void)
 	exit_mm_release(current, mm);
 	if (!mm)
 		return;
+
+	exit_mm_sched_cache(mm);
+
 	mmap_read_lock(mm);
 	mmgrab_lazy_tlb(mm);
 	BUG_ON(mm !=3D current->active_mm);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df21366..a10116f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1384,6 +1384,32 @@ static int llc_id(int cpu)
 	return per_cpu(sd_llc_id, cpu);
 }
=20
+static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
+{
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned long llc, footprint;
+	struct sched_domain *sd;
+
+	guard(rcu)();
+
+	sd =3D rcu_dereference_sched_domain(cpu_rq(cpu)->sd);
+	if (!sd)
+		return true;
+
+	if (static_branch_likely(&sched_numa_balancing)) {
+		/*
+		 * TBD: RDT exclusive LLC ways reserved should be
+		 * excluded.
+		 */
+		llc =3D sd->llc_bytes;
+		footprint =3D READ_ONCE(mm->sc_stat.footprint);
+
+		return (llc < (footprint * PAGE_SIZE));
+	}
+#endif
+	return false;
+}
+
 static bool invalid_llc_nr(struct mm_struct *mm, struct task_struct *p,
 			   int cpu)
 {
@@ -1463,6 +1489,7 @@ void mm_init_sched(struct mm_struct *mm,
 	mm->sc_stat.cpu =3D -1;
 	mm->sc_stat.next_scan =3D jiffies;
 	mm->sc_stat.nr_running_avg =3D 0;
+	mm->sc_stat.footprint =3D 0;
 	/*
 	 * The update to mm->sc_stat should not be reordered
 	 * before initialization to mm's other fields, in case
@@ -1585,7 +1612,8 @@ void account_mm_sched(struct rq *rq, struct task_stru=
ct *p, s64 delta_exec)
 	 * its preferred state.
 	 */
 	if (epoch - READ_ONCE(mm->sc_stat.epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
-	    invalid_llc_nr(mm, p, cpu_of(rq))) {
+	    invalid_llc_nr(mm, p, cpu_of(rq)) ||
+	    exceed_llc_capacity(mm, cpu_of(rq))) {
 		if (mm->sc_stat.cpu !=3D -1)
 			mm->sc_stat.cpu =3D -1;
 	}
@@ -1716,7 +1744,8 @@ static void task_cache_work(struct callback_head *wor=
k)
 		return;
=20
 	curr_cpu =3D task_cpu(p);
-	if (invalid_llc_nr(mm, p, curr_cpu)) {
+	if (invalid_llc_nr(mm, p, curr_cpu) ||
+	    exceed_llc_capacity(mm, curr_cpu)) {
 		if (mm->sc_stat.cpu !=3D -1)
 			mm->sc_stat.cpu =3D -1;
=20
@@ -3515,6 +3544,7 @@ static void task_numa_placement(struct task_struct *p)
 	unsigned long total_faults;
 	u64 runtime, period;
 	spinlock_t *group_lock =3D NULL;
+	long __maybe_unused new_fp;
 	struct numa_group *ng;
=20
 	/*
@@ -3589,6 +3619,31 @@ static void task_numa_placement(struct task_struct *=
p)
 				ng->total_faults +=3D diff;
 				group_faults +=3D ng->faults[mem_idx];
 			}
+#ifdef CONFIG_SCHED_CACHE
+			/*
+			 * Per task p->numa_faults[mem_idx] converges,
+			 * so the accumulation of each task's faults
+			 * converges too - Given the number of threads,
+			 * it cannot overflow an unsigned long.
+			 * Racy with concurrent updates from other threads
+			 * sharing this mm. Acceptable since footprint is a
+			 * heuristic and occasional lost updates are tolerable.
+			 *
+			 * If a task exits, its corresponding footprint must
+			 * be subtracted from the mm->sc_stat.footprint, otherwise
+			 * the mm->sc_stat.footprint will not converge:
+			 * the exiting thread's footprint remains unchanged/undecayed
+			 * in mm->sc_stat.footprint. See exit_mm().
+			 *
+			 * Lost updates and unsynchronized subtraction
+			 * in exit_mm() can cause footprint + diff to
+			 * go negative. Clamp to zero to prevent the
+			 * unsigned footprint from wrapping.
+			 */
+			new_fp =3D (long)READ_ONCE(p->mm->sc_stat.footprint) + diff;
+			WRITE_ONCE(p->mm->sc_stat.footprint,
+				   max(new_fp, 0L));
+#endif
 		}
=20
 		if (!ng) {
@@ -10338,7 +10393,8 @@ static enum llc_mig can_migrate_llc_task(int src_cp=
u, int dst_cpu,
 		return mig_unrestricted;
=20
 	/* skip cache aware load balance for too many threads */
-	if (invalid_llc_nr(mm, p, dst_cpu)) {
+	if (invalid_llc_nr(mm, p, dst_cpu) ||
+	    exceed_llc_capacity(mm, dst_cpu)) {
 		if (mm->sc_stat.cpu !=3D -1)
 			mm->sc_stat.cpu =3D -1;
 		return mig_unrestricted;