From nobody Mon May 25 00:10:57 2026 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6D5E93DA7DD; Wed, 20 May 2026 08:34:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779266080; cv=none; b=EIJRwqU/fPFvSyLVB1clqg5gV+3z2uweujX75F7Y0aRcKNVjCv4cHIWT3gHBcmP7lit3vCNi5S3UQ+Acqry/K9YBx3WUsm5hvLamwJTA97sJIDC7kR4sYfAZxytZFRuVO7SShhI/ZT/LAilFG7HrzrkLGT5NMj4VvnzjiyYz0yY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779266080; c=relaxed/simple; bh=y3GCmdT+350k957rtdROAD6KFtXbsD+aRuhsbMrNQIk=; h=Date:From:To:Subject:Cc:In-Reply-To:References:MIME-Version: Message-ID:Content-Type; b=pLDMI48fChUzUTpMCIO7PrYYF1SZ9/kbnxqj8Wbehzglt3yH5iyzD7rXMDJ9HYzae3uNltfzt3YHxt5AvIbQKejk3/uuKbPE4ymJs1EaW+WocXwnQC68v5AvAZKZTgvfaHiICcf0WWTw1wLZ4HqYr5dfon4sCmEJuY7NLZo0qao= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=rsfSKP1u; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=FoSnIQ1T; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="rsfSKP1u"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="FoSnIQ1T" Date: Wed, 20 May 2026 08:34:35 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1779266076; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ohyeymPdfKmELhSnPn49CrjOF6PA2JsQpvaKgmTIP8Q=; b=rsfSKP1uYNkKgM7+0/9uAqANDU73EkJRgdi5ueDhpWgcPLYgJjFLXGeobXGJmWnYzjqCiz 61XH9WvNiuJTNf+J/UVhIgA8tXPu5dNEuf7zWZCEdM7TcHTivFMRd8goAaqEYbGe04y6PL a9QlDME0nQXWCYmpxGkdkDR89IPFPThlhNnBV4nuveXTO7rO8n4J/E3grOAaMFoWXN9ve0 zXggJTQtz4IuPHxpRFd8zFTkW6NPJJEO3Rb64wTuU5XOGPDudOXrkUDMErTDhvbFqm6MXS sm6+1I2H/jZwpesXeqR+R186snG199MX7Ble5pS6Q7LG0fv2Jnudf4gLKczpPA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1779266076; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ohyeymPdfKmELhSnPn49CrjOF6PA2JsQpvaKgmTIP8Q=; b=FoSnIQ1T2cxIxXM7WCDy7kT8R074t0zzO0zAiYSg7YfRhFM2Ytt+tqFRlaygeqvXqm4Xpc nGnL6AgUhYbcj4BQ== From: "tip-bot2 for Chen Yu" Sender: tip-bot2@linutronix.de Reply-to: linux-kernel@vger.kernel.org To: linux-tip-commits@vger.kernel.org Subject: [tip: sched/core] sched/cache: Avoid cache-aware scheduling for memory-heavy processes Cc: K Prateek Nayak , Vern Hao , Chen Yu , Tim Chen , "Peter Zijlstra (Intel)" , Tingyin Duan , x86@kernel.org, linux-kernel@vger.kernel.org In-Reply-To: =?utf-8?q?=3C95cf64a385bcc12f18dcebe9d59e8d3ba8bb318f=2E1778703?= =?utf-8?q?694=2Egit=2Etim=2Ec=2Echen=40linux=2Eintel=2Ecom=3E?= References: =?utf-8?q?=3C95cf64a385bcc12f18dcebe9d59e8d3ba8bb318f=2E17787036?= =?utf-8?q?94=2Egit=2Etim=2Ec=2Echen=40linux=2Eintel=2Ecom=3E?= Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Message-ID: <177926607555.711.13711823912594864456.tip-bot2@tip-bot2> Robot-ID: Robot-Unsubscribe: Contact to get blacklisted from these emails Precedence: bulk Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable The following commit has been merged into the sched/core branch of tip: Commit-ID: 808915f982c2a52f5d148510ecfab52284de67cf Gitweb: https://git.kernel.org/tip/808915f982c2a52f5d148510ecfab5228= 4de67cf Author: Chen Yu AuthorDate: Wed, 13 May 2026 13:39:16 -07:00 Committer: Peter Zijlstra CommitterDate: Mon, 18 May 2026 21:33:15 +02:00 sched/cache: Avoid cache-aware scheduling for memory-heavy processes Prateek and Tingyin reported that memory-intensive workloads (such as stream) can saturate memory bandwidth and caches on the preferred LLC when sched_cache aggregates too many threads. To mitigate this, estimate a process's memory footprint by comparing its NUMA balancing fault statistics to the size of the LLC. If the footprint exceeds the LLC size, skip cache-aware scheduling. Note that footprint is only an approximation of the memory footprint, since the kernel lacks suitable metrics to estimate the real working set. If a user-provided hint is available in the future, it would be more accurate. A later patch will allow users to provide a hint to adjust this threshold. Suggested-by: K Prateek Nayak Suggested-by: Vern Hao Signed-off-by: Chen Yu Co-developed-by: Tim Chen Signed-off-by: Tim Chen Signed-off-by: Peter Zijlstra (Intel) Tested-by: Tingyin Duan Link: https://patch.msgid.link/95cf64a385bcc12f18dcebe9d59e8d3ba8bb318f.177= 8703694.git.tim.c.chen@linux.intel.com --- include/linux/sched.h | 1 +- kernel/exit.c | 29 ++++++++++++++++++++- kernel/sched/fair.c | 62 +++++++++++++++++++++++++++++++++++++++--- 3 files changed, 89 insertions(+), 3 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 6701911..9572967 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2425,6 +2425,7 @@ struct sched_cache_stat { unsigned long epoch; u64 nr_running_avg; unsigned long next_scan; + unsigned long footprint; int cpu; } ____cacheline_aligned_in_smp; =20 diff --git a/kernel/exit.c b/kernel/exit.c index ede3117..77275c2 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -543,6 +543,32 @@ void mm_update_next_owner(struct mm_struct *mm) } #endif /* CONFIG_MEMCG */ =20 +#if defined(CONFIG_SCHED_CACHE) && defined(CONFIG_NUMA_BALANCING) +/* + * Subtract the memory footprint of the current task from + * mm. + */ +static void exit_mm_sched_cache(struct mm_struct *mm) +{ + unsigned long fp, sub; + + if (!current->total_numa_faults) + return; + /* + * No lock protection due to performance considerations. + * Make sure mm->sc_stat.footprint does not become + * negative. + */ + fp =3D READ_ONCE(mm->sc_stat.footprint); + sub =3D min(fp, current->total_numa_faults); + WRITE_ONCE(mm->sc_stat.footprint, fp - sub); +} +#else +static inline void exit_mm_sched_cache(struct mm_struct *mm) +{ +} +#endif /* CONFIG_SCHED_CACHE CONFIG_NUMA_BALANCING */ + /* * Turn us into a lazy TLB process if we * aren't already.. @@ -554,6 +580,9 @@ static void exit_mm(void) exit_mm_release(current, mm); if (!mm) return; + + exit_mm_sched_cache(mm); + mmap_read_lock(mm); mmgrab_lazy_tlb(mm); BUG_ON(mm !=3D current->active_mm); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index df21366..a10116f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1384,6 +1384,32 @@ static int llc_id(int cpu) return per_cpu(sd_llc_id, cpu); } =20 +static bool exceed_llc_capacity(struct mm_struct *mm, int cpu) +{ +#ifdef CONFIG_NUMA_BALANCING + unsigned long llc, footprint; + struct sched_domain *sd; + + guard(rcu)(); + + sd =3D rcu_dereference_sched_domain(cpu_rq(cpu)->sd); + if (!sd) + return true; + + if (static_branch_likely(&sched_numa_balancing)) { + /* + * TBD: RDT exclusive LLC ways reserved should be + * excluded. + */ + llc =3D sd->llc_bytes; + footprint =3D READ_ONCE(mm->sc_stat.footprint); + + return (llc < (footprint * PAGE_SIZE)); + } +#endif + return false; +} + static bool invalid_llc_nr(struct mm_struct *mm, struct task_struct *p, int cpu) { @@ -1463,6 +1489,7 @@ void mm_init_sched(struct mm_struct *mm, mm->sc_stat.cpu =3D -1; mm->sc_stat.next_scan =3D jiffies; mm->sc_stat.nr_running_avg =3D 0; + mm->sc_stat.footprint =3D 0; /* * The update to mm->sc_stat should not be reordered * before initialization to mm's other fields, in case @@ -1585,7 +1612,8 @@ void account_mm_sched(struct rq *rq, struct task_stru= ct *p, s64 delta_exec) * its preferred state. */ if (epoch - READ_ONCE(mm->sc_stat.epoch) > EPOCH_LLC_AFFINITY_TIMEOUT || - invalid_llc_nr(mm, p, cpu_of(rq))) { + invalid_llc_nr(mm, p, cpu_of(rq)) || + exceed_llc_capacity(mm, cpu_of(rq))) { if (mm->sc_stat.cpu !=3D -1) mm->sc_stat.cpu =3D -1; } @@ -1716,7 +1744,8 @@ static void task_cache_work(struct callback_head *wor= k) return; =20 curr_cpu =3D task_cpu(p); - if (invalid_llc_nr(mm, p, curr_cpu)) { + if (invalid_llc_nr(mm, p, curr_cpu) || + exceed_llc_capacity(mm, curr_cpu)) { if (mm->sc_stat.cpu !=3D -1) mm->sc_stat.cpu =3D -1; =20 @@ -3515,6 +3544,7 @@ static void task_numa_placement(struct task_struct *p) unsigned long total_faults; u64 runtime, period; spinlock_t *group_lock =3D NULL; + long __maybe_unused new_fp; struct numa_group *ng; =20 /* @@ -3589,6 +3619,31 @@ static void task_numa_placement(struct task_struct *= p) ng->total_faults +=3D diff; group_faults +=3D ng->faults[mem_idx]; } +#ifdef CONFIG_SCHED_CACHE + /* + * Per task p->numa_faults[mem_idx] converges, + * so the accumulation of each task's faults + * converges too - Given the number of threads, + * it cannot overflow an unsigned long. + * Racy with concurrent updates from other threads + * sharing this mm. Acceptable since footprint is a + * heuristic and occasional lost updates are tolerable. + * + * If a task exits, its corresponding footprint must + * be subtracted from the mm->sc_stat.footprint, otherwise + * the mm->sc_stat.footprint will not converge: + * the exiting thread's footprint remains unchanged/undecayed + * in mm->sc_stat.footprint. See exit_mm(). + * + * Lost updates and unsynchronized subtraction + * in exit_mm() can cause footprint + diff to + * go negative. Clamp to zero to prevent the + * unsigned footprint from wrapping. + */ + new_fp =3D (long)READ_ONCE(p->mm->sc_stat.footprint) + diff; + WRITE_ONCE(p->mm->sc_stat.footprint, + max(new_fp, 0L)); +#endif } =20 if (!ng) { @@ -10338,7 +10393,8 @@ static enum llc_mig can_migrate_llc_task(int src_cp= u, int dst_cpu, return mig_unrestricted; =20 /* skip cache aware load balance for too many threads */ - if (invalid_llc_nr(mm, p, dst_cpu)) { + if (invalid_llc_nr(mm, p, dst_cpu) || + exceed_llc_capacity(mm, dst_cpu)) { if (mm->sc_stat.cpu !=3D -1) mm->sc_stat.cpu =3D -1; return mig_unrestricted;