From nobody Mon May 25 00:09:40 2026 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 200853DA5B6; Wed, 20 May 2026 08:34:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779266078; cv=none; b=Pzd+riofgiW3OIK6EWa3rwQS0wV9erCsLKrTd+aCMQqpOwf5fWPMAAWSzPH8f2XX6nvqLquoDHAhvgQSWxNdxP44+ZcRyW8Ufdnq5ZRH72znwNZ69s1VQ5d26bI2V6BalFPG77JgqKssD2sye+XFGU6oN37wx9OVWNsnySeFQ6I= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779266078; c=relaxed/simple; bh=mK10AF9Rcc54FYoJ1UXHWQpkQ+Cbx7oEWdEMcsXKUcI=; h=Date:From:To:Subject:Cc:In-Reply-To:References:MIME-Version: Message-ID:Content-Type; b=EkMNXko1VgbJAc4X1MVKsGqxsoKhM3s+CAEIOXhafHnSh5+9GZxbpHZFNO/2MHWdZUrbstr1y2RHdv/PUlvDCCjeC3t/+TueALAksjFhtlke703H0dlgGSvp3iVNDPDWZOkyv8hgD25LX/DFRU5eAtyKxn7b202Vphp5m55iBu8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=3kjJhuqE; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=PN1Ebb3C; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="3kjJhuqE"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="PN1Ebb3C" Date: Wed, 20 May 2026 08:34:34 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1779266075; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=tCmg8vmySGqTPtHQRcaBMMiLs2xr+mPu2b8yMTqX4KM=; b=3kjJhuqEXekbqtIXSDYHxl8vZQIZ8Z6FaxQogb3NARd8aythl9bWTz1MUjxH04DTDXsIDl 7GVDPeF5Nc0qWcyV+m2+tgRH8JxNNseO4EVyY7At4hKHE38+aJTM9mQwmJfBjFfGWftEcu 1wQQP+dplSvsP+96D5i5Lg7/sI18mxM3sX/r8tfpO12ViCIVjIvDRFv00ytaYTbgpJ/TlW ShNEcbn791oTzFKfBx9uM/JG93HNBgnBPme9P1ylgUI+MQ1lwMK9JHFRkwS+NK0lzAgYkT qYUEInxbsy0+jn8ryg7ACajb4E+weLrO6SN5hvHq5OQI5ZPEnXFsAaSTSV9uew== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1779266075; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=tCmg8vmySGqTPtHQRcaBMMiLs2xr+mPu2b8yMTqX4KM=; b=PN1Ebb3CoX6cqJBRUAAqugRbJehlWAKVNkHgnGoABnvEk1j8aX+sbpOdugAQMN1TXHG7bM FM1ihtjN8Z+qtGBg== From: "tip-bot2 for Chen Yu" Sender: tip-bot2@linutronix.de Reply-to: linux-kernel@vger.kernel.org To: linux-tip-commits@vger.kernel.org Subject: [tip: sched/core] sched/cache: Add user control to adjust the aggressiveness of cache-aware scheduling Cc: K Prateek Nayak , Madadi Vineeth Reddy , Shrikanth Hegde , Tingyin Duan , Jianyong Wu , Yangyu Chen , Chen Yu , Tim Chen , "Peter Zijlstra (Intel)" , x86@kernel.org, linux-kernel@vger.kernel.org In-Reply-To: =?utf-8?q?=3C1c62cc060ba2b33d7b1f0ed98b3390128edbae93=2E1778703?= =?utf-8?q?694=2Egit=2Etim=2Ec=2Echen=40linux=2Eintel=2Ecom=3E?= References: =?utf-8?q?=3C1c62cc060ba2b33d7b1f0ed98b3390128edbae93=2E17787036?= =?utf-8?q?94=2Egit=2Etim=2Ec=2Echen=40linux=2Eintel=2Ecom=3E?= Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Message-ID: <177926607416.711.12699163047451368701.tip-bot2@tip-bot2> Robot-ID: Robot-Unsubscribe: Contact to get blacklisted from these emails Precedence: bulk Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable The following commit has been merged into the sched/core branch of tip: Commit-ID: c1e7fe5e75ed11fa85368e5a186472afd3858f3a Gitweb: https://git.kernel.org/tip/c1e7fe5e75ed11fa85368e5a186472afd= 3858f3a Author: Chen Yu AuthorDate: Wed, 13 May 2026 13:39:17 -07:00 Committer: Peter Zijlstra CommitterDate: Mon, 18 May 2026 21:33:15 +02:00 sched/cache: Add user control to adjust the aggressiveness of cache-aware s= cheduling Introduce a set of debugfs knobs to control how aggressively the cache aware scheduling does the task aggregation. (1) aggr_tolerance With sched_cache enabled, the scheduler uses a process's footprint as a proxy for its LLC footprint to determine if aggregating tasks on the preferred LLC could cause cache contention. If the footprint exceeds the LLC size, aggregation is skipped. Since the kernel cannot efficiently track per-task cache usage (resctrl is user-space only), userspace can provide a more accurate hint. Introduce /sys/kernel/debug/sched/llc_balancing/aggr_tolerance to let users control how strictly footprint limits aggregation. Values range from 0 to 100: - 0: Cache-aware scheduling is disabled. - 1: Strict; tasks with footprint larger than LLC size are skipped. - >=3D100: Aggressive; tasks are aggregated regardless of footprint. For example, with a 32MB L3 cache: - aggr_tolerance=3D1 -> tasks with footprint > 32MB are skipped. - aggr_tolerance=3D99 -> tasks with footprint > 784GB are skipped (784GB =3D (1 + (99 - 1) * 256) * 32MB). Similarly, /sys/kernel/debug/sched/llc_balancing/aggr_tolerance also controls how strictly the number of active threads is considered when doing cache aware load balance. The number of SMTs is also considered. High SMT counts reduce the aggregation capacity, preventing excessive task aggregation on SMT-heavy systems like Power10/Power11. Yangyu suggested introducing separate aggregation controls for the number of active threads and memory footprint checks. Since there are plans to add per-process/task group controls, fine-grained tunables are deferred to that implementation. (2) epoch_period, epoch_affinity_timeout, imb_pct, overaggr_pct are also turned into tunables. Suggested-by: K Prateek Nayak Suggested-by: Madadi Vineeth Reddy Suggested-by: Shrikanth Hegde Suggested-by: Tingyin Duan Suggested-by: Jianyong Wu Suggested-by: Yangyu Chen Signed-off-by: Chen Yu Co-developed-by: Tim Chen Signed-off-by: Tim Chen Signed-off-by: Peter Zijlstra (Intel) Tested-by: Tingyin Duan Link: https://patch.msgid.link/1c62cc060ba2b33d7b1f0ed98b3390128edbae93.177= 8703694.git.tim.c.chen@linux.intel.com --- kernel/sched/debug.c | 10 ++++++- kernel/sched/fair.c | 68 +++++++++++++++++++++++++++++++++++++------ kernel/sched/sched.h | 5 +++- 3 files changed, 75 insertions(+), 8 deletions(-) diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 2eae67c..fe56953 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -670,6 +670,16 @@ static __init int sched_init_debug(void) llc =3D debugfs_create_dir("llc_balancing", debugfs_sched); debugfs_create_file("enabled", 0644, llc, NULL, &sched_cache_enable_fops); + debugfs_create_u32("aggr_tolerance", 0644, llc, + &llc_aggr_tolerance); + debugfs_create_u32("epoch_period", 0644, llc, + &llc_epoch_period); + debugfs_create_u32("epoch_affinity_timeout", 0644, llc, + &llc_epoch_affinity_timeout); + debugfs_create_u32("overaggr_pct", 0644, llc, + &llc_overaggr_pct); + debugfs_create_u32("imb_pct", 0644, llc, + &llc_imb_pct); #endif =20 debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops= ); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a10116f..76ac6a8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1375,6 +1375,11 @@ static void set_next_buddy(struct sched_entity *se); */ #define EPOCH_PERIOD (HZ / 100) /* 10 ms */ #define EPOCH_LLC_AFFINITY_TIMEOUT 5 /* 50 ms */ +__read_mostly unsigned int llc_aggr_tolerance =3D 1; +__read_mostly unsigned int llc_epoch_period =3D EPOCH_PERIOD; +__read_mostly unsigned int llc_epoch_affinity_timeout =3D EPOCH_LLC_AFFINI= TY_TIMEOUT; +__read_mostly unsigned int llc_imb_pct =3D 20; +__read_mostly unsigned int llc_overaggr_pct =3D 50; =20 static int llc_id(int cpu) { @@ -1384,11 +1389,25 @@ static int llc_id(int cpu) return per_cpu(sd_llc_id, cpu); } =20 +static inline int get_sched_cache_scale(int mul) +{ + unsigned int tol =3D READ_ONCE(llc_aggr_tolerance); + + if (!tol) + return 0; + + if (tol >=3D 100) + return INT_MAX; + + return (1 + (tol - 1) * mul); +} + static bool exceed_llc_capacity(struct mm_struct *mm, int cpu) { #ifdef CONFIG_NUMA_BALANCING unsigned long llc, footprint; struct sched_domain *sd; + int scale; =20 guard(rcu)(); =20 @@ -1404,7 +1423,28 @@ static bool exceed_llc_capacity(struct mm_struct *mm= , int cpu) llc =3D sd->llc_bytes; footprint =3D READ_ONCE(mm->sc_stat.footprint); =20 - return (llc < (footprint * PAGE_SIZE)); + /* + * Scale the LLC size by 256*llc_aggr_tolerance + * and compare it to the task's footprint. + * + * Suppose the L3 size is 32MB. If the + * llc_aggr_tolerance is 1: + * When the footprint is larger than 32MB, the + * process is regarded as exceeding the LLC + * capacity. If the llc_aggr_tolerance is 99: + * When the footprint is larger than 784GB, the + * process is regarded as exceeding the LLC + * capacity: + * 784GB =3D (1 + (99 - 1) * 256) * 32MB + * If the llc_aggr_tolerance is 100: + * ignore the footprint and do the aggregation + * anyway. + */ + scale =3D get_sched_cache_scale(256); + if (scale =3D=3D INT_MAX) + return false; + + return ((llc * (u64)scale) < (footprint * PAGE_SIZE)); } #endif return false; @@ -1413,11 +1453,21 @@ static bool exceed_llc_capacity(struct mm_struct *m= m, int cpu) static bool invalid_llc_nr(struct mm_struct *mm, struct task_struct *p, int cpu) { + int scale; + if (get_nr_threads(p) <=3D 1) return true; =20 + /* + * Scale the number of 'cores' in a LLC by llc_aggr_tolerance + * and compare it to the task's active threads. + */ + scale =3D get_sched_cache_scale(1); + if (scale =3D=3D INT_MAX) + return false; + return !fits_capacity((mm->sc_stat.nr_running_avg * cpu_smt_num_threads), - per_cpu(sd_llc_size, cpu)); + (scale * per_cpu(sd_llc_size, cpu))); } =20 static void account_llc_enqueue(struct rq *rq, struct task_struct *p) @@ -1513,13 +1563,14 @@ static inline void __update_mm_sched(struct rq *rq, { lockdep_assert_held(&rq->cpu_epoch_lock); =20 + unsigned int period =3D max(READ_ONCE(llc_epoch_period), 1U); unsigned long n, now =3D jiffies; long delta =3D now - rq->cpu_epoch_next; =20 if (delta > 0) { - n =3D (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD; + n =3D (delta + period - 1) / period; rq->cpu_epoch +=3D n; - rq->cpu_epoch_next +=3D n * EPOCH_PERIOD; + rq->cpu_epoch_next +=3D n * period; __shr_u64(&rq->cpu_runtime, n); } =20 @@ -1611,7 +1662,7 @@ void account_mm_sched(struct rq *rq, struct task_stru= ct *p, s64 delta_exec) * If this process hasn't hit task_cache_work() for a while invalidate * its preferred state. */ - if (epoch - READ_ONCE(mm->sc_stat.epoch) > EPOCH_LLC_AFFINITY_TIMEOUT || + if ((long)(epoch - READ_ONCE(mm->sc_stat.epoch)) > llc_epoch_affinity_tim= eout || invalid_llc_nr(mm, p, cpu_of(rq)) || exceed_llc_capacity(mm, cpu_of(rq))) { if (mm->sc_stat.cpu !=3D -1) @@ -1740,7 +1791,8 @@ static void task_cache_work(struct callback_head *wor= k) =20 /* only 1 thread is allowed to scan */ if (!try_cmpxchg(&mm->sc_stat.next_scan, &next_scan, - now + EPOCH_PERIOD)) + now + max_t(unsigned long, + READ_ONCE(llc_epoch_period), 1))) return; =20 curr_cpu =3D task_cpu(p); @@ -10232,7 +10284,7 @@ static inline int task_is_ineligible_on_dst_cpu(str= uct task_struct *p, int dest_ */ static bool fits_llc_capacity(unsigned long util, unsigned long max) { - u32 aggr_pct =3D 50; + u32 aggr_pct =3D llc_overaggr_pct; =20 /* * For single core systems, raise the aggregation @@ -10252,7 +10304,7 @@ static bool fits_llc_capacity(unsigned long util, u= nsigned long max) */ /* Allows dst util to be bigger than src util by up to bias percent */ #define util_greater(util1, util2) \ - ((util1) * 100 > (util2) * 120) + ((util1) * 100 > (util2) * (100 + llc_imb_pct)) =20 static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util, unsigned long *cap) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index f499d5d..2740939 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -4072,6 +4072,11 @@ static inline void mm_cid_switch_to(struct task_stru= ct *prev, struct task_struct DECLARE_STATIC_KEY_FALSE(sched_cache_present); DECLARE_STATIC_KEY_FALSE(sched_cache_active); extern int sysctl_sched_cache_user; +extern unsigned int llc_aggr_tolerance; +extern unsigned int llc_epoch_period; +extern unsigned int llc_epoch_affinity_timeout; +extern unsigned int llc_imb_pct; +extern unsigned int llc_overaggr_pct; =20 static inline bool sched_cache_enabled(void) {