From nobody Fri Dec 19 20:53:13 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E85E42F0C6F for ; Wed, 3 Dec 2025 23:01:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802915; cv=none; b=NKB81c5nkJMF1m/c1AQra8pCalQ/VATWqz8ZHIWg0eoz6hnNECnbqY6IjBOdnDBFvVl/b9HVmkECeNM1mHW2uEI8K209dQ6+mwy42BNPEeHaX20qEOS7RazcHKvkjiS5SxHlmYAv1Sx5K4HGlnkZ+3m/wG0/DRyA26pbDpUaoF0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802915; c=relaxed/simple; bh=j5hfiRZ2EYaCTsQGDmAvNRTgCCnUI1j/ItMFRbl9uzY=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=V2hqbFyqQGneKfxIcpO2Kc5dagTB+TDzJUq23BN2DeHLv/PgsNga9e2rv+hmluwZMbEcHv9RyyZKJ8F8TwCiuK0Z3yMm4l1RIXSG3p6TYCnyj/3zsuh7jcDOrc/cJgzZvLgpTBDOt79ulEa8r4q4GzHG4PsV4tL2S7Y8MOiS1eo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=aHHISq0g; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="aHHISq0g" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802913; x=1796338913; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=j5hfiRZ2EYaCTsQGDmAvNRTgCCnUI1j/ItMFRbl9uzY=; b=aHHISq0gwB38J2pv7w+1lfXdj3ALD4Re5eBGYwuwYbgrSTS87mzWr9d9 6z8UE8JAD8ovVTi9HPH2Dj4nm47BQyJFWTB7aSIByFBZvHQDMif8JcxQo YN44mNhAEn4CrrZXow3MjME9dhVbGveKvuIPn5IfCupOo2V/UomJWHR8v dtkYFqLnVw3S3bkna5BsUdpRh9ZBimaMuGq/+WwGF2nx4rrzpNdxn0j5U 3rhoVYZ01bV7elVPmaWw/ckqsd0iILZe0x+W0mSMx9qrnSVEtbw4rvo6z M5hLadE9a+KUPXiCE/w4A03eCnExBDNTMSqLbTk/r37NYHjbU70zyE3SM g==; X-CSE-ConnectionGUID: EZWPyiB6S9KiFT6DKfxjxA== X-CSE-MsgGUID: 07XoWa+5TBOCIV3mZenWgw== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136682" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136682" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:51 -0800 X-CSE-ConnectionGUID: MiEptcrPQgi3rw/P5nNNDA== X-CSE-MsgGUID: DrdTMc52RGuwpeHC+Js9gg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199763975" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:51 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org Subject: [PATCH v2 20/23] sched/cache: Add user control to adjust the parameters of cache-aware scheduling Date: Wed, 3 Dec 2025 15:07:39 -0800 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu Introduce a set of debugfs knobs to control the enabling of and parameters for cache-aware load balancing. (1) llc_enabled llc_enabled acts as the primary switch - users can toggle it to enable or disable cache aware load balancing. (2) llc_aggr_tolerance With sched_cache enabled, the scheduler uses a process's RSS as a proxy for its LLC footprint to determine if aggregating tasks on the preferred LLC could cause cache contention. If RSS exceeds the LLC size, aggregation is skipped. Some workloads with large RSS but small actual memory footprints may still benefit from aggregation. Since the kernel cannot efficiently track per-task cache usage (resctrl is user-space only), userspace can provide a more accurate hint. Introduce /sys/kernel/debug/sched/llc_aggr_tolerance to let users control how strictly RSS limits aggregation. Values range from 0 to 100: - 0: Cache-aware scheduling is disabled. - 1: Strict; tasks with RSS larger than LLC size are skipped. - 100: Aggressive; tasks are aggregated regardless of RSS. For example, with a 32MB L3 cache: - llc_aggr_tolerance=3D1 -> tasks with RSS > 32MB are skipped. - llc_aggr_tolerance=3D99 -> tasks with RSS > 784GB are skipped (784GB =3D (1 + (99 - 1) * 256) * 32MB). Similarly, /sys/kernel/debug/sched/llc_aggr_tolerance also controls how strictly the number of active threads is considered when doing cache aware load balance. The number of SMTs is also considered. High SMT counts reduce the aggregation capacity, preventing excessive task aggregation on SMT-heavy systems like Power10/Power11. For example, with 8 Cores/16 CPUs in a L3: - llc_aggr_tolerance=3D1 -> tasks with nr_running > 8 are skipped. - llc_aggr_tolerance=3D99 -> tasks with nr_running > 785 are skipped 785 =3D (1 + (99 - 1) * 8). (3) llc_epoch_period/llc_epoch_affinity_timeout Besides, llc_epoch_period and llc_epoch_affinity_timeout are also turned into tunable. Suggested-by: K Prateek Nayak Suggested-by: Madadi Vineeth Reddy Suggested-by: Shrikanth Hegde Suggested-by: Tingyin Duan Co-developed-by: Tim Chen Signed-off-by: Tim Chen Signed-off-by: Chen Yu --- Notes: v1->v2: Remove the smt_nr check in fits_llc_capacity(). (Aaron Lu) include/linux/sched.h | 4 ++- kernel/sched/debug.c | 62 ++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 63 ++++++++++++++++++++++++++++++++++++----- kernel/sched/sched.h | 5 ++++ kernel/sched/topology.c | 54 +++++++++++++++++++++++++++++++++-- 5 files changed, 178 insertions(+), 10 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 466ba8b7398c..95bf080bbbf0 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2436,9 +2436,11 @@ extern void migrate_enable(void); DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable()) =20 #ifdef CONFIG_SCHED_CACHE +DECLARE_STATIC_KEY_FALSE(sched_cache_on); + static inline bool sched_cache_enabled(void) { - return false; + return static_branch_unlikely(&sched_cache_on); } #endif =20 diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 02e16b70a790..cde324672103 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -169,6 +169,53 @@ static const struct file_operations sched_feat_fops = =3D { .release =3D single_release, }; =20 +#ifdef CONFIG_SCHED_CACHE +#define SCHED_CACHE_CREATE_CONTROL(name, max) \ +static ssize_t sched_cache_write_##name(struct file *filp, \ + const char __user *ubuf, \ + size_t cnt, loff_t *ppos) \ +{ \ + char buf[16]; \ + unsigned int val; \ + if (cnt > 15) \ + cnt =3D 15; \ + if (copy_from_user(&buf, ubuf, cnt)) \ + return -EFAULT; \ + buf[cnt] =3D '\0'; \ + if (kstrtouint(buf, 10, &val)) \ + return -EINVAL; \ + if (val > (max)) \ + return -EINVAL; \ + llc_##name =3D val; \ + if (!strcmp(#name, "enabled")) \ + sched_cache_set(false); \ + *ppos +=3D cnt; \ + return cnt; \ +} \ +static int sched_cache_show_##name(struct seq_file *m, void *v) \ +{ \ + seq_printf(m, "%d\n", llc_##name); \ + return 0; \ +} \ +static int sched_cache_open_##name(struct inode *inode, \ + struct file *filp) \ +{ \ + return single_open(filp, sched_cache_show_##name, NULL); \ +} \ +static const struct file_operations sched_cache_fops_##name =3D { \ + .open =3D sched_cache_open_##name, \ + .write =3D sched_cache_write_##name, \ + .read =3D seq_read, \ + .llseek =3D seq_lseek, \ + .release =3D single_release, \ +} + +SCHED_CACHE_CREATE_CONTROL(overload_pct, 100); +SCHED_CACHE_CREATE_CONTROL(imb_pct, 100); +SCHED_CACHE_CREATE_CONTROL(aggr_tolerance, 100); +SCHED_CACHE_CREATE_CONTROL(enabled, 1); +#endif /* SCHED_CACHE */ + static ssize_t sched_scaling_write(struct file *filp, const char __user *u= buf, size_t cnt, loff_t *ppos) { @@ -523,6 +570,21 @@ static __init int sched_init_debug(void) debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing= _hot_threshold); #endif /* CONFIG_NUMA_BALANCING */ =20 +#ifdef CONFIG_SCHED_CACHE + debugfs_create_file("llc_overload_pct", 0644, debugfs_sched, NULL, + &sched_cache_fops_overload_pct); + debugfs_create_file("llc_imb_pct", 0644, debugfs_sched, NULL, + &sched_cache_fops_imb_pct); + debugfs_create_file("llc_aggr_tolerance", 0644, debugfs_sched, NULL, + &sched_cache_fops_aggr_tolerance); + debugfs_create_file("llc_enabled", 0644, debugfs_sched, NULL, + &sched_cache_fops_enabled); + debugfs_create_u32("llc_epoch_period", 0644, debugfs_sched, + &llc_epoch_period); + debugfs_create_u32("llc_epoch_affinity_timeout", 0644, debugfs_sched, + &llc_epoch_affinity_timeout); +#endif + debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops= ); =20 debugfs_fair_server_init(); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 424ec601cfdf..a2e2d6742481 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1207,6 +1207,9 @@ static s64 update_se(struct rq *rq, struct sched_enti= ty *se) =20 __read_mostly unsigned int llc_overload_pct =3D 50; __read_mostly unsigned int llc_imb_pct =3D 20; +__read_mostly unsigned int llc_aggr_tolerance =3D 1; +__read_mostly unsigned int llc_epoch_period =3D EPOCH_PERIOD; +__read_mostly unsigned int llc_epoch_affinity_timeout =3D EPOCH_LLC_AFFINI= TY_TIMEOUT; =20 static int llc_id(int cpu) { @@ -1223,11 +1226,22 @@ static int llc_id(int cpu) return llc; } =20 +static inline int get_sched_cache_scale(int mul) +{ + if (!llc_aggr_tolerance) + return 0; + + if (llc_aggr_tolerance =3D=3D 100) + return INT_MAX; + + return (1 + (llc_aggr_tolerance - 1) * mul); +} + static bool exceed_llc_capacity(struct mm_struct *mm, int cpu) { + unsigned int llc, scale; struct cacheinfo *ci; unsigned long rss; - unsigned int llc; =20 /* * get_cpu_cacheinfo_level() can not be used @@ -1252,19 +1266,54 @@ static bool exceed_llc_capacity(struct mm_struct *m= m, int cpu) rss =3D get_mm_counter(mm, MM_ANONPAGES) + get_mm_counter(mm, MM_SHMEMPAGES); =20 - return (llc <=3D (rss * PAGE_SIZE)); + /* + * Scale the LLC size by 256*llc_aggr_tolerance + * and compare it to the task's RSS size. + * + * Suppose the L3 size is 32MB. If the + * llc_aggr_tolerance is 1: + * When the RSS is larger than 32MB, the process + * is regarded as exceeding the LLC capacity. If + * the llc_aggr_tolerance is 99: + * When the RSS is larger than 784GB, the process + * is regarded as exceeding the LLC capacity because: + * 784GB =3D (1 + (99 - 1) * 256) * 32MB + */ + scale =3D get_sched_cache_scale(256); + if (scale =3D=3D INT_MAX) + return false; + + return ((llc * scale) <=3D (rss * PAGE_SIZE)); } =20 static bool exceed_llc_nr(struct mm_struct *mm, int cpu) { - int smt_nr =3D 1; + int smt_nr =3D 1, scale; =20 #ifdef CONFIG_SCHED_SMT if (sched_smt_active()) smt_nr =3D cpumask_weight(cpu_smt_mask(cpu)); #endif + /* + * Scale the Core number in a LLC by llc_aggr_tolerance + * and compare it to the task's active threads. + * + * Suppose the number of Cores in LLC is 8. + * Every core has 2 SMTs. + * If the llc_aggr_tolerance is 1: When the + * nr_running is larger than 8, the process + * is regarded as exceeding the LLC capacity. + * If the llc_aggr_tolerance is 99: + * When the nr_running is larger than 785, + * the process is regarded as exceeding + * the LLC capacity: + * 785 =3D 1 + (99 - 1) * 8 + */ + scale =3D get_sched_cache_scale(1); + if (scale =3D=3D INT_MAX) + return false; =20 - return ((mm->nr_running_avg * smt_nr) > per_cpu(sd_llc_size, cpu)); + return ((mm->nr_running_avg * smt_nr) > (scale * per_cpu(sd_llc_size, cpu= ))); } =20 static void account_llc_enqueue(struct rq *rq, struct task_struct *p) @@ -1350,9 +1399,9 @@ static inline void __update_mm_sched(struct rq *rq, s= truct mm_sched *pcpu_sched) long delta =3D now - rq->cpu_epoch_next; =20 if (delta > 0) { - n =3D (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD; + n =3D (delta + llc_epoch_period - 1) / llc_epoch_period; rq->cpu_epoch +=3D n; - rq->cpu_epoch_next +=3D n * EPOCH_PERIOD; + rq->cpu_epoch_next +=3D n * llc_epoch_period; __shr_u64(&rq->cpu_runtime, n); } =20 @@ -1412,7 +1461,7 @@ void account_mm_sched(struct rq *rq, struct task_stru= ct *p, s64 delta_exec) * has only 1 thread, or has too many active threads, invalidate * its preferred state. */ - if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT || + if (epoch - READ_ONCE(mm->mm_sched_epoch) > llc_epoch_affinity_timeout || get_nr_threads(p) <=3D 1 || exceed_llc_nr(mm, cpu_of(rq)) || exceed_llc_capacity(mm, cpu_of(rq))) { diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 40798a06e058..15d126bd3728 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2852,6 +2852,11 @@ extern unsigned int sysctl_numa_balancing_hot_thresh= old; #ifdef CONFIG_SCHED_CACHE extern unsigned int llc_overload_pct; extern unsigned int llc_imb_pct; +extern unsigned int llc_aggr_tolerance; +extern unsigned int llc_epoch_period; +extern unsigned int llc_epoch_affinity_timeout; +extern unsigned int llc_enabled; +void sched_cache_set(bool locked); #endif =20 #ifdef CONFIG_SCHED_HRTICK diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 9799e3a9a609..818599ddaaef 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -26,6 +26,49 @@ int max_llcs; =20 static bool sched_cache_present; =20 +unsigned int llc_enabled =3D 1; +DEFINE_STATIC_KEY_FALSE(sched_cache_on); + +/* + * Enable/disable cache aware scheduling according to + * user input and the presence of hardware support. + */ +static void _sched_cache_set(bool enable, bool locked) +{ + if (enable) { + if (locked) + static_branch_enable_cpuslocked(&sched_cache_on); + else + static_branch_enable(&sched_cache_on); + } else { + if (locked) + static_branch_disable_cpuslocked(&sched_cache_on); + else + static_branch_disable(&sched_cache_on); + } +} + +void sched_cache_set(bool locked) +{ + /* hardware does not support */ + if (!sched_cache_present) { + if (static_branch_likely(&sched_cache_on)) + _sched_cache_set(false, locked); + + return; + } + + /* user wants it or not ?*/ + if (llc_enabled) { + if (!static_branch_likely(&sched_cache_on)) + _sched_cache_set(true, locked); + + } else { + if (static_branch_likely(&sched_cache_on)) + _sched_cache_set(false, locked); + } +} + static unsigned int *alloc_new_pref_llcs(unsigned int *old, unsigned int *= *gc) { unsigned int *new =3D NULL; @@ -70,8 +113,12 @@ static int resize_llc_pref(bool has_multi_llcs) * new buffer. */ tmp_llc_pref =3D alloc_percpu_noprof(unsigned int *); - if (!tmp_llc_pref) - return -ENOMEM; + if (!tmp_llc_pref) { + sched_cache_present =3D false; + ret =3D -ENOMEM; + + goto out; + } =20 for_each_present_cpu(i) *per_cpu_ptr(tmp_llc_pref, i) =3D NULL; @@ -89,6 +136,7 @@ static int resize_llc_pref(bool has_multi_llcs) new =3D alloc_new_pref_llcs(rq->nr_pref_llc, per_cpu_ptr(tmp_llc_pref, i= )); if (!new) { ret =3D -ENOMEM; + sched_cache_present =3D false; =20 goto release_old; } @@ -126,6 +174,8 @@ static int resize_llc_pref(bool has_multi_llcs) if (!ret) max_llcs =3D new_max_llcs; =20 +out: + sched_cache_set(true); return ret; } =20 --=20 2.32.0