From nobody Thu Apr 2 17:13:11 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 36BAF3921E9 for ; Tue, 10 Feb 2026 22:13:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761626; cv=none; b=UpAud2q/96oCS2xPYGdU4Lru9tk/SFjX3/8DVEoQTyejQCvxwQFX5QOvaIUDORTJ7BsKaT9rHB0AyLb24V84V2RHmUsT5hMCs6CW3A9nfl5IkdmvJCeGBGWuXq158cYdBauebWJuN9366dVbrJ2o/JecXNvBgbNRQoVHieWFhaA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761626; c=relaxed/simple; bh=9Q/v0b5kiovrrYSgga3rUosv6cdxZ5oB+lZ0wwPobZ0=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Kk+c5D08R4UVSrJM4dOxC4MyQiutrpu/zwO78uGFMFRtNxcY+wVZ1ribpV0xJDVcy7m25S/8Tnf4TPshGWoHInz0sjxl9rcbLTjiKDPBhZcg1wGbHO6ieaDgffmTQQNzZ4xDk37ZfuEWn6vXy4kSu0U7jVtLoWgZxBgXpbj7ySc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=XZIrpUof; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="XZIrpUof" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770761625; x=1802297625; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=9Q/v0b5kiovrrYSgga3rUosv6cdxZ5oB+lZ0wwPobZ0=; b=XZIrpUoffnO9gk9GuO91t3t7XpuZ5AgLZ+WmRzZexJ679+yS7hZj55T7 Wrw1IUeRDzH8IIKK6hBymJEUbbsvr4jtfIua6Ov9ScdQHSVNG+kdJq4iu kPp8nNW3OOJrG3YbfVLrhpiuPKF8IyqF5lqgORlL0TUAFgzPiBwN3Rz0y R7O7UdDNRzu5rpA+VAD2hZYfXWgUrzpP8Lc+MQm9Kkc15PPlT0ADrJWeO zToc/sRxMmnu2z5x3I/jZf+iqufgoPtJ3bCCxT/aYsNvE3wGpix+QA+Hf yYYTh2iI0ZPbHtcs8Vl9U/0fGNbGOCqeJVsEhcYnI3JzGgJvazJPS5xPE Q==; X-CSE-ConnectionGUID: UlYB2d7PQ0uf0w6Oth9ndw== X-CSE-MsgGUID: zb0hkBctT2Swd6jM0dDBkw== X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631609" X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="82631609" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 14:13:44 -0800 X-CSE-ConnectionGUID: x33Gk2zbQpi7YhesuJBdLw== X-CSE-MsgGUID: XqTQyB72TDG69za4CTK/1A== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="216374036" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:43 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [PATCH v3 19/21] sched/cache: Add user control to adjust the aggressiveness of cache-aware scheduling Date: Tue, 10 Feb 2026 14:18:59 -0800 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu Introduce a set of debugfs knobs to control how aggressive the cache aware scheduling do the task aggregation. (1) llc_aggr_tolerance With sched_cache enabled, the scheduler uses a process's RSS as a proxy for its LLC footprint to determine if aggregating tasks on the preferred LLC could cause cache contention. If RSS exceeds the LLC size, aggregation is skipped. Some workloads with large RSS but small actual memory footprints may still benefit from aggregation. Since the kernel cannot efficiently track per-task cache usage (resctrl is user-space only), userspace can provide a more accurate hint. Introduce /sys/kernel/debug/sched/llc_aggr_tolerance to let users control how strictly RSS limits aggregation. Values range from 0 to 100: - 0: Cache-aware scheduling is disabled. - 1: Strict; tasks with RSS larger than LLC size are skipped. - >=3D100: Aggressive; tasks are aggregated regardless of RSS. For example, with a 32MB L3 cache: - llc_aggr_tolerance=3D1 -> tasks with RSS > 32MB are skipped. - llc_aggr_tolerance=3D99 -> tasks with RSS > 784GB are skipped (784GB =3D (1 + (99 - 1) * 256) * 32MB). Similarly, /sys/kernel/debug/sched/llc_aggr_tolerance also controls how strictly the number of active threads is considered when doing cache aware load balance. The number of SMTs is also considered. High SMT counts reduce the aggregation capacity, preventing excessive task aggregation on SMT-heavy systems like Power10/Power11. Yangyu suggested introducing separate aggregation controls for the number of active threads and memory RSS checks. Since there are plans to add per-process/task group controls, fine-grained tunables are deferred to that implementation. (2) llc_epoch_period, llc_epoch_affinity_timeout, llc_imb_pct, llc_overaggr_pct are also turned into tunable. Suggested-by: K Prateek Nayak Suggested-by: Madadi Vineeth Reddy Suggested-by: Shrikanth Hegde Suggested-by: Tingyin Duan Suggested-by: Jianyong Wu Suggested-by: Yangyu Chen Co-developed-by: Tim Chen Signed-off-by: Tim Chen Signed-off-by: Chen Yu --- Notes: v2->v3: Simplify the implementation by using debugfs_create_u32() for all tunable parameters. kernel/sched/debug.c | 10 ++++++++ kernel/sched/fair.c | 59 ++++++++++++++++++++++++++++++++++++++------ kernel/sched/sched.h | 5 ++++ 3 files changed, 67 insertions(+), 7 deletions(-) diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index bae747eddc59..dc4b7de6569f 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -566,6 +566,16 @@ static __init int sched_init_debug(void) #ifdef CONFIG_SCHED_CACHE debugfs_create_file("llc_enabled", 0644, debugfs_sched, NULL, &sched_cache_enable_fops); + debugfs_create_u32("llc_aggr_tolerance", 0644, debugfs_sched, + &llc_aggr_tolerance); + debugfs_create_u32("llc_epoch_period", 0644, debugfs_sched, + &llc_epoch_period); + debugfs_create_u32("llc_epoch_affinity_timeout", 0644, debugfs_sched, + &llc_epoch_affinity_timeout); + debugfs_create_u32("llc_overaggr_pct", 0644, debugfs_sched, + &llc_overaggr_pct); + debugfs_create_u32("llc_imb_pct", 0644, debugfs_sched, + &llc_imb_pct); #endif =20 debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops= ); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ee4982af2bdd..da4291ace24c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1191,6 +1191,12 @@ static void set_next_buddy(struct sched_entity *se); #define EPOCH_PERIOD (HZ / 100) /* 10 ms */ #define EPOCH_LLC_AFFINITY_TIMEOUT 5 /* 50 ms */ =20 +__read_mostly unsigned int llc_aggr_tolerance =3D 1; +__read_mostly unsigned int llc_epoch_period =3D EPOCH_PERIOD; +__read_mostly unsigned int llc_epoch_affinity_timeout =3D EPOCH_LLC_AFFINI= TY_TIMEOUT; +__read_mostly unsigned int llc_imb_pct =3D 20; +__read_mostly unsigned int llc_overaggr_pct =3D 50; + static int llc_id(int cpu) { if (cpu < 0) @@ -1223,10 +1229,22 @@ static inline bool valid_llc_buf(struct sched_domai= n *sd, return valid_llc_id(id); } =20 +static inline int get_sched_cache_scale(int mul) +{ + if (!llc_aggr_tolerance) + return 0; + + if (llc_aggr_tolerance >=3D 100) + return INT_MAX; + + return (1 + (llc_aggr_tolerance - 1) * mul); +} + static bool exceed_llc_capacity(struct mm_struct *mm, int cpu) { struct cacheinfo *ci; u64 rss, llc; + int scale; =20 /* * get_cpu_cacheinfo_level() can not be used @@ -1251,20 +1269,47 @@ static bool exceed_llc_capacity(struct mm_struct *m= m, int cpu) rss =3D get_mm_counter(mm, MM_ANONPAGES) + get_mm_counter(mm, MM_SHMEMPAGES); =20 - return (llc <=3D (rss * PAGE_SIZE)); + /* + * Scale the LLC size by 256*llc_aggr_tolerance + * and compare it to the task's RSS size. + * + * Suppose the L3 size is 32MB. If the + * llc_aggr_tolerance is 1: + * When the RSS is larger than 32MB, the process + * is regarded as exceeding the LLC capacity. If + * the llc_aggr_tolerance is 99: + * When the RSS is larger than 784GB, the process + * is regarded as exceeding the LLC capacity: + * 784GB =3D (1 + (99 - 1) * 256) * 32MB + * If the llc_aggr_tolerance is 100: + * ignore the RSS. + */ + scale =3D get_sched_cache_scale(256); + if (scale =3D=3D INT_MAX) + return false; + + return ((llc * scale) <=3D (rss * PAGE_SIZE)); } =20 static bool exceed_llc_nr(struct mm_struct *mm, int cpu) { - int smt_nr =3D 1; + int smt_nr =3D 1, scale; =20 #ifdef CONFIG_SCHED_SMT if (sched_smt_active()) smt_nr =3D cpumask_weight(cpu_smt_mask(cpu)); #endif =20 + /* + * Scale the number of 'cores' in a LLC by llc_aggr_tolerance + * and compare it to the task's active threads. + */ + scale =3D get_sched_cache_scale(1); + if (scale =3D=3D INT_MAX) + return false; + return !fits_capacity((mm->sc_stat.nr_running_avg * smt_nr), - per_cpu(sd_llc_size, cpu)); + (scale * per_cpu(sd_llc_size, cpu))); } =20 static void account_llc_enqueue(struct rq *rq, struct task_struct *p) @@ -1365,7 +1410,7 @@ static inline void __update_mm_sched(struct rq *rq, long delta =3D now - rq->cpu_epoch_next; =20 if (delta > 0) { - n =3D (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD; + n =3D (delta + llc_epoch_period - 1) / llc_epoch_period; rq->cpu_epoch +=3D n; rq->cpu_epoch_next +=3D n * EPOCH_PERIOD; __shr_u64(&rq->cpu_runtime, n); @@ -1460,7 +1505,7 @@ void account_mm_sched(struct rq *rq, struct task_stru= ct *p, s64 delta_exec) * has only 1 thread, invalidate its preferred state. */ if (time_after(epoch, - READ_ONCE(mm->sc_stat.epoch) + EPOCH_LLC_AFFINITY_TIMEOUT) || + READ_ONCE(mm->sc_stat.epoch) + llc_epoch_affinity_timeout) || get_nr_threads(p) <=3D 1 || exceed_llc_nr(mm, cpu_of(rq)) || exceed_llc_capacity(mm, cpu_of(rq))) { @@ -9920,7 +9965,7 @@ static inline int task_is_ineligible_on_dst_cpu(struc= t task_struct *p, int dest_ * (default: ~50%) */ #define fits_llc_capacity(util, max) \ - ((util) * 2 < (max)) + ((util) * 100 < (max) * llc_overaggr_pct) =20 /* * The margin used when comparing utilization. @@ -9930,7 +9975,7 @@ static inline int task_is_ineligible_on_dst_cpu(struc= t task_struct *p, int dest_ */ /* Allows dst util to be bigger than src util by up to bias percent */ #define util_greater(util1, util2) \ - ((util1) * 100 > (util2) * 120) + ((util1) * 100 > (util2) * (100 + llc_imb_pct)) =20 /* Called from load balancing paths with rcu_read_lock held */ static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util, diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index adf3428745dd..f4785f84b1f1 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3919,6 +3919,11 @@ static inline void mm_cid_switch_to(struct task_stru= ct *prev, struct task_struct DECLARE_STATIC_KEY_FALSE(sched_cache_present); DECLARE_STATIC_KEY_FALSE(sched_cache_active); extern int max_llcs, sysctl_sched_cache_user; +extern unsigned int llc_aggr_tolerance; +extern unsigned int llc_epoch_period; +extern unsigned int llc_epoch_affinity_timeout; +extern unsigned int llc_imb_pct; +extern unsigned int llc_overaggr_pct; =20 static inline bool sched_cache_enabled(void) { --=20 2.32.0