From nobody Thu Apr 2 15:36:01 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0C05A334C00 for ; Tue, 10 Feb 2026 22:12:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761580; cv=none; b=VO0+uTUSZwOsLETymrt4CJaoHYVxHd7qf1MPygQa/f8+ZZqxWre1mLLW8lXY+Ww9s6/pRCjzbunMJwjqpMt0yDYHcogyt28qOqDjxUzTphijS0qz9JSDp/q+phkixpSq1C6GwSpuX+8/OWfhvkJv1N7Zea3NtHrbbDv3FqI5bl4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761580; c=relaxed/simple; bh=6uQjRiumUsaI2l7luwCke7Ha40a//jdajoSL7kWpAVo=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=YV43JgT4GPp09NpIWJMhlr/GySwY6YFYRmXSzxhJzrjOzjzl0T6zIToxNNammb1fKHkrlBP9lHpt9QAWNdk5EXAmJXPyL8+gxNJafbmopYKFsj5unhiocrRUHb80ztalFFh2Cl89wawzH6xrKkbAu8xw6kUNG0A9jKLVXU2ZOjc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=jPsb71HF; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="jPsb71HF" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770761579; x=1802297579; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=6uQjRiumUsaI2l7luwCke7Ha40a//jdajoSL7kWpAVo=; b=jPsb71HFBhb0qO6hCNuAz46yb8VUwHw1e4/MkwuGXl1BPiuTae0quWRE juXdqPtrGwxZKGYv5DiZBWit1qTQ+lUWXCvwy17yhue9Vlau5s7sZWbHr n7HcNWSZfa0Or8WmDqwQOorIKzNNA1MmBUr8yrmxfa66Zbzwempslz9j2 TCPOpfr1xSbwZGlIvtkaphw4W15Ffve/k+OaKlDI8mw61m4pyerVM0reg Ky07jjOzdJtWl4PSyBHIG72mBd4/wEgRlf70/SgYTtBDPfahgFkQgQMI+ TkCq/1LbGqEb2tvwIs2DTGRyb362lo0E8V7KOMbtuhtc+tW39leAhWFPk w==; X-CSE-ConnectionGUID: C/M32EiGRiqAQ3Clr4MO5Q== X-CSE-MsgGUID: WtTHnyfHSta4y9hU+FjgZQ== X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631205" X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="82631205" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 14:12:58 -0800 X-CSE-ConnectionGUID: TBjjeU/7SV6w9aDrJM2P1g== X-CSE-MsgGUID: s1dZML/LQEue7d3RGZikgQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="216373859" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:12:56 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [PATCH v3 01/21] sched/cache: Introduce infrastructure for cache-aware load balancing Date: Tue, 10 Feb 2026 14:18:41 -0800 Message-Id: <6ec6eee6e1c620c0cfb9f56923f8bfbb71c31a75.1770760558.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: "Peter Zijlstra (Intel)" Adds infrastructure to enable cache-aware load balancing, which improves cache locality by grouping tasks that share resources within the same cache domain. This reduces cache misses and improves overall data access efficiency. In this initial implementation, threads belonging to the same process are treated as entities that likely share working sets. The mechanism tracks per-process CPU occupancy across cache domains and attempts to migrate threads toward cache-hot domains where their process already has active threads, thereby enhancing locality. This provides a basic model for cache affinity. While the current code targets the last-level cache (LLC), the approach could be extended to other domain types such as clusters (L2) or node-internal groupings. At present, the mechanism selects the CPU within an LLC that has the highest recent runtime. Subsequent patches in this series will use this information in the load-balancing path to guide task placement toward preferred LLCs. In the future, more advanced policies could be integrated through NUMA balancing-for example, migrating a task to its preferred LLC when spare capacity exists, or swapping tasks across LLCs to improve cache affinity. Grouping of tasks could also be generalized from that of a process to be that of a NUMA group, or be user configurable. Originally-by: Peter Zijlstra (Intel) Signed-off-by: Tim Chen Signed-off-by: Chen Yu --- Notes: v2->v3: Fix the wrap in epoch for time comparison of mm->mm_sched_epoch. (Peter Zijlstra) =20 Remove __no_profile tag. (Peter Zijlstra) =20 Introduce a new structure named sched_cache_stat to save the statistics of cache aware scheduling, similar to mm_mm_cid. (Peter Zijlstra) include/linux/mm_types.h | 32 +++++ include/linux/sched.h | 24 ++++ init/Kconfig | 11 ++ kernel/fork.c | 6 + kernel/sched/core.c | 6 + kernel/sched/fair.c | 265 +++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 14 +++ 7 files changed, 358 insertions(+) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 42af2292951d..777a48523aa6 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1125,6 +1125,8 @@ struct mm_struct { /* MM CID related storage */ struct mm_mm_cid mm_cid; =20 + /* sched_cache related statistics */ + struct sched_cache_stat sc_stat; #ifdef CONFIG_MMU atomic_long_t pgtables_bytes; /* size of all page tables */ #endif @@ -1519,6 +1521,36 @@ static inline unsigned int mm_cid_size(void) } #endif /* CONFIG_SCHED_MM_CID */ =20 +#ifdef CONFIG_SCHED_CACHE +void mm_init_sched(struct mm_struct *mm, + struct sched_cache_time __percpu *pcpu_sched); + +static inline int mm_alloc_sched_noprof(struct mm_struct *mm) +{ + struct sched_cache_time __percpu *pcpu_sched =3D + alloc_percpu_noprof(struct sched_cache_time); + + if (!pcpu_sched) + return -ENOMEM; + + mm_init_sched(mm, pcpu_sched); + return 0; +} + +#define mm_alloc_sched(...) alloc_hooks(mm_alloc_sched_noprof(__VA_ARGS__)) + +static inline void mm_destroy_sched(struct mm_struct *mm) +{ + free_percpu(mm->sc_stat.pcpu_sched); + mm->sc_stat.pcpu_sched =3D NULL; +} +#else /* !CONFIG_SCHED_CACHE */ + +static inline int mm_alloc_sched(struct mm_struct *mm) { return 0; } +static inline void mm_destroy_sched(struct mm_struct *mm) { } + +#endif /* CONFIG_SCHED_CACHE */ + struct mmu_gather; extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm); extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct= *mm); diff --git a/include/linux/sched.h b/include/linux/sched.h index d395f2810fac..2817a21ee055 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1409,6 +1409,10 @@ struct task_struct { unsigned long numa_pages_migrated; #endif /* CONFIG_NUMA_BALANCING */ =20 +#ifdef CONFIG_SCHED_CACHE + struct callback_head cache_work; +#endif + struct rseq_data rseq; struct sched_mm_cid mm_cid; =20 @@ -2330,6 +2334,26 @@ static __always_inline int task_mm_cid(struct task_s= truct *t) } #endif =20 +#ifdef CONFIG_SCHED_CACHE + +struct sched_cache_time { + u64 runtime; + unsigned long epoch; +}; + +struct sched_cache_stat { + struct sched_cache_time __percpu *pcpu_sched; + raw_spinlock_t lock; + unsigned long epoch; + int cpu; +} ____cacheline_aligned_in_smp; + +#else + +struct sched_cache_stat { }; + +#endif + #ifndef MODULE #ifndef COMPILE_OFFSETS =20 diff --git a/init/Kconfig b/init/Kconfig index fa79feb8fe57..f4b2649f8401 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -990,6 +990,17 @@ config NUMA_BALANCING =20 This system will be inactive on UMA systems. =20 +config SCHED_CACHE + bool "Cache aware load balance" + default y + depends on SMP + help + When enabled, the scheduler will attempt to aggregate tasks from + the same process onto a single Last Level Cache (LLC) domain when + possible. This improves cache locality by keeping tasks that share + resources within the same cache domain, reducing cache misses and + lowering data access latency. + config NUMA_BALANCING_DEFAULT_ENABLED bool "Automatically enable NUMA aware memory/task placement" default y diff --git a/kernel/fork.c b/kernel/fork.c index b1f3915d5f8e..2a49c49f29f9 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -723,6 +723,7 @@ void __mmdrop(struct mm_struct *mm) cleanup_lazy_tlbs(mm); =20 WARN_ON_ONCE(mm =3D=3D current->active_mm); + mm_destroy_sched(mm); mm_free_pgd(mm); mm_free_id(mm); destroy_context(mm); @@ -1123,6 +1124,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm= , struct task_struct *p, if (mm_alloc_cid(mm, p)) goto fail_cid; =20 + if (mm_alloc_sched(mm)) + goto fail_sched; + if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT, NR_MM_COUNTERS)) goto fail_pcpu; @@ -1132,6 +1136,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm= , struct task_struct *p, return mm; =20 fail_pcpu: + mm_destroy_sched(mm); +fail_sched: mm_destroy_cid(mm); fail_cid: destroy_context(mm); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 41ba0be16911..c6efa71cf500 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4412,6 +4412,7 @@ static void __sched_fork(u64 clone_flags, struct task= _struct *p) init_numa_balancing(clone_flags, p); p->wake_entry.u_flags =3D CSD_TYPE_TTWU; p->migration_pending =3D NULL; + init_sched_mm(p); } =20 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing); @@ -8691,6 +8692,11 @@ void __init sched_init(void) =20 rq->core_cookie =3D 0UL; #endif +#ifdef CONFIG_SCHED_CACHE + raw_spin_lock_init(&rq->cpu_epoch_lock); + rq->cpu_epoch_next =3D jiffies; +#endif + zalloc_cpumask_var_node(&rq->scratch_mask, GFP_KERNEL, cpu_to_node(i)); } =20 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index da46c3164537..58286275e166 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1136,6 +1136,8 @@ void post_init_entity_util_avg(struct task_struct *p) sa->runnable_avg =3D sa->util_avg; } =20 +static inline void account_mm_sched(struct rq *rq, struct task_struct *p, = s64 delta_exec); + static s64 update_se(struct rq *rq, struct sched_entity *se) { u64 now =3D rq_clock_task(rq); @@ -1158,6 +1160,7 @@ static s64 update_se(struct rq *rq, struct sched_enti= ty *se) =20 trace_sched_stat_runtime(running, delta_exec); account_group_exec_runtime(running, delta_exec); + account_mm_sched(rq, running, delta_exec); =20 /* cgroup time is always accounted against the donor */ cgroup_account_cputime(donor, delta_exec); @@ -1179,6 +1182,266 @@ static s64 update_se(struct rq *rq, struct sched_en= tity *se) =20 static void set_next_buddy(struct sched_entity *se); =20 +#ifdef CONFIG_SCHED_CACHE + +/* + * XXX numbers come from a place the sun don't shine -- probably wants to = be SD + * tunable or so. + */ +#define EPOCH_PERIOD (HZ / 100) /* 10 ms */ +#define EPOCH_LLC_AFFINITY_TIMEOUT 5 /* 50 ms */ + +static int llc_id(int cpu) +{ + if (cpu < 0) + return -1; + + return per_cpu(sd_llc_id, cpu); +} + +void mm_init_sched(struct mm_struct *mm, + struct sched_cache_time __percpu *_pcpu_sched) +{ + unsigned long epoch; + int i; + + for_each_possible_cpu(i) { + struct sched_cache_time *pcpu_sched =3D per_cpu_ptr(_pcpu_sched, i); + struct rq *rq =3D cpu_rq(i); + + pcpu_sched->runtime =3D 0; + pcpu_sched->epoch =3D rq->cpu_epoch; + epoch =3D rq->cpu_epoch; + } + + raw_spin_lock_init(&mm->sc_stat.lock); + mm->sc_stat.epoch =3D epoch; + mm->sc_stat.cpu =3D -1; + + /* + * The update to mm->sc_stat should not be reordered + * before initialization to mm's other fields, in case + * the readers may get invalid mm_sched_epoch, etc. + */ + smp_store_release(&mm->sc_stat.pcpu_sched, _pcpu_sched); +} + +/* because why would C be fully specified */ +static __always_inline void __shr_u64(u64 *val, unsigned int n) +{ + if (n >=3D 64) { + *val =3D 0; + return; + } + *val >>=3D n; +} + +static inline void __update_mm_sched(struct rq *rq, + struct sched_cache_time *pcpu_sched) +{ + lockdep_assert_held(&rq->cpu_epoch_lock); + + unsigned long n, now =3D jiffies; + long delta =3D now - rq->cpu_epoch_next; + + if (delta > 0) { + n =3D (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD; + rq->cpu_epoch +=3D n; + rq->cpu_epoch_next +=3D n * EPOCH_PERIOD; + __shr_u64(&rq->cpu_runtime, n); + } + + n =3D rq->cpu_epoch - pcpu_sched->epoch; + if (n) { + pcpu_sched->epoch +=3D n; + __shr_u64(&pcpu_sched->runtime, n); + } +} + +static unsigned long fraction_mm_sched(struct rq *rq, + struct sched_cache_time *pcpu_sched) +{ + guard(raw_spinlock_irqsave)(&rq->cpu_epoch_lock); + + __update_mm_sched(rq, pcpu_sched); + + /* + * Runtime is a geometric series (r=3D0.5) and as such will sum to twice + * the accumulation period, this means the multiplcation here should + * not overflow. + */ + return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq->cpu_runtime + 1); +} + +static inline +void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec) +{ + struct sched_cache_time *pcpu_sched; + struct mm_struct *mm =3D p->mm; + unsigned long epoch; + + if (!sched_cache_enabled()) + return; + + if (p->sched_class !=3D &fair_sched_class) + return; + /* + * init_task, kthreads and user thread created + * by user_mode_thread() don't have mm. + */ + if (!mm || !mm->sc_stat.pcpu_sched) + return; + + pcpu_sched =3D per_cpu_ptr(p->mm->sc_stat.pcpu_sched, cpu_of(rq)); + + scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) { + __update_mm_sched(rq, pcpu_sched); + pcpu_sched->runtime +=3D delta_exec; + rq->cpu_runtime +=3D delta_exec; + epoch =3D rq->cpu_epoch; + } + + /* + * If this process hasn't hit task_cache_work() for a while, or it + * has only 1 thread, invalidate its preferred state. + */ + if (time_after(epoch, + READ_ONCE(mm->sc_stat.epoch) + EPOCH_LLC_AFFINITY_TIMEOUT) || + get_nr_threads(p) <=3D 1) { + if (mm->sc_stat.cpu !=3D -1) + mm->sc_stat.cpu =3D -1; + } +} + +static void task_tick_cache(struct rq *rq, struct task_struct *p) +{ + struct callback_head *work =3D &p->cache_work; + struct mm_struct *mm =3D p->mm; + unsigned long epoch; + + if (!sched_cache_enabled()) + return; + + if (!mm || !mm->sc_stat.pcpu_sched) + return; + + epoch =3D rq->cpu_epoch; + /* avoid moving backwards */ + if (time_after_eq(mm->sc_stat.epoch, epoch)) + return; + + guard(raw_spinlock)(&mm->sc_stat.lock); + + if (work->next =3D=3D work) { + task_work_add(p, work, TWA_RESUME); + WRITE_ONCE(mm->sc_stat.epoch, epoch); + } +} + +static void task_cache_work(struct callback_head *work) +{ + struct task_struct *p =3D current; + struct mm_struct *mm =3D p->mm; + unsigned long m_a_occ =3D 0; + unsigned long curr_m_a_occ =3D 0; + int cpu, m_a_cpu =3D -1; + cpumask_var_t cpus; + + WARN_ON_ONCE(work !=3D &p->cache_work); + + work->next =3D work; + + if (p->flags & PF_EXITING) + return; + + if (!zalloc_cpumask_var(&cpus, GFP_KERNEL)) + return; + + scoped_guard (cpus_read_lock) { + cpumask_copy(cpus, cpu_online_mask); + + for_each_cpu(cpu, cpus) { + /* XXX sched_cluster_active */ + struct sched_domain *sd =3D per_cpu(sd_llc, cpu); + unsigned long occ, m_occ =3D 0, a_occ =3D 0; + int m_cpu =3D -1, i; + + if (!sd) + continue; + + for_each_cpu(i, sched_domain_span(sd)) { + occ =3D fraction_mm_sched(cpu_rq(i), + per_cpu_ptr(mm->sc_stat.pcpu_sched, i)); + a_occ +=3D occ; + if (occ > m_occ) { + m_occ =3D occ; + m_cpu =3D i; + } + } + + /* + * Compare the accumulated occupancy of each LLC. The + * reason for using accumulated occupancy rather than average + * per CPU occupancy is that it works better in asymmetric LLC + * scenarios. + * For example, if there are 2 threads in a 4CPU LLC and 3 + * threads in an 8CPU LLC, it might be better to choose the one + * with 3 threads. However, this would not be the case if the + * occupancy is divided by the number of CPUs in an LLC (i.e., + * if average per CPU occupancy is used). + * Besides, NUMA balancing fault statistics behave similarly: + * the total number of faults per node is compared rather than + * the average number of faults per CPU. This strategy is also + * followed here. + */ + if (a_occ > m_a_occ) { + m_a_occ =3D a_occ; + m_a_cpu =3D m_cpu; + } + + if (llc_id(cpu) =3D=3D llc_id(mm->sc_stat.cpu)) + curr_m_a_occ =3D a_occ; + + cpumask_andnot(cpus, cpus, sched_domain_span(sd)); + } + } + + if (m_a_occ > (2 * curr_m_a_occ)) { + /* + * Avoid switching sc_stat.cpu too fast. + * The reason to choose 2X is because: + * 1. It is better to keep the preferred LLC stable, + * rather than changing it frequently and cause migrations + * 2. 2X means the new preferred LLC has at least 1 more + * busy CPU than the old one(200% vs 100%, eg) + * 3. 2X is chosen based on test results, as it delivers + * the optimal performance gain so far. + */ + mm->sc_stat.cpu =3D m_a_cpu; + } + + free_cpumask_var(cpus); +} + +void init_sched_mm(struct task_struct *p) +{ + struct callback_head *work =3D &p->cache_work; + + init_task_work(work, task_cache_work); + work->next =3D work; +} + +#else + +static inline void account_mm_sched(struct rq *rq, struct task_struct *p, + s64 delta_exec) { } + +void init_sched_mm(struct task_struct *p) { } + +static void task_tick_cache(struct rq *rq, struct task_struct *p) { } + +#endif + /* * Used by other classes to account runtime. */ @@ -13377,6 +13640,8 @@ static void task_tick_fair(struct rq *rq, struct ta= sk_struct *curr, int queued) if (static_branch_unlikely(&sched_numa_balancing)) task_tick_numa(rq, curr); =20 + task_tick_cache(rq, curr); + update_misfit_status(curr, rq); check_update_overutilized_status(task_rq(curr)); =20 diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index d30cca6870f5..de5b701c3950 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1196,6 +1196,12 @@ struct rq { u64 clock_pelt_idle_copy; u64 clock_idle_copy; #endif +#ifdef CONFIG_SCHED_CACHE + raw_spinlock_t cpu_epoch_lock ____cacheline_aligned; + u64 cpu_runtime; + unsigned long cpu_epoch; + unsigned long cpu_epoch_next; +#endif =20 atomic_t nr_iowait; =20 @@ -3890,6 +3896,14 @@ static inline void mm_cid_switch_to(struct task_stru= ct *prev, struct task_struct static inline void mm_cid_switch_to(struct task_struct *prev, struct task_= struct *next) { } #endif /* !CONFIG_SCHED_MM_CID */ =20 +#ifdef CONFIG_SCHED_CACHE +static inline bool sched_cache_enabled(void) +{ + return false; +} +#endif +extern void init_sched_mm(struct task_struct *p); + extern u64 avg_vruntime(struct cfs_rq *cfs_rq); extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se); static inline --=20 2.32.0 From nobody Thu Apr 2 15:36:01 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D6E17336EC3 for ; Tue, 10 Feb 2026 22:13:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761584; cv=none; b=HPF9HVepaYVIkCeEL6fDOLvHKHe7zd6SDICGTb7VjUN3zfe9uoBmtvahOfr3tVuKwwQnDRFNLgubyTKmxpGgWHPxh+5gynghxolqgbUYqBP2LcTD8aBmHRV62JMvJxgI+1Sw3KgmXMDW+DpYDGxLslk7F2nI/ojzt7AXis75rXI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761584; c=relaxed/simple; bh=y5FA6q54pFBJOrqtRYVJ26W6vHWJ8QZmmJZVJrZ0TTc=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=dAOUQ1VflrunYboXPxtdWVfslJQF4V9MklqAY3fQYauPIgumuSlzCq+rQx4duykRZWWe3GDFWUFddxyGAeVDwy/mHj5yFenqOCxUnfrciG7M28mfQIOzYKLP8INbk8KRqvF5YhPj+ENeWvBnve3tOl1j1Raqs7BsiLEi06rzFYA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=JLixBCXg; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="JLixBCXg" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770761583; x=1802297583; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=y5FA6q54pFBJOrqtRYVJ26W6vHWJ8QZmmJZVJrZ0TTc=; b=JLixBCXgZGlbzrl00Xm9LR27zuQenB6SR4NZlqVWx3xU5TQCuZfhvXy6 foO0iXVGh/whwtDjVHS0KtALezhSyQmEVnPS7Fvo3mU0UWjMQ0j4vB7+F SGXdV4ndlhBJ2O5ofOqvyEv7zR17ddprHQe7Cd+OD1QObh/N0o+YV8boP 3Ftv3thW3UPRzPe9vFEtDDN5qUKM1E4vIPCMOgQF90LQJ9CqrbtbYpXbo mVlwXADh5L06Zh03CKkOgIo9IWNHheeB4lGp3pcA0Vs/vuQ7JLIG17ncO 50bKTy+WckeDa9pXVD5RH2wGOXGkYRRKw/EFWkpWylqK8Cr0eSXBKEJOG g==; X-CSE-ConnectionGUID: t+HsXdjHSFm5A1fkVQG4gQ== X-CSE-MsgGUID: EpIlYTW+QNGZqiEwQbrgWg== X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631231" X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="82631231" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 14:13:03 -0800 X-CSE-ConnectionGUID: y9jum0wkQ9uvwm0eLtRYbQ== X-CSE-MsgGUID: K+nx4pxVRtW9Tv7wcOM+YQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="216373869" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:01 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [PATCH v3 02/21] sched/cache: Record per LLC utilization to guide cache aware scheduling decisions Date: Tue, 10 Feb 2026 14:18:42 -0800 Message-Id: <93f0a3958e2398e8b4a05c15cb89f0fd759c5ac9.1770760558.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu When a system becomes busy and a process's preferred LLC is saturated with too many threads, tasks within that LLC migrate frequently. These in LLC migrations introduce latency and degrade performance. To avoid this, task aggregation should be suppressed when the preferred LLC is overloaded, which requires a metric to indicate LLC utilization. Record per LLC utilization/cpu capacity during periodic load balancing. These statistics will be used in later patches to decide whether tasks should be aggregated into their preferred LLC. Co-developed-by: Tim Chen Signed-off-by: Tim Chen Signed-off-by: Chen Yu --- Notes: v2->v3: Remove ____cacheline_aligned_in_smp attribute in struct sched_domain_shared to avoid premature optimization. (Peter Zijlstra) include/linux/sched/topology.h | 4 ++ kernel/sched/fair.c | 70 ++++++++++++++++++++++++++++++++++ 2 files changed, 74 insertions(+) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 45c0022b91ce..a4e2fb31f2fd 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -68,6 +68,10 @@ struct sched_domain_shared { atomic_t nr_busy_cpus; int has_idle_cores; int nr_idle_scan; +#ifdef CONFIG_SCHED_CACHE + unsigned long util_avg; + unsigned long capacity; +#endif }; =20 struct sched_domain { diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 58286275e166..dfeb107f2cfd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9688,6 +9688,29 @@ static inline int task_is_ineligible_on_dst_cpu(stru= ct task_struct *p, int dest_ return 0; } =20 +#ifdef CONFIG_SCHED_CACHE +/* Called from load balancing paths with rcu_read_lock held */ +static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util, + unsigned long *cap) +{ + struct sched_domain_shared *sd_share; + + sd_share =3D rcu_dereference(per_cpu(sd_llc_shared, cpu)); + if (!sd_share) + return false; + + *util =3D READ_ONCE(sd_share->util_avg); + *cap =3D READ_ONCE(sd_share->capacity); + + return true; +} +#else +static inline bool get_llc_stats(int cpu, unsigned long *util, + unsigned long *cap) +{ + return false; +} +#endif /* * can_migrate_task - may task p from runqueue rq be migrated to this_cpu? */ @@ -10658,6 +10681,52 @@ sched_reduced_capacity(struct rq *rq, struct sched= _domain *sd) return check_cpu_capacity(rq, sd); } =20 +#ifdef CONFIG_SCHED_CACHE +/* + * Record the statistics for this scheduler group for later + * use. These values guide load balancing on aggregating tasks + * to a LLC. + */ +static void record_sg_llc_stats(struct lb_env *env, + struct sg_lb_stats *sgs, + struct sched_group *group) +{ + struct sched_domain_shared *sd_share; + + if (!sched_cache_enabled() || env->idle =3D=3D CPU_NEWLY_IDLE) + return; + + /* Only care about sched domain spanning multiple LLCs */ + if (env->sd->child !=3D rcu_dereference(per_cpu(sd_llc, env->dst_cpu))) + return; + + /* + * At this point we know this group spans a LLC domain. + * Record the statistic of this group in its corresponding + * shared LLC domain. + * Note: sd_share cannot be obtained via sd->child->shared, + * because the latter refers to the domain that covers the + * local group. Instead, sd_share should be located using + * the first CPU of the LLC group. + */ + sd_share =3D rcu_dereference(per_cpu(sd_llc_shared, + cpumask_first(sched_group_span(group)))); + if (!sd_share) + return; + + if (READ_ONCE(sd_share->util_avg) !=3D sgs->group_util) + WRITE_ONCE(sd_share->util_avg, sgs->group_util); + + if (unlikely(READ_ONCE(sd_share->capacity) !=3D sgs->group_capacity)) + WRITE_ONCE(sd_share->capacity, sgs->group_capacity); +} +#else +static inline void record_sg_llc_stats(struct lb_env *env, struct sg_lb_st= ats *sgs, + struct sched_group *group) +{ +} +#endif + /** * update_sg_lb_stats - Update sched_group's statistics for load balancing. * @env: The load balancing environment. @@ -10747,6 +10816,7 @@ static inline void update_sg_lb_stats(struct lb_env= *env, =20 sgs->group_type =3D group_classify(env->sd->imbalance_pct, group, sgs); =20 + record_sg_llc_stats(env, sgs, group); /* Computing avg_load makes sense only when group is overloaded */ if (sgs->group_type =3D=3D group_overloaded) sgs->avg_load =3D (sgs->group_load * SCHED_CAPACITY_SCALE) / --=20 2.32.0 From nobody Thu Apr 2 15:36:01 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AC725339844 for ; Tue, 10 Feb 2026 22:13:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761588; cv=none; b=WtWWbaOSaMqfOLzBw0wpnI/FAQ++GFJAEiPf30wtkC6gGn4NBPX3Altzoq8V4S4GwE4naexzPk59f8DTNg5LSVUIi9NkcGIj7uZWvaWcjsW0G2ESyjElCDNkCGWBQfnoR/G6LiW2kxGn2p+gcN7AIm4p/yjyD55nihyMbhs5ZBA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761588; c=relaxed/simple; bh=+dXdc/foU6CrK/UWJil+Zj/UilvizxjOxvM4MOqR9P8=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=iOXGWeOGlieF+zOxGqvPqeKRLCutE2wia+XIcHpeM8ruE/4Ob+O7q73J/i/pJoTknyZieUJoEit1mriZhUvlO2soe4I/TaCa5ExPFL1GS5HDSoOdaRIvBhXhz4V+dFmnizzPq6pJKkff3tugvcV9yeDdgPE+ZJ9Kxqu0twTcvHQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=dJQ5x1RJ; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="dJQ5x1RJ" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770761587; x=1802297587; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=+dXdc/foU6CrK/UWJil+Zj/UilvizxjOxvM4MOqR9P8=; b=dJQ5x1RJv32EDA/tkEjzTAEisxFMHTLbVQqgBeT2wJEcQ2zx/kcCo487 23H7U5GAK8VxbXQKrLinrplCg7mObHvaz3ZIDGPnzCVzDOX7+XE6Hfd5m 4for/kizWHg1yu/3uIBp1Mv7XiXoJYYNNlaxenGjoKRmbAZ6pm5dGpmcF Ght9FkMxsltLHIplHfJ5ZEJD2qIRDkDoiMePTdcbc8DYbrhdCngeo9Jae hfMHctF1rXxdrOSTgZpTVlCGQUDSvRtVtCLct7VwWTEm1K3NJo26Q64Ul 99Y4cjl1AWjJaruseF55haHCJTKJxBTWF0CUKLE3rrukv9c79Mc3Z5xkD w==; X-CSE-ConnectionGUID: s2RFJKMwSV6FuIS/rY4UXg== X-CSE-MsgGUID: vrEwD8ZFSSCn/t6I7MJpZw== X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631269" X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="82631269" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 14:13:06 -0800 X-CSE-ConnectionGUID: 55QKO9xtTQ+Oe/DRpzT+Uw== X-CSE-MsgGUID: xl1OQkW3RvKl4kjoe+fRBg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="216373890" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:05 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [PATCH v3 03/21] sched/cache: Introduce helper functions to enforce LLC migration policy Date: Tue, 10 Feb 2026 14:18:43 -0800 Message-Id: <7475922f6020abe5d458a136b0c88fe24e823091.1770760558.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu Cache-aware scheduling aggregates threads onto their preferred LLC, mainly through load balancing. When the preferred LLC becomes saturated, more threads are still placed there, increasing latency. A mechanism is needed to limit aggregation so that the preferred LLC does not become overloaded. Introduce helper functions can_migrate_llc() and can_migrate_llc_task() to enforce the LLC migration policy: 1. Aggregate a task to its preferred LLC if both source and destination LLCs are not too busy, or if doing so will not leave the preferred LLC much more imbalanced than the non-preferred one (>20% utilization difference, a little higher than imbalance_pct(17%) of the LLC domain as hysteresis). 2. Allow moving a task from overloaded preferred LLC to a non preferred LLC if this will not cause the non preferred LLC to become too imbalanced to cause a later migration back. 3. If both LLCs are too busy, let the generic load balance to spread the tasks. Further (hysteresis)action could be taken in the future to prevent tasks from being migrated into and out of the preferred LLC frequently (back and forth): the threshold for migrating a task out of its preferred LLC should be higher than that for migrating it into the LLC. Co-developed-by: Tim Chen Signed-off-by: Tim Chen Signed-off-by: Chen Yu --- Notes: v2->v3: No change. kernel/sched/fair.c | 153 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 153 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index dfeb107f2cfd..bf5f39a01017 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9689,6 +9689,27 @@ static inline int task_is_ineligible_on_dst_cpu(stru= ct task_struct *p, int dest_ } =20 #ifdef CONFIG_SCHED_CACHE +/* + * The margin used when comparing LLC utilization with CPU capacity. + * It determines the LLC load level where active LLC aggregation is + * done. + * Derived from fits_capacity(). + * + * (default: ~50%) + */ +#define fits_llc_capacity(util, max) \ + ((util) * 2 < (max)) + +/* + * The margin used when comparing utilization. + * is 'util1' noticeably greater than 'util2' + * Derived from capacity_greater(). + * Bias is in perentage. + */ +/* Allows dst util to be bigger than src util by up to bias percent */ +#define util_greater(util1, util2) \ + ((util1) * 100 > (util2) * 120) + /* Called from load balancing paths with rcu_read_lock held */ static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util, unsigned long *cap) @@ -9704,6 +9725,138 @@ static __maybe_unused bool get_llc_stats(int cpu, u= nsigned long *util, =20 return true; } + +/* + * Decision matrix according to the LLC utilization. To + * decide whether we can do task aggregation across LLC. + * + * By default, 50% is the threshold for treating the LLC + * as busy. The reason for choosing 50% is to avoid saturation + * of SMT-2, and it is also a safe cutoff for other SMT-n + * platforms. + * + * 20% is the utilization imbalance percentage to decide + * if the preferred LLC is busier than the non-preferred LLC. + * 20 is a little higher than the LLC domain's imbalance_pct + * 17. The hysteresis is used to avoid task bouncing between the + * preferred LLC and the non-preferred LLC. + * + * 1. moving towards the preferred LLC, dst is the preferred + * LLC, src is not. + * + * src \ dst 30% 40% 50% 60% + * 30% Y Y Y N + * 40% Y Y Y Y + * 50% Y Y G G + * 60% Y Y G G + * + * 2. moving out of the preferred LLC, src is the preferred + * LLC, dst is not: + * + * src \ dst 30% 40% 50% 60% + * 30% N N N N + * 40% N N N N + * 50% N N G G + * 60% Y N G G + * + * src : src_util + * dst : dst_util + * Y : Yes, migrate + * N : No, do not migrate + * G : let the Generic load balance to even the load. + * + * The intention is that if both LLCs are quite busy, cache aware + * load balance should not be performed, and generic load balance + * should take effect. However, if one is busy and the other is not, + * the preferred LLC capacity(50%) and imbalance criteria(20%) should + * be considered to determine whether LLC aggregation should be + * performed to bias the load towards the preferred LLC. + */ + +/* migration decision, 3 states are orthogonal. */ +enum llc_mig { + mig_forbid =3D 0, /* N: Don't migrate task, respect LLC preference */ + mig_llc, /* Y: Do LLC preference based migration */ + mig_unrestricted /* G: Don't restrict generic load balance migration */ +}; + +/* + * Check if task can be moved from the source LLC to the + * destination LLC without breaking cache aware preferrence. + * src_cpu and dst_cpu are arbitrary CPUs within the source + * and destination LLCs, respectively. + */ +static enum llc_mig can_migrate_llc(int src_cpu, int dst_cpu, + unsigned long tsk_util, + bool to_pref) +{ + unsigned long src_util, dst_util, src_cap, dst_cap; + + if (!get_llc_stats(src_cpu, &src_util, &src_cap) || + !get_llc_stats(dst_cpu, &dst_util, &dst_cap)) + return mig_unrestricted; + + if (!fits_llc_capacity(dst_util, dst_cap) && + !fits_llc_capacity(src_util, src_cap)) + return mig_unrestricted; + + src_util =3D src_util < tsk_util ? 0 : src_util - tsk_util; + dst_util =3D dst_util + tsk_util; + if (to_pref) { + /* + * Don't migrate if we will get preferred LLC too + * heavily loaded and if the dest is much busier + * than the src, in which case migration will + * increase the imbalance too much. + */ + if (!fits_llc_capacity(dst_util, dst_cap) && + util_greater(dst_util, src_util)) + return mig_forbid; + } else { + /* + * Don't migrate if we will leave preferred LLC + * too idle, or if this migration leads to the + * non-preferred LLC falls within sysctl_aggr_imb percent + * of preferred LLC, leading to migration again + * back to preferred LLC. + */ + if (fits_llc_capacity(src_util, src_cap) || + !util_greater(src_util, dst_util)) + return mig_forbid; + } + return mig_llc; +} + +/* + * Check if task p can migrate from source LLC to + * destination LLC in terms of cache aware load balance. + */ +static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int d= st_cpu, + struct task_struct *p) +{ + struct mm_struct *mm; + bool to_pref; + int cpu; + + mm =3D p->mm; + if (!mm) + return mig_unrestricted; + + cpu =3D mm->sc_stat.cpu; + if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu)) + return mig_unrestricted; + + if (cpus_share_cache(dst_cpu, cpu)) + to_pref =3D true; + else if (cpus_share_cache(src_cpu, cpu)) + to_pref =3D false; + else + return mig_unrestricted; + + return can_migrate_llc(src_cpu, dst_cpu, + task_util(p), to_pref); +} + #else static inline bool get_llc_stats(int cpu, unsigned long *util, unsigned long *cap) --=20 2.32.0 From nobody Thu Apr 2 15:36:01 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5AFB6336EDA for ; Tue, 10 Feb 2026 22:13:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761591; cv=none; b=BNdqxwtCXnKuBYJpucdERXgi+HawQVKpcCqFdrbjrnBCua3z0cT1w47ofeMbJDtah4NDHZctBooG3XFgrgMDp96BLQ7UU24rKy6l1JWRL4ZH+H2W0tNggmBM9KYv0dEiV3dn/5P6rxFaRzEr4jeuXYxPm3+tcKqB7sXGK0nAB6k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761591; c=relaxed/simple; bh=T3k34hbPq1z9/RrkdAzRCXvBVVy+/ochK3PX9bbZfIk=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=aNEWZQrvLB9XgaRGuVW8dPyOPEMImLtxIif74W9VJ0P/eSfd20kLPkPbBvUeI3zSyNR/djdmgpJ9BjnczB7f1IoJ+lJcPKOiQIh9O3d+EDnGGS9kD/PtaN6jF5Vlx146gcJdqRMeNKg0nz3RiiIM73DdZ9T7pGof+rvFAd+dpqE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=F07cAAwc; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="F07cAAwc" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770761591; x=1802297591; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=T3k34hbPq1z9/RrkdAzRCXvBVVy+/ochK3PX9bbZfIk=; b=F07cAAwcHT0ODqmzXtDL9rEZVHxLTQ74NAQmCmbgo7SQx+f7yD/NObO5 A4pZHS8zfU06rjv22RhvM2YanidQalAJigfH5d6XgCSqjoMCQGbaZJ9RD sthm12u6v3kVFgfFYefkXGA84EjTBS2hnS1kZyA9KbpDGTWOm/1kQ8iFS prpI1NIbHSETBUmLWY7fGWQ0bW+wAGsX0HMNrWBWk2x4ZwQqd4zc4+iM4 QQYi1oGLk4IAWRbxgdFP2cB17rPIyrjvPTal/vWLdS2s16Wz/jD56jYIW N84NzvEpZi+HcooAb4LrsnSTX8wCAbLEFNA3Q0TGayp+1YlgOAry8pWSz w==; X-CSE-ConnectionGUID: UJ8OCAktR8q4NLljxweDFA== X-CSE-MsgGUID: +4doCTFVT6WE36a2nw4yEA== X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631291" X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="82631291" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 14:13:10 -0800 X-CSE-ConnectionGUID: KzpshQ8vSvGb7c+EVFX4hA== X-CSE-MsgGUID: Meul46aLQpqEDG4rc/ZJUQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="216373897" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:08 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [PATCH v3 04/21] sched/cache: Make LLC id continuous Date: Tue, 10 Feb 2026 14:18:44 -0800 Message-Id: <60a05a3f50d14a7bf3b968f62cca87893c5c552c.1770760558.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu Introduce an index mapping between CPUs and their LLCs. This provides a continuous per LLC index needed for cache-aware load balancing in later patches. The existing per_cpu llc_id usually points to the first CPU of the LLC domain, which is sparse and unsuitable as an array index. Using llc_id directly would waste memory. With the new mapping, CPUs in the same LLC share a continuous id: per_cpu(llc_id, CPU=3D0...15) =3D 0 per_cpu(llc_id, CPU=3D16...31) =3D 1 per_cpu(llc_id, CPU=3D32...47) =3D 2 ... Once a CPU has been assigned an llc_id, this ID persists even when the CPU is taken offline and brought back online, which can facilitate the management of the ID. Co-developed-by: Tim Chen Signed-off-by: Tim Chen Co-developed-by: K Prateek Nayak Signed-off-by: K Prateek Nayak Signed-off-by: Chen Yu --- Notes: v2->v3: Allocate the LLC id according to the topology level data directly, rath= er than calculating from the sched domain. This simplifies the code. (Peter Zijlstra, K Prateek Nayak) kernel/sched/topology.c | 47 ++++++++++++++++++++++++++++++++++++++--- 1 file changed, 44 insertions(+), 3 deletions(-) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index cf643a5ddedd..ca46b5cf7f78 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -20,6 +20,7 @@ void sched_domains_mutex_unlock(void) /* Protected by sched_domains_mutex: */ static cpumask_var_t sched_domains_tmpmask; static cpumask_var_t sched_domains_tmpmask2; +static int tl_max_llcs; =20 static int __init sched_debug_setup(char *str) { @@ -658,7 +659,7 @@ static void destroy_sched_domains(struct sched_domain *= sd) */ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc); DEFINE_PER_CPU(int, sd_llc_size); -DEFINE_PER_CPU(int, sd_llc_id); +DEFINE_PER_CPU(int, sd_llc_id) =3D -1; DEFINE_PER_CPU(int, sd_share_id); DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa); @@ -684,7 +685,6 @@ static void update_top_cache_domain(int cpu) =20 rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); per_cpu(sd_llc_size, cpu) =3D size; - per_cpu(sd_llc_id, cpu) =3D id; rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds); =20 sd =3D lowest_flag_domain(cpu, SD_CLUSTER); @@ -2567,10 +2567,18 @@ build_sched_domains(const struct cpumask *cpu_map, = struct sched_domain_attr *att =20 /* Set up domains for CPUs specified by the cpu_map: */ for_each_cpu(i, cpu_map) { - struct sched_domain_topology_level *tl; + struct sched_domain_topology_level *tl, *tl_llc =3D NULL; + int lid; =20 sd =3D NULL; for_each_sd_topology(tl) { + int flags =3D 0; + + if (tl->sd_flags) + flags =3D (*tl->sd_flags)(); + + if (flags & SD_SHARE_LLC) + tl_llc =3D tl; =20 sd =3D build_sched_domain(tl, cpu_map, attr, sd, i); =20 @@ -2581,6 +2589,39 @@ build_sched_domains(const struct cpumask *cpu_map, s= truct sched_domain_attr *att if (cpumask_equal(cpu_map, sched_domain_span(sd))) break; } + + lid =3D per_cpu(sd_llc_id, i); + if (lid =3D=3D -1) { + int j; + + /* + * Assign the llc_id to the CPUs that do not + * have an LLC. + */ + if (!tl_llc) { + per_cpu(sd_llc_id, i) =3D tl_max_llcs++; + + continue; + } + + /* try to reuse the llc_id of its siblings */ + for_each_cpu(j, tl_llc->mask(tl_llc, i)) { + if (i =3D=3D j) + continue; + + lid =3D per_cpu(sd_llc_id, j); + + if (lid !=3D -1) { + per_cpu(sd_llc_id, i) =3D lid; + + break; + } + } + + /* a new LLC is detected */ + if (lid =3D=3D -1) + per_cpu(sd_llc_id, i) =3D tl_max_llcs++; + } } =20 if (WARN_ON(!topology_span_sane(cpu_map))) --=20 2.32.0 From nobody Thu Apr 2 15:36:01 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9A71E33A708 for ; Tue, 10 Feb 2026 22:13:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761593; cv=none; b=Xff1SOBNxlYC/Cl/OYVKL+ODGk4wOtBtJxrAp9d4nJk39pDbSGZU1ue1TR51hbJ8HbRuLmlLShdlFxBwnm3ky0u6d+MCuIt33zcw4AG82Cf8s0XVfGKAGVNTM9m1yfvArHVQddzFhpvWdPImIk7/lNNQzCufDCMbffu9Cctc26Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761593; c=relaxed/simple; bh=frV2KkgqfOLT5qQofbpSaoorLbzVFFWBXGLK+9NJg0E=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=tLefFrxxGWPayNsqvhCIV+sZ6HUwRbGV0vokkRalzekB9yK8My33vNJFUtWbk2FFZnSlci7HOsc5LfvWWAUIoBAyM7DJC7zsziDDtYRWosmUZic71U3B+S0CZK8ATuzeXW/hn6DJx98WKbck8IOQZJn6io3cti96oHNAIudvwno= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=ayEUExAv; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="ayEUExAv" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770761593; x=1802297593; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=frV2KkgqfOLT5qQofbpSaoorLbzVFFWBXGLK+9NJg0E=; b=ayEUExAvRA6fKbL6BAfyo5IUDoP2+uS7N7yRREcUWLwxIiwNuzryxl0M J7dnvrkQ8ynNWrnDDG3niokrZs6KvSCLvYK0LGpRrbmUD6RSW9GkWMOso FRzsnRLlMs/9LnnXbwJWIipFYUvPB7lCsKpGxIS41EVCmHvJ/DqOH2C2I T0Gy9cUyeIoOSXXab5WAqRxct0/0qJwdVD8C8Mbx6L94bpO9r8IE+TTjP 6yNXDYe02UyJtopinbWEdAC5OJpnuyzKp25JBXyPKbpHnV96dOEs2xFDn xrdvZquH3XLmqk9JprkX1tr+36sh9FtF6M0pAORuxR/ofpq6V6ysOm8wH g==; X-CSE-ConnectionGUID: rA7xOuXJRxeXDjc5YfOwRg== X-CSE-MsgGUID: +RoPoL/0TmSGqfLBsfmoaA== X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631312" X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="82631312" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 14:13:12 -0800 X-CSE-ConnectionGUID: zYyG9ZqcQ96eji+Wxkb/Yw== X-CSE-MsgGUID: +LNVKCaRTvOvBtd4slrtCQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="216373903" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:11 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [PATCH v3 05/21] sched/cache: Assign preferred LLC ID to processes Date: Tue, 10 Feb 2026 14:18:45 -0800 Message-Id: <4a92b93edb669845e3bdca24c3ae3354b317c3eb.1770760558.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" With cache-aware scheduling enabled, each task is assigned a preferred LLC ID. This allows quick identification of the LLC domain where the task prefers to run, similar to numa_preferred_nid in NUMA balancing. Signed-off-by: Tim Chen --- Notes: v2->v3: Add comments around code handling NUMA balance conflict with cache aware scheduling. (Peter Zijlstra) =20 Check if NUMA balancing is disabled before checking numa_preferred_nid (Jianyong Wu) include/linux/sched.h | 1 + init/init_task.c | 3 +++ kernel/sched/fair.c | 42 ++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 46 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 2817a21ee055..c98bd1c46088 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1411,6 +1411,7 @@ struct task_struct { =20 #ifdef CONFIG_SCHED_CACHE struct callback_head cache_work; + int preferred_llc; #endif =20 struct rseq_data rseq; diff --git a/init/init_task.c b/init/init_task.c index 49b13d7c3985..baa420de2644 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -218,6 +218,9 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = =3D { .numa_group =3D NULL, .numa_faults =3D NULL, #endif +#ifdef CONFIG_SCHED_CACHE + .preferred_llc =3D -1, +#endif #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS) .kasan_depth =3D 1, #endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bf5f39a01017..0b4ed0f2809d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1273,11 +1273,43 @@ static unsigned long fraction_mm_sched(struct rq *r= q, return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq->cpu_runtime + 1); } =20 +static int get_pref_llc(struct task_struct *p, struct mm_struct *mm) +{ + int mm_sched_llc =3D -1; + + if (!mm) + return -1; + + if (mm->sc_stat.cpu !=3D -1) { + mm_sched_llc =3D llc_id(mm->sc_stat.cpu); + +#ifdef CONFIG_NUMA_BALANCING + /* + * Don't assign preferred LLC if it + * conflicts with NUMA balancing. + * This can happen when sched_setnuma() gets + * called, however it is not much of an issue + * because we expect account_mm_sched() to get + * called fairly regularly -- at a higher rate + * than sched_setnuma() at least -- and thus the + * conflict only exists for a short period of time. + */ + if (static_branch_likely(&sched_numa_balancing) && + p->numa_preferred_nid >=3D 0 && + cpu_to_node(mm->sc_stat.cpu) !=3D p->numa_preferred_nid) + mm_sched_llc =3D -1; +#endif + } + + return mm_sched_llc; +} + static inline void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec) { struct sched_cache_time *pcpu_sched; struct mm_struct *mm =3D p->mm; + int mm_sched_llc =3D -1; unsigned long epoch; =20 if (!sched_cache_enabled()) @@ -1311,6 +1343,11 @@ void account_mm_sched(struct rq *rq, struct task_str= uct *p, s64 delta_exec) if (mm->sc_stat.cpu !=3D -1) mm->sc_stat.cpu =3D -1; } + + mm_sched_llc =3D get_pref_llc(p, mm); + + if (p->preferred_llc !=3D mm_sched_llc) + p->preferred_llc =3D mm_sched_llc; } =20 static void task_tick_cache(struct rq *rq, struct task_struct *p) @@ -1440,6 +1477,11 @@ void init_sched_mm(struct task_struct *p) { } =20 static void task_tick_cache(struct rq *rq, struct task_struct *p) { } =20 +static inline int get_pref_llc(struct task_struct *p, + struct mm_struct *mm) +{ + return -1; +} #endif =20 /* --=20 2.32.0 From nobody Thu Apr 2 15:36:01 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A359733ADAE for ; Tue, 10 Feb 2026 22:13:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761596; cv=none; b=ZiLlxFLRXf6ZR0gbEOkS/uDl5Vi/+sRD2xax1CmOz+7Ft/6E0+V9Z3z0JO7QLZbveE3FS+nUQ/8EKSwWDFhgVUSrpetH5G0pybVfE30FfrO17My8gFdH7RqAIJzPPrhbbtQkg7o9AaJYuxf7QbyQmNyZYXnls6XWpjG1IVYWD5U= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761596; c=relaxed/simple; bh=n35wV+hjZT+VOHWA+vniVGH39bA/Nsa9EOYs7IP5CVQ=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=KaY45y/gFg5jFUm7NxTA2cXz9LcPn+p2LcQkNF048Kx9auWOHK9HQvD36mhlXbtwZRJOB4xumAs10DnUbDNPJQc4VhocIDiv4K4G7RwJ3mxgZsmtVhqIrbrbiSLbzWSEcFGsCD+LZ21L8+c+0N+g+aDC3Y/8mBuqJ1x0E7T9vZE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=SjW+0o9S; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="SjW+0o9S" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770761596; x=1802297596; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=n35wV+hjZT+VOHWA+vniVGH39bA/Nsa9EOYs7IP5CVQ=; b=SjW+0o9STo+/ZVX3CoiEIOvLvgyUq7DuReyDQ1eyn5uYWvaeIlnEJq9n 7eZW09bZwenzGCKluCD7+cGUtQuPlc2125UjJcb4jHT0f8/QQQwE5CB83 GEHxz4FXzda9AslVoMOnwYXf0jHV0dDYoujN/Uw03OrOO+cinMwGnRz+Y GNrewwEtuPB+SIKV3+DAKJgherHO/KwxxxjAjiGUEfXntNmpEQoqDOW+z cskL2aUjMabb4y37gmp2mCtzwUw+t9Pv+VwFXQHgzj91S3goxNC/wVX3h /xIm1knRZaazXh2kl62JiK5vjLE7sshChcEIjnBolERRjCUnqdPtJSDOz w==; X-CSE-ConnectionGUID: M8o50/DbRAux2rv+KOpUqA== X-CSE-MsgGUID: M5eKDl3tTQ+c5DhxMgbngw== X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631334" X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="82631334" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 14:13:15 -0800 X-CSE-ConnectionGUID: o0TDk7zeTQiLM83McRAh4Q== X-CSE-MsgGUID: MJMqV7wXTHiht5R0O8PTfw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="216373911" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:13 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [PATCH v3 06/21] sched/cache: Track LLC-preferred tasks per runqueue Date: Tue, 10 Feb 2026 14:18:46 -0800 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" For each runqueue, track the number of tasks with an LLC preference and how many of them are running on their preferred LLC. This mirrors nr_numa_running and nr_preferred_running for NUMA balancing, and will be used by cache-aware load balancing in later patches. Co-developed-by: Chen Yu Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v2->v3: Remove the sched_cache_enabled() check and make the account_llc_{en,de}queue() depending on CONFIG_SCHED_CACHE, so sched_llc_active in v2 can be removed. (Peter Zijlstra) kernel/sched/core.c | 5 +++++ kernel/sched/fair.c | 48 +++++++++++++++++++++++++++++++++++++++++--- kernel/sched/sched.h | 6 ++++++ 3 files changed, 56 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c6efa71cf500..c464e370576f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -532,6 +532,11 @@ void __trace_set_current_state(int state_value) } EXPORT_SYMBOL(__trace_set_current_state); =20 +int task_llc(const struct task_struct *p) +{ + return per_cpu(sd_llc_id, task_cpu(p)); +} + /* * Serialization rules: * diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0b4ed0f2809d..6ad9ad2f918f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1199,6 +1199,30 @@ static int llc_id(int cpu) return per_cpu(sd_llc_id, cpu); } =20 +static void account_llc_enqueue(struct rq *rq, struct task_struct *p) +{ + int pref_llc; + + pref_llc =3D p->preferred_llc; + if (pref_llc < 0) + return; + + rq->nr_llc_running++; + rq->nr_pref_llc_running +=3D (pref_llc =3D=3D task_llc(p)); +} + +static void account_llc_dequeue(struct rq *rq, struct task_struct *p) +{ + int pref_llc; + + pref_llc =3D p->preferred_llc; + if (pref_llc < 0) + return; + + rq->nr_llc_running--; + rq->nr_pref_llc_running -=3D (pref_llc =3D=3D task_llc(p)); +} + void mm_init_sched(struct mm_struct *mm, struct sched_cache_time __percpu *_pcpu_sched) { @@ -1304,6 +1328,8 @@ static int get_pref_llc(struct task_struct *p, struct= mm_struct *mm) return mm_sched_llc; } =20 +static unsigned int task_running_on_cpu(int cpu, struct task_struct *p); + static inline void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec) { @@ -1346,8 +1372,13 @@ void account_mm_sched(struct rq *rq, struct task_str= uct *p, s64 delta_exec) =20 mm_sched_llc =3D get_pref_llc(p, mm); =20 - if (p->preferred_llc !=3D mm_sched_llc) + /* task not on rq accounted later in account_entity_enqueue() */ + if (task_running_on_cpu(rq->cpu, p) && + p->preferred_llc !=3D mm_sched_llc) { + account_llc_dequeue(rq, p); p->preferred_llc =3D mm_sched_llc; + account_llc_enqueue(rq, p); + } } =20 static void task_tick_cache(struct rq *rq, struct task_struct *p) @@ -1482,6 +1513,11 @@ static inline int get_pref_llc(struct task_struct *p, { return -1; } + +static void account_llc_enqueue(struct rq *rq, struct task_struct *p) {} + +static void account_llc_dequeue(struct rq *rq, struct task_struct *p) {} + #endif =20 /* @@ -3970,9 +4006,11 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct= sched_entity *se) { update_load_add(&cfs_rq->load, se->load.weight); if (entity_is_task(se)) { + struct task_struct *p =3D task_of(se); struct rq *rq =3D rq_of(cfs_rq); =20 - account_numa_enqueue(rq, task_of(se)); + account_numa_enqueue(rq, p); + account_llc_enqueue(rq, p); list_add(&se->group_node, &rq->cfs_tasks); } cfs_rq->nr_queued++; @@ -3983,7 +4021,11 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct= sched_entity *se) { update_load_sub(&cfs_rq->load, se->load.weight); if (entity_is_task(se)) { - account_numa_dequeue(rq_of(cfs_rq), task_of(se)); + struct task_struct *p =3D task_of(se); + struct rq *rq =3D rq_of(cfs_rq); + + account_numa_dequeue(rq, p); + account_llc_dequeue(rq, p); list_del_init(&se->group_node); } cfs_rq->nr_queued--; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index de5b701c3950..35cea6aa32a4 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1128,6 +1128,10 @@ struct rq { unsigned int nr_preferred_running; unsigned int numa_migrate_on; #endif +#ifdef CONFIG_SCHED_CACHE + unsigned int nr_pref_llc_running; + unsigned int nr_llc_running; +#endif #ifdef CONFIG_NO_HZ_COMMON unsigned long last_blocked_load_update_tick; unsigned int has_blocked_load; @@ -1996,6 +2000,8 @@ init_numa_balancing(u64 clone_flags, struct task_stru= ct *p) =20 #endif /* !CONFIG_NUMA_BALANCING */ =20 +int task_llc(const struct task_struct *p); + static inline void queue_balance_callback(struct rq *rq, struct balance_callback *head, --=20 2.32.0 From nobody Thu Apr 2 15:36:01 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EFFFD33A701 for ; Tue, 10 Feb 2026 22:13:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761599; cv=none; b=OtU6wwFJFfLEBKrQOeChXULozjzh5P8LDXQwNvbnn1/PLwSwI7eQ5xHl2SfJacRiJzFUmJtTAdOJLBiqt55eThcO8vL2P+r8GXEcRUNVwZZ8Gk99Z76FE5RLZQpqB1NIKPFm12X6nTNAVaauFQOyIApRz8GFX9wncRcH3Osf9OM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761599; c=relaxed/simple; bh=Tft1zswN1Z0GrU5Df0edo2pze4mac6yMpiFecl7nqK0=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=XrKhhOkH3E7zlXYh6TmfTjfP0mEGd84jZTtOCDWEqaPWv2zEoGqVFPnupU8224+fu9pe2K1ii6Uw7qXI8fwCp/yZwcPBdUF17dFvFSyDnXs2NMX0NiiuYI8Dv8y/tFhSYHjy3Rmad43iH+Tn4tOglOWlsmfT17uc2aeTKxVzXJA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=GI88rwsL; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="GI88rwsL" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770761598; x=1802297598; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Tft1zswN1Z0GrU5Df0edo2pze4mac6yMpiFecl7nqK0=; b=GI88rwsL+Fri6OHzVL92NL2sOs9+ovlyZ2a+RTjL/Su2QzwUqUkd2js5 9GtgUysFrmlo0KYQBMMKiiSqodzZ4RSFYesJvg9DFhvJUneth5v9jasgO cs+03bGRgBBHxeBtrT1OAceaSzPTvX0sk3pF79n/oK4H+csoYG/OWT4BQ 5ZcF3lOFh68xyhP1Q9SORM6hfxm3WmUfmhFdFHOKu0n9bvskX0P/Ob/M4 sLMuZtnCnqzDVcr4eICXHjdwoUflI0D5Q/eS+Nbn1CJUDttfJpxiCRyvc azRFK5frhjZXgqt9Fc8fV/DmoTLCJ9LLD1SvhA3nXwBsBApueYNHJAzmw A==; X-CSE-ConnectionGUID: npdPCJs+SgqKdJSU5Se8fQ== X-CSE-MsgGUID: Z45iMYw1RNmM4+TOFs7vCw== X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631356" X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="82631356" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 14:13:18 -0800 X-CSE-ConnectionGUID: mmHdVatVSPyMhOV76Dz32A== X-CSE-MsgGUID: DavtVsktQ5m+AX7JSdRrkQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="216373935" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:16 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [PATCH v3 07/21] sched/cache: Introduce per CPU's tasks LLC preference counter Date: Tue, 10 Feb 2026 14:18:47 -0800 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The lowest level of sched domain for each CPU is assigned an array where each element tracks the number of tasks preferring a given LLC, indexed from 0 to max_llcs - 1. Since each CPU has its dedicated sd, this implies that each CPU will have a dedicated task LLC preference counter. For example, sd->pf[3] =3D 2 signifies that there are 2 tasks on this runqueue which prefer to run within LLC3. The load balancer can use this information to identify busy runqueues and migrate tasks to their preferred LLC domains. This array will be reallocated at runtime during sched domain rebuild. Introduce the buffer allocation mechanism, and the statistics will be calculated in the subsequent patch. Note: the LLC preference statistics of each CPU are reset on sched domain rebuild and may under count temporarily, until the CPU becomes idle and the count is cleared. This is a trade off to avoid complex data synchronization across sched domain builds. Suggested-by: Peter Zijlstra (Intel) Suggested-by: K Prateek Nayak Co-developed-by: Chen Yu Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v2->v3: Allocate preferred LLC buffer in rq->sd rather than the rq. That way it automagically gets reallocated and old buffer gets recycled during sched domain rebuild. (Peter Zijlstra) include/linux/sched/topology.h | 4 +++ kernel/sched/sched.h | 2 ++ kernel/sched/topology.c | 64 +++++++++++++++++++++++++++++++++- 3 files changed, 69 insertions(+), 1 deletion(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index a4e2fb31f2fd..3aa6c101b2e4 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -102,6 +102,10 @@ struct sched_domain { u64 max_newidle_lb_cost; unsigned long last_decay_max_lb_cost; =20 +#ifdef CONFIG_SCHED_CACHE + unsigned int *pf; +#endif + #ifdef CONFIG_SCHEDSTATS /* sched_balance_rq() stats */ unsigned int lb_count[CPU_MAX_IDLE_TYPES]; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 35cea6aa32a4..ac8c7ac1ac0d 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3903,6 +3903,8 @@ static inline void mm_cid_switch_to(struct task_struc= t *prev, struct task_struct #endif /* !CONFIG_SCHED_MM_CID */ =20 #ifdef CONFIG_SCHED_CACHE +extern int max_llcs; + static inline bool sched_cache_enabled(void) { return false; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index ca46b5cf7f78..dae78b5915a7 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -21,6 +21,7 @@ void sched_domains_mutex_unlock(void) static cpumask_var_t sched_domains_tmpmask; static cpumask_var_t sched_domains_tmpmask2; static int tl_max_llcs; +int max_llcs; =20 static int __init sched_debug_setup(char *str) { @@ -628,6 +629,11 @@ static void destroy_sched_domain(struct sched_domain *= sd) =20 if (sd->shared && atomic_dec_and_test(&sd->shared->ref)) kfree(sd->shared); + +#ifdef CONFIG_SCHED_CACHE + /* only the bottom sd has pref_llc array */ + kfree(sd->pf); +#endif kfree(sd); } =20 @@ -747,10 +753,15 @@ cpu_attach_domain(struct sched_domain *sd, struct roo= t_domain *rd, int cpu) if (sd && sd_degenerate(sd)) { tmp =3D sd; sd =3D sd->parent; - destroy_sched_domain(tmp); + if (sd) { struct sched_group *sg =3D sd->groups; =20 +#ifdef CONFIG_SCHED_CACHE + /* move pf to parent as child is being destroyed */ + sd->pf =3D tmp->pf; + tmp->pf =3D NULL; +#endif /* * sched groups hold the flags of the child sched * domain for convenience. Clear such flags since @@ -762,6 +773,8 @@ cpu_attach_domain(struct sched_domain *sd, struct root_= domain *rd, int cpu) =20 sd->child =3D NULL; } + + destroy_sched_domain(tmp); } =20 sched_domain_debug(sd, cpu); @@ -787,6 +800,46 @@ enum s_alloc { sa_none, }; =20 +#ifdef CONFIG_SCHED_CACHE +static bool alloc_sd_pref(const struct cpumask *cpu_map, + struct s_data *d) +{ + struct sched_domain *sd; + unsigned int *pf; + int i; + + for_each_cpu(i, cpu_map) { + sd =3D *per_cpu_ptr(d->sd, i); + if (!sd) + goto err; + + pf =3D kcalloc(tl_max_llcs, sizeof(unsigned int), GFP_KERNEL); + if (!pf) + goto err; + + sd->pf =3D pf; + } + + return true; +err: + for_each_cpu(i, cpu_map) { + sd =3D *per_cpu_ptr(d->sd, i); + if (sd) { + kfree(sd->pf); + sd->pf =3D NULL; + } + } + + return false; +} +#else +static bool alloc_sd_pref(const struct cpumask *cpu_map, + struct s_data *d) +{ + return false; +} +#endif + /* * Return the canonical balance CPU for this group, this is the first CPU * of this group that's also in the balance mask. @@ -2710,6 +2763,8 @@ build_sched_domains(const struct cpumask *cpu_map, st= ruct sched_domain_attr *att } } =20 + alloc_sd_pref(cpu_map, &d); + /* Attach the domains */ rcu_read_lock(); for_each_cpu(i, cpu_map) { @@ -2723,6 +2778,13 @@ build_sched_domains(const struct cpumask *cpu_map, s= truct sched_domain_attr *att } rcu_read_unlock(); =20 + /* + * Ensure we see enlarged sd->pf when we use new llc_ids and + * bigger max_llcs. + */ + smp_mb(); + max_llcs =3D tl_max_llcs; + if (has_asym) static_branch_inc_cpuslocked(&sched_asym_cpucapacity); =20 --=20 2.32.0 From nobody Thu Apr 2 15:36:01 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CBA8E33B6F1 for ; Tue, 10 Feb 2026 22:13:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761601; cv=none; b=CyA6GLqyzy6qPy7eM4z2myjYokiNNkI64nDL55uilQdQ44nReATUU0WJy1Q5ZOQyrozMYUccXMx7nJ1/f3hIsqOmPD6D+MfrFRQTVcSUkyvkc1h82SxL9SdVu1dNh+hgkeIZ5Nz48162B6emxzem4IetLnCroJsFJiCK3VXYdbU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761601; c=relaxed/simple; bh=DWu++QLbzyBJytYKRjnoJfPb+BN1oEQx/b7EvvlxTjU=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=UPmDMY7DZmek4jpDEk9kbA3dy3GKc42jKcLfbDlsd7+baPsRtTertpWK9PlKbiwVW9caLZaEopZoeMv3voeMgEdvJYB/V6cwcF0Cf0dvqHUVtCgctFV5rmNGzyAY05TkBQvqsHGoKydkrmrPU23AHwd4uhX0UDmdKB9k9OPKEW4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=Ia+udd6i; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Ia+udd6i" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770761600; x=1802297600; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=DWu++QLbzyBJytYKRjnoJfPb+BN1oEQx/b7EvvlxTjU=; b=Ia+udd6iPmC7VvGvKj9nfKB590Fn0f3t/zbTKGBt8wyYMWJteM1tvXqD LCkHRPnSdiZzAnOnwa3cd+cwjM/HjpP4qdbIvCETqhFYWNwyKRAgJPE42 9j7insbiLgmu91WObHds7IyMUEFY/nUOIvSTXGdZEbVbG5dLoRNkz6pF3 DS2yN1owhEEgMdXdGf2Vb5Q7IJf8AAVsx1bTGdCZCm4wGW2tRwuV3V9bv nsaWMCtJRlc6q5rjokIQLcKlxNDjmo8VSpmnO+iLC+wDXj74PsgfMZGYO HYT/CzSekrrP0APectfheWtmCvjTKHgYdW3neuFbbkqbJqU9JSToZHJ5w Q==; X-CSE-ConnectionGUID: MvzCwiLATdCb3BjfRNzGdQ== X-CSE-MsgGUID: rMJBZNE1QpeS/NodwVY+CQ== X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631378" X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="82631378" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 14:13:20 -0800 X-CSE-ConnectionGUID: +kW0xOefQL+p/3/t8mxyqg== X-CSE-MsgGUID: HxX5F/csQZSuBNIc0rmErw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="216373939" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:18 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [PATCH v3 08/21] sched/cache: Calculate the percpu sd task LLC preference Date: Tue, 10 Feb 2026 14:18:48 -0800 Message-Id: <41f8e91b70060e7697840163b80c3dc097aabb34.1770760558.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Calculate the number of tasks' LLC preferences for each runqueue. This statistic is computed during task enqueue and dequeue operations, and is used by the cache-aware load balancing. Co-developed-by: Chen Yu Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v2->v3: Move max_llcs check from patch4 to this patch. This would clarify the rationale for the max_llc check and makes review easier (Peter Zijlstra). kernel/sched/fair.c | 56 +++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 54 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6ad9ad2f918f..4a98aa866d65 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1199,28 +1199,80 @@ static int llc_id(int cpu) return per_cpu(sd_llc_id, cpu); } =20 +static inline bool valid_llc_id(int id) +{ + if (unlikely(id < 0 || id >=3D max_llcs)) + return false; + + return true; +} + +static inline bool valid_llc_buf(struct sched_domain *sd, + int id) +{ + /* + * The check for sd and its corresponding pf is to + * confirm that the sd->pf[] has been allocated in + * build_sched_domains() after the assignment of + * per_cpu(sd_llc_id, i). This is used to avoid + * the race condition. + */ + if (unlikely(!sd || !sd->pf)) + return false; + + return valid_llc_id(id); +} + static void account_llc_enqueue(struct rq *rq, struct task_struct *p) { + struct sched_domain *sd; int pref_llc; =20 pref_llc =3D p->preferred_llc; - if (pref_llc < 0) + if (!valid_llc_id(pref_llc)) return; =20 rq->nr_llc_running++; rq->nr_pref_llc_running +=3D (pref_llc =3D=3D task_llc(p)); + + scoped_guard (rcu) { + sd =3D rcu_dereference(rq->sd); + if (valid_llc_buf(sd, pref_llc)) + sd->pf[pref_llc]++; + } } =20 static void account_llc_dequeue(struct rq *rq, struct task_struct *p) { + struct sched_domain *sd; int pref_llc; =20 pref_llc =3D p->preferred_llc; - if (pref_llc < 0) + if (!valid_llc_id(pref_llc)) return; =20 rq->nr_llc_running--; rq->nr_pref_llc_running -=3D (pref_llc =3D=3D task_llc(p)); + + scoped_guard (rcu) { + sd =3D rcu_dereference(rq->sd); + if (valid_llc_buf(sd, pref_llc)) { + /* + * There is a race condition between dequeue + * and CPU hotplug. After a task has been enqueued + * on CPUx, a CPU hotplug event occurs, and all online + * CPUs (including CPUx) rebuild their sched_domains + * and reset statistics to zero (including sd->pf). + * This can cause temporary undercount and we have to + * check for such underflow in sd->pf. + * + * This undercount is temporary and accurate accounting + * will resume once the rq has a chance to be idle. + */ + if (sd->pf[pref_llc]) + sd->pf[pref_llc]--; + } + } } =20 void mm_init_sched(struct mm_struct *mm, --=20 2.32.0 From nobody Thu Apr 2 15:36:01 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BF2D933BBB1 for ; Tue, 10 Feb 2026 22:13:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761603; cv=none; b=C0Ig+W7IzWDti+3rGYjUc5KFVRVrX0TVzG1Et+RvTagFt9rdWHyp5Rho3dE2QKZ7AWsyq/WM98+xeQsRH5ygoP67txnLVPdf0kDjXY0duIbKvjCekXMyNt0rGJH1JzU3wdxkXA6PHtA5iGjYDgVlXS2pwH3szNk1/dPhm58A3xg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761603; c=relaxed/simple; bh=h8EaW97unXwqr3Zm62BEQ6Jz9Y52LZhXe+5mCa90Ga8=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=LuIQz21nYGQHPO7uJJAJ45nDLRPcNqNwvC1lOh9uO4/EaPlpgngiYOeQBvbYznP4Y2BiJuLt+k6P+w9sDp9KZCpjwOAm1P7MYkBCyOXf8zWmI3fuSd9kK77JVRXKLDgJlAZD7PIOx3kGUgV0jdCSvToWVL/Xbti0fwUMv1DyMsM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=glOugbil; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="glOugbil" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770761602; x=1802297602; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=h8EaW97unXwqr3Zm62BEQ6Jz9Y52LZhXe+5mCa90Ga8=; b=glOugbilH1PQc8KdYJ8uRolEJhViBrL219XjjE01mq5Bm1iQs2ALFYnj j2p7RiR+dHP0qYSuRjFDLzkVWKkRsKelbhUeaLakkcyl1KztbPq+R3Htz 8E1ykcIxoF1KmfeHSBl6lA9Mj1w9fW/im8FZnpoi5L+bNW1YwClcZIrEE osQpSLqiupOlUgqCLwsZ2B5T4+RA0m3oaK3yXEjKNACRmXV5Qy6nTNJen p0mjh87isA7bG4Jd3yhcz92seI1hcl8glAvhyuXr93vzqoe63zpYXo/aa U5jGEWELL0+VFO3uw/xMAHDKCFZZRWnbMqFTBGsewNWA8W6VMqG5esKja g==; X-CSE-ConnectionGUID: i/3QiOSmRxqJcSyzjv1o7g== X-CSE-MsgGUID: 45sp/AO6RrmCtAmJ825RKg== X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631400" X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="82631400" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 14:13:22 -0800 X-CSE-ConnectionGUID: QuYWffTFSP6QPG/1OcbvPg== X-CSE-MsgGUID: aunmfRrgTDmCm3s+6M9UjQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="216373944" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:20 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [PATCH v3 09/21] sched/cache: Count tasks prefering destination LLC in a sched group Date: Tue, 10 Feb 2026 14:18:49 -0800 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" During LLC load balancing, tabulate the number of tasks on each runqueue that prefer the LLC contains the env->dst_cpu in a sched group. For example, consider a system with 4 LLC sched groups (LLC0 to LLC3) balancing towards LLC3. LLC0 has 3 tasks preferring LLC3, LLC1 has 2, and LLC2 has 1. LLC0, having the most tasks preferring LLC3, is selected as the busiest source to pick tasks from. Within a source LLC, the total number of tasks preferring a destination LLC is computed by summing counts across all CPUs in that LLC. For instance, if LLC0 has CPU0 with 2 tasks and CPU1 with 1 task preferring LLC3, the total for LLC0 is 3. These statistics allow the load balancer to choose tasks from source sched groups that best match their preferred LLCs. Co-developed-by: Chen Yu Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v2->v3: Rename nr_pref_llc to nr_pref_dst_llc for clarification. kernel/sched/fair.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4a98aa866d65..bb93cc046d73 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10566,6 +10566,9 @@ struct sg_lb_stats { unsigned int nr_numa_running; unsigned int nr_preferred_running; #endif +#ifdef CONFIG_SCHED_CACHE + unsigned int nr_pref_dst_llc; +#endif }; =20 /* @@ -11034,6 +11037,9 @@ static inline void update_sg_lb_stats(struct lb_env= *env, { int i, nr_running, local_group, sd_flags =3D env->sd->flags; bool balancing_at_rd =3D !env->sd->parent; +#ifdef CONFIG_SCHED_CACHE + int dst_llc =3D llc_id(env->dst_cpu); +#endif =20 memset(sgs, 0, sizeof(*sgs)); =20 @@ -11054,6 +11060,15 @@ static inline void update_sg_lb_stats(struct lb_en= v *env, if (cpu_overutilized(i)) *sg_overutilized =3D 1; =20 +#ifdef CONFIG_SCHED_CACHE + if (sched_cache_enabled() && llc_id(i) !=3D dst_llc) { + struct sched_domain *sd_tmp =3D rcu_dereference(rq->sd); + + if (valid_llc_buf(sd_tmp, dst_llc)) + sgs->nr_pref_dst_llc +=3D sd_tmp->pf[dst_llc]; + } +#endif + /* * No need to call idle_cpu() if nr_running is not 0 */ --=20 2.32.0 From nobody Thu Apr 2 15:36:01 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8A11A33AD9E for ; Tue, 10 Feb 2026 22:13:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761605; cv=none; b=iTJD6aiU1H7dHrrcgjQbvueP90EjZ28vz3uH683tL7b3dTq666tXjsV4FejCBoJDjAY58TmjuJ4C1TMrWzHDlwr/Z6fwSxvb+uj4SPvMDhyTCwUpQB8TJ5HFQlgqVdfWW7SZp/lUJh+7kddc/A1k+ve0S7pZI2d0BXNzQX3xxww= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761605; c=relaxed/simple; bh=deREQWr008Oddvl6ktaFRrYAHwf1vWih75YBvefoALY=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=OT13iyB5B9nTFSTN9qBEIFBN7L98adFY/j+VDNLJZIWuP5Ktg8ykP/QguVmZZtXMammdZ4jF784zB9U1Nok2SGOu1hwrzqcMGe8bCHOcidl5sqnzjkwd5dJzfJMX9kVdIz2mvaZ1x+7RjFe4dCZa2YcMnGfqxv6gzA5xAfAYXo4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=VQm13UYO; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="VQm13UYO" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770761605; x=1802297605; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=deREQWr008Oddvl6ktaFRrYAHwf1vWih75YBvefoALY=; b=VQm13UYORIFsGskCJRkQeeo2zuRDP/iPPRD6sr5ySXNE6bDCYyKpS1Hl M1poa0FjM2oLwu9SPlXaETXmxM9NIv56Z156uEcjn7GhHVHL20g+VtS5C 1P2DBYVeiC0Hk1ScgDSoVNB+yvk6zvK9BPCH7H8EGkhL1QscTLws074fR MmBB3BN8NLu9nSqXJcD8Zb4BpdIrczEtktD5GMCwqfKl3HlNriBZdgvtv lGsaqZaKnnjEfVhEmIm8FNlItf3OtLrE0psfqp03k3sAJ6OKpcX1k1+2P 7aIN0mVufFES778JO172kYB3Hgt+4ocy31dnv8+F9sm3/MfloU9eEcIvW g==; X-CSE-ConnectionGUID: 23Aou1VYSHqlVg4OjMXmwg== X-CSE-MsgGUID: Ic2QsJqSSMCoisKrkI4ERQ== X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631420" X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="82631420" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 14:13:24 -0800 X-CSE-ConnectionGUID: hUpMd5nNQByyiA1jelrdjw== X-CSE-MsgGUID: XEub+xfKQKKpitoFXSkj/A== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="216373952" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:22 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [PATCH v3 10/21] sched/cache: Check local_group only once in update_sg_lb_stats() Date: Tue, 10 Feb 2026 14:18:50 -0800 Message-Id: <9b77a144811f5c11217a0e6a4e6c2b5cfe9dffb9.1770760558.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" There is no need to check the local group twice for both group_asym_packing and group_smt_balance. Adjust the code to facilitate future checks for group types (cache-aware load balancing) as well. No functional changes are expected. Suggested-by: Peter Zijlstra (Intel) Co-developed-by: Chen Yu Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v2->v3: No change. kernel/sched/fair.c | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bb93cc046d73..b0cf4424d198 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -11109,14 +11109,16 @@ static inline void update_sg_lb_stats(struct lb_e= nv *env, =20 sgs->group_weight =3D group->group_weight; =20 - /* Check if dst CPU is idle and preferred to this group */ - if (!local_group && env->idle && sgs->sum_h_nr_running && - sched_group_asym(env, sgs, group)) - sgs->group_asym_packing =3D 1; - - /* Check for loaded SMT group to be balanced to dst CPU */ - if (!local_group && smt_balance(env, sgs, group)) - sgs->group_smt_balance =3D 1; + if (!local_group) { + /* Check if dst CPU is idle and preferred to this group */ + if (env->idle && sgs->sum_h_nr_running && + sched_group_asym(env, sgs, group)) + sgs->group_asym_packing =3D 1; + + /* Check for loaded SMT group to be balanced to dst CPU */ + if (smt_balance(env, sgs, group)) + sgs->group_smt_balance =3D 1; + } =20 sgs->group_type =3D group_classify(env->sd->imbalance_pct, group, sgs); =20 --=20 2.32.0 From nobody Thu Apr 2 15:36:01 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B89E233F8DD for ; Tue, 10 Feb 2026 22:13:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761609; cv=none; b=jVDywHr1Y+GwBL2hIlAYLV1VTDIaVRwFeaRWBI4M+mASbFRZjgZaZldB3v/lBKPn/lrorwqN8ZGNbUbQ8TtvTRj/HTP1s7BTM1b9Hb3CPLnHk4ckWDXpcID3KMY27dQBUJPQe3S63auMjWvFWCtCgziKmhFSOUirTLg8ygEVvVc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761609; c=relaxed/simple; bh=Qq/LgW2Vk/aD8G9oPEpX37cscCriWqkxV6kRZ4EgZpA=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=bhWWZTAnGqFB3Q4now16HWDWBn3ZxzZNhuOSRZRK5iAEJPkNUU4/CW5rsw7gXMgjM2pnritv9kaZqLAsjhDDiGp7+F/dIpiHCeV8yHfNeC/A7BlntLAYIOO8S5c4LfYbxp24rmfYAwuz+Bs9OJQNSpBtIpncxlciCSPwL73uKqg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=ZdGq0Q0R; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="ZdGq0Q0R" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770761608; x=1802297608; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Qq/LgW2Vk/aD8G9oPEpX37cscCriWqkxV6kRZ4EgZpA=; b=ZdGq0Q0RClgAzyHYIIQoglg20S+Oc53k9xniKqVn/w2Lou0rsBpIjIRV 2Oej6aVHLXek/QLrYw5Yf9P8FnrG0ZVPtUkd575J2VW7eQIUfTZRxBciK AhSxiXfhTKoSVE1cQd6gwXJV7ml6R1Ra3odXJUNLK1hCkGkbo6kORg4iu nzG/tmU7D2XthJnXomMnkfhbC43SsMMsEL4gz8kNtfcdHQwhOOg2WIwTQ FEf03uhhzLFbdqWjfZMZ+yJTKDzBO3JBbRF1WbDYOVpvmCKW5YFfwp3i2 oxFhorbi4StSsC7CSrvqHnB3x9Ny0W0jfn9A7ShhaUmovoOW9/1k4h2NV A==; X-CSE-ConnectionGUID: 6zFsVbcMSq2310Nie2ewrw== X-CSE-MsgGUID: qzP9tiQcRkiw+X6nS6+/9Q== X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631443" X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="82631443" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 14:13:28 -0800 X-CSE-ConnectionGUID: Idq5vmFiSdGMbmZViAWi7g== X-CSE-MsgGUID: jGXGWr5yTF+CRDo3EpfDng== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="216373959" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:25 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [PATCH v3 11/21] sched/cache: Prioritize tasks preferring destination LLC during balancing Date: Tue, 10 Feb 2026 14:18:51 -0800 Message-Id: <4754991218da7da039a0891b0b9647f6eabd5716.1770760558.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" During LLC load balancing, first check for tasks that prefer the destination LLC and balance them to it before others. Mark source sched groups containing tasks preferring non local LLCs with the group_llc_balance flag. This ensures the load balancer later pulls or pushes these tasks toward their preferred LLCs. The load balancer selects the busiest sched_group and migrates tasks to less busy groups to distribute load across CPUs. With cache-aware scheduling enabled, the busiest sched_group is the one with most tasks preferring the destination LLC. If the group has the llc_balance flag set, cache aware load balancing is triggered. Introduce the helper function update_llc_busiest() to identify the sched_group with the most tasks preferring the destination LLC. Suggested-by: K Prateek Nayak Co-developed-by: Chen Yu Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v2->v3: Consider sd->nr_balance_failed when deciding whether LLC load balance should be used. (Peter Zijlstra) kernel/sched/fair.c | 77 ++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 76 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b0cf4424d198..43dcf2827298 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9649,6 +9649,11 @@ enum group_type { * from balancing the load across the system. */ group_imbalanced, + /* + * There are tasks running on non-preferred LLC, possible to move + * them to their preferred LLC without creating too much imbalance. + */ + group_llc_balance, /* * The CPU is overloaded and can't provide expected CPU cycles to all * tasks. @@ -10561,6 +10566,7 @@ struct sg_lb_stats { enum group_type group_type; unsigned int group_asym_packing; /* Tasks should be moved to preferred CP= U */ unsigned int group_smt_balance; /* Task on busy SMT be moved */ + unsigned int group_llc_balance; /* Tasks should be moved to preferred LL= C */ unsigned long group_misfit_task_load; /* A CPU has a task too big for its= capacity */ #ifdef CONFIG_NUMA_BALANCING unsigned int nr_numa_running; @@ -10819,6 +10825,9 @@ group_type group_classify(unsigned int imbalance_pc= t, if (group_is_overloaded(imbalance_pct, sgs)) return group_overloaded; =20 + if (sgs->group_llc_balance) + return group_llc_balance; + if (sg_imbalanced(group)) return group_imbalanced; =20 @@ -11012,11 +11021,66 @@ static void record_sg_llc_stats(struct lb_env *en= v, if (unlikely(READ_ONCE(sd_share->capacity) !=3D sgs->group_capacity)) WRITE_ONCE(sd_share->capacity, sgs->group_capacity); } + +/* + * Do LLC balance on sched group that contains LLC, and have tasks preferr= ing + * to run on LLC in idle dst_cpu. + */ +static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs, + struct sched_group *group) +{ + if (!sched_cache_enabled()) + return false; + + if (env->sd->flags & SD_SHARE_LLC) + return false; + + /* + * Don't do cache aware balancing if there + * are too many balance failures. + * + * Should fall back to regular load balancing + * after repeated cache aware balance failures. + */ + if (env->sd->nr_balance_failed >=3D + env->sd->cache_nice_tries + 1) + return false; + + if (sgs->nr_pref_dst_llc && + can_migrate_llc(cpumask_first(sched_group_span(group)), + env->dst_cpu, 0, true) =3D=3D mig_llc) + return true; + + return false; +} + +static bool update_llc_busiest(struct lb_env *env, + struct sg_lb_stats *busiest, + struct sg_lb_stats *sgs) +{ + /* + * There are more tasks that want to run on dst_cpu's LLC. + */ + return sgs->nr_pref_dst_llc > busiest->nr_pref_dst_llc; +} #else static inline void record_sg_llc_stats(struct lb_env *env, struct sg_lb_st= ats *sgs, struct sched_group *group) { } + +static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs, + struct sched_group *group) +{ + return false; +} + +static bool update_llc_busiest(struct lb_env *env, + struct sg_lb_stats *busiest, + struct sg_lb_stats *sgs) +{ + return false; +} #endif =20 /** @@ -11118,6 +11182,10 @@ static inline void update_sg_lb_stats(struct lb_en= v *env, /* Check for loaded SMT group to be balanced to dst CPU */ if (smt_balance(env, sgs, group)) sgs->group_smt_balance =3D 1; + + /* Check for tasks in this group can be moved to their preferred LLC */ + if (llc_balance(env, sgs, group)) + sgs->group_llc_balance =3D 1; } =20 sgs->group_type =3D group_classify(env->sd->imbalance_pct, group, sgs); @@ -11181,6 +11249,10 @@ static bool update_sd_pick_busiest(struct lb_env *= env, /* Select the overloaded group with highest avg_load. */ return sgs->avg_load > busiest->avg_load; =20 + case group_llc_balance: + /* Select the group with most tasks preferring dst LLC */ + return update_llc_busiest(env, busiest, sgs); + case group_imbalanced: /* * Select the 1st imbalanced group as we don't have any way to @@ -11443,6 +11515,7 @@ static bool update_pick_idlest(struct sched_group *= idlest, return false; break; =20 + case group_llc_balance: case group_imbalanced: case group_asym_packing: case group_smt_balance: @@ -11575,6 +11648,7 @@ sched_balance_find_dst_group(struct sched_domain *s= d, struct task_struct *p, int return NULL; break; =20 + case group_llc_balance: case group_imbalanced: case group_asym_packing: case group_smt_balance: @@ -12074,7 +12148,8 @@ static struct sched_group *sched_balance_find_src_g= roup(struct lb_env *env) * group's child domain. */ if (sds.prefer_sibling && local->group_type =3D=3D group_has_spare && - sibling_imbalance(env, &sds, busiest, local) > 1) + (busiest->group_type =3D=3D group_llc_balance || + sibling_imbalance(env, &sds, busiest, local) > 1)) goto force_balance; =20 if (busiest->group_type !=3D group_overloaded) { --=20 2.32.0 From nobody Thu Apr 2 15:36:01 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D7ACF33FE1F for ; Tue, 10 Feb 2026 22:13:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761611; cv=none; b=LYnGbU1eJGO9U8QvhlnNFIOUX/XvarI33ljQ5tB2eQbfBYcOXsZO2cm25MppD1ZYemIG0tmZDXNeGmr/n4xOtheeK0FwigSG7NbQ15eUh89Q0j/qq36hKHbQt8gYgj5GC4kUNElI2lk9gHfKzya64zJzI2YLsBhwLMEhluKaba8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761611; c=relaxed/simple; bh=UN264F5dvdXr4hYvmSUaM59YbuS9/BGP5NAiGJVH8vQ=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Q14MMmwWT9F/+s24LYl4uzdC5sU/4AWsFPkdSDsDW0hI0ZWC6ZuOtATG30OyEVpk+6cbFSYyAE/72ArbMwEExBp04TmAP+MUh0HZnAw1IIczCmDiV4Z+ei1cCht40wf7QzVCPB6GL1FjAHMNwJavCiJV0c4pA0onb6B/kojs3lg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=TM24SUhH; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="TM24SUhH" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770761609; x=1802297609; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=UN264F5dvdXr4hYvmSUaM59YbuS9/BGP5NAiGJVH8vQ=; b=TM24SUhH9/RyZoG459NP+AmabZVsnPngn1zE2/ZV02xp+osEX8qJODQz 74LaSBw4VHSQmRsslLREG8pHQKCpimbafrIshAdmV0JQ9Vy/Hp5FgAheQ XTaoI9P3Zo7b406AjQVD5j82513az4ESzCQszsAX2Cj5S4N0fni6f72BN OknXiaWNswFLoKA4mAkkEzQyBUc11fS8rCly/sYQ04zfS1Ustv1ScE6kg nSEpyB+OL5LsYQhxe32BoD9eTSylsZ1O/q/ACXyzBbPTQBNC+kwdPHgi2 1iv7LDQI4+9Bmx1VvnZLQDjNv5h5VE5C3dnJih3bWaLrvig2I7UOmy5KK Q==; X-CSE-ConnectionGUID: PwbWaJb2Ttqlus8fJEOX7Q== X-CSE-MsgGUID: w5xkH7KzRuiSvXukasL2xQ== X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631463" X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="82631463" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 14:13:30 -0800 X-CSE-ConnectionGUID: 1/zkImpXRPeYfWS2QZ9C+g== X-CSE-MsgGUID: NEMij9YqSmCxg4sYBGmvsw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="216373973" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:28 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [PATCH v3 12/21] sched/cache: Add migrate_llc_task migration type for cache-aware balancing Date: Tue, 10 Feb 2026 14:18:52 -0800 Message-Id: <9038c2e0d40b744d5db19138c384819717eb03e6.1770760558.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Introduce a new migration type, migrate_llc_task, to support cache-aware load balancing. After identifying the busiest sched_group (having the most tasks preferring the destination LLC), mark migrations with this type. During load balancing, each runqueue in the busiest sched_group is examined, and the runqueue with the highest number of tasks preferring the destination CPU is selected as the busiest runqueue. Signed-off-by: Tim Chen --- Notes: v2->v3: Let the enum and switch statements have the same order. (Peter Zijlstra) kernel/sched/fair.c | 38 +++++++++++++++++++++++++++++++++++++- 1 file changed, 37 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 43dcf2827298..1697791ef11c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9665,7 +9665,8 @@ enum migration_type { migrate_load =3D 0, migrate_util, migrate_task, - migrate_misfit + migrate_misfit, + migrate_llc_task }; =20 #define LBF_ALL_PINNED 0x01 @@ -10266,6 +10267,10 @@ static int detach_tasks(struct lb_env *env) =20 env->imbalance =3D 0; break; + + case migrate_llc_task: + env->imbalance--; + break; } =20 detach_task(p, env); @@ -11902,6 +11907,15 @@ static inline void calculate_imbalance(struct lb_e= nv *env, struct sd_lb_stats *s return; } =20 +#ifdef CONFIG_SCHED_CACHE + if (busiest->group_type =3D=3D group_llc_balance) { + /* Move a task that prefer local LLC */ + env->migration_type =3D migrate_llc_task; + env->imbalance =3D 1; + return; + } +#endif + if (busiest->group_type =3D=3D group_imbalanced) { /* * In the group_imb case we cannot rely on group-wide averages @@ -12209,6 +12223,11 @@ static struct rq *sched_balance_find_src_rq(struct= lb_env *env, struct rq *busiest =3D NULL, *rq; unsigned long busiest_util =3D 0, busiest_load =3D 0, busiest_capacity = =3D 1; unsigned int busiest_nr =3D 0; +#ifdef CONFIG_SCHED_CACHE + unsigned int busiest_pref_llc =3D 0; + struct sched_domain *sd_tmp; + int dst_llc; +#endif int i; =20 for_each_cpu_and(i, sched_group_span(group), env->cpus) { @@ -12336,6 +12355,21 @@ static struct rq *sched_balance_find_src_rq(struct= lb_env *env, =20 break; =20 + case migrate_llc_task: +#ifdef CONFIG_SCHED_CACHE + sd_tmp =3D rcu_dereference(rq->sd); + dst_llc =3D llc_id(env->dst_cpu); + if (valid_llc_buf(sd_tmp, dst_llc)) { + unsigned int this_pref_llc =3D sd_tmp->pf[dst_llc]; + + if (busiest_pref_llc < this_pref_llc) { + busiest_pref_llc =3D this_pref_llc; + busiest =3D rq; + } + } +#endif + break; + } } =20 @@ -12499,6 +12533,8 @@ static void update_lb_imbalance_stat(struct lb_env = *env, struct sched_domain *sd case migrate_misfit: __schedstat_add(sd->lb_imbalance_misfit[idle], env->imbalance); break; + case migrate_llc_task: + break; } } =20 --=20 2.32.0 From nobody Thu Apr 2 15:36:01 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 35F4733B6D3 for ; Tue, 10 Feb 2026 22:13:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761613; cv=none; b=lk5MdHYVwPvEAxRUxjjnLyU8eZm/aw1HrlY67y2jMliVs1TKRGvJNdrMex+/ou5rB8pWUDHYsOEWGCWWiqWm7Y36Brw1+YIF1O9rOlpSxbDYLTEp51CQ4Z8qpi5f5xbl3NQKQFkAA686/G7sg0QvG8Q/lMNlnHcvKGCTUsUpVe8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761613; c=relaxed/simple; bh=Sp3CApwES6LZ8wznsuorYrIR44CICnE+yAXXfi436d0=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=pEZDNKCxmNotKlUXdwAQ+KzBiKf/Z7Y51298bvHQV3Gl2leBgy5SYxkpfoUbBEnBn1OfVeAdIXqNMN/4pQ38o9G1F8yzRTOd+lsVXk357BqC2n0dqWe4mxvMl/ylc+V35zjiMWuEVF6dwcvXYXTi9o7DJXnIEvFmgx2q+ksitVM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=dbU0w3sc; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="dbU0w3sc" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770761612; x=1802297612; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Sp3CApwES6LZ8wznsuorYrIR44CICnE+yAXXfi436d0=; b=dbU0w3scI28x8xgP1OsFEnOnOSkfQgZFnBbG60v7xb6s0m5f0eRyYecS RLBKA33qal+fb+ImbzNSVaXDN151KwTXloCH0sM6hyHGthq9Qg4rmE3nP 5RgBeLIX8VQnD3EdYqxmmDP3eJ+RUx8CI0hdPD4fHDHRjQwMOalJ3eIMJ WceAvccOSNCRnkZ4dsKbT8e5fSbmVbN8pUhznZFHigqWvBtmoovhNXSOD RKqORiE1pARQqgI3OFX39acsOwGqjN+l3pIVKpAsNABgTsXssZgwu0Sjy Q7yWwbp9vFi0PLyDbssW/67XYMBcMbd7E0rdZro8wIiRGxTn7iF010bwl A==; X-CSE-ConnectionGUID: a+0EVXeTTqq5hO8JZ+Wxvg== X-CSE-MsgGUID: 1qYXTk7RSk+b8S3Pj2aNcA== X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631485" X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="82631485" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 14:13:31 -0800 X-CSE-ConnectionGUID: XP3ymKcFRS2pztVOCTB5Hw== X-CSE-MsgGUID: GLjdOTeEQdu1Z++BmJyQSg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="216373986" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:30 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [PATCH v3 13/21] sched/cache: Handle moving single tasks to/from their preferred LLC Date: Tue, 10 Feb 2026 14:18:53 -0800 Message-Id: <92fa33fc26f069d8044bac3b0efc3598f53131de.1770760558.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" In the generic load balance(non-cache-aware-load-balance), if the busiest runqueue has only one task, active balancing may be invoked to move it. However, this migration might break LLC locality. Before migration, check whether the task is running on its preferred LLC: Do not move a lone task to another LLC if it would move the task away from its preferred LLC or cause excessive imbalance between LLCs. On the other hand, if the migration type is migrate_llc_task, it means that there are tasks on the env->src_cpu that want to be migrated to their preferred LLC, launch the active load balance anyway. Co-developed-by: Chen Yu Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v2->v3: Remove redundant rcu read lock in break_llc_locality(). kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 53 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1697791ef11c..03959a701514 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9999,12 +9999,60 @@ static __maybe_unused enum llc_mig can_migrate_llc_= task(int src_cpu, int dst_cpu task_util(p), to_pref); } =20 +/* + * Check if active load balance breaks LLC locality in + * terms of cache aware load balance. + */ +static inline bool +alb_break_llc(struct lb_env *env) +{ + if (!sched_cache_enabled()) + return false; + + if (cpus_share_cache(env->src_cpu, env->dst_cpu)) + return false; + /* + * All tasks prefer to stay on their current CPU. + * Do not pull a task from its preferred CPU if: + * 1. It is the only task running there; OR + * 2. Migrating it away from its preferred LLC would violate + * the cache-aware scheduling policy. + */ + if (env->src_rq->nr_pref_llc_running && + env->src_rq->nr_pref_llc_running =3D=3D env->src_rq->cfs.h_nr_runnabl= e) { + unsigned long util =3D 0; + struct task_struct *cur; + + if (env->src_rq->nr_running <=3D 1) + return true; + + /* + * Reach here in load balance with + * rcu_read_lock() protected. + */ + cur =3D rcu_dereference(env->src_rq->curr); + if (cur) + util =3D task_util(cur); + + if (can_migrate_llc(env->src_cpu, env->dst_cpu, + util, false) =3D=3D mig_forbid) + return true; + } + + return false; +} #else static inline bool get_llc_stats(int cpu, unsigned long *util, unsigned long *cap) { return false; } + +static inline bool +alb_break_llc(struct lb_env *env) +{ + return false; +} #endif /* * can_migrate_task - may task p from runqueue rq be migrated to this_cpu? @@ -12421,6 +12469,9 @@ static int need_active_balance(struct lb_env *env) { struct sched_domain *sd =3D env->sd; =20 + if (alb_break_llc(env)) + return 0; + if (asym_active_balance(env)) return 1; =20 @@ -12440,7 +12491,8 @@ static int need_active_balance(struct lb_env *env) return 1; } =20 - if (env->migration_type =3D=3D migrate_misfit) + if (env->migration_type =3D=3D migrate_misfit || + env->migration_type =3D=3D migrate_llc_task) return 1; =20 return 0; --=20 2.32.0 From nobody Thu Apr 2 15:36:01 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 85041355033 for ; Tue, 10 Feb 2026 22:13:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761615; cv=none; b=b3o1n/cP++SPpH1qPMo3bQFAeO6/e6kbkxVHNMSobSKYN+7XCULzztII746yVSbYCaIZjFBXxLFRkqs9AwzLXdC3bMuFRMr8/82YtorIf1VaZVjjkLA+bNqIUbKzysC4oGbSAVW1R9AkSst6KoGO4YhWnzpXCMglkcEDoP4lYWQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761615; c=relaxed/simple; bh=tYv2iHyag5NKSPeYQKd4k8stcZ7hWz9Pzz005OH/eMI=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=GQOCk2B67DTBvv6LH6XV7tzB4uO80JvzPDDFSv9PW+MJSC3Kse9eVIdC3YN+QCQ1PfzAqmYxOFbHqUe1VkdQLcVYuNO1plkmqd9B5B1rIl7baQHTDAyJvQETGWLgiaEZvkvRf/Gj2TSS9LIm3gkLLOQX3M8ugBczbmhbuHFNutc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=Dn4F0W/w; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Dn4F0W/w" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770761614; x=1802297614; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=tYv2iHyag5NKSPeYQKd4k8stcZ7hWz9Pzz005OH/eMI=; b=Dn4F0W/wxOycB1W3oNIstnIL1NK3/w/hG1+9QHEu2iG1mJJxUuSY4qC8 oCHXEAg8R/8bGRmqk5KMmWVOlpsF13dnelPUcQgpy5lHeKWHa1yzHoO4p 5zc4rVi6Lj1j1+K1FfSbcloy+Y/EbF/f81OZtTanFlI95LHi9mPZOdK8u ThkbMMWLMuZ1lfH9f9gE7gbe3DIFbBGOwdfyft3Fvmz9+jDGJ04UdEzDF 9y64emCRuwTyW4kLf9cu1eEMRBLyri/5aLg7kIm7UEzX9C5R10WFKEoPR eKDWgJLU4Phi39hIsoy6GoqucyWEnkNeq1EWJAn7dacDgqrYbYu6acCub Q==; X-CSE-ConnectionGUID: KAZJi8v8Qxavym9c30i3aw== X-CSE-MsgGUID: GvzSXcoJSQGz74f+BBWbLQ== X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631505" X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="82631505" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 14:13:34 -0800 X-CSE-ConnectionGUID: ADAxB+eRTjKPH1xcLvfBDQ== X-CSE-MsgGUID: JDVcTB7EQw2uF/CBovIwBw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="216374007" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:32 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [PATCH v3 14/21] sched/cache: Respect LLC preference in task migration and detach Date: Tue, 10 Feb 2026 14:18:54 -0800 Message-Id: <82aeb78bbfb80cb6861b85e4db9d398f6c8e331b.1770760558.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" During the final step of load balancing, can_migrate_task() now considers a task's LLC preference before moving it out of its preferred LLC. Suggested-by: Peter Zijlstra (Intel) Suggested-by: K Prateek Nayak Co-developed-by: Chen Yu Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v2->v3: Use the similar mechanism as NUMA balancing, which skips over the tasks that would degrade locality in can_migrate_task(); and only if nr_balanced_failed is high enough do we ignore that. (Peter Zijlstra) =20 Let migrate_degrade_locality() take precedence over migrate_degrades_llc(), which aims to migrate towards the preferred NUMA node. (Peter Zijlstra) kernel/sched/fair.c | 64 +++++++++++++++++++++++++++++++++++++++++--- kernel/sched/sched.h | 13 +++++++++ 2 files changed, 73 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 03959a701514..d1145997b88d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9973,8 +9973,8 @@ static enum llc_mig can_migrate_llc(int src_cpu, int = dst_cpu, * Check if task p can migrate from source LLC to * destination LLC in terms of cache aware load balance. */ -static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int d= st_cpu, - struct task_struct *p) +static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu, + struct task_struct *p) { struct mm_struct *mm; bool to_pref; @@ -10041,6 +10041,47 @@ alb_break_llc(struct lb_env *env) =20 return false; } + +/* + * Check if migrating task p from env->src_cpu to + * env->dst_cpu breaks LLC localiy. + */ +static bool migrate_degrades_llc(struct task_struct *p, struct lb_env *env) +{ + if (!sched_cache_enabled()) + return false; + + if (task_has_sched_core(p)) + return false; + /* + * Skip over tasks that would degrade LLC locality; + * only when nr_balanced_failed is sufficiently high do we + * ignore this constraint. + * + * Threshold of cache_nice_tries is set to 1 higher + * than nr_balance_failed to avoid excessive task + * migration at the same time. Refer to comments around + * llc_balance(). + */ + if (env->sd->nr_balance_failed >=3D env->sd->cache_nice_tries + 1) + return false; + + /* + * We know the env->src_cpu has some tasks prefer to + * run on env->dst_cpu, skip the tasks do not prefer + * env->dst_cpu, and find the one that prefers. + */ + if (env->migration_type =3D=3D migrate_llc_task && + task_llc(p) !=3D llc_id(env->dst_cpu)) + return true; + + if (can_migrate_llc_task(env->src_cpu, + env->dst_cpu, p) !=3D mig_forbid) + return false; + + return true; +} + #else static inline bool get_llc_stats(int cpu, unsigned long *util, unsigned long *cap) @@ -10053,6 +10094,12 @@ alb_break_llc(struct lb_env *env) { return false; } + +static inline bool +migrate_degrades_llc(struct task_struct *p, struct lb_env *env) +{ + return false; +} #endif /* * can_migrate_task - may task p from runqueue rq be migrated to this_cpu? @@ -10150,10 +10197,19 @@ int can_migrate_task(struct task_struct *p, struc= t lb_env *env) return 1; =20 degrades =3D migrate_degrades_locality(p, env); - if (!degrades) + if (!degrades) { + /* + * If the NUMA locality is not broken, + * further check if migration would hurt + * LLC locality. + */ + if (migrate_degrades_llc(p, env)) + return 0; + hot =3D task_hot(p, env); - else + } else { hot =3D degrades > 0; + } =20 if (!hot || env->sd->nr_balance_failed > env->sd->cache_nice_tries) { if (hot) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index ac8c7ac1ac0d..c18e59f320a6 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1495,6 +1495,14 @@ extern void sched_core_dequeue(struct rq *rq, struct= task_struct *p, int flags); extern void sched_core_get(void); extern void sched_core_put(void); =20 +static inline bool task_has_sched_core(struct task_struct *p) +{ + if (sched_core_disabled()) + return false; + + return !!p->core_cookie; +} + #else /* !CONFIG_SCHED_CORE: */ =20 static inline bool sched_core_enabled(struct rq *rq) @@ -1534,6 +1542,11 @@ static inline bool sched_group_cookie_match(struct r= q *rq, return true; } =20 +static inline bool task_has_sched_core(struct task_struct *p) +{ + return false; +} + #endif /* !CONFIG_SCHED_CORE */ =20 #ifdef CONFIG_RT_GROUP_SCHED --=20 2.32.0 From nobody Thu Apr 2 15:36:01 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0A2F633B95E for ; Tue, 10 Feb 2026 22:13:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761618; cv=none; b=LR47nAPZY1PgF6IbjenhtrkUHuxkJ2bAYwrloAbJT9FBcwvP0a7NwD1i/3OSaDDYGOPcHcgwGntObyqKZDcIJpARwBBIuYbwrciPNW4WzD3eqQIOctgtCqUI/T3z3dD9vG6sHxwuxqXvJ9R+MgQe4aS3ku/JJA0LE0fWEFNh1YE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761618; c=relaxed/simple; bh=NUaoOtUWcv68MwDfmTICP/GIBfn5Rq1xgTOvSnQx5n4=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=nMVmD7PJO6ckzGNg9di55uePxBQfptt/rYn1lgcvi9ftoAAEYXuyiF76REfvpl9zr/2Owoo6A+aItk1/bMp1Bm4y3sliOA8jDrL9phhm3L/qYlr++JTqFU3jCpl7frsxmXHtENApOSdBoCfStRPvPxDPxP5NkxjvwltbLK7GgfM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=SMVg0Ta0; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="SMVg0Ta0" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770761617; x=1802297617; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=NUaoOtUWcv68MwDfmTICP/GIBfn5Rq1xgTOvSnQx5n4=; b=SMVg0Ta0JLKQ2GU5NfqaNuj+AyLUuOGDNh5kYdEk2YxseAPD4Xz4B9oJ 1Khb0UQPYuHWFowHMr4Qcs2WXgL7ITs00IvvSCbOwUtRkD8DTAT75rvbR J4S+R9BUVIDHwDEUgY9xN6JSSBJ1TscSFqic5h91YXCCgUeImA7cbyize hYMPK0HOH3dYyrP7jMcbeQcqlsm6uY3YzH/01u0EslqyLdRP0y9em73L/ Qh90IJ3LVDpz3jnIbKbxu48g43psbUPkp50iFw7Jm8Q9C5sMPUrbRToqo E9BsYVwG5S83CHWBgZe41UGCjyJLTQY7LLxxIRYwDlHHNGezK6aSI+YQo A==; X-CSE-ConnectionGUID: mqc7m0U8QOGkx2vqjLbDPg== X-CSE-MsgGUID: C/6a71C4RS+1Eq4gdekLLQ== X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631526" X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="82631526" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 14:13:36 -0800 X-CSE-ConnectionGUID: OklA/d5yRyii4/PE7G03HQ== X-CSE-MsgGUID: ieN5axIRQvu2hTWBWedZwQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="216374013" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:34 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [PATCH v3 15/21] sched/cache: Disable cache aware scheduling for processes with high thread counts Date: Tue, 10 Feb 2026 14:18:55 -0800 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu A performance regression was observed by Prateek when running hackbench with many threads per process (high fd count). To avoid this, processes with a large number of active threads are excluded from cache-aware scheduling. With sched_cache enabled, record the number of active threads in each process during the periodic task_cache_work(). While iterating over CPUs, if the currently running task belongs to the same process as the task that launched task_cache_work(), increment the active thread count. If the number of active threads within the process exceeds the number of Cores(divided by SMTs number) in the LLC, do not enable cache-aware scheduling. For users who wish to perform task aggregation regardless, a debugfs knob is provided for tuning in a subsequent patch. Suggested-by: K Prateek Nayak Suggested-by: Aaron Lu Co-developed-by: Tim Chen Signed-off-by: Tim Chen Signed-off-by: Chen Yu --- Notes: v2->v3: Put the calculating of nr_running_avg and the use of it into 1 patch. (Peter Zijlstra) =20 Use guard(rcu)() when calculating the number of active threads of the process. (Peter Zijlstra) =20 Introduce update_avg_scale() rather than using update_avg() to fit system with small LLC. (Aaron Lu) include/linux/sched.h | 1 + kernel/sched/fair.c | 59 ++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 57 insertions(+), 3 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index c98bd1c46088..511c9b263386 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2346,6 +2346,7 @@ struct sched_cache_stat { struct sched_cache_time __percpu *pcpu_sched; raw_spinlock_t lock; unsigned long epoch; + u64 nr_running_avg; int cpu; } ____cacheline_aligned_in_smp; =20 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d1145997b88d..86b6b08e7e1e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1223,6 +1223,19 @@ static inline bool valid_llc_buf(struct sched_domain= *sd, return valid_llc_id(id); } =20 +static bool exceed_llc_nr(struct mm_struct *mm, int cpu) +{ + int smt_nr =3D 1; + +#ifdef CONFIG_SCHED_SMT + if (sched_smt_active()) + smt_nr =3D cpumask_weight(cpu_smt_mask(cpu)); +#endif + + return !fits_capacity((mm->sc_stat.nr_running_avg * smt_nr), + per_cpu(sd_llc_size, cpu)); +} + static void account_llc_enqueue(struct rq *rq, struct task_struct *p) { struct sched_domain *sd; @@ -1417,7 +1430,8 @@ void account_mm_sched(struct rq *rq, struct task_stru= ct *p, s64 delta_exec) */ if (time_after(epoch, READ_ONCE(mm->sc_stat.epoch) + EPOCH_LLC_AFFINITY_TIMEOUT) || - get_nr_threads(p) <=3D 1) { + get_nr_threads(p) <=3D 1 || + exceed_llc_nr(mm, cpu_of(rq))) { if (mm->sc_stat.cpu !=3D -1) mm->sc_stat.cpu =3D -1; } @@ -1458,13 +1472,31 @@ static void task_tick_cache(struct rq *rq, struct t= ask_struct *p) } } =20 +static inline void update_avg_scale(u64 *avg, u64 sample) +{ + int factor =3D per_cpu(sd_llc_size, raw_smp_processor_id()); + s64 diff =3D sample - *avg; + u32 divisor; + + /* + * Scale the divisor based on the number of CPUs contained + * in the LLC. This scaling ensures smaller LLC domains use + * a smaller divisor to achieve more precise sensitivity to + * changes in nr_running, while larger LLC domains are capped + * at a maximum divisor of 8 which is the default smoothing + * factor of EWMA in update_avg(). + */ + divisor =3D clamp_t(u32, (factor >> 2), 2, 8); + *avg +=3D div64_s64(diff, divisor); +} + static void task_cache_work(struct callback_head *work) { - struct task_struct *p =3D current; + struct task_struct *p =3D current, *cur; struct mm_struct *mm =3D p->mm; unsigned long m_a_occ =3D 0; unsigned long curr_m_a_occ =3D 0; - int cpu, m_a_cpu =3D -1; + int cpu, m_a_cpu =3D -1, nr_running =3D 0; cpumask_var_t cpus; =20 WARN_ON_ONCE(work !=3D &p->cache_work); @@ -1474,6 +1506,13 @@ static void task_cache_work(struct callback_head *wo= rk) if (p->flags & PF_EXITING) return; =20 + if (get_nr_threads(p) <=3D 1) { + if (mm->sc_stat.cpu !=3D -1) + mm->sc_stat.cpu =3D -1; + + return; + } + if (!zalloc_cpumask_var(&cpus, GFP_KERNEL)) return; =20 @@ -1497,6 +1536,12 @@ static void task_cache_work(struct callback_head *wo= rk) m_occ =3D occ; m_cpu =3D i; } + scoped_guard (rcu) { + cur =3D rcu_dereference(cpu_rq(i)->curr); + if (cur && !(cur->flags & (PF_EXITING | PF_KTHREAD)) && + cur->mm =3D=3D mm) + nr_running++; + } } =20 /* @@ -1540,6 +1585,7 @@ static void task_cache_work(struct callback_head *wor= k) mm->sc_stat.cpu =3D m_a_cpu; } =20 + update_avg_scale(&mm->sc_stat.nr_running_avg, nr_running); free_cpumask_var(cpus); } =20 @@ -9988,6 +10034,13 @@ static enum llc_mig can_migrate_llc_task(int src_cp= u, int dst_cpu, if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu)) return mig_unrestricted; =20 + /* skip cache aware load balance for single/too many threads */ + if (get_nr_threads(p) <=3D 1 || exceed_llc_nr(mm, dst_cpu)) { + if (mm->sc_stat.cpu !=3D -1) + mm->sc_stat.cpu =3D -1; + return mig_unrestricted; + } + if (cpus_share_cache(dst_cpu, cpu)) to_pref =3D true; else if (cpus_share_cache(src_cpu, cpu)) --=20 2.32.0 From nobody Thu Apr 2 15:36:01 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B9B2838A9DF for ; Tue, 10 Feb 2026 22:13:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761621; cv=none; b=cp1OYQ16tAgHHOxCKH/Nv3grWd1PeLMNWOzC7Znesl798wAGFGzL/HyPbiHecQnX8bz3NrsTxqYGGKxXKWHn8kUZIPQm7nXQxei+9eNZiMSwEEc0QDKnD2VRfayA20dojntY8tiCBqJdHEUX3dDwaOFY8DHV5enksdsNKDZ2MyE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761621; c=relaxed/simple; bh=nZHFDOb9VWAdLlbvPBVHD2UPMHAPV9s16xdLaYp6e4A=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=DMuVgmYu5/gUN6ygUrdnPsqtnb6Zyjo72hARgcoipYKLsv/hvQ0Vk0cDnV60bYoAJF4eHFsZycGcfY0sP5DDoIBOYI0NmlRWoTBRPfUCDII8KBKrAPA4KcFZnvS5fKu+WqeW6LlhqHwOslT1k/xY6Yl18mqHPwSlN/PrpAkr6dw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=dB+RxgXV; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="dB+RxgXV" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770761619; x=1802297619; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=nZHFDOb9VWAdLlbvPBVHD2UPMHAPV9s16xdLaYp6e4A=; b=dB+RxgXVGuzBBXF95swyQl8T270h1ABo9VtzHcMGGZwhxznbuCJ5Hdmn /gzCrhZVBti+lmlg0aBc8s82QbyV6eLhQDbGyke61ZSt1Cb6NWAy4NnaA RwWo4d66cQ4JZSonKj4KUo0oPXNQayyKlPKzs+5Sz+E+rxSoh2aAa3Uht Cjeh+oVFo4edxLnXZgWGfv90197RlTryXT8GiQByqvnDrEyE5cVbkFtQ1 cnHY1qjkpElXdt7kBWaFfXzujG/Gu0QNuJLElaOQimQYmYHP7HzdlL1vE izdb5/eTsBd7bHEsIe5ki9NXpwzg0fEtsQ4mnEAt0Ap2xwXbrYf2ZPCxA Q==; X-CSE-ConnectionGUID: HsPNimK2QeSODCud0dW5Hw== X-CSE-MsgGUID: ujHoP/TPRw2ZV+OyFfB0cQ== X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631547" X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="82631547" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 14:13:39 -0800 X-CSE-ConnectionGUID: NHqepHf5TX26sK7KhXnhcA== X-CSE-MsgGUID: 0nNW0XZzQSypqvmedbXkAQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="216374017" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:37 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [PATCH v3 16/21] sched/cache: Avoid cache-aware scheduling for memory-heavy processes Date: Tue, 10 Feb 2026 14:18:56 -0800 Message-Id: <9f2c28aa9d981ee17ce7d2db0d4b883954b1e71c.1770760558.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu Prateek and Tingyin reported that memory-intensive workloads (such as stream) can saturate memory bandwidth and caches on the preferred LLC when sched_cache aggregates too many threads. To mitigate this, estimate a process's memory footprint by comparing its RSS (anonymous and shared pages) to the size of the LLC. If RSS exceeds the LLC size, skip cache-aware scheduling. Note that RSS is only an approximation of the memory footprint. By default, the comparison is strict, but a later patch will allow users to provide a hint to adjust this threshold. According to the test from Adam, some systems do not have shared L3 but with shared L2 as clusters. In this case, the L2 becomes the LLC[1]. Link[1]: https://lore.kernel.org/all/3cb6ebc7-a2fd-42b3-8739-b00e28a09cb6@o= s.amperecomputing.com/ Suggested-by: K Prateek Nayak Co-developed-by: Tim Chen Signed-off-by: Tim Chen Signed-off-by: Chen Yu --- Notes: v2->v3: Fix overflow issue in exceed_llc_capacity() by changing the type of llc from int to u64. (Jianyong Wu, Yangyu Chen) include/linux/cacheinfo.h | 21 ++++++++++------- kernel/sched/fair.c | 48 +++++++++++++++++++++++++++++++++++---- 2 files changed, 56 insertions(+), 13 deletions(-) diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h index c8f4f0a0b874..82d0d59ca0e1 100644 --- a/include/linux/cacheinfo.h +++ b/include/linux/cacheinfo.h @@ -113,18 +113,11 @@ int acpi_get_cache_info(unsigned int cpu, =20 const struct attribute_group *cache_get_priv_group(struct cacheinfo *this_= leaf); =20 -/* - * Get the cacheinfo structure for the cache associated with @cpu at - * level @level. - * cpuhp lock must be held. - */ -static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level) +static inline struct cacheinfo *_get_cpu_cacheinfo_level(int cpu, int leve= l) { struct cpu_cacheinfo *ci =3D get_cpu_cacheinfo(cpu); int i; =20 - lockdep_assert_cpus_held(); - for (i =3D 0; i < ci->num_leaves; i++) { if (ci->info_list[i].level =3D=3D level) { if (ci->info_list[i].attributes & CACHE_ID) @@ -136,6 +129,18 @@ static inline struct cacheinfo *get_cpu_cacheinfo_leve= l(int cpu, int level) return NULL; } =20 +/* + * Get the cacheinfo structure for the cache associated with @cpu at + * level @level. + * cpuhp lock must be held. + */ +static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level) +{ + lockdep_assert_cpus_held(); + + return _get_cpu_cacheinfo_level(cpu, level); +} + /* * Get the id of the cache associated with @cpu at level @level. * cpuhp lock must be held. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 86b6b08e7e1e..ee4982af2bdd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1223,6 +1223,37 @@ static inline bool valid_llc_buf(struct sched_domain= *sd, return valid_llc_id(id); } =20 +static bool exceed_llc_capacity(struct mm_struct *mm, int cpu) +{ + struct cacheinfo *ci; + u64 rss, llc; + + /* + * get_cpu_cacheinfo_level() can not be used + * because it requires the cpu_hotplug_lock + * to be held. Use _get_cpu_cacheinfo_level() + * directly because the 'cpu' can not be + * offlined at the moment. + */ + ci =3D _get_cpu_cacheinfo_level(cpu, 3); + if (!ci) { + /* + * On system without L3 but with shared L2, + * L2 becomes the LLC. + */ + ci =3D _get_cpu_cacheinfo_level(cpu, 2); + if (!ci) + return true; + } + + llc =3D ci->size; + + rss =3D get_mm_counter(mm, MM_ANONPAGES) + + get_mm_counter(mm, MM_SHMEMPAGES); + + return (llc <=3D (rss * PAGE_SIZE)); +} + static bool exceed_llc_nr(struct mm_struct *mm, int cpu) { int smt_nr =3D 1; @@ -1431,7 +1462,8 @@ void account_mm_sched(struct rq *rq, struct task_stru= ct *p, s64 delta_exec) if (time_after(epoch, READ_ONCE(mm->sc_stat.epoch) + EPOCH_LLC_AFFINITY_TIMEOUT) || get_nr_threads(p) <=3D 1 || - exceed_llc_nr(mm, cpu_of(rq))) { + exceed_llc_nr(mm, cpu_of(rq)) || + exceed_llc_capacity(mm, cpu_of(rq))) { if (mm->sc_stat.cpu !=3D -1) mm->sc_stat.cpu =3D -1; } @@ -1496,7 +1528,7 @@ static void task_cache_work(struct callback_head *wor= k) struct mm_struct *mm =3D p->mm; unsigned long m_a_occ =3D 0; unsigned long curr_m_a_occ =3D 0; - int cpu, m_a_cpu =3D -1, nr_running =3D 0; + int cpu, m_a_cpu =3D -1, nr_running =3D 0, curr_cpu; cpumask_var_t cpus; =20 WARN_ON_ONCE(work !=3D &p->cache_work); @@ -1506,7 +1538,9 @@ static void task_cache_work(struct callback_head *wor= k) if (p->flags & PF_EXITING) return; =20 - if (get_nr_threads(p) <=3D 1) { + curr_cpu =3D task_cpu(p); + if (get_nr_threads(p) <=3D 1 || + exceed_llc_capacity(mm, curr_cpu)) { if (mm->sc_stat.cpu !=3D -1) mm->sc_stat.cpu =3D -1; =20 @@ -10034,8 +10068,12 @@ static enum llc_mig can_migrate_llc_task(int src_c= pu, int dst_cpu, if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu)) return mig_unrestricted; =20 - /* skip cache aware load balance for single/too many threads */ - if (get_nr_threads(p) <=3D 1 || exceed_llc_nr(mm, dst_cpu)) { + /* + * Skip cache aware load balance for single/too many threads + * or large memory RSS. + */ + if (get_nr_threads(p) <=3D 1 || exceed_llc_nr(mm, dst_cpu) || + exceed_llc_capacity(mm, dst_cpu)) { if (mm->sc_stat.cpu !=3D -1) mm->sc_stat.cpu =3D -1; return mig_unrestricted; --=20 2.32.0 From nobody Thu Apr 2 15:36:01 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 585A138B7C4 for ; Tue, 10 Feb 2026 22:13:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761622; cv=none; b=cDvHg+BuIcf+FWLAkGK0GPTLvZxiTfEEFAH0T2YWZVfdhmjzZktPfLg7KmaNZUDJtO/VbQv76Bh/MLquEejuXyBkbYfJUFFv0f5W0m+T104RKGoqutv79Z1BDLbYauMu1rdUVAfhj+GlQu7B52//43nFYYwOLRH+LjEdU0KxT7g= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761622; c=relaxed/simple; bh=iOg2XKjD/jnBvASIUIGijnOmsOwKYW6UQ4+RkvVcrLc=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Dwdug2wMf0e34CMpE9hTtoPWlbXSGVrU+IFmanq0NDanXDGz4qcsFqYZ1awhtnIsUEfMVjXebTooZRMNr1p8JJaAdOog1mStyxsdnwh7BdeC6KIMyvcL5lG7R+koO3idz6LHqXzaNVxARpfWVKVVLd/PhR1AaYR2l40tdyPYG8k= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=g/oadkep; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="g/oadkep" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770761621; x=1802297621; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=iOg2XKjD/jnBvASIUIGijnOmsOwKYW6UQ4+RkvVcrLc=; b=g/oadkepIDV1cj/vwsOCpJaDWD6N4cNrzT1x3WAQBwKKjuVY97pYhNaV Gg2jH+HF/YypLAoUySudtZ//Uu4nQQxuLT0tKMRcTd0GwokeM5IIQARBb vHyvkxN7Ohlh4kVFE5MSsKj9lcn3WRtOFCedNIvkgaQd4LKGZTv+D9rFV mkmND2eyNHi5hSDw4LnW4hpknq3MtPp6uyjOHFhyH+Cw9dk8UtNEqGcTD B2jHhpIOvb1bgU76BgOxqBGhgFtqhI58phO4mzNpNGnsZsvkQsk40yL8K TJkaBnYr7wpL3WLTaCJ+96RnApDykCNb1JOYdz2MjQzAiGxq2zmQ19vYX g==; X-CSE-ConnectionGUID: ZMpavvj9SFy+C5k/C84akg== X-CSE-MsgGUID: qIv96OsqQHOV88Ts3/LNdA== X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631568" X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="82631568" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 14:13:41 -0800 X-CSE-ConnectionGUID: s9V8yD4vSWeAVjBFClw4tw== X-CSE-MsgGUID: t9oJi9eZRIerGF4FI4dhug== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="216374023" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:40 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [PATCH v3 17/21] sched/cache: Enable cache aware scheduling for multi LLCs NUMA node Date: Tue, 10 Feb 2026 14:18:57 -0800 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu Introduce sched_cache_present to enable cache aware scheduling for multi LLCs NUMA node Cache-aware load balancing should only be enabled if there are more than 1 LLCs within 1 NUMA node. sched_cache_present is introduced to indicate whether this platform supports this topology. Suggested-by: Libo Chen Suggested-by: Adam Li Co-developed-by: Tim Chen Signed-off-by: Tim Chen Signed-off-by: Chen Yu --- Notes: v2->v3: No change. kernel/sched/sched.h | 3 ++- kernel/sched/topology.c | 18 ++++++++++++++++-- 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index c18e59f320a6..59ac04625842 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3916,11 +3916,12 @@ static inline void mm_cid_switch_to(struct task_str= uct *prev, struct task_struct #endif /* !CONFIG_SCHED_MM_CID */ =20 #ifdef CONFIG_SCHED_CACHE +DECLARE_STATIC_KEY_FALSE(sched_cache_present); extern int max_llcs; =20 static inline bool sched_cache_enabled(void) { - return false; + return static_branch_unlikely(&sched_cache_present); } #endif extern void init_sched_mm(struct task_struct *p); diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index dae78b5915a7..9104fed25351 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -801,6 +801,7 @@ enum s_alloc { }; =20 #ifdef CONFIG_SCHED_CACHE +DEFINE_STATIC_KEY_FALSE(sched_cache_present); static bool alloc_sd_pref(const struct cpumask *cpu_map, struct s_data *d) { @@ -2604,6 +2605,7 @@ static int build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_att= r *attr) { enum s_alloc alloc_state =3D sa_none; + bool has_multi_llcs =3D false; struct sched_domain *sd; struct s_data d; struct rq *rq =3D NULL; @@ -2731,10 +2733,12 @@ build_sched_domains(const struct cpumask *cpu_map, = struct sched_domain_attr *att * between LLCs and memory channels. */ nr_llcs =3D sd->span_weight / child->span_weight; - if (nr_llcs =3D=3D 1) + if (nr_llcs =3D=3D 1) { imb =3D sd->span_weight >> 3; - else + } else { imb =3D nr_llcs; + has_multi_llcs =3D true; + } imb =3D max(1U, imb); sd->imb_numa_nr =3D imb; =20 @@ -2796,6 +2800,16 @@ build_sched_domains(const struct cpumask *cpu_map, s= truct sched_domain_attr *att =20 ret =3D 0; error: +#ifdef CONFIG_SCHED_CACHE + /* + * TBD: check before writing to it. sched domain rebuild + * is not in the critical path, leave as-is for now. + */ + if (!ret && has_multi_llcs) + static_branch_enable_cpuslocked(&sched_cache_present); + else + static_branch_disable_cpuslocked(&sched_cache_present); +#endif __free_domain_allocs(&d, alloc_state, cpu_map); =20 return ret; --=20 2.32.0 From nobody Thu Apr 2 15:36:01 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 427C538B9A3 for ; Tue, 10 Feb 2026 22:13:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761624; cv=none; b=Od7ihFB8AVd6Cn8BuKEbyMIAdfAMXEssL2hGePb0c0MVlkNOogl+q8QUje6dCMQLCZIsDhjyOG2vtRBJqjc+ZKqyzdY6C35WRwgcuvkdy35YeZydVyxIHDIfT21FvGoUMJeM+ggHReSzOtJT4uV2ulENrTsoH0IeS46QUPSt6Kw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761624; c=relaxed/simple; bh=DdamgIBRlreQFym+IV+sNw42fL88iaPB67NI+vDOkMM=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=TwpFD5O9IVCkUVjdY4hfFhYvV3mSS4SrG3Lh1JNpU1utrCc2+rHI+OL+ZbKnJxa2SquQQs+6IBh+lKL5Qf7lzqA1g45dw/YRuWwDXek8NRnJfQzikIcVD/Qhkumx3utvJgxSoXGWrxrSZniNkXBZA+T1yxgTVSNg8ECNSwMUiiE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=V1IqKXYw; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="V1IqKXYw" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770761623; x=1802297623; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=DdamgIBRlreQFym+IV+sNw42fL88iaPB67NI+vDOkMM=; b=V1IqKXYwHhRY5C7YiOyfngX/dOXZ4NNIiWEOOW3TxycdigBrxHFYnbsT 3DzaTWG/+ug2ZoMnhhNb1kNSAWQgVhaqYo0SIHSz1aY8a+Hm43IRHB5k/ dbQfsW43dBCz7QQVw3lTRg2E3Imu6GAkJiepCvbfYSH8lXeicZUxcUvgm mhQcszGHh99DKdD1fmCNjUcLzw7qSUnCozp1BOuEKpNI/S8qKnN4NDiNe XQb20118VMqify/ZEkeJ2Np2clD01N+D3peo9eF08PiDMVOVDfXTQBMW2 hZL7GdI9ZI2Kxw6YNdfQlU8kkVSVFrzzGjwr9temIuTOab+O2p9+dn491 A==; X-CSE-ConnectionGUID: v417zybOThewi8t7Sry5uw== X-CSE-MsgGUID: 3JI9lRqWTuyM1woUKV7Now== X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631588" X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="82631588" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 14:13:42 -0800 X-CSE-ConnectionGUID: uKw0XR9HQVylq7/DW8VW6w== X-CSE-MsgGUID: r+C0uYY1SR2HVCVoeEQ0EA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="216374029" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:42 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [PATCH v3 18/21] sched/cache: Allow the user space to turn on and off cache aware scheduling Date: Tue, 10 Feb 2026 14:18:58 -0800 Message-Id: <57f431298bf6346d37a3046ec771898607ae6ccf.1770760558.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu Provide a debugfs knob to allow the user to turn off and on the cache aware scheduling at runtime. Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v2->v3: Split into a new patch for better review, use kstrtobool_from_user() to get the user input. (Peter Zijlstra) kernel/sched/debug.c | 45 ++++++++++++++++++++++++++++ kernel/sched/sched.h | 7 +++-- kernel/sched/topology.c | 65 +++++++++++++++++++++++++++++++++++++++++ 3 files changed, 115 insertions(+), 2 deletions(-) diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 41caa22e0680..bae747eddc59 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -215,6 +215,46 @@ static const struct file_operations sched_scaling_fops= =3D { .release =3D single_release, }; =20 +#ifdef CONFIG_SCHED_CACHE +static ssize_t +sched_cache_enable_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + bool val; + int ret; + + ret =3D kstrtobool_from_user(ubuf, cnt, &val); + if (ret) + return ret; + + sysctl_sched_cache_user =3D val; + + sched_cache_active_set_unlocked(); + + return cnt; +} + +static int sched_cache_enable_show(struct seq_file *m, void *v) +{ + seq_printf(m, "%d\n", sysctl_sched_cache_user); + return 0; +} + +static int sched_cache_enable_open(struct inode *inode, + struct file *filp) +{ + return single_open(filp, sched_cache_enable_show, NULL); +} + +static const struct file_operations sched_cache_enable_fops =3D { + .open =3D sched_cache_enable_open, + .write =3D sched_cache_enable_write, + .read =3D seq_read, + .llseek =3D seq_lseek, + .release =3D single_release, +}; +#endif + #ifdef CONFIG_PREEMPT_DYNAMIC =20 static ssize_t sched_dynamic_write(struct file *filp, const char __user *u= buf, @@ -523,6 +563,11 @@ static __init int sched_init_debug(void) debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing= _hot_threshold); #endif /* CONFIG_NUMA_BALANCING */ =20 +#ifdef CONFIG_SCHED_CACHE + debugfs_create_file("llc_enabled", 0644, debugfs_sched, NULL, + &sched_cache_enable_fops); +#endif + debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops= ); =20 debugfs_fair_server_init(); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 59ac04625842..adf3428745dd 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3917,12 +3917,15 @@ static inline void mm_cid_switch_to(struct task_str= uct *prev, struct task_struct =20 #ifdef CONFIG_SCHED_CACHE DECLARE_STATIC_KEY_FALSE(sched_cache_present); -extern int max_llcs; +DECLARE_STATIC_KEY_FALSE(sched_cache_active); +extern int max_llcs, sysctl_sched_cache_user; =20 static inline bool sched_cache_enabled(void) { - return static_branch_unlikely(&sched_cache_present); + return static_branch_unlikely(&sched_cache_active); } + +extern void sched_cache_active_set_unlocked(void); #endif extern void init_sched_mm(struct task_struct *p); =20 diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 9104fed25351..e86dea1b9e86 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -801,7 +801,16 @@ enum s_alloc { }; =20 #ifdef CONFIG_SCHED_CACHE +/* hardware support for cache aware scheduling */ DEFINE_STATIC_KEY_FALSE(sched_cache_present); +/* + * Indicator of whether cache aware scheduling + * is active, used by the scheduler. + */ +DEFINE_STATIC_KEY_FALSE(sched_cache_active); +/* user wants cache aware scheduling [0 or 1] */ +int sysctl_sched_cache_user =3D 1; + static bool alloc_sd_pref(const struct cpumask *cpu_map, struct s_data *d) { @@ -833,6 +842,60 @@ static bool alloc_sd_pref(const struct cpumask *cpu_ma= p, =20 return false; } + +static void _sched_cache_active_set(bool enable, bool locked) +{ + if (enable) { + if (locked) + static_branch_enable_cpuslocked(&sched_cache_active); + else + static_branch_enable(&sched_cache_active); + } else { + if (locked) + static_branch_disable_cpuslocked(&sched_cache_active); + else + static_branch_disable(&sched_cache_active); + } +} + +/* + * Enable/disable cache aware scheduling according to + * user input and the presence of hardware support. + */ +static void sched_cache_active_set(bool locked) +{ + /* hardware does not support */ + if (!static_branch_likely(&sched_cache_present)) { + _sched_cache_active_set(false, locked); + return; + } + + /* + * user wants it or not ? + * TBD: read before writing the static key. + * It is not in the critical path, leave as-is + * for now. + */ + if (sysctl_sched_cache_user) { + _sched_cache_active_set(true, locked); + if (sched_debug()) + pr_info("%s: enabling cache aware scheduling\n", __func__); + } else { + _sched_cache_active_set(false, locked); + if (sched_debug()) + pr_info("%s: disabling cache aware scheduling\n", __func__); + } +} + +static void sched_cache_active_set_locked(void) +{ + return sched_cache_active_set(true); +} + +void sched_cache_active_set_unlocked(void) +{ + return sched_cache_active_set(false); +} #else static bool alloc_sd_pref(const struct cpumask *cpu_map, struct s_data *d) @@ -2809,6 +2872,8 @@ build_sched_domains(const struct cpumask *cpu_map, st= ruct sched_domain_attr *att static_branch_enable_cpuslocked(&sched_cache_present); else static_branch_disable_cpuslocked(&sched_cache_present); + + sched_cache_active_set_locked(); #endif __free_domain_allocs(&d, alloc_state, cpu_map); =20 --=20 2.32.0 From nobody Thu Apr 2 15:36:01 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 36BAF3921E9 for ; Tue, 10 Feb 2026 22:13:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761626; cv=none; b=UpAud2q/96oCS2xPYGdU4Lru9tk/SFjX3/8DVEoQTyejQCvxwQFX5QOvaIUDORTJ7BsKaT9rHB0AyLb24V84V2RHmUsT5hMCs6CW3A9nfl5IkdmvJCeGBGWuXq158cYdBauebWJuN9366dVbrJ2o/JecXNvBgbNRQoVHieWFhaA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761626; c=relaxed/simple; bh=9Q/v0b5kiovrrYSgga3rUosv6cdxZ5oB+lZ0wwPobZ0=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Kk+c5D08R4UVSrJM4dOxC4MyQiutrpu/zwO78uGFMFRtNxcY+wVZ1ribpV0xJDVcy7m25S/8Tnf4TPshGWoHInz0sjxl9rcbLTjiKDPBhZcg1wGbHO6ieaDgffmTQQNzZ4xDk37ZfuEWn6vXy4kSu0U7jVtLoWgZxBgXpbj7ySc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=XZIrpUof; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="XZIrpUof" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770761625; x=1802297625; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=9Q/v0b5kiovrrYSgga3rUosv6cdxZ5oB+lZ0wwPobZ0=; b=XZIrpUoffnO9gk9GuO91t3t7XpuZ5AgLZ+WmRzZexJ679+yS7hZj55T7 Wrw1IUeRDzH8IIKK6hBymJEUbbsvr4jtfIua6Ov9ScdQHSVNG+kdJq4iu kPp8nNW3OOJrG3YbfVLrhpiuPKF8IyqF5lqgORlL0TUAFgzPiBwN3Rz0y R7O7UdDNRzu5rpA+VAD2hZYfXWgUrzpP8Lc+MQm9Kkc15PPlT0ADrJWeO zToc/sRxMmnu2z5x3I/jZf+iqufgoPtJ3bCCxT/aYsNvE3wGpix+QA+Hf yYYTh2iI0ZPbHtcs8Vl9U/0fGNbGOCqeJVsEhcYnI3JzGgJvazJPS5xPE Q==; X-CSE-ConnectionGUID: UlYB2d7PQ0uf0w6Oth9ndw== X-CSE-MsgGUID: zb0hkBctT2Swd6jM0dDBkw== X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631609" X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="82631609" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 14:13:44 -0800 X-CSE-ConnectionGUID: x33Gk2zbQpi7YhesuJBdLw== X-CSE-MsgGUID: XqTQyB72TDG69za4CTK/1A== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="216374036" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:43 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [PATCH v3 19/21] sched/cache: Add user control to adjust the aggressiveness of cache-aware scheduling Date: Tue, 10 Feb 2026 14:18:59 -0800 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu Introduce a set of debugfs knobs to control how aggressive the cache aware scheduling do the task aggregation. (1) llc_aggr_tolerance With sched_cache enabled, the scheduler uses a process's RSS as a proxy for its LLC footprint to determine if aggregating tasks on the preferred LLC could cause cache contention. If RSS exceeds the LLC size, aggregation is skipped. Some workloads with large RSS but small actual memory footprints may still benefit from aggregation. Since the kernel cannot efficiently track per-task cache usage (resctrl is user-space only), userspace can provide a more accurate hint. Introduce /sys/kernel/debug/sched/llc_aggr_tolerance to let users control how strictly RSS limits aggregation. Values range from 0 to 100: - 0: Cache-aware scheduling is disabled. - 1: Strict; tasks with RSS larger than LLC size are skipped. - >=3D100: Aggressive; tasks are aggregated regardless of RSS. For example, with a 32MB L3 cache: - llc_aggr_tolerance=3D1 -> tasks with RSS > 32MB are skipped. - llc_aggr_tolerance=3D99 -> tasks with RSS > 784GB are skipped (784GB =3D (1 + (99 - 1) * 256) * 32MB). Similarly, /sys/kernel/debug/sched/llc_aggr_tolerance also controls how strictly the number of active threads is considered when doing cache aware load balance. The number of SMTs is also considered. High SMT counts reduce the aggregation capacity, preventing excessive task aggregation on SMT-heavy systems like Power10/Power11. Yangyu suggested introducing separate aggregation controls for the number of active threads and memory RSS checks. Since there are plans to add per-process/task group controls, fine-grained tunables are deferred to that implementation. (2) llc_epoch_period, llc_epoch_affinity_timeout, llc_imb_pct, llc_overaggr_pct are also turned into tunable. Suggested-by: K Prateek Nayak Suggested-by: Madadi Vineeth Reddy Suggested-by: Shrikanth Hegde Suggested-by: Tingyin Duan Suggested-by: Jianyong Wu Suggested-by: Yangyu Chen Co-developed-by: Tim Chen Signed-off-by: Tim Chen Signed-off-by: Chen Yu --- Notes: v2->v3: Simplify the implementation by using debugfs_create_u32() for all tunable parameters. kernel/sched/debug.c | 10 ++++++++ kernel/sched/fair.c | 59 ++++++++++++++++++++++++++++++++++++++------ kernel/sched/sched.h | 5 ++++ 3 files changed, 67 insertions(+), 7 deletions(-) diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index bae747eddc59..dc4b7de6569f 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -566,6 +566,16 @@ static __init int sched_init_debug(void) #ifdef CONFIG_SCHED_CACHE debugfs_create_file("llc_enabled", 0644, debugfs_sched, NULL, &sched_cache_enable_fops); + debugfs_create_u32("llc_aggr_tolerance", 0644, debugfs_sched, + &llc_aggr_tolerance); + debugfs_create_u32("llc_epoch_period", 0644, debugfs_sched, + &llc_epoch_period); + debugfs_create_u32("llc_epoch_affinity_timeout", 0644, debugfs_sched, + &llc_epoch_affinity_timeout); + debugfs_create_u32("llc_overaggr_pct", 0644, debugfs_sched, + &llc_overaggr_pct); + debugfs_create_u32("llc_imb_pct", 0644, debugfs_sched, + &llc_imb_pct); #endif =20 debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops= ); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ee4982af2bdd..da4291ace24c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1191,6 +1191,12 @@ static void set_next_buddy(struct sched_entity *se); #define EPOCH_PERIOD (HZ / 100) /* 10 ms */ #define EPOCH_LLC_AFFINITY_TIMEOUT 5 /* 50 ms */ =20 +__read_mostly unsigned int llc_aggr_tolerance =3D 1; +__read_mostly unsigned int llc_epoch_period =3D EPOCH_PERIOD; +__read_mostly unsigned int llc_epoch_affinity_timeout =3D EPOCH_LLC_AFFINI= TY_TIMEOUT; +__read_mostly unsigned int llc_imb_pct =3D 20; +__read_mostly unsigned int llc_overaggr_pct =3D 50; + static int llc_id(int cpu) { if (cpu < 0) @@ -1223,10 +1229,22 @@ static inline bool valid_llc_buf(struct sched_domai= n *sd, return valid_llc_id(id); } =20 +static inline int get_sched_cache_scale(int mul) +{ + if (!llc_aggr_tolerance) + return 0; + + if (llc_aggr_tolerance >=3D 100) + return INT_MAX; + + return (1 + (llc_aggr_tolerance - 1) * mul); +} + static bool exceed_llc_capacity(struct mm_struct *mm, int cpu) { struct cacheinfo *ci; u64 rss, llc; + int scale; =20 /* * get_cpu_cacheinfo_level() can not be used @@ -1251,20 +1269,47 @@ static bool exceed_llc_capacity(struct mm_struct *m= m, int cpu) rss =3D get_mm_counter(mm, MM_ANONPAGES) + get_mm_counter(mm, MM_SHMEMPAGES); =20 - return (llc <=3D (rss * PAGE_SIZE)); + /* + * Scale the LLC size by 256*llc_aggr_tolerance + * and compare it to the task's RSS size. + * + * Suppose the L3 size is 32MB. If the + * llc_aggr_tolerance is 1: + * When the RSS is larger than 32MB, the process + * is regarded as exceeding the LLC capacity. If + * the llc_aggr_tolerance is 99: + * When the RSS is larger than 784GB, the process + * is regarded as exceeding the LLC capacity: + * 784GB =3D (1 + (99 - 1) * 256) * 32MB + * If the llc_aggr_tolerance is 100: + * ignore the RSS. + */ + scale =3D get_sched_cache_scale(256); + if (scale =3D=3D INT_MAX) + return false; + + return ((llc * scale) <=3D (rss * PAGE_SIZE)); } =20 static bool exceed_llc_nr(struct mm_struct *mm, int cpu) { - int smt_nr =3D 1; + int smt_nr =3D 1, scale; =20 #ifdef CONFIG_SCHED_SMT if (sched_smt_active()) smt_nr =3D cpumask_weight(cpu_smt_mask(cpu)); #endif =20 + /* + * Scale the number of 'cores' in a LLC by llc_aggr_tolerance + * and compare it to the task's active threads. + */ + scale =3D get_sched_cache_scale(1); + if (scale =3D=3D INT_MAX) + return false; + return !fits_capacity((mm->sc_stat.nr_running_avg * smt_nr), - per_cpu(sd_llc_size, cpu)); + (scale * per_cpu(sd_llc_size, cpu))); } =20 static void account_llc_enqueue(struct rq *rq, struct task_struct *p) @@ -1365,7 +1410,7 @@ static inline void __update_mm_sched(struct rq *rq, long delta =3D now - rq->cpu_epoch_next; =20 if (delta > 0) { - n =3D (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD; + n =3D (delta + llc_epoch_period - 1) / llc_epoch_period; rq->cpu_epoch +=3D n; rq->cpu_epoch_next +=3D n * EPOCH_PERIOD; __shr_u64(&rq->cpu_runtime, n); @@ -1460,7 +1505,7 @@ void account_mm_sched(struct rq *rq, struct task_stru= ct *p, s64 delta_exec) * has only 1 thread, invalidate its preferred state. */ if (time_after(epoch, - READ_ONCE(mm->sc_stat.epoch) + EPOCH_LLC_AFFINITY_TIMEOUT) || + READ_ONCE(mm->sc_stat.epoch) + llc_epoch_affinity_timeout) || get_nr_threads(p) <=3D 1 || exceed_llc_nr(mm, cpu_of(rq)) || exceed_llc_capacity(mm, cpu_of(rq))) { @@ -9920,7 +9965,7 @@ static inline int task_is_ineligible_on_dst_cpu(struc= t task_struct *p, int dest_ * (default: ~50%) */ #define fits_llc_capacity(util, max) \ - ((util) * 2 < (max)) + ((util) * 100 < (max) * llc_overaggr_pct) =20 /* * The margin used when comparing utilization. @@ -9930,7 +9975,7 @@ static inline int task_is_ineligible_on_dst_cpu(struc= t task_struct *p, int dest_ */ /* Allows dst util to be bigger than src util by up to bias percent */ #define util_greater(util1, util2) \ - ((util1) * 100 > (util2) * 120) + ((util1) * 100 > (util2) * (100 + llc_imb_pct)) =20 /* Called from load balancing paths with rcu_read_lock held */ static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util, diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index adf3428745dd..f4785f84b1f1 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3919,6 +3919,11 @@ static inline void mm_cid_switch_to(struct task_stru= ct *prev, struct task_struct DECLARE_STATIC_KEY_FALSE(sched_cache_present); DECLARE_STATIC_KEY_FALSE(sched_cache_active); extern int max_llcs, sysctl_sched_cache_user; +extern unsigned int llc_aggr_tolerance; +extern unsigned int llc_epoch_period; +extern unsigned int llc_epoch_affinity_timeout; +extern unsigned int llc_imb_pct; +extern unsigned int llc_overaggr_pct; =20 static inline bool sched_cache_enabled(void) { --=20 2.32.0 From nobody Thu Apr 2 15:36:01 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CE4A638B9A3 for ; Tue, 10 Feb 2026 22:13:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761630; cv=none; b=W9BAUXsIdiUgOu/jugDl5iuSEVzsNW5XrVY7R+aldzJpO1qaGVzhguOsGkwSsVnyQnggayJC1eCZnBtkWnSMq0cbwYtorNOgZV9ozwUkoOfisv/zG1SR0RkwiBr3Nyw8XN08+5cmYDTcs3lKcD4F4xmCCspbk/BPAYHmnhfeHjk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761630; c=relaxed/simple; bh=0Co2o6pud5mwuJHoDbZhN4azymt88zPl/PqLGlZkDXU=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=QOtxeM1OFLsuXT4YJp/CHQh5uUAdfjHbs5Hi4ZPTouE1n9rCowuA1f4LEshAn3KQMMB3nroZOjJYDZobWekSxzrbWf74F6Ah4xcMJVsJnycvZkUJB4iRLwlj1mb7fUMLXKahC7BKXz7vHKSCgboMdbHvy+KFwb73jr9cZ4nhrKE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=Wetf7WAn; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Wetf7WAn" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770761628; x=1802297628; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=0Co2o6pud5mwuJHoDbZhN4azymt88zPl/PqLGlZkDXU=; b=Wetf7WAnwwYHThKur2sCgKcmV7qrpTT7O6/bbF5bBo+E1yCIy7tG8Oo8 fH/jg3dge9L0YA3ORsXRNXRdF3j/vFORfRE9CIXIpw6KlPrpRsIYh4Lcl q+CbLzid6v+0MRkLHtnYwgpWcPIb9UIBHBCR4U5yCRURl6iLwGrddJoGu A1dU5wtD7Tl49/zM0EVPPVtFdll8ePgvYJRHBneVPbluGkEOHxDyeLulv VuKu6GkaDctsGDAG5gQEJqVtYJ/xHHKck7st3vCEHNUhGiRVKol8GrLAs 59kDlZfFkwEhL8B7eu0qLHDv82stdc4Shz+vv3431CvttZM6aCeuQiqY2 w==; X-CSE-ConnectionGUID: 1plUBdGMSouLTCEewYScHg== X-CSE-MsgGUID: 1i2YZsoiQ5qIlVf5+a/H1Q== X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631629" X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="82631629" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 14:13:48 -0800 X-CSE-ConnectionGUID: 24BLstINQF2mO/5HF20AiQ== X-CSE-MsgGUID: MkThpCS0SredZayvs99A0Q== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="216374041" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:46 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [PATCH v3 20/21] -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy for each process via proc fs Date: Tue, 10 Feb 2026 14:19:00 -0800 Message-Id: <09c48847deeb9d2c1c7de1f2799cc128cd2e866e.1770760558.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu Debug patch only. Show the per-LLC occupancy in /proc/{PID}/schedstat, with each column corresponding to one LLC. This can be used to verify if the cache-aware load balancer works as expected by aggregating threads onto dedicated LLCs. Suppose there are 2 LLCs and the sampling duration is 10 seconds: Enable the cache aware load balance: 0 12281 <--- LLC0 residency delta is 0, LLC1 is 12 seconds 0 18881 0 16217 disable the cache aware load balance: 6497 15802 9299 5435 17811 8278 Co-developed-by: Aaron Lu Signed-off-by: Aaron Lu Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v2->v3: Enhance the informational output by printing the task's preferred LLC. (Aaron Lu) fs/proc/base.c | 31 +++++++++++++++++++++++++ include/linux/mm_types.h | 17 +++++++++++--- include/linux/sched.h | 6 +++++ kernel/sched/fair.c | 50 ++++++++++++++++++++++++++++++++++++---- 4 files changed, 97 insertions(+), 7 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 4eec684baca9..76b49e80af1a 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -518,6 +518,37 @@ static int proc_pid_schedstat(struct seq_file *m, stru= ct pid_namespace *ns, (unsigned long long)task->se.sum_exec_runtime, (unsigned long long)task->sched_info.run_delay, task->sched_info.pcount); +#ifdef CONFIG_SCHED_CACHE + if (sched_cache_inuse()) { + struct mm_struct *mm =3D task->mm; + u64 *llc_runtime; + int mm_sched_llc; + + if (!mm) + return 0; + + llc_runtime =3D kcalloc(max_llcs, sizeof(u64), GFP_KERNEL); + if (!llc_runtime) + return 0; + + if (get_mm_per_llc_runtime(task, llc_runtime)) + goto out; + + if (mm->sc_stat.cpu =3D=3D -1) + mm_sched_llc =3D -1; + else + mm_sched_llc =3D llc_id(mm->sc_stat.cpu); + + for (int i =3D 0; i < max_llcs; i++) + seq_printf(m, "%s%s%llu ", + i =3D=3D task->preferred_llc ? "*" : "", + i =3D=3D mm_sched_llc ? "?" : "", + llc_runtime[i]); + seq_puts(m, "\n"); +out: + kfree(llc_runtime); + } +#endif =20 return 0; } diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 777a48523aa6..2b8d0ec032e8 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1523,17 +1523,26 @@ static inline unsigned int mm_cid_size(void) =20 #ifdef CONFIG_SCHED_CACHE void mm_init_sched(struct mm_struct *mm, - struct sched_cache_time __percpu *pcpu_sched); + struct sched_cache_time __percpu *pcpu_sched, + struct sched_cache_time __percpu *pcpu_time); =20 static inline int mm_alloc_sched_noprof(struct mm_struct *mm) { struct sched_cache_time __percpu *pcpu_sched =3D - alloc_percpu_noprof(struct sched_cache_time); + alloc_percpu_noprof(struct sched_cache_time), + *pcpu_time; =20 if (!pcpu_sched) return -ENOMEM; =20 - mm_init_sched(mm, pcpu_sched); + pcpu_time =3D alloc_percpu_noprof(struct sched_cache_time); + if (!pcpu_time) { + free_percpu(pcpu_sched); + return -ENOMEM; + } + + mm_init_sched(mm, pcpu_sched, pcpu_time); + return 0; } =20 @@ -1542,7 +1551,9 @@ static inline int mm_alloc_sched_noprof(struct mm_str= uct *mm) static inline void mm_destroy_sched(struct mm_struct *mm) { free_percpu(mm->sc_stat.pcpu_sched); + free_percpu(mm->sc_stat.pcpu_time); mm->sc_stat.pcpu_sched =3D NULL; + mm->sc_stat.pcpu_time =3D NULL; } #else /* !CONFIG_SCHED_CACHE */ =20 diff --git a/include/linux/sched.h b/include/linux/sched.h index 511c9b263386..4236cacbb409 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2344,12 +2344,18 @@ struct sched_cache_time { =20 struct sched_cache_stat { struct sched_cache_time __percpu *pcpu_sched; + struct sched_cache_time __percpu *pcpu_time; raw_spinlock_t lock; unsigned long epoch; u64 nr_running_avg; int cpu; } ____cacheline_aligned_in_smp; =20 +int get_mm_per_llc_runtime(struct task_struct *p, u64 *buf); +bool sched_cache_inuse(void); +extern int max_llcs; +int llc_id(int cpu); + #else =20 struct sched_cache_stat { }; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index da4291ace24c..25cee3dd767c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1197,7 +1197,12 @@ __read_mostly unsigned int llc_epoch_affinity_timeou= t =3D EPOCH_LLC_AFFINITY_TIMEO __read_mostly unsigned int llc_imb_pct =3D 20; __read_mostly unsigned int llc_overaggr_pct =3D 50; =20 -static int llc_id(int cpu) +bool sched_cache_inuse(void) +{ + return sched_cache_enabled(); +} + +int llc_id(int cpu) { if (cpu < 0) return -1; @@ -1365,17 +1370,20 @@ static void account_llc_dequeue(struct rq *rq, stru= ct task_struct *p) } =20 void mm_init_sched(struct mm_struct *mm, - struct sched_cache_time __percpu *_pcpu_sched) + struct sched_cache_time __percpu *_pcpu_sched, + struct sched_cache_time __percpu *_pcpu_time) { unsigned long epoch; int i; =20 for_each_possible_cpu(i) { struct sched_cache_time *pcpu_sched =3D per_cpu_ptr(_pcpu_sched, i); + struct sched_cache_time *pcpu_time =3D per_cpu_ptr(_pcpu_time, i); struct rq *rq =3D cpu_rq(i); =20 pcpu_sched->runtime =3D 0; pcpu_sched->epoch =3D rq->cpu_epoch; + pcpu_time->runtime =3D 0; epoch =3D rq->cpu_epoch; } =20 @@ -1389,6 +1397,8 @@ void mm_init_sched(struct mm_struct *mm, * the readers may get invalid mm_sched_epoch, etc. */ smp_store_release(&mm->sc_stat.pcpu_sched, _pcpu_sched); + /* barrier */ + smp_store_release(&mm->sc_stat.pcpu_time, _pcpu_time); } =20 /* because why would C be fully specified */ @@ -1474,7 +1484,8 @@ static unsigned int task_running_on_cpu(int cpu, stru= ct task_struct *p); static inline void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec) { - struct sched_cache_time *pcpu_sched; + struct sched_cache_time *pcpu_sched, + *pcpu_time; struct mm_struct *mm =3D p->mm; int mm_sched_llc =3D -1; unsigned long epoch; @@ -1488,14 +1499,18 @@ void account_mm_sched(struct rq *rq, struct task_st= ruct *p, s64 delta_exec) * init_task, kthreads and user thread created * by user_mode_thread() don't have mm. */ - if (!mm || !mm->sc_stat.pcpu_sched) + if (!mm || !mm->sc_stat.pcpu_sched || + !mm->sc_stat.pcpu_time) return; =20 pcpu_sched =3D per_cpu_ptr(p->mm->sc_stat.pcpu_sched, cpu_of(rq)); + pcpu_time =3D per_cpu_ptr(p->mm->sc_stat.pcpu_time, cpu_of(rq)); =20 scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) { __update_mm_sched(rq, pcpu_sched); pcpu_sched->runtime +=3D delta_exec; + /* pure runtime without decay */ + pcpu_time->runtime +=3D delta_exec; rq->cpu_runtime +=3D delta_exec; epoch =3D rq->cpu_epoch; } @@ -1676,6 +1691,33 @@ void init_sched_mm(struct task_struct *p) work->next =3D work; } =20 +/* p->pi_lock is hold */ +int get_mm_per_llc_runtime(struct task_struct *p, u64 *buf) +{ + struct sched_cache_time *pcpu_time; + struct mm_struct *mm =3D p->mm; + int cpu; + + if (!mm) + return -EINVAL; + + rcu_read_lock(); + for_each_online_cpu(cpu) { + int llc =3D llc_id(cpu); + u64 runtime_ms; + + if (!valid_llc_id(llc)) + continue; + + pcpu_time =3D per_cpu_ptr(mm->sc_stat.pcpu_sched, cpu); + runtime_ms =3D div_u64(pcpu_time->runtime, NSEC_PER_MSEC); + buf[llc] +=3D runtime_ms; + } + rcu_read_unlock(); + + return 0; +} + #else =20 static inline void account_mm_sched(struct rq *rq, struct task_struct *p, --=20 2.32.0 From nobody Thu Apr 2 15:36:01 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4F38F394481 for ; Tue, 10 Feb 2026 22:13:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761631; cv=none; b=CbsdXjboyNHam/s/WgHNm2UwErJtHkfQqkiR6t1EwFvwnrfjpGPL7+fA57h0mqZbt6rFSQ58Nxtuw50O8vO/cJXuWaK/nJgAUn+fcyXUPB9OjKiiZ5dw7EqJf/YFZao72EcKTe+GJWiUqiYfk1IFwj0Kn0V9nZ44H7aOe/2qaYY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761631; c=relaxed/simple; bh=wghvjcCtlAuZLzOfZ4J/EiqZ5QL4yndcyXeomtKcxz8=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=QYEHtnMS97gFJr7UP5SX3RK690dgbnmYAcIQcjgwTBwEcdMODCUpEur8kKBR/g1d7smNJgdsNhE4H4D3zaB+9r/SAzpliPJb9Mw9uxiWMYLuOUrE5nbccqkyuG1hOKv/PzAwn3Kzgx7MsPwxbcrJOhUeKnKpQVhyOXoMZNPa5KA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=YhdbfbjW; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="YhdbfbjW" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770761630; x=1802297630; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=wghvjcCtlAuZLzOfZ4J/EiqZ5QL4yndcyXeomtKcxz8=; b=YhdbfbjWjyKqgcWXrUUh+Bx6MUZmnEo3ibRVtCzCy+1wxYsrTKbzHdMZ DDDEMPZwr0iPLUMxQv1X8QnUD/DJ9Y0RZaaTuMOjvEcVtKnxc8u2oEHoN JySn++1SiF1yQWg+ZcdFzqSfUwZQl3ikWvkkP4Z622i5LIQzMbfmG5PIx U59Eh3g2OlfyyQY5c4bWkoqZOgiHErYr6KB17UyNYCZkl0Mi2+NHfV4on JJS+ecQCABxXHZ+IQa5hOqzWBnct2CTGTxfu+QjCSTc3YJPpMtXimOmS5 ZbrRPpBCXw92xEravRppxq4S4W7wejo994Acx4S+e7+8hYkQY3TttHEqC w==; X-CSE-ConnectionGUID: tFTCPkIMRc62c5T2VK6jew== X-CSE-MsgGUID: qOcwBoX/TDyh/kqnX5PRxA== X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631651" X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="82631651" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 14:13:49 -0800 X-CSE-ConnectionGUID: ye5jfKVFQ9+0v7lGFII88g== X-CSE-MsgGUID: NdY6Z6pfSjSMakp/1q8R3g== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="216374046" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:48 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [PATCH v3 21/21] -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load balance statistics Date: Tue, 10 Feb 2026 14:19:01 -0800 Message-Id: <5d663caaed7ebe93ab9b272235675b2400b3ed8b.1770760558.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu Debug patch only. The user leverages these trace events (via bpftrace, etc.) to monitor the cache-aware load balancing activity - specifically, whether tasks are moved to their preferred LLC, moved out of their preferred LLC, or whether cache-aware load balancing is skipped due to exceeding the memory footprint limit or too many active tasks. Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v2->v3: Add more trace events when the process exceeds the limitation of LLC size or number of active threads(moved from schedstat to trace event for better bpf tracking) include/trace/events/sched.h | 79 ++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 40 ++++++++++++++---- 2 files changed, 110 insertions(+), 9 deletions(-) diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 7b2645b50e78..b73327653e4b 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -10,6 +10,85 @@ #include #include =20 +#ifdef CONFIG_SCHED_CACHE +TRACE_EVENT(sched_exceed_llc_cap, + + TP_PROTO(struct task_struct *t, int exceeded), + + TP_ARGS(t, exceeded), + + TP_STRUCT__entry( + __array( char, comm, TASK_COMM_LEN ) + __field( pid_t, pid ) + __field( int, exceeded ) + ), + + TP_fast_assign( + memcpy(__entry->comm, t->comm, TASK_COMM_LEN); + __entry->pid =3D t->pid; + __entry->exceeded =3D exceeded; + ), + + TP_printk("comm=3D%s pid=3D%d exceed_cap=3D%d", + __entry->comm, __entry->pid, + __entry->exceeded) +); + +TRACE_EVENT(sched_exceed_llc_nr, + + TP_PROTO(struct task_struct *t, int exceeded), + + TP_ARGS(t, exceeded), + + TP_STRUCT__entry( + __array( char, comm, TASK_COMM_LEN ) + __field( pid_t, pid ) + __field( int, exceeded ) + ), + + TP_fast_assign( + memcpy(__entry->comm, t->comm, TASK_COMM_LEN); + __entry->pid =3D t->pid; + __entry->exceeded =3D exceeded; + ), + + TP_printk("comm=3D%s pid=3D%d exceed_nr=3D%d", + __entry->comm, __entry->pid, + __entry->exceeded) +); + +TRACE_EVENT(sched_attach_task, + + TP_PROTO(struct task_struct *t, int pref_cpu, int pref_llc, + int attach_cpu, int attach_llc), + + TP_ARGS(t, pref_cpu, pref_llc, attach_cpu, attach_llc), + + TP_STRUCT__entry( + __array( char, comm, TASK_COMM_LEN ) + __field( pid_t, pid ) + __field( int, pref_cpu ) + __field( int, pref_llc ) + __field( int, attach_cpu ) + __field( int, attach_llc ) + ), + + TP_fast_assign( + memcpy(__entry->comm, t->comm, TASK_COMM_LEN); + __entry->pid =3D t->pid; + __entry->pref_cpu =3D pref_cpu; + __entry->pref_llc =3D pref_llc; + __entry->attach_cpu =3D attach_cpu; + __entry->attach_llc =3D attach_llc; + ), + + TP_printk("comm=3D%s pid=3D%d pref_cpu=3D%d pref_llc=3D%d attach_cpu=3D%d= attach_llc=3D%d", + __entry->comm, __entry->pid, + __entry->pref_cpu, __entry->pref_llc, + __entry->attach_cpu, __entry->attach_llc) +); +#endif + /* * Tracepoint for calling kthread_stop, performed to end a kthread: */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 25cee3dd767c..977091fd0e49 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1245,9 +1245,11 @@ static inline int get_sched_cache_scale(int mul) return (1 + (llc_aggr_tolerance - 1) * mul); } =20 -static bool exceed_llc_capacity(struct mm_struct *mm, int cpu) +static bool exceed_llc_capacity(struct mm_struct *mm, int cpu, + struct task_struct *p) { struct cacheinfo *ci; + bool exceeded; u64 rss, llc; int scale; =20 @@ -1293,12 +1295,18 @@ static bool exceed_llc_capacity(struct mm_struct *m= m, int cpu) if (scale =3D=3D INT_MAX) return false; =20 - return ((llc * scale) <=3D (rss * PAGE_SIZE)); + exceeded =3D ((llc * scale) <=3D (rss * PAGE_SIZE)); + + trace_sched_exceed_llc_cap(p, exceeded); + + return exceeded; } =20 -static bool exceed_llc_nr(struct mm_struct *mm, int cpu) +static bool exceed_llc_nr(struct mm_struct *mm, int cpu, + struct task_struct *p) { int smt_nr =3D 1, scale; + bool exceeded; =20 #ifdef CONFIG_SCHED_SMT if (sched_smt_active()) @@ -1313,8 +1321,12 @@ static bool exceed_llc_nr(struct mm_struct *mm, int = cpu) if (scale =3D=3D INT_MAX) return false; =20 - return !fits_capacity((mm->sc_stat.nr_running_avg * smt_nr), + exceeded =3D !fits_capacity((mm->sc_stat.nr_running_avg * smt_nr), (scale * per_cpu(sd_llc_size, cpu))); + + trace_sched_exceed_llc_nr(p, exceeded); + + return exceeded; } =20 static void account_llc_enqueue(struct rq *rq, struct task_struct *p) @@ -1522,8 +1534,8 @@ void account_mm_sched(struct rq *rq, struct task_stru= ct *p, s64 delta_exec) if (time_after(epoch, READ_ONCE(mm->sc_stat.epoch) + llc_epoch_affinity_timeout) || get_nr_threads(p) <=3D 1 || - exceed_llc_nr(mm, cpu_of(rq)) || - exceed_llc_capacity(mm, cpu_of(rq))) { + exceed_llc_nr(mm, cpu_of(rq), p) || + exceed_llc_capacity(mm, cpu_of(rq), p)) { if (mm->sc_stat.cpu !=3D -1) mm->sc_stat.cpu =3D -1; } @@ -1600,7 +1612,7 @@ static void task_cache_work(struct callback_head *wor= k) =20 curr_cpu =3D task_cpu(p); if (get_nr_threads(p) <=3D 1 || - exceed_llc_capacity(mm, curr_cpu)) { + exceed_llc_capacity(mm, curr_cpu, p)) { if (mm->sc_stat.cpu !=3D -1) mm->sc_stat.cpu =3D -1; =20 @@ -10159,8 +10171,8 @@ static enum llc_mig can_migrate_llc_task(int src_cp= u, int dst_cpu, * Skip cache aware load balance for single/too many threads * or large memory RSS. */ - if (get_nr_threads(p) <=3D 1 || exceed_llc_nr(mm, dst_cpu) || - exceed_llc_capacity(mm, dst_cpu)) { + if (get_nr_threads(p) <=3D 1 || exceed_llc_nr(mm, dst_cpu, p) || + exceed_llc_capacity(mm, dst_cpu, p)) { if (mm->sc_stat.cpu !=3D -1) mm->sc_stat.cpu =3D -1; return mig_unrestricted; @@ -10602,6 +10614,16 @@ static void attach_task(struct rq *rq, struct task= _struct *p) { lockdep_assert_rq_held(rq); =20 +#ifdef CONFIG_SCHED_CACHE + if (p->mm) { + int pref_cpu =3D p->mm->sc_stat.cpu; + + trace_sched_attach_task(p, + pref_cpu, + pref_cpu !=3D -1 ? llc_id(pref_cpu) : -1, + cpu_of(rq), llc_id(cpu_of(rq))); + } +#endif WARN_ON_ONCE(task_rq(p) !=3D rq); activate_task(rq, p, ENQUEUE_NOCLOCK); wakeup_preempt(rq, p, 0); --=20 2.32.0