From nobody Fri Dec 19 19:37:42 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3E1062EA481 for ; Wed, 3 Dec 2025 23:01:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802883; cv=none; b=UKLk6Rg4Ag2RrVZTM6q83e57jrtOabhFLy87jTKdCORkErT5oscdmGvQFuZ8uzk4JddS6cPh1pfkZIjrorb34GjVrTfhTnjF3Ev1eA9P3f9SHm6a8HG5wxWf/yS25iz0NQWmXUw8INvgj0a9A56o6dRBuDjYNgK/XPE8bAKiBUg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802883; c=relaxed/simple; bh=6xbRUXX8feoSk8bOjg/vcAGiqy4i78lNKWOOyysMsTg=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=ezTAzjYx2Rp52iZO2WWYVcoqrFo5k7CxRy+shLmmCt9X8OAnGBmN2eYuhkz/I7t0LW4rAjnmLXBSt4s5lKDI7cjNxUO/rV3B0EWqv13ojuB5QKkGvUXb3YGE9U0EUSc8TdruI55O35k40Uh0lNID1k89G7Dxb8VJ6Ckm0RWpbqE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=hXZ7RTSy; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="hXZ7RTSy" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802880; x=1796338880; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=6xbRUXX8feoSk8bOjg/vcAGiqy4i78lNKWOOyysMsTg=; b=hXZ7RTSym9fS1Xvrd9iY6zdRxiZpXzgeaEnDkbWt4E9kikaWOOGcUivi QbmpWan09GqoanGn0S6Vft9B7BxCpebF9EW9KXpkUelSttyWWDfdj3/y5 FTK7BCv2Ykd5RjEGqBmouxnoYSthhh0M052SACkie+UXmvYxcT/sOQCCX HOsATO8B6T2nuON/L4dyuLl54HqVuf+JcbMOZ0ABnQ6ZFHGM/cCwqCXcJ AmUI07y2Khz2g6thC1D3WG4YXreJSp+sT28iidXrCmaZBan6+WI286Msl K0/hGg9Y68V2FBcOV+wIiAuy+MY5XGtKxf7nZIp0LSDOwP7fiuTJEB8U9 g==; X-CSE-ConnectionGUID: cpmnVUlITmyoLapwFs/Now== X-CSE-MsgGUID: 1fVK363gQBOq9Aw6XgwrKg== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136182" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136182" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:18 -0800 X-CSE-ConnectionGUID: d8cQS9oyRh+diLyesP0AjA== X-CSE-MsgGUID: JRVFNz3/S1eHusPWEwHUNA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199763734" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:18 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org Subject: [PATCH v2 01/23] sched/cache: Introduce infrastructure for cache-aware load balancing Date: Wed, 3 Dec 2025 15:07:20 -0800 Message-Id: <06f0d7edbc3185ec730b50b3b00d87ace44169b3.1764801860.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: "Peter Zijlstra (Intel)" Adds infrastructure to enable cache-aware load balancing, which improves cache locality by grouping tasks that share resources within the same cache domain. This reduces cache misses and improves overall data access efficiency. In this initial implementation, threads belonging to the same process are treated as entities that likely share working sets. The mechanism tracks per-process CPU occupancy across cache domains and attempts to migrate threads toward cache-hot domains where their process already has active threads, thereby enhancing locality. This provides a basic model for cache affinity. While the current code targets the last-level cache (LLC), the approach could be extended to other domain types such as clusters (L2) or node-internal groupings. At present, the mechanism selects the CPU within an LLC that has the highest recent runtime. Subsequent patches in this series will use this information in the load-balancing path to guide task placement toward preferred LLCs. In the future, more advanced policies could be integrated through NUMA balancing-for example, migrating a task to its preferred LLC when spare capacity exists, or swapping tasks across LLCs to improve cache affinity. Grouping of tasks could also be generalized from that of a process to be that of a NUMA group, or be user configurable. Originally-by: Peter Zijlstra (Intel) Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v1->v2: Restore the original CPU scan to cover all online CPUs, rather than scanning within the preferred NUMA node. (Peter Zijlstra) =20 Use rq->curr instead of rq->donor. (K Prateek Nayak) =20 Minor fix in task_tick_cache() to use if (mm->mm_sched_epoch >=3D rq->cpu_epoch) to avoid mm_sched_epoch going backwards. include/linux/mm_types.h | 44 +++++++ include/linux/sched.h | 11 ++ init/Kconfig | 11 ++ kernel/fork.c | 6 + kernel/sched/core.c | 6 + kernel/sched/fair.c | 258 +++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 8 ++ 7 files changed, 344 insertions(+) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 90e5790c318f..1ea16ef90566 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -939,6 +939,11 @@ typedef struct { DECLARE_BITMAP(__mm_flags, NUM_MM_FLAG_BITS); } __private mm_flags_t; =20 +struct mm_sched { + u64 runtime; + unsigned long epoch; +}; + struct kioctx_table; struct iommu_mm_data; struct mm_struct { @@ -1029,6 +1034,17 @@ struct mm_struct { */ raw_spinlock_t cpus_allowed_lock; #endif +#ifdef CONFIG_SCHED_CACHE + /* + * Track per-cpu-per-process occupancy as a proxy for cache residency. + * See account_mm_sched() and ... + */ + struct mm_sched __percpu *pcpu_sched; + raw_spinlock_t mm_sched_lock; + unsigned long mm_sched_epoch; + int mm_sched_cpu; +#endif + #ifdef CONFIG_MMU atomic_long_t pgtables_bytes; /* size of all page tables */ #endif @@ -1487,6 +1503,34 @@ static inline unsigned int mm_cid_size(void) static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct = cpumask *cpumask) { } #endif /* CONFIG_SCHED_MM_CID */ =20 +#ifdef CONFIG_SCHED_CACHE +void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *pcpu_sc= hed); + +static inline int mm_alloc_sched_noprof(struct mm_struct *mm) +{ + struct mm_sched __percpu *pcpu_sched =3D alloc_percpu_noprof(struct mm_sc= hed); + + if (!pcpu_sched) + return -ENOMEM; + + mm_init_sched(mm, pcpu_sched); + return 0; +} + +#define mm_alloc_sched(...) alloc_hooks(mm_alloc_sched_noprof(__VA_ARGS__)) + +static inline void mm_destroy_sched(struct mm_struct *mm) +{ + free_percpu(mm->pcpu_sched); + mm->pcpu_sched =3D NULL; +} +#else /* !CONFIG_SCHED_CACHE */ + +static inline int mm_alloc_sched(struct mm_struct *mm) { return 0; } +static inline void mm_destroy_sched(struct mm_struct *mm) { } + +#endif /* CONFIG_SCHED_CACHE */ + struct mmu_gather; extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm); extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct= *mm); diff --git a/include/linux/sched.h b/include/linux/sched.h index b469878de25c..278b529c91df 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1406,6 +1406,10 @@ struct task_struct { unsigned long numa_pages_migrated; #endif /* CONFIG_NUMA_BALANCING */ =20 +#ifdef CONFIG_SCHED_CACHE + struct callback_head cache_work; +#endif + #ifdef CONFIG_RSEQ struct rseq __user *rseq; u32 rseq_len; @@ -2428,4 +2432,11 @@ extern void migrate_enable(void); =20 DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable()) =20 +#ifdef CONFIG_SCHED_CACHE +static inline bool sched_cache_enabled(void) +{ + return false; +} +#endif + #endif diff --git a/init/Kconfig b/init/Kconfig index cab3ad28ca49..88556ef8cfd1 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -983,6 +983,17 @@ config NUMA_BALANCING =20 This system will be inactive on UMA systems. =20 +config SCHED_CACHE + bool "Cache aware load balance" + default y + depends on SMP + help + When enabled, the scheduler will attempt to aggregate tasks from + the same process onto a single Last Level Cache (LLC) domain when + possible. This improves cache locality by keeping tasks that share + resources within the same cache domain, reducing cache misses and + lowering data access latency. + config NUMA_BALANCING_DEFAULT_ENABLED bool "Automatically enable NUMA aware memory/task placement" default y diff --git a/kernel/fork.c b/kernel/fork.c index 3da0f08615a9..aae5053d1e30 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -680,6 +680,7 @@ void __mmdrop(struct mm_struct *mm) cleanup_lazy_tlbs(mm); =20 WARN_ON_ONCE(mm =3D=3D current->active_mm); + mm_destroy_sched(mm); mm_free_pgd(mm); mm_free_id(mm); destroy_context(mm); @@ -1083,6 +1084,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm= , struct task_struct *p, if (mm_alloc_cid(mm, p)) goto fail_cid; =20 + if (mm_alloc_sched(mm)) + goto fail_sched; + if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT, NR_MM_COUNTERS)) goto fail_pcpu; @@ -1092,6 +1096,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm= , struct task_struct *p, return mm; =20 fail_pcpu: + mm_destroy_sched(mm); +fail_sched: mm_destroy_cid(mm); fail_cid: destroy_context(mm); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f754a60de848..e8bdf03a4b7f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4488,6 +4488,7 @@ static void __sched_fork(u64 clone_flags, struct task= _struct *p) p->wake_entry.u_flags =3D CSD_TYPE_TTWU; p->migration_pending =3D NULL; init_sched_mm_cid(p); + init_sched_mm(p); } =20 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing); @@ -8791,6 +8792,11 @@ void __init sched_init(void) =20 rq->core_cookie =3D 0UL; #endif +#ifdef CONFIG_SCHED_CACHE + raw_spin_lock_init(&rq->cpu_epoch_lock); + rq->cpu_epoch_next =3D jiffies; +#endif + zalloc_cpumask_var_node(&rq->scratch_mask, GFP_KERNEL, cpu_to_node(i)); } =20 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5b752324270b..cb82f558dc5b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1152,6 +1152,8 @@ void post_init_entity_util_avg(struct task_struct *p) sa->runnable_avg =3D sa->util_avg; } =20 +static inline void account_mm_sched(struct rq *rq, struct task_struct *p, = s64 delta_exec); + static s64 update_se(struct rq *rq, struct sched_entity *se) { u64 now =3D rq_clock_task(rq); @@ -1174,6 +1176,7 @@ static s64 update_se(struct rq *rq, struct sched_enti= ty *se) =20 trace_sched_stat_runtime(running, delta_exec); account_group_exec_runtime(running, delta_exec); + account_mm_sched(rq, running, delta_exec); =20 /* cgroup time is always accounted against the donor */ cgroup_account_cputime(donor, delta_exec); @@ -1193,6 +1196,259 @@ static s64 update_se(struct rq *rq, struct sched_en= tity *se) return delta_exec; } =20 +#ifdef CONFIG_SCHED_CACHE + +/* + * XXX numbers come from a place the sun don't shine -- probably wants to = be SD + * tunable or so. + */ +#define EPOCH_PERIOD (HZ / 100) /* 10 ms */ +#define EPOCH_LLC_AFFINITY_TIMEOUT 5 /* 50 ms */ + +static int llc_id(int cpu) +{ + if (cpu < 0) + return -1; + + return per_cpu(sd_llc_id, cpu); +} + +void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_s= ched) +{ + unsigned long epoch; + int i; + + for_each_possible_cpu(i) { + struct mm_sched *pcpu_sched =3D per_cpu_ptr(_pcpu_sched, i); + struct rq *rq =3D cpu_rq(i); + + pcpu_sched->runtime =3D 0; + pcpu_sched->epoch =3D rq->cpu_epoch; + epoch =3D rq->cpu_epoch; + } + + raw_spin_lock_init(&mm->mm_sched_lock); + mm->mm_sched_epoch =3D epoch; + mm->mm_sched_cpu =3D -1; + + /* + * The update to mm->pcpu_sched should not be reordered + * before initialization to mm's other fields, in case + * the readers may get invalid mm_sched_epoch, etc. + */ + smp_store_release(&mm->pcpu_sched, _pcpu_sched); +} + +/* because why would C be fully specified */ +static __always_inline void __shr_u64(u64 *val, unsigned int n) +{ + if (n >=3D 64) { + *val =3D 0; + return; + } + *val >>=3D n; +} + +static inline void __update_mm_sched(struct rq *rq, struct mm_sched *pcpu_= sched) +{ + lockdep_assert_held(&rq->cpu_epoch_lock); + + unsigned long n, now =3D jiffies; + long delta =3D now - rq->cpu_epoch_next; + + if (delta > 0) { + n =3D (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD; + rq->cpu_epoch +=3D n; + rq->cpu_epoch_next +=3D n * EPOCH_PERIOD; + __shr_u64(&rq->cpu_runtime, n); + } + + n =3D rq->cpu_epoch - pcpu_sched->epoch; + if (n) { + pcpu_sched->epoch +=3D n; + __shr_u64(&pcpu_sched->runtime, n); + } +} + +static unsigned long __no_profile fraction_mm_sched(struct rq *rq, struct = mm_sched *pcpu_sched) +{ + guard(raw_spinlock_irqsave)(&rq->cpu_epoch_lock); + + __update_mm_sched(rq, pcpu_sched); + + /* + * Runtime is a geometric series (r=3D0.5) and as such will sum to twice + * the accumulation period, this means the multiplcation here should + * not overflow. + */ + return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq->cpu_runtime + 1); +} + +static inline +void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec) +{ + struct mm_struct *mm =3D p->mm; + struct mm_sched *pcpu_sched; + unsigned long epoch; + + if (!sched_cache_enabled()) + return; + + if (p->sched_class !=3D &fair_sched_class) + return; + /* + * init_task and kthreads don't having mm + */ + if (!mm || !mm->pcpu_sched) + return; + + pcpu_sched =3D per_cpu_ptr(p->mm->pcpu_sched, cpu_of(rq)); + + scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) { + __update_mm_sched(rq, pcpu_sched); + pcpu_sched->runtime +=3D delta_exec; + rq->cpu_runtime +=3D delta_exec; + epoch =3D rq->cpu_epoch; + } + + /* + * If this task hasn't hit task_cache_work() for a while, or it + * has only 1 thread, invalidate its preferred state. + */ + if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT || + get_nr_threads(p) <=3D 1) { + if (mm->mm_sched_cpu !=3D -1) + mm->mm_sched_cpu =3D -1; + } +} + +static void task_tick_cache(struct rq *rq, struct task_struct *p) +{ + struct callback_head *work =3D &p->cache_work; + struct mm_struct *mm =3D p->mm; + + if (!sched_cache_enabled()) + return; + + if (!mm || !mm->pcpu_sched) + return; + + /* avoid moving backwards */ + if (mm->mm_sched_epoch >=3D rq->cpu_epoch) + return; + + guard(raw_spinlock)(&mm->mm_sched_lock); + + if (work->next =3D=3D work) { + task_work_add(p, work, TWA_RESUME); + WRITE_ONCE(mm->mm_sched_epoch, rq->cpu_epoch); + } +} + +static void __no_profile task_cache_work(struct callback_head *work) +{ + struct task_struct *p =3D current; + struct mm_struct *mm =3D p->mm; + unsigned long m_a_occ =3D 0; + unsigned long curr_m_a_occ =3D 0; + int cpu, m_a_cpu =3D -1; + cpumask_var_t cpus; + + WARN_ON_ONCE(work !=3D &p->cache_work); + + work->next =3D work; + + if (p->flags & PF_EXITING) + return; + + if (!zalloc_cpumask_var(&cpus, GFP_KERNEL)) + return; + + scoped_guard (cpus_read_lock) { + cpumask_copy(cpus, cpu_online_mask); + + for_each_cpu(cpu, cpus) { + /* XXX sched_cluster_active */ + struct sched_domain *sd =3D per_cpu(sd_llc, cpu); + unsigned long occ, m_occ =3D 0, a_occ =3D 0; + int m_cpu =3D -1, i; + + if (!sd) + continue; + + for_each_cpu(i, sched_domain_span(sd)) { + occ =3D fraction_mm_sched(cpu_rq(i), + per_cpu_ptr(mm->pcpu_sched, i)); + a_occ +=3D occ; + if (occ > m_occ) { + m_occ =3D occ; + m_cpu =3D i; + } + } + + /* + * Compare the accumulated occupancy of each LLC. The + * reason for using accumulated occupancy rather than average + * per CPU occupancy is that it works better in asymmetric LLC + * scenarios. + * For example, if there are 2 threads in a 4CPU LLC and 3 + * threads in an 8CPU LLC, it might be better to choose the one + * with 3 threads. However, this would not be the case if the + * occupancy is divided by the number of CPUs in an LLC (i.e., + * if average per CPU occupancy is used). + * Besides, NUMA balancing fault statistics behave similarly: + * the total number of faults per node is compared rather than + * the average number of faults per CPU. This strategy is also + * followed here. + */ + if (a_occ > m_a_occ) { + m_a_occ =3D a_occ; + m_a_cpu =3D m_cpu; + } + + if (llc_id(cpu) =3D=3D llc_id(mm->mm_sched_cpu)) + curr_m_a_occ =3D a_occ; + + cpumask_andnot(cpus, cpus, sched_domain_span(sd)); + } + } + + if (m_a_occ > (2 * curr_m_a_occ)) { + /* + * Avoid switching mm_sched_cpu too fast. + * The reason to choose 2X is because: + * 1. It is better to keep the preferred LLC stable, + * rather than changing it frequently and cause migrations + * 2. 2X means the new preferred LLC has at least 1 more + * busy CPU than the old one(200% vs 100%, eg) + * 3. 2X is chosen based on test results, as it delivers + * the optimal performance gain so far. + */ + mm->mm_sched_cpu =3D m_a_cpu; + } + + free_cpumask_var(cpus); +} + +void init_sched_mm(struct task_struct *p) +{ + struct callback_head *work =3D &p->cache_work; + + init_task_work(work, task_cache_work); + work->next =3D work; +} + +#else + +static inline void account_mm_sched(struct rq *rq, struct task_struct *p, + s64 delta_exec) { } + +void init_sched_mm(struct task_struct *p) { } + +static void task_tick_cache(struct rq *rq, struct task_struct *p) { } + +#endif + /* * Used by other classes to account runtime. */ @@ -13124,6 +13380,8 @@ static void task_tick_fair(struct rq *rq, struct ta= sk_struct *curr, int queued) if (static_branch_unlikely(&sched_numa_balancing)) task_tick_numa(rq, curr); =20 + task_tick_cache(rq, curr); + update_misfit_status(curr, rq); check_update_overutilized_status(task_rq(curr)); =20 diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index adfb6e3409d7..84118b522f22 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1194,6 +1194,12 @@ struct rq { u64 clock_pelt_idle_copy; u64 clock_idle_copy; #endif +#ifdef CONFIG_SCHED_CACHE + raw_spinlock_t cpu_epoch_lock ____cacheline_aligned; + u64 cpu_runtime; + unsigned long cpu_epoch; + unsigned long cpu_epoch_next; +#endif =20 atomic_t nr_iowait; =20 @@ -3819,6 +3825,8 @@ static inline void task_tick_mm_cid(struct rq *rq, st= ruct task_struct *curr) { } static inline void init_sched_mm_cid(struct task_struct *t) { } #endif /* !CONFIG_SCHED_MM_CID */ =20 +extern void init_sched_mm(struct task_struct *p); + extern u64 avg_vruntime(struct cfs_rq *cfs_rq); extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se); static inline --=20 2.32.0 From nobody Fri Dec 19 19:37:42 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8872B2EC0A3 for ; Wed, 3 Dec 2025 23:01:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802884; cv=none; b=PIVYWfHNGhpYcL5pUf5pbJV6z5GC4MufyMLaT00/IZT2eIAKxBzqzRglsyVDKa18ZuvGOOBF6720BmFO1QjbQTlm++JQNaJ2Li4EQo87RGn9XE96gbHXFQW46Ye00LdP+tH7Hh5mDSD6E7sACuXB9wl4PappMcJ/np+rPkSv+fk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802884; c=relaxed/simple; bh=1tsEZhdWTsEDcQ9RmMyka/N/6UwyydH6Z8nvicoX744=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version:Content-Type; b=HIfoZU9H/SZm0t6eE4dquYqikhNAFvY4+BXlcqSIZ3CtUZjOzIUSOC63YZp9YVMZHXi1YQfdjTLmXM4JflgdOMpsYGcmIdM9y97XnpuLltYZndJJ3UMie+BQAS7WTzwavGBbWlwvukQFWzaAt18tTAj+n7TvfZUbdaq3Hd/PwnQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=gY2DTyL8; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="gY2DTyL8" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802881; x=1796338881; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=1tsEZhdWTsEDcQ9RmMyka/N/6UwyydH6Z8nvicoX744=; b=gY2DTyL8yWnM8kjFl12irX509n24BDz4iFKCqM6WCNUCLXRN5a5IlNUP CfYI2+/YpAT4bu6uPNEPMLhPBFM2XD4LK26owQYwXoYEFxXYOPyRzMCCr rISEhzC11YficDTuxwWe3QvPX3HaXsnsqXtK9HLG/hiT6NfkxrHYuu33P 2QVChiY0MqYwc1nvL417RDFrqZbCy7kRQLG02T5nK00USUuGMRvgZv+U3 gt7oM5XlbDtNyyU+5sVU7KIViaRsZSfklkuYRaOOMQ39LYUdIFQ+Ue6G0 EAocEYO+P59FhkDZmjjHTJ9I3dlRH+Fcb/w/MBdqObwG/r+XHEjXGxmQZ g==; X-CSE-ConnectionGUID: leGPfNk6R8KUwcSASjrrFg== X-CSE-MsgGUID: gXlnURSyTm+Cie+BIy/27w== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136204" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136204" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:21 -0800 X-CSE-ConnectionGUID: R/PsDZqXSOeZVekLwfPG7Q== X-CSE-MsgGUID: gEHdD8uJSdmMqZ2X+NMi/w== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199763741" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:20 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org Subject: [PATCH v2 02/23] sched/cache: Record per-LLC utilization to guide cache-aware scheduling decisions Date: Wed, 3 Dec 2025 15:07:21 -0800 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Chen Yu When a system becomes busy and a process=E2=80=99s preferred LLC is saturated with too many threads, tasks within that LLC migrate frequently. These in LLC migrations introduce latency and degrade performance. To avoid this, task aggregation should be suppressed when the preferred LLC is overloaded, which requires a metric to indicate LLC utilization. Record per LLC utilization/cpu capacity during periodic load balancing. These statistics will be used in later patches to decide whether tasks should be aggregated into their preferred LLC. Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v1->v2: Refine the comments in record_sg_llc_stats().(Peter Zijlstra). include/linux/sched/topology.h | 4 ++ kernel/sched/fair.c | 69 ++++++++++++++++++++++++++++++++++ 2 files changed, 73 insertions(+) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index bbcfdf12aa6e..0ba4697d74ba 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -68,6 +68,10 @@ struct sched_domain_shared { atomic_t nr_busy_cpus; int has_idle_cores; int nr_idle_scan; +#ifdef CONFIG_SCHED_CACHE + unsigned long util_avg; + unsigned long capacity ____cacheline_aligned_in_smp; +#endif }; =20 struct sched_domain { diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index cb82f558dc5b..b9f336300f14 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9622,6 +9622,29 @@ static inline int task_is_ineligible_on_dst_cpu(stru= ct task_struct *p, int dest_ return 0; } =20 +#ifdef CONFIG_SCHED_CACHE +/* Called from load balancing paths with rcu_read_lock held */ +static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util, + unsigned long *cap) +{ + struct sched_domain_shared *sd_share; + + sd_share =3D rcu_dereference(per_cpu(sd_llc_shared, cpu)); + if (!sd_share) + return false; + + *util =3D READ_ONCE(sd_share->util_avg); + *cap =3D READ_ONCE(sd_share->capacity); + + return true; +} +#else +static inline bool get_llc_stats(int cpu, unsigned long *util, + unsigned long *cap) +{ + return false; +} +#endif /* * can_migrate_task - may task p from runqueue rq be migrated to this_cpu? */ @@ -10592,6 +10615,51 @@ sched_reduced_capacity(struct rq *rq, struct sched= _domain *sd) return check_cpu_capacity(rq, sd); } =20 +#ifdef CONFIG_SCHED_CACHE +/* + * Record the statistics for this scheduler group for later + * use. These values guide load balancing on aggregating tasks + * to a LLC. + */ +static void record_sg_llc_stats(struct lb_env *env, + struct sg_lb_stats *sgs, + struct sched_group *group) +{ + struct sched_domain_shared *sd_share; + + if (!sched_cache_enabled() || env->idle =3D=3D CPU_NEWLY_IDLE) + return; + + /* Only care about sched domain spanning multiple LLCs */ + if (env->sd->child !=3D rcu_dereference(per_cpu(sd_llc, env->dst_cpu))) + return; + + /* + * At this point we know this group spans a LLC domain. + * Record the statistic of this group in its corresponding + * shared LLC domain. + * Note: sd_share cannot be obtained via sd->child->shared, because + * it refers to the domain that covers the local group, while + * sd_share could represent any of the LLC group. + */ + sd_share =3D rcu_dereference(per_cpu(sd_llc_shared, + cpumask_first(sched_group_span(group)))); + if (!sd_share) + return; + + if (READ_ONCE(sd_share->util_avg) !=3D sgs->group_util) + WRITE_ONCE(sd_share->util_avg, sgs->group_util); + + if (unlikely(READ_ONCE(sd_share->capacity) !=3D sgs->group_capacity)) + WRITE_ONCE(sd_share->capacity, sgs->group_capacity); +} +#else +static inline void record_sg_llc_stats(struct lb_env *env, struct sg_lb_st= ats *sgs, + struct sched_group *group) +{ +} +#endif + /** * update_sg_lb_stats - Update sched_group's statistics for load balancing. * @env: The load balancing environment. @@ -10681,6 +10749,7 @@ static inline void update_sg_lb_stats(struct lb_env= *env, =20 sgs->group_type =3D group_classify(env->sd->imbalance_pct, group, sgs); =20 + record_sg_llc_stats(env, sgs, group); /* Computing avg_load makes sense only when group is overloaded */ if (sgs->group_type =3D=3D group_overloaded) sgs->avg_load =3D (sgs->group_load * SCHED_CAPACITY_SCALE) / --=20 2.32.0 From nobody Fri Dec 19 19:37:42 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2F2692EBDDE for ; Wed, 3 Dec 2025 23:01:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802885; cv=none; b=fboZZkFKPl6gqpHDoF2b6zbyblNamhVu+FcjT54t3oU8vxsb1XXezAqbDtyJgvQY5nilFQH3AKBGOohsQ/SQ3tX2mRk+BSCtjeqUEVqOw4w0dDc2wtmgFtlHa6V/L30IDsIjeiViMUZM4y4AiA82fvOBsu4+NJQNRAWoaUu83no= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802885; c=relaxed/simple; bh=TakEXE1LpDhxRe/Kb7GWIrlVFYabIDFNwz7qIMJeAzA=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=lWQIStOFn4iy99stFlHV/qSBEi3k7WL/GF8q0g3QeYxyAInDLMtgRyHdyj4lgpwV4+hcrGelSaLn9GQ314YsxP62kdg4igNnwsJ5I/UGLtE/m0W5/zOTgeJYpf5nNjxi042Eu8UJR3sDuMQmXljn/+2COvTOKDQkes+q8dJg4fs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=ZJw0lk7W; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="ZJw0lk7W" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802883; x=1796338883; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=TakEXE1LpDhxRe/Kb7GWIrlVFYabIDFNwz7qIMJeAzA=; b=ZJw0lk7W6RTQQr7pUyzeAA7+tFRn5rwcdkgzS49IJ6otxSXwAzwDWZIh 72+xVH8b/09ZAgA4A4sjEOCcav+jAPzfD2L3N7AxSkmW/F8BHhBoUD3JQ QbRstLbqNMnMwfrcQ+qBeU1Q3VwTeXm0rmxciTrI2u6z3GCHX79/Bxc9Y tid45au2Oifch9e3/2xq9ljpUEYKZAVIVVPqiF3n86ssLv/OdDy75IUHo 67RTdQeGc20OckklfmpRjpvC7cCT1mZKRlid3w67UBs6EEbQgCGzqXjOi NdatFPNJvaFIWKoBtqpyQd9yFecmVzXENUGCr745w3Jqa3QUeXJGyO4fH g==; X-CSE-ConnectionGUID: /fs42l2aRamlkF2vGhwf9A== X-CSE-MsgGUID: 2SoBy6EYTSqmAsupZBGzKg== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136230" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136230" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:22 -0800 X-CSE-ConnectionGUID: O845bRyGQ8Wd6MpeSkE96g== X-CSE-MsgGUID: Cz70j2EQQ0GHWMf5pEgBBw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199763752" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:22 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org Subject: [PATCH v2 03/23] sched/cache: Introduce helper functions to enforce LLC migration policy Date: Wed, 3 Dec 2025 15:07:22 -0800 Message-Id: <12e90c8c26c690b40e48cc1e03c785f2f99fafa8.1764801860.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu Cache-aware scheduling aggregates threads onto their preferred LLC, mainly through load balancing. When the preferred LLC becomes saturated, more threads are still placed there, increasing latency. A mechanism is needed to limit aggregation so that the preferred LLC does not become overloaded. Introduce helper functions can_migrate_llc() and can_migrate_llc_task() to enforce the LLC migration policy: 1. Aggregate a task to its preferred LLC if both source and destination LLCs are not too busy (<50% utilization), or if doing so will not leave the preferred LLC much more imbalanced than the non-preferred one (>20% utilization difference, similar to imbalance_pct of the LLC domain). 2. Allow moving a task from overloaded preferred LLC to a non preferred LLC if this will not cause the non preferred LLC to become too imbalanced to cause a later migration back. 3. If both LLCs are too busy, let the generic load balance to spread the tasks. Further (hysteresis)action could be taken in the future to prevent tasks from being migrated into and out of the preferred LLC frequently (back and forth): the threshold for migrating a task out of its preferred LLC should be higher than that for migrating it into the LLC. Since aggregation tends to make the preferred LLC busier than others, the imbalance tolerance is controlled by llc_imb_pct. If set to 0, tasks may still aggregate to the preferred LLC as long as it is not more utilized than the source LLC, preserving the preference. Co-developed-by: Tim Chen Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v1->v2: No change. kernel/sched/fair.c | 153 +++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 5 ++ 2 files changed, 158 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b9f336300f14..710ed9943d27 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1205,6 +1205,9 @@ static s64 update_se(struct rq *rq, struct sched_enti= ty *se) #define EPOCH_PERIOD (HZ / 100) /* 10 ms */ #define EPOCH_LLC_AFFINITY_TIMEOUT 5 /* 50 ms */ =20 +__read_mostly unsigned int llc_overload_pct =3D 50; +__read_mostly unsigned int llc_imb_pct =3D 20; + static int llc_id(int cpu) { if (cpu < 0) @@ -9623,6 +9626,27 @@ static inline int task_is_ineligible_on_dst_cpu(stru= ct task_struct *p, int dest_ } =20 #ifdef CONFIG_SCHED_CACHE +/* + * The margin used when comparing LLC utilization with CPU capacity. + * Parameter llc_overload_pct determines the LLC load level where + * active LLC aggregation is done. + * Derived from fits_capacity(). + * + * (default: ~50%) + */ +#define fits_llc_capacity(util, max) \ + ((util) * 100 < (max) * llc_overload_pct) + +/* + * The margin used when comparing utilization. + * is 'util1' noticeably greater than 'util2' + * Derived from capacity_greater(). + * Bias is in perentage. + */ +/* Allows dst util to be bigger than src util by up to bias percent */ +#define util_greater(util1, util2) \ + ((util1) * 100 > (util2) * (100 + llc_imb_pct)) + /* Called from load balancing paths with rcu_read_lock held */ static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util, unsigned long *cap) @@ -9638,6 +9662,135 @@ static __maybe_unused bool get_llc_stats(int cpu, u= nsigned long *util, =20 return true; } + +/* + * Decision matrix according to the LLC utilization. To + * decide whether we can do task aggregation across LLC. + * + * By default, 50% is the threshold to treat the LLC as busy, + * and 20% is the utilization imbalance percentage to decide + * if the preferred LLC is busier than the non-preferred LLC. + * The hysteresis is used to avoid task bouncing between the + * preferred LLC and the non-preferred LLC. + * + * 1. moving towards the preferred LLC, dst is the preferred + * LLC, src is not. + * + * src \ dst 30% 40% 50% 60% + * 30% Y Y Y N + * 40% Y Y Y Y + * 50% Y Y G G + * 60% Y Y G G + * + * 2. moving out of the preferred LLC, src is the preferred + * LLC, dst is not: + * + * src \ dst 30% 40% 50% 60% + * 30% N N N N + * 40% N N N N + * 50% N N G G + * 60% Y N G G + * + * src : src_util + * dst : dst_util + * Y : Yes, migrate + * N : No, do not migrate + * G : let the Generic load balance to even the load. + * + * The intention is that if both LLCs are quite busy, cache aware + * load balance should not be performed, and generic load balance + * should take effect. However, if one is busy and the other is not, + * the preferred LLC capacity(50%) and imbalance criteria(20%) should + * be considered to determine whether LLC aggregation should be + * performed to bias the load towards the preferred LLC. + */ + +/* migration decision, 3 states are orthogonal. */ +enum llc_mig { + mig_forbid =3D 0, /* N: Don't migrate task, respect LLC preference */ + mig_llc, /* Y: Do LLC preference based migration */ + mig_unrestricted /* G: Don't restrict generic load balance migration */ +}; + +/* + * Check if task can be moved from the source LLC to the + * destination LLC without breaking cache aware preferrence. + * src_cpu and dst_cpu are arbitrary CPUs within the source + * and destination LLCs, respectively. + */ +static enum llc_mig can_migrate_llc(int src_cpu, int dst_cpu, + unsigned long tsk_util, + bool to_pref) +{ + unsigned long src_util, dst_util, src_cap, dst_cap; + + if (!get_llc_stats(src_cpu, &src_util, &src_cap) || + !get_llc_stats(dst_cpu, &dst_util, &dst_cap)) + return mig_unrestricted; + + if (!fits_llc_capacity(dst_util, dst_cap) && + !fits_llc_capacity(src_util, src_cap)) + return mig_unrestricted; + + src_util =3D src_util < tsk_util ? 0 : src_util - tsk_util; + dst_util =3D dst_util + tsk_util; + if (to_pref) { + /* + * llc_imb_pct is the imbalance allowed between + * preferred LLC and non-preferred LLC. + * Don't migrate if we will get preferred LLC too + * heavily loaded and if the dest is much busier + * than the src, in which case migration will + * increase the imbalance too much. + */ + if (!fits_llc_capacity(dst_util, dst_cap) && + util_greater(dst_util, src_util)) + return mig_forbid; + } else { + /* + * Don't migrate if we will leave preferred LLC + * too idle, or if this migration leads to the + * non-preferred LLC falls within sysctl_aggr_imb percent + * of preferred LLC, leading to migration again + * back to preferred LLC. + */ + if (fits_llc_capacity(src_util, src_cap) || + !util_greater(src_util, dst_util)) + return mig_forbid; + } + return mig_llc; +} + +/* + * Check if task p can migrate from source LLC to + * destination LLC in terms of cache aware load balance. + */ +static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int d= st_cpu, + struct task_struct *p) +{ + struct mm_struct *mm; + bool to_pref; + int cpu; + + mm =3D p->mm; + if (!mm) + return mig_unrestricted; + + cpu =3D mm->mm_sched_cpu; + if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu)) + return mig_unrestricted; + + if (cpus_share_cache(dst_cpu, cpu)) + to_pref =3D true; + else if (cpus_share_cache(src_cpu, cpu)) + to_pref =3D false; + else + return mig_unrestricted; + + return can_migrate_llc(src_cpu, dst_cpu, + task_util(p), to_pref); +} + #else static inline bool get_llc_stats(int cpu, unsigned long *util, unsigned long *cap) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 84118b522f22..bf72c5bab506 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2828,6 +2828,11 @@ extern unsigned int sysctl_numa_balancing_scan_perio= d_max; extern unsigned int sysctl_numa_balancing_scan_size; extern unsigned int sysctl_numa_balancing_hot_threshold; =20 +#ifdef CONFIG_SCHED_CACHE +extern unsigned int llc_overload_pct; +extern unsigned int llc_imb_pct; +#endif + #ifdef CONFIG_SCHED_HRTICK =20 /* --=20 2.32.0 From nobody Fri Dec 19 19:37:42 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 55E9C2EC08D for ; Wed, 3 Dec 2025 23:01:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802886; cv=none; b=d3VrPdnnjHo1v15INzZi2Be9GCCRZHIzY8RvdjoDE/lVfQN7C6RgefM63jeAgMs+Ej4xBAgNM48bikZgcfBK97s516BGyLXX1Rbvhsn/lxdjOTLJb7/BzUSsXmqizKiXSV4Q40vVu+4KUJUTuTrw0EcRJX7axQAupxl66/Njl7g= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802886; c=relaxed/simple; bh=/WXShEpYiDAFPDra61vUPdbNcgE+VqMlav+UUM59jU0=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=sp6UO1OW6Q3DioPA4TyMAxm2w7jZEWfXn+BecCi+DY63bhyHNOAdo2gxE9qPcZ4H/AG5K6vG0sVgNdh5TPmn2YDZ1M3oPRXJYAPeKE66XGC3smKX35V4ctG4LeLd8SIPZYPGBwl8SDEjENvTH1Cw9AGh2YoAZb6Q6CfS4bRt+vY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=mJ6yn4qm; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="mJ6yn4qm" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802884; x=1796338884; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=/WXShEpYiDAFPDra61vUPdbNcgE+VqMlav+UUM59jU0=; b=mJ6yn4qmNCzGpuUPMZdx+lsUqY/Y8q397TD5tze5hB735PCmFim3TtR3 Eh74z+kUDoOPtNaJnMct+g67IgKwnq6+WYRbc+f3oEEw9Wg1Gcg9yN7oU vI5Oubm8s7zVFVo1CwCylUT7AAgUyeA+NaPz/BoikrttCBobaJqnnubeC HmGkKxv21UFMqlb7bdh2Dv1ZUBuQd/5iPTCr2He8Z4My1BxTJHc0KlROt IrrMfarEIQ6kjL275GsASGznmrL05FEBJGY2at3hHLlbpnBR+lPPkEK0Y B/H+e/fK9u8hElcLfWPp6Axh3PPWmX2TiXZI/s6f1Be/ZF/FgJXPpYRSc Q==; X-CSE-ConnectionGUID: G2tkFvPIT6SY1+ZRXOxXBw== X-CSE-MsgGUID: uYGT49/IQCatA5I80r2gog== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136249" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136249" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:23 -0800 X-CSE-ConnectionGUID: uiDsRffYTdabgXsA6tZowg== X-CSE-MsgGUID: ym+MS4XuQPSAYB5T1Atjpw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199763756" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:23 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org Subject: [PATCH v2 04/23] sched/cache: Make LLC id continuous Date: Wed, 3 Dec 2025 15:07:23 -0800 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Introduce an index mapping between CPUs and their LLCs. This provides a continuous per LLC index needed for cache-aware load balancing in later patches. The existing per_cpu llc_id usually points to the first CPU of the LLC domain, which is sparse and unsuitable as an array index. Using llc_id directly would waste memory. With the new mapping, CPUs in the same LLC share a continuous id: per_cpu(llc_id, CPU=3D0...15) =3D 0 per_cpu(llc_id, CPU=3D16...31) =3D 1 per_cpu(llc_id, CPU=3D32...47) =3D 2 ... Co-developed-by: Chen Yu Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v1->v2: Convert the static LLC id to be allocated sequentially as LLCs are discovered, and replace the old sd_llc_id. (Peter Zijlstra) kernel/sched/fair.c | 9 ++++++- kernel/sched/sched.h | 1 + kernel/sched/topology.c | 60 +++++++++++++++++++++++++++++++++++++++-- 3 files changed, 67 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 710ed9943d27..0a3918269906 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1210,10 +1210,17 @@ __read_mostly unsigned int llc_imb_pct = =3D 20; =20 static int llc_id(int cpu) { + int llc; + if (cpu < 0) return -1; =20 - return per_cpu(sd_llc_id, cpu); + llc =3D per_cpu(sd_llc_id, cpu); + /* avoid race with cpu hotplug */ + if (unlikely(llc >=3D max_llcs)) + return -1; + + return llc; } =20 void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_s= ched) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index bf72c5bab506..728737641847 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2075,6 +2075,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_= cpucapacity); =20 extern struct static_key_false sched_asym_cpucapacity; extern struct static_key_false sched_cluster_active; +extern int max_llcs; =20 static __always_inline bool sched_asym_cpucap_active(void) { diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 444bdfdab731..f25d950ab015 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -17,6 +17,8 @@ void sched_domains_mutex_unlock(void) mutex_unlock(&sched_domains_mutex); } =20 +int max_llcs; + /* Protected by sched_domains_mutex: */ static cpumask_var_t sched_domains_tmpmask; static cpumask_var_t sched_domains_tmpmask2; @@ -668,6 +670,55 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cp= ucapacity); DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity); DEFINE_STATIC_KEY_FALSE(sched_cluster_active); =20 +/* + * Assign continuous llc id for the CPU, and return + * the assigned llc id. + */ +static int update_llc_id(struct sched_domain *sd, + int cpu) +{ + int id =3D per_cpu(sd_llc_id, cpu), i; + + if (id >=3D 0) + return id; + + if (sd) { + /* Look for any assigned id and reuse it.*/ + for_each_cpu(i, sched_domain_span(sd)) { + id =3D per_cpu(sd_llc_id, i); + + if (id >=3D 0) { + per_cpu(sd_llc_id, cpu) =3D id; + return id; + } + } + } + + /* + * When 1. there is no id assigned to this LLC domain, + * or 2. the sd is NULL, we reach here. + * Consider the following scenario, + * CPU0~CPU95 are in the node0, CPU96~CPU191 are + * in the node1. During bootup, maxcpus=3D96 is + * appended. + * case 1: When running cpu_attach_domain(CPU24) + * during boot up, CPU24 is the first CPU in its + * non-NULL LLC domain. However, + * its corresponding llc id has not been assigned yet. + * + * case 2: After boot up, the CPU100 is brought up + * via sysfs manually. As a result, CPU100 has only a + * Numa domain attached, because CPU100 is the only CPU + * of a sched domain, all its bottom domains are degenerated. + * The LLC domain pointer sd is NULL for CPU100. + * + * For both cases, we want to increase the number of LLCs. + */ + per_cpu(sd_llc_id, cpu) =3D max_llcs++; + + return per_cpu(sd_llc_id, cpu); +} + static void update_top_cache_domain(int cpu) { struct sched_domain_shared *sds =3D NULL; @@ -677,14 +728,13 @@ static void update_top_cache_domain(int cpu) =20 sd =3D highest_flag_domain(cpu, SD_SHARE_LLC); if (sd) { - id =3D cpumask_first(sched_domain_span(sd)); size =3D cpumask_weight(sched_domain_span(sd)); sds =3D sd->shared; } =20 rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); per_cpu(sd_llc_size, cpu) =3D size; - per_cpu(sd_llc_id, cpu) =3D id; + id =3D update_llc_id(sd, cpu); rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds); =20 sd =3D lowest_flag_domain(cpu, SD_CLUSTER); @@ -2488,6 +2538,12 @@ build_sched_domains(const struct cpumask *cpu_map, s= truct sched_domain_attr *att bool has_asym =3D false; bool has_cluster =3D false; =20 + /* first scan of LLCs */ + if (!max_llcs) { + for_each_possible_cpu(i) + per_cpu(sd_llc_id, i) =3D -1; + } + if (WARN_ON(cpumask_empty(cpu_map))) goto error; =20 --=20 2.32.0 From nobody Fri Dec 19 19:37:42 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 679562ECEBB for ; Wed, 3 Dec 2025 23:01:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802888; cv=none; b=ETXMSycIjg3hW2uD7ktvuDRCwlm80jzWlfuybxMLSJjuPv1gOLZC1i6pxE62EG9+cDFAU1hLySS0z9EjoSW7h+IC9WTpkMIZz2geJs1QP3R/eObNqU3OG+yETt/G54TGksleKQ7hmlJH6AIkTyDQ9XdCc+AMJOzQkCsvN6AteuA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802888; c=relaxed/simple; bh=0AOJ8UhIDlWuve34OwSELAi4hyIDL68J1uZ46Rj5j/U=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=oT6l76w5OE/CgwV2buKuyAjl0MI2Q/KFcNiA5tSmBm5YfGauRJZvP4km+gtrjR5EEwXVgaCsan/LhKN6+lL1MozMs4acvCaZOIR7MI0TH1a6DN/iL60iGgK73IOwTgFjrIfIZLKuBBoFD14Z4gbqwWYyV8VrRWfEVNe6RZksId4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=MYgWTb60; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="MYgWTb60" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802886; x=1796338886; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=0AOJ8UhIDlWuve34OwSELAi4hyIDL68J1uZ46Rj5j/U=; b=MYgWTb60T6yG49rQ3nLnjAfGEf6N3B3x0R1ujoF4MP+f6thBTMMmV5A2 6gtXPzCButviIBBCpY7AZSw3brie2XhnzEv9X/ke/XBPmw9iwTMQXM9o0 iuW5LJjdLixT+ECza7WcFjH4T9QTfvwhG/w9TZhOFFXAm15dszIkONvBa SXqv+2sjbXByYYFdX59mzr/UJBdZJP29/Qsoq52Bq39LKfBUjAIOaxdni O3Dd1ftGoYiiVuFKxIPrD6KHkaSzbffy0qzla2yFfiBHwoJt7cDfE6IuV V+N5yhbYcGH4NZwhO7yAb7il3S4WiOKkWeUjmgInRdyyz/X833IZzWB7j A==; X-CSE-ConnectionGUID: kjwyz+xTQ4WR9aHYObBogw== X-CSE-MsgGUID: zaCpH7h5Qxu4sNwXX6Krdg== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136266" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136266" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:25 -0800 X-CSE-ConnectionGUID: alNeGVnjSJOwjLHIJ4OuIg== X-CSE-MsgGUID: 5cfoA8m6Su+MIcnd+RQoGg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199763763" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:25 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org Subject: [PATCH v2 05/23] sched/cache: Assign preferred LLC ID to processes Date: Wed, 3 Dec 2025 15:07:24 -0800 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" With cache-aware scheduling enabled, each task is assigned a preferred LLC ID. This allows quick identification of the LLC domain where the task prefers to run, similar to numa_preferred_nid in NUMA balancing. Signed-off-by: Tim Chen --- Notes: v1->v2: Align preferred LLC with NUMA balancing's preferred node. include/linux/sched.h | 1 + init/init_task.c | 3 +++ kernel/sched/fair.c | 18 ++++++++++++++++++ 3 files changed, 22 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 278b529c91df..1ad46220cd04 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1408,6 +1408,7 @@ struct task_struct { =20 #ifdef CONFIG_SCHED_CACHE struct callback_head cache_work; + int preferred_llc; #endif =20 #ifdef CONFIG_RSEQ diff --git a/init/init_task.c b/init/init_task.c index a55e2189206f..44bae72b5b7d 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -191,6 +191,9 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = =3D { .numa_group =3D NULL, .numa_faults =3D NULL, #endif +#ifdef CONFIG_SCHED_CACHE + .preferred_llc =3D -1, +#endif #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS) .kasan_depth =3D 1, #endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0a3918269906..10cec83f65d5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1300,6 +1300,7 @@ void account_mm_sched(struct rq *rq, struct task_stru= ct *p, s64 delta_exec) struct mm_struct *mm =3D p->mm; struct mm_sched *pcpu_sched; unsigned long epoch; + int mm_sched_llc =3D -1; =20 if (!sched_cache_enabled()) return; @@ -1330,6 +1331,23 @@ void account_mm_sched(struct rq *rq, struct task_str= uct *p, s64 delta_exec) if (mm->mm_sched_cpu !=3D -1) mm->mm_sched_cpu =3D -1; } + + if (mm->mm_sched_cpu !=3D -1) { + mm_sched_llc =3D llc_id(mm->mm_sched_cpu); + +#ifdef CONFIG_NUMA_BALANCING + /* + * Don't assign preferred LLC if it + * conflicts with NUMA balancing. + */ + if (p->numa_preferred_nid >=3D 0 && + cpu_to_node(mm->mm_sched_cpu) !=3D p->numa_preferred_nid) + mm_sched_llc =3D -1; +#endif + } + + if (p->preferred_llc !=3D mm_sched_llc) + p->preferred_llc =3D mm_sched_llc; } =20 static void task_tick_cache(struct rq *rq, struct task_struct *p) --=20 2.32.0 From nobody Fri Dec 19 19:37:42 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C93962EF652 for ; Wed, 3 Dec 2025 23:01:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802890; cv=none; b=EDsiu7g2BtXvvoS9BKwrirW/B8ldDhmwGPx+cdJzoxBtklhxCuicf7XZFi+5IO9eicj+U0q988drhlH0OJjM+IwUt0amTGbw3mfM6d+6WZDelOH8Kc3PIbWBuITzHpbg31UVRdkj3UEviuqp+uvpMTrssPknIugATiCNu3Bm+08= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802890; c=relaxed/simple; bh=VfUUqC84e+k4dM9OCiHr0qSll3wkyw96Z2hiwhlrd+g=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=bMgyCWF3/XMpBtns9xgAQbuvJYsQoxOLy5qU1v3Ure2zyH7eaHG4ZLbKyqgBn1NINjkU2O0RPcPn7whkPdiyLRm36oluEWQ4viCDhC3YxOj/EZYMjqKw4E92UmhMBk5j0NYcW2RvXkMIEQxCZjUg4qUDiMfwP1eraXWWdJgmvkk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=HkSBtET5; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="HkSBtET5" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802887; x=1796338887; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=VfUUqC84e+k4dM9OCiHr0qSll3wkyw96Z2hiwhlrd+g=; b=HkSBtET5tJuyrYVLfwF6tgrJB3jPRTx01PEveXBF1wIqsiOJxkXhzAOm sC1smgQCW8wgJyR4E1u9VSEyU2s5OeGIeEuC988/p/oKmWX8sR4t5I1+Q tI0jgAIHPovP+AIphgRpysIDP7uveWJciGMii/zPUANlnHxP4W7VRq2eJ sBFqpGeZy1Ve8fewNRoxQswiP1fA+sTe9iwHVjtYcP+1v4kzgt4NxJNt7 wXwMA6vcMf7L8X5pDnsHkNo+K4j1B34n8SEcNJu9+4em9z3ghkY3MGzod zaVcGH6lY2mH/znHiuVlkKaau6etkJB5XXnU6Zdt6/ZSkCkDGyN6SoMYU A==; X-CSE-ConnectionGUID: k1qC8aFmROqogX9M8c8KUg== X-CSE-MsgGUID: rOmGsn1SSNWPs0ITFZygdw== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136288" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136288" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:27 -0800 X-CSE-ConnectionGUID: cvzDuwr8RH6q0DcIt04zOw== X-CSE-MsgGUID: uxGA3PlMTN6liJURHzFh/g== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199763775" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:26 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org Subject: [PATCH v2 06/23] sched/cache: Track LLC-preferred tasks per runqueue Date: Wed, 3 Dec 2025 15:07:25 -0800 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" For each runqueue, track the number of tasks with an LLC preference and how many of them are running on their preferred LLC. This mirrors nr_numa_running and nr_preferred_running for NUMA balancing, and will be used by cache-aware load balancing in later patches. Signed-off-by: Tim Chen --- Notes: v1->v2: Invoke task_of() once and reuse its result afterwards. (Peter Zijlstra) Remove hacky reset_llc_stats() and introduce sched_llc_active f= lag to properly pair enqueue/dequeue statistics update (Peter Zijls= tra, K Prateek Nayak) include/linux/sched.h | 2 ++ init/init_task.c | 1 + kernel/sched/core.c | 5 ++++ kernel/sched/fair.c | 60 ++++++++++++++++++++++++++++++++++++++++--- kernel/sched/sched.h | 6 +++++ 5 files changed, 71 insertions(+), 3 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 1ad46220cd04..466ba8b7398c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1408,6 +1408,8 @@ struct task_struct { =20 #ifdef CONFIG_SCHED_CACHE struct callback_head cache_work; + /*the p is currently refcounted in a rq's preferred llc stats*/ + bool sched_llc_active; int preferred_llc; #endif =20 diff --git a/init/init_task.c b/init/init_task.c index 44bae72b5b7d..ee78837b0aa2 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -192,6 +192,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = =3D { .numa_faults =3D NULL, #endif #ifdef CONFIG_SCHED_CACHE + .sched_llc_active =3D false, .preferred_llc =3D -1, #endif #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e8bdf03a4b7f..48626c81ba8e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -531,6 +531,11 @@ void __trace_set_current_state(int state_value) } EXPORT_SYMBOL(__trace_set_current_state); =20 +int task_llc(const struct task_struct *p) +{ + return per_cpu(sd_llc_id, task_cpu(p)); +} + /* * Serialization rules: * diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 10cec83f65d5..d46a70a9d9fb 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1223,6 +1223,43 @@ static int llc_id(int cpu) return llc; } =20 +static void account_llc_enqueue(struct rq *rq, struct task_struct *p) +{ + int pref_llc; + + if (!sched_cache_enabled()) + return; + + pref_llc =3D p->preferred_llc; + if (pref_llc < 0) + return; + + rq->nr_llc_running++; + rq->nr_pref_llc_running +=3D (pref_llc =3D=3D task_llc(p)); + p->sched_llc_active =3D true; +} + +static void account_llc_dequeue(struct rq *rq, struct task_struct *p) +{ + int pref_llc; + + /* + * Borrow the uc_se->active from uclamp_rq_inc_id(), + * uclamp_rq_dec_id() to avoid the unbalanced calculation + * of rq statistics. + */ + if (unlikely(!p->sched_llc_active)) + return; + + pref_llc =3D p->preferred_llc; + if (pref_llc < 0) + return; + + rq->nr_llc_running--; + rq->nr_pref_llc_running -=3D (pref_llc =3D=3D task_llc(p)); + p->sched_llc_active =3D false; +} + void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_s= ched) { unsigned long epoch; @@ -1294,6 +1331,8 @@ static unsigned long __no_profile fraction_mm_sched(s= truct rq *rq, struct mm_sch return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq->cpu_runtime + 1); } =20 +static unsigned int task_running_on_cpu(int cpu, struct task_struct *p); + static inline void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec) { @@ -1346,8 +1385,13 @@ void account_mm_sched(struct rq *rq, struct task_str= uct *p, s64 delta_exec) #endif } =20 - if (p->preferred_llc !=3D mm_sched_llc) + /* task not on rq accounted later in account_entity_enqueue() */ + if (task_running_on_cpu(rq->cpu, p) && + p->preferred_llc !=3D mm_sched_llc) { + account_llc_dequeue(rq, p); p->preferred_llc =3D mm_sched_llc; + account_llc_enqueue(rq, p); + } } =20 static void task_tick_cache(struct rq *rq, struct task_struct *p) @@ -1475,6 +1519,10 @@ void init_sched_mm(struct task_struct *p) { } =20 static void task_tick_cache(struct rq *rq, struct task_struct *p) { } =20 +static void account_llc_enqueue(struct rq *rq, struct task_struct *p) {} + +static void account_llc_dequeue(struct rq *rq, struct task_struct *p) {} + #endif =20 /* @@ -3965,9 +4013,11 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct= sched_entity *se) { update_load_add(&cfs_rq->load, se->load.weight); if (entity_is_task(se)) { + struct task_struct *p =3D task_of(se); struct rq *rq =3D rq_of(cfs_rq); =20 - account_numa_enqueue(rq, task_of(se)); + account_numa_enqueue(rq, p); + account_llc_enqueue(rq, p); list_add(&se->group_node, &rq->cfs_tasks); } cfs_rq->nr_queued++; @@ -3978,7 +4028,11 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct= sched_entity *se) { update_load_sub(&cfs_rq->load, se->load.weight); if (entity_is_task(se)) { - account_numa_dequeue(rq_of(cfs_rq), task_of(se)); + struct task_struct *p =3D task_of(se); + struct rq *rq =3D rq_of(cfs_rq); + + account_numa_dequeue(rq, p); + account_llc_dequeue(rq, p); list_del_init(&se->group_node); } cfs_rq->nr_queued--; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 728737641847..ee8b70647835 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1126,6 +1126,10 @@ struct rq { unsigned int nr_preferred_running; unsigned int numa_migrate_on; #endif +#ifdef CONFIG_SCHED_CACHE + unsigned int nr_pref_llc_running; + unsigned int nr_llc_running; +#endif #ifdef CONFIG_NO_HZ_COMMON unsigned long last_blocked_load_update_tick; unsigned int has_blocked_load; @@ -1980,6 +1984,8 @@ init_numa_balancing(u64 clone_flags, struct task_stru= ct *p) =20 #endif /* !CONFIG_NUMA_BALANCING */ =20 +int task_llc(const struct task_struct *p); + static inline void queue_balance_callback(struct rq *rq, struct balance_callback *head, --=20 2.32.0 From nobody Fri Dec 19 19:37:42 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7F2A72F0696 for ; Wed, 3 Dec 2025 23:01:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802892; cv=none; b=JuI4HP7FPjUZRRvIF57U5a+nyKFVaejSLBjOwb2o4K+dyMy+TzvS6alNai1tmhDlx/F2kpTdrbKJxXsp0ye0xTv9vWh98FuHcXDXimNg3p+EZ0AClnIocNRkMFznzOXiGUgsNO6KJzOsOmRV7MqRji4PoMn2fV9YYulhopDCdW0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802892; c=relaxed/simple; bh=Kgfm8ZrVAem+cuIFSErLp11pWO+uaVSLeCf068dctEI=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=oWgrj/+vmbD4ydPoKoPApP5RMU0UBhjF4mxsiADMVL/t5AARqr//6C8rqPkshWdzhhrhMPF1AzqYud7ZATo+YBem2D9OjWwAWcvEU+adG0BNbDeKX0F/tFC7FpYkxBtH1K1PhGVx8OIwbNowGJZ5W0OZkvMWwyvk09t3vXbHMn4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=Ff+wBHml; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Ff+wBHml" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802889; x=1796338889; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Kgfm8ZrVAem+cuIFSErLp11pWO+uaVSLeCf068dctEI=; b=Ff+wBHmlGI9Ls+hPQ/icfiRSQpZE9xFA2dMFUkAvN4HoLDq9rsPxZPeZ 8VRONCVnKKzfdp0/tx6ByohayUgQnukEiUM/5FG80edcOUwn8pLvcV6CD rsakyGnOPLHSStQkG1+f0q6DnjhqobEUdJaywwMsE54fftDticAbLprId 3bhB2AwAPJQjK37rs0/N96in+m4FjW7qil9FvPJrQKe2CXx6Vw8vc05XH UOnoKjT+4VoaXotKSh3uNxjPZTKFSxLyHcD1a3z71R7y9pyahaHenJnCZ 3UkyBEcsW2m1c1Cx8k4IAc/bj/uxMr+zGfxYNNEZL+3nmX/2zLcKYH7UG w==; X-CSE-ConnectionGUID: XVdRsMs0TMKO/Xjz8IoNBA== X-CSE-MsgGUID: 8lt8Jb1nTZqY7huRsHsSQw== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136318" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136318" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:28 -0800 X-CSE-ConnectionGUID: HuS/ZH/YT/Cjm+dSF3UiRw== X-CSE-MsgGUID: YDDbEJCdQwCGXEo7YEaNTQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199763787" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:28 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org Subject: [PATCH v2 07/23] sched/cache: Introduce per runqueue task LLC preference counter Date: Wed, 3 Dec 2025 15:07:26 -0800 Message-Id: <63091f7ca7bb473fbc176af86a87d27a07a6e149.1764801860.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Each runqueue is assigned an array where each element tracks the number of tasks preferring a given LLC, indexed from 0 to max_llcs - 1. For example, rq->nr_pref_llc[3] =3D 2 signifies that there are 2 tasks on this runqueue which prefer to run within LLC3. The load balancer can use this information to identify busy runqueues and migrate tasks to their preferred LLC domains. This array will be reallocated at runtime if the number of LLCs increases due to CPU hotplug. Only extending the buffer(rather than shrinking it) is supported to simplify the implementation. Introduce the buffer allocation mechanism, and the statistics will be calculated in the subsequent patch. Co-developed-by: Chen Yu Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v1->v2: Remove static allocation of per runqueue LLC preference arrays. Allocate array size to the actual number of LLCs online. (Peter Zij= lstra, Madadi Vineeth Reddy) kernel/sched/core.c | 1 + kernel/sched/sched.h | 1 + kernel/sched/topology.c | 117 +++++++++++++++++++++++++++++++++++++++- 3 files changed, 118 insertions(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 48626c81ba8e..ce533dc485f5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -8800,6 +8800,7 @@ void __init sched_init(void) #ifdef CONFIG_SCHED_CACHE raw_spin_lock_init(&rq->cpu_epoch_lock); rq->cpu_epoch_next =3D jiffies; + rq->nr_pref_llc =3D NULL; #endif =20 zalloc_cpumask_var_node(&rq->scratch_mask, GFP_KERNEL, cpu_to_node(i)); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index ee8b70647835..8f2a779825e4 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1129,6 +1129,7 @@ struct rq { #ifdef CONFIG_SCHED_CACHE unsigned int nr_pref_llc_running; unsigned int nr_llc_running; + unsigned int *nr_pref_llc; #endif #ifdef CONFIG_NO_HZ_COMMON unsigned long last_blocked_load_update_tick; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index f25d950ab015..d583399fc6a1 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -17,8 +17,121 @@ void sched_domains_mutex_unlock(void) mutex_unlock(&sched_domains_mutex); } =20 +/* the number of max LLCs being detected */ +static int new_max_llcs; +/* the current number of max LLCs */ int max_llcs; =20 +#ifdef CONFIG_SCHED_CACHE + +static unsigned int *alloc_new_pref_llcs(unsigned int *old, unsigned int *= *gc) +{ + unsigned int *new =3D NULL; + + new =3D kcalloc(new_max_llcs, sizeof(unsigned int), + GFP_KERNEL | __GFP_NOWARN); + + if (!new) { + *gc =3D NULL; + } else { + /* + * Place old entry in garbage collector + * for later disposal. + */ + *gc =3D old; + } + return new; +} + +static void populate_new_pref_llcs(unsigned int *old, unsigned int *new) +{ + int i; + + if (!old) + return; + + for (i =3D 0; i < max_llcs; i++) + new[i] =3D old[i]; +} + +static int resize_llc_pref(void) +{ + unsigned int *__percpu *tmp_llc_pref; + int i, ret =3D 0; + + if (new_max_llcs <=3D max_llcs) + return 0; + + /* + * Allocate temp percpu pointer for old llc_pref, + * which will be released after switching to the + * new buffer. + */ + tmp_llc_pref =3D alloc_percpu_noprof(unsigned int *); + if (!tmp_llc_pref) + return -ENOMEM; + + for_each_present_cpu(i) + *per_cpu_ptr(tmp_llc_pref, i) =3D NULL; + + /* + * Resize the per rq nr_pref_llc buffer and + * switch to this new buffer. + */ + for_each_present_cpu(i) { + struct rq_flags rf; + unsigned int *new; + struct rq *rq; + + rq =3D cpu_rq(i); + new =3D alloc_new_pref_llcs(rq->nr_pref_llc, per_cpu_ptr(tmp_llc_pref, i= )); + if (!new) { + ret =3D -ENOMEM; + + goto release_old; + } + + /* + * Locking rq ensures that rq->nr_pref_llc values + * don't change with new task enqueue/dequeue + * when we repopulate the newly enlarged array. + */ + rq_lock_irqsave(rq, &rf); + populate_new_pref_llcs(rq->nr_pref_llc, new); + rq->nr_pref_llc =3D new; + rq_unlock_irqrestore(rq, &rf); + } + +release_old: + /* + * Load balance is done under rcu_lock. + * Wait for load balance before and during resizing to + * be done. They may refer to old nr_pref_llc[] + * that hasn't been resized. + */ + synchronize_rcu(); + for_each_present_cpu(i) + kfree(*per_cpu_ptr(tmp_llc_pref, i)); + + free_percpu(tmp_llc_pref); + + /* succeed and update */ + if (!ret) + max_llcs =3D new_max_llcs; + + return ret; +} + +#else + +static int resize_llc_pref(void) +{ + max_llcs =3D new_max_llcs; + return 0; +} + +#endif + /* Protected by sched_domains_mutex: */ static cpumask_var_t sched_domains_tmpmask; static cpumask_var_t sched_domains_tmpmask2; @@ -714,7 +827,7 @@ static int update_llc_id(struct sched_domain *sd, * * For both cases, we want to increase the number of LLCs. */ - per_cpu(sd_llc_id, cpu) =3D max_llcs++; + per_cpu(sd_llc_id, cpu) =3D new_max_llcs++; =20 return per_cpu(sd_llc_id, cpu); } @@ -2674,6 +2787,8 @@ build_sched_domains(const struct cpumask *cpu_map, st= ruct sched_domain_attr *att if (has_cluster) static_branch_inc_cpuslocked(&sched_cluster_active); =20 + resize_llc_pref(); + if (rq && sched_debug_verbose) pr_info("root domain span: %*pbl\n", cpumask_pr_args(cpu_map)); =20 --=20 2.32.0 From nobody Fri Dec 19 19:37:43 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D24952F0C5B for ; Wed, 3 Dec 2025 23:01:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802893; cv=none; b=oh6ql8wRtQTKo8nnK9dUK9t3JsNVUN1SrqTTOLrpZpUDsIKZ+qt9qst5oOs9c5FDd2R9eecOFriCSP4q8iJw0WZIClfw/A2n3lz9QanZX0TndqedBRildmD/ptw2VXSsbXzzCrUFl3ehtEIBnQQqE0gyq5YyFY1waemEa1gZMq0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802893; c=relaxed/simple; bh=ubwbCrnLe+FpFs84fmQJ8NDFPPh85CKovnWcqS4HszM=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Q4hmeOs7hnwqOE8JDGvxpGVeABvVS45aiDvLk6ZpSrPGuTfn+4YcfZc0AFuBMnvnutRPD41rCA1to3LTp3U/rg4Ky2sVe8bcd4xUTzxW+ljCc0tBYewYHhc60QRARoN5k0NGQJalWwDG5Ur5+u4g9f7uSgwIhh8HrXiFwlORSOs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=ngl+EBZ5; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="ngl+EBZ5" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802890; x=1796338890; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=ubwbCrnLe+FpFs84fmQJ8NDFPPh85CKovnWcqS4HszM=; b=ngl+EBZ5jKSYuF1GoScWtzUvUawQCvSqX6vXeypzCig51al5M6EFhEW6 6ZPkta/KDGc5tm3cLZAn+Q0r4sAGXevBcvNbEeEF94NWh0Q5o4Qi40yoE 6fENyQt6WsIYC5Biv3AXCHk/Ns+vA3D+5k8K971vxD5ci0G6jwAhua/Ip V4EYKsxzhnY36WL45Wqmck026Nhmf3XpLNt/wYGNgwSMFF7INI6pnMGxW qdO3IW9AZPldmpFj84igpzlIJMlsU2GHA5/5/K1uwnar4bbN3Va12Jz5l CXyXS2But8o6/1q/DIrjmb1ErBv9PahFCMwFzVlsm1m+7SCCYHQiWGEv4 A==; X-CSE-ConnectionGUID: 1JiC3BwvQxSVrDu/qUQ3kA== X-CSE-MsgGUID: lKtqqExhQ5+Xp1L40Y0AnQ== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136340" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136340" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:30 -0800 X-CSE-ConnectionGUID: lyR4BYigTY+QEoG74KEnKQ== X-CSE-MsgGUID: lE96dcq2TgepzkLPnZNnrg== X-Ironport-Invalid-End-Of-Message: True X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199763795" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:30 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org Subject: [PATCH v2 08/23] sched/cache: Calculate the per runqueue task LLC preference Date: Wed, 3 Dec 2025 15:07:27 -0800 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Calculate the number of tasks' LLC preferences for each runqueue. This statistic is computed during task enqueue and dequeue operations, and is used by the cache-aware load balancing. Co-developed-by: Chen Yu Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v1->v2: Split from previous patch for easier review. kernel/sched/fair.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d46a70a9d9fb..b0e87616e377 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1231,11 +1231,12 @@ static void account_llc_enqueue(struct rq *rq, stru= ct task_struct *p) return; =20 pref_llc =3D p->preferred_llc; - if (pref_llc < 0) + if (pref_llc < 0 || pref_llc >=3D max_llcs) return; =20 rq->nr_llc_running++; rq->nr_pref_llc_running +=3D (pref_llc =3D=3D task_llc(p)); + rq->nr_pref_llc[pref_llc]++; p->sched_llc_active =3D true; } =20 @@ -1252,11 +1253,12 @@ static void account_llc_dequeue(struct rq *rq, stru= ct task_struct *p) return; =20 pref_llc =3D p->preferred_llc; - if (pref_llc < 0) + if (pref_llc < 0 || pref_llc >=3D max_llcs) return; =20 rq->nr_llc_running--; rq->nr_pref_llc_running -=3D (pref_llc =3D=3D task_llc(p)); + rq->nr_pref_llc[pref_llc]--; p->sched_llc_active =3D false; } =20 --=20 2.32.0 From nobody Fri Dec 19 19:37:43 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 761A92F12BE for ; Wed, 3 Dec 2025 23:01:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802894; cv=none; b=VOvR4Yo5MT+v4vvHJHnJrL04tUMLwfYbb4+GQbWJ3QO13hC1zjlHArO6dzcuGLllayHXLBw43BKllYMjOKohjC7Fzd9T9m3hYmCRq3WLpZzHqcCQuO2JcQTdEeD/rjnDRhN1lGZeCfQEi5WHKdPb8iHSUPG9WfZsKEu6JozCWHQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802894; c=relaxed/simple; bh=Hwjod13ydyBeyAl1Bc0MaWee5egwZS7IehFiRUr+3EU=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Qm3SflXuxBKuYopJgqhcvipXf7FPYSYSF15V5hLWMr9nUpsAfdv+d2spbB0P7Tw1LmX/zkoTpJ7guZJ5VbPuMzy9Baf9HL/h+ZfC7oU8NJtxgafnNNwl0O1u1CDaxlhc7yoqMW17JyUgVXekWAPj30g3bMDCDrz5uBQLCvlVneA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=cSsts8rq; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="cSsts8rq" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802892; x=1796338892; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Hwjod13ydyBeyAl1Bc0MaWee5egwZS7IehFiRUr+3EU=; b=cSsts8rq9lESYUplMXqyaf7fQNdZgkgjFqCazxZIqivu0ulnrxBxtLfr 2q49FeXJtEzQZUFodeAzsWSFeSbbR0eNrEPCzAiJg3hLVd3plskFuoc8R LSKLX41Wp9fMgp9Ou54k2TxPn+ZJpABPQDMRZBxyysFrDh3CB41EwtGEs RrfwNP72MRObV0Rpqk7QGgKlk2FmXjIY1nC71X0MFH6YEKKSRhWDNHOyK 9xcJGzOrMyQT5S0kQJJP+Yjr1dE5itsHoR0sqlWiS8N54X7izsEc5kZbZ a2UxxHPNluXsMUFiW8C3sWBY39nJzoHIE5rPFYFCFz7BLdiv2vnTIfuTx g==; X-CSE-ConnectionGUID: +jOJhU2XTqKlvSAAbvSNZg== X-CSE-MsgGUID: bWKw4Hx3R3mw3p0kqP8M9w== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136361" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136361" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:31 -0800 X-CSE-ConnectionGUID: fNxL8O0TTpG29riKu1HOzA== X-CSE-MsgGUID: yN1bkFJBSRe3C9XnLxsEJA== X-Ironport-Invalid-End-Of-Message: True X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199763802" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:31 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org Subject: [PATCH v2 09/23] sched/cache: Count tasks prefering destination LLC in a sched group Date: Wed, 3 Dec 2025 15:07:28 -0800 Message-Id: <1eb6a231ec82b37483208983f0cf10eec823ec9d.1764801860.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" During LLC load balancing, tabulate the number of tasks on each runqueue that prefer the LLC contains the env->dst_cpu in a sched group. For example, consider a system with 4 LLC sched groups (LLC0 to LLC3) balancing towards LLC3. LLC0 has 3 tasks preferring LLC3, LLC1 has 2, and LLC2 has 1. LLC0, having the most tasks preferring LLC3, is selected as the busiest source to pick tasks from. Within a source LLC, the total number of tasks preferring a destination LLC is computed by summing counts across all CPUs in that LLC. For instance, if LLC0 has CPU0 with 2 tasks and CPU1 with 1 task preferring LLC3, the total for LLC0 is 3. These statistics allow the load balancer to choose tasks from source sched groups that best match their preferred LLCs. Signed-off-by: Tim Chen --- Notes: v1->v2: Convert nr_pref_llc array in sg_lb_stats to a single variable as only the dst LLC stat is needed. (K Prateek Nayak) kernel/sched/fair.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b0e87616e377..4d7803f69a74 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10445,6 +10445,9 @@ struct sg_lb_stats { unsigned int nr_numa_running; unsigned int nr_preferred_running; #endif +#ifdef CONFIG_SCHED_CACHE + unsigned int nr_pref_llc; +#endif }; =20 /* @@ -10912,6 +10915,9 @@ static inline void update_sg_lb_stats(struct lb_env= *env, { int i, nr_running, local_group, sd_flags =3D env->sd->flags; bool balancing_at_rd =3D !env->sd->parent; +#ifdef CONFIG_SCHED_CACHE + int dst_llc =3D llc_id(env->dst_cpu); +#endif =20 memset(sgs, 0, sizeof(*sgs)); =20 @@ -10932,6 +10938,12 @@ static inline void update_sg_lb_stats(struct lb_en= v *env, if (cpu_overutilized(i)) *sg_overutilized =3D 1; =20 +#ifdef CONFIG_SCHED_CACHE + if (sched_cache_enabled() && llc_id(i) !=3D dst_llc && + dst_llc >=3D 0) + sgs->nr_pref_llc +=3D rq->nr_pref_llc[dst_llc]; +#endif + /* * No need to call idle_cpu() if nr_running is not 0 */ --=20 2.32.0 From nobody Fri Dec 19 19:37:43 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3209E2EDD45 for ; Wed, 3 Dec 2025 23:01:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802897; cv=none; b=AN9K8aiWJQG7HDbeaWXpGDetIW2icpqGbDr6zs/psxf+4ZLm2ceitwFSdlkxUNnHO69aqE5S3Lgw8UXlsXoedmM4Pr7i5RbMpn7L1KrlbpjXV6xeAEYh8XRvFtihZU5ev2z3gpc9wUtfTNoORHKd7LfpH7/RywEIWMBBa/DRGKQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802897; c=relaxed/simple; bh=p+3h65+/r+G8M/UVKx3C3o18pTa5Qaadr44RFr//JJM=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=MmMfghNG3eQEQnrI1wgmAlkBPcwScfTCOYIB2L9oD0PhxTEQvycV+raEGlUU7tq/cOm1m41tgx1zgYVTnsY1VCpNGnM6slJtSvukwWoNbVbq6sVz9SyOM9hVO35VnfPEJ/kFPYJD7nSsZDAVCSBbwe4MWGUKumJjlC3jPA1Gp5w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=oK3XGSFi; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="oK3XGSFi" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802894; x=1796338894; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=p+3h65+/r+G8M/UVKx3C3o18pTa5Qaadr44RFr//JJM=; b=oK3XGSFi2bGDDnHY3Lou8C7HjUQfAlxc1xp5Jsb4tWssOTetEyKk8VhS xWt++svfjbe9DJCu7kK8NB54Iyuv23cDcsruzAVgtKiHf34SlRWKEmzrW D+oCFG7YN+VzH5prFgSppmI032uc/cJAJ/qAKAOk+5EqFUqWcIySUNujp dnKCK0NZsBYY0rnhzU9NpLtzRd0sgBD+P+q/gVsngGR9F8P7Ojt0z+4k+ FNbn0vTsTTr/tR3CHEUKYnt1XKHxIQth0oKpXgg30ClUCUHrWShO5n1wq sHaXMI4sp88m3bKftZXPxnzsOaTk5Sy2iUOBeydtIg4kqCpHbvNeeio00 A==; X-CSE-ConnectionGUID: eUWbdnCGTbS8UdiOjQqeaw== X-CSE-MsgGUID: M8ATD04uQSWlmkQNqjCJ+A== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136382" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136382" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:33 -0800 X-CSE-ConnectionGUID: +B4a0CVGS5aDMise1kmgcw== X-CSE-MsgGUID: mYFfuf8aQyCTLl73SGkFmQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199763810" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:33 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org Subject: [PATCH v2 10/23] sched/cache: Check local_group only once in update_sg_lb_stats() Date: Wed, 3 Dec 2025 15:07:29 -0800 Message-Id: <2581fa14a0083bbd22b50837cd86003e59192c00.1764801860.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" There is no need to check the local group twice for both group_asym_packing and group_smt_balance. Adjust the code to facilitate future checks for group types (cache-aware load balancing) as well. No functional changes are expected. Suggested-by: Peter Zijlstra (Intel) Co-developed-by: Chen Yu Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v1->v2: New code cleanup patch. (Peter Zijlstra) kernel/sched/fair.c | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4d7803f69a74..6e4c1ae1bdda 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10984,14 +10984,16 @@ static inline void update_sg_lb_stats(struct lb_e= nv *env, =20 sgs->group_weight =3D group->group_weight; =20 - /* Check if dst CPU is idle and preferred to this group */ - if (!local_group && env->idle && sgs->sum_h_nr_running && - sched_group_asym(env, sgs, group)) - sgs->group_asym_packing =3D 1; - - /* Check for loaded SMT group to be balanced to dst CPU */ - if (!local_group && smt_balance(env, sgs, group)) - sgs->group_smt_balance =3D 1; + if (!local_group) { + /* Check if dst CPU is idle and preferred to this group */ + if (env->idle && sgs->sum_h_nr_running && + sched_group_asym(env, sgs, group)) + sgs->group_asym_packing =3D 1; + + /* Check for loaded SMT group to be balanced to dst CPU */ + if (smt_balance(env, sgs, group)) + sgs->group_smt_balance =3D 1; + } =20 sgs->group_type =3D group_classify(env->sd->imbalance_pct, group, sgs); =20 --=20 2.32.0 From nobody Fri Dec 19 19:37:43 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C1A762F6160 for ; Wed, 3 Dec 2025 23:01:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802899; cv=none; b=Y+KSvOIGNo37S4ppd6Zqb+qeXMYg9H7oOVVSwDUONcSmPmmNo+OfFtkUVVLhQy9Kszncjru9WbcIa9UEetZqhMPsmMY2k5fVZ6RAWQZpLFm3o5ZOTcH4gt2vkBWUME5YgLQA3NYdBf+3LQy/lgsvGtAErx6vO+QUxr5PuBX7rAE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802899; c=relaxed/simple; bh=CwrhaA/K9sEcx5ifxeMnRiF7w0oKVkh5kmhIRkZCI08=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=oku/UCNYCKxtHDLcs7jWCa04/T613otu/fvMOx46pM5Fk461C8jmF88SnvfkaEKbY/tPKG6ssSj+6jJ5qq4aFqOkczxx9qajmomVw1d15n0Nxc/H0Jxmj7YmItsjsTy0cRx3h5fJ6U2M4vg8NuLnjq+H/GqT2czhMHBhUawwwc0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=NQQGq9b9; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="NQQGq9b9" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802896; x=1796338896; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=CwrhaA/K9sEcx5ifxeMnRiF7w0oKVkh5kmhIRkZCI08=; b=NQQGq9b9vgYmeBpCZdLnkhTURcvp8LDZjz79tucf4QAOjRH6WMoJ7DIc VCEpH4PZk5+dZi9trvIpapAwsuwYkQegVq+/LDqHzSrIt129SaHxgL94Y 5nrvAHUr0MUD5UNXllanE0V0Fykum1uE2UTQDl3LnIDioTcTzOYpAO1X1 4qycYWShsJLluL7efSyQ+/SgISKYo/HIyxL8OBYx1D4XH6mSLaqEpIaiX g8GbNG2ofsWe9Fe2YAYpsC9b78PtUUg4W2Vm4/GWu3tuk8/oeCtghHVCm rv/mHq9+NoDA+NgB2cghgRnsU5NYvBkjZ9v38NvuhidP8frlEkqZR1gb1 Q==; X-CSE-ConnectionGUID: p32r4lRGQkiQ0lJdqe2vYQ== X-CSE-MsgGUID: Aoa+xDJNReuNdKo3ZUU7Lw== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136420" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136420" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:36 -0800 X-CSE-ConnectionGUID: bwFbXM5NTD2aXs2HmVfWRA== X-CSE-MsgGUID: oV/d19IyQLihLj6NBNyYIA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199763827" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:35 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org Subject: [PATCH v2 11/23] sched/cache: Prioritize tasks preferring destination LLC during balancing Date: Wed, 3 Dec 2025 15:07:30 -0800 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" During LLC load balancing, first check for tasks that prefer the destination LLC and balance them to it before others. Mark source sched groups containing tasks preferring non local LLCs with the group_llc_balance flag. This ensures the load balancer later pulls or pushes these tasks toward their preferred LLCs. The load balancer selects the busiest sched_group and migrates tasks to less busy groups to distribute load across CPUs. With cache-aware scheduling enabled, the busiest sched_group is the one with most tasks preferring the destination LLC. If the group has the llc_balance flag set, cache aware load balancing is triggered. Introduce the helper function update_llc_busiest() to identify the sched_group with the most tasks preferring the destination LLC. Suggested-by: K Prateek Nayak Co-developed-by: Chen Yu Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v1->v2: Fix comparison in can_migrate_llc(), which uses an uninitialized env->src_cpu. Use the candidate group's first CPU instead. (Aaron Lu) =20 Fix a race condition during bootup with build_sched_domains(), where the per-cpu(sd_llc_id) is reset to -1. (lkp/0day) Put the set of group_llc_balance and the usage of it into 1 patch. (Peter Zijlstra) =20 Change group_llc_balance priority to be lower than group_overloaded and embed it into normal load balance path. (Peter Zijlstra) =20 Remove the sched group's SD_SHARE_LLC check in llc_balance(), because we should allow tasks migration across NUMA nodes to their preferred= LLC, where the domain does not have SD_SHARE_LLC flag. kernel/sched/fair.c | 66 ++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 65 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6e4c1ae1bdda..db555c11b5b8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9531,6 +9531,11 @@ enum group_type { * from balancing the load across the system. */ group_imbalanced, + /* + * There are tasks running on non-preferred LLC, possible to move + * them to their preferred LLC without creating too much imbalance. + */ + group_llc_balance, /* * The CPU is overloaded and can't provide expected CPU cycles to all * tasks. @@ -10440,6 +10445,7 @@ struct sg_lb_stats { enum group_type group_type; unsigned int group_asym_packing; /* Tasks should be moved to preferred CP= U */ unsigned int group_smt_balance; /* Task on busy SMT be moved */ + unsigned int group_llc_balance; /* Tasks should be moved to preferred LL= C */ unsigned long group_misfit_task_load; /* A CPU has a task too big for its= capacity */ #ifdef CONFIG_NUMA_BALANCING unsigned int nr_numa_running; @@ -10698,6 +10704,9 @@ group_type group_classify(unsigned int imbalance_pc= t, if (group_is_overloaded(imbalance_pct, sgs)) return group_overloaded; =20 + if (sgs->group_llc_balance) + return group_llc_balance; + if (sg_imbalanced(group)) return group_imbalanced; =20 @@ -10890,11 +10899,55 @@ static void record_sg_llc_stats(struct lb_env *en= v, if (unlikely(READ_ONCE(sd_share->capacity) !=3D sgs->group_capacity)) WRITE_ONCE(sd_share->capacity, sgs->group_capacity); } + +/* + * Do LLC balance on sched group that contains LLC, and have tasks preferr= ing + * to run on LLC in idle dst_cpu. + */ +static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs, + struct sched_group *group) +{ + if (!sched_cache_enabled()) + return false; + + if (env->sd->flags & SD_SHARE_LLC) + return false; + + if (sgs->nr_pref_llc && + can_migrate_llc(cpumask_first(sched_group_span(group)), + env->dst_cpu, 0, true) =3D=3D mig_llc) + return true; + + return false; +} + +static bool update_llc_busiest(struct lb_env *env, + struct sg_lb_stats *busiest, + struct sg_lb_stats *sgs) +{ + /* + * There are more tasks that want to run on dst_cpu's LLC. + */ + return sgs->nr_pref_llc > busiest->nr_pref_llc; +} #else static inline void record_sg_llc_stats(struct lb_env *env, struct sg_lb_st= ats *sgs, struct sched_group *group) { } + +static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs, + struct sched_group *group) +{ + return false; +} + +static bool update_llc_busiest(struct lb_env *env, + struct sg_lb_stats *busiest, + struct sg_lb_stats *sgs) +{ + return false; +} #endif =20 /** @@ -10993,6 +11046,10 @@ static inline void update_sg_lb_stats(struct lb_en= v *env, /* Check for loaded SMT group to be balanced to dst CPU */ if (smt_balance(env, sgs, group)) sgs->group_smt_balance =3D 1; + + /* Check for tasks in this group can be moved to their preferred LLC */ + if (llc_balance(env, sgs, group)) + sgs->group_llc_balance =3D 1; } =20 sgs->group_type =3D group_classify(env->sd->imbalance_pct, group, sgs); @@ -11056,6 +11113,10 @@ static bool update_sd_pick_busiest(struct lb_env *= env, /* Select the overloaded group with highest avg_load. */ return sgs->avg_load > busiest->avg_load; =20 + case group_llc_balance: + /* Select the group with most tasks preferring dst LLC */ + return update_llc_busiest(env, busiest, sgs); + case group_imbalanced: /* * Select the 1st imbalanced group as we don't have any way to @@ -11318,6 +11379,7 @@ static bool update_pick_idlest(struct sched_group *= idlest, return false; break; =20 + case group_llc_balance: case group_imbalanced: case group_asym_packing: case group_smt_balance: @@ -11450,6 +11512,7 @@ sched_balance_find_dst_group(struct sched_domain *s= d, struct task_struct *p, int return NULL; break; =20 + case group_llc_balance: case group_imbalanced: case group_asym_packing: case group_smt_balance: @@ -11949,7 +12012,8 @@ static struct sched_group *sched_balance_find_src_g= roup(struct lb_env *env) * group's child domain. */ if (sds.prefer_sibling && local->group_type =3D=3D group_has_spare && - sibling_imbalance(env, &sds, busiest, local) > 1) + (busiest->group_type =3D=3D group_llc_balance || + sibling_imbalance(env, &sds, busiest, local) > 1)) goto force_balance; =20 if (busiest->group_type !=3D group_overloaded) { --=20 2.32.0 From nobody Fri Dec 19 19:37:43 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8446B2FCBE3 for ; Wed, 3 Dec 2025 23:01:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802901; cv=none; b=AczUsF+ErJIRlmzmMhdwLmi7ZupDah78/dCkfXKoZGQ3XVlhu9qwGaFYSDg3FFQU9754xRJEORkGrcVZU1ssicX++R+V0FXfSTdSUEZWfvt980XcoUhlWnK7J8un6y7YNQXxJBfZVrhj31WyccQPJJevDmK67sgqqF6PsKk29mc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802901; c=relaxed/simple; bh=da90OiAHbhR9NPA8Ratl9FUXidYv15t1ql0bzkXvmzA=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=hCA0ZezNljYOVjtzlpNDPYqpoGKoW7yU4ihuYN4DdplXI8ZjqyOysntDUcfzbne+6CzBonX2R+LOUwUNh5V4ZvlW0NEG+WGaT266Gr89t7EmmUAyb0SQ4i4NDSbCHrELFwlVL45n3XsDuBwIKNxjYMRKZj90lzt9XJuGVK0hJpE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=E/oDxO7e; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="E/oDxO7e" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802898; x=1796338898; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=da90OiAHbhR9NPA8Ratl9FUXidYv15t1ql0bzkXvmzA=; b=E/oDxO7ef4nI4G5J3jOvjR+X/vFua+P9e3AZXKLJcxFJriNr7Ua944xG AxkcNTluTudW0fa7LiL2oLSyXQGNm4wxTedztXy+Kb3GNW3m1xItQPgjY yaKpw+/5zQcwTUlI7cSSe2yq6pGi70PjZnOQeUYqx+6LdidqnzQeT9x0d oKfUVrBxLwV+bxjJ5X7pfb+amTWF/9P1/Z2cwQnN4MgR4+xZfJ/oQETi0 OhZkv30WMo989iIGaDW9QOVZENXrnIYuSR0poLGwGoz4vGxEA6oadIK33 rSOZLBiBoM9ORQbnZoVJ4AxudF9GCXu3fDkCd/li1EhJxcKamQHTatJeP g==; X-CSE-ConnectionGUID: Ktrog9qIS3GVBMh0FikIKg== X-CSE-MsgGUID: wFdH+7CRS02fjEaxSZiyog== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136444" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136444" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:37 -0800 X-CSE-ConnectionGUID: 6HowfZdBQD20KHd0gzJbtg== X-CSE-MsgGUID: 6WhOzrMuS8+5P3U6JQUdyA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199763835" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:37 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org Subject: [PATCH v2 12/23] sched/cache: Add migrate_llc_task migration type for cache-aware balancing Date: Wed, 3 Dec 2025 15:07:31 -0800 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Introduce a new migration type, migrate_llc_task, to support cache-aware load balancing. After identifying the busiest sched_group (having the most tasks preferring the destination LLC), mark migrations with this type. During load balancing, each runqueue in the busiest sched_group is examined, and the runqueue with the highest number of tasks preferring the destination CPU is selected as the busiest runqueue. Signed-off-by: Tim Chen --- Notes: v1->v2: Remove unnecessary cpus_share_cache() check in sched_balance_find_src_rq() (K Prateek Nayak) kernel/sched/fair.c | 32 +++++++++++++++++++++++++++++++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index db555c11b5b8..529adf342ce0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9547,7 +9547,8 @@ enum migration_type { migrate_load =3D 0, migrate_util, migrate_task, - migrate_misfit + migrate_misfit, + migrate_llc_task }; =20 #define LBF_ALL_PINNED 0x01 @@ -10134,6 +10135,10 @@ static int detach_tasks(struct lb_env *env) env->imbalance -=3D util; break; =20 + case migrate_llc_task: + env->imbalance--; + break; + case migrate_task: env->imbalance--; break; @@ -11766,6 +11771,15 @@ static inline void calculate_imbalance(struct lb_e= nv *env, struct sd_lb_stats *s return; } =20 +#ifdef CONFIG_SCHED_CACHE + if (busiest->group_type =3D=3D group_llc_balance) { + /* Move a task that prefer local LLC */ + env->migration_type =3D migrate_llc_task; + env->imbalance =3D 1; + return; + } +#endif + if (busiest->group_type =3D=3D group_imbalanced) { /* * In the group_imb case we cannot rely on group-wide averages @@ -12073,6 +12087,10 @@ static struct rq *sched_balance_find_src_rq(struct= lb_env *env, struct rq *busiest =3D NULL, *rq; unsigned long busiest_util =3D 0, busiest_load =3D 0, busiest_capacity = =3D 1; unsigned int busiest_nr =3D 0; +#ifdef CONFIG_SCHED_CACHE + unsigned int busiest_pref_llc =3D 0; + int dst_llc; +#endif int i; =20 for_each_cpu_and(i, sched_group_span(group), env->cpus) { @@ -12181,6 +12199,16 @@ static struct rq *sched_balance_find_src_rq(struct= lb_env *env, } break; =20 + case migrate_llc_task: +#ifdef CONFIG_SCHED_CACHE + dst_llc =3D llc_id(env->dst_cpu); + if (dst_llc >=3D 0 && + busiest_pref_llc < rq->nr_pref_llc[dst_llc]) { + busiest_pref_llc =3D rq->nr_pref_llc[dst_llc]; + busiest =3D rq; + } +#endif + break; case migrate_task: if (busiest_nr < nr_running) { busiest_nr =3D nr_running; @@ -12363,6 +12391,8 @@ static void update_lb_imbalance_stat(struct lb_env = *env, struct sched_domain *sd case migrate_misfit: __schedstat_add(sd->lb_imbalance_misfit[idle], env->imbalance); break; + case migrate_llc_task: + break; } } =20 --=20 2.32.0 From nobody Fri Dec 19 19:37:43 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 97BB42FFF98 for ; Wed, 3 Dec 2025 23:01:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802903; cv=none; b=KDFpHcGAhKnHRBZFMFMtHMoRnhc4icrwIxIA8u+Vif5oz7Z18LHjkzu1IOV8tRYJFy4lXDjG6wYe22JV6BPtT9JAf2mUHKRyigHv1MkoPNBeRIKSEJ51iH0zebfyiiIhyx46QCps5MkfKG9xVMGg3N7ENza6Vv2+y6dsL+Zp0lE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802903; c=relaxed/simple; bh=8SM2jHHpi12dQS+zJornGRPQxkuowwvNXMVhwIeDBGA=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=U/JdvWoJw/IU3s2ub70NWLePIaQBRwHwPYibO+bbRJhw5I3xBFJgWmgkN/HfBIb1ABZRWNcUN5ladx9wdRE4q84V9sG4/k/92/pAoHRgP60/SkA1N0lBh+0oNDDaOMmcaJymNEYAB4Y+PlNTanSz07u82e6zrOmPcftrMVq0eQg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=QFrFmdLP; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="QFrFmdLP" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802900; x=1796338900; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=8SM2jHHpi12dQS+zJornGRPQxkuowwvNXMVhwIeDBGA=; b=QFrFmdLPstihmD8vLzP896hsrOed6TFf664ZbLxgCKDyVP1ElFu/KxlL cWka8HAx7lSbtKJIRs2zDLb662V+u3vSkOL/+GmAmBZOGy6YahHgzdZ+w Cm8JPiAUQ0kzPS2n/rAw++vW0A14d5QX1S2PZ0RvAxgtjOMIEQght4vtw NlNGyMxSykwrfzzHo/Khc6YFVxKydWs7zQdFb7hjDddawl3rivgSTQ4lM rXsDbUmw/L0HUCnUtshRY/GabXqs3gMSK3t3UfCRyfscjIhW5T7A4/xG6 Ul+07Ph3CpTYgJ6hsHVxiRy1rZKIhjL1V7FiHZTJQ8OxeBn2eVIqDAWXn g==; X-CSE-ConnectionGUID: Ew56K6WcQq6rz+G+xWa07Q== X-CSE-MsgGUID: UdKIolArSn+zoD5IfTBGAQ== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136469" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136469" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:39 -0800 X-CSE-ConnectionGUID: 2oDVFmE5Rvu2YRtXhtrPTg== X-CSE-MsgGUID: oS68+IQVQH2bxeYLezigJQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199763850" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:38 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org Subject: [PATCH v2 13/23] sched/cache: Handle moving single tasks to/from their preferred LLC Date: Wed, 3 Dec 2025 15:07:32 -0800 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" If the busiest runqueue has only one task, active balancing may be invoked to move it. However, before migration, check whether the task is running on its preferred LLC. Do not move a lone task to another LLC if it would move the task away from its preferred LLC or cause excessive imbalance between LLCs. Co-developed-by: Chen Yu Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v1->v2: Remove uneeded preferred LLC migration check from active_load_balance_cpu_stop(). kernel/sched/fair.c | 51 ++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 50 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 529adf342ce0..aed3fab98d7c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9878,12 +9878,57 @@ static __maybe_unused enum llc_mig can_migrate_llc_= task(int src_cpu, int dst_cpu task_util(p), to_pref); } =20 +/* + * Check if active load balance breaks LLC locality in + * terms of cache aware load balance. + */ +static inline bool +break_llc_locality(struct lb_env *env) +{ + if (!sched_cache_enabled()) + return false; + + if (cpus_share_cache(env->src_cpu, env->dst_cpu)) + return false; + /* + * All tasks prefer to stay on their current CPU. + * Do not pull a task from its preferred CPU if: + * 1. It is the only task running there; OR + * 2. Migrating it away from its preferred LLC would violate + * the cache-aware scheduling policy. + */ + if (env->src_rq->nr_pref_llc_running =3D=3D env->src_rq->cfs.h_nr_runnabl= e) { + unsigned long util =3D 0; + struct task_struct *cur; + + if (env->src_rq->nr_running <=3D 1) + return true; + + rcu_read_lock(); + cur =3D rcu_dereference(env->src_rq->curr); + if (cur) + util =3D task_util(cur); + rcu_read_unlock(); + + if (can_migrate_llc(env->src_cpu, env->dst_cpu, + util, false) =3D=3D mig_forbid) + return true; + } + + return false; +} #else static inline bool get_llc_stats(int cpu, unsigned long *util, unsigned long *cap) { return false; } + +static inline bool +break_llc_locality(struct lb_env *env) +{ + return false; +} #endif /* * can_migrate_task - may task p from runqueue rq be migrated to this_cpu? @@ -12279,6 +12324,9 @@ static int need_active_balance(struct lb_env *env) { struct sched_domain *sd =3D env->sd; =20 + if (break_llc_locality(env)) + return 0; + if (asym_active_balance(env)) return 1; =20 @@ -12298,7 +12346,8 @@ static int need_active_balance(struct lb_env *env) return 1; } =20 - if (env->migration_type =3D=3D migrate_misfit) + if (env->migration_type =3D=3D migrate_misfit || + env->migration_type =3D=3D migrate_llc_task) return 1; =20 return 0; --=20 2.32.0 From nobody Fri Dec 19 19:37:43 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C1938F513 for ; Wed, 3 Dec 2025 23:01:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802905; cv=none; b=sPmV7aM8SfneES++JSxoAMTpkJxsxkIaVzLucunnA9mKqP6A+4Tm600kyT9VTXTzXq34T39lXTUp9sHWoERIl8w+bTu7J1HC+rfyTlXxwEVQV8C99GFpkkbN1BPFHILnrVb4xczJGDnWK5dD50Ye9FIBTMyihvIerGvjfEsmqNE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802905; c=relaxed/simple; bh=lHL2pgABc7GHr6ACmg9H32RJUswizn6AHQFobrHrbFw=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=G7rXkqjakmupf9n++e5JGAkMIXq3jqgQc6G6Gw5IYyY/VhHNnlVMfdVNOcDomPtYPBMavf9m7Y2bsSMUvQExqTt6CASUZ8aGZ8iX+XoR/Ej28b5EwCnggenbKxXL4Xj0/E38v+KIJD/T8MnOLbFEeGjSREtAQxxgu/2prdjZMw8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=FiUlG+0K; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="FiUlG+0K" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802901; x=1796338901; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=lHL2pgABc7GHr6ACmg9H32RJUswizn6AHQFobrHrbFw=; b=FiUlG+0K/UC9vVMh/oPWl1WUBZdhy5MrB44PaaHkXUAA4jYHkLTFSSsi qocTAQQFuheK8JLYpFg2R7aU2iv4GZRGXge93BEc9kS9nTpx4oQOMWekm +vXMxJj28JhCGkxAcIYAkVQvbks0I4+snX/or9+O6+kLtJoq4VW98lvHt gsZRnKPvbTAbfB8BLT4mfbZqijYwb7I27I0TW2bqZx35wIeRxh9EeBFyi ROuei6K/cuomwGMaKK20uTZT8/nP1CIoBiGImBAQQNhK7Hgo6jMMsFX23 lLTcZHF+7w8PBbIBEKU+iwv08wqwC5Czno4lf4DE3GutioUzRJHIw2uIq A==; X-CSE-ConnectionGUID: LC9AWrvJQPSiwhGuWRZBoA== X-CSE-MsgGUID: vAm1J3bzStCyct17U9yt9Q== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136497" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136497" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:40 -0800 X-CSE-ConnectionGUID: pVhkYbuLQ6qhXibAglfMuQ== X-CSE-MsgGUID: 3iy0SCYQQeWYaEaxUD8biw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199763859" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:40 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org Subject: [PATCH v2 14/23] sched/cache: Consider LLC preference when selecting tasks for load balancing Date: Wed, 3 Dec 2025 15:07:33 -0800 Message-Id: <048601436d24f19e84c0a002e1c5897f95853276.1764801860.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Currently, task selection from the busiest runqueue ignores LLC preferences. Reorder tasks in the busiest queue to prioritize selection as follows: 1. Tasks preferring the destination CPU's LLC 2. Tasks with no LLC preference 3. Tasks preferring an LLC different from their current one 4. Tasks preferring the LLC they are currently on This improves the likelihood that tasks are migrated to their preferred LLC. Signed-off-by: Tim Chen --- Notes: v1->v2: No change. kernel/sched/fair.c | 66 ++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 65 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index aed3fab98d7c..dd09a816670e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10092,6 +10092,68 @@ static struct task_struct *detach_one_task(struct = lb_env *env) return NULL; } =20 +#ifdef CONFIG_SCHED_CACHE +/* + * Prepare lists to detach tasks in the following order: + * 1. tasks that prefer dst cpu's LLC + * 2. tasks that have no preference in LLC + * 3. tasks that prefer LLC other than the ones they are on + * 4. tasks that prefer the LLC that they are currently on. + */ +static struct list_head +*order_tasks_by_llc(struct lb_env *env, struct list_head *tasks) +{ + struct task_struct *p; + LIST_HEAD(pref_old_llc); + LIST_HEAD(pref_new_llc); + LIST_HEAD(no_pref_llc); + LIST_HEAD(pref_other_llc); + + if (!sched_cache_enabled()) + return tasks; + + if (cpus_share_cache(env->dst_cpu, env->src_cpu)) + return tasks; + + while (!list_empty(tasks)) { + p =3D list_last_entry(tasks, struct task_struct, se.group_node); + + if (p->preferred_llc =3D=3D llc_id(env->dst_cpu)) { + list_move(&p->se.group_node, &pref_new_llc); + continue; + } + + if (p->preferred_llc =3D=3D llc_id(env->src_cpu)) { + list_move(&p->se.group_node, &pref_old_llc); + continue; + } + + if (p->preferred_llc =3D=3D -1) { + list_move(&p->se.group_node, &no_pref_llc); + continue; + } + + list_move(&p->se.group_node, &pref_other_llc); + } + + /* + * We detach tasks from list tail in detach tasks. Put tasks + * to be chosen first at end of list. + */ + list_splice(&pref_new_llc, tasks); + list_splice(&no_pref_llc, tasks); + list_splice(&pref_other_llc, tasks); + list_splice(&pref_old_llc, tasks); + return tasks; +} +#else +static inline struct list_head +*order_tasks_by_llc(struct lb_env *env, struct list_head *tasks) +{ + return tasks; +} +#endif + /* * detach_tasks() -- tries to detach up to imbalance load/util/tasks from * busiest_rq, as part of a balancing operation within domain "sd". @@ -10100,7 +10162,7 @@ static struct task_struct *detach_one_task(struct l= b_env *env) */ static int detach_tasks(struct lb_env *env) { - struct list_head *tasks =3D &env->src_rq->cfs_tasks; + struct list_head *tasks; unsigned long util, load; struct task_struct *p; int detached =3D 0; @@ -10119,6 +10181,8 @@ static int detach_tasks(struct lb_env *env) if (env->imbalance <=3D 0) return 0; =20 + tasks =3D order_tasks_by_llc(env, &env->src_rq->cfs_tasks); + while (!list_empty(tasks)) { /* * We don't want to steal all, otherwise we may be treated likewise, --=20 2.32.0 From nobody Fri Dec 19 19:37:43 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F15862EC0B3 for ; Wed, 3 Dec 2025 23:01:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802907; cv=none; b=gaatxX9hyfNCQNZuo8e4RU3vaqRhxVWET62DnEKpixJNU5xDEVuougssJt9/6wdKqXoIUOBKaKYsQEEI9+soes2dovmZhy3fGDXwD4VJshA6aArNO/9BRtmRmrSUH+Qeb4uxqCx6TiODM+aPCVtCEwIA755BalFPfrmj7+qULOI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802907; c=relaxed/simple; bh=Qq+bxGUfP5y5uzFrPweEIf2ig+fLfO0Fva+8tsaaHnM=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=PMemmZdPDG2ErK7Z6ebePwKSI9cabjQRZi7fOaAynPsVbH0TYAxCQkgG7kmEu1N/+0Kmoqb2iEytzk5b6Y83O56eTuw4wsJTpcQbn5OA5nrv8fwKgYRvMuPqwTWStSC5o/clmWh6Un/rG7VXFCAXnoxf+tadmloUwr1ceD4Iuek= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=K1c6F2rL; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="K1c6F2rL" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802904; x=1796338904; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Qq+bxGUfP5y5uzFrPweEIf2ig+fLfO0Fva+8tsaaHnM=; b=K1c6F2rLIDMugmXFGo0VPRa3CkwpTWx9IJrRa/hsq4UrL7DnV0pw8ajG BaGeCuW4iC0q3KpRjUrb5Gjs2+rOB74bBmgvjzvP0Bgae0TPuFdvMjX23 z6+gGGgG19Wv4ve1vRjEwTT08BRcUINH2YNXiTUVgX6ibcCJComlk0Y6n quNDMVfwdU0hQZhwOtrSHXPRqMojx8I7m9WQ/PmD1woe8uT6yci0V4u2u jfnFFUMEbPvj3J6FUSZjuQwGSGo/EqXqp0xk/5KRyXKafHJF8xEhV/udJ e4v9JDT09EYShziT4Bzd1zuoH2hhzYHA7OeJFLCwdgppCCBWwVA2w3KmJ w==; X-CSE-ConnectionGUID: 9y9zHDIITpm9FBwYNTu03A== X-CSE-MsgGUID: Qrhpktr7Tg2wRc89JHsg/g== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136537" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136537" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:42 -0800 X-CSE-ConnectionGUID: JansXFpeT5WbVZuKS0jHBA== X-CSE-MsgGUID: NjtVIxeZSgSD5Yg4dtwXow== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199763888" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:42 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org Subject: [PATCH v2 15/23] sched/cache: Respect LLC preference in task migration and detach Date: Wed, 3 Dec 2025 15:07:34 -0800 Message-Id: <1c75f54a2e259737eb9b15c98a5c1d1f142fdef6.1764801860.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" During the final step of load balancing, can_migrate_task() now considers a task's LLC preference before moving it out of its preferred LLC. Additionally, add checks in detach_tasks() to prevent selecting tasks that prefer their current LLC. Co-developed-by: Chen Yu Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v1->v2: Leave out tasks under core scheduling from the cache aware load balance. (K Prateek Nayak) =20 Reduce the degree of honoring preferred_llc in detach_tasks(). If certain conditions are met, stop migrating tasks that prefer their current LLC and instead continue load balancing from other busiest runqueues. (K Prateek Nayak) kernel/sched/fair.c | 63 ++++++++++++++++++++++++++++++++++++++++++-- kernel/sched/sched.h | 13 +++++++++ 2 files changed, 74 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index dd09a816670e..580a967efdac 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9852,8 +9852,8 @@ static enum llc_mig can_migrate_llc(int src_cpu, int = dst_cpu, * Check if task p can migrate from source LLC to * destination LLC in terms of cache aware load balance. */ -static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int d= st_cpu, - struct task_struct *p) +static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu, + struct task_struct *p) { struct mm_struct *mm; bool to_pref; @@ -10025,6 +10025,13 @@ int can_migrate_task(struct task_struct *p, struct= lb_env *env) if (env->flags & LBF_ACTIVE_LB) return 1; =20 +#ifdef CONFIG_SCHED_CACHE + if (sched_cache_enabled() && + can_migrate_llc_task(env->src_cpu, env->dst_cpu, p) =3D=3D mig_forbid= && + !task_has_sched_core(p)) + return 0; +#endif + degrades =3D migrate_degrades_locality(p, env); if (!degrades) hot =3D task_hot(p, env); @@ -10146,12 +10153,55 @@ static struct list_head list_splice(&pref_old_llc, tasks); return tasks; } + +static bool stop_migrate_src_rq(struct task_struct *p, + struct lb_env *env, + int detached) +{ + if (!sched_cache_enabled() || p->preferred_llc =3D=3D -1 || + cpus_share_cache(env->src_cpu, env->dst_cpu) || + env->sd->nr_balance_failed) + return false; + + /* + * Stop migration for the src_rq and pull from a + * different busy runqueue in the following cases: + * + * 1. Trying to migrate task to its preferred + * LLC, but the chosen task does not prefer dest + * LLC - case 3 in order_tasks_by_llc(). This violates + * the goal of migrate_llc_task. However, we should + * stop detaching only if some tasks have been detached + * and the imbalance has been mitigated. + * + * 2. Don't detach more tasks if the remaining tasks want + * to stay. We know the remaining tasks all prefer the + * current LLC, because after order_tasks_by_llc(), the + * tasks that prefer the current LLC are the least favored + * candidates to be migrated out. + */ + if (env->migration_type =3D=3D migrate_llc_task && + detached && llc_id(env->dst_cpu) !=3D p->preferred_llc) + return true; + + if (llc_id(env->src_cpu) =3D=3D p->preferred_llc) + return true; + + return false; +} #else static inline struct list_head *order_tasks_by_llc(struct lb_env *env, struct list_head *tasks) { return tasks; } + +static bool stop_migrate_src_rq(struct task_struct *p, + struct lb_env *env, + int detached) +{ + return false; +} #endif =20 /* @@ -10205,6 +10255,15 @@ static int detach_tasks(struct lb_env *env) =20 p =3D list_last_entry(tasks, struct task_struct, se.group_node); =20 + /* + * Check if detaching current src_rq should be stopped, because + * doing so would break cache aware load balance. If we stop + * here, the env->flags has LBF_ALL_PINNED, which would cause + * the load balance to pull from another busy runqueue. + */ + if (stop_migrate_src_rq(p, env, detached)) + break; + if (!can_migrate_task(p, env)) goto next; =20 diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 8f2a779825e4..40798a06e058 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1485,6 +1485,14 @@ extern void sched_core_dequeue(struct rq *rq, struct= task_struct *p, int flags); extern void sched_core_get(void); extern void sched_core_put(void); =20 +static inline bool task_has_sched_core(struct task_struct *p) +{ + if (sched_core_disabled()) + return false; + + return !!p->core_cookie; +} + #else /* !CONFIG_SCHED_CORE: */ =20 static inline bool sched_core_enabled(struct rq *rq) @@ -1524,6 +1532,11 @@ static inline bool sched_group_cookie_match(struct r= q *rq, return true; } =20 +static inline bool task_has_sched_core(struct task_struct *p) +{ + return false; +} + #endif /* !CONFIG_SCHED_CORE */ =20 #ifdef CONFIG_RT_GROUP_SCHED --=20 2.32.0 From nobody Fri Dec 19 19:37:43 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6E3AB2EFDAD for ; Wed, 3 Dec 2025 23:01:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802907; cv=none; b=DqEqfXaSW0ZZxydgpHnr//9Y+r8Kz4ipcj+CchWbORZ48RCt17FQ2DquLW8sfqca/x+abOrEYIPaq71/GVkzdhR5YktmlcdFPno7ta7IuxETAlghruG+YXcsfmrH3WvfypIFBRxcIK9G7zQ7Meao90BbtEmbg2ZH1AORZqaQMHw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802907; c=relaxed/simple; bh=UY2I5n5Zb5eoLU5mFytvpnggFlTCSd5WOZCBICo1NK0=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=ss76k4YY8rB/Z6uAGDFyQbUZ7bARhHHFMR8yOxKyMTjDj6HDUJk3fTrjyBpd8eZwWLWJd6uE+i5j5z2Y9c/kkgK7AnD0FSS5RcyHMwddwez0X8IBpyAwZBkh9Vkri2qy0caEGEQrs66nsLD9/pRtuqh/ensvo0F7AVsRu2xo2+M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=b/WFlJ1d; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="b/WFlJ1d" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802904; x=1796338904; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=UY2I5n5Zb5eoLU5mFytvpnggFlTCSd5WOZCBICo1NK0=; b=b/WFlJ1dlftZ7EAiu5bb8CTSjdtBeseHX8isQ4Wht5vD1dxWm6RURFOT R1B3Vg98GKNKQd2LzX3IPnNH9KdzkcCltvIyuRjvzvHEAhFOFxsI/nNCA UEadn+0Fte3u19UFuKUeR+zfOfQY/nrc24OBpPT4wpQKXE96Ne4Zzhez9 CGKthr3Nhi0su6EqgFcgXSic3+e2vAZwxOJETpVdCkTcXOxPoH3AQRibc 89EqfPOQ7c13HxarJn7Y8fuv5oRcK9m2z4cMXZ93jLuPQkW6wM0YzTFzA la772T94DglzvBNsM6aU73BVVoFLW1MUMY65Xa6wwGE8bwa6iEUdQtZCN w==; X-CSE-ConnectionGUID: hV5QNWsDRNeWT6DD4+0r9w== X-CSE-MsgGUID: zyp/cB/OQI6PxkSBENXT7w== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136566" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136566" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:44 -0800 X-CSE-ConnectionGUID: iafqWAoBQZGMBBV/tLdIww== X-CSE-MsgGUID: S+YoPmfDSRiUGxMt3Qjmgw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199763904" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:43 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org, Libo Chen Subject: [PATCH v2 16/23] sched/cache: Introduce sched_cache_present to enable cache aware scheduling for multi LLCs NUMA node Date: Wed, 3 Dec 2025 15:07:35 -0800 Message-Id: <7453e3f901878608959f23dacaa36dfc0432c05b.1764801860.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu Cache-aware load balancing should only be enabled if there are more than 1 LLCs within 1 NUMA node. sched_cache_present is introduced to indicate whether this platform supports this topology. Suggested-by: Libo Chen Suggested-by: Adam Li Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v1->v2: Use flag sched_cache_present to indicate whether a platform supports cache aware scheduling. Change this flag from staic key. There should be only 1 static key to control the cache aware scheduling. (Peter Zijlstra) kernel/sched/topology.c | 20 +++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index d583399fc6a1..9799e3a9a609 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -24,6 +24,8 @@ int max_llcs; =20 #ifdef CONFIG_SCHED_CACHE =20 +static bool sched_cache_present; + static unsigned int *alloc_new_pref_llcs(unsigned int *old, unsigned int *= *gc) { unsigned int *new =3D NULL; @@ -54,7 +56,7 @@ static void populate_new_pref_llcs(unsigned int *old, uns= igned int *new) new[i] =3D old[i]; } =20 -static int resize_llc_pref(void) +static int resize_llc_pref(bool has_multi_llcs) { unsigned int *__percpu *tmp_llc_pref; int i, ret =3D 0; @@ -102,6 +104,11 @@ static int resize_llc_pref(void) rq_unlock_irqrestore(rq, &rf); } =20 + if (has_multi_llcs) { + sched_cache_present =3D true; + pr_info_once("Cache aware load balance is enabled on the platform.\n"); + } + release_old: /* * Load balance is done under rcu_lock. @@ -124,7 +131,7 @@ static int resize_llc_pref(void) =20 #else =20 -static int resize_llc_pref(void) +static int resize_llc_pref(bool has_multi_llcs) { max_llcs =3D new_max_llcs; return 0; @@ -2644,6 +2651,7 @@ static int build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_att= r *attr) { enum s_alloc alloc_state =3D sa_none; + bool has_multi_llcs =3D false; struct sched_domain *sd; struct s_data d; struct rq *rq =3D NULL; @@ -2736,10 +2744,12 @@ build_sched_domains(const struct cpumask *cpu_map, = struct sched_domain_attr *att * between LLCs and memory channels. */ nr_llcs =3D sd->span_weight / child->span_weight; - if (nr_llcs =3D=3D 1) + if (nr_llcs =3D=3D 1) { imb =3D sd->span_weight >> 3; - else + } else { imb =3D nr_llcs; + has_multi_llcs =3D true; + } imb =3D max(1U, imb); sd->imb_numa_nr =3D imb; =20 @@ -2787,7 +2797,7 @@ build_sched_domains(const struct cpumask *cpu_map, st= ruct sched_domain_attr *att if (has_cluster) static_branch_inc_cpuslocked(&sched_cluster_active); =20 - resize_llc_pref(); + resize_llc_pref(has_multi_llcs); =20 if (rq && sched_debug_verbose) pr_info("root domain span: %*pbl\n", cpumask_pr_args(cpu_map)); --=20 2.32.0 From nobody Fri Dec 19 19:37:43 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2CA91309EF4 for ; Wed, 3 Dec 2025 23:01:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802908; cv=none; b=nzlhLORShQGH6z2OKPCwgPj3fFYQBq0S4kjlB8PdpAMAbRvUDKx69/o9oLg1lRga1/7uLzN7ZJmwClhqm7REccEFVBXjMxnF8O6F1qeXlUxSc5j6wsPAdvgE25W54gtIVxKBjQRnZDVLeIGtXbaxk29EoCqp7pm1fCpS1IY7jQo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802908; c=relaxed/simple; bh=7G8GAR73tqFcdrEyXVcfBaeUwRwA82VAe47pEbdUV2w=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Jh1NMZniFEQvMeyAac4yMWESOURMqAUIKW5GcomnPyFPuACvinoSr0dUF9HnUWSFLODn+/4wiWm4ySl8YKMzKSgIL7OQSmo169aanmL/sbmdbfeduyjfscZaBGqL5cQYK99GiDZLKPt44QcYP3KC0gclEaC+Rkd8OiTRxeMU500= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=E0yq8JMN; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="E0yq8JMN" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802906; x=1796338906; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=7G8GAR73tqFcdrEyXVcfBaeUwRwA82VAe47pEbdUV2w=; b=E0yq8JMN3sNhZ58s1b5iZ/cpqNuM9N0pDevJEvrPce0R2mUndVkmGScN McHDjEQAdkFny/+9qg6ANdvlFmYlDA/4TibC4Yz5kBPZKGiM/VEgmSwNx Wv+0fExbPAqEqTORsnJ61vyIc7KAkoB0P/ug+G27y1gOBAwA36EGLI/OA /yCpUK6WyND+MO1j8Jd+Z6+AKRhUgaidNDGg0GWIIit5s7o17SsHVlDsV qRWNYanMa3En1ALugyelInfcAx8tLNFNwwlqUz9ZCh6D2uuGRuoBR5fLH VziKp+AH5f2oXxMZP43VD+u7hWt+ni9sCpFuAa1/qPyus5y+HPClviJWH w==; X-CSE-ConnectionGUID: oDqO/ga6T/+BT+4b/VYbEw== X-CSE-MsgGUID: BsO2ZD3WSAih53lppi2XZQ== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136597" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136597" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:45 -0800 X-CSE-ConnectionGUID: vKt+yECETT+2Z5MJs0mW1A== X-CSE-MsgGUID: XpexGbaTSRGCCth9FMIgbg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199763921" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:45 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org Subject: [PATCH v2 17/23] sched/cache: Record the number of active threads per process for cache-aware scheduling Date: Wed, 3 Dec 2025 15:07:36 -0800 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu A performance regression was observed by Prateek when running hackbench with many threads per process (high fd count). To avoid this, processes with a large number of active threads are excluded from cache-aware scheduling. With sched_cache enabled, record the number of active threads in each process during the periodic task_cache_work(). While iterating over CPUs, if the currently running task belongs to the same process as the task that launched task_cache_work(), increment the active thread count. This number will be used by subsequent patch to inhibit cache aware load balance. Suggested-by: K Prateek Nayak Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v1->v2: No change. include/linux/mm_types.h | 1 + kernel/sched/fair.c | 11 +++++++++-- 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 1ea16ef90566..04743983de4d 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1043,6 +1043,7 @@ struct mm_struct { raw_spinlock_t mm_sched_lock; unsigned long mm_sched_epoch; int mm_sched_cpu; + u64 nr_running_avg ____cacheline_aligned_in_smp; #endif =20 #ifdef CONFIG_MMU diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 580a967efdac..2f38ad82688f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1421,11 +1421,11 @@ static void task_tick_cache(struct rq *rq, struct t= ask_struct *p) =20 static void __no_profile task_cache_work(struct callback_head *work) { - struct task_struct *p =3D current; + struct task_struct *p =3D current, *cur; struct mm_struct *mm =3D p->mm; unsigned long m_a_occ =3D 0; unsigned long curr_m_a_occ =3D 0; - int cpu, m_a_cpu =3D -1; + int cpu, m_a_cpu =3D -1, nr_running =3D 0; cpumask_var_t cpus; =20 WARN_ON_ONCE(work !=3D &p->cache_work); @@ -1458,6 +1458,12 @@ static void __no_profile task_cache_work(struct call= back_head *work) m_occ =3D occ; m_cpu =3D i; } + rcu_read_lock(); + cur =3D rcu_dereference(cpu_rq(i)->curr); + if (cur && !(cur->flags & (PF_EXITING | PF_KTHREAD)) && + cur->mm =3D=3D mm) + nr_running++; + rcu_read_unlock(); } =20 /* @@ -1501,6 +1507,7 @@ static void __no_profile task_cache_work(struct callb= ack_head *work) mm->mm_sched_cpu =3D m_a_cpu; } =20 + update_avg(&mm->nr_running_avg, nr_running); free_cpumask_var(cpus); } =20 --=20 2.32.0 From nobody Fri Dec 19 19:37:43 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3BC5B30C376 for ; Wed, 3 Dec 2025 23:01:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802910; cv=none; b=onpA+M+D8g+bB7DNpp5zLpepvUh9w8T9C2/oqeKTUWUlV9lpl/W31aZarTCR7uvwI9r/kkm/FD7MwcDDnX7hNWvSaLIvFHtht8DxsLrUWb3j5NtWoxy2IAV7VHzxT0RxTQbEVmk6ub/tCK+n4V2wt8/jU8sGCZYABu8xUNFmQzE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802910; c=relaxed/simple; bh=BCBRwLmdA+4IVzADPAWhC/3F5wk90mYr0XsPdVDldug=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=P22IAZf3pO0DwcaeGaXfPF45reu5KwrXd9udmOhkXnd4XQpVPzlUupze8eBT005FfxLXJRNYY4JgHS7VRdg5qBGX8VhBoX9G0rOKgnTr7U9RHG4jdp1TU4xtGdenBrAxzksuJ/5c09oa/Ni6O8HCwsplWWOi+6exHbX7OKSFqwo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=Ty7FUw1A; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Ty7FUw1A" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802908; x=1796338908; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=BCBRwLmdA+4IVzADPAWhC/3F5wk90mYr0XsPdVDldug=; b=Ty7FUw1AorJFrTn1pShKiLwJJ/bjWAtb7y1krTlw9/SRwaxzgqmmczqo u3N/1SifTNffuhxC1c0FAisXDHgXvvqPgSL0eykN2kILgw5XGJw02WLu5 DTsTU9YL6pY9pb/nL5ZARaF9QKCpSpfipEIM2etVGVvo5Q7kFSTOXs+H8 iIxOD/4oSuYwezAxsdbkRhhzIdd7YfjUSvB9o0XWfU4YnsJl/heMOcJ7B H3ZduMD5RF+5BphEK1nTa5CXhVJ0S2nzOaIUo5QipmWAbfGExiFD7Dfvc B8hxG4haeF2aHk7F8TdO+F6bVlL/xt/ae41Mu5pc0GlLavso3K0AzD+Xh Q==; X-CSE-ConnectionGUID: 93dV145yReO721FOecTa9w== X-CSE-MsgGUID: cCQ1dcHHRZCkHxfc9efEaQ== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136621" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136621" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:47 -0800 X-CSE-ConnectionGUID: gYbWyA1jQSuPW79ZakwQKg== X-CSE-MsgGUID: TU95ucBJS6iZ5dz55kzZsQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199763946" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:47 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org Subject: [PATCH v2 18/23] sched/cache: Disable cache aware scheduling for processes with high thread counts Date: Wed, 3 Dec 2025 15:07:37 -0800 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu If the number of active threads within the process exceeds the number of Cores(divided by SMTs number) in the LLC, do not enable cache-aware scheduling. This is because there is a risk of cache contention within the preferred LLC when too many threads are present. Suggested-by: K Prateek Nayak Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v1->v2: No change. kernel/sched/fair.c | 29 +++++++++++++++++++++++++++-- 1 file changed, 27 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2f38ad82688f..6afa3f9a4e9b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1223,6 +1223,18 @@ static int llc_id(int cpu) return llc; } =20 +static bool exceed_llc_nr(struct mm_struct *mm, int cpu) +{ + int smt_nr =3D 1; + +#ifdef CONFIG_SCHED_SMT + if (sched_smt_active()) + smt_nr =3D cpumask_weight(cpu_smt_mask(cpu)); +#endif + + return ((mm->nr_running_avg * smt_nr) > per_cpu(sd_llc_size, cpu)); +} + static void account_llc_enqueue(struct rq *rq, struct task_struct *p) { int pref_llc; @@ -1365,10 +1377,12 @@ void account_mm_sched(struct rq *rq, struct task_st= ruct *p, s64 delta_exec) =20 /* * If this task hasn't hit task_cache_work() for a while, or it - * has only 1 thread, invalidate its preferred state. + * has only 1 thread, or has too many active threads, invalidate + * its preferred state. */ if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT || - get_nr_threads(p) <=3D 1) { + get_nr_threads(p) <=3D 1 || + exceed_llc_nr(mm, cpu_of(rq))) { if (mm->mm_sched_cpu !=3D -1) mm->mm_sched_cpu =3D -1; } @@ -1435,6 +1449,13 @@ static void __no_profile task_cache_work(struct call= back_head *work) if (p->flags & PF_EXITING) return; =20 + if (get_nr_threads(p) <=3D 1) { + if (mm->mm_sched_cpu !=3D -1) + mm->mm_sched_cpu =3D -1; + + return; + } + if (!zalloc_cpumask_var(&cpus, GFP_KERNEL)) return; =20 @@ -9874,6 +9895,10 @@ static enum llc_mig can_migrate_llc_task(int src_cpu= , int dst_cpu, if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu)) return mig_unrestricted; =20 + /* skip cache aware load balance for single/too many threads */ + if (get_nr_threads(p) <=3D 1 || exceed_llc_nr(mm, dst_cpu)) + return mig_unrestricted; + if (cpus_share_cache(dst_cpu, cpu)) to_pref =3D true; else if (cpus_share_cache(src_cpu, cpu)) --=20 2.32.0 From nobody Fri Dec 19 19:37:43 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9101E2EC54D for ; Wed, 3 Dec 2025 23:01:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802913; cv=none; b=mZ6zgozB73YTe2Q60NzNJeXcrA6dwd6hmTIv0PKyoFj0ekz5KBJkRG1qM2/BURh0aF7CFHE0sYQDT25Sh/ho6UmSGiIRzP3Vlf26ErGeRZYynNy7Hu4jA7k4JybnWrC09LDy8qEGxsIyAxdcr/3QTceL1Zxm0kxxCEBV46nlDEI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802913; c=relaxed/simple; bh=ty+thnKFxG9+3T4ifTVEX04pmBe/l14iXANMioAm72I=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=bowCsa1//bbyzKU9WSJiWQsUHsXrqBQlvs/cAKgMyk/m4Bld010TDYg5UwVzdHKRvlpaid+xFoVz12quGwWlGa5F6HadDbBqKTBPP6/p1CNg91urhPN3p32qxubeGCoBIbuMM7MCO6I/YdFGB6u4/f5TpvPg3YmLnLcjC8/C7Xc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=bxWe6OeK; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="bxWe6OeK" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802910; x=1796338910; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=ty+thnKFxG9+3T4ifTVEX04pmBe/l14iXANMioAm72I=; b=bxWe6OeKUH0dPxqgW1jI5HE2e1z6OmOiyR4hMvqwqKai+AqvYcbOCYwu JOlPn9ZWYosHECHx5UGnkdTGEzkOmDWCRC2K3ypKwePUhIyD1337RCjJ3 uixa8Z2lYSQS2J5GJVC48B2f/yhUzBFPqFV4CEHvCoMLsK1cOf7W1aP4l eQBVHvIxVJB4mpBt3ae1f/13ipHHAFwfwmFLo4k5SToBHKxSAT6nyvK8a Vm37u8PzhAmKBcxxBJlGGGzpwc2T4MC/PWSin17i5/r/Xk+DaSUzLnxaF ZlP2B1+lT/NuonQU/h16sWvSe3/WRw4AeV5gKIbsttEfaewPOisfGEd7j g==; X-CSE-ConnectionGUID: Jgmht7L1SaW2ul5kAUA6dw== X-CSE-MsgGUID: 8CE3l3r/SEaFaHk/6vdVRg== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136653" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136653" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:49 -0800 X-CSE-ConnectionGUID: 88MGzjBCTmOWjRxLdU7vUw== X-CSE-MsgGUID: Bi68ivGaS76IdMdbGxb19w== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199763965" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:49 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org Subject: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes Date: Wed, 3 Dec 2025 15:07:38 -0800 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu Prateek and Tingyin reported that memory-intensive workloads (such as stream) can saturate memory bandwidth and caches on the preferred LLC when sched_cache aggregates too many threads. To mitigate this, estimate a process's memory footprint by comparing its RSS (anonymous and shared pages) to the size of the LLC. If RSS exceeds the LLC size, skip cache-aware scheduling. Note that RSS is only an approximation of the memory footprint. By default, the comparison is strict, but a later patch will allow users to provide a hint to adjust this threshold. According to the test from Adam, some systems do not have shared L3 but with shared L2 as clusters. In this case, the L2 becomes the LLC[1]. Link[1]: https://lore.kernel.org/all/3cb6ebc7-a2fd-42b3-8739-b00e28a09cb6@o= s.amperecomputing.com/ Co-developed-by: Tim Chen Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v1->v2: Assigned curr_cpu in task_cache_work() before checking exceed_llc_capacity(mm, curr_cpu) to avoid out-of-bound access.(lkp/0day) include/linux/cacheinfo.h | 21 ++++++++++------- kernel/sched/fair.c | 49 +++++++++++++++++++++++++++++++++++---- 2 files changed, 57 insertions(+), 13 deletions(-) diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h index c8f4f0a0b874..82d0d59ca0e1 100644 --- a/include/linux/cacheinfo.h +++ b/include/linux/cacheinfo.h @@ -113,18 +113,11 @@ int acpi_get_cache_info(unsigned int cpu, =20 const struct attribute_group *cache_get_priv_group(struct cacheinfo *this_= leaf); =20 -/* - * Get the cacheinfo structure for the cache associated with @cpu at - * level @level. - * cpuhp lock must be held. - */ -static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level) +static inline struct cacheinfo *_get_cpu_cacheinfo_level(int cpu, int leve= l) { struct cpu_cacheinfo *ci =3D get_cpu_cacheinfo(cpu); int i; =20 - lockdep_assert_cpus_held(); - for (i =3D 0; i < ci->num_leaves; i++) { if (ci->info_list[i].level =3D=3D level) { if (ci->info_list[i].attributes & CACHE_ID) @@ -136,6 +129,18 @@ static inline struct cacheinfo *get_cpu_cacheinfo_leve= l(int cpu, int level) return NULL; } =20 +/* + * Get the cacheinfo structure for the cache associated with @cpu at + * level @level. + * cpuhp lock must be held. + */ +static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level) +{ + lockdep_assert_cpus_held(); + + return _get_cpu_cacheinfo_level(cpu, level); +} + /* * Get the id of the cache associated with @cpu at level @level. * cpuhp lock must be held. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6afa3f9a4e9b..424ec601cfdf 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1223,6 +1223,38 @@ static int llc_id(int cpu) return llc; } =20 +static bool exceed_llc_capacity(struct mm_struct *mm, int cpu) +{ + struct cacheinfo *ci; + unsigned long rss; + unsigned int llc; + + /* + * get_cpu_cacheinfo_level() can not be used + * because it requires the cpu_hotplug_lock + * to be held. Use _get_cpu_cacheinfo_level() + * directly because the 'cpu' can not be + * offlined at the moment. + */ + ci =3D _get_cpu_cacheinfo_level(cpu, 3); + if (!ci) { + /* + * On system without L3 but with shared L2, + * L2 becomes the LLC. + */ + ci =3D _get_cpu_cacheinfo_level(cpu, 2); + if (!ci) + return true; + } + + llc =3D ci->size; + + rss =3D get_mm_counter(mm, MM_ANONPAGES) + + get_mm_counter(mm, MM_SHMEMPAGES); + + return (llc <=3D (rss * PAGE_SIZE)); +} + static bool exceed_llc_nr(struct mm_struct *mm, int cpu) { int smt_nr =3D 1; @@ -1382,7 +1414,8 @@ void account_mm_sched(struct rq *rq, struct task_stru= ct *p, s64 delta_exec) */ if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT || get_nr_threads(p) <=3D 1 || - exceed_llc_nr(mm, cpu_of(rq))) { + exceed_llc_nr(mm, cpu_of(rq)) || + exceed_llc_capacity(mm, cpu_of(rq))) { if (mm->mm_sched_cpu !=3D -1) mm->mm_sched_cpu =3D -1; } @@ -1439,7 +1472,7 @@ static void __no_profile task_cache_work(struct callb= ack_head *work) struct mm_struct *mm =3D p->mm; unsigned long m_a_occ =3D 0; unsigned long curr_m_a_occ =3D 0; - int cpu, m_a_cpu =3D -1, nr_running =3D 0; + int cpu, m_a_cpu =3D -1, nr_running =3D 0, curr_cpu; cpumask_var_t cpus; =20 WARN_ON_ONCE(work !=3D &p->cache_work); @@ -1449,7 +1482,9 @@ static void __no_profile task_cache_work(struct callb= ack_head *work) if (p->flags & PF_EXITING) return; =20 - if (get_nr_threads(p) <=3D 1) { + curr_cpu =3D task_cpu(p); + if (get_nr_threads(p) <=3D 1 || + exceed_llc_capacity(mm, curr_cpu)) { if (mm->mm_sched_cpu !=3D -1) mm->mm_sched_cpu =3D -1; =20 @@ -9895,8 +9930,12 @@ static enum llc_mig can_migrate_llc_task(int src_cpu= , int dst_cpu, if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu)) return mig_unrestricted; =20 - /* skip cache aware load balance for single/too many threads */ - if (get_nr_threads(p) <=3D 1 || exceed_llc_nr(mm, dst_cpu)) + /* + * Skip cache aware load balance for single/too many threads + * or large footprint. + */ + if (get_nr_threads(p) <=3D 1 || exceed_llc_nr(mm, dst_cpu) || + exceed_llc_capacity(mm, dst_cpu)) return mig_unrestricted; =20 if (cpus_share_cache(dst_cpu, cpu)) --=20 2.32.0 From nobody Fri Dec 19 19:37:43 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E85E42F0C6F for ; Wed, 3 Dec 2025 23:01:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802915; cv=none; b=NKB81c5nkJMF1m/c1AQra8pCalQ/VATWqz8ZHIWg0eoz6hnNECnbqY6IjBOdnDBFvVl/b9HVmkECeNM1mHW2uEI8K209dQ6+mwy42BNPEeHaX20qEOS7RazcHKvkjiS5SxHlmYAv1Sx5K4HGlnkZ+3m/wG0/DRyA26pbDpUaoF0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802915; c=relaxed/simple; bh=j5hfiRZ2EYaCTsQGDmAvNRTgCCnUI1j/ItMFRbl9uzY=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=V2hqbFyqQGneKfxIcpO2Kc5dagTB+TDzJUq23BN2DeHLv/PgsNga9e2rv+hmluwZMbEcHv9RyyZKJ8F8TwCiuK0Z3yMm4l1RIXSG3p6TYCnyj/3zsuh7jcDOrc/cJgzZvLgpTBDOt79ulEa8r4q4GzHG4PsV4tL2S7Y8MOiS1eo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=aHHISq0g; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="aHHISq0g" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802913; x=1796338913; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=j5hfiRZ2EYaCTsQGDmAvNRTgCCnUI1j/ItMFRbl9uzY=; b=aHHISq0gwB38J2pv7w+1lfXdj3ALD4Re5eBGYwuwYbgrSTS87mzWr9d9 6z8UE8JAD8ovVTi9HPH2Dj4nm47BQyJFWTB7aSIByFBZvHQDMif8JcxQo YN44mNhAEn4CrrZXow3MjME9dhVbGveKvuIPn5IfCupOo2V/UomJWHR8v dtkYFqLnVw3S3bkna5BsUdpRh9ZBimaMuGq/+WwGF2nx4rrzpNdxn0j5U 3rhoVYZ01bV7elVPmaWw/ckqsd0iILZe0x+W0mSMx9qrnSVEtbw4rvo6z M5hLadE9a+KUPXiCE/w4A03eCnExBDNTMSqLbTk/r37NYHjbU70zyE3SM g==; X-CSE-ConnectionGUID: EZWPyiB6S9KiFT6DKfxjxA== X-CSE-MsgGUID: 07XoWa+5TBOCIV3mZenWgw== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136682" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136682" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:51 -0800 X-CSE-ConnectionGUID: MiEptcrPQgi3rw/P5nNNDA== X-CSE-MsgGUID: DrdTMc52RGuwpeHC+Js9gg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199763975" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:51 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org Subject: [PATCH v2 20/23] sched/cache: Add user control to adjust the parameters of cache-aware scheduling Date: Wed, 3 Dec 2025 15:07:39 -0800 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu Introduce a set of debugfs knobs to control the enabling of and parameters for cache-aware load balancing. (1) llc_enabled llc_enabled acts as the primary switch - users can toggle it to enable or disable cache aware load balancing. (2) llc_aggr_tolerance With sched_cache enabled, the scheduler uses a process's RSS as a proxy for its LLC footprint to determine if aggregating tasks on the preferred LLC could cause cache contention. If RSS exceeds the LLC size, aggregation is skipped. Some workloads with large RSS but small actual memory footprints may still benefit from aggregation. Since the kernel cannot efficiently track per-task cache usage (resctrl is user-space only), userspace can provide a more accurate hint. Introduce /sys/kernel/debug/sched/llc_aggr_tolerance to let users control how strictly RSS limits aggregation. Values range from 0 to 100: - 0: Cache-aware scheduling is disabled. - 1: Strict; tasks with RSS larger than LLC size are skipped. - 100: Aggressive; tasks are aggregated regardless of RSS. For example, with a 32MB L3 cache: - llc_aggr_tolerance=3D1 -> tasks with RSS > 32MB are skipped. - llc_aggr_tolerance=3D99 -> tasks with RSS > 784GB are skipped (784GB =3D (1 + (99 - 1) * 256) * 32MB). Similarly, /sys/kernel/debug/sched/llc_aggr_tolerance also controls how strictly the number of active threads is considered when doing cache aware load balance. The number of SMTs is also considered. High SMT counts reduce the aggregation capacity, preventing excessive task aggregation on SMT-heavy systems like Power10/Power11. For example, with 8 Cores/16 CPUs in a L3: - llc_aggr_tolerance=3D1 -> tasks with nr_running > 8 are skipped. - llc_aggr_tolerance=3D99 -> tasks with nr_running > 785 are skipped 785 =3D (1 + (99 - 1) * 8). (3) llc_epoch_period/llc_epoch_affinity_timeout Besides, llc_epoch_period and llc_epoch_affinity_timeout are also turned into tunable. Suggested-by: K Prateek Nayak Suggested-by: Madadi Vineeth Reddy Suggested-by: Shrikanth Hegde Suggested-by: Tingyin Duan Co-developed-by: Tim Chen Signed-off-by: Tim Chen Signed-off-by: Chen Yu --- Notes: v1->v2: Remove the smt_nr check in fits_llc_capacity(). (Aaron Lu) include/linux/sched.h | 4 ++- kernel/sched/debug.c | 62 ++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 63 ++++++++++++++++++++++++++++++++++++----- kernel/sched/sched.h | 5 ++++ kernel/sched/topology.c | 54 +++++++++++++++++++++++++++++++++-- 5 files changed, 178 insertions(+), 10 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 466ba8b7398c..95bf080bbbf0 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2436,9 +2436,11 @@ extern void migrate_enable(void); DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable()) =20 #ifdef CONFIG_SCHED_CACHE +DECLARE_STATIC_KEY_FALSE(sched_cache_on); + static inline bool sched_cache_enabled(void) { - return false; + return static_branch_unlikely(&sched_cache_on); } #endif =20 diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 02e16b70a790..cde324672103 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -169,6 +169,53 @@ static const struct file_operations sched_feat_fops = =3D { .release =3D single_release, }; =20 +#ifdef CONFIG_SCHED_CACHE +#define SCHED_CACHE_CREATE_CONTROL(name, max) \ +static ssize_t sched_cache_write_##name(struct file *filp, \ + const char __user *ubuf, \ + size_t cnt, loff_t *ppos) \ +{ \ + char buf[16]; \ + unsigned int val; \ + if (cnt > 15) \ + cnt =3D 15; \ + if (copy_from_user(&buf, ubuf, cnt)) \ + return -EFAULT; \ + buf[cnt] =3D '\0'; \ + if (kstrtouint(buf, 10, &val)) \ + return -EINVAL; \ + if (val > (max)) \ + return -EINVAL; \ + llc_##name =3D val; \ + if (!strcmp(#name, "enabled")) \ + sched_cache_set(false); \ + *ppos +=3D cnt; \ + return cnt; \ +} \ +static int sched_cache_show_##name(struct seq_file *m, void *v) \ +{ \ + seq_printf(m, "%d\n", llc_##name); \ + return 0; \ +} \ +static int sched_cache_open_##name(struct inode *inode, \ + struct file *filp) \ +{ \ + return single_open(filp, sched_cache_show_##name, NULL); \ +} \ +static const struct file_operations sched_cache_fops_##name =3D { \ + .open =3D sched_cache_open_##name, \ + .write =3D sched_cache_write_##name, \ + .read =3D seq_read, \ + .llseek =3D seq_lseek, \ + .release =3D single_release, \ +} + +SCHED_CACHE_CREATE_CONTROL(overload_pct, 100); +SCHED_CACHE_CREATE_CONTROL(imb_pct, 100); +SCHED_CACHE_CREATE_CONTROL(aggr_tolerance, 100); +SCHED_CACHE_CREATE_CONTROL(enabled, 1); +#endif /* SCHED_CACHE */ + static ssize_t sched_scaling_write(struct file *filp, const char __user *u= buf, size_t cnt, loff_t *ppos) { @@ -523,6 +570,21 @@ static __init int sched_init_debug(void) debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing= _hot_threshold); #endif /* CONFIG_NUMA_BALANCING */ =20 +#ifdef CONFIG_SCHED_CACHE + debugfs_create_file("llc_overload_pct", 0644, debugfs_sched, NULL, + &sched_cache_fops_overload_pct); + debugfs_create_file("llc_imb_pct", 0644, debugfs_sched, NULL, + &sched_cache_fops_imb_pct); + debugfs_create_file("llc_aggr_tolerance", 0644, debugfs_sched, NULL, + &sched_cache_fops_aggr_tolerance); + debugfs_create_file("llc_enabled", 0644, debugfs_sched, NULL, + &sched_cache_fops_enabled); + debugfs_create_u32("llc_epoch_period", 0644, debugfs_sched, + &llc_epoch_period); + debugfs_create_u32("llc_epoch_affinity_timeout", 0644, debugfs_sched, + &llc_epoch_affinity_timeout); +#endif + debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops= ); =20 debugfs_fair_server_init(); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 424ec601cfdf..a2e2d6742481 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1207,6 +1207,9 @@ static s64 update_se(struct rq *rq, struct sched_enti= ty *se) =20 __read_mostly unsigned int llc_overload_pct =3D 50; __read_mostly unsigned int llc_imb_pct =3D 20; +__read_mostly unsigned int llc_aggr_tolerance =3D 1; +__read_mostly unsigned int llc_epoch_period =3D EPOCH_PERIOD; +__read_mostly unsigned int llc_epoch_affinity_timeout =3D EPOCH_LLC_AFFINI= TY_TIMEOUT; =20 static int llc_id(int cpu) { @@ -1223,11 +1226,22 @@ static int llc_id(int cpu) return llc; } =20 +static inline int get_sched_cache_scale(int mul) +{ + if (!llc_aggr_tolerance) + return 0; + + if (llc_aggr_tolerance =3D=3D 100) + return INT_MAX; + + return (1 + (llc_aggr_tolerance - 1) * mul); +} + static bool exceed_llc_capacity(struct mm_struct *mm, int cpu) { + unsigned int llc, scale; struct cacheinfo *ci; unsigned long rss; - unsigned int llc; =20 /* * get_cpu_cacheinfo_level() can not be used @@ -1252,19 +1266,54 @@ static bool exceed_llc_capacity(struct mm_struct *m= m, int cpu) rss =3D get_mm_counter(mm, MM_ANONPAGES) + get_mm_counter(mm, MM_SHMEMPAGES); =20 - return (llc <=3D (rss * PAGE_SIZE)); + /* + * Scale the LLC size by 256*llc_aggr_tolerance + * and compare it to the task's RSS size. + * + * Suppose the L3 size is 32MB. If the + * llc_aggr_tolerance is 1: + * When the RSS is larger than 32MB, the process + * is regarded as exceeding the LLC capacity. If + * the llc_aggr_tolerance is 99: + * When the RSS is larger than 784GB, the process + * is regarded as exceeding the LLC capacity because: + * 784GB =3D (1 + (99 - 1) * 256) * 32MB + */ + scale =3D get_sched_cache_scale(256); + if (scale =3D=3D INT_MAX) + return false; + + return ((llc * scale) <=3D (rss * PAGE_SIZE)); } =20 static bool exceed_llc_nr(struct mm_struct *mm, int cpu) { - int smt_nr =3D 1; + int smt_nr =3D 1, scale; =20 #ifdef CONFIG_SCHED_SMT if (sched_smt_active()) smt_nr =3D cpumask_weight(cpu_smt_mask(cpu)); #endif + /* + * Scale the Core number in a LLC by llc_aggr_tolerance + * and compare it to the task's active threads. + * + * Suppose the number of Cores in LLC is 8. + * Every core has 2 SMTs. + * If the llc_aggr_tolerance is 1: When the + * nr_running is larger than 8, the process + * is regarded as exceeding the LLC capacity. + * If the llc_aggr_tolerance is 99: + * When the nr_running is larger than 785, + * the process is regarded as exceeding + * the LLC capacity: + * 785 =3D 1 + (99 - 1) * 8 + */ + scale =3D get_sched_cache_scale(1); + if (scale =3D=3D INT_MAX) + return false; =20 - return ((mm->nr_running_avg * smt_nr) > per_cpu(sd_llc_size, cpu)); + return ((mm->nr_running_avg * smt_nr) > (scale * per_cpu(sd_llc_size, cpu= ))); } =20 static void account_llc_enqueue(struct rq *rq, struct task_struct *p) @@ -1350,9 +1399,9 @@ static inline void __update_mm_sched(struct rq *rq, s= truct mm_sched *pcpu_sched) long delta =3D now - rq->cpu_epoch_next; =20 if (delta > 0) { - n =3D (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD; + n =3D (delta + llc_epoch_period - 1) / llc_epoch_period; rq->cpu_epoch +=3D n; - rq->cpu_epoch_next +=3D n * EPOCH_PERIOD; + rq->cpu_epoch_next +=3D n * llc_epoch_period; __shr_u64(&rq->cpu_runtime, n); } =20 @@ -1412,7 +1461,7 @@ void account_mm_sched(struct rq *rq, struct task_stru= ct *p, s64 delta_exec) * has only 1 thread, or has too many active threads, invalidate * its preferred state. */ - if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT || + if (epoch - READ_ONCE(mm->mm_sched_epoch) > llc_epoch_affinity_timeout || get_nr_threads(p) <=3D 1 || exceed_llc_nr(mm, cpu_of(rq)) || exceed_llc_capacity(mm, cpu_of(rq))) { diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 40798a06e058..15d126bd3728 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2852,6 +2852,11 @@ extern unsigned int sysctl_numa_balancing_hot_thresh= old; #ifdef CONFIG_SCHED_CACHE extern unsigned int llc_overload_pct; extern unsigned int llc_imb_pct; +extern unsigned int llc_aggr_tolerance; +extern unsigned int llc_epoch_period; +extern unsigned int llc_epoch_affinity_timeout; +extern unsigned int llc_enabled; +void sched_cache_set(bool locked); #endif =20 #ifdef CONFIG_SCHED_HRTICK diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 9799e3a9a609..818599ddaaef 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -26,6 +26,49 @@ int max_llcs; =20 static bool sched_cache_present; =20 +unsigned int llc_enabled =3D 1; +DEFINE_STATIC_KEY_FALSE(sched_cache_on); + +/* + * Enable/disable cache aware scheduling according to + * user input and the presence of hardware support. + */ +static void _sched_cache_set(bool enable, bool locked) +{ + if (enable) { + if (locked) + static_branch_enable_cpuslocked(&sched_cache_on); + else + static_branch_enable(&sched_cache_on); + } else { + if (locked) + static_branch_disable_cpuslocked(&sched_cache_on); + else + static_branch_disable(&sched_cache_on); + } +} + +void sched_cache_set(bool locked) +{ + /* hardware does not support */ + if (!sched_cache_present) { + if (static_branch_likely(&sched_cache_on)) + _sched_cache_set(false, locked); + + return; + } + + /* user wants it or not ?*/ + if (llc_enabled) { + if (!static_branch_likely(&sched_cache_on)) + _sched_cache_set(true, locked); + + } else { + if (static_branch_likely(&sched_cache_on)) + _sched_cache_set(false, locked); + } +} + static unsigned int *alloc_new_pref_llcs(unsigned int *old, unsigned int *= *gc) { unsigned int *new =3D NULL; @@ -70,8 +113,12 @@ static int resize_llc_pref(bool has_multi_llcs) * new buffer. */ tmp_llc_pref =3D alloc_percpu_noprof(unsigned int *); - if (!tmp_llc_pref) - return -ENOMEM; + if (!tmp_llc_pref) { + sched_cache_present =3D false; + ret =3D -ENOMEM; + + goto out; + } =20 for_each_present_cpu(i) *per_cpu_ptr(tmp_llc_pref, i) =3D NULL; @@ -89,6 +136,7 @@ static int resize_llc_pref(bool has_multi_llcs) new =3D alloc_new_pref_llcs(rq->nr_pref_llc, per_cpu_ptr(tmp_llc_pref, i= )); if (!new) { ret =3D -ENOMEM; + sched_cache_present =3D false; =20 goto release_old; } @@ -126,6 +174,8 @@ static int resize_llc_pref(bool has_multi_llcs) if (!ret) max_llcs =3D new_max_llcs; =20 +out: + sched_cache_set(true); return ret; } =20 --=20 2.32.0 From nobody Fri Dec 19 19:37:43 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D10542F12DD for ; Wed, 3 Dec 2025 23:01:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802915; cv=none; b=AmWzQXbFY2sN5heLcp4s9rWoLO7pjURsg464nsA8jjoqA5nJagwpJv9G+UJULof1tTaFgz2GmAr0hHkABofj6ydnfXE2fd4hRRYb7GE+M+4gERnZr5wAJOQw/zTEmxBeWSSE5iNgbAWmM054GBUn6MCdpITYzuKbb1BP7b3L3sk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802915; c=relaxed/simple; bh=heewblv8+VUSifHzkX3W2P+i26TuBbpse5E1oodIHu8=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Xpw0JLbgiphYf3Sab645eHcm9Luo+Mx2FuFXrjcPsXJxnYfglU5zHbY1C3nGcYUTlQht3caQEhhC7tRceDrXIkNZHUg5zn5pvhgic99RbM9RtmxCAUWRJKHEvQHILxmwPExiCxVB0m/pqwl8+stVV67Gqhqd6Lhw1hT41ldDB9s= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=Cxg4oTl4; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Cxg4oTl4" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802913; x=1796338913; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=heewblv8+VUSifHzkX3W2P+i26TuBbpse5E1oodIHu8=; b=Cxg4oTl4FXmQXHOKywDD1PXh0TwFbaiKduxzegiGnyiEGbaHQGeStB45 heDXhCr5sdgqIhbxUFp1vM0glTwn0l4/6ZiEL/dgHN9LNlGjaYsII9jc1 2qGZ9JRhqrUWqdc8Jm6fWF0Wuz16A6ncwR05z1/osHOGjbKNCnVNF9Y0l 4FSdn5Pg7wz/0mo5Tfd9kz21TLqYSS8tlCVsn5MnhfbvMVKYOtZOb0WKR 3KiZKcH2I7DsvpgO/euP9zAwOTpRdP8eIGES5K1LCg7I6oiUiavAKbHWR nP3xATAIJENhZb+rdETusA0Fs1MIUcnKK88Vr8NJIw3yCIQUWh4CdT9qz A==; X-CSE-ConnectionGUID: 32GrqILbQRmayJvwXZJ8Bg== X-CSE-MsgGUID: F77VqvbYTl+X7/J43cXgzw== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136713" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136713" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:53 -0800 X-CSE-ConnectionGUID: U2qPsSdSRnej2tO9wNUi2g== X-CSE-MsgGUID: YiZKZfnpSMaNZy2O6pI9Xg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199763990" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:52 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org Subject: [PATCH v2 21/23] -- DO NOT APPLY!!! -- sched/cache/stats: Add schedstat for cache aware load balancing Date: Wed, 3 Dec 2025 15:07:40 -0800 Message-Id: <71b94a7547f7843230270e20b84ecb0a540ab604.1764801860.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu Debug patch only. With cache-aware load balancing enabled, statistics related to its activity are exposed via /proc/schedstat and debugfs. For instance, if users want to verify metrics like the number of exceeding RSS and nr_running limits, they can filter the output of /sys/kernel/debug/sched/debug and compute the requ= ired statistics manually: llc_exceed_cap SUM: 6 llc_exceed_nr SUM: 4531 Furthermore, these statistics exposed in /proc/schedstats can be queried ma= nually or via perf sched stats[1] with minor modifications. Link: https://lore.kernel.org/all/20250909114227.58802-1-swapnil.sapkal@amd= .com #1 Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 1 + kernel/sched/stats.c | 5 +++-- 3 files changed, 5 insertions(+), 2 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 0ba4697d74ba..8702c1e731a0 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -108,6 +108,7 @@ struct sched_domain { unsigned int lb_imbalance_util[CPU_MAX_IDLE_TYPES]; unsigned int lb_imbalance_task[CPU_MAX_IDLE_TYPES]; unsigned int lb_imbalance_misfit[CPU_MAX_IDLE_TYPES]; + unsigned int lb_imbalance_llc[CPU_MAX_IDLE_TYPES]; unsigned int lb_gained[CPU_MAX_IDLE_TYPES]; unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES]; unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES]; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a2e2d6742481..742e455b093e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -12684,6 +12684,7 @@ static void update_lb_imbalance_stat(struct lb_env = *env, struct sched_domain *sd __schedstat_add(sd->lb_imbalance_misfit[idle], env->imbalance); break; case migrate_llc_task: + __schedstat_add(sd->lb_imbalance_llc[idle], env->imbalance); break; } } diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c index d1c9429a4ac5..3736f6102261 100644 --- a/kernel/sched/stats.c +++ b/kernel/sched/stats.c @@ -104,7 +104,7 @@ void __update_stats_enqueue_sleeper(struct rq *rq, stru= ct task_struct *p, * Bump this up when changing the output format or the meaning of an exist= ing * format, so that tools can adapt (or abort) */ -#define SCHEDSTAT_VERSION 17 +#define SCHEDSTAT_VERSION 18 =20 static int show_schedstat(struct seq_file *seq, void *v) { @@ -139,7 +139,7 @@ static int show_schedstat(struct seq_file *seq, void *v) seq_printf(seq, "domain%d %s %*pb", dcount++, sd->name, cpumask_pr_args(sched_domain_span(sd))); for (itype =3D 0; itype < CPU_MAX_IDLE_TYPES; itype++) { - seq_printf(seq, " %u %u %u %u %u %u %u %u %u %u %u", + seq_printf(seq, " %u %u %u %u %u %u %u %u %u %u %u %u", sd->lb_count[itype], sd->lb_balanced[itype], sd->lb_failed[itype], @@ -147,6 +147,7 @@ static int show_schedstat(struct seq_file *seq, void *v) sd->lb_imbalance_util[itype], sd->lb_imbalance_task[itype], sd->lb_imbalance_misfit[itype], + sd->lb_imbalance_llc[itype], sd->lb_gained[itype], sd->lb_hot_gained[itype], sd->lb_nobusyq[itype], --=20 2.32.0 From nobody Fri Dec 19 19:37:43 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B53DB2EDD63 for ; Wed, 3 Dec 2025 23:01:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802917; cv=none; b=Sv2g8yh/ssOUkCxGvmjgju6aonEWXYABCuXTb+U7pmXY6LV36x4JKu1MuMeuYO1vCluXZy/7Ay7i1yE6FtkBqXrqbYaDn/USnq7xKePL08B+Z5erY6PuyaIsHhWqUANVdUR5D6Behj/PK8qsySaRT1rgt6AitMIk8lP+NbOAHCE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802917; c=relaxed/simple; bh=Z1RTLO9XI8wzi8HuLfkDHGWZXGFVHLiwaXN3uD70l1Y=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=CATqSWfOo+6YE9nXLVWZJO6JnMOLrl52x8cVMx1zwPuSpCTUr3IN5JnkiXN2GyKQ26mCPXBWcWxBHdzMY7E9cxtAmJLxGzbXdU2Fg+4DSuAYi1K0o6tozFHYiuKS+6QKbzMtYuK8+ri9bLYJjOu4P79WeHsP8FgYaIsrRFSRu2w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=mJ2c+rcP; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="mJ2c+rcP" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802915; x=1796338915; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Z1RTLO9XI8wzi8HuLfkDHGWZXGFVHLiwaXN3uD70l1Y=; b=mJ2c+rcP1UOBgGP4yRYC4G9oY4qxvoF1rz/E8g2VluXVhdaKym+KKeiM 98QozNlJsgm6c2psR2Mp1UJhkz/Z+hMiEVNErwajLDcIdLXPKWwrmkhgP CWKO4YFSmv7sZsGBLUL6MPnqDCpqzgPQvR5FKXPgi7m3I3rXLqAaZgLzM bfubfkiwaBvcluOfyoYhJ37GeqSNPw53SP+PU0pGAu+cSL5BeyuIN+g+r dRFzsYKK0wBWGsqYyMy6aje2lH7qKav3U/83YEE1h0WkyFF5hAmr4RJRT /HIg5gjIb43mMeVrXXMSuFG2ajgVo7HXw1utNSmLiOQiREq43MfL2zw8y w==; X-CSE-ConnectionGUID: 9oIPtAybQ2qo8rrXonYwDQ== X-CSE-MsgGUID: zYiFP9hSSKCIjoTpSYg3HA== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136743" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136743" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:54 -0800 X-CSE-ConnectionGUID: WESeCoKDRrGN0u/wU3Xx1Q== X-CSE-MsgGUID: 8VfHl2BFSC60lwjXMiz/LQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199764003" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:54 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org Subject: [PATCH v2 22/23] -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load balance statistics Date: Wed, 3 Dec 2025 15:07:41 -0800 Message-Id: <445303c70d8d464c35c97f33d4be7b752e8db5ae.1764801860.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu Debug patch only. The user leverages this trace event (via bpftrace, etc)to monitor the cache aware load balance activity - whether the tasks are moved to their preferred LLC, or moved out of their preferred LLC. Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- include/trace/events/sched.h | 31 +++++++++++++++++++++++++++++++ kernel/sched/fair.c | 10 ++++++++++ 2 files changed, 41 insertions(+) diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 7b2645b50e78..bd03f49f7e3c 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -10,6 +10,37 @@ #include #include =20 +TRACE_EVENT(sched_attach_task, + + TP_PROTO(struct task_struct *t, int pref_cpu, int pref_llc, + int attach_cpu, int attach_llc), + + TP_ARGS(t, pref_cpu, pref_llc, attach_cpu, attach_llc), + + TP_STRUCT__entry( + __array( char, comm, TASK_COMM_LEN ) + __field( pid_t, pid ) + __field( int, pref_cpu ) + __field( int, pref_llc ) + __field( int, attach_cpu ) + __field( int, attach_llc ) + ), + + TP_fast_assign( + memcpy(__entry->comm, t->comm, TASK_COMM_LEN); + __entry->pid =3D t->pid; + __entry->pref_cpu =3D pref_cpu; + __entry->pref_llc =3D pref_llc; + __entry->attach_cpu =3D attach_cpu; + __entry->attach_llc =3D attach_llc; + ), + + TP_printk("comm=3D%s pid=3D%d pref_cpu=3D%d pref_llc=3D%d attach_cpu=3D%d= attach_llc=3D%d", + __entry->comm, __entry->pid, + __entry->pref_cpu, __entry->pref_llc, + __entry->attach_cpu, __entry->attach_llc) +); + /* * Tracepoint for calling kthread_stop, performed to end a kthread: */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 742e455b093e..e47b4096f0a6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10487,6 +10487,16 @@ static void attach_task(struct rq *rq, struct task= _struct *p) { lockdep_assert_rq_held(rq); =20 +#ifdef CONFIG_SCHED_CACHE + if (p->mm) { + int pref_cpu =3D p->mm->mm_sched_cpu; + + trace_sched_attach_task(p, + pref_cpu, + pref_cpu !=3D -1 ? llc_id(pref_cpu) : -1, + cpu_of(rq), llc_id(cpu_of(rq))); + } +#endif WARN_ON_ONCE(task_rq(p) !=3D rq); activate_task(rq, p, ENQUEUE_NOCLOCK); wakeup_preempt(rq, p, 0); --=20 2.32.0 From nobody Fri Dec 19 19:37:43 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0CB492EBBB7 for ; Wed, 3 Dec 2025 23:01:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802919; cv=none; b=m/i8AM9jez30fmSC1ThjI0YmAYEwTjLN0aX4/W91cI/xJdwDY3yhTCxjuRQMXmg8XAbCVHRL4AColOXBfQy71E1URs7aT+GLFscw7WH4+OFmIN9YsDx0KaMus5WdBjhF8tzszL6TEZ12kmt42mlqOXQoE5Z3dqzJYmLCcriio58= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802919; c=relaxed/simple; bh=wLrJr/SuamOWjFO9gpHP5B2k8lcK+6x8dlASnUWXGe8=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=eJbxWPHDUsl7XKuqPrYe829WccTGNXVp007ecq2JrHaVKwuvPh4j19TPROJM5V4vppIdkk1U3AT26iFdDx2qrmsewZCkwqlDeBPDJqbvbZbY+3Vimkg2ojZhH8CLl94yalOO4ZSXRjWefBovmf2taUbRtFOEBHGk1S0e1XvQ7G4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=jEU0XIYU; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="jEU0XIYU" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802917; x=1796338917; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=wLrJr/SuamOWjFO9gpHP5B2k8lcK+6x8dlASnUWXGe8=; b=jEU0XIYUmUP1w1odUpoZztkux4d2T4uFzSQDEoeQkO6AEZ1yfHcuVfq9 YwImXDzBWY46rQh33rL3qoP+4HJZhnOXgjU9/vwFZtLvGkGs5rHvI8YBx jDLfActh0h/lcktc8ZNAWUhHLuPaktpxkehHuTNiQ+/PYiyL7+Hj8Xdrd 41rYFhxJEN7aCEKecsCgMgtV2kyKG5rxF89kVp/FA/73jNvUXDa5pRoN7 yqtdT/I+zUDFwYL0JDyMdCOZxceWrOHrciU5DroHkoBLTkvVc7oA5oIMh KkFun1mmeV+tcvGf8EXfa3CUEmb0TvEhrDlTxbkcFqltiq0sEOiCw8NXE w==; X-CSE-ConnectionGUID: 7s0dCQLrSayFkNv254nlIw== X-CSE-MsgGUID: oFf+c8koRFSSrN6sf+Ly3g== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136770" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136770" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:56 -0800 X-CSE-ConnectionGUID: F/ChnV0DRm2XulsnpHzSUQ== X-CSE-MsgGUID: VHoDeeBRRb2BuWXAH6c04Q== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199764012" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:55 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org Subject: [PATCH v2 23/23] -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy for each process via proc fs Date: Wed, 3 Dec 2025 15:07:42 -0800 Message-Id: <0eaf9b9f89f0d97dbf46b760421f65aee3ffe063.1764801860.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu Debug patch only. Show the per-LLC occupancy in /proc/{PID}/schedstat, with each column corresponding to one LLC. This can be used to verify if the cache-aware load balancer works as expected by aggregating threads onto dedicated LLCs. Suppose there are 2 LLCs and the sampling duration is 10 seconds: Enable the cache aware load balance: 0 12281 <--- LLC0 residency delta is 0, LLC1 is 12 seconds 0 18881 0 16217 disable the cache aware load balance: 6497 15802 9299 5435 17811 8278 Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- fs/proc/base.c | 22 ++++++++++++++++++++++ include/linux/mm_types.h | 19 +++++++++++++++++-- include/linux/sched.h | 3 +++ kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++++++++++-- 4 files changed, 80 insertions(+), 4 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 6299878e3d97..f4be96f4bd01 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -518,6 +518,28 @@ static int proc_pid_schedstat(struct seq_file *m, stru= ct pid_namespace *ns, (unsigned long long)task->se.sum_exec_runtime, (unsigned long long)task->sched_info.run_delay, task->sched_info.pcount); +#ifdef CONFIG_SCHED_CACHE + if (sched_cache_enabled()) { + struct mm_struct *mm =3D task->mm; + u64 *llc_runtime; + + if (!mm) + return 0; + + llc_runtime =3D kcalloc(max_llcs, sizeof(u64), GFP_KERNEL); + if (!llc_runtime) + return 0; + + if (get_mm_per_llc_runtime(task, llc_runtime)) + goto out; + + for (int i =3D 0; i < max_llcs; i++) + seq_printf(m, "%llu ", llc_runtime[i]); + seq_puts(m, "\n"); +out: + kfree(llc_runtime); + } +#endif =20 return 0; } diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 04743983de4d..255c22be7312 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -944,6 +944,10 @@ struct mm_sched { unsigned long epoch; }; =20 +struct mm_time { + u64 runtime_ns; +}; + struct kioctx_table; struct iommu_mm_data; struct mm_struct { @@ -1040,6 +1044,7 @@ struct mm_struct { * See account_mm_sched() and ... */ struct mm_sched __percpu *pcpu_sched; + struct mm_time __percpu *pcpu_time; raw_spinlock_t mm_sched_lock; unsigned long mm_sched_epoch; int mm_sched_cpu; @@ -1505,16 +1510,24 @@ static inline void mm_set_cpus_allowed(struct mm_st= ruct *mm, const struct cpumas #endif /* CONFIG_SCHED_MM_CID */ =20 #ifdef CONFIG_SCHED_CACHE -void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *pcpu_sc= hed); +void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *pcpu_sc= hed, + struct mm_time __percpu *pcpu_time); =20 static inline int mm_alloc_sched_noprof(struct mm_struct *mm) { struct mm_sched __percpu *pcpu_sched =3D alloc_percpu_noprof(struct mm_sc= hed); + struct mm_time __percpu *pcpu_time; =20 if (!pcpu_sched) return -ENOMEM; =20 - mm_init_sched(mm, pcpu_sched); + pcpu_time =3D alloc_percpu_noprof(struct mm_time); + if (!pcpu_time) { + free_percpu(mm->pcpu_sched); + return -ENOMEM; + } + + mm_init_sched(mm, pcpu_sched, pcpu_time); return 0; } =20 @@ -1523,7 +1536,9 @@ static inline int mm_alloc_sched_noprof(struct mm_str= uct *mm) static inline void mm_destroy_sched(struct mm_struct *mm) { free_percpu(mm->pcpu_sched); + free_percpu(mm->pcpu_time); mm->pcpu_sched =3D NULL; + mm->pcpu_time =3D NULL; } #else /* !CONFIG_SCHED_CACHE */ =20 diff --git a/include/linux/sched.h b/include/linux/sched.h index 95bf080bbbf0..875ac3f4208b 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2442,6 +2442,9 @@ static inline bool sched_cache_enabled(void) { return static_branch_unlikely(&sched_cache_on); } + +int get_mm_per_llc_runtime(struct task_struct *p, u64 *buf); +extern int max_llcs; #endif =20 #endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e47b4096f0a6..205208f061bb 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1355,16 +1355,19 @@ static void account_llc_dequeue(struct rq *rq, stru= ct task_struct *p) p->sched_llc_active =3D false; } =20 -void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_s= ched) +void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_s= ched, + struct mm_time __percpu *_pcpu_time) { unsigned long epoch; int i; =20 for_each_possible_cpu(i) { struct mm_sched *pcpu_sched =3D per_cpu_ptr(_pcpu_sched, i); + struct mm_time *pcpu_time =3D per_cpu_ptr(_pcpu_time, i); struct rq *rq =3D cpu_rq(i); =20 pcpu_sched->runtime =3D 0; + pcpu_time->runtime_ns =3D 0; pcpu_sched->epoch =3D rq->cpu_epoch; epoch =3D rq->cpu_epoch; } @@ -1379,6 +1382,8 @@ void mm_init_sched(struct mm_struct *mm, struct mm_sc= hed __percpu *_pcpu_sched) * the readers may get invalid mm_sched_epoch, etc. */ smp_store_release(&mm->pcpu_sched, _pcpu_sched); + /* same as above */ + smp_store_release(&mm->pcpu_time, _pcpu_time); } =20 /* because why would C be fully specified */ @@ -1428,11 +1433,39 @@ static unsigned long __no_profile fraction_mm_sched= (struct rq *rq, struct mm_sch =20 static unsigned int task_running_on_cpu(int cpu, struct task_struct *p); =20 +/* p->pi_lock is hold */ +int get_mm_per_llc_runtime(struct task_struct *p, u64 *buf) +{ + struct mm_struct *mm =3D p->mm; + struct mm_time *pcpu_time; + int cpu; + + if (!mm) + return -EINVAL; + + rcu_read_lock(); + for_each_online_cpu(cpu) { + int llc =3D llc_id(cpu); + u64 runtime_ms; + + if (llc < 0) + continue; + + pcpu_time =3D per_cpu_ptr(mm->pcpu_time, cpu); + runtime_ms =3D div_u64(pcpu_time->runtime_ns, NSEC_PER_MSEC); + buf[llc] +=3D runtime_ms; + } + rcu_read_unlock(); + + return 0; +} + static inline void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec) { struct mm_struct *mm =3D p->mm; struct mm_sched *pcpu_sched; + struct mm_time *pcpu_time; unsigned long epoch; int mm_sched_llc =3D -1; =20 @@ -1444,14 +1477,17 @@ void account_mm_sched(struct rq *rq, struct task_st= ruct *p, s64 delta_exec) /* * init_task and kthreads don't having mm */ - if (!mm || !mm->pcpu_sched) + if (!mm || !mm->pcpu_sched || !mm->pcpu_time) return; =20 pcpu_sched =3D per_cpu_ptr(p->mm->pcpu_sched, cpu_of(rq)); + pcpu_time =3D per_cpu_ptr(p->mm->pcpu_time, cpu_of(rq)); =20 scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) { __update_mm_sched(rq, pcpu_sched); pcpu_sched->runtime +=3D delta_exec; + /* pure runtime without decay */ + pcpu_time->runtime_ns +=3D delta_exec; rq->cpu_runtime +=3D delta_exec; epoch =3D rq->cpu_epoch; } --=20 2.32.0