From nobody Sun Feb 8 17:30:11 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EB2F53F9D2 for ; Mon, 21 Apr 2025 03:30:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.12 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745206211; cv=none; b=uNiHFJYNyDjmgYVzzwwflA9DMyTBxrKuuGFZBPc369U8TEkJbsu3reGnIw6yuB5I9UwZjHDr7+IAS6xSUM2c+iF4byCI9OgOxXBgJscqk9fjR6yDqEvvqkK+Mg9mUY1yJg4Q5/Qs/WhKk2ZGCGAw4uk5LdREICgkeFRpFzlzey4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745206211; c=relaxed/simple; bh=43fWVZ91clW26wn4qUJZMl9xz2gH6KO1ZZF6RwNGW3M=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=BmUDNtrajEU/JbUkeE4pHLdhBWQJBciz07dWcL18Jw7y6zWkmN5AMUuzQLiVJeTCKgmcMB5003U4itqPUpBrR8/95SBtTxCmHjf4/9r2GopIBMwPB9NOHMGbBmxfQpM9DiUSfbMSevxn8pg9uRkPj3ollvGTNxC8OO19QqARSB0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=U4WpDywl; arc=none smtp.client-ip=192.198.163.12 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="U4WpDywl" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1745206209; x=1776742209; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=43fWVZ91clW26wn4qUJZMl9xz2gH6KO1ZZF6RwNGW3M=; b=U4WpDywlwRLQJJ7/sOxbQ3f3cKHOPgkcq8RF27UuLUIAisEk1P7Vw+uM L8NOoR+Mnhaaiq/+vFjHkc2t+gA/tL8++Jcl2MNnd/EBD5qUSMyuScHz/ cwSR2ihcLdqfgNK/fggYCTlzDaxnsrEUiHi5ujKgU6UraEHHG3YhpuYy7 WvG/lqU8Ky8mQ5X6o13HZG86W7Ir6nD97zhgqRTSAbSFnj1ghqnz8RHVZ Jjee0p+ua8tZaxDp3nR6d6xAeeF+M6HiQMGxvTzSt2csIfmUaftAyavme Tq4wYm0PiZ4fG/mYjPLnNx2rNlWRBVWxVw6HhQf47ugcc/N9+nNFJGLpz w==; X-CSE-ConnectionGUID: lKL5GBaWSAOdT2qGSnQ9uw== X-CSE-MsgGUID: xfmuwF1HSf2g8zrSrptTWg== X-IronPort-AV: E=McAfee;i="6700,10204,11409"; a="50563072" X-IronPort-AV: E=Sophos;i="6.15,227,1739865600"; d="scan'208";a="50563072" Received: from fmviesa002.fm.intel.com ([10.60.135.142]) by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Apr 2025 20:30:08 -0700 X-CSE-ConnectionGUID: 5YpeFg53RXi4oPdFu6JJXw== X-CSE-MsgGUID: YpHpx6ImRgSou5P2frWBwg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.15,227,1739865600"; d="scan'208";a="154772506" Received: from chenyu-dev.sh.intel.com ([10.239.62.107]) by fmviesa002.fm.intel.com with ESMTP; 20 Apr 2025 20:30:03 -0700 From: Chen Yu To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" Cc: Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Tim Chen , Vincent Guittot , Libo Chen , Abel Wu , Madadi Vineeth Reddy , Hillf Danton , linux-kernel@vger.kernel.org Subject: [RFC PATCH 1/5] sched: Cache aware load-balancing Date: Mon, 21 Apr 2025 11:24:26 +0800 Message-Id: <391c48836585786ed32d66df9534366459684383.1745199017.git.yu.c.chen@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Peter Zijlstra Hi all, One of the many things on the eternal todo list has been finishing the below hackery. It is an attempt at modelling cache affinity -- and while the patch really only targets LLC, it could very well be extended to also apply to clusters (L2). Specifically any case of multiple cache domains inside a node. Anyway, I wrote this about a year ago, and I mentioned this at the recent OSPM conf where Gautham and Prateek expressed interest in playing with this code. So here goes, very rough and largely unproven code ahead :-) It applies to current tip/master, but I know it will fail the __percpu validation that sits in -next, although that shouldn't be terribly hard to fix up. As is, it only computes a CPU inside the LLC that has the highest recent runtime, this CPU is then used in the wake-up path to steer towards this LLC and in task_hot() to limit migrations away from it. More elaborate things could be done, notably there is an XXX in there somewhere about finding the best LLC inside a NODE (interaction with NUMA_BALANCING). Signed-off-by: Peter Zijlstra (Intel) --- include/linux/mm_types.h | 44 ++++++ include/linux/sched.h | 4 + init/Kconfig | 4 + kernel/fork.c | 5 + kernel/sched/core.c | 13 +- kernel/sched/fair.c | 330 +++++++++++++++++++++++++++++++++++++-- kernel/sched/sched.h | 8 + 7 files changed, 388 insertions(+), 20 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 56d07edd01f9..013291c6aaa2 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -893,6 +893,12 @@ struct mm_cid { }; #endif =20 +struct mm_sched { + u64 runtime; + unsigned long epoch; + unsigned long occ; +}; + struct kioctx_table; struct iommu_mm_data; struct mm_struct { @@ -983,6 +989,17 @@ struct mm_struct { */ raw_spinlock_t cpus_allowed_lock; #endif +#ifdef CONFIG_SCHED_CACHE + /* + * Track per-cpu-per-process occupancy as a proxy for cache residency. + * See account_mm_sched() and ... + */ + struct mm_sched __percpu *pcpu_sched; + raw_spinlock_t mm_sched_lock; + unsigned long mm_sched_epoch; + int mm_sched_cpu; +#endif + #ifdef CONFIG_MMU atomic_long_t pgtables_bytes; /* size of all page tables */ #endif @@ -1393,6 +1410,33 @@ static inline unsigned int mm_cid_size(void) static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct = cpumask *cpumask) { } #endif /* CONFIG_SCHED_MM_CID */ =20 +#ifdef CONFIG_SCHED_CACHE +extern void mm_init_sched(struct mm_struct *mm, struct mm_sched *pcpu_sche= d); + +static inline int mm_alloc_sched_noprof(struct mm_struct *mm) +{ + struct mm_sched *pcpu_sched =3D alloc_percpu_noprof(struct mm_sched); + if (!pcpu_sched) + return -ENOMEM; + + mm_init_sched(mm, pcpu_sched); + return 0; +} + +#define mm_alloc_sched(...) alloc_hooks(mm_alloc_sched_noprof(__VA_ARGS__)) + +static inline void mm_destroy_sched(struct mm_struct *mm) +{ + free_percpu(mm->pcpu_sched); + mm->pcpu_sched =3D NULL; +} +#else /* !CONFIG_SCHED_CACHE */ + +static inline int mm_alloc_sched(struct mm_struct *mm) { return 0; } +static inline void mm_destroy_sched(struct mm_struct *mm) { } + +#endif /* CONFIG_SCHED_CACHE */ + struct mmu_gather; extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm); extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct= *mm); diff --git a/include/linux/sched.h b/include/linux/sched.h index f96ac1982893..d0e4cda2b3cd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1399,6 +1399,10 @@ struct task_struct { unsigned long numa_pages_migrated; #endif /* CONFIG_NUMA_BALANCING */ =20 +#ifdef CONFIG_SCHED_CACHE + struct callback_head cache_work; +#endif + #ifdef CONFIG_RSEQ struct rseq __user *rseq; u32 rseq_len; diff --git a/init/Kconfig b/init/Kconfig index b2c045c71d7f..7e0104efd138 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -950,6 +950,10 @@ config NUMA_BALANCING =20 This system will be inactive on UMA systems. =20 +config SCHED_CACHE + bool "Cache aware scheduler" + default y + config NUMA_BALANCING_DEFAULT_ENABLED bool "Automatically enable NUMA aware memory/task placement" default y diff --git a/kernel/fork.c b/kernel/fork.c index c4b26cd8998b..974869841e62 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1331,6 +1331,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm= , struct task_struct *p, if (mm_alloc_cid(mm, p)) goto fail_cid; =20 + if (mm_alloc_sched(mm)) + goto fail_sched; + if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT, NR_MM_COUNTERS)) goto fail_pcpu; @@ -1340,6 +1343,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm= , struct task_struct *p, return mm; =20 fail_pcpu: + mm_destroy_sched(mm); +fail_sched: mm_destroy_cid(mm); fail_cid: destroy_context(mm); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 79692f85643f..5a92c02df97b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4524,6 +4524,7 @@ static void __sched_fork(unsigned long clone_flags, s= truct task_struct *p) p->migration_pending =3D NULL; #endif init_sched_mm_cid(p); + init_sched_mm(p); } =20 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing); @@ -8528,6 +8529,7 @@ static struct kmem_cache *task_group_cache __ro_after= _init; =20 void __init sched_init(void) { + unsigned long now =3D jiffies; unsigned long ptr =3D 0; int i; =20 @@ -8602,7 +8604,7 @@ void __init sched_init(void) raw_spin_lock_init(&rq->__lock); rq->nr_running =3D 0; rq->calc_load_active =3D 0; - rq->calc_load_update =3D jiffies + LOAD_FREQ; + rq->calc_load_update =3D now + LOAD_FREQ; init_cfs_rq(&rq->cfs); init_rt_rq(&rq->rt); init_dl_rq(&rq->dl); @@ -8646,7 +8648,7 @@ void __init sched_init(void) rq->cpu_capacity =3D SCHED_CAPACITY_SCALE; rq->balance_callback =3D &balance_push_callback; rq->active_balance =3D 0; - rq->next_balance =3D jiffies; + rq->next_balance =3D now; rq->push_cpu =3D 0; rq->cpu =3D i; rq->online =3D 0; @@ -8658,7 +8660,7 @@ void __init sched_init(void) =20 rq_attach_root(rq, &def_root_domain); #ifdef CONFIG_NO_HZ_COMMON - rq->last_blocked_load_update_tick =3D jiffies; + rq->last_blocked_load_update_tick =3D now; atomic_set(&rq->nohz_flags, 0); =20 INIT_CSD(&rq->nohz_csd, nohz_csd_func, rq); @@ -8683,6 +8685,11 @@ void __init sched_init(void) =20 rq->core_cookie =3D 0UL; #endif +#ifdef CONFIG_SCHED_CACHE + raw_spin_lock_init(&rq->cpu_epoch_lock); + rq->cpu_epoch_next =3D now; +#endif + zalloc_cpumask_var_node(&rq->scratch_mask, GFP_KERNEL, cpu_to_node(i)); } =20 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5e1bd9e8464c..23ea35dbd381 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1166,10 +1166,229 @@ static s64 update_curr_se(struct rq *rq, struct sc= hed_entity *curr) return delta_exec; } =20 -static inline void update_curr_task(struct task_struct *p, s64 delta_exec) +#ifdef CONFIG_SCHED_CACHE + +/* + * XXX numbers come from a place the sun don't shine -- probably wants to = be SD + * tunable or so. + */ +#define EPOCH_PERIOD (HZ/100) /* 10 ms */ +#define EPOCH_OLD 5 /* 50 ms */ + +void mm_init_sched(struct mm_struct *mm, struct mm_sched *_pcpu_sched) +{ + unsigned long epoch; + int i; + + for_each_possible_cpu(i) { + struct mm_sched *pcpu_sched =3D per_cpu_ptr(_pcpu_sched, i); + struct rq *rq =3D cpu_rq(i); + + pcpu_sched->runtime =3D 0; + pcpu_sched->epoch =3D epoch =3D rq->cpu_epoch; + pcpu_sched->occ =3D -1; + } + + raw_spin_lock_init(&mm->mm_sched_lock); + mm->mm_sched_epoch =3D epoch; + mm->mm_sched_cpu =3D -1; + + smp_store_release(&mm->pcpu_sched, _pcpu_sched); +} + +/* because why would C be fully specified */ +static __always_inline void __shr_u64(u64 *val, unsigned int n) +{ + if (n >=3D 64) { + *val =3D 0; + return; + } + *val >>=3D n; +} + +static inline void __update_mm_sched(struct rq *rq, struct mm_sched *pcpu_= sched) +{ + lockdep_assert_held(&rq->cpu_epoch_lock); + + unsigned long n, now =3D jiffies; + long delta =3D now - rq->cpu_epoch_next; + + if (delta > 0) { + n =3D (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD; + rq->cpu_epoch +=3D n; + rq->cpu_epoch_next +=3D n * EPOCH_PERIOD; + __shr_u64(&rq->cpu_runtime, n); + } + + n =3D rq->cpu_epoch - pcpu_sched->epoch; + if (n) { + pcpu_sched->epoch +=3D n; + __shr_u64(&pcpu_sched->runtime, n); + } +} + +static unsigned long fraction_mm_sched(struct rq *rq, struct mm_sched *pcp= u_sched) +{ + guard(raw_spinlock_irqsave)(&rq->cpu_epoch_lock); + + __update_mm_sched(rq, pcpu_sched); + + /* + * Runtime is a geometric series (r=3D0.5) and as such will sum to twice + * the accumulation period, this means the multiplcation here should + * not overflow. + */ + return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq->cpu_runtime + 1); +} + +static inline +void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec) +{ + struct mm_struct *mm =3D p->mm; + struct mm_sched *pcpu_sched; + unsigned long epoch; + + /* + * init_task and kthreads don't be having no mm + */ + if (!mm || !mm->pcpu_sched) + return; + + pcpu_sched =3D this_cpu_ptr(p->mm->pcpu_sched); + + scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) { + __update_mm_sched(rq, pcpu_sched); + pcpu_sched->runtime +=3D delta_exec; + rq->cpu_runtime +=3D delta_exec; + epoch =3D rq->cpu_epoch; + } + + /* + * If this task hasn't hit task_cache_work() for a while, invalidate + * it's preferred state. + */ + if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_OLD) { + mm->mm_sched_cpu =3D -1; + pcpu_sched->occ =3D -1; + } +} + +static void task_tick_cache(struct rq *rq, struct task_struct *p) +{ + struct callback_head *work =3D &p->cache_work; + struct mm_struct *mm =3D p->mm; + + if (!mm || !mm->pcpu_sched) + return; + + if (mm->mm_sched_epoch =3D=3D rq->cpu_epoch) + return; + + guard(raw_spinlock)(&mm->mm_sched_lock); + + if (mm->mm_sched_epoch =3D=3D rq->cpu_epoch) + return; + + if (work->next =3D=3D work) { + task_work_add(p, work, TWA_RESUME); + WRITE_ONCE(mm->mm_sched_epoch, rq->cpu_epoch); + } +} + +static void task_cache_work(struct callback_head *work) +{ + struct task_struct *p =3D current; + struct mm_struct *mm =3D p->mm; + unsigned long m_a_occ =3D 0; + int cpu, m_a_cpu =3D -1; + cpumask_var_t cpus; + + WARN_ON_ONCE(work !=3D &p->cache_work); + + work->next =3D work; + + if (p->flags & PF_EXITING) + return; + + if (!alloc_cpumask_var(&cpus, GFP_KERNEL)) + return; + + scoped_guard (cpus_read_lock) { + cpumask_copy(cpus, cpu_online_mask); + + for_each_cpu(cpu, cpus) { + /* XXX sched_cluster_active */ + struct sched_domain *sd =3D per_cpu(sd_llc, cpu); + unsigned long occ, m_occ =3D 0, a_occ =3D 0; + int m_cpu =3D -1, nr =3D 0, i; + + for_each_cpu(i, sched_domain_span(sd)) { + occ =3D fraction_mm_sched(cpu_rq(i), + per_cpu_ptr(mm->pcpu_sched, i)); + a_occ +=3D occ; + if (occ > m_occ) { + m_occ =3D occ; + m_cpu =3D i; + } + nr++; + trace_printk("(%d) occ: %ld m_occ: %ld m_cpu: %d nr: %d\n", + per_cpu(sd_llc_id, i), occ, m_occ, m_cpu, nr); + } + + a_occ /=3D nr; + if (a_occ > m_a_occ) { + m_a_occ =3D a_occ; + m_a_cpu =3D m_cpu; + } + + trace_printk("(%d) a_occ: %ld m_a_occ: %ld\n", + per_cpu(sd_llc_id, cpu), a_occ, m_a_occ); + + for_each_cpu(i, sched_domain_span(sd)) { + /* XXX threshold ? */ + per_cpu_ptr(mm->pcpu_sched, i)->occ =3D a_occ; + } + + cpumask_andnot(cpus, cpus, sched_domain_span(sd)); + } + } + + /* + * If the max average cache occupancy is 'small' we don't care. + */ + if (m_a_occ < (NICE_0_LOAD >> EPOCH_OLD)) + m_a_cpu =3D -1; + + mm->mm_sched_cpu =3D m_a_cpu; + + free_cpumask_var(cpus); +} + +void init_sched_mm(struct task_struct *p) +{ + struct callback_head *work =3D &p->cache_work; + init_task_work(work, task_cache_work); + work->next =3D work; +} + +#else + +static inline void account_mm_sched(struct rq *rq, struct task_struct *p, + s64 delta_exec) { } + + +void init_sched_mm(struct task_struct *p) { } + +static void task_tick_cache(struct rq *rq, struct task_struct *p) { } + +#endif + +static inline +void update_curr_task(struct rq *rq, struct task_struct *p, s64 delta_exec) { trace_sched_stat_runtime(p, delta_exec); account_group_exec_runtime(p, delta_exec); + account_mm_sched(rq, p, delta_exec); cgroup_account_cputime(p, delta_exec); } =20 @@ -1215,7 +1434,7 @@ s64 update_curr_common(struct rq *rq) =20 delta_exec =3D update_curr_se(rq, &donor->se); if (likely(delta_exec > 0)) - update_curr_task(donor, delta_exec); + update_curr_task(rq, donor, delta_exec); =20 return delta_exec; } @@ -1244,7 +1463,7 @@ static void update_curr(struct cfs_rq *cfs_rq) if (entity_is_task(curr)) { struct task_struct *p =3D task_of(curr); =20 - update_curr_task(p, delta_exec); + update_curr_task(rq, p, delta_exec); =20 /* * If the fair_server is active, we need to account for the @@ -7843,7 +8062,7 @@ static int select_idle_sibling(struct task_struct *p,= int prev, int target) * per-cpu select_rq_mask usage */ lockdep_assert_irqs_disabled(); - +again: if ((available_idle_cpu(target) || sched_idle_cpu(target)) && asym_fits_cpu(task_util, util_min, util_max, target)) return target; @@ -7881,7 +8100,8 @@ static int select_idle_sibling(struct task_struct *p,= int prev, int target) /* Check a recently used CPU as a potential idle candidate: */ recent_used_cpu =3D p->recent_used_cpu; p->recent_used_cpu =3D prev; - if (recent_used_cpu !=3D prev && + if (prev =3D=3D p->wake_cpu && + recent_used_cpu !=3D prev && recent_used_cpu !=3D target && cpus_share_cache(recent_used_cpu, target) && (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cp= u)) && @@ -7934,6 +8154,18 @@ static int select_idle_sibling(struct task_struct *p= , int prev, int target) if ((unsigned)i < nr_cpumask_bits) return i; =20 + if (prev !=3D p->wake_cpu && !cpus_share_cache(prev, p->wake_cpu)) { + /* + * Most likely select_cache_cpu() will have re-directed + * the wakeup, but getting here means the preferred cache is + * too busy, so re-try with the actual previous. + * + * XXX wake_affine is lost for this pass. + */ + prev =3D target =3D p->wake_cpu; + goto again; + } + /* * For cluster machines which have lower sharing cache like L2 or * LLC Tag, we tend to find an idle CPU in the target's cluster @@ -8556,6 +8788,40 @@ static int find_energy_efficient_cpu(struct task_str= uct *p, int prev_cpu) return target; } =20 +#ifdef CONFIG_SCHED_CACHE +static long __migrate_degrades_locality(struct task_struct *p, int src_cpu= , int dst_cpu, bool idle); + +static int select_cache_cpu(struct task_struct *p, int prev_cpu) +{ + struct mm_struct *mm =3D p->mm; + int cpu; + + if (!mm || p->nr_cpus_allowed =3D=3D 1) + return prev_cpu; + + cpu =3D mm->mm_sched_cpu; + if (cpu < 0) + return prev_cpu; + + + if (static_branch_likely(&sched_numa_balancing) && + __migrate_degrades_locality(p, prev_cpu, cpu, false) > 0) { + /* + * XXX look for max occupancy inside prev_cpu's node + */ + return prev_cpu; + } + + return cpu; +} +#else +static int select_cache_cpu(struct task_struct *p, int prev_cpu) +{ + return prev_cpu; +} +#endif + + /* * select_task_rq_fair: Select target runqueue for the waking task in doma= ins * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAK= E, @@ -8581,6 +8847,8 @@ select_task_rq_fair(struct task_struct *p, int prev_c= pu, int wake_flags) * required for stable ->cpus_allowed */ lockdep_assert_held(&p->pi_lock); + guard(rcu)(); + if (wake_flags & WF_TTWU) { record_wakee(p); =20 @@ -8588,6 +8856,8 @@ select_task_rq_fair(struct task_struct *p, int prev_c= pu, int wake_flags) cpumask_test_cpu(cpu, p->cpus_ptr)) return cpu; =20 + new_cpu =3D prev_cpu =3D select_cache_cpu(p, prev_cpu); + if (!is_rd_overutilized(this_rq()->rd)) { new_cpu =3D find_energy_efficient_cpu(p, prev_cpu); if (new_cpu >=3D 0) @@ -8598,7 +8868,6 @@ select_task_rq_fair(struct task_struct *p, int prev_c= pu, int wake_flags) want_affine =3D !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr); } =20 - rcu_read_lock(); for_each_domain(cpu, tmp) { /* * If both 'cpu' and 'prev_cpu' are part of this domain, @@ -8631,7 +8900,6 @@ select_task_rq_fair(struct task_struct *p, int prev_c= pu, int wake_flags) /* Fast path */ new_cpu =3D select_idle_sibling(p, prev_cpu, new_cpu); } - rcu_read_unlock(); =20 return new_cpu; } @@ -9281,6 +9549,17 @@ static int task_hot(struct task_struct *p, struct lb= _env *env) if (sysctl_sched_migration_cost =3D=3D 0) return 0; =20 +#ifdef CONFIG_SCHED_CACHE + if (p->mm && p->mm->pcpu_sched) { + /* + * XXX things like Skylake have non-inclusive L3 and might not + * like this L3 centric view. What to do about L2 stickyness ? + */ + return per_cpu_ptr(p->mm->pcpu_sched, env->src_cpu)->occ > + per_cpu_ptr(p->mm->pcpu_sched, env->dst_cpu)->occ; + } +#endif + delta =3D rq_clock_task(env->src_rq) - p->se.exec_start; =20 return delta < (s64)sysctl_sched_migration_cost; @@ -9292,27 +9571,25 @@ static int task_hot(struct task_struct *p, struct l= b_env *env) * Returns 0, if task migration is not affected by locality. * Returns a negative value, if task migration improves locality i.e migra= tion preferred. */ -static long migrate_degrades_locality(struct task_struct *p, struct lb_env= *env) +static long __migrate_degrades_locality(struct task_struct *p, int src_cpu= , int dst_cpu, bool idle) { struct numa_group *numa_group =3D rcu_dereference(p->numa_group); unsigned long src_weight, dst_weight; int src_nid, dst_nid, dist; =20 - if (!static_branch_likely(&sched_numa_balancing)) - return 0; - - if (!p->numa_faults || !(env->sd->flags & SD_NUMA)) + if (!p->numa_faults) return 0; =20 - src_nid =3D cpu_to_node(env->src_cpu); - dst_nid =3D cpu_to_node(env->dst_cpu); + src_nid =3D cpu_to_node(src_cpu); + dst_nid =3D cpu_to_node(dst_cpu); =20 if (src_nid =3D=3D dst_nid) return 0; =20 /* Migrating away from the preferred node is always bad. */ if (src_nid =3D=3D p->numa_preferred_nid) { - if (env->src_rq->nr_running > env->src_rq->nr_preferred_running) + struct rq *src_rq =3D cpu_rq(src_cpu); + if (src_rq->nr_running > src_rq->nr_preferred_running) return 1; else return 0; @@ -9323,7 +9600,7 @@ static long migrate_degrades_locality(struct task_str= uct *p, struct lb_env *env) return -1; =20 /* Leaving a core idle is often worse than degrading locality. */ - if (env->idle =3D=3D CPU_IDLE) + if (idle) return 0; =20 dist =3D node_distance(src_nid, dst_nid); @@ -9338,7 +9615,24 @@ static long migrate_degrades_locality(struct task_st= ruct *p, struct lb_env *env) return src_weight - dst_weight; } =20 +static long migrate_degrades_locality(struct task_struct *p, struct lb_env= *env) +{ + if (!static_branch_likely(&sched_numa_balancing)) + return 0; + + if (!(env->sd->flags & SD_NUMA)) + return 0; + + return __migrate_degrades_locality(p, env->src_cpu, env->dst_cpu, + env->idle =3D=3D CPU_IDLE); +} + #else +static long __migrate_degrades_locality(struct task_struct *p, int src_cpu= , int dst_cpu, bool idle) +{ + return 0; +} + static inline long migrate_degrades_locality(struct task_struct *p, struct lb_env *env) { @@ -13098,8 +13392,8 @@ static inline void task_tick_core(struct rq *rq, st= ruct task_struct *curr) {} */ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int qu= eued) { - struct cfs_rq *cfs_rq; struct sched_entity *se =3D &curr->se; + struct cfs_rq *cfs_rq; =20 for_each_sched_entity(se) { cfs_rq =3D cfs_rq_of(se); @@ -13109,6 +13403,8 @@ static void task_tick_fair(struct rq *rq, struct ta= sk_struct *curr, int queued) if (static_branch_unlikely(&sched_numa_balancing)) task_tick_numa(rq, curr); =20 + task_tick_cache(rq, curr); + update_misfit_status(curr, rq); check_update_overutilized_status(task_rq(curr)); =20 diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index c5a6a503eb6d..1b6d7e374bc3 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1173,6 +1173,12 @@ struct rq { u64 clock_pelt_idle_copy; u64 clock_idle_copy; #endif +#ifdef CONFIG_SCHED_CACHE + raw_spinlock_t cpu_epoch_lock; + u64 cpu_runtime; + unsigned long cpu_epoch; + unsigned long cpu_epoch_next; +#endif =20 atomic_t nr_iowait; =20 @@ -3887,6 +3893,8 @@ static inline void task_tick_mm_cid(struct rq *rq, st= ruct task_struct *curr) { } static inline void init_sched_mm_cid(struct task_struct *t) { } #endif /* !CONFIG_SCHED_MM_CID */ =20 +extern void init_sched_mm(struct task_struct *p); + extern u64 avg_vruntime(struct cfs_rq *cfs_rq); extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se); #ifdef CONFIG_SMP --=20 2.25.1 From nobody Sun Feb 8 17:30:11 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 753641ADC8D for ; Mon, 21 Apr 2025 03:30:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.12 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745206234; cv=none; b=Uy1+upQv+uUMpMtgF3JdjUicyIKngsw5KOw8YkddDgZcW0VsMc2a+E6HM13qBYjjlEaIKGO+NyOyJf65IQr4ntHIHi/ypZ6wADfj1cfF1oFdfa9ZgSWIvnHC/CmuDK5ypjMPswJqsix5vceL0saN10HBNjaIF/3bevoBPnqWSbg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745206234; c=relaxed/simple; bh=Uvyzn2iZjV1/tyvMqIEGzXXaiEHvY5yO7ykOj+isebM=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=o+tyMEDZ5YgwJczsbzqLlnM8yzWqy3DxjbnSBoiuhoRmqIng7QDyEB/vmIjRaMCEpJOcZs4yJZWQ/NPUJLUEDNmGVw794Fx8MKPjdWpo0OIKfuE0ellRK2hvoiENiaHe/duxMbLhn9JuakkZdpVELKFlKhz3lxmaZreerdIr9fI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=iGedbg1b; arc=none smtp.client-ip=192.198.163.12 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="iGedbg1b" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1745206232; x=1776742232; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Uvyzn2iZjV1/tyvMqIEGzXXaiEHvY5yO7ykOj+isebM=; b=iGedbg1bBvbtXPixA9ALpn4sR2v4hQOxGm7IynDmayVjcuQKKrUhTrW8 JBENO/GUMN3gYmWmyXM16TlgbkH3JlrkMAWaAlBxkxxr/kW3iKOG2qrCF Pk2oopRVRrT2JkjoRD35+1wVN94h52/RXZuOjhQN0QVBOKg3G5WhceUdg 6tpBO7CF+NI0J54oxcRs1sa5kDuWPwkrJE8dTixT13Ex0YAp/8l/2Duw8 xJp4ES7wEvPqRicpSVStLskBmyQK8kgDYSBwYWGmP8q4/FotSXE34u78M qkFwSsgPaLj7ri+xytItc3Kv3XQiolg6kkU34/gcYyzsNabhfHM1DIEzl w==; X-CSE-ConnectionGUID: 0qxie4TnSjOYjKCjHfB6yA== X-CSE-MsgGUID: zN66KwU3T5iWRjdlVjGmHw== X-IronPort-AV: E=McAfee;i="6700,10204,11409"; a="50563108" X-IronPort-AV: E=Sophos;i="6.15,227,1739865600"; d="scan'208";a="50563108" Received: from fmviesa002.fm.intel.com ([10.60.135.142]) by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Apr 2025 20:30:30 -0700 X-CSE-ConnectionGUID: B3f+p0sFS9qo+OAGS9RZaQ== X-CSE-MsgGUID: Z0kNp9jLSnSxv6NyrZjBXg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.15,227,1739865600"; d="scan'208";a="154772663" Received: from chenyu-dev.sh.intel.com ([10.239.62.107]) by fmviesa002.fm.intel.com with ESMTP; 20 Apr 2025 20:30:24 -0700 From: Chen Yu To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" Cc: Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Tim Chen , Vincent Guittot , Libo Chen , Abel Wu , Madadi Vineeth Reddy , Hillf Danton , linux-kernel@vger.kernel.org, Chen Yu Subject: [RFC PATCH 2/5] sched: Several fixes for cache aware scheduling Date: Mon, 21 Apr 2025 11:24:41 +0800 Message-Id: <660bc36a8aacc6ba55fbcf8b0f9f05b6326e69ce.1745199017.git.yu.c.chen@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" 1. Fix the compile errors on per-CPU allocation. 2. Enqueue tasks to the target CPU instead of the current CPU; otherwise, the per-CPU occupancy will be messed up. 3. Fix the NULL LLC sched domain issue(Libo Chen). 4. Avoid duplicated epoch check in task_tick_cache() 5. Introduce sched feature SCHED_CACHE to control cache aware scheduling TBD suggestion in previous version: move cache_work from per task to per mm_struct, consider the actual cpu capacity in fraction_mm_sched() (Abel Wu) Signed-off-by: Chen Yu --- include/linux/mm_types.h | 4 ++-- kernel/sched/fair.c | 15 +++++++++------ kernel/sched/features.h | 1 + 3 files changed, 12 insertions(+), 8 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 013291c6aaa2..9de4a0a13c4d 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1411,11 +1411,11 @@ static inline void mm_set_cpus_allowed(struct mm_st= ruct *mm, const struct cpumas #endif /* CONFIG_SCHED_MM_CID */ =20 #ifdef CONFIG_SCHED_CACHE -extern void mm_init_sched(struct mm_struct *mm, struct mm_sched *pcpu_sche= d); +extern void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *= pcpu_sched); =20 static inline int mm_alloc_sched_noprof(struct mm_struct *mm) { - struct mm_sched *pcpu_sched =3D alloc_percpu_noprof(struct mm_sched); + struct mm_sched __percpu *pcpu_sched =3D alloc_percpu_noprof(struct mm_sc= hed); if (!pcpu_sched) return -ENOMEM; =20 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 23ea35dbd381..22b5830e7e4e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1175,7 +1175,7 @@ static s64 update_curr_se(struct rq *rq, struct sched= _entity *curr) #define EPOCH_PERIOD (HZ/100) /* 10 ms */ #define EPOCH_OLD 5 /* 50 ms */ =20 -void mm_init_sched(struct mm_struct *mm, struct mm_sched *_pcpu_sched) +void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_s= ched) { unsigned long epoch; int i; @@ -1254,7 +1254,7 @@ void account_mm_sched(struct rq *rq, struct task_stru= ct *p, s64 delta_exec) if (!mm || !mm->pcpu_sched) return; =20 - pcpu_sched =3D this_cpu_ptr(p->mm->pcpu_sched); + pcpu_sched =3D per_cpu_ptr(p->mm->pcpu_sched, cpu_of(rq)); =20 scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) { __update_mm_sched(rq, pcpu_sched); @@ -1286,9 +1286,6 @@ static void task_tick_cache(struct rq *rq, struct tas= k_struct *p) =20 guard(raw_spinlock)(&mm->mm_sched_lock); =20 - if (mm->mm_sched_epoch =3D=3D rq->cpu_epoch) - return; - if (work->next =3D=3D work) { task_work_add(p, work, TWA_RESUME); WRITE_ONCE(mm->mm_sched_epoch, rq->cpu_epoch); @@ -1322,6 +1319,9 @@ static void task_cache_work(struct callback_head *wor= k) unsigned long occ, m_occ =3D 0, a_occ =3D 0; int m_cpu =3D -1, nr =3D 0, i; =20 + if (!sd) + continue; + for_each_cpu(i, sched_domain_span(sd)) { occ =3D fraction_mm_sched(cpu_rq(i), per_cpu_ptr(mm->pcpu_sched, i)); @@ -8796,6 +8796,9 @@ static int select_cache_cpu(struct task_struct *p, in= t prev_cpu) struct mm_struct *mm =3D p->mm; int cpu; =20 + if (!sched_feat(SCHED_CACHE)) + return prev_cpu; + if (!mm || p->nr_cpus_allowed =3D=3D 1) return prev_cpu; =20 @@ -9550,7 +9553,7 @@ static int task_hot(struct task_struct *p, struct lb_= env *env) return 0; =20 #ifdef CONFIG_SCHED_CACHE - if (p->mm && p->mm->pcpu_sched) { + if (sched_feat(SCHED_CACHE) && p->mm && p->mm->pcpu_sched) { /* * XXX things like Skylake have non-inclusive L3 and might not * like this L3 centric view. What to do about L2 stickyness ? diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 3c12d9f93331..d2af7bfd36bf 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -87,6 +87,7 @@ SCHED_FEAT(TTWU_QUEUE, true) */ SCHED_FEAT(SIS_UTIL, true) =20 +SCHED_FEAT(SCHED_CACHE, true) /* * Issue a WARN when we do multiple update_rq_clock() calls * in a single rq->lock section. Default disabled because the --=20 2.25.1 From nobody Sun Feb 8 17:30:11 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 601312045BC for ; Mon, 21 Apr 2025 03:30:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.10 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745206246; cv=none; b=cyUDcaWwS/vnAC/9vr6QVtia4FcTQMPd7eMvED7v9EKZ0kKV2VJJLgwBwEqIE3MmRS+wNHMY36fQXAFbakXWR9x6Kv8NRZJFtJBcBk4umPV+oZOEHo9eVoYQAJ5ouoUVWfAi5+myL9jL7wvER+fafEXrrPk3isKNaugWgbBd+nw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745206246; c=relaxed/simple; bh=UD6nYLwiuBEsKgfJlf0Unq67sRh4l0bjKGn9AJ0VpF0=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=J6F7uqEaxT32UEBxrSQCL+3cojdRbPZQQFQKhOytZdS+PTN+q7WKXy50s3loBm9q6HkxZqYYtT6UFXynFuF4pnt/SnFD2VH06ivZy5P7mvOfsYwPgoo1ypV6YM94Nf2l8p9aPyFJGZuLth/e8Cyoxqk5pAXICdDi4f8LTD4Hsc8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=M5G6iEoA; arc=none smtp.client-ip=198.175.65.10 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="M5G6iEoA" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1745206245; x=1776742245; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=UD6nYLwiuBEsKgfJlf0Unq67sRh4l0bjKGn9AJ0VpF0=; b=M5G6iEoAu7fNeFu+iumhwMY356Zm623f8ujNFitSCUq18XM22GqKhn// VnJCdo76lNoAuDAqH+aZoNdGRoSUqSKJj/ZSmz6qk0oRd6bsIK1ZLWgx2 vJeD0YcggG/EOVqhh3yzmGZb++iyjmBU3AMpXm91aFL0j6/uYDSijsMjx LqTbdBRluvlPHGQqD+XJodEQ3lQOxeN8nYecVSlgmaoEV1rt/z56IS9hf Zo02ylcxHKJFPJJWQAwDRNddckwXeGAZciLz2KndSZGl5eUCAmuLjeO7J T4dsXznKnCDg539b44E5Xp8xSCU9E90dKFySynjlVBoEAkw2jLUNT3lbH w==; X-CSE-ConnectionGUID: eJdKTGJ5RwGjlSIsKnJ+Zw== X-CSE-MsgGUID: zSr1hSx/TPm4VczOmWLm2Q== X-IronPort-AV: E=McAfee;i="6700,10204,11409"; a="64144193" X-IronPort-AV: E=Sophos;i="6.15,227,1739865600"; d="scan'208";a="64144193" Received: from orviesa009.jf.intel.com ([10.64.159.149]) by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Apr 2025 20:30:44 -0700 X-CSE-ConnectionGUID: lRA4h+KMRiCO6ODq9AiIoA== X-CSE-MsgGUID: U+OblyDQQ46Vo4z3tm1qgQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.15,227,1739865600"; d="scan'208";a="131355364" Received: from chenyu-dev.sh.intel.com ([10.239.62.107]) by orviesa009.jf.intel.com with ESMTP; 20 Apr 2025 20:30:40 -0700 From: Chen Yu To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" Cc: Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Tim Chen , Vincent Guittot , Libo Chen , Abel Wu , Madadi Vineeth Reddy , Hillf Danton , linux-kernel@vger.kernel.org, Chen Yu Subject: [RFC PATCH 3/5] sched: Avoid task migration within its preferred LLC Date: Mon, 21 Apr 2025 11:25:04 +0800 Message-Id: <01a54d63193fab5c819aab75321f6aa492491997.1745199017.git.yu.c.chen@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" It was found that when running schbench, there is a significant amount of in-LLC task migration, even if the wakee is woken up on its preferred LLC. This leads to core-to-core latency and impairs performance. Inhibit task migration if the wakee is already in its preferred LLC. Meanwhile, prevent the load balancer from treating the task as cache-hot if this task is being migrated out of its preferred LLC, rather than comparing the occupancy between CPUs. With this enhancement applied, the in-LLC task migration has been reduced a lot(use PATCH 5/5 to verify). It was found that when schbench is running, there is a significant amount of in-LLC task migration, even if the wakee is woken up on its preferred LLC. This leads to core-to-core latency and impairs performance. Inhibit task migration if the wakee is already in its preferred LLC. Meanwhile, prevent the load balancer from treating the task as cache-hot if this task is being migrated out of its preferred LLC, instead of comparing occupancy between CPUs directly. With this enhancement applied, the in-LLC task migration has been reduced significantly, (use PATCH 5/5 to verify). Signed-off-by: Chen Yu --- kernel/sched/fair.c | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 22b5830e7e4e..1733eb83042c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8806,6 +8806,12 @@ static int select_cache_cpu(struct task_struct *p, i= nt prev_cpu) if (cpu < 0) return prev_cpu; =20 + /* + * No need to migrate the task if previous and preferred CPU + * are in the same LLC. + */ + if (cpus_share_cache(prev_cpu, cpu)) + return prev_cpu; =20 if (static_branch_likely(&sched_numa_balancing) && __migrate_degrades_locality(p, prev_cpu, cpu, false) > 0) { @@ -9553,14 +9559,13 @@ static int task_hot(struct task_struct *p, struct l= b_env *env) return 0; =20 #ifdef CONFIG_SCHED_CACHE - if (sched_feat(SCHED_CACHE) && p->mm && p->mm->pcpu_sched) { - /* - * XXX things like Skylake have non-inclusive L3 and might not - * like this L3 centric view. What to do about L2 stickyness ? - */ - return per_cpu_ptr(p->mm->pcpu_sched, env->src_cpu)->occ > - per_cpu_ptr(p->mm->pcpu_sched, env->dst_cpu)->occ; - } + /* + * Don't migrate task out of its preferred LLC. + */ + if (sched_feat(SCHED_CACHE) && p->mm && p->mm->mm_sched_cpu >=3D 0 && + cpus_share_cache(env->src_cpu, p->mm->mm_sched_cpu) && + !cpus_share_cache(env->src_cpu, env->dst_cpu)) + return 1; #endif =20 delta =3D rq_clock_task(env->src_rq) - p->se.exec_start; --=20 2.25.1 From nobody Sun Feb 8 17:30:11 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 21ABB17BEBF for ; Mon, 21 Apr 2025 03:30:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.10 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745206261; cv=none; b=BgON1Q6NtCJ5s+QmrduVZgAzjvm548BXKyVJToiEmB4wZaFDKSPFr2p9Q5TAGOigaJzI7vMl12RQ4cIFFX8wZpsOzpwB0vS26Ko+lqr07x0sTwNT6JXTwmlnqaGbsmi5DozOk/9R0tlcC6F4JEgXXu3a0Qujd0KGjVbTymi+izc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745206261; c=relaxed/simple; bh=1MJqnxJnIT/C5QdCNLNcR3OoMKXPPnOwSmh9yOZXkAY=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=MFw3z9hlx8sfk04LtQuk5EIQeZbyzuuNxN5wP8umpLSjkhPJD0TuLz6eXCCq1EaYCfRdcl6MPu+u67J9yMmUtk8FWEhQxIWC4B9BpnZhZ7TQ4EG7rx+499kdiOw957TfvBb53boLQc/lnsbiYix/YkjBSFHW2PmxZnK5zfyi8hU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=Z4qGMvk+; arc=none smtp.client-ip=198.175.65.10 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Z4qGMvk+" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1745206259; x=1776742259; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=1MJqnxJnIT/C5QdCNLNcR3OoMKXPPnOwSmh9yOZXkAY=; b=Z4qGMvk+MYybEPB2KwHUZ3uIIjonLsTE7i5QY/HqmdMTncmzH3GvRSTv /c53mq+LS1T8DXQFzlx6np14ZVfNRJqw/mIvKpRz0D4HjiDG8VqG9VVY6 Gq2/gUVdwBKtKbPK9JKnNK8GhKF+iMpzE+mcfogQPoUiWM8HLx2pdz8JG BHIp67lEn/jEzS35UTdhALl9AGvAMZRBNWMCpYNTbiKl38/MtkPxDLVsV tXxozQ5RbvvI73xlWEKw+TCL18Pf9UobP/4T/0kxoqvgqWebuGygSm2pK CFFN49K3oUnzOunoHKuXhwlNZgB1nIYjFejXKLwRQc7WjcG6QUB7SZysq Q==; X-CSE-ConnectionGUID: ZWYdHXSxR1ie1d8UekXgLQ== X-CSE-MsgGUID: JptGT7ThTSGM7tFAh1y5PQ== X-IronPort-AV: E=McAfee;i="6700,10204,11409"; a="64144220" X-IronPort-AV: E=Sophos;i="6.15,227,1739865600"; d="scan'208";a="64144220" Received: from orviesa009.jf.intel.com ([10.64.159.149]) by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Apr 2025 20:30:59 -0700 X-CSE-ConnectionGUID: trwDvCizQ7SHIqO4u/C1pw== X-CSE-MsgGUID: 39pzNhYuQQiknrhMPQq5dA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.15,227,1739865600"; d="scan'208";a="131355380" Received: from chenyu-dev.sh.intel.com ([10.239.62.107]) by orviesa009.jf.intel.com with ESMTP; 20 Apr 2025 20:30:54 -0700 From: Chen Yu To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" Cc: Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Tim Chen , Vincent Guittot , Libo Chen , Abel Wu , Madadi Vineeth Reddy , Hillf Danton , linux-kernel@vger.kernel.org, Chen Yu Subject: [RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the preferred LLC is over aggregated Date: Mon, 21 Apr 2025 11:25:18 +0800 Message-Id: <2c45f6db1efef84c6c1ed514a8d24a9bc4a2ca4b.1745199017.git.yu.c.chen@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" It is found that when the process's preferred LLC gets saturated by too many threads, task contention is very frequent and causes performance regression. Save the per LLC statistics calculated by periodic load balance. The statis= tics include the average utilization and the average number of runnable tasks. The task wakeup path for cache aware scheduling manipulates these statistics to inhibit cache aware scheduling to avoid performance regression. When eit= her the average utilization of the preferred LLC has reached 25%, or the average number of runnable tasks has exceeded 1/3 of the LLC weight, the cache aware wakeup is disabled. Only when the process has more threads than the LLC wei= ght will this restriction be enabled. Running schbench via mmtests on a Xeon platform, which has 2 sockets, each = socket has 60 Cores/120 CPUs. The DRAM interleave is enabled across NUMA nodes via= BIOS, so there are 2 "LLCs" in 1 NUMA node. compare-mmtests.pl --directory work/log --benchmark schbench --names baseli= ne,sched_cache baselin sched_cach baseline sched_cache Lat 50.0th-qrtle-1 6.00 ( 0.00%) 6.00 ( 0.00%) Lat 90.0th-qrtle-1 10.00 ( 0.00%) 9.00 ( 10.00%) Lat 99.0th-qrtle-1 29.00 ( 0.00%) 13.00 ( 55.17%) Lat 99.9th-qrtle-1 35.00 ( 0.00%) 21.00 ( 40.00%) Lat 20.0th-qrtle-1 266.00 ( 0.00%) 266.00 ( 0.00%) Lat 50.0th-qrtle-2 8.00 ( 0.00%) 6.00 ( 25.00%) Lat 90.0th-qrtle-2 10.00 ( 0.00%) 10.00 ( 0.00%) Lat 99.0th-qrtle-2 19.00 ( 0.00%) 18.00 ( 5.26%) Lat 99.9th-qrtle-2 27.00 ( 0.00%) 29.00 ( -7.41%) Lat 20.0th-qrtle-2 533.00 ( 0.00%) 507.00 ( 4.88%) Lat 50.0th-qrtle-4 6.00 ( 0.00%) 5.00 ( 16.67%) Lat 90.0th-qrtle-4 8.00 ( 0.00%) 5.00 ( 37.50%) Lat 99.0th-qrtle-4 14.00 ( 0.00%) 9.00 ( 35.71%) Lat 99.9th-qrtle-4 22.00 ( 0.00%) 14.00 ( 36.36%) Lat 20.0th-qrtle-4 1070.00 ( 0.00%) 995.00 ( 7.01%) Lat 50.0th-qrtle-8 5.00 ( 0.00%) 5.00 ( 0.00%) Lat 90.0th-qrtle-8 7.00 ( 0.00%) 5.00 ( 28.57%) Lat 99.0th-qrtle-8 12.00 ( 0.00%) 11.00 ( 8.33%) Lat 99.9th-qrtle-8 19.00 ( 0.00%) 16.00 ( 15.79%) Lat 20.0th-qrtle-8 2140.00 ( 0.00%) 2140.00 ( 0.00%) Lat 50.0th-qrtle-16 6.00 ( 0.00%) 5.00 ( 16.67%) Lat 90.0th-qrtle-16 7.00 ( 0.00%) 5.00 ( 28.57%) Lat 99.0th-qrtle-16 12.00 ( 0.00%) 10.00 ( 16.67%) Lat 99.9th-qrtle-16 17.00 ( 0.00%) 14.00 ( 17.65%) Lat 20.0th-qrtle-16 4296.00 ( 0.00%) 4200.00 ( 2.23%) Lat 50.0th-qrtle-32 6.00 ( 0.00%) 5.00 ( 16.67%) Lat 90.0th-qrtle-32 8.00 ( 0.00%) 6.00 ( 25.00%) Lat 99.0th-qrtle-32 12.00 ( 0.00%) 10.00 ( 16.67%) Lat 99.9th-qrtle-32 17.00 ( 0.00%) 14.00 ( 17.65%) Lat 20.0th-qrtle-32 8496.00 ( 0.00%) 8528.00 ( -0.38%) Lat 50.0th-qrtle-64 6.00 ( 0.00%) 5.00 ( 16.67%) Lat 90.0th-qrtle-64 8.00 ( 0.00%) 8.00 ( 0.00%) Lat 99.0th-qrtle-64 12.00 ( 0.00%) 12.00 ( 0.00%) Lat 99.9th-qrtle-64 17.00 ( 0.00%) 17.00 ( 0.00%) Lat 20.0th-qrtle-64 17120.00 ( 0.00%) 17120.00 ( 0.00%) Lat 50.0th-qrtle-128 7.00 ( 0.00%) 7.00 ( 0.00%) Lat 90.0th-qrtle-128 9.00 ( 0.00%) 9.00 ( 0.00%) Lat 99.0th-qrtle-128 13.00 ( 0.00%) 14.00 ( -7.69%) Lat 99.9th-qrtle-128 20.00 ( 0.00%) 20.00 ( 0.00%) Lat 20.0th-qrtle-128 31776.00 ( 0.00%) 30496.00 ( 4.03%) Lat 50.0th-qrtle-239 9.00 ( 0.00%) 9.00 ( 0.00%) Lat 90.0th-qrtle-239 14.00 ( 0.00%) 18.00 ( -28.57%) Lat 99.0th-qrtle-239 43.00 ( 0.00%) 56.00 ( -30.23%) Lat 99.9th-qrtle-239 106.00 ( 0.00%) 483.00 (-355.66%) Lat 20.0th-qrtle-239 30176.00 ( 0.00%) 29984.00 ( 0.64%) We can see overall latency improvement and some throughput degradation when the system gets saturated. Also, we run schbench (old version) on an EPYC 7543 system, which has 4 NUMA nodes, and each node has 4 LLCs. Monitor the 99.0th latency: case load baseline(std%) compare%( std%) normal 4-mthreads-1-workers 1.00 ( 6.47) +9.02 ( 4= .68) normal 4-mthreads-2-workers 1.00 ( 3.25) +28.03 ( 8= .76) normal 4-mthreads-4-workers 1.00 ( 6.67) -4.32 ( 2= .58) normal 4-mthreads-8-workers 1.00 ( 2.38) +1.27 ( 2= .41) normal 4-mthreads-16-workers 1.00 ( 5.61) -8.48 ( 4= .39) normal 4-mthreads-31-workers 1.00 ( 9.31) -0.22 ( 9= .77) When the LLC is underloaded, the latency improvement is observed. When the = LLC gets saturated, we observe some degradation. The aggregation of tasks will move tasks towards the preferred LLC pretty quickly during wake ups. However load balance will tend to move tasks away from the aggregated LLC. The two migrations are in the opposite directions and tend to bounce tasks between LLCs. Such task migrations should be impeded in load balancing as long as the home LLC. We're working on fixing up the load balancing path to address such issues. Co-developed-by: Tim Chen Signed-off-by: Tim Chen Signed-off-by: Chen Yu --- include/linux/sched/topology.h | 4 ++ kernel/sched/fair.c | 101 ++++++++++++++++++++++++++++++++- 2 files changed, 104 insertions(+), 1 deletion(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 198bb5cc1774..9625d9d762f5 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -78,6 +78,10 @@ struct sched_domain_shared { atomic_t nr_busy_cpus; int has_idle_cores; int nr_idle_scan; +#ifdef CONFIG_SCHED_CACHE + unsigned long util_avg; + u64 nr_avg; +#endif }; =20 struct sched_domain { diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1733eb83042c..f74d8773c811 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8791,6 +8791,58 @@ static int find_energy_efficient_cpu(struct task_str= uct *p, int prev_cpu) #ifdef CONFIG_SCHED_CACHE static long __migrate_degrades_locality(struct task_struct *p, int src_cpu= , int dst_cpu, bool idle); =20 +/* expected to be protected by rcu_read_lock() */ +static bool get_llc_stats(int cpu, int *nr, int *weight, unsigned long *ut= il) +{ + struct sched_domain_shared *sd_share; + + sd_share =3D rcu_dereference(per_cpu(sd_llc_shared, cpu)); + if (!sd_share) + return false; + + *nr =3D READ_ONCE(sd_share->nr_avg); + *util =3D READ_ONCE(sd_share->util_avg); + *weight =3D per_cpu(sd_llc_size, cpu); + + return true; +} + +static bool valid_target_cpu(int cpu, struct task_struct *p) +{ + int nr_running, llc_weight; + unsigned long util, llc_cap; + + if (!get_llc_stats(cpu, &nr_running, &llc_weight, + &util)) + return false; + + llc_cap =3D llc_weight * SCHED_CAPACITY_SCALE; + + /* + * If this process has many threads, be careful to avoid + * task stacking on the preferred LLC, by checking the system's + * utilization and runnable tasks. Otherwise, if this + * process does not have many threads, honor the cache + * aware wakeup. + */ + if (get_nr_threads(p) < llc_weight) + return true; + + /* + * Check if it exceeded 25% of average utiliazation, + * or if it exceeded 33% of CPUs. This is a magic number + * that did not cause heavy cache contention on Xeon or + * Zen. + */ + if (util * 4 >=3D llc_cap) + return false; + + if (nr_running * 3 >=3D llc_weight) + return false; + + return true; +} + static int select_cache_cpu(struct task_struct *p, int prev_cpu) { struct mm_struct *mm =3D p->mm; @@ -8813,6 +8865,9 @@ static int select_cache_cpu(struct task_struct *p, in= t prev_cpu) if (cpus_share_cache(prev_cpu, cpu)) return prev_cpu; =20 + if (!valid_target_cpu(cpu, p)) + return prev_cpu; + if (static_branch_likely(&sched_numa_balancing) && __migrate_degrades_locality(p, prev_cpu, cpu, false) > 0) { /* @@ -9564,7 +9619,8 @@ static int task_hot(struct task_struct *p, struct lb_= env *env) */ if (sched_feat(SCHED_CACHE) && p->mm && p->mm->mm_sched_cpu >=3D 0 && cpus_share_cache(env->src_cpu, p->mm->mm_sched_cpu) && - !cpus_share_cache(env->src_cpu, env->dst_cpu)) + !cpus_share_cache(env->src_cpu, env->dst_cpu) && + !valid_target_cpu(env->dst_cpu, p)) return 1; #endif =20 @@ -10634,6 +10690,48 @@ sched_reduced_capacity(struct rq *rq, struct sched= _domain *sd) return check_cpu_capacity(rq, sd); } =20 +#ifdef CONFIG_SCHED_CACHE +/* + * Save this sched group's statistic for later use: + * The task wakeup and load balance can make better + * decision based on these statistics. + */ +static void update_sg_if_llc(struct lb_env *env, struct sg_lb_stats *sgs, + struct sched_group *group) +{ + /* Find the sched domain that spans this group. */ + struct sched_domain *sd =3D env->sd->child; + struct sched_domain_shared *sd_share; + u64 last_nr; + + if (!sched_feat(SCHED_CACHE) || env->idle =3D=3D CPU_NEWLY_IDLE) + return; + + /* only care the sched domain that spans 1 LLC */ + if (!sd || !(sd->flags & SD_SHARE_LLC) || + !sd->parent || (sd->parent->flags & SD_SHARE_LLC)) + return; + + sd_share =3D rcu_dereference(per_cpu(sd_llc_shared, + cpumask_first(sched_group_span(group)))); + if (!sd_share) + return; + + last_nr =3D READ_ONCE(sd_share->nr_avg); + update_avg(&last_nr, sgs->sum_nr_running); + + if (likely(READ_ONCE(sd_share->util_avg) !=3D sgs->group_util)) + WRITE_ONCE(sd_share->util_avg, sgs->group_util); + + WRITE_ONCE(sd_share->nr_avg, last_nr); +} +#else +static inline void update_sg_if_llc(struct lb_env *env, struct sg_lb_stats= *sgs, + struct sched_group *group) +{ +} +#endif + /** * update_sg_lb_stats - Update sched_group's statistics for load balancing. * @env: The load balancing environment. @@ -10723,6 +10821,7 @@ static inline void update_sg_lb_stats(struct lb_env= *env, =20 sgs->group_type =3D group_classify(env->sd->imbalance_pct, group, sgs); =20 + update_sg_if_llc(env, sgs, group); /* Computing avg_load makes sense only when group is overloaded */ if (sgs->group_type =3D=3D group_overloaded) sgs->avg_load =3D (sgs->group_load * SCHED_CAPACITY_SCALE) / --=20 2.25.1 From nobody Sun Feb 8 17:30:11 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2E32FF507 for ; Mon, 21 Apr 2025 03:31:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745206277; cv=none; b=Sy2OmM07akeAyE7SQ83Gt7ygEH36ao7hrVLGMiZVZgjlpLIHXnma2RZxYtq4xZDi7okdR2nYuS4fmbhpkIXtgCVovv3PvoUgo5lq2ryUoJ77wHt12NXIEwIrs4ulE5at5pHsYXSXvMCqst9q9jztUaEoYaG976WhLcA91fv2QDE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745206277; c=relaxed/simple; bh=yMJHcecm2zj6M6Q7DbedbLw6+meHZ2UNj7fqi76HCNQ=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=kZPFkaOYdXyRmevrT4/sZC4hCp/TupabKRGzs1td8WMqA/GsOlUJZ1s3WAG9ZNbUbkogXzm27SniI/sagl3q3/f0bhPbYVzfeALEGSRURhzQaA2lUlvyFqfiYxvByVPF3yxWjwtFVWPFn9LLNj91hybn8UZ0a1hyFl5KBLolZU4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=Pj53XDj/; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Pj53XDj/" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1745206277; x=1776742277; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=yMJHcecm2zj6M6Q7DbedbLw6+meHZ2UNj7fqi76HCNQ=; b=Pj53XDj/IO0qJOuvtvqrM16lPjaEVY1ks3oPSXFDzF8ltVGpzsoStm1l C+6YrmSyvReUJt3/MCuN4grNA0cGBkyK62hlRsNT6HRWivyhTNzPMbPhl xsKgLcs2n1MLdKNH9CBWkwX/zobDVEfzRZ5OdT4zqcv+r1QfW+h6+9aI6 gBFhy02HonE1rDg0jhYiTr21JdJeiRJBKwA8F9CuIHskYOT7sgPJl+oWi MYm24J2Wxe91+FTDDRqyfjDy+KN9n4+ynZZGyz82jTUgf/WKJiW5FZfqA MIRrRnbVqu/R/UjAdoRIEvQHcS8csAWzMvIBqozuRUWEPJyvDMO6wpmVG A==; X-CSE-ConnectionGUID: 6TXKzjgjQbegoH94qV5gVg== X-CSE-MsgGUID: Eaz94i42Q+eIIKykoUY8vw== X-IronPort-AV: E=McAfee;i="6700,10204,11409"; a="56913626" X-IronPort-AV: E=Sophos;i="6.15,227,1739865600"; d="scan'208";a="56913626" Received: from fmviesa009.fm.intel.com ([10.60.135.149]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Apr 2025 20:31:14 -0700 X-CSE-ConnectionGUID: HJ/8UCbwQTyqzEykmeLLwQ== X-CSE-MsgGUID: NPsK7sG8Rv6yXocPBpLLiw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.15,227,1739865600"; d="scan'208";a="132488712" Received: from chenyu-dev.sh.intel.com ([10.239.62.107]) by fmviesa009.fm.intel.com with ESMTP; 20 Apr 2025 20:31:09 -0700 From: Chen Yu To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" Cc: Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Tim Chen , Vincent Guittot , Libo Chen , Abel Wu , Madadi Vineeth Reddy , Hillf Danton , linux-kernel@vger.kernel.org, Chen Yu Subject: [RFC PATCH 5/5] sched: Add ftrace to track task migration and load balance within and across LLC Date: Mon, 21 Apr 2025 11:25:33 +0800 Message-Id: <5d5a6e243b88d47a744f3c84d2a3a74832a6ef35.1745199017.git.yu.c.chen@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" [Not for upstream] Introduce these ftrace events for debugging purposes. The task migration activity is an important indicator to infer the performance regression. Use the following bpftrace script to capture the task migrations: tracepoint:sched:sched_attach_task { $src_cpu =3D args->src_cpu; $dst_cpu =3D args->dst_cpu; $src_llc =3D args->src_llc; $dst_llc =3D args->dst_llc; $idle =3D args->idle; if ($src_llc =3D=3D $dst_llc) { @lb_mig_1llc[$idle] =3D count(); } else { @lb_mig_2llc[$idle] =3D count(); } } tracepoint:sched:sched_select_task_rq { $new_cpu =3D args->new_cpu; $old_cpu =3D args->old_cpu; $new_llc =3D args->new_llc; $old_llc =3D args->old_llc; if ($new_cpu !=3D $old_cpu) { if ($new_llc =3D=3D $old_llc) { @wake_mig_1llc[$new_llc] =3D count(); } else { @wake_mig_2llc =3D count(); } } } interval:s:10 { time("\n%H:%M:%S scheduler statistics: \n"); print(@lb_mig_1llc); clear(@lb_mig_1llc); print(@lb_mig_2llc); clear(@lb_mig_2llc); print(@wake_mig_1llc); clear(@wake_mig_1llc); print(@wake_mig_2llc); clear(@wake_mig_2llc); } Signed-off-by: Chen Yu --- include/trace/events/sched.h | 51 ++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 24 ++++++++++++----- 2 files changed, 69 insertions(+), 6 deletions(-) diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 3bec9fb73a36..9995e09525ed 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -10,6 +10,57 @@ #include #include =20 +TRACE_EVENT(sched_attach_task, + + TP_PROTO(int src_cpu, int dst_cpu, int src_llc, int dst_llc, int idle), + + TP_ARGS(src_cpu, dst_cpu, src_llc, dst_llc, idle), + + TP_STRUCT__entry( + __field( int, src_cpu ) + __field( int, dst_cpu ) + __field( int, src_llc ) + __field( int, dst_llc ) + __field( int, idle ) + ), + + TP_fast_assign( + __entry->src_cpu =3D src_cpu; + __entry->dst_cpu =3D dst_cpu; + __entry->src_llc =3D src_llc; + __entry->dst_llc =3D dst_llc; + __entry->idle =3D idle; + ), + + TP_printk("src_cpu=3D%d dst_cpu=3D%d src_llc=3D%d dst_llc=3D%d idle=3D%d", + __entry->src_cpu, __entry->dst_cpu, __entry->src_llc, + __entry->dst_llc, __entry->idle) +); + +TRACE_EVENT(sched_select_task_rq, + + TP_PROTO(int new_cpu, int old_cpu, int new_llc, int old_llc), + + TP_ARGS(new_cpu, old_cpu, new_llc, old_llc), + + TP_STRUCT__entry( + __field( int, new_cpu ) + __field( int, old_cpu ) + __field( int, new_llc ) + __field( int, old_llc ) + ), + + TP_fast_assign( + __entry->new_cpu =3D new_cpu; + __entry->old_cpu =3D old_cpu; + __entry->new_llc =3D new_llc; + __entry->old_llc =3D old_llc; + ), + + TP_printk("new_cpu=3D%d old_cpu=3D%d new_llc=3D%d old_llc=3D%d", + __entry->new_cpu, __entry->old_cpu, __entry->new_llc, __entry->old_llc) +); + /* * Tracepoint for calling kthread_stop, performed to end a kthread: */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f74d8773c811..635fd3a6009c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8902,7 +8902,7 @@ select_task_rq_fair(struct task_struct *p, int prev_c= pu, int wake_flags) int sync =3D (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING); struct sched_domain *tmp, *sd =3D NULL; int cpu =3D smp_processor_id(); - int new_cpu =3D prev_cpu; + int new_cpu =3D prev_cpu, orig_prev_cpu =3D prev_cpu; int want_affine =3D 0; /* SD_flags and WF_flags share the first nibble */ int sd_flag =3D wake_flags & 0xF; @@ -8965,6 +8965,10 @@ select_task_rq_fair(struct task_struct *p, int prev_= cpu, int wake_flags) new_cpu =3D select_idle_sibling(p, prev_cpu, new_cpu); } =20 + trace_sched_select_task_rq(new_cpu, orig_prev_cpu, + per_cpu(sd_llc_id, new_cpu), + per_cpu(sd_llc_id, orig_prev_cpu)); + return new_cpu; } =20 @@ -10026,11 +10030,17 @@ static int detach_tasks(struct lb_env *env) /* * attach_task() -- attach the task detached by detach_task() to its new r= q. */ -static void attach_task(struct rq *rq, struct task_struct *p) +static void attach_task(struct rq *rq, struct task_struct *p, struct lb_en= v *env) { lockdep_assert_rq_held(rq); =20 WARN_ON_ONCE(task_rq(p) !=3D rq); + + if (env) + trace_sched_attach_task(env->src_cpu, env->dst_cpu, + per_cpu(sd_llc_id, env->src_cpu), + per_cpu(sd_llc_id, env->dst_cpu), + env->idle); activate_task(rq, p, ENQUEUE_NOCLOCK); wakeup_preempt(rq, p, 0); } @@ -10039,13 +10049,13 @@ static void attach_task(struct rq *rq, struct tas= k_struct *p) * attach_one_task() -- attaches the task returned from detach_one_task() = to * its new rq. */ -static void attach_one_task(struct rq *rq, struct task_struct *p) +static void attach_one_task(struct rq *rq, struct task_struct *p, struct l= b_env *env) { struct rq_flags rf; =20 rq_lock(rq, &rf); update_rq_clock(rq); - attach_task(rq, p); + attach_task(rq, p, env); rq_unlock(rq, &rf); } =20 @@ -10066,7 +10076,7 @@ static void attach_tasks(struct lb_env *env) p =3D list_first_entry(tasks, struct task_struct, se.group_node); list_del_init(&p->se.group_node); =20 - attach_task(env->dst_rq, p); + attach_task(env->dst_rq, p, env); } =20 rq_unlock(env->dst_rq, &rf); @@ -12457,6 +12467,7 @@ static int active_load_balance_cpu_stop(void *data) struct sched_domain *sd; struct task_struct *p =3D NULL; struct rq_flags rf; + struct lb_env env_tmp; =20 rq_lock_irq(busiest_rq, &rf); /* @@ -12512,6 +12523,7 @@ static int active_load_balance_cpu_stop(void *data) } else { schedstat_inc(sd->alb_failed); } + memcpy(&env_tmp, &env, sizeof(env)); } rcu_read_unlock(); out_unlock: @@ -12519,7 +12531,7 @@ static int active_load_balance_cpu_stop(void *data) rq_unlock(busiest_rq, &rf); =20 if (p) - attach_one_task(target_rq, p); + attach_one_task(target_rq, p, sd ? &env_tmp : NULL); =20 local_irq_enable(); =20 --=20 2.25.1