From nobody Sun Oct 5 05:26:32 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 93DAD125D6 for ; Sat, 9 Aug 2025 05:07:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.19 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1754716034; cv=none; b=YUmFsTNH1uwQvpOHnTp11Akd3lgJWMfavT04pYrRO6bSLY9uShqjjFR32v7kBjYwOu9HZts4Psvms0Up5yiFkgkTpBdbC8CX/E7Z4c1Klx1PkIf3BPuhpb8ZvRx+SMdhPpzo/SQA6Ht628h/WhbmPYoJzx1WyHar5r5e0vVf1nw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1754716034; c=relaxed/simple; bh=B8SncTRCfxokFw3HLq476F91kwXYiv+eNctY+3vgxDg=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=LoeDB94MVSwRN9DiDv9UOzgtbfNf7Z8gM97rUyOhBmJ8xbH9EBSw6tUtmKoGA8eGV7pQZyNxS628yTpoLSby3RcBvL9Nu68rCTwLJMSC6e/upA0JZGqZ1E/H3XAf1XnjpHP133kxqHoHsAf9B7kQtb8FMbTEqziO+wZWq/wHV7c= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=EoDNlyYD; arc=none smtp.client-ip=192.198.163.19 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="EoDNlyYD" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1754716032; x=1786252032; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=B8SncTRCfxokFw3HLq476F91kwXYiv+eNctY+3vgxDg=; b=EoDNlyYDA5oPL7kQOXZZlmGPK3p6khDrsVNfQ0JxnGYMNTmoWPG3Trqv G2IvQfodRnNQCYPKgy20JzG+hnRCEBJuWbDYvbBKAv1X1Y6JYcYj11fWU ZEKDojM5x6NyBsP6fUSaKmteIt+dcABM+mQ1mSY84wSYIPWQMFhGWqxKi 6u+a+ocT6BdIAxulicFjYoaLOtii26qUbwZRgLo92ZRGMfUm3fzaPrvmE Ao5J3uJtLRfBswzdorTuQV5vLeCnDshzqwFinb0JTb2FOypjk2LzTN+gp 1hpBBXiAasf2lovIh8TYi7x2VmGvVyHeq1JBHV/mgRFlzzU7UVpHT7daf Q==; X-CSE-ConnectionGUID: to6lNAM4S3a+AafwUff0zw== X-CSE-MsgGUID: crHzkh/2S0a7G+FwBgPlVQ== X-IronPort-AV: E=McAfee;i="6800,10657,11515"; a="56091915" X-IronPort-AV: E=Sophos;i="6.17,278,1747724400"; d="scan'208";a="56091915" Received: from orviesa007.jf.intel.com ([10.64.159.147]) by fmvoesa113.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Aug 2025 22:07:11 -0700 X-CSE-ConnectionGUID: BhKTydHRScuaqNFjz5BXWQ== X-CSE-MsgGUID: m+FrJxJNTOeCg6U8j8DjQQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.17,278,1747724400"; d="scan'208";a="165475503" Received: from chenyu-dev.sh.intel.com ([10.239.62.107]) by orviesa007.jf.intel.com with ESMTP; 08 Aug 2025 22:07:06 -0700 From: Chen Yu To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" Cc: Vincent Guittot , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Libo Chen , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , linux-kernel@vger.kernel.org Subject: [RFC PATCH v4 01/28] sched: Cache aware load-balancing Date: Sat, 9 Aug 2025 13:00:59 +0800 Message-Id: <9157186cf9e3fd541f62c637579ff736b3704c51.1754712565.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Peter Zijlstra Hi all, One of the many things on the eternal todo list has been finishing the below hackery. It is an attempt at modelling cache affinity -- and while the patch really only targets LLC, it could very well be extended to also apply to clusters (L2). Specifically any case of multiple cache domains inside a node. Anyway, I wrote this about a year ago, and I mentioned this at the recent OSPM conf where Gautham and Prateek expressed interest in playing with this code. So here goes, very rough and largely unproven code ahead :-) It applies to current tip/master, but I know it will fail the __percpu validation that sits in -next, although that shouldn't be terribly hard to fix up. As is, it only computes a CPU inside the LLC that has the highest recent runtime, this CPU is then used in the wake-up path to steer towards this LLC and in task_hot() to limit migrations away from it. More elaborate things could be done, notably there is an XXX in there somewhere about finding the best LLC inside a NODE (interaction with NUMA_BALANCING). Signed-off-by: Peter Zijlstra (Intel) --- include/linux/mm_types.h | 44 ++++++ include/linux/sched.h | 4 + init/Kconfig | 4 + kernel/fork.c | 5 + kernel/sched/core.c | 13 +- kernel/sched/fair.c | 330 +++++++++++++++++++++++++++++++++++++-- kernel/sched/sched.h | 8 + 7 files changed, 388 insertions(+), 20 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index d6b91e8a66d6..cf26ad8b41ab 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -928,6 +928,12 @@ struct mm_cid { }; #endif =20 +struct mm_sched { + u64 runtime; + unsigned long epoch; + unsigned long occ; +}; + struct kioctx_table; struct iommu_mm_data; struct mm_struct { @@ -1018,6 +1024,17 @@ struct mm_struct { */ raw_spinlock_t cpus_allowed_lock; #endif +#ifdef CONFIG_SCHED_CACHE + /* + * Track per-cpu-per-process occupancy as a proxy for cache residency. + * See account_mm_sched() and ... + */ + struct mm_sched __percpu *pcpu_sched; + raw_spinlock_t mm_sched_lock; + unsigned long mm_sched_epoch; + int mm_sched_cpu; +#endif + #ifdef CONFIG_MMU atomic_long_t pgtables_bytes; /* size of all page tables */ #endif @@ -1432,6 +1449,33 @@ static inline unsigned int mm_cid_size(void) static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct = cpumask *cpumask) { } #endif /* CONFIG_SCHED_MM_CID */ =20 +#ifdef CONFIG_SCHED_CACHE +extern void mm_init_sched(struct mm_struct *mm, struct mm_sched *pcpu_sche= d); + +static inline int mm_alloc_sched_noprof(struct mm_struct *mm) +{ + struct mm_sched *pcpu_sched =3D alloc_percpu_noprof(struct mm_sched); + if (!pcpu_sched) + return -ENOMEM; + + mm_init_sched(mm, pcpu_sched); + return 0; +} + +#define mm_alloc_sched(...) alloc_hooks(mm_alloc_sched_noprof(__VA_ARGS__)) + +static inline void mm_destroy_sched(struct mm_struct *mm) +{ + free_percpu(mm->pcpu_sched); + mm->pcpu_sched =3D NULL; +} +#else /* !CONFIG_SCHED_CACHE */ + +static inline int mm_alloc_sched(struct mm_struct *mm) { return 0; } +static inline void mm_destroy_sched(struct mm_struct *mm) { } + +#endif /* CONFIG_SCHED_CACHE */ + struct mmu_gather; extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm); extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct= *mm); diff --git a/include/linux/sched.h b/include/linux/sched.h index aa9c5be7a632..02ff8b8be25b 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1403,6 +1403,10 @@ struct task_struct { unsigned long numa_pages_migrated; #endif /* CONFIG_NUMA_BALANCING */ =20 +#ifdef CONFIG_SCHED_CACHE + struct callback_head cache_work; +#endif + #ifdef CONFIG_RSEQ struct rseq __user *rseq; u32 rseq_len; diff --git a/init/Kconfig b/init/Kconfig index 666783eb50ab..27f4012347f9 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -947,6 +947,10 @@ config NUMA_BALANCING =20 This system will be inactive on UMA systems. =20 +config SCHED_CACHE + bool "Cache aware scheduler" + default y + config NUMA_BALANCING_DEFAULT_ENABLED bool "Automatically enable NUMA aware memory/task placement" default y diff --git a/kernel/fork.c b/kernel/fork.c index 1ee8eb11f38b..546c49e46d48 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1073,6 +1073,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm= , struct task_struct *p, if (mm_alloc_cid(mm, p)) goto fail_cid; =20 + if (mm_alloc_sched(mm)) + goto fail_sched; + if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT, NR_MM_COUNTERS)) goto fail_pcpu; @@ -1082,6 +1085,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm= , struct task_struct *p, return mm; =20 fail_pcpu: + mm_destroy_sched(mm); +fail_sched: mm_destroy_cid(mm); fail_cid: destroy_context(mm); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 81c6df746df1..a5fb3057b1c4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4539,6 +4539,7 @@ static void __sched_fork(unsigned long clone_flags, s= truct task_struct *p) p->migration_pending =3D NULL; #endif init_sched_mm_cid(p); + init_sched_mm(p); } =20 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing); @@ -8508,6 +8509,7 @@ static struct kmem_cache *task_group_cache __ro_after= _init; =20 void __init sched_init(void) { + unsigned long now =3D jiffies; unsigned long ptr =3D 0; int i; =20 @@ -8582,7 +8584,7 @@ void __init sched_init(void) raw_spin_lock_init(&rq->__lock); rq->nr_running =3D 0; rq->calc_load_active =3D 0; - rq->calc_load_update =3D jiffies + LOAD_FREQ; + rq->calc_load_update =3D now + LOAD_FREQ; init_cfs_rq(&rq->cfs); init_rt_rq(&rq->rt); init_dl_rq(&rq->dl); @@ -8626,7 +8628,7 @@ void __init sched_init(void) rq->cpu_capacity =3D SCHED_CAPACITY_SCALE; rq->balance_callback =3D &balance_push_callback; rq->active_balance =3D 0; - rq->next_balance =3D jiffies; + rq->next_balance =3D now; rq->push_cpu =3D 0; rq->cpu =3D i; rq->online =3D 0; @@ -8638,7 +8640,7 @@ void __init sched_init(void) =20 rq_attach_root(rq, &def_root_domain); #ifdef CONFIG_NO_HZ_COMMON - rq->last_blocked_load_update_tick =3D jiffies; + rq->last_blocked_load_update_tick =3D now; atomic_set(&rq->nohz_flags, 0); =20 INIT_CSD(&rq->nohz_csd, nohz_csd_func, rq); @@ -8663,6 +8665,11 @@ void __init sched_init(void) =20 rq->core_cookie =3D 0UL; #endif +#ifdef CONFIG_SCHED_CACHE + raw_spin_lock_init(&rq->cpu_epoch_lock); + rq->cpu_epoch_next =3D now; +#endif + zalloc_cpumask_var_node(&rq->scratch_mask, GFP_KERNEL, cpu_to_node(i)); } =20 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7a14da5396fb..e3897cd7696d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1166,10 +1166,229 @@ static s64 update_curr_se(struct rq *rq, struct sc= hed_entity *curr) return delta_exec; } =20 -static inline void update_curr_task(struct task_struct *p, s64 delta_exec) +#ifdef CONFIG_SCHED_CACHE + +/* + * XXX numbers come from a place the sun don't shine -- probably wants to = be SD + * tunable or so. + */ +#define EPOCH_PERIOD (HZ/100) /* 10 ms */ +#define EPOCH_OLD 5 /* 50 ms */ + +void mm_init_sched(struct mm_struct *mm, struct mm_sched *_pcpu_sched) +{ + unsigned long epoch; + int i; + + for_each_possible_cpu(i) { + struct mm_sched *pcpu_sched =3D per_cpu_ptr(_pcpu_sched, i); + struct rq *rq =3D cpu_rq(i); + + pcpu_sched->runtime =3D 0; + pcpu_sched->epoch =3D epoch =3D rq->cpu_epoch; + pcpu_sched->occ =3D -1; + } + + raw_spin_lock_init(&mm->mm_sched_lock); + mm->mm_sched_epoch =3D epoch; + mm->mm_sched_cpu =3D -1; + + smp_store_release(&mm->pcpu_sched, _pcpu_sched); +} + +/* because why would C be fully specified */ +static __always_inline void __shr_u64(u64 *val, unsigned int n) +{ + if (n >=3D 64) { + *val =3D 0; + return; + } + *val >>=3D n; +} + +static inline void __update_mm_sched(struct rq *rq, struct mm_sched *pcpu_= sched) +{ + lockdep_assert_held(&rq->cpu_epoch_lock); + + unsigned long n, now =3D jiffies; + long delta =3D now - rq->cpu_epoch_next; + + if (delta > 0) { + n =3D (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD; + rq->cpu_epoch +=3D n; + rq->cpu_epoch_next +=3D n * EPOCH_PERIOD; + __shr_u64(&rq->cpu_runtime, n); + } + + n =3D rq->cpu_epoch - pcpu_sched->epoch; + if (n) { + pcpu_sched->epoch +=3D n; + __shr_u64(&pcpu_sched->runtime, n); + } +} + +static unsigned long fraction_mm_sched(struct rq *rq, struct mm_sched *pcp= u_sched) +{ + guard(raw_spinlock_irqsave)(&rq->cpu_epoch_lock); + + __update_mm_sched(rq, pcpu_sched); + + /* + * Runtime is a geometric series (r=3D0.5) and as such will sum to twice + * the accumulation period, this means the multiplcation here should + * not overflow. + */ + return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq->cpu_runtime + 1); +} + +static inline +void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec) +{ + struct mm_struct *mm =3D p->mm; + struct mm_sched *pcpu_sched; + unsigned long epoch; + + /* + * init_task and kthreads don't be having no mm + */ + if (!mm || !mm->pcpu_sched) + return; + + pcpu_sched =3D this_cpu_ptr(p->mm->pcpu_sched); + + scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) { + __update_mm_sched(rq, pcpu_sched); + pcpu_sched->runtime +=3D delta_exec; + rq->cpu_runtime +=3D delta_exec; + epoch =3D rq->cpu_epoch; + } + + /* + * If this task hasn't hit task_cache_work() for a while, invalidate + * it's preferred state. + */ + if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_OLD) { + mm->mm_sched_cpu =3D -1; + pcpu_sched->occ =3D -1; + } +} + +static void task_tick_cache(struct rq *rq, struct task_struct *p) +{ + struct callback_head *work =3D &p->cache_work; + struct mm_struct *mm =3D p->mm; + + if (!mm || !mm->pcpu_sched) + return; + + if (mm->mm_sched_epoch =3D=3D rq->cpu_epoch) + return; + + guard(raw_spinlock)(&mm->mm_sched_lock); + + if (mm->mm_sched_epoch =3D=3D rq->cpu_epoch) + return; + + if (work->next =3D=3D work) { + task_work_add(p, work, TWA_RESUME); + WRITE_ONCE(mm->mm_sched_epoch, rq->cpu_epoch); + } +} + +static void task_cache_work(struct callback_head *work) +{ + struct task_struct *p =3D current; + struct mm_struct *mm =3D p->mm; + unsigned long m_a_occ =3D 0; + int cpu, m_a_cpu =3D -1; + cpumask_var_t cpus; + + WARN_ON_ONCE(work !=3D &p->cache_work); + + work->next =3D work; + + if (p->flags & PF_EXITING) + return; + + if (!alloc_cpumask_var(&cpus, GFP_KERNEL)) + return; + + scoped_guard (cpus_read_lock) { + cpumask_copy(cpus, cpu_online_mask); + + for_each_cpu(cpu, cpus) { + /* XXX sched_cluster_active */ + struct sched_domain *sd =3D per_cpu(sd_llc, cpu); + unsigned long occ, m_occ =3D 0, a_occ =3D 0; + int m_cpu =3D -1, nr =3D 0, i; + + for_each_cpu(i, sched_domain_span(sd)) { + occ =3D fraction_mm_sched(cpu_rq(i), + per_cpu_ptr(mm->pcpu_sched, i)); + a_occ +=3D occ; + if (occ > m_occ) { + m_occ =3D occ; + m_cpu =3D i; + } + nr++; + trace_printk("(%d) occ: %ld m_occ: %ld m_cpu: %d nr: %d\n", + per_cpu(sd_llc_id, i), occ, m_occ, m_cpu, nr); + } + + a_occ /=3D nr; + if (a_occ > m_a_occ) { + m_a_occ =3D a_occ; + m_a_cpu =3D m_cpu; + } + + trace_printk("(%d) a_occ: %ld m_a_occ: %ld\n", + per_cpu(sd_llc_id, cpu), a_occ, m_a_occ); + + for_each_cpu(i, sched_domain_span(sd)) { + /* XXX threshold ? */ + per_cpu_ptr(mm->pcpu_sched, i)->occ =3D a_occ; + } + + cpumask_andnot(cpus, cpus, sched_domain_span(sd)); + } + } + + /* + * If the max average cache occupancy is 'small' we don't care. + */ + if (m_a_occ < (NICE_0_LOAD >> EPOCH_OLD)) + m_a_cpu =3D -1; + + mm->mm_sched_cpu =3D m_a_cpu; + + free_cpumask_var(cpus); +} + +void init_sched_mm(struct task_struct *p) +{ + struct callback_head *work =3D &p->cache_work; + init_task_work(work, task_cache_work); + work->next =3D work; +} + +#else + +static inline void account_mm_sched(struct rq *rq, struct task_struct *p, + s64 delta_exec) { } + + +void init_sched_mm(struct task_struct *p) { } + +static void task_tick_cache(struct rq *rq, struct task_struct *p) { } + +#endif + +static inline +void update_curr_task(struct rq *rq, struct task_struct *p, s64 delta_exec) { trace_sched_stat_runtime(p, delta_exec); account_group_exec_runtime(p, delta_exec); + account_mm_sched(rq, p, delta_exec); cgroup_account_cputime(p, delta_exec); } =20 @@ -1215,7 +1434,7 @@ s64 update_curr_common(struct rq *rq) =20 delta_exec =3D update_curr_se(rq, &donor->se); if (likely(delta_exec > 0)) - update_curr_task(donor, delta_exec); + update_curr_task(rq, donor, delta_exec); =20 return delta_exec; } @@ -1244,7 +1463,7 @@ static void update_curr(struct cfs_rq *cfs_rq) if (entity_is_task(curr)) { struct task_struct *p =3D task_of(curr); =20 - update_curr_task(p, delta_exec); + update_curr_task(rq, p, delta_exec); =20 /* * If the fair_server is active, we need to account for the @@ -7862,7 +8081,7 @@ static int select_idle_sibling(struct task_struct *p,= int prev, int target) * per-cpu select_rq_mask usage */ lockdep_assert_irqs_disabled(); - +again: if ((available_idle_cpu(target) || sched_idle_cpu(target)) && asym_fits_cpu(task_util, util_min, util_max, target)) return target; @@ -7900,7 +8119,8 @@ static int select_idle_sibling(struct task_struct *p,= int prev, int target) /* Check a recently used CPU as a potential idle candidate: */ recent_used_cpu =3D p->recent_used_cpu; p->recent_used_cpu =3D prev; - if (recent_used_cpu !=3D prev && + if (prev =3D=3D p->wake_cpu && + recent_used_cpu !=3D prev && recent_used_cpu !=3D target && cpus_share_cache(recent_used_cpu, target) && (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cp= u)) && @@ -7953,6 +8173,18 @@ static int select_idle_sibling(struct task_struct *p= , int prev, int target) if ((unsigned)i < nr_cpumask_bits) return i; =20 + if (prev !=3D p->wake_cpu && !cpus_share_cache(prev, p->wake_cpu)) { + /* + * Most likely select_cache_cpu() will have re-directed + * the wakeup, but getting here means the preferred cache is + * too busy, so re-try with the actual previous. + * + * XXX wake_affine is lost for this pass. + */ + prev =3D target =3D p->wake_cpu; + goto again; + } + /* * For cluster machines which have lower sharing cache like L2 or * LLC Tag, we tend to find an idle CPU in the target's cluster @@ -8575,6 +8807,40 @@ static int find_energy_efficient_cpu(struct task_str= uct *p, int prev_cpu) return target; } =20 +#ifdef CONFIG_SCHED_CACHE +static long __migrate_degrades_locality(struct task_struct *p, int src_cpu= , int dst_cpu, bool idle); + +static int select_cache_cpu(struct task_struct *p, int prev_cpu) +{ + struct mm_struct *mm =3D p->mm; + int cpu; + + if (!mm || p->nr_cpus_allowed =3D=3D 1) + return prev_cpu; + + cpu =3D mm->mm_sched_cpu; + if (cpu < 0) + return prev_cpu; + + + if (static_branch_likely(&sched_numa_balancing) && + __migrate_degrades_locality(p, prev_cpu, cpu, false) > 0) { + /* + * XXX look for max occupancy inside prev_cpu's node + */ + return prev_cpu; + } + + return cpu; +} +#else +static int select_cache_cpu(struct task_struct *p, int prev_cpu) +{ + return prev_cpu; +} +#endif + + /* * select_task_rq_fair: Select target runqueue for the waking task in doma= ins * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAK= E, @@ -8600,6 +8866,8 @@ select_task_rq_fair(struct task_struct *p, int prev_c= pu, int wake_flags) * required for stable ->cpus_allowed */ lockdep_assert_held(&p->pi_lock); + guard(rcu)(); + if (wake_flags & WF_TTWU) { record_wakee(p); =20 @@ -8607,6 +8875,8 @@ select_task_rq_fair(struct task_struct *p, int prev_c= pu, int wake_flags) cpumask_test_cpu(cpu, p->cpus_ptr)) return cpu; =20 + new_cpu =3D prev_cpu =3D select_cache_cpu(p, prev_cpu); + if (!is_rd_overutilized(this_rq()->rd)) { new_cpu =3D find_energy_efficient_cpu(p, prev_cpu); if (new_cpu >=3D 0) @@ -8617,7 +8887,6 @@ select_task_rq_fair(struct task_struct *p, int prev_c= pu, int wake_flags) want_affine =3D !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr); } =20 - rcu_read_lock(); for_each_domain(cpu, tmp) { /* * If both 'cpu' and 'prev_cpu' are part of this domain, @@ -8650,7 +8919,6 @@ select_task_rq_fair(struct task_struct *p, int prev_c= pu, int wake_flags) /* Fast path */ new_cpu =3D select_idle_sibling(p, prev_cpu, new_cpu); } - rcu_read_unlock(); =20 return new_cpu; } @@ -9300,6 +9568,17 @@ static int task_hot(struct task_struct *p, struct lb= _env *env) if (sysctl_sched_migration_cost =3D=3D 0) return 0; =20 +#ifdef CONFIG_SCHED_CACHE + if (p->mm && p->mm->pcpu_sched) { + /* + * XXX things like Skylake have non-inclusive L3 and might not + * like this L3 centric view. What to do about L2 stickyness ? + */ + return per_cpu_ptr(p->mm->pcpu_sched, env->src_cpu)->occ > + per_cpu_ptr(p->mm->pcpu_sched, env->dst_cpu)->occ; + } +#endif + delta =3D rq_clock_task(env->src_rq) - p->se.exec_start; =20 return delta < (s64)sysctl_sched_migration_cost; @@ -9311,27 +9590,25 @@ static int task_hot(struct task_struct *p, struct l= b_env *env) * Returns 0, if task migration is not affected by locality. * Returns a negative value, if task migration improves locality i.e migra= tion preferred. */ -static long migrate_degrades_locality(struct task_struct *p, struct lb_env= *env) +static long __migrate_degrades_locality(struct task_struct *p, int src_cpu= , int dst_cpu, bool idle) { struct numa_group *numa_group =3D rcu_dereference(p->numa_group); unsigned long src_weight, dst_weight; int src_nid, dst_nid, dist; =20 - if (!static_branch_likely(&sched_numa_balancing)) - return 0; - - if (!p->numa_faults || !(env->sd->flags & SD_NUMA)) + if (!p->numa_faults) return 0; =20 - src_nid =3D cpu_to_node(env->src_cpu); - dst_nid =3D cpu_to_node(env->dst_cpu); + src_nid =3D cpu_to_node(src_cpu); + dst_nid =3D cpu_to_node(dst_cpu); =20 if (src_nid =3D=3D dst_nid) return 0; =20 /* Migrating away from the preferred node is always bad. */ if (src_nid =3D=3D p->numa_preferred_nid) { - if (env->src_rq->nr_running > env->src_rq->nr_preferred_running) + struct rq *src_rq =3D cpu_rq(src_cpu); + if (src_rq->nr_running > src_rq->nr_preferred_running) return 1; else return 0; @@ -9342,7 +9619,7 @@ static long migrate_degrades_locality(struct task_str= uct *p, struct lb_env *env) return -1; =20 /* Leaving a core idle is often worse than degrading locality. */ - if (env->idle =3D=3D CPU_IDLE) + if (idle) return 0; =20 dist =3D node_distance(src_nid, dst_nid); @@ -9357,7 +9634,24 @@ static long migrate_degrades_locality(struct task_st= ruct *p, struct lb_env *env) return src_weight - dst_weight; } =20 +static long migrate_degrades_locality(struct task_struct *p, struct lb_env= *env) +{ + if (!static_branch_likely(&sched_numa_balancing)) + return 0; + + if (!(env->sd->flags & SD_NUMA)) + return 0; + + return __migrate_degrades_locality(p, env->src_cpu, env->dst_cpu, + env->idle =3D=3D CPU_IDLE); +} + #else +static long __migrate_degrades_locality(struct task_struct *p, int src_cpu= , int dst_cpu, bool idle) +{ + return 0; +} + static inline long migrate_degrades_locality(struct task_struct *p, struct lb_env *env) { @@ -13117,8 +13411,8 @@ static inline void task_tick_core(struct rq *rq, st= ruct task_struct *curr) {} */ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int qu= eued) { - struct cfs_rq *cfs_rq; struct sched_entity *se =3D &curr->se; + struct cfs_rq *cfs_rq; =20 for_each_sched_entity(se) { cfs_rq =3D cfs_rq_of(se); @@ -13128,6 +13422,8 @@ static void task_tick_fair(struct rq *rq, struct ta= sk_struct *curr, int queued) if (static_branch_unlikely(&sched_numa_balancing)) task_tick_numa(rq, curr); =20 + task_tick_cache(rq, curr); + update_misfit_status(curr, rq); check_update_overutilized_status(task_rq(curr)); =20 diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 83e3aa917142..839463027ab0 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1173,6 +1173,12 @@ struct rq { u64 clock_pelt_idle_copy; u64 clock_idle_copy; #endif +#ifdef CONFIG_SCHED_CACHE + raw_spinlock_t cpu_epoch_lock; + u64 cpu_runtime; + unsigned long cpu_epoch; + unsigned long cpu_epoch_next; +#endif =20 atomic_t nr_iowait; =20 @@ -3885,6 +3891,8 @@ static inline void task_tick_mm_cid(struct rq *rq, st= ruct task_struct *curr) { } static inline void init_sched_mm_cid(struct task_struct *t) { } #endif /* !CONFIG_SCHED_MM_CID */ =20 +extern void init_sched_mm(struct task_struct *p); + extern u64 avg_vruntime(struct cfs_rq *cfs_rq); extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se); #ifdef CONFIG_SMP --=20 2.25.1