From nobody Sun Apr 12 21:01:33 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 59C023A3E96 for ; Wed, 1 Apr 2026 21:47:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.15 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775080033; cv=none; b=LxLy9idemKiLbXvFIsAVC0Ynamks3h32pkRZuY0qycz5rmSrFP+Z11wI3fF3lx28AH75KApkDWbk4VAUrlmrw+wYdHjQ5HIWBsk8Yyb9wP6i6Qc0EVgnhgpNiUc/OoG3c9KpD8EFHoBVe/GquhHRUDzMtK06hEanCfdw6ik8ww0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775080033; c=relaxed/simple; bh=r0bYVLW9GtIH476H+Ka7i0+1q4hSOUofQogimPMyhU8=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=lFDk0Zqwyf+MRO0focgYzqK6tSEkxTOtdtCd74tlwrQj8pysJNdKx1HqMJ+Zz4KTEf1hB2h+2GhC9bDTkSZ2JRYd5bBXUfJ0VbTVUgiqD069UzFab4rTkTWCk2Uevh8mE+0/scp1cSDvYE8WZ/ulaqYLJ8Zms07fr/3AR8RlL4k= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=eizOiPnj; arc=none smtp.client-ip=198.175.65.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="eizOiPnj" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1775080032; x=1806616032; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=r0bYVLW9GtIH476H+Ka7i0+1q4hSOUofQogimPMyhU8=; b=eizOiPnjc0gHN//lVSOhBJBfwWoqj+1aTGXcr0lrTXj/YJ+Wu3LilcsW w4wY7dvouK6WXuVkhg01vOJJOBGOU1bpONBWhIyw22KIwH564SVHhy8sF vs29kSIJcg73DEWWDQkbAv34xeEBKwixxkudRgjUgHA0nkCdzUIcMxidG YekMwY1rsfLsK2qvnhba2VRXM1NGmasuIDzPa3/gxLLld2uyIQb8k1IAc ICN8egYyQ8ZwoMiA38eyMtwXYqHrgy5ipRhUKLjKXfsclQLW3FR3wv1BC KG6jZbQLrWQt+qYig0r7TW4PmcJwG+MckZiAzdENv0CyXo4HV5T17FjNk A==; X-CSE-ConnectionGUID: rudGIo7FS/OMBtN+sZ20mw== X-CSE-MsgGUID: 4JsnEJUqQVelJ+gJlpg5Zw== X-IronPort-AV: E=McAfee;i="6800,10657,11746"; a="79740033" X-IronPort-AV: E=Sophos;i="6.23,153,1770624000"; d="scan'208";a="79740033" Received: from fmviesa002.fm.intel.com ([10.60.135.142]) by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Apr 2026 14:47:10 -0700 X-CSE-ConnectionGUID: mq10wGARQZ6o8cqFDKVYXQ== X-CSE-MsgGUID: a+ErZti9TKGB881yJAlJYg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,153,1770624000"; d="scan'208";a="249842482" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa002.fm.intel.com with ESMTP; 01 Apr 2026 14:47:09 -0700 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [Patch v4 16/22] sched/cache: Disable cache aware scheduling for processes with high thread counts Date: Wed, 1 Apr 2026 14:52:28 -0700 Message-Id: <47cc4cffecdac2770a719c84bec3b459a1256def.1775065312.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu A performance regression was observed by Prateek when running hackbench with many threads per process (high fd count). To avoid this, processes with a large number of active threads are excluded from cache-aware scheduling. With sched_cache enabled, record the number of active threads in each process during the periodic task_cache_work(). While iterating over CPUs, if the currently running task belongs to the same process as the task that launched task_cache_work(), increment the active thread count. If the number of active threads within the process exceeds the number of Cores(divided by SMTs number) in the LLC, do not enable cache-aware scheduling. However, on system with smaller number of CPUs within 1 LLC, like Power10/Power11 with SMT4 and LLC size of 4, this check effectively disables cache-aware scheduling for any process. One possible solution suggested by Peter is to use a LLC-mask instead of a single LLC value for preference. Once there are a 'few' LLCs as preference, this constraint becomes a little easier. It could be an enhancement in the future. For users who wish to perform task aggregation regardless, a debugfs knob is provided for tuning in a subsequent patch. Suggested-by: K Prateek Nayak Suggested-by: Aaron Lu Signed-off-by: Chen Yu Co-developed-by: Tim Chen Signed-off-by: Tim Chen --- Notes: v3->v4: Use cpu_smt_num_threads instead of cpumask_weight(cpu_smt_mask(cpu)) (Peter Zijlstra) include/linux/sched.h | 1 + kernel/sched/fair.c | 54 +++++++++++++++++++++++++++++++++++++++---- 2 files changed, 51 insertions(+), 4 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 526108acc483..dfa4bfd099c6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2392,6 +2392,7 @@ struct sched_cache_stat { struct sched_cache_time __percpu *pcpu_sched; raw_spinlock_t lock; unsigned long epoch; + u64 nr_running_avg; int cpu; } ____cacheline_aligned_in_smp; =20 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9541e94370e7..077ae7875e2e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1316,6 +1316,12 @@ static inline bool valid_llc_buf(struct sched_domain= *sd, return true; } =20 +static bool exceed_llc_nr(struct mm_struct *mm, int cpu) +{ + return !fits_capacity((mm->sc_stat.nr_running_avg * cpu_smt_num_threads), + per_cpu(sd_llc_size, cpu)); +} + static void account_llc_enqueue(struct rq *rq, struct task_struct *p) { struct sched_domain *sd; @@ -1507,7 +1513,8 @@ void account_mm_sched(struct rq *rq, struct task_stru= ct *p, s64 delta_exec) */ if (time_after(epoch, READ_ONCE(mm->sc_stat.epoch) + EPOCH_LLC_AFFINITY_TIMEOUT) || - get_nr_threads(p) <=3D 1) { + get_nr_threads(p) <=3D 1 || + exceed_llc_nr(mm, cpu_of(rq))) { if (mm->sc_stat.cpu !=3D -1) mm->sc_stat.cpu =3D -1; } @@ -1592,13 +1599,31 @@ static void get_scan_cpumasks(cpumask_var_t cpus, s= truct task_struct *p) cpumask_copy(cpus, cpu_online_mask); } =20 +static inline void update_avg_scale(u64 *avg, u64 sample) +{ + int factor =3D per_cpu(sd_llc_size, raw_smp_processor_id()); + s64 diff =3D sample - *avg; + u32 divisor; + + /* + * Scale the divisor based on the number of CPUs contained + * in the LLC. This scaling ensures smaller LLC domains use + * a smaller divisor to achieve more precise sensitivity to + * changes in nr_running, while larger LLC domains are capped + * at a maximum divisor of 8 which is the default smoothing + * factor of EWMA in update_avg(). + */ + divisor =3D clamp_t(u32, (factor >> 2), 2, 8); + *avg +=3D div64_s64(diff, divisor); +} + static void task_cache_work(struct callback_head *work) { - struct task_struct *p =3D current; + struct task_struct *p =3D current, *cur; + int cpu, m_a_cpu =3D -1, nr_running =3D 0; + unsigned long curr_m_a_occ =3D 0; struct mm_struct *mm =3D p->mm; unsigned long m_a_occ =3D 0; - unsigned long curr_m_a_occ =3D 0; - int cpu, m_a_cpu =3D -1; cpumask_var_t cpus; =20 WARN_ON_ONCE(work !=3D &p->cache_work); @@ -1608,6 +1633,13 @@ static void task_cache_work(struct callback_head *wo= rk) if (p->flags & PF_EXITING) return; =20 + if (get_nr_threads(p) <=3D 1) { + if (mm->sc_stat.cpu !=3D -1) + mm->sc_stat.cpu =3D -1; + + return; + } + if (!zalloc_cpumask_var(&cpus, GFP_KERNEL)) return; =20 @@ -1631,6 +1663,12 @@ static void task_cache_work(struct callback_head *wo= rk) m_occ =3D occ; m_cpu =3D i; } + scoped_guard (rcu) { + cur =3D rcu_dereference_all(cpu_rq(i)->curr); + if (cur && !(cur->flags & (PF_EXITING | PF_KTHREAD)) && + cur->mm =3D=3D mm) + nr_running++; + } } =20 /* @@ -1674,6 +1712,7 @@ static void task_cache_work(struct callback_head *wor= k) mm->sc_stat.cpu =3D m_a_cpu; } =20 + update_avg_scale(&mm->sc_stat.nr_running_avg, nr_running); free_cpumask_var(cpus); } =20 @@ -10105,6 +10144,13 @@ static enum llc_mig can_migrate_llc_task(int src_c= pu, int dst_cpu, if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu)) return mig_unrestricted; =20 + /* skip cache aware load balance for single/too many threads */ + if (get_nr_threads(p) <=3D 1 || exceed_llc_nr(mm, dst_cpu)) { + if (mm->sc_stat.cpu !=3D -1) + mm->sc_stat.cpu =3D -1; + return mig_unrestricted; + } + if (cpus_share_cache(dst_cpu, cpu)) to_pref =3D true; else if (cpus_share_cache(src_cpu, cpu)) --=20 2.32.0