From nobody Thu Apr 2 17:02:04 2026 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0A2F633B95E for ; Tue, 10 Feb 2026 22:13:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761618; cv=none; b=LR47nAPZY1PgF6IbjenhtrkUHuxkJ2bAYwrloAbJT9FBcwvP0a7NwD1i/3OSaDDYGOPcHcgwGntObyqKZDcIJpARwBBIuYbwrciPNW4WzD3eqQIOctgtCqUI/T3z3dD9vG6sHxwuxqXvJ9R+MgQe4aS3ku/JJA0LE0fWEFNh1YE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761618; c=relaxed/simple; bh=NUaoOtUWcv68MwDfmTICP/GIBfn5Rq1xgTOvSnQx5n4=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=nMVmD7PJO6ckzGNg9di55uePxBQfptt/rYn1lgcvi9ftoAAEYXuyiF76REfvpl9zr/2Owoo6A+aItk1/bMp1Bm4y3sliOA8jDrL9phhm3L/qYlr++JTqFU3jCpl7frsxmXHtENApOSdBoCfStRPvPxDPxP5NkxjvwltbLK7GgfM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=SMVg0Ta0; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="SMVg0Ta0" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770761617; x=1802297617; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=NUaoOtUWcv68MwDfmTICP/GIBfn5Rq1xgTOvSnQx5n4=; b=SMVg0Ta0JLKQ2GU5NfqaNuj+AyLUuOGDNh5kYdEk2YxseAPD4Xz4B9oJ 1Khb0UQPYuHWFowHMr4Qcs2WXgL7ITs00IvvSCbOwUtRkD8DTAT75rvbR J4S+R9BUVIDHwDEUgY9xN6JSSBJ1TscSFqic5h91YXCCgUeImA7cbyize hYMPK0HOH3dYyrP7jMcbeQcqlsm6uY3YzH/01u0EslqyLdRP0y9em73L/ Qh90IJ3LVDpz3jnIbKbxu48g43psbUPkp50iFw7Jm8Q9C5sMPUrbRToqo E9BsYVwG5S83CHWBgZe41UGCjyJLTQY7LLxxIRYwDlHHNGezK6aSI+YQo A==; X-CSE-ConnectionGUID: mqc7m0U8QOGkx2vqjLbDPg== X-CSE-MsgGUID: C/6a71C4RS+1Eq4gdekLLQ== X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631526" X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="82631526" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 14:13:36 -0800 X-CSE-ConnectionGUID: OklA/d5yRyii4/PE7G03HQ== X-CSE-MsgGUID: ieN5axIRQvu2hTWBWedZwQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="216374013" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:13:34 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [PATCH v3 15/21] sched/cache: Disable cache aware scheduling for processes with high thread counts Date: Tue, 10 Feb 2026 14:18:55 -0800 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu A performance regression was observed by Prateek when running hackbench with many threads per process (high fd count). To avoid this, processes with a large number of active threads are excluded from cache-aware scheduling. With sched_cache enabled, record the number of active threads in each process during the periodic task_cache_work(). While iterating over CPUs, if the currently running task belongs to the same process as the task that launched task_cache_work(), increment the active thread count. If the number of active threads within the process exceeds the number of Cores(divided by SMTs number) in the LLC, do not enable cache-aware scheduling. For users who wish to perform task aggregation regardless, a debugfs knob is provided for tuning in a subsequent patch. Suggested-by: K Prateek Nayak Suggested-by: Aaron Lu Co-developed-by: Tim Chen Signed-off-by: Tim Chen Signed-off-by: Chen Yu --- Notes: v2->v3: Put the calculating of nr_running_avg and the use of it into 1 patch. (Peter Zijlstra) =20 Use guard(rcu)() when calculating the number of active threads of the process. (Peter Zijlstra) =20 Introduce update_avg_scale() rather than using update_avg() to fit system with small LLC. (Aaron Lu) include/linux/sched.h | 1 + kernel/sched/fair.c | 59 ++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 57 insertions(+), 3 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index c98bd1c46088..511c9b263386 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2346,6 +2346,7 @@ struct sched_cache_stat { struct sched_cache_time __percpu *pcpu_sched; raw_spinlock_t lock; unsigned long epoch; + u64 nr_running_avg; int cpu; } ____cacheline_aligned_in_smp; =20 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d1145997b88d..86b6b08e7e1e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1223,6 +1223,19 @@ static inline bool valid_llc_buf(struct sched_domain= *sd, return valid_llc_id(id); } =20 +static bool exceed_llc_nr(struct mm_struct *mm, int cpu) +{ + int smt_nr =3D 1; + +#ifdef CONFIG_SCHED_SMT + if (sched_smt_active()) + smt_nr =3D cpumask_weight(cpu_smt_mask(cpu)); +#endif + + return !fits_capacity((mm->sc_stat.nr_running_avg * smt_nr), + per_cpu(sd_llc_size, cpu)); +} + static void account_llc_enqueue(struct rq *rq, struct task_struct *p) { struct sched_domain *sd; @@ -1417,7 +1430,8 @@ void account_mm_sched(struct rq *rq, struct task_stru= ct *p, s64 delta_exec) */ if (time_after(epoch, READ_ONCE(mm->sc_stat.epoch) + EPOCH_LLC_AFFINITY_TIMEOUT) || - get_nr_threads(p) <=3D 1) { + get_nr_threads(p) <=3D 1 || + exceed_llc_nr(mm, cpu_of(rq))) { if (mm->sc_stat.cpu !=3D -1) mm->sc_stat.cpu =3D -1; } @@ -1458,13 +1472,31 @@ static void task_tick_cache(struct rq *rq, struct t= ask_struct *p) } } =20 +static inline void update_avg_scale(u64 *avg, u64 sample) +{ + int factor =3D per_cpu(sd_llc_size, raw_smp_processor_id()); + s64 diff =3D sample - *avg; + u32 divisor; + + /* + * Scale the divisor based on the number of CPUs contained + * in the LLC. This scaling ensures smaller LLC domains use + * a smaller divisor to achieve more precise sensitivity to + * changes in nr_running, while larger LLC domains are capped + * at a maximum divisor of 8 which is the default smoothing + * factor of EWMA in update_avg(). + */ + divisor =3D clamp_t(u32, (factor >> 2), 2, 8); + *avg +=3D div64_s64(diff, divisor); +} + static void task_cache_work(struct callback_head *work) { - struct task_struct *p =3D current; + struct task_struct *p =3D current, *cur; struct mm_struct *mm =3D p->mm; unsigned long m_a_occ =3D 0; unsigned long curr_m_a_occ =3D 0; - int cpu, m_a_cpu =3D -1; + int cpu, m_a_cpu =3D -1, nr_running =3D 0; cpumask_var_t cpus; =20 WARN_ON_ONCE(work !=3D &p->cache_work); @@ -1474,6 +1506,13 @@ static void task_cache_work(struct callback_head *wo= rk) if (p->flags & PF_EXITING) return; =20 + if (get_nr_threads(p) <=3D 1) { + if (mm->sc_stat.cpu !=3D -1) + mm->sc_stat.cpu =3D -1; + + return; + } + if (!zalloc_cpumask_var(&cpus, GFP_KERNEL)) return; =20 @@ -1497,6 +1536,12 @@ static void task_cache_work(struct callback_head *wo= rk) m_occ =3D occ; m_cpu =3D i; } + scoped_guard (rcu) { + cur =3D rcu_dereference(cpu_rq(i)->curr); + if (cur && !(cur->flags & (PF_EXITING | PF_KTHREAD)) && + cur->mm =3D=3D mm) + nr_running++; + } } =20 /* @@ -1540,6 +1585,7 @@ static void task_cache_work(struct callback_head *wor= k) mm->sc_stat.cpu =3D m_a_cpu; } =20 + update_avg_scale(&mm->sc_stat.nr_running_avg, nr_running); free_cpumask_var(cpus); } =20 @@ -9988,6 +10034,13 @@ static enum llc_mig can_migrate_llc_task(int src_cp= u, int dst_cpu, if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu)) return mig_unrestricted; =20 + /* skip cache aware load balance for single/too many threads */ + if (get_nr_threads(p) <=3D 1 || exceed_llc_nr(mm, dst_cpu)) { + if (mm->sc_stat.cpu !=3D -1) + mm->sc_stat.cpu =3D -1; + return mig_unrestricted; + } + if (cpus_share_cache(dst_cpu, cpu)) to_pref =3D true; else if (cpus_share_cache(src_cpu, cpu)) --=20 2.32.0