From nobody Mon Jun 15 23:17:21 2026 Received: from canpmsgout03.his.huawei.com (canpmsgout03.his.huawei.com [113.46.200.218]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C36583115AF for ; Tue, 14 Apr 2026 14:40:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=113.46.200.218 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776177609; cv=none; b=SRKiVrZ21e2yt7WsinqBAxMklcXgwpDcGRzWibR2Y6xXmAALoSDNLnvJ7zixYB/GFr5JgRgToNv06kYw7imYew7Z+Tx54LSpygG33z8b0p2lCOHMRMf7Dj+7xs0GKWIgf7k4+4mMTyQsvDYkj21wBwSj3cduhD0JAANFGkgyyas= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776177609; c=relaxed/simple; bh=Zu8Qi6/kC1NMpMkza0vXdhrEh5UJr2oxRIGmRVagoZE=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=GkjQjd5PDe6AG4Fh4GqCWeh1K/YjNFBgLukW34D2vo1uF2M5pEyePVJtdy0Q6br8QUeyDNm25ReATkiOZUfdoJCFx+bfOuE/pMn6Vnpg4pBTBhqEFN9hizs0FlAzAqKQOWZ/VA2gZQ4HkLDoDXE/2G6JEJHHHqnrgreVtITIKzI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; dkim=pass (1024-bit key) header.d=huawei.com header.i=@huawei.com header.b=hK5TfZ6G; arc=none smtp.client-ip=113.46.200.218 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=huawei.com header.i=@huawei.com header.b="hK5TfZ6G" dkim-signature: v=1; a=rsa-sha256; d=huawei.com; s=dkim; c=relaxed/relaxed; q=dns/txt; h=From; bh=TCqRD4NPmseJfWXKe8680TYz5NmAUZFjq+eIHnVgeOM=; b=hK5TfZ6G2rSrEpImHL9dQc4myidK+/hRPtSUblV830ibeTls1xm3yzOXUT3UXBf2+HxU7cYm2 nAgv2Egz5NI7c2MEB4e1K1FR/em9ddnhP58G0ocK/SmzwUuAjYcRg5Q8o0GwV7mEBishqHCTXh/ zuxRM5FgdRKAVSYrvs10CFU= Received: from mail.maildlp.com (unknown [172.19.163.0]) by canpmsgout03.his.huawei.com (SkyGuard) with ESMTPS id 4fw6GK5HHHzpSvW; Tue, 14 Apr 2026 22:33:45 +0800 (CST) Received: from kwepemj100017.china.huawei.com (unknown [7.202.194.11]) by mail.maildlp.com (Postfix) with ESMTPS id D1C5940574; Tue, 14 Apr 2026 22:39:56 +0800 (CST) Received: from huawei.com (10.67.174.193) by kwepemj100017.china.huawei.com (7.202.194.11) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.36; Tue, 14 Apr 2026 22:39:55 +0800 From: Luo Gengkun To: CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Subject: [PATCH v2] sched/cache: Reduce the overhead of task_cache_work by only scan the visisted cpus. Date: Tue, 14 Apr 2026 15:07:45 +0000 Message-ID: <20260414150745.225416-1-luogengkun2@huawei.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <4fb7a6da-447d-452a-a920-7cd39b939ccb@intel.com> References: <4fb7a6da-447d-452a-a920-7cd39b939ccb@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: kwepems200002.china.huawei.com (7.221.188.68) To kwepemj100017.china.huawei.com (7.202.194.11) Content-Type: text/plain; charset="utf-8" The overhead of task_cache_work is high, espeically in multi-NUMA system. Currently, task_cache_work try to find the pref_llc by scan all cpus in the system. However, most of these scans are meaningless, such as those for cpus that have never been visited or were accessed a long time ago. To address this problem, this patch introduces visited_cpus to track the visited cpus and uses llc_epoch_visited_timeout to evict cpus that have timed out. Signed-off-by: Luo Gengkun --- Thanks for the reviews. I've updated the patch based on your feedback. v2 Changes: 1. Added a pre-check before set/clear visited_cpus to avoid C2C overhead. 2. Optimized llc_epoch_visited_timeout by using a static key to minimize ov= erhead. --- include/linux/sched.h | 1 + kernel/sched/debug.c | 50 +++++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 25 +++++++++++++++++++--- kernel/sched/sched.h | 6 ++++++ 4 files changed, 79 insertions(+), 3 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index dfa4bfd099c6..f2327a13fda8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2390,6 +2390,7 @@ struct sched_cache_time { =20 struct sched_cache_stat { struct sched_cache_time __percpu *pcpu_sched; + struct cpumask visited_cpus; raw_spinlock_t lock; unsigned long epoch; u64 nr_running_avg; diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 4469e1c152c8..46aa73939f9e 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -247,6 +247,54 @@ static const struct file_operations sched_cache_enable= _fops =3D { .llseek =3D seq_lseek, .release =3D single_release, }; + +static void sched_cache_timeout_set(void) +{ + if (llc_epoch_visited_timeout) { + if (!static_branch_likely(&sched_cache_timeout)) + static_branch_enable(&sched_cache_timeout); + } else { + if (static_branch_likely(&sched_cache_timeout)) + static_branch_disable(&sched_cache_timeout); + } +} + +static ssize_t +sched_cache_timeout_enable_write(struct file *filp, const char __user *ubu= f, + size_t cnt, loff_t *ppos) +{ + int val, ret; + + ret =3D kstrtouint_from_user(ubuf, cnt, 10, &val); + if (ret) + return ret; + + llc_epoch_visited_timeout =3D val; + + sched_cache_timeout_set(); + + return cnt; +} + +static int sched_cache_timeout_enable_show(struct seq_file *m, void *v) +{ + seq_printf(m, "%d\n", llc_epoch_visited_timeout); + return 0; +} + +static int sched_cache_timeout_enable_open(struct inode *inode, + struct file *filp) +{ + return single_open(filp, sched_cache_timeout_enable_show, NULL); +} + +static const struct file_operations sched_cache_timeout_enable_fops =3D { + .open =3D sched_cache_timeout_enable_open, + .write =3D sched_cache_timeout_enable_write, + .read =3D seq_read, + .llseek =3D seq_lseek, + .release =3D single_release, +}; #endif =20 #ifdef CONFIG_PREEMPT_DYNAMIC @@ -669,6 +717,8 @@ static __init int sched_init_debug(void) llc =3D debugfs_create_dir("llc_balancing", debugfs_sched); debugfs_create_file("enabled", 0644, llc, NULL, &sched_cache_enable_fops); + debugfs_create_file("epoch_visited_timeout", 0644, llc, NULL, + &sched_cache_timeout_enable_fops); debugfs_create_u32("aggr_tolerance", 0644, llc, &llc_aggr_tolerance); debugfs_create_u32("epoch_period", 0644, llc, diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e4e22696a0b1..89f44ea97fee 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1285,9 +1285,12 @@ static void set_next_buddy(struct sched_entity *se); __read_mostly unsigned int llc_aggr_tolerance =3D 1; __read_mostly unsigned int llc_epoch_period =3D EPOCH_PERIOD; __read_mostly unsigned int llc_epoch_affinity_timeout =3D EPOCH_LLC_AFFINI= TY_TIMEOUT; +__read_mostly unsigned int llc_epoch_visited_timeout =3D EPOCH_LLC_AFFINI= TY_TIMEOUT; __read_mostly unsigned int llc_imb_pct =3D 20; __read_mostly unsigned int llc_overaggr_pct =3D 50; =20 +DEFINE_STATIC_KEY_TRUE(sched_cache_timeout); + static int llc_id(int cpu) { if (cpu < 0) @@ -1466,6 +1469,7 @@ void mm_init_sched(struct mm_struct *mm, raw_spin_lock_init(&mm->sc_stat.lock); mm->sc_stat.epoch =3D epoch; mm->sc_stat.cpu =3D -1; + cpumask_clear(&mm->sc_stat.visited_cpus); =20 /* * The update to mm->sc_stat should not be reordered @@ -1582,6 +1586,9 @@ void account_mm_sched(struct rq *rq, struct task_stru= ct *p, s64 delta_exec) pcpu_sched->runtime +=3D delta_exec; rq->cpu_runtime +=3D delta_exec; epoch =3D rq->cpu_epoch; + if (sched_cache_timeout_enabled() && + !cpumask_test_cpu(cpu_of(rq), &mm->sc_stat.visited_cpus)) + cpumask_set_cpu(cpu_of(rq), &mm->sc_stat.visited_cpus); } =20 /* @@ -1724,7 +1731,10 @@ static void task_cache_work(struct callback_head *wo= rk) return; =20 scoped_guard (cpus_read_lock) { - get_scan_cpumasks(cpus, p); + if (!sched_cache_timeout_enabled()) + get_scan_cpumasks(cpus, p); + else + cpumask_and(cpus, cpu_online_mask, &mm->sc_stat.visited_cpus); =20 for_each_cpu(cpu, cpus) { /* XXX sched_cluster_active */ @@ -1736,8 +1746,17 @@ static void task_cache_work(struct callback_head *wo= rk) continue; =20 for_each_cpu(i, sched_domain_span(sd)) { - occ =3D fraction_mm_sched(cpu_rq(i), - per_cpu_ptr(mm->sc_stat.pcpu_sched, i)); + struct rq *rq =3D cpu_rq(i); + struct sched_cache_time *pcpu_sched =3D per_cpu_ptr(mm->sc_stat.pcpu_s= ched, i); + /* Skip the rq that has not been hit for a long time */ + if (sched_cache_timeout_enabled() && + cpumask_test_cpu(cpu_of(rq), &mm->sc_stat.visited_cpus) && + (rq->cpu_epoch - pcpu_sched->epoch) > + llc_epoch_visited_timeout) { + cpumask_clear_cpu(cpu_of(rq), &mm->sc_stat.visited_cpus); + continue; + } + occ =3D fraction_mm_sched(rq, pcpu_sched); a_occ +=3D occ; if (occ > m_occ) { m_occ =3D occ; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b757812725f7..2ba09e9567af 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -4037,10 +4037,12 @@ static inline void mm_cid_switch_to(struct task_str= uct *prev, struct task_struct #ifdef CONFIG_SCHED_CACHE DECLARE_STATIC_KEY_FALSE(sched_cache_present); DECLARE_STATIC_KEY_FALSE(sched_cache_active); +DECLARE_STATIC_KEY_TRUE(sched_cache_timeout); extern int sysctl_sched_cache_user; extern unsigned int llc_aggr_tolerance; extern unsigned int llc_epoch_period; extern unsigned int llc_epoch_affinity_timeout; +extern unsigned int llc_epoch_visited_timeout; extern unsigned int llc_imb_pct; extern unsigned int llc_overaggr_pct; =20 @@ -4051,6 +4053,10 @@ static inline bool sched_cache_enabled(void) =20 extern void sched_cache_active_set_unlocked(void); =20 +static inline bool sched_cache_timeout_enabled(void) +{ + return static_branch_unlikely(&sched_cache_timeout); +} #endif =20 void sched_domains_free_llc_id(int cpu); --=20 2.34.1