From nobody Mon Apr 6 10:44:59 2026 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3D5BF324716 for ; Fri, 20 Mar 2026 06:20:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.187 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773987627; cv=none; b=JSmffLf6J2jxlQzPp6sa2MRwguKgbDySUgoTTJx9LRf/JiIDfQqpeqmjWlmfI+xh+GTV7vU95fXEWMfdp/wMjplI9HgEjpiOCURqF6m5i83eMxckJSOEXXxx1c8JK0H9nVlNAwDbuC0BAuCIBwF/r/6fVNMj4YjTRa/ZGgfVArE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773987627; c=relaxed/simple; bh=aQ4ZK7NQIQiddIVelVqTYvKZNqLivgtlVTa58ZRsPnI=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=rxcXW/cbnJbN27KBPQ3Ln8/5Mp4ETFAgRjBNjPl7tiV9PWG2GB2qkNVGYn8JG+CCJGuWn4VWigLKZjrlIieDyPLhgY82ds1qjMZjggUOErFGIWUMwrqdpum6Erm1m2A6j1mf4khILit7BAUdJG9l7GWrGifN3F3/qQWcgapxlf8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; dkim=pass (1024-bit key) header.d=huawei.com header.i=@huawei.com header.b=S2N37hqn; dkim=pass (1024-bit key) header.d=huawei.com header.i=@huawei.com header.b=S2N37hqn; arc=none smtp.client-ip=45.249.212.187 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=huawei.com header.i=@huawei.com header.b="S2N37hqn"; dkim=pass (1024-bit key) header.d=huawei.com header.i=@huawei.com header.b="S2N37hqn" dkim-signature: v=1; a=rsa-sha256; d=huawei.com; s=dkim; c=relaxed/relaxed; q=dns/txt; h=From; bh=iXBrAzn0Xa2rKOCUmvvigs4XG2wDX2pQSdwVKmE/kXE=; b=S2N37hqnr8/OD6NLQ4YOgg2tIdkahwB4GXoWMIGpx5M2jmyqE2ELsoayk3kSDitEc3m6hcL6h FcUqqaMV7zksDCQZ5Qp5j9lmMY0Mmw4W4h0d4TlDeDYFh7BTUPNKLYYhuio4U22qbHV9CgZQSzc kw0JkedniDDVWTIbtbgndAE= Received: from canpmsgout03.his.huawei.com (unknown [172.19.92.159]) by szxga01-in.huawei.com (SkyGuard) with ESMTPS id 4fcXTM59jKz1BG3W for ; Fri, 20 Mar 2026 14:19:19 +0800 (CST) dkim-signature: v=1; a=rsa-sha256; d=huawei.com; s=dkim; c=relaxed/relaxed; q=dns/txt; h=From; bh=iXBrAzn0Xa2rKOCUmvvigs4XG2wDX2pQSdwVKmE/kXE=; b=S2N37hqnr8/OD6NLQ4YOgg2tIdkahwB4GXoWMIGpx5M2jmyqE2ELsoayk3kSDitEc3m6hcL6h FcUqqaMV7zksDCQZ5Qp5j9lmMY0Mmw4W4h0d4TlDeDYFh7BTUPNKLYYhuio4U22qbHV9CgZQSzc kw0JkedniDDVWTIbtbgndAE= Received: from mail.maildlp.com (unknown [172.19.162.144]) by canpmsgout03.his.huawei.com (SkyGuard) with ESMTPS id 4fcXN73GkDzpTLZ; Fri, 20 Mar 2026 14:14:47 +0800 (CST) Received: from kwepemr500016.china.huawei.com (unknown [7.202.195.68]) by mail.maildlp.com (Postfix) with ESMTPS id 5523A4056A; Fri, 20 Mar 2026 14:20:14 +0800 (CST) Received: from huawei.com (10.67.174.242) by kwepemr500016.china.huawei.com (7.202.195.68) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Fri, 20 Mar 2026 14:20:13 +0800 From: Chen Jinghuang To: , , , , CC: , , , , , , , Subject: [RFC PATCH v5 9/9] sched/fair: Provide idle search schedstats Date: Fri, 20 Mar 2026 05:59:20 +0000 Message-ID: <20260320055920.2518389-10-chenjinghuang2@huawei.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260320055920.2518389-1-chenjinghuang2@huawei.com> References: <20260320055920.2518389-1-chenjinghuang2@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: kwepems500001.china.huawei.com (7.221.188.70) To kwepemr500016.china.huawei.com (7.202.195.68) Content-Type: text/plain; charset="utf-8" From: Steve Sistare Add schedstats to measure the effectiveness of searching for idle CPUs and stealing tasks. This is a temporary patch intended for use during development only. SCHEDSTAT_VERSION is bumped to 16, and the following fields are added to the per-CPU statistics of /proc/schedstat: field 10: # of times select_idle_sibling "easily" found an idle CPU -- prev or target is idle. field 11: # of times select_idle_sibling searched and found an idle cpu. field 12: # of times select_idle_sibling searched and found an idle core. field 13: # of times select_idle_sibling failed to find anything idle. field 14: time in nanoseconds spent in functions that search for idle CPUs and search for tasks to steal. field 15: # of times an idle CPU steals a task from another CPU. field 16: # of times try_steal finds overloaded CPUs but no task is migratable. Signed-off-by: Steve Sistare Signed-off-by: Chen Jinghuang --- kernel/sched/core.c | 31 +++++++++++++++++++++++-- kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++++++++---- kernel/sched/sched.h | 9 ++++++++ kernel/sched/stats.c | 9 ++++++++ kernel/sched/stats.h | 13 +++++++++++ 5 files changed, 109 insertions(+), 7 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 759777694c78..841a4ca7e173 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4505,17 +4505,44 @@ static int sysctl_numa_balancing(const struct ctl_t= able *table, int write, =20 DEFINE_STATIC_KEY_FALSE(sched_schedstats); =20 +unsigned long schedstat_skid; + +static void compute_skid(void) +{ + int i, n =3D 0; + s64 t; + int skid =3D 0; + + for (i =3D 0; i < 100; i++) { + t =3D local_clock(); + t =3D local_clock() - t; + if (t > 0 && t < 1000) { /* only use sane samples */ + skid +=3D (int) t; + n++; + } + } + + if (n > 0) + schedstat_skid =3D skid / n; + else + schedstat_skid =3D 0; + pr_info("schedstat_skid =3D %lu\n", schedstat_skid); +} + static void set_schedstats(bool enabled) { - if (enabled) + if (enabled) { + compute_skid(); static_branch_enable(&sched_schedstats); - else + } else { static_branch_disable(&sched_schedstats); + } } =20 void force_schedstat_enabled(void) { if (!schedstat_enabled()) { + compute_skid(); pr_info("kernel profiling enabled schedstats, disable via kernel.sched_s= chedstats.\n"); static_branch_enable(&sched_schedstats); } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 500215a57392..ba2b9f811135 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5091,29 +5091,35 @@ static inline void rq_idle_stamp_clear(struct rq *r= q) static void overload_clear(struct rq *rq) { struct sparsemask *overload_cpus; + unsigned long time; =20 if (!sched_feat(STEAL)) return; =20 + time =3D schedstat_start_time(); rcu_read_lock(); overload_cpus =3D rcu_dereference(rq->cfs_overload_cpus); if (overload_cpus) sparsemask_clear_elem(overload_cpus, rq->cpu); rcu_read_unlock(); + schedstat_end_time(rq->find_time, time); } =20 static void overload_set(struct rq *rq) { struct sparsemask *overload_cpus; + unsigned long time; =20 if (!sched_feat(STEAL)) return; =20 + time =3D schedstat_start_time(); rcu_read_lock(); overload_cpus =3D rcu_dereference(rq->cfs_overload_cpus); if (overload_cpus) sparsemask_set_elem(overload_cpus, rq->cpu); rcu_read_unlock(); + schedstat_end_time(rq->find_time, time); } =20 static int try_steal(struct rq *this_rq, struct rq_flags *rf); @@ -7830,6 +7836,16 @@ static inline bool asym_fits_cpu(unsigned long util, return true; } =20 +#define SET_STAT(STAT) \ + do { \ + if (schedstat_enabled()) { \ + struct rq *rq =3D this_rq(); \ + \ + if (rq) \ + __schedstat_inc(rq->STAT); \ + } \ + } while (0) + /* * Try and locate an idle core/thread in the LLC cache domain. */ @@ -7857,8 +7873,10 @@ static int select_idle_sibling(struct task_struct *p= , int prev, int target) lockdep_assert_irqs_disabled(); =20 if ((available_idle_cpu(target) || sched_idle_cpu(target)) && - asym_fits_cpu(task_util, util_min, util_max, target)) + asym_fits_cpu(task_util, util_min, util_max, target)) { + SET_STAT(found_idle_cpu_easy); return target; + } =20 /* * If the previous CPU is cache affine and idle, don't be stupid: @@ -7868,8 +7886,10 @@ static int select_idle_sibling(struct task_struct *p= , int prev, int target) asym_fits_cpu(task_util, util_min, util_max, prev)) { =20 if (!static_branch_unlikely(&sched_cluster_active) || - cpus_share_resources(prev, target)) + cpus_share_resources(prev, target)) { + SET_STAT(found_idle_cpu_easy); return prev; + } =20 prev_aff =3D prev; } @@ -7887,6 +7907,7 @@ static int select_idle_sibling(struct task_struct *p,= int prev, int target) prev =3D=3D smp_processor_id() && this_rq()->nr_running <=3D 1 && asym_fits_cpu(task_util, util_min, util_max, prev)) { + SET_STAT(found_idle_cpu_easy); return prev; } =20 @@ -7901,8 +7922,10 @@ static int select_idle_sibling(struct task_struct *p= , int prev, int target) asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) { =20 if (!static_branch_unlikely(&sched_cluster_active) || - cpus_share_resources(recent_used_cpu, target)) + cpus_share_resources(recent_used_cpu, target)) { + SET_STAT(found_idle_cpu_easy); return recent_used_cpu; + } =20 } else { recent_used_cpu =3D -1; @@ -7924,13 +7947,16 @@ static int select_idle_sibling(struct task_struct *= p, int prev, int target) */ if (sd) { i =3D select_idle_capacity(p, sd, target); + SET_STAT(found_idle_cpu_capacity); return ((unsigned)i < nr_cpumask_bits) ? i : target; } } =20 sd =3D rcu_dereference_all(per_cpu(sd_llc, target)); - if (!sd) + if (!sd) { + SET_STAT(nofound_idle_cpu); return target; + } =20 if (sched_smt_active()) { has_idle_core =3D test_idle_cores(target); @@ -7943,8 +7969,12 @@ static int select_idle_sibling(struct task_struct *p= , int prev, int target) } =20 i =3D select_idle_cpu(p, sd, has_idle_core, target); - if ((unsigned)i < nr_cpumask_bits) + if ((unsigned)i < nr_cpumask_bits) { + SET_STAT(found_idle_cpu); return i; + } + + SET_STAT(nofound_idle_cpu); =20 /* * For cluster machines which have lower sharing cache like L2 or @@ -8580,6 +8610,7 @@ static int select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) { int sync =3D (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING); + unsigned long time; struct sched_domain *tmp, *sd =3D NULL; int cpu =3D smp_processor_id(); int new_cpu =3D prev_cpu; @@ -8587,6 +8618,8 @@ select_task_rq_fair(struct task_struct *p, int prev_c= pu, int wake_flags) /* SD_flags and WF_flags share the first nibble */ int sd_flag =3D wake_flags & 0xF; =20 + time =3D schedstat_start_time(); + /* * required for stable ->cpus_allowed */ @@ -8643,6 +8676,8 @@ select_task_rq_fair(struct task_struct *p, int prev_c= pu, int wake_flags) } rcu_read_unlock(); =20 + schedstat_end_time(cpu_rq(cpu)->find_time, time); + return new_cpu; } =20 @@ -8981,6 +9016,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct= *prev, struct rq_flags *rf struct sched_entity *se; struct task_struct *p; int new_tasks; + unsigned long time; =20 again: p =3D pick_task_fair(rq, rf); @@ -9038,6 +9074,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct= *prev, struct rq_flags *rf =20 idle: if (rf) { + time =3D schedstat_start_time(); /* * We must set idle_stamp _before_ calling try_steal() or * sched_balance_newidle(), such that we measure the duration @@ -9052,6 +9089,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct= *prev, struct rq_flags *rf if (new_tasks) rq_idle_stamp_clear(rq); =20 + schedstat_end_time(rq->find_time, time); + /* * Because try_steal() and sched_balance_newidle() release * (and re-acquire) rq->lock, it is possible for any higher priority @@ -13215,6 +13254,7 @@ static int steal_from(struct rq *dst_rq, struct rq_= flags *dst_rf, bool *locked, update_rq_clock(dst_rq); attach_task(dst_rq, p); stolen =3D 1; + schedstat_inc(dst_rq->steal); } local_irq_restore(rf.flags); =20 @@ -13239,6 +13279,7 @@ static int try_steal(struct rq *dst_rq, struct rq_f= lags *dst_rf) int dst_cpu =3D dst_rq->cpu; bool locked =3D true; int stolen =3D 0; + bool any_overload =3D false; struct sparsemask *overload_cpus; =20 if (!sched_feat(STEAL)) @@ -13281,6 +13322,7 @@ static int try_steal(struct rq *dst_rq, struct rq_f= lags *dst_rf) stolen =3D 1; goto out; } + any_overload =3D true; } =20 out: @@ -13292,6 +13334,8 @@ static int try_steal(struct rq *dst_rq, struct rq_f= lags *dst_rf) stolen |=3D (dst_rq->cfs.h_nr_runnable > 0); if (dst_rq->nr_running !=3D dst_rq->cfs.h_nr_runnable) stolen =3D -1; + if (!stolen && any_overload) + schedstat_inc(dst_rq->steal_fail); return stolen; } =20 diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 4989a92eeb9b..530b80fbf897 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1304,6 +1304,15 @@ struct rq { /* try_to_wake_up() stats */ unsigned int ttwu_count; unsigned int ttwu_local; + + /* Idle search stats */ + unsigned int found_idle_cpu_capacity; + unsigned int found_idle_cpu; + unsigned int found_idle_cpu_easy; + unsigned int nofound_idle_cpu; + unsigned long find_time; + unsigned int steal; + unsigned int steal_fail; #endif =20 #ifdef CONFIG_CPU_IDLE diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c index d1c9429a4ac5..7063c9712f68 100644 --- a/kernel/sched/stats.c +++ b/kernel/sched/stats.c @@ -129,6 +129,15 @@ static int show_schedstat(struct seq_file *seq, void *= v) rq->rq_cpu_time, rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount); =20 + seq_printf(seq, " %u %u %u %u %lu %u %u", + rq->found_idle_cpu_easy, + rq->found_idle_cpu_capacity, + rq->found_idle_cpu, + rq->nofound_idle_cpu, + rq->find_time, + rq->steal, + rq->steal_fail); + seq_printf(seq, "\n"); =20 /* domain-specific stats */ diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index a612cf253c87..55f31a4df8fa 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -43,6 +43,17 @@ rq_sched_info_dequeue(struct rq *rq, unsigned long long = delta) #define schedstat_set(var, val) do { if (schedstat_enabled()) { var =3D = (val); } } while (0) #define schedstat_val(var) (var) #define schedstat_val_or_zero(var) ((schedstat_enabled()) ? (var) : 0) +#define schedstat_start_time() schedstat_val_or_zero(local_clock()) +#define schedstat_end_time(stat, time) \ + do { \ + unsigned long endtime; \ + \ + if (schedstat_enabled() && (time)) { \ + endtime =3D local_clock() - (time) - schedstat_skid; \ + schedstat_add((stat), endtime); \ + } \ + } while (0) +extern unsigned long schedstat_skid; =20 void __update_stats_wait_start(struct rq *rq, struct task_struct *p, struct sched_statistics *stats); @@ -81,6 +92,8 @@ static inline void rq_sched_info_depart (struct rq *rq, = unsigned long long delt # define schedstat_set(var, val) do { } while (0) # define schedstat_val(var) 0 # define schedstat_val_or_zero(var) 0 +# define schedstat_start_time() 0 +# define schedstat_end_time(stat, t) do { } while (0) =20 # define __update_stats_wait_start(rq, p, stats) do { } while (0) # define __update_stats_wait_end(rq, p, stats) do { } while (0) --=20 2.34.1