kernel/sched/fair.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-)
From: Shubhang Kaushik <shubhang@os.amperecomputing.com>
Modify the wakeup path in `select_task_rq_fair()` to prioritize cache
locality for waking tasks. The previous fast path always attempted to
find an idle sibling, even if the task's prev CPU was not truly busy.
The original problem was that under some circumstances, this could lead
to unnecessary task migrations away from a cache-hot core, even when
the task's prev CPU was a suitable candidate. The scheduler's internal
mechanism `cpu_overutilized()` provide an evaluation of CPU load.
To address this, the wakeup heuristic is updated to check the status of
the task's `prev_cpu` first:
- If the `prev_cpu` is not overutilized (as determined by
`cpu_overutilized()`, via PELT), the task is woken up on
its previous CPU. This leverages cache locality and avoids
a potentially unnecessary migration.
- If the `prev_cpu` is considered busy or overutilized, the scheduler
falls back to the existing behavior of searching for an idle sibling.
Signed-off-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
---
This patch optimizes the scheduler's wakeup path to prioritize cache
locality by keeping a task on its previous CPU if it is not overutilized,
falling back to a sibling search only when necessary.
---
kernel/sched/fair.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc0b7ce8a65d6bbe616953f530f7a02bb619537c..bb0d28d7d9872642cb5a4076caeb3ac9d8fe7bcd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8618,7 +8618,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
} else if (wake_flags & WF_TTWU) { /* XXX always ? */
/* Fast path */
- new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
+
+ /*
+ * Avoid wakeup on an overutilized CPU.
+ * If the previous CPU is not overloaded, retain the same for cache locality.
+ * Otherwise, search for an idle sibling.
+ */
+ if (!cpu_overutilized(prev_cpu))
+ new_cpu = prev_cpu;
+ else
+ new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
}
rcu_read_unlock();
---
base-commit: 9b332cece987ee1790b2ed4c989e28162fa47860
change-id: 20251017-b4-sched-cfs-refactor-propagate-2c4a820998a4
Best regards,
--
Shubhang Kaushik <shubhang@os.amperecomputing.com>
On 18.10.25 01:00, Shubhang Kaushik via B4 Relay wrote:
> From: Shubhang Kaushik <shubhang@os.amperecomputing.com>
>
> Modify the wakeup path in `select_task_rq_fair()` to prioritize cache
> locality for waking tasks. The previous fast path always attempted to
> find an idle sibling, even if the task's prev CPU was not truly busy.
>
> The original problem was that under some circumstances, this could lead
> to unnecessary task migrations away from a cache-hot core, even when
> the task's prev CPU was a suitable candidate. The scheduler's internal
> mechanism `cpu_overutilized()` provide an evaluation of CPU load.
>
> To address this, the wakeup heuristic is updated to check the status of
> the task's `prev_cpu` first:
> - If the `prev_cpu` is not overutilized (as determined by
> `cpu_overutilized()`, via PELT), the task is woken up on
> its previous CPU. This leverages cache locality and avoids
> a potentially unnecessary migration.
> - If the `prev_cpu` is considered busy or overutilized, the scheduler
> falls back to the existing behavior of searching for an idle sibling.
How does you sched domain topology look like? How many CPUs do you have
in your MC domain?
>
> Signed-off-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
> ---
> This patch optimizes the scheduler's wakeup path to prioritize cache
> locality by keeping a task on its previous CPU if it is not overutilized,
> falling back to a sibling search only when necessary.
> ---
> kernel/sched/fair.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bc0b7ce8a65d6bbe616953f530f7a02bb619537c..bb0d28d7d9872642cb5a4076caeb3ac9d8fe7bcd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8618,7 +8618,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
> } else if (wake_flags & WF_TTWU) { /* XXX always ? */
> /* Fast path */
> - new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> +
> + /*
> + * Avoid wakeup on an overutilized CPU.
> + * If the previous CPU is not overloaded, retain the same for cache locality.
> + * Otherwise, search for an idle sibling.
> + */
> + if (!cpu_overutilized(prev_cpu))
> + new_cpu = prev_cpu;
IMHO, special conditions like this one are normally coded at the
beginning of select_idle_sibling().
[...]
The system is an 80 core Ampere Altra with a two-level
sched domain topology. The MC domain contains all 80 cores.
I agree that placing the condition earlier in `select_idle_sibling()`
aligns better with convention. I will move the check (EAS Aware) to the
top of the function and submit a v2 patch.
Best,
Shubhang Kaushik
On Thu, 30 Oct 2025, Dietmar Eggemann wrote:
> On 18.10.25 01:00, Shubhang Kaushik via B4 Relay wrote:
>> From: Shubhang Kaushik <shubhang@os.amperecomputing.com>
>>
>> Modify the wakeup path in `select_task_rq_fair()` to prioritize cache
>> locality for waking tasks. The previous fast path always attempted to
>> find an idle sibling, even if the task's prev CPU was not truly busy.
>>
>> The original problem was that under some circumstances, this could lead
>> to unnecessary task migrations away from a cache-hot core, even when
>> the task's prev CPU was a suitable candidate. The scheduler's internal
>> mechanism `cpu_overutilized()` provide an evaluation of CPU load.
>>
>> To address this, the wakeup heuristic is updated to check the status of
>> the task's `prev_cpu` first:
>> - If the `prev_cpu` is not overutilized (as determined by
>> `cpu_overutilized()`, via PELT), the task is woken up on
>> its previous CPU. This leverages cache locality and avoids
>> a potentially unnecessary migration.
>> - If the `prev_cpu` is considered busy or overutilized, the scheduler
>> falls back to the existing behavior of searching for an idle sibling.
>
> How does you sched domain topology look like? How many CPUs do you have
> in your MC domain?
>
>>
>> Signed-off-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
>> ---
>> This patch optimizes the scheduler's wakeup path to prioritize cache
>> locality by keeping a task on its previous CPU if it is not overutilized,
>> falling back to a sibling search only when necessary.
>> ---
>> kernel/sched/fair.c | 11 ++++++++++-
>> 1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index bc0b7ce8a65d6bbe616953f530f7a02bb619537c..bb0d28d7d9872642cb5a4076caeb3ac9d8fe7bcd 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8618,7 +8618,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>> new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
>> } else if (wake_flags & WF_TTWU) { /* XXX always ? */
>> /* Fast path */
>> - new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
>> +
>> + /*
>> + * Avoid wakeup on an overutilized CPU.
>> + * If the previous CPU is not overloaded, retain the same for cache locality.
>> + * Otherwise, search for an idle sibling.
>> + */
>> + if (!cpu_overutilized(prev_cpu))
>> + new_cpu = prev_cpu;
>
> IMHO, special conditions like this one are normally coded at the
> beginning of select_idle_sibling().
>
> [...]
>
On 30.10.25 17:35, Shubhang wrote: > The system is an 80 core Ampere Altra with a two-level > sched domain topology. The MC domain contains all 80 cores. Ah OK. So I assume the other SD is CLS with 2 CPUs? Does this mean you guys have recently changed the sched topology on this thing? I still remember setups with 2 CPUs in MC and 80 CPUs on PKG. If this is the case, is: db1e59483dfd - topology: make core_mask include at least cluster_siblings (2022-04-20 Darren Hart) still needed in this case? > I agree that placing the condition earlier in `select_idle_sibling()`> aligns better with convention. I will move the check (EAS Aware) to the > top of the function and submit a v2 patch. I can't imagine that you run EAS on this machine? It needs heterogeneous CPUs which you shouldn't have. Looks like that Christian L. was asking you already on your v2. [...]
Hello,
we just reported a "76.8% improvement of stress-ng.tee.ops_per_sec" in
https://lore.kernel.org/all/202510281543.28d76c2-lkp@intel.com/
now we captured a regression. FYI.
kernel test robot noticed a 8.5% regression of stress-ng.io-uring.ops_per_sec on:
commit: 24efd1bf8a44f0f51f42f4af4ce22f21e873073d ("[PATCH] sched/fair: Prefer cache-hot prev_cpu for wakeup")
url: https://github.com/intel-lab-lkp/linux/commits/Shubhang-Kaushik-via-B4-Relay/sched-fair-Prefer-cache-hot-prev_cpu-for-wakeup/20251018-092110
patch link: https://lore.kernel.org/all/20251017-b4-sched-cfs-refactor-propagate-v1-1-1eb0dc5b19b3@os.amperecomputing.com/
patch subject: [PATCH] sched/fair: Prefer cache-hot prev_cpu for wakeup
testcase: stress-ng
config: x86_64-rhel-9.4
compiler: gcc-14
test machine: 256 threads 2 sockets Intel(R) Xeon(R) 6768P CPU @ 2.4GHz (Granite Rapids) with 64G memory
parameters:
nr_threads: 100%
testtime: 60s
test: io-uring
cpufreq_governor: performance
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202510291148.b2988254-lkp@intel.com
Details are as below:
-------------------------------------------------------------------------------------------------->
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20251029/202510291148.b2988254-lkp@intel.com
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
gcc-14/performance/x86_64-rhel-9.4/100%/debian-13-x86_64-20250902.cgz/lkp-gnr-2sp4/io-uring/stress-ng/60s
commit:
9b332cece9 ("Merge tag 'nfsd-6.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux")
24efd1bf8a ("sched/fair: Prefer cache-hot prev_cpu for wakeup")
9b332cece987ee17 24efd1bf8a44f0f51f42f4af4ce
---------------- ---------------------------
%stddev %change %stddev
\ | \
3.58e+09 +17.6% 4.21e+09 cpuidle..time
9.276e+08 -35.8% 5.958e+08 ± 2% cpuidle..usage
48009670 -12.7% 41899608 ± 4% numa-numastat.node0.local_node
48122238 -12.8% 41981276 ± 4% numa-numastat.node0.numa_hit
0.89 ± 44% +13.2 14.07 ± 3% turbostat.C1E%
0.67 ± 44% +381.0% 3.22 turbostat.CPU%c1
1.375e+08 ± 44% +199.4% 4.116e+08 turbostat.IRQ
4.70 ± 44% +224.5% 15.25 turbostat.RAMWatt
210.17 ± 77% +1158.0% 2643 perf-c2c.DRAM.local
1725 ± 11% +10694.5% 186276 ± 3% perf-c2c.DRAM.remote
320853 -50.4% 159203 ± 4% perf-c2c.HITM.local
1320 ± 13% +9462.5% 126256 ± 3% perf-c2c.HITM.remote
322174 -11.4% 285460 perf-c2c.HITM.total
14.00 ± 4% -2.1 11.92 ± 4% mpstat.cpu.all.idle%
13.31 +5.2 18.56 mpstat.cpu.all.iowait%
1.48 +4.9 6.39 ± 3% mpstat.cpu.all.irq%
0.85 -0.2 0.68 mpstat.cpu.all.soft%
3.51 -2.1 1.40 ± 5% mpstat.cpu.all.usr%
18.17 ± 4% +12.8% 20.50 ± 4% mpstat.max_utilization.seconds
12518136 -40.6% 7432802 ± 5% meminfo.Active
12518120 -40.6% 7432786 ± 5% meminfo.Active(anon)
14791509 -34.2% 9726112 ± 4% meminfo.Cached
17016588 -29.8% 11943542 ± 3% meminfo.Committed_AS
19860760 -19.5% 15994452 ± 2% meminfo.Memused
11109813 -45.8% 6019207 ± 6% meminfo.Shmem
19916177 -19.5% 16031079 ± 2% meminfo.max_used_kB
104776 ± 14% -24.3% 79337 ± 21% numa-meminfo.node0.KReclaimable
104776 ± 14% -24.3% 79337 ± 21% numa-meminfo.node0.SReclaimable
11913809 -42.7% 6821430 ± 5% numa-meminfo.node1.Active
11913804 -42.7% 6821421 ± 5% numa-meminfo.node1.Active(anon)
11336225 ± 2% -30.4% 7891392 ± 23% numa-meminfo.node1.FilePages
19000428 +14.7% 21787417 ± 8% numa-meminfo.node1.MemFree
11104229 -45.9% 6012466 ± 6% numa-meminfo.node1.Shmem
1.125e+09 -8.4% 1.03e+09 ± 3% stress-ng.io-uring.ops
18779554 -8.5% 17185210 ± 3% stress-ng.io-uring.ops_per_sec
2.353e+08 +58.7% 3.735e+08 ± 3% stress-ng.time.involuntary_context_switches
16880 -11.1% 15008 stress-ng.time.percent_of_cpu_this_job_got
9702 -8.5% 8878 stress-ng.time.system_time
443.21 -67.8% 142.54 ± 2% stress-ng.time.user_time
1.362e+09 -11.5% 1.206e+09 ± 3% stress-ng.time.voluntary_context_switches
26194 ± 14% -24.3% 19834 ± 21% numa-vmstat.node0.nr_slab_reclaimable
48122182 -12.8% 41981349 ± 4% numa-vmstat.node0.numa_hit
48009614 -12.7% 41899680 ± 4% numa-vmstat.node0.numa_local
2981009 -42.7% 1707865 ± 5% numa-vmstat.node1.nr_active_anon
2836469 ± 2% -30.4% 1975086 ± 23% numa-vmstat.node1.nr_file_pages
4747481 +14.7% 5444494 ± 8% numa-vmstat.node1.nr_free_pages
4714110 +14.8% 5411734 ± 8% numa-vmstat.node1.nr_free_pages_blocks
2778450 -45.8% 1505400 ± 6% numa-vmstat.node1.nr_shmem
2981003 -42.7% 1707858 ± 5% numa-vmstat.node1.nr_zone_active_anon
3131938 -40.6% 1860663 ± 5% proc-vmstat.nr_active_anon
1133648 +8.5% 1230219 proc-vmstat.nr_dirty_background_threshold
2270069 +8.5% 2463447 proc-vmstat.nr_dirty_threshold
3700155 -34.2% 2433658 ± 4% proc-vmstat.nr_file_pages
11441308 +8.5% 12408439 proc-vmstat.nr_free_pages
11335855 +8.6% 12314183 proc-vmstat.nr_free_pages_blocks
2779743 -45.8% 1507064 ± 6% proc-vmstat.nr_shmem
50620 -5.9% 47611 proc-vmstat.nr_slab_reclaimable
3131938 -40.6% 1860663 ± 5% proc-vmstat.nr_zone_active_anon
99148879 -9.8% 89432077 ± 3% proc-vmstat.numa_hit
98893495 -9.8% 89168637 ± 3% proc-vmstat.numa_local
54203 ± 24% -57.3% 23128 ± 10% proc-vmstat.numa_pages_migrated
99397243 -9.8% 89638031 ± 3% proc-vmstat.pgalloc_normal
94583491 -8.4% 86624034 ± 3% proc-vmstat.pgfree
54203 ± 24% -57.3% 23128 ± 10% proc-vmstat.pgmigrate_success
39031 +1.8% 39717 proc-vmstat.pgreuse
47196381 -31.6% 32305642 ± 2% proc-vmstat.unevictable_pgs_culled
0.08 ± 2% +3468.8% 2.97 ± 4% perf-stat.i.MPKI
7.499e+10 -21.9% 5.854e+10 ± 2% perf-stat.i.branch-instructions
0.94 -0.3 0.62 perf-stat.i.branch-miss-rate%
6.557e+08 -48.5% 3.38e+08 ± 2% perf-stat.i.branch-misses
0.70 ± 2% +36.7 37.40 ± 4% perf-stat.i.cache-miss-rate%
29724413 ± 2% +2544.3% 7.86e+08 perf-stat.i.cache-misses
5.32e+09 -60.5% 2.103e+09 ± 3% perf-stat.i.cache-references
42032996 -14.0% 36140436 ± 2% perf-stat.i.context-switches
2.29 +18.0% 2.70 ± 2% perf-stat.i.cpi
7.916e+11 -9.6% 7.154e+11 perf-stat.i.cpu-cycles
11062415 -99.9% 15481 perf-stat.i.cpu-migrations
44096 ± 5% -97.9% 910.19 perf-stat.i.cycles-between-cache-misses
3.698e+11 -22.9% 2.852e+11 ± 2% perf-stat.i.instructions
0.46 -14.9% 0.40 ± 2% perf-stat.i.ipc
0.05 ± 47% +96.1% 0.09 ± 14% perf-stat.i.major-faults
207.41 -31.9% 141.15 ± 2% perf-stat.i.metric.K/sec
0.08 ± 2% +3331.9% 2.76 ± 4% perf-stat.overall.MPKI
0.87 -0.3 0.58 perf-stat.overall.branch-miss-rate%
0.56 ± 2% +36.9 37.43 ± 4% perf-stat.overall.cache-miss-rate%
2.14 +17.3% 2.51 ± 2% perf-stat.overall.cpi
26647 ± 2% -96.6% 910.56 perf-stat.overall.cycles-between-cache-misses
0.47 -14.7% 0.40 ± 2% perf-stat.overall.ipc
7.375e+10 -21.9% 5.757e+10 ± 2% perf-stat.ps.branch-instructions
6.449e+08 -48.4% 3.325e+08 ± 2% perf-stat.ps.branch-misses
29243806 ± 2% +2543.5% 7.731e+08 perf-stat.ps.cache-misses
5.233e+09 -60.5% 2.068e+09 ± 3% perf-stat.ps.cache-references
41341425 -14.0% 35549572 ± 2% perf-stat.ps.context-switches
7.786e+11 -9.6% 7.037e+11 perf-stat.ps.cpu-cycles
10881167 -99.9% 15227 perf-stat.ps.cpu-migrations
3.637e+11 -22.9% 2.805e+11 ± 2% perf-stat.ps.instructions
0.05 ± 47% +93.6% 0.09 ± 14% perf-stat.ps.major-faults
2.217e+13 -22.7% 1.713e+13 ± 2% perf-stat.total.instructions
4219859 -17.8% 3469357 sched_debug.cfs_rq:/.avg_vruntime.avg
7247589 ± 9% -38.3% 4469027 ± 7% sched_debug.cfs_rq:/.avg_vruntime.max
4013259 -29.0% 2849620 ± 17% sched_debug.cfs_rq:/.avg_vruntime.min
265810 ± 14% -54.9% 119970 ± 11% sched_debug.cfs_rq:/.avg_vruntime.stddev
3.42 ± 10% -24.4% 2.58 ± 7% sched_debug.cfs_rq:/.h_nr_queued.max
3.33 ± 11% -22.5% 2.58 ± 7% sched_debug.cfs_rq:/.h_nr_runnable.max
4401036 -17.1% 3647494 ± 4% sched_debug.cfs_rq:/.left_deadline.max
1274751 ± 5% -18.7% 1035958 ± 12% sched_debug.cfs_rq:/.left_deadline.stddev
4400687 -17.1% 3647059 ± 4% sched_debug.cfs_rq:/.left_vruntime.max
1274640 ± 5% -18.7% 1035848 ± 12% sched_debug.cfs_rq:/.left_vruntime.stddev
4219859 -17.8% 3469357 sched_debug.cfs_rq:/.min_vruntime.avg
7247589 ± 9% -38.3% 4469027 ± 7% sched_debug.cfs_rq:/.min_vruntime.max
4013259 -29.0% 2849620 ± 17% sched_debug.cfs_rq:/.min_vruntime.min
265810 ± 14% -54.9% 119970 ± 11% sched_debug.cfs_rq:/.min_vruntime.stddev
4400687 -17.1% 3647059 ± 4% sched_debug.cfs_rq:/.right_vruntime.max
1274640 ± 5% -18.7% 1035848 ± 12% sched_debug.cfs_rq:/.right_vruntime.stddev
532.33 -11.4% 471.62 ± 2% sched_debug.cfs_rq:/.runnable_avg.avg
1361 ± 3% +18.4% 1611 ± 10% sched_debug.cfs_rq:/.runnable_avg.max
203.24 ± 4% +38.0% 280.47 ± 3% sched_debug.cfs_rq:/.runnable_avg.stddev
108.79 ± 5% +68.6% 183.41 ± 4% sched_debug.cfs_rq:/.util_avg.stddev
99.93 ± 8% +144.8% 244.58 ± 4% sched_debug.cfs_rq:/.util_est.avg
154.15 ± 10% +41.9% 218.69 ± 5% sched_debug.cfs_rq:/.util_est.stddev
585777 ± 3% +55.0% 907718 ± 6% sched_debug.cpu.avg_idle.avg
257569 ± 15% +30.0% 334947 ± 11% sched_debug.cpu.avg_idle.stddev
581651 ± 2% +97.0% 1146052 ± 3% sched_debug.cpu.max_idle_balance_cost.avg
1334820 ± 4% +10.9% 1479741 sched_debug.cpu.max_idle_balance_cost.max
150290 ± 9% +34.9% 202732 ± 7% sched_debug.cpu.max_idle_balance_cost.stddev
3.42 ± 10% -24.4% 2.58 ± 13% sched_debug.cpu.nr_running.max
4900954 -14.0% 4212806 ± 2% sched_debug.cpu.nr_switches.avg
1872618 ± 12% +57.1% 2941530 ± 17% sched_debug.cpu.nr_switches.min
-24.25 -67.7% -7.83 sched_debug.cpu.nr_uninterruptible.min
8.41 ± 12% -46.3% 4.52 ± 14% sched_debug.cpu.nr_uninterruptible.stddev
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hello,
kernel test robot noticed a 76.8% improvement of stress-ng.tee.ops_per_sec on:
commit: 24efd1bf8a44f0f51f42f4af4ce22f21e873073d ("[PATCH] sched/fair: Prefer cache-hot prev_cpu for wakeup")
url: https://github.com/intel-lab-lkp/linux/commits/Shubhang-Kaushik-via-B4-Relay/sched-fair-Prefer-cache-hot-prev_cpu-for-wakeup/20251018-092110
patch link: https://lore.kernel.org/all/20251017-b4-sched-cfs-refactor-propagate-v1-1-1eb0dc5b19b3@os.amperecomputing.com/
patch subject: [PATCH] sched/fair: Prefer cache-hot prev_cpu for wakeup
testcase: stress-ng
config: x86_64-rhel-9.4
compiler: gcc-14
test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 512G memory
parameters:
nr_threads: 100%
testtime: 60s
test: tee
cpufreq_governor: performance
Details are as below:
-------------------------------------------------------------------------------------------------->
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20251028/202510281543.28d76c2-lkp@intel.com
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
gcc-14/performance/x86_64-rhel-9.4/100%/debian-13-x86_64-20250902.cgz/lkp-spr-2sp4/tee/stress-ng/60s
commit:
9b332cece9 ("Merge tag 'nfsd-6.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux")
24efd1bf8a ("sched/fair: Prefer cache-hot prev_cpu for wakeup")
9b332cece987ee17 24efd1bf8a44f0f51f42f4af4ce
---------------- ---------------------------
%stddev %change %stddev
\ | \
12097 ± 3% +10.9% 13413 ± 2% uptime.idle
3.662e+08 ± 7% +382.7% 1.768e+09 cpuidle..time
5056131 ± 56% +426.8% 26635997 ± 3% cpuidle..usage
13144587 ± 11% +21.1% 15921410 meminfo.Memused
13326158 ± 11% +20.6% 16067699 meminfo.max_used_kB
58707455 -16.5% 49043102 ± 9% numa-numastat.node1.local_node
58841583 -16.4% 49176968 ± 9% numa-numastat.node1.numa_hit
58770618 -16.3% 49175467 ± 9% numa-vmstat.node1.numa_hit
58636509 -16.4% 49041602 ± 9% numa-vmstat.node1.numa_local
2184 ± 9% +2157.3% 49310 ± 3% perf-c2c.DRAM.remote
3115 ± 11% +1689.3% 55737 ± 3% perf-c2c.HITM.local
1193 ± 13% +2628.6% 32575 ± 3% perf-c2c.HITM.remote
4308 ± 10% +1949.6% 88312 perf-c2c.HITM.total
1.95 ± 6% +10.4 12.34 mpstat.cpu.all.idle%
0.50 ± 3% +1.0 1.53 mpstat.cpu.all.irq%
0.02 ± 6% +0.1 0.09 ± 5% mpstat.cpu.all.soft%
74.24 -7.0 67.21 mpstat.cpu.all.sys%
23.29 -4.5 18.83 mpstat.cpu.all.usr%
232818 ± 35% -18.3% 190138 proc-vmstat.nr_anon_pages
124104 -1.1% 122691 proc-vmstat.nr_slab_unreclaimable
1.167e+08 -15.0% 99106005 proc-vmstat.numa_hit
1.164e+08 -15.1% 98853060 proc-vmstat.numa_local
1.168e+08 -15.2% 99060661 proc-vmstat.pgalloc_normal
1.147e+08 -15.7% 96704739 proc-vmstat.pgfree
1.071e+08 ± 2% +76.8% 1.894e+08 ± 2% stress-ng.tee.ops
1786177 ± 2% +76.8% 3157701 ± 2% stress-ng.tee.ops_per_sec
1.044e+08 -49.4% 52882701 stress-ng.time.involuntary_context_switches
21972 -12.1% 19317 stress-ng.time.percent_of_cpu_this_job_got
10131 -9.6% 9155 stress-ng.time.system_time
3070 -20.2% 2450 stress-ng.time.user_time
1.512e+08 -37.9% 93853736 stress-ng.time.voluntary_context_switches
2816 -10.5% 2519 turbostat.Avg_MHz
97.12 -9.8 87.30 turbostat.Busy%
0.11 ± 52% +0.5 0.66 ± 5% turbostat.C1%
0.40 ± 11% +8.4 8.78 turbostat.C1E%
2.39 ± 3% +1.0 3.42 ± 2% turbostat.C6%
1.08 ± 9% +168.3% 2.90 ± 3% turbostat.CPU%c1
32638444 +167.8% 87395049 turbostat.IRQ
110.56 +14.6 125.14 ± 4% turbostat.PKG_%
23.05 +32.8% 30.62 turbostat.RAMWatt
7559994 -21.3% 5948968 sched_debug.cfs_rq:/.avg_vruntime.avg
11028968 ± 13% -38.2% 6818572 ± 4% sched_debug.cfs_rq:/.avg_vruntime.max
0.34 ± 13% +104.0% 0.69 ± 3% sched_debug.cfs_rq:/.h_nr_queued.stddev
0.38 ± 8% +75.2% 0.67 ± 3% sched_debug.cfs_rq:/.h_nr_runnable.stddev
20.67 ± 33% +3672.8% 779.66 ± 73% sched_debug.cfs_rq:/.load_avg.avg
519.67 +7141.5% 37631 ± 10% sched_debug.cfs_rq:/.load_avg.max
86.71 ± 22% +5134.0% 4538 ± 39% sched_debug.cfs_rq:/.load_avg.stddev
7559994 -21.3% 5948968 sched_debug.cfs_rq:/.min_vruntime.avg
11028968 ± 13% -38.2% 6818572 ± 4% sched_debug.cfs_rq:/.min_vruntime.max
0.12 ± 17% +117.1% 0.27 ± 3% sched_debug.cfs_rq:/.nr_queued.stddev
809.69 ± 2% +15.6% 936.26 sched_debug.cfs_rq:/.runnable_avg.avg
2093 ± 3% +18.5% 2480 ± 8% sched_debug.cfs_rq:/.runnable_avg.max
259.47 ± 18% +71.8% 445.79 ± 3% sched_debug.cfs_rq:/.runnable_avg.stddev
576.64 -10.6% 515.40 sched_debug.cfs_rq:/.util_avg.avg
137.33 ± 12% +85.3% 254.45 ± 2% sched_debug.cfs_rq:/.util_avg.stddev
609.44 +15.6% 704.34 ± 3% sched_debug.cfs_rq:/.util_est.avg
1839 ± 11% +23.7% 2274 ± 7% sched_debug.cfs_rq:/.util_est.max
245.27 ± 7% +82.4% 447.29 ± 4% sched_debug.cfs_rq:/.util_est.stddev
702831 ± 5% +19.6% 840863 ± 3% sched_debug.cpu.avg_idle.avg
378668 ± 14% +32.2% 500458 ± 6% sched_debug.cpu.avg_idle.stddev
44.33 ± 22% -62.7% 16.52 ± 13% sched_debug.cpu.clock.stddev
909.29 ± 12% +160.6% 2369 ± 4% sched_debug.cpu.curr->pid.stddev
639355 ± 5% +80.6% 1154626 sched_debug.cpu.max_idle_balance_cost.avg
500000 +57.3% 786555 ± 11% sched_debug.cpu.max_idle_balance_cost.min
0.00 ± 20% -47.2% 0.00 ± 18% sched_debug.cpu.next_balance.stddev
0.32 ± 14% +111.2% 0.68 ± 3% sched_debug.cpu.nr_running.stddev
574871 -33.2% 383811 sched_debug.cpu.nr_switches.avg
788985 ± 11% -32.4% 533309 ± 6% sched_debug.cpu.nr_switches.max
0.04 ± 19% +1073.1% 0.50 perf-stat.i.MPKI
1.443e+11 -17.9% 1.184e+11 perf-stat.i.branch-instructions
0.08 ± 3% +0.0 0.12 perf-stat.i.branch-miss-rate%
1.049e+08 ± 2% +23.6% 1.296e+08 perf-stat.i.branch-misses
31.56 ± 11% +16.6 48.19 perf-stat.i.cache-miss-rate%
25936672 ± 22% +1080.1% 3.061e+08 perf-stat.i.cache-misses
77849475 ± 13% +714.6% 6.342e+08 perf-stat.i.cache-references
4288231 -33.1% 2868755 perf-stat.i.context-switches
0.85 +9.6% 0.94 perf-stat.i.cpi
6.387e+11 -10.2% 5.735e+11 perf-stat.i.cpu-cycles
2828 ± 24% +596.1% 19688 perf-stat.i.cpu-migrations
32456 ± 26% -94.2% 1870 perf-stat.i.cycles-between-cache-misses
7.486e+11 -18.2% 6.125e+11 perf-stat.i.instructions
1.17 -8.8% 1.07 perf-stat.i.ipc
19.17 -33.2% 12.81 perf-stat.i.metric.K/sec
0.03 ± 22% +1341.1% 0.50 perf-stat.overall.MPKI
0.07 ± 3% +0.0 0.11 perf-stat.overall.branch-miss-rate%
33.00 ± 10% +15.2 48.24 perf-stat.overall.cache-miss-rate%
0.85 +9.7% 0.94 perf-stat.overall.cpi
25848 ± 21% -92.7% 1874 perf-stat.overall.cycles-between-cache-misses
1.17 -8.9% 1.07 perf-stat.overall.ipc
1.419e+11 -18.1% 1.162e+11 perf-stat.ps.branch-instructions
1.028e+08 ± 2% +23.3% 1.268e+08 perf-stat.ps.branch-misses
25499974 ± 22% +1077.5% 3.003e+08 perf-stat.ps.cache-misses
76519245 ± 13% +713.3% 6.224e+08 perf-stat.ps.cache-references
4214394 -33.2% 2815077 perf-stat.ps.context-switches
6.278e+11 -10.4% 5.627e+11 perf-stat.ps.cpu-cycles
2763 ± 24% +598.5% 19305 perf-stat.ps.cpu-migrations
7.358e+11 -18.3% 6.009e+11 perf-stat.ps.instructions
4.489e+13 -18.3% 3.668e+13 perf-stat.total.instructions
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On Sat, 18 Oct 2025 at 01:01, Shubhang Kaushik via B4 Relay
<devnull+shubhang.os.amperecomputing.com@kernel.org> wrote:
>
> From: Shubhang Kaushik <shubhang@os.amperecomputing.com>
>
> Modify the wakeup path in `select_task_rq_fair()` to prioritize cache
> locality for waking tasks. The previous fast path always attempted to
> find an idle sibling, even if the task's prev CPU was not truly busy.
>
> The original problem was that under some circumstances, this could lead
> to unnecessary task migrations away from a cache-hot core, even when
> the task's prev CPU was a suitable candidate. The scheduler's internal
> mechanism `cpu_overutilized()` provide an evaluation of CPU load.
>
> To address this, the wakeup heuristic is updated to check the status of
> the task's `prev_cpu` first:
> - If the `prev_cpu` is not overutilized (as determined by
> `cpu_overutilized()`, via PELT), the task is woken up on
> its previous CPU. This leverages cache locality and avoids
> a potentially unnecessary migration.
> - If the `prev_cpu` is considered busy or overutilized, the scheduler
> falls back to the existing behavior of searching for an idle sibling.
>
> Signed-off-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
> ---
> This patch optimizes the scheduler's wakeup path to prioritize cache
> locality by keeping a task on its previous CPU if it is not overutilized,
> falling back to a sibling search only when necessary.
> ---
> kernel/sched/fair.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bc0b7ce8a65d6bbe616953f530f7a02bb619537c..bb0d28d7d9872642cb5a4076caeb3ac9d8fe7bcd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8618,7 +8618,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
> } else if (wake_flags & WF_TTWU) { /* XXX always ? */
> /* Fast path */
> - new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> +
> + /*
> + * Avoid wakeup on an overutilized CPU.
> + * If the previous CPU is not overloaded, retain the same for cache locality.
> + * Otherwise, search for an idle sibling.
> + */
> + if (!cpu_overutilized(prev_cpu))
cpu_overutilized() returns false if (!sched_energy_enabled())
so new_cpu is always prev_cpu for non EAS aware system which is
probably not what you want
> + new_cpu = prev_cpu;
> + else
> + new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> }
> rcu_read_unlock();
>
>
> ---
> base-commit: 9b332cece987ee1790b2ed4c989e28162fa47860
> change-id: 20251017-b4-sched-cfs-refactor-propagate-2c4a820998a4
>
> Best regards,
> --
> Shubhang Kaushik <shubhang@os.amperecomputing.com>
>
>
Hi,
On Fri, Oct 17, 2025 at 04:00:44PM -0700 Shubhang Kaushik via B4 Relay wrote:
> From: Shubhang Kaushik <shubhang@os.amperecomputing.com>
>
> Modify the wakeup path in `select_task_rq_fair()` to prioritize cache
> locality for waking tasks. The previous fast path always attempted to
> find an idle sibling, even if the task's prev CPU was not truly busy.
>
> The original problem was that under some circumstances, this could lead
> to unnecessary task migrations away from a cache-hot core, even when
> the task's prev CPU was a suitable candidate. The scheduler's internal
> mechanism `cpu_overutilized()` provide an evaluation of CPU load.
>
> To address this, the wakeup heuristic is updated to check the status of
> the task's `prev_cpu` first:
> - If the `prev_cpu` is not overutilized (as determined by
> `cpu_overutilized()`, via PELT), the task is woken up on
> its previous CPU. This leverages cache locality and avoids
> a potentially unnecessary migration.
> - If the `prev_cpu` is considered busy or overutilized, the scheduler
> falls back to the existing behavior of searching for an idle sibling.
>
> Signed-off-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
> ---
> This patch optimizes the scheduler's wakeup path to prioritize cache
> locality by keeping a task on its previous CPU if it is not overutilized,
> falling back to a sibling search only when necessary.
> ---
> kernel/sched/fair.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bc0b7ce8a65d6bbe616953f530f7a02bb619537c..bb0d28d7d9872642cb5a4076caeb3ac9d8fe7bcd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8618,7 +8618,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
> } else if (wake_flags & WF_TTWU) { /* XXX always ? */
> /* Fast path */
> - new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> +
> + /*
> + * Avoid wakeup on an overutilized CPU.
> + * If the previous CPU is not overloaded, retain the same for cache locality.
> + * Otherwise, search for an idle sibling.
> + */
> + if (!cpu_overutilized(prev_cpu))
> + new_cpu = prev_cpu;
> + else
> + new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
Won't this be checking if the cpu is overusitilzed without the wakee. It
might well be overutilized once the wakee is placed there.
I suspect this will hurt some workloads. Do you have numbers to share?
Cheers,
Phil
> }
> rcu_read_unlock();
>
>
> ---
> base-commit: 9b332cece987ee1790b2ed4c989e28162fa47860
> change-id: 20251017-b4-sched-cfs-refactor-propagate-2c4a820998a4
>
> Best regards,
> --
> Shubhang Kaushik <shubhang@os.amperecomputing.com>
>
>
>
--
© 2016 - 2026 Red Hat, Inc.