[PATCH] sched/fair: Prefer cache-hot prev_cpu for wakeup

Shubhang Kaushik via B4 Relay posted 1 patch 3 months, 3 weeks ago
kernel/sched/fair.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
[PATCH] sched/fair: Prefer cache-hot prev_cpu for wakeup
Posted by Shubhang Kaushik via B4 Relay 3 months, 3 weeks ago
From: Shubhang Kaushik <shubhang@os.amperecomputing.com>

Modify the wakeup path in `select_task_rq_fair()` to prioritize cache
locality for waking tasks. The previous fast path always attempted to
find an idle sibling, even if the task's prev CPU was not truly busy.

The original problem was that under some circumstances, this could lead
to unnecessary task migrations away from a cache-hot core, even when
the task's prev CPU was a suitable candidate. The scheduler's internal
mechanism `cpu_overutilized()` provide an evaluation of CPU load.

To address this, the wakeup heuristic is updated to check the status of
the task's `prev_cpu` first:
- If the `prev_cpu` is  not overutilized (as determined by
  `cpu_overutilized()`, via PELT), the task is woken up on
  its previous CPU. This leverages cache locality and avoids
  a potentially unnecessary migration.
- If the `prev_cpu` is considered busy or overutilized, the scheduler
  falls back to the existing behavior of searching for an idle sibling.

Signed-off-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
---
This patch optimizes the scheduler's wakeup path to prioritize cache 
locality by keeping a task on its previous CPU if it is not overutilized,
falling back to a sibling search only when necessary.
---
 kernel/sched/fair.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc0b7ce8a65d6bbe616953f530f7a02bb619537c..bb0d28d7d9872642cb5a4076caeb3ac9d8fe7bcd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8618,7 +8618,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
 	} else if (wake_flags & WF_TTWU) { /* XXX always ? */
 		/* Fast path */
-		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
+
+		/*
+		 * Avoid wakeup on an overutilized CPU.
+		 * If the previous CPU is not overloaded, retain the same for cache locality.
+		 * Otherwise, search for an idle sibling.
+		 */
+		if (!cpu_overutilized(prev_cpu))
+			new_cpu = prev_cpu;
+		else
+			new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
 	}
 	rcu_read_unlock();
 

---
base-commit: 9b332cece987ee1790b2ed4c989e28162fa47860
change-id: 20251017-b4-sched-cfs-refactor-propagate-2c4a820998a4

Best regards,
-- 
Shubhang Kaushik <shubhang@os.amperecomputing.com>
Re: [PATCH] sched/fair: Prefer cache-hot prev_cpu for wakeup
Posted by Dietmar Eggemann 3 months, 1 week ago
On 18.10.25 01:00, Shubhang Kaushik via B4 Relay wrote:
> From: Shubhang Kaushik <shubhang@os.amperecomputing.com>
> 
> Modify the wakeup path in `select_task_rq_fair()` to prioritize cache
> locality for waking tasks. The previous fast path always attempted to
> find an idle sibling, even if the task's prev CPU was not truly busy.
> 
> The original problem was that under some circumstances, this could lead
> to unnecessary task migrations away from a cache-hot core, even when
> the task's prev CPU was a suitable candidate. The scheduler's internal
> mechanism `cpu_overutilized()` provide an evaluation of CPU load.
> 
> To address this, the wakeup heuristic is updated to check the status of
> the task's `prev_cpu` first:
> - If the `prev_cpu` is  not overutilized (as determined by
>   `cpu_overutilized()`, via PELT), the task is woken up on
>   its previous CPU. This leverages cache locality and avoids
>   a potentially unnecessary migration.
> - If the `prev_cpu` is considered busy or overutilized, the scheduler
>   falls back to the existing behavior of searching for an idle sibling.

How does you sched domain topology look like? How many CPUs do you have
in your MC domain?

> 
> Signed-off-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
> ---
> This patch optimizes the scheduler's wakeup path to prioritize cache 
> locality by keeping a task on its previous CPU if it is not overutilized,
> falling back to a sibling search only when necessary.
> ---
>  kernel/sched/fair.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bc0b7ce8a65d6bbe616953f530f7a02bb619537c..bb0d28d7d9872642cb5a4076caeb3ac9d8fe7bcd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8618,7 +8618,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>  		new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
>  	} else if (wake_flags & WF_TTWU) { /* XXX always ? */
>  		/* Fast path */
> -		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> +
> +		/*
> +		 * Avoid wakeup on an overutilized CPU.
> +		 * If the previous CPU is not overloaded, retain the same for cache locality.
> +		 * Otherwise, search for an idle sibling.
> +		 */
> +		if (!cpu_overutilized(prev_cpu))
> +			new_cpu = prev_cpu;

IMHO, special conditions like this one are normally coded at the
beginning of select_idle_sibling().

[...]
Re: [PATCH] sched/fair: Prefer cache-hot prev_cpu for wakeup
Posted by Shubhang 3 months, 1 week ago
The system is an 80 core Ampere Altra with a two-level
sched domain topology. The MC domain contains all 80 cores.

I agree that placing the condition earlier in `select_idle_sibling()` 
aligns better with convention. I will move the check (EAS Aware) to the 
top of the function and submit a v2 patch.

Best,
Shubhang Kaushik

On Thu, 30 Oct 2025, Dietmar Eggemann wrote:

> On 18.10.25 01:00, Shubhang Kaushik via B4 Relay wrote:
>> From: Shubhang Kaushik <shubhang@os.amperecomputing.com>
>>
>> Modify the wakeup path in `select_task_rq_fair()` to prioritize cache
>> locality for waking tasks. The previous fast path always attempted to
>> find an idle sibling, even if the task's prev CPU was not truly busy.
>>
>> The original problem was that under some circumstances, this could lead
>> to unnecessary task migrations away from a cache-hot core, even when
>> the task's prev CPU was a suitable candidate. The scheduler's internal
>> mechanism `cpu_overutilized()` provide an evaluation of CPU load.
>>
>> To address this, the wakeup heuristic is updated to check the status of
>> the task's `prev_cpu` first:
>> - If the `prev_cpu` is  not overutilized (as determined by
>>   `cpu_overutilized()`, via PELT), the task is woken up on
>>   its previous CPU. This leverages cache locality and avoids
>>   a potentially unnecessary migration.
>> - If the `prev_cpu` is considered busy or overutilized, the scheduler
>>   falls back to the existing behavior of searching for an idle sibling.
>
> How does you sched domain topology look like? How many CPUs do you have
> in your MC domain?
>
>>
>> Signed-off-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
>> ---
>> This patch optimizes the scheduler's wakeup path to prioritize cache
>> locality by keeping a task on its previous CPU if it is not overutilized,
>> falling back to a sibling search only when necessary.
>> ---
>>  kernel/sched/fair.c | 11 ++++++++++-
>>  1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index bc0b7ce8a65d6bbe616953f530f7a02bb619537c..bb0d28d7d9872642cb5a4076caeb3ac9d8fe7bcd 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8618,7 +8618,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>>  		new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
>>  	} else if (wake_flags & WF_TTWU) { /* XXX always ? */
>>  		/* Fast path */
>> -		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
>> +
>> +		/*
>> +		 * Avoid wakeup on an overutilized CPU.
>> +		 * If the previous CPU is not overloaded, retain the same for cache locality.
>> +		 * Otherwise, search for an idle sibling.
>> +		 */
>> +		if (!cpu_overutilized(prev_cpu))
>> +			new_cpu = prev_cpu;
>
> IMHO, special conditions like this one are normally coded at the
> beginning of select_idle_sibling().
>
> [...]
>
Re: [PATCH] sched/fair: Prefer cache-hot prev_cpu for wakeup
Posted by Dietmar Eggemann 3 months, 1 week ago
On 30.10.25 17:35, Shubhang wrote:
> The system is an 80 core Ampere Altra with a two-level
> sched domain topology. The MC domain contains all 80 cores.

Ah OK. So I assume the other SD is CLS with 2 CPUs?

Does this mean you guys have recently changed the sched topology on this
thing? I still remember setups with 2 CPUs in MC and 80 CPUs on PKG.

If this is the case, is:

db1e59483dfd - topology: make core_mask include at least
cluster_siblings (2022-04-20 Darren Hart)

still needed in this case?

> I agree that placing the condition earlier in `select_idle_sibling()`>
aligns better with convention. I will move the check (EAS Aware) to the
> top of the function and submit a v2 patch.
I can't imagine that you run EAS on this machine? It needs heterogeneous
CPUs which you shouldn't have. Looks like that Christian L. was asking
you already on your v2.

[...]
Re: [PATCH] sched/fair: Prefer cache-hot prev_cpu for wakeup
Posted by kernel test robot 3 months, 1 week ago

Hello,

we just reported a "76.8% improvement of stress-ng.tee.ops_per_sec" in
https://lore.kernel.org/all/202510281543.28d76c2-lkp@intel.com/

now we captured a regression. FYI.


kernel test robot noticed a 8.5% regression of stress-ng.io-uring.ops_per_sec on:


commit: 24efd1bf8a44f0f51f42f4af4ce22f21e873073d ("[PATCH] sched/fair: Prefer cache-hot prev_cpu for wakeup")
url: https://github.com/intel-lab-lkp/linux/commits/Shubhang-Kaushik-via-B4-Relay/sched-fair-Prefer-cache-hot-prev_cpu-for-wakeup/20251018-092110
patch link: https://lore.kernel.org/all/20251017-b4-sched-cfs-refactor-propagate-v1-1-1eb0dc5b19b3@os.amperecomputing.com/
patch subject: [PATCH] sched/fair: Prefer cache-hot prev_cpu for wakeup

testcase: stress-ng
config: x86_64-rhel-9.4
compiler: gcc-14
test machine: 256 threads 2 sockets Intel(R) Xeon(R) 6768P  CPU @ 2.4GHz (Granite Rapids) with 64G memory
parameters:

	nr_threads: 100%
	testtime: 60s
	test: io-uring
	cpufreq_governor: performance



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202510291148.b2988254-lkp@intel.com


Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20251029/202510291148.b2988254-lkp@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
  gcc-14/performance/x86_64-rhel-9.4/100%/debian-13-x86_64-20250902.cgz/lkp-gnr-2sp4/io-uring/stress-ng/60s

commit: 
  9b332cece9 ("Merge tag 'nfsd-6.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux")
  24efd1bf8a ("sched/fair: Prefer cache-hot prev_cpu for wakeup")

9b332cece987ee17 24efd1bf8a44f0f51f42f4af4ce 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
  3.58e+09           +17.6%   4.21e+09        cpuidle..time
 9.276e+08           -35.8%  5.958e+08 ±  2%  cpuidle..usage
  48009670           -12.7%   41899608 ±  4%  numa-numastat.node0.local_node
  48122238           -12.8%   41981276 ±  4%  numa-numastat.node0.numa_hit
      0.89 ± 44%     +13.2       14.07 ±  3%  turbostat.C1E%
      0.67 ± 44%    +381.0%       3.22        turbostat.CPU%c1
 1.375e+08 ± 44%    +199.4%  4.116e+08        turbostat.IRQ
      4.70 ± 44%    +224.5%      15.25        turbostat.RAMWatt
    210.17 ± 77%   +1158.0%       2643        perf-c2c.DRAM.local
      1725 ± 11%  +10694.5%     186276 ±  3%  perf-c2c.DRAM.remote
    320853           -50.4%     159203 ±  4%  perf-c2c.HITM.local
      1320 ± 13%   +9462.5%     126256 ±  3%  perf-c2c.HITM.remote
    322174           -11.4%     285460        perf-c2c.HITM.total
     14.00 ±  4%      -2.1       11.92 ±  4%  mpstat.cpu.all.idle%
     13.31            +5.2       18.56        mpstat.cpu.all.iowait%
      1.48            +4.9        6.39 ±  3%  mpstat.cpu.all.irq%
      0.85            -0.2        0.68        mpstat.cpu.all.soft%
      3.51            -2.1        1.40 ±  5%  mpstat.cpu.all.usr%
     18.17 ±  4%     +12.8%      20.50 ±  4%  mpstat.max_utilization.seconds
  12518136           -40.6%    7432802 ±  5%  meminfo.Active
  12518120           -40.6%    7432786 ±  5%  meminfo.Active(anon)
  14791509           -34.2%    9726112 ±  4%  meminfo.Cached
  17016588           -29.8%   11943542 ±  3%  meminfo.Committed_AS
  19860760           -19.5%   15994452 ±  2%  meminfo.Memused
  11109813           -45.8%    6019207 ±  6%  meminfo.Shmem
  19916177           -19.5%   16031079 ±  2%  meminfo.max_used_kB
    104776 ± 14%     -24.3%      79337 ± 21%  numa-meminfo.node0.KReclaimable
    104776 ± 14%     -24.3%      79337 ± 21%  numa-meminfo.node0.SReclaimable
  11913809           -42.7%    6821430 ±  5%  numa-meminfo.node1.Active
  11913804           -42.7%    6821421 ±  5%  numa-meminfo.node1.Active(anon)
  11336225 ±  2%     -30.4%    7891392 ± 23%  numa-meminfo.node1.FilePages
  19000428           +14.7%   21787417 ±  8%  numa-meminfo.node1.MemFree
  11104229           -45.9%    6012466 ±  6%  numa-meminfo.node1.Shmem
 1.125e+09            -8.4%   1.03e+09 ±  3%  stress-ng.io-uring.ops
  18779554            -8.5%   17185210 ±  3%  stress-ng.io-uring.ops_per_sec
 2.353e+08           +58.7%  3.735e+08 ±  3%  stress-ng.time.involuntary_context_switches
     16880           -11.1%      15008        stress-ng.time.percent_of_cpu_this_job_got
      9702            -8.5%       8878        stress-ng.time.system_time
    443.21           -67.8%     142.54 ±  2%  stress-ng.time.user_time
 1.362e+09           -11.5%  1.206e+09 ±  3%  stress-ng.time.voluntary_context_switches
     26194 ± 14%     -24.3%      19834 ± 21%  numa-vmstat.node0.nr_slab_reclaimable
  48122182           -12.8%   41981349 ±  4%  numa-vmstat.node0.numa_hit
  48009614           -12.7%   41899680 ±  4%  numa-vmstat.node0.numa_local
   2981009           -42.7%    1707865 ±  5%  numa-vmstat.node1.nr_active_anon
   2836469 ±  2%     -30.4%    1975086 ± 23%  numa-vmstat.node1.nr_file_pages
   4747481           +14.7%    5444494 ±  8%  numa-vmstat.node1.nr_free_pages
   4714110           +14.8%    5411734 ±  8%  numa-vmstat.node1.nr_free_pages_blocks
   2778450           -45.8%    1505400 ±  6%  numa-vmstat.node1.nr_shmem
   2981003           -42.7%    1707858 ±  5%  numa-vmstat.node1.nr_zone_active_anon
   3131938           -40.6%    1860663 ±  5%  proc-vmstat.nr_active_anon
   1133648            +8.5%    1230219        proc-vmstat.nr_dirty_background_threshold
   2270069            +8.5%    2463447        proc-vmstat.nr_dirty_threshold
   3700155           -34.2%    2433658 ±  4%  proc-vmstat.nr_file_pages
  11441308            +8.5%   12408439        proc-vmstat.nr_free_pages
  11335855            +8.6%   12314183        proc-vmstat.nr_free_pages_blocks
   2779743           -45.8%    1507064 ±  6%  proc-vmstat.nr_shmem
     50620            -5.9%      47611        proc-vmstat.nr_slab_reclaimable
   3131938           -40.6%    1860663 ±  5%  proc-vmstat.nr_zone_active_anon
  99148879            -9.8%   89432077 ±  3%  proc-vmstat.numa_hit
  98893495            -9.8%   89168637 ±  3%  proc-vmstat.numa_local
     54203 ± 24%     -57.3%      23128 ± 10%  proc-vmstat.numa_pages_migrated
  99397243            -9.8%   89638031 ±  3%  proc-vmstat.pgalloc_normal
  94583491            -8.4%   86624034 ±  3%  proc-vmstat.pgfree
     54203 ± 24%     -57.3%      23128 ± 10%  proc-vmstat.pgmigrate_success
     39031            +1.8%      39717        proc-vmstat.pgreuse
  47196381           -31.6%   32305642 ±  2%  proc-vmstat.unevictable_pgs_culled
      0.08 ±  2%   +3468.8%       2.97 ±  4%  perf-stat.i.MPKI
 7.499e+10           -21.9%  5.854e+10 ±  2%  perf-stat.i.branch-instructions
      0.94            -0.3        0.62        perf-stat.i.branch-miss-rate%
 6.557e+08           -48.5%   3.38e+08 ±  2%  perf-stat.i.branch-misses
      0.70 ±  2%     +36.7       37.40 ±  4%  perf-stat.i.cache-miss-rate%
  29724413 ±  2%   +2544.3%   7.86e+08        perf-stat.i.cache-misses
  5.32e+09           -60.5%  2.103e+09 ±  3%  perf-stat.i.cache-references
  42032996           -14.0%   36140436 ±  2%  perf-stat.i.context-switches
      2.29           +18.0%       2.70 ±  2%  perf-stat.i.cpi
 7.916e+11            -9.6%  7.154e+11        perf-stat.i.cpu-cycles
  11062415           -99.9%      15481        perf-stat.i.cpu-migrations
     44096 ±  5%     -97.9%     910.19        perf-stat.i.cycles-between-cache-misses
 3.698e+11           -22.9%  2.852e+11 ±  2%  perf-stat.i.instructions
      0.46           -14.9%       0.40 ±  2%  perf-stat.i.ipc
      0.05 ± 47%     +96.1%       0.09 ± 14%  perf-stat.i.major-faults
    207.41           -31.9%     141.15 ±  2%  perf-stat.i.metric.K/sec
      0.08 ±  2%   +3331.9%       2.76 ±  4%  perf-stat.overall.MPKI
      0.87            -0.3        0.58        perf-stat.overall.branch-miss-rate%
      0.56 ±  2%     +36.9       37.43 ±  4%  perf-stat.overall.cache-miss-rate%
      2.14           +17.3%       2.51 ±  2%  perf-stat.overall.cpi
     26647 ±  2%     -96.6%     910.56        perf-stat.overall.cycles-between-cache-misses
      0.47           -14.7%       0.40 ±  2%  perf-stat.overall.ipc
 7.375e+10           -21.9%  5.757e+10 ±  2%  perf-stat.ps.branch-instructions
 6.449e+08           -48.4%  3.325e+08 ±  2%  perf-stat.ps.branch-misses
  29243806 ±  2%   +2543.5%  7.731e+08        perf-stat.ps.cache-misses
 5.233e+09           -60.5%  2.068e+09 ±  3%  perf-stat.ps.cache-references
  41341425           -14.0%   35549572 ±  2%  perf-stat.ps.context-switches
 7.786e+11            -9.6%  7.037e+11        perf-stat.ps.cpu-cycles
  10881167           -99.9%      15227        perf-stat.ps.cpu-migrations
 3.637e+11           -22.9%  2.805e+11 ±  2%  perf-stat.ps.instructions
      0.05 ± 47%     +93.6%       0.09 ± 14%  perf-stat.ps.major-faults
 2.217e+13           -22.7%  1.713e+13 ±  2%  perf-stat.total.instructions
   4219859           -17.8%    3469357        sched_debug.cfs_rq:/.avg_vruntime.avg
   7247589 ±  9%     -38.3%    4469027 ±  7%  sched_debug.cfs_rq:/.avg_vruntime.max
   4013259           -29.0%    2849620 ± 17%  sched_debug.cfs_rq:/.avg_vruntime.min
    265810 ± 14%     -54.9%     119970 ± 11%  sched_debug.cfs_rq:/.avg_vruntime.stddev
      3.42 ± 10%     -24.4%       2.58 ±  7%  sched_debug.cfs_rq:/.h_nr_queued.max
      3.33 ± 11%     -22.5%       2.58 ±  7%  sched_debug.cfs_rq:/.h_nr_runnable.max
   4401036           -17.1%    3647494 ±  4%  sched_debug.cfs_rq:/.left_deadline.max
   1274751 ±  5%     -18.7%    1035958 ± 12%  sched_debug.cfs_rq:/.left_deadline.stddev
   4400687           -17.1%    3647059 ±  4%  sched_debug.cfs_rq:/.left_vruntime.max
   1274640 ±  5%     -18.7%    1035848 ± 12%  sched_debug.cfs_rq:/.left_vruntime.stddev
   4219859           -17.8%    3469357        sched_debug.cfs_rq:/.min_vruntime.avg
   7247589 ±  9%     -38.3%    4469027 ±  7%  sched_debug.cfs_rq:/.min_vruntime.max
   4013259           -29.0%    2849620 ± 17%  sched_debug.cfs_rq:/.min_vruntime.min
    265810 ± 14%     -54.9%     119970 ± 11%  sched_debug.cfs_rq:/.min_vruntime.stddev
   4400687           -17.1%    3647059 ±  4%  sched_debug.cfs_rq:/.right_vruntime.max
   1274640 ±  5%     -18.7%    1035848 ± 12%  sched_debug.cfs_rq:/.right_vruntime.stddev
    532.33           -11.4%     471.62 ±  2%  sched_debug.cfs_rq:/.runnable_avg.avg
      1361 ±  3%     +18.4%       1611 ± 10%  sched_debug.cfs_rq:/.runnable_avg.max
    203.24 ±  4%     +38.0%     280.47 ±  3%  sched_debug.cfs_rq:/.runnable_avg.stddev
    108.79 ±  5%     +68.6%     183.41 ±  4%  sched_debug.cfs_rq:/.util_avg.stddev
     99.93 ±  8%    +144.8%     244.58 ±  4%  sched_debug.cfs_rq:/.util_est.avg
    154.15 ± 10%     +41.9%     218.69 ±  5%  sched_debug.cfs_rq:/.util_est.stddev
    585777 ±  3%     +55.0%     907718 ±  6%  sched_debug.cpu.avg_idle.avg
    257569 ± 15%     +30.0%     334947 ± 11%  sched_debug.cpu.avg_idle.stddev
    581651 ±  2%     +97.0%    1146052 ±  3%  sched_debug.cpu.max_idle_balance_cost.avg
   1334820 ±  4%     +10.9%    1479741        sched_debug.cpu.max_idle_balance_cost.max
    150290 ±  9%     +34.9%     202732 ±  7%  sched_debug.cpu.max_idle_balance_cost.stddev
      3.42 ± 10%     -24.4%       2.58 ± 13%  sched_debug.cpu.nr_running.max
   4900954           -14.0%    4212806 ±  2%  sched_debug.cpu.nr_switches.avg
   1872618 ± 12%     +57.1%    2941530 ± 17%  sched_debug.cpu.nr_switches.min
    -24.25           -67.7%      -7.83        sched_debug.cpu.nr_uninterruptible.min
      8.41 ± 12%     -46.3%       4.52 ± 14%  sched_debug.cpu.nr_uninterruptible.stddev



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Re: [PATCH] sched/fair: Prefer cache-hot prev_cpu for wakeup
Posted by kernel test robot 3 months, 1 week ago

Hello,

kernel test robot noticed a 76.8% improvement of stress-ng.tee.ops_per_sec on:


commit: 24efd1bf8a44f0f51f42f4af4ce22f21e873073d ("[PATCH] sched/fair: Prefer cache-hot prev_cpu for wakeup")
url: https://github.com/intel-lab-lkp/linux/commits/Shubhang-Kaushik-via-B4-Relay/sched-fair-Prefer-cache-hot-prev_cpu-for-wakeup/20251018-092110
patch link: https://lore.kernel.org/all/20251017-b4-sched-cfs-refactor-propagate-v1-1-1eb0dc5b19b3@os.amperecomputing.com/
patch subject: [PATCH] sched/fair: Prefer cache-hot prev_cpu for wakeup

testcase: stress-ng
config: x86_64-rhel-9.4
compiler: gcc-14
test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 512G memory
parameters:

	nr_threads: 100%
	testtime: 60s
	test: tee
	cpufreq_governor: performance



Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20251028/202510281543.28d76c2-lkp@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
  gcc-14/performance/x86_64-rhel-9.4/100%/debian-13-x86_64-20250902.cgz/lkp-spr-2sp4/tee/stress-ng/60s

commit: 
  9b332cece9 ("Merge tag 'nfsd-6.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux")
  24efd1bf8a ("sched/fair: Prefer cache-hot prev_cpu for wakeup")

9b332cece987ee17 24efd1bf8a44f0f51f42f4af4ce 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
     12097 ±  3%     +10.9%      13413 ±  2%  uptime.idle
 3.662e+08 ±  7%    +382.7%  1.768e+09        cpuidle..time
   5056131 ± 56%    +426.8%   26635997 ±  3%  cpuidle..usage
  13144587 ± 11%     +21.1%   15921410        meminfo.Memused
  13326158 ± 11%     +20.6%   16067699        meminfo.max_used_kB
  58707455           -16.5%   49043102 ±  9%  numa-numastat.node1.local_node
  58841583           -16.4%   49176968 ±  9%  numa-numastat.node1.numa_hit
  58770618           -16.3%   49175467 ±  9%  numa-vmstat.node1.numa_hit
  58636509           -16.4%   49041602 ±  9%  numa-vmstat.node1.numa_local
      2184 ±  9%   +2157.3%      49310 ±  3%  perf-c2c.DRAM.remote
      3115 ± 11%   +1689.3%      55737 ±  3%  perf-c2c.HITM.local
      1193 ± 13%   +2628.6%      32575 ±  3%  perf-c2c.HITM.remote
      4308 ± 10%   +1949.6%      88312        perf-c2c.HITM.total
      1.95 ±  6%     +10.4       12.34        mpstat.cpu.all.idle%
      0.50 ±  3%      +1.0        1.53        mpstat.cpu.all.irq%
      0.02 ±  6%      +0.1        0.09 ±  5%  mpstat.cpu.all.soft%
     74.24            -7.0       67.21        mpstat.cpu.all.sys%
     23.29            -4.5       18.83        mpstat.cpu.all.usr%
    232818 ± 35%     -18.3%     190138        proc-vmstat.nr_anon_pages
    124104            -1.1%     122691        proc-vmstat.nr_slab_unreclaimable
 1.167e+08           -15.0%   99106005        proc-vmstat.numa_hit
 1.164e+08           -15.1%   98853060        proc-vmstat.numa_local
 1.168e+08           -15.2%   99060661        proc-vmstat.pgalloc_normal
 1.147e+08           -15.7%   96704739        proc-vmstat.pgfree
 1.071e+08 ±  2%     +76.8%  1.894e+08 ±  2%  stress-ng.tee.ops
   1786177 ±  2%     +76.8%    3157701 ±  2%  stress-ng.tee.ops_per_sec
 1.044e+08           -49.4%   52882701        stress-ng.time.involuntary_context_switches
     21972           -12.1%      19317        stress-ng.time.percent_of_cpu_this_job_got
     10131            -9.6%       9155        stress-ng.time.system_time
      3070           -20.2%       2450        stress-ng.time.user_time
 1.512e+08           -37.9%   93853736        stress-ng.time.voluntary_context_switches
      2816           -10.5%       2519        turbostat.Avg_MHz
     97.12            -9.8       87.30        turbostat.Busy%
      0.11 ± 52%      +0.5        0.66 ±  5%  turbostat.C1%
      0.40 ± 11%      +8.4        8.78        turbostat.C1E%
      2.39 ±  3%      +1.0        3.42 ±  2%  turbostat.C6%
      1.08 ±  9%    +168.3%       2.90 ±  3%  turbostat.CPU%c1
  32638444          +167.8%   87395049        turbostat.IRQ
    110.56           +14.6      125.14 ±  4%  turbostat.PKG_%
     23.05           +32.8%      30.62        turbostat.RAMWatt
   7559994           -21.3%    5948968        sched_debug.cfs_rq:/.avg_vruntime.avg
  11028968 ± 13%     -38.2%    6818572 ±  4%  sched_debug.cfs_rq:/.avg_vruntime.max
      0.34 ± 13%    +104.0%       0.69 ±  3%  sched_debug.cfs_rq:/.h_nr_queued.stddev
      0.38 ±  8%     +75.2%       0.67 ±  3%  sched_debug.cfs_rq:/.h_nr_runnable.stddev
     20.67 ± 33%   +3672.8%     779.66 ± 73%  sched_debug.cfs_rq:/.load_avg.avg
    519.67         +7141.5%      37631 ± 10%  sched_debug.cfs_rq:/.load_avg.max
     86.71 ± 22%   +5134.0%       4538 ± 39%  sched_debug.cfs_rq:/.load_avg.stddev
   7559994           -21.3%    5948968        sched_debug.cfs_rq:/.min_vruntime.avg
  11028968 ± 13%     -38.2%    6818572 ±  4%  sched_debug.cfs_rq:/.min_vruntime.max
      0.12 ± 17%    +117.1%       0.27 ±  3%  sched_debug.cfs_rq:/.nr_queued.stddev
    809.69 ±  2%     +15.6%     936.26        sched_debug.cfs_rq:/.runnable_avg.avg
      2093 ±  3%     +18.5%       2480 ±  8%  sched_debug.cfs_rq:/.runnable_avg.max
    259.47 ± 18%     +71.8%     445.79 ±  3%  sched_debug.cfs_rq:/.runnable_avg.stddev
    576.64           -10.6%     515.40        sched_debug.cfs_rq:/.util_avg.avg
    137.33 ± 12%     +85.3%     254.45 ±  2%  sched_debug.cfs_rq:/.util_avg.stddev
    609.44           +15.6%     704.34 ±  3%  sched_debug.cfs_rq:/.util_est.avg
      1839 ± 11%     +23.7%       2274 ±  7%  sched_debug.cfs_rq:/.util_est.max
    245.27 ±  7%     +82.4%     447.29 ±  4%  sched_debug.cfs_rq:/.util_est.stddev
    702831 ±  5%     +19.6%     840863 ±  3%  sched_debug.cpu.avg_idle.avg
    378668 ± 14%     +32.2%     500458 ±  6%  sched_debug.cpu.avg_idle.stddev
     44.33 ± 22%     -62.7%      16.52 ± 13%  sched_debug.cpu.clock.stddev
    909.29 ± 12%    +160.6%       2369 ±  4%  sched_debug.cpu.curr->pid.stddev
    639355 ±  5%     +80.6%    1154626        sched_debug.cpu.max_idle_balance_cost.avg
    500000           +57.3%     786555 ± 11%  sched_debug.cpu.max_idle_balance_cost.min
      0.00 ± 20%     -47.2%       0.00 ± 18%  sched_debug.cpu.next_balance.stddev
      0.32 ± 14%    +111.2%       0.68 ±  3%  sched_debug.cpu.nr_running.stddev
    574871           -33.2%     383811        sched_debug.cpu.nr_switches.avg
    788985 ± 11%     -32.4%     533309 ±  6%  sched_debug.cpu.nr_switches.max
      0.04 ± 19%   +1073.1%       0.50        perf-stat.i.MPKI
 1.443e+11           -17.9%  1.184e+11        perf-stat.i.branch-instructions
      0.08 ±  3%      +0.0        0.12        perf-stat.i.branch-miss-rate%
 1.049e+08 ±  2%     +23.6%  1.296e+08        perf-stat.i.branch-misses
     31.56 ± 11%     +16.6       48.19        perf-stat.i.cache-miss-rate%
  25936672 ± 22%   +1080.1%  3.061e+08        perf-stat.i.cache-misses
  77849475 ± 13%    +714.6%  6.342e+08        perf-stat.i.cache-references
   4288231           -33.1%    2868755        perf-stat.i.context-switches
      0.85            +9.6%       0.94        perf-stat.i.cpi
 6.387e+11           -10.2%  5.735e+11        perf-stat.i.cpu-cycles
      2828 ± 24%    +596.1%      19688        perf-stat.i.cpu-migrations
     32456 ± 26%     -94.2%       1870        perf-stat.i.cycles-between-cache-misses
 7.486e+11           -18.2%  6.125e+11        perf-stat.i.instructions
      1.17            -8.8%       1.07        perf-stat.i.ipc
     19.17           -33.2%      12.81        perf-stat.i.metric.K/sec
      0.03 ± 22%   +1341.1%       0.50        perf-stat.overall.MPKI
      0.07 ±  3%      +0.0        0.11        perf-stat.overall.branch-miss-rate%
     33.00 ± 10%     +15.2       48.24        perf-stat.overall.cache-miss-rate%
      0.85            +9.7%       0.94        perf-stat.overall.cpi
     25848 ± 21%     -92.7%       1874        perf-stat.overall.cycles-between-cache-misses
      1.17            -8.9%       1.07        perf-stat.overall.ipc
 1.419e+11           -18.1%  1.162e+11        perf-stat.ps.branch-instructions
 1.028e+08 ±  2%     +23.3%  1.268e+08        perf-stat.ps.branch-misses
  25499974 ± 22%   +1077.5%  3.003e+08        perf-stat.ps.cache-misses
  76519245 ± 13%    +713.3%  6.224e+08        perf-stat.ps.cache-references
   4214394           -33.2%    2815077        perf-stat.ps.context-switches
 6.278e+11           -10.4%  5.627e+11        perf-stat.ps.cpu-cycles
      2763 ± 24%    +598.5%      19305        perf-stat.ps.cpu-migrations
 7.358e+11           -18.3%  6.009e+11        perf-stat.ps.instructions
 4.489e+13           -18.3%  3.668e+13        perf-stat.total.instructions




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Re: [PATCH] sched/fair: Prefer cache-hot prev_cpu for wakeup
Posted by Vincent Guittot 3 months, 2 weeks ago
On Sat, 18 Oct 2025 at 01:01, Shubhang Kaushik via B4 Relay
<devnull+shubhang.os.amperecomputing.com@kernel.org> wrote:
>
> From: Shubhang Kaushik <shubhang@os.amperecomputing.com>
>
> Modify the wakeup path in `select_task_rq_fair()` to prioritize cache
> locality for waking tasks. The previous fast path always attempted to
> find an idle sibling, even if the task's prev CPU was not truly busy.
>
> The original problem was that under some circumstances, this could lead
> to unnecessary task migrations away from a cache-hot core, even when
> the task's prev CPU was a suitable candidate. The scheduler's internal
> mechanism `cpu_overutilized()` provide an evaluation of CPU load.
>
> To address this, the wakeup heuristic is updated to check the status of
> the task's `prev_cpu` first:
> - If the `prev_cpu` is  not overutilized (as determined by
>   `cpu_overutilized()`, via PELT), the task is woken up on
>   its previous CPU. This leverages cache locality and avoids
>   a potentially unnecessary migration.
> - If the `prev_cpu` is considered busy or overutilized, the scheduler
>   falls back to the existing behavior of searching for an idle sibling.
>
> Signed-off-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
> ---
> This patch optimizes the scheduler's wakeup path to prioritize cache
> locality by keeping a task on its previous CPU if it is not overutilized,
> falling back to a sibling search only when necessary.
> ---
>  kernel/sched/fair.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bc0b7ce8a65d6bbe616953f530f7a02bb619537c..bb0d28d7d9872642cb5a4076caeb3ac9d8fe7bcd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8618,7 +8618,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>                 new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
>         } else if (wake_flags & WF_TTWU) { /* XXX always ? */
>                 /* Fast path */
> -               new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> +
> +               /*
> +                * Avoid wakeup on an overutilized CPU.
> +                * If the previous CPU is not overloaded, retain the same for cache locality.
> +                * Otherwise, search for an idle sibling.
> +                */
> +               if (!cpu_overutilized(prev_cpu))

cpu_overutilized() returns false if (!sched_energy_enabled())

so  new_cpu is always prev_cpu for non EAS aware system which is
probably not what you want

> +                       new_cpu = prev_cpu;
> +               else
> +                       new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
>         }
>         rcu_read_unlock();
>
>
> ---
> base-commit: 9b332cece987ee1790b2ed4c989e28162fa47860
> change-id: 20251017-b4-sched-cfs-refactor-propagate-2c4a820998a4
>
> Best regards,
> --
> Shubhang Kaushik <shubhang@os.amperecomputing.com>
>
>
Re: [PATCH] sched/fair: Prefer cache-hot prev_cpu for wakeup
Posted by Phil Auld 3 months, 2 weeks ago
Hi,

On Fri, Oct 17, 2025 at 04:00:44PM -0700 Shubhang Kaushik via B4 Relay wrote:
> From: Shubhang Kaushik <shubhang@os.amperecomputing.com>
> 
> Modify the wakeup path in `select_task_rq_fair()` to prioritize cache
> locality for waking tasks. The previous fast path always attempted to
> find an idle sibling, even if the task's prev CPU was not truly busy.
> 
> The original problem was that under some circumstances, this could lead
> to unnecessary task migrations away from a cache-hot core, even when
> the task's prev CPU was a suitable candidate. The scheduler's internal
> mechanism `cpu_overutilized()` provide an evaluation of CPU load.
> 
> To address this, the wakeup heuristic is updated to check the status of
> the task's `prev_cpu` first:
> - If the `prev_cpu` is  not overutilized (as determined by
>   `cpu_overutilized()`, via PELT), the task is woken up on
>   its previous CPU. This leverages cache locality and avoids
>   a potentially unnecessary migration.
> - If the `prev_cpu` is considered busy or overutilized, the scheduler
>   falls back to the existing behavior of searching for an idle sibling.
> 
> Signed-off-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
> ---
> This patch optimizes the scheduler's wakeup path to prioritize cache 
> locality by keeping a task on its previous CPU if it is not overutilized,
> falling back to a sibling search only when necessary.
> ---
>  kernel/sched/fair.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bc0b7ce8a65d6bbe616953f530f7a02bb619537c..bb0d28d7d9872642cb5a4076caeb3ac9d8fe7bcd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8618,7 +8618,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>  		new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
>  	} else if (wake_flags & WF_TTWU) { /* XXX always ? */
>  		/* Fast path */
> -		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> +
> +		/*
> +		 * Avoid wakeup on an overutilized CPU.
> +		 * If the previous CPU is not overloaded, retain the same for cache locality.
> +		 * Otherwise, search for an idle sibling.
> +		 */
> +		if (!cpu_overutilized(prev_cpu))
> +			new_cpu = prev_cpu;
> +		else
> +			new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);

Won't this be checking if the cpu is overusitilzed without the wakee. It
might well be overutilized once the wakee is placed there.

I suspect this will hurt some workloads. Do you have numbers to share?


Cheers,
Phil


>  	}
>  	rcu_read_unlock();
>  
> 
> ---
> base-commit: 9b332cece987ee1790b2ed4c989e28162fa47860
> change-id: 20251017-b4-sched-cfs-refactor-propagate-2c4a820998a4
> 
> Best regards,
> -- 
> Shubhang Kaushik <shubhang@os.amperecomputing.com>
> 
> 
> 

--