[PATCH] sched/rt: optimize cpupri_vec layout

Pan Deng posted 1 patch 4 months ago
kernel/sched/cpupri.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
[PATCH] sched/rt: optimize cpupri_vec layout
Posted by Pan Deng 4 months ago
When running a multi-instance ffmpeg transcoding workload which uses rt
thread in a high core count system, cpupri_vec->count contends with the
reading of mask in the same cache line in function cpupri_find_fitness
and cpupri_set.
This change separates each count and mask into different cache lines by
cache aligned attribute to avoid the false sharing.
Tested in a 2 sockets, 240 physical core 480 logical core machine, running
60 ffmpeg transcoding instances. With the change, the kernel cycles% is
reduced from ~20% to ~12%, the fps metric is improved ~11%.
The side effect of this change is that struct cpupri size is increased
from 26 cache lines to 203 cache lines.

Signed-off-by: Pan Deng <pan.deng@intel.com>
Signed-off-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/cpupri.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
index d6cba0020064..245b0fa626be 100644
--- a/kernel/sched/cpupri.h
+++ b/kernel/sched/cpupri.h
@@ -9,7 +9,7 @@
 
 struct cpupri_vec {
 	atomic_t		count;
-	cpumask_var_t		mask;
+	cpumask_var_t		mask	____cacheline_aligned;
 };
 
 struct cpupri {
-- 
2.43.5
Re: [PATCH] sched/rt: optimize cpupri_vec layout
Posted by kernel test robot 3 months, 3 weeks ago
Hello,

kernel test robot noticed a 67.7% improvement of stress-ng.mutex.ops_per_sec on:


commit: cd316a87572309a79102940e1856ee877740156e ("[PATCH] sched/rt: optimize cpupri_vec layout")
url: https://github.com/intel-lab-lkp/linux/commits/Pan-Deng/sched-rt-optimize-cpupri_vec-layout/20250612-110857
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git b01f2d9597250e9c4011cb78d8d46287deaa6a69
patch link: https://lore.kernel.org/all/20250612031148.455046-1-pan.deng@intel.com/
patch subject: [PATCH] sched/rt: optimize cpupri_vec layout

testcase: stress-ng
config: x86_64-rhel-9.4
compiler: gcc-12
test machine: 192 threads 2 sockets Intel(R) Xeon(R) 6740E  CPU @ 2.4GHz (Sierra Forest) with 256G memory
parameters:

	nr_threads: 100%
	testtime: 60s
	test: mutex
	cpufreq_governor: performance



Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250616/202506161643.ab40fa8e-lkp@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
  gcc-12/performance/x86_64-rhel-9.4/100%/debian-12-x86_64-20240206.cgz/lkp-srf-2sp2/mutex/stress-ng/60s

commit: 
  b01f2d9597 ("sched/eevdf: Correct the comment in place_entity")
  cd316a8757 ("sched/rt: optimize cpupri_vec layout")

b01f2d9597250e9c cd316a87572309a79102940e185 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
  22409567           +52.5%   34179472 ±  3%  cpuidle..usage
     21410 ± 30%     +26.2%      27010 ± 16%  numa-vmstat.node0.nr_slab_reclaimable
      0.07 ±  2%      +0.0        0.09 ±  2%  mpstat.cpu.all.soft%
      1.06            +0.6        1.63 ±  3%  mpstat.cpu.all.usr%
     85656 ± 30%     +26.1%     108025 ± 16%  numa-meminfo.node0.KReclaimable
     85656 ± 30%     +26.1%     108025 ± 16%  numa-meminfo.node0.SReclaimable
   2398650           +60.1%    3839452 ±  2%  vmstat.system.cs
   1650319           +44.1%    2378651        vmstat.system.in
      1821 ±  7%     +28.5%       2340 ± 14%  perf-c2c.DRAM.local
     17138 ± 14%     +86.2%      31915 ± 17%  perf-c2c.DRAM.remote
     91166 ± 16%    +134.9%     214147 ± 19%  perf-c2c.HITM.local
     13399 ± 13%    +104.1%      27347 ± 16%  perf-c2c.HITM.remote
    104565 ± 15%    +131.0%     241494 ± 19%  perf-c2c.HITM.total
    125201 ±  2%     -39.4%      75820 ±  2%  stress-ng.mutex.nanosecs_per_mutex
  85791341           +67.7%  1.438e+08        stress-ng.mutex.ops
   1429837           +67.7%    2397156        stress-ng.mutex.ops_per_sec
  68606706           +63.5%  1.122e+08 ±  2%  stress-ng.time.involuntary_context_switches
      9345            -1.3%       9226        stress-ng.time.system_time
     99.39           +61.2%     160.24        stress-ng.time.user_time
  56563097           +57.6%   89151856 ±  2%  stress-ng.time.voluntary_context_switches
 7.208e+09 ±  2%     +42.9%   1.03e+10        perf-stat.i.branch-instructions
  52257508           +47.5%   77078460 ±  2%  perf-stat.i.branch-misses
  37265287 ±  2%     +34.4%   50098262 ±  3%  perf-stat.i.cache-misses
 2.416e+08           +42.7%  3.449e+08 ±  2%  perf-stat.i.cache-references
   2500366           +60.9%    4022250 ±  2%  perf-stat.i.context-switches
     20.66           -29.7%      14.53        perf-stat.i.cpi
    490637           +60.9%     789567 ±  2%  perf-stat.i.cpu-migrations
     15477 ±  4%     -25.1%      11585 ±  3%  perf-stat.i.cycles-between-cache-misses
 3.356e+10 ±  2%     +44.2%  4.838e+10        perf-stat.i.instructions
      0.06 ±  9%     +36.2%       0.08        perf-stat.i.ipc
     15.58           +60.8%      25.06 ±  2%  perf-stat.i.metric.K/sec
     17.01 ±  2%     -29.9%      11.93        perf-stat.overall.cpi
     15347 ±  3%     -24.8%      11539 ±  3%  perf-stat.overall.cycles-between-cache-misses
      0.06 ±  2%     +42.5%       0.08        perf-stat.overall.ipc
 7.096e+09 ±  2%     +42.6%  1.012e+10        perf-stat.ps.branch-instructions
  51310401           +47.6%   75731432 ±  2%  perf-stat.ps.branch-misses
  36634137 ±  2%     +34.4%   49233432 ±  3%  perf-stat.ps.cache-misses
 2.378e+08           +42.6%  3.392e+08 ±  2%  perf-stat.ps.cache-references
   2462472           +60.7%    3956471 ±  2%  perf-stat.ps.context-switches
    483238           +60.7%     776702 ±  2%  perf-stat.ps.cpu-migrations
 3.304e+10 ±  2%     +43.9%  4.756e+10        perf-stat.ps.instructions
 2.059e+12 ±  2%     +43.2%  2.949e+12        perf-stat.total.instructions
      0.61 ± 54%     -66.1%       0.21 ± 34%  perf-sched.sch_delay.avg.ms.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.vma_alloc_folio_noprof.alloc_anon_folio
      0.57 ± 63%     -79.0%       0.12 ±137%  perf-sched.sch_delay.avg.ms.__cond_resched.__do_fault.do_read_fault.do_pte_missing.__handle_mm_fault
     20.28 ±215%     -98.8%       0.25 ± 40%  perf-sched.sch_delay.avg.ms.__cond_resched.__wait_for_common.wait_for_completion_state.kernel_clone.__x64_sys_vfork
      5.85 ±133%     -96.7%       0.19 ± 48%  perf-sched.sch_delay.avg.ms.__cond_resched.change_pud_range.isra.0.change_protection_range
      0.62 ± 41%     -65.0%       0.22 ± 31%  perf-sched.sch_delay.avg.ms.__cond_resched.copy_pte_range.copy_p4d_range.copy_page_range.dup_mmap
      0.42 ± 34%     -52.4%       0.20 ± 30%  perf-sched.sch_delay.avg.ms.__cond_resched.down_write.unlink_anon_vmas.free_pgtables.exit_mmap
      0.46 ± 42%     -54.1%       0.21 ± 45%  perf-sched.sch_delay.avg.ms.__cond_resched.kmem_cache_alloc_noprof.mas_alloc_nodes.mas_preallocate.__mmap_new_vma
      0.50 ± 72%     -49.6%       0.25 ± 20%  perf-sched.sch_delay.avg.ms.__cond_resched.kmem_cache_alloc_noprof.vm_area_alloc.__mmap_new_vma.__mmap_region
      0.47 ± 34%     -62.9%       0.18 ± 29%  perf-sched.sch_delay.avg.ms.__cond_resched.mmput.m_stop.seq_read_iter.seq_read
      0.23 ± 56%     -85.7%       0.03 ±154%  perf-sched.sch_delay.avg.ms.__cond_resched.mutex_lock_killable.pcpu_alloc_noprof.mm_init.dup_mm
    248.83 ± 26%     -60.2%      99.05 ± 73%  perf-sched.sch_delay.avg.ms.anon_pipe_read.fifo_pipe_read.vfs_read.ksys_read
      9.67 ±167%     -97.1%       0.28 ± 21%  perf-sched.sch_delay.avg.ms.irqentry_exit_to_user_mode.asm_exc_page_fault.[unknown].[unknown]
     83.18 ± 21%     -68.9%      25.88 ± 26%  perf-sched.sch_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      4.32 ± 91%     -84.6%       0.67 ± 24%  perf-sched.sch_delay.max.ms.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.vma_alloc_folio_noprof.alloc_anon_folio
      0.90 ± 74%     -75.6%       0.22 ±139%  perf-sched.sch_delay.max.ms.__cond_resched.__do_fault.do_read_fault.do_pte_missing.__handle_mm_fault
      1.40 ± 48%     -74.1%       0.36 ± 85%  perf-sched.sch_delay.max.ms.__cond_resched.__kmalloc_cache_noprof.vmstat_start.seq_read_iter.proc_reg_read_iter
    358.00 ±219%     -99.8%       0.86 ± 51%  perf-sched.sch_delay.max.ms.__cond_resched.__wait_for_common.wait_for_completion_state.kernel_clone.__x64_sys_vfork
      1.18 ± 47%     -65.3%       0.41 ± 52%  perf-sched.sch_delay.max.ms.__cond_resched.copy_pte_range.copy_p4d_range.copy_page_range.dup_mmap
      1.20 ± 53%     -70.4%       0.35 ± 58%  perf-sched.sch_delay.max.ms.__cond_resched.down_read.walk_component.link_path_walk.path_openat
      1.30 ± 34%     -68.0%       0.42 ± 73%  perf-sched.sch_delay.max.ms.__cond_resched.mmput.m_stop.seq_read_iter.seq_read
      1.03 ± 40%     -55.5%       0.46 ± 21%  perf-sched.sch_delay.max.ms.__cond_resched.mutex_lock_killable.pcpu_alloc_noprof.__percpu_counter_init_many.mm_init
      0.30 ± 65%     -88.9%       0.03 ±154%  perf-sched.sch_delay.max.ms.__cond_resched.mutex_lock_killable.pcpu_alloc_noprof.mm_init.dup_mm
    281.41 ± 38%    +143.8%     686.20 ± 35%  perf-sched.sch_delay.max.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_write_begin
      0.72 ± 81%     -70.7%       0.21 ± 83%  perf-sched.sch_delay.max.ms.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep
    888.12 ±173%     -99.4%       5.74 ± 97%  perf-sched.sch_delay.max.ms.irqentry_exit_to_user_mode.asm_exc_page_fault.[unknown].[unknown]
      2.66 ±  7%     -15.2%       2.25 ±  5%  perf-sched.total_wait_and_delay.average.ms
      1.68 ±  7%     -18.0%       1.38 ±  6%  perf-sched.total_wait_time.average.ms
      1092 ±  6%     -21.5%     857.36 ± 12%  perf-sched.wait_and_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      1160 ± 11%     -37.2%     728.54 ± 28%  perf-sched.wait_and_delay.avg.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
    342.17 ±  8%     -16.0%     287.50        perf-sched.wait_and_delay.count.__cond_resched.__wait_for_common.affine_move_task.__set_cpus_allowed_ptr.__sched_setaffinity
      7.50 ± 27%     -60.0%       3.00 ± 76%  perf-sched.wait_and_delay.count.__cond_resched.rcu_gp_cleanup.rcu_gp_kthread.kthread.ret_from_fork
      3012 ±  9%     +30.1%       3919 ±  9%  perf-sched.wait_and_delay.count.irqentry_exit_to_user_mode.asm_sysvec_call_function_single.[unknown].[unknown]
      2811 ±  5%     +32.8%       3732 ±  7%  perf-sched.wait_and_delay.count.irqentry_exit_to_user_mode.asm_sysvec_reschedule_ipi.[unknown].[unknown]
    116.17 ± 21%     -37.9%      72.17 ± 26%  perf-sched.wait_and_delay.count.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
    699.17 ±  3%     -24.6%     527.50 ±  5%  perf-sched.wait_and_delay.count.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
    249.00 ±  2%     -38.5%     153.17 ±  8%  perf-sched.wait_and_delay.count.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
    562.82 ± 38%    +174.7%       1546 ± 51%  perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_write_begin
      0.95 ± 97%     -78.2%       0.21 ± 34%  perf-sched.wait_time.avg.ms.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.vma_alloc_folio_noprof.alloc_anon_folio
      0.57 ± 63%     -79.0%       0.12 ±137%  perf-sched.wait_time.avg.ms.__cond_resched.__do_fault.do_read_fault.do_pte_missing.__handle_mm_fault
      5.85 ±133%     -96.7%       0.19 ± 48%  perf-sched.wait_time.avg.ms.__cond_resched.change_pud_range.isra.0.change_protection_range
      0.62 ± 41%     -65.0%       0.22 ± 31%  perf-sched.wait_time.avg.ms.__cond_resched.copy_pte_range.copy_p4d_range.copy_page_range.dup_mmap
      0.42 ± 34%     -52.4%       0.20 ± 30%  perf-sched.wait_time.avg.ms.__cond_resched.down_write.unlink_anon_vmas.free_pgtables.exit_mmap
      0.35 ± 20%     -45.7%       0.19 ± 49%  perf-sched.wait_time.avg.ms.__cond_resched.dput.step_into.link_path_walk.path_openat
      0.46 ± 42%     -54.1%       0.21 ± 45%  perf-sched.wait_time.avg.ms.__cond_resched.kmem_cache_alloc_noprof.mas_alloc_nodes.mas_preallocate.__mmap_new_vma
      0.50 ± 72%     -49.6%       0.25 ± 20%  perf-sched.wait_time.avg.ms.__cond_resched.kmem_cache_alloc_noprof.vm_area_alloc.__mmap_new_vma.__mmap_region
      0.47 ± 34%     -62.9%       0.18 ± 29%  perf-sched.wait_time.avg.ms.__cond_resched.mmput.m_stop.seq_read_iter.seq_read
      0.23 ± 56%     -85.7%       0.03 ±154%  perf-sched.wait_time.avg.ms.__cond_resched.mutex_lock_killable.pcpu_alloc_noprof.mm_init.dup_mm
      9.74 ±165%     -96.2%       0.37 ± 48%  perf-sched.wait_time.avg.ms.irqentry_exit_to_user_mode.asm_exc_page_fault.[unknown].[unknown]
      1411 ± 33%     -65.5%     487.33 ±141%  perf-sched.wait_time.avg.ms.schedule_timeout.kcompactd.kthread.ret_from_fork
      1009 ±  7%     -17.6%     831.48 ± 12%  perf-sched.wait_time.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      1150 ± 12%     -37.7%     717.38 ± 28%  perf-sched.wait_time.avg.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
     10.09 ±138%     -93.4%       0.67 ± 24%  perf-sched.wait_time.max.ms.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.vma_alloc_folio_noprof.alloc_anon_folio
      0.90 ± 74%     -75.6%       0.22 ±139%  perf-sched.wait_time.max.ms.__cond_resched.__do_fault.do_read_fault.do_pte_missing.__handle_mm_fault
      1.40 ± 48%     -74.1%       0.36 ± 85%  perf-sched.wait_time.max.ms.__cond_resched.__kmalloc_cache_noprof.vmstat_start.seq_read_iter.proc_reg_read_iter
    715.15 ±160%     -99.7%       2.21 ±133%  perf-sched.wait_time.max.ms.__cond_resched.__wait_for_common.wait_for_completion_state.kernel_clone.__x64_sys_vfork
      1.18 ± 47%     -65.3%       0.41 ± 52%  perf-sched.wait_time.max.ms.__cond_resched.copy_pte_range.copy_p4d_range.copy_page_range.dup_mmap
      1.20 ± 53%     -70.4%       0.35 ± 58%  perf-sched.wait_time.max.ms.__cond_resched.down_read.walk_component.link_path_walk.path_openat
      1.30 ± 34%     -68.0%       0.42 ± 73%  perf-sched.wait_time.max.ms.__cond_resched.mmput.m_stop.seq_read_iter.seq_read
      1.03 ± 40%     -55.5%       0.46 ± 21%  perf-sched.wait_time.max.ms.__cond_resched.mutex_lock_killable.pcpu_alloc_noprof.__percpu_counter_init_many.mm_init
      0.30 ± 65%     -88.9%       0.03 ±154%  perf-sched.wait_time.max.ms.__cond_resched.mutex_lock_killable.pcpu_alloc_noprof.mm_init.dup_mm
    281.41 ± 38%    +205.6%     859.89 ± 67%  perf-sched.wait_time.max.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_write_begin
    167.37 ±222%     -99.9%       0.21 ± 83%  perf-sched.wait_time.max.ms.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep
     31.88            -1.9       30.02        perf-profile.calltrace.cycles-pp.__schedule.schedule.futex_do_wait.__futex_wait.futex_wait
     31.90            -1.9       30.05        perf-profile.calltrace.cycles-pp.schedule.futex_do_wait.__futex_wait.futex_wait.do_futex
     31.96            -1.8       30.14        perf-profile.calltrace.cycles-pp.futex_do_wait.__futex_wait.futex_wait.do_futex.__x64_sys_futex
     32.28            -1.7       30.62        perf-profile.calltrace.cycles-pp.__futex_wait.futex_wait.do_futex.__x64_sys_futex.do_syscall_64
      9.30 ±  3%      -1.6        7.66 ±  5%  perf-profile.calltrace.cycles-pp.pull_rt_task.balance_callbacks.__sched_setscheduler._sched_setscheduler.do_sched_setscheduler
     32.29            -1.6       30.65        perf-profile.calltrace.cycles-pp.futex_wait.do_futex.__x64_sys_futex.do_syscall_64.entry_SYSCALL_64_after_hwframe
     26.81            -1.6       25.25        perf-profile.calltrace.cycles-pp.find_lock_lowest_rq.push_rt_task.push_rt_tasks.finish_task_switch.__schedule
     10.51 ±  3%      -1.5        9.00 ±  5%  perf-profile.calltrace.cycles-pp.balance_callbacks.__sched_setscheduler._sched_setscheduler.do_sched_setscheduler.__x64_sys_sched_setscheduler
     13.57            -1.5       12.11 ±  4%  perf-profile.calltrace.cycles-pp.cpupri_find_fitness.find_lowest_rq.find_lock_lowest_rq.push_rt_task.push_rt_tasks
     13.55            -1.5       12.09 ±  4%  perf-profile.calltrace.cycles-pp.__cpupri_find.cpupri_find_fitness.find_lowest_rq.find_lock_lowest_rq.push_rt_task
     13.61            -1.4       12.17 ±  4%  perf-profile.calltrace.cycles-pp.find_lowest_rq.find_lock_lowest_rq.push_rt_task.push_rt_tasks.finish_task_switch
      5.90 ±  3%      -1.2        4.68 ±  4%  perf-profile.calltrace.cycles-pp.pull_rt_task.balance_rt.__pick_next_task.__schedule.schedule
      5.92 ±  3%      -1.2        4.70 ±  4%  perf-profile.calltrace.cycles-pp.balance_rt.__pick_next_task.__schedule.schedule.futex_do_wait
     32.27            -1.2       31.09        perf-profile.calltrace.cycles-pp.push_rt_task.push_rt_tasks.finish_task_switch.__schedule.schedule
     38.40            -0.9       37.47        perf-profile.calltrace.cycles-pp.__sched_setscheduler._sched_setscheduler.do_sched_setscheduler.__x64_sys_sched_setscheduler.do_syscall_64
     38.47            -0.9       37.57        perf-profile.calltrace.cycles-pp._sched_setscheduler.do_sched_setscheduler.__x64_sys_sched_setscheduler.do_syscall_64.entry_SYSCALL_64_after_hwframe
     19.36            -0.9       18.47        perf-profile.calltrace.cycles-pp.push_rt_tasks.finish_task_switch.__schedule.schedule.futex_do_wait
     40.08            -0.9       39.21        perf-profile.calltrace.cycles-pp.do_futex.__x64_sys_futex.do_syscall_64.entry_SYSCALL_64_after_hwframe
     40.10            -0.9       39.23        perf-profile.calltrace.cycles-pp.__x64_sys_futex.do_syscall_64.entry_SYSCALL_64_after_hwframe
     19.82            -0.8       18.98        perf-profile.calltrace.cycles-pp.finish_task_switch.__schedule.schedule.futex_do_wait.__futex_wait
      7.86 ±  2%      -0.8        7.05 ±  4%  perf-profile.calltrace.cycles-pp.__pick_next_task.__schedule.schedule.futex_do_wait.__futex_wait
     38.68            -0.8       37.89        perf-profile.calltrace.cycles-pp.do_sched_setscheduler.__x64_sys_sched_setscheduler.do_syscall_64.entry_SYSCALL_64_after_hwframe.__sched_setscheduler
     38.68            -0.8       37.89        perf-profile.calltrace.cycles-pp.__x64_sys_sched_setscheduler.do_syscall_64.entry_SYSCALL_64_after_hwframe.__sched_setscheduler
     42.56            -0.8       41.79        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__sched_setscheduler
     42.57            -0.8       41.81        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__sched_setscheduler
     11.76            -0.8       11.01 ±  2%  perf-profile.calltrace.cycles-pp.cpupri_set.enqueue_task_rt.enqueue_task.__sched_setscheduler._sched_setscheduler
     41.14            -0.7       40.41        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
     41.14            -0.7       40.42        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
     42.78            -0.6       42.15        perf-profile.calltrace.cycles-pp.__sched_setscheduler
     12.24            -0.5       11.75 ±  2%  perf-profile.calltrace.cycles-pp.cpupri_set.dequeue_rt_stack.dequeue_task_rt.__sched_setscheduler._sched_setscheduler
      3.53 ±  2%      -0.5        3.04 ±  3%  perf-profile.calltrace.cycles-pp.cpupri_set.dequeue_rt_stack.dequeue_task_rt.try_to_block_task.__schedule
      3.56 ±  2%      -0.5        3.08 ±  3%  perf-profile.calltrace.cycles-pp.dequeue_rt_stack.dequeue_task_rt.try_to_block_task.__schedule.schedule
     12.30            -0.5       11.83 ±  2%  perf-profile.calltrace.cycles-pp.dequeue_rt_stack.dequeue_task_rt.__sched_setscheduler._sched_setscheduler.do_sched_setscheduler
      3.62 ±  2%      -0.4        3.23 ±  3%  perf-profile.calltrace.cycles-pp.try_to_block_task.__schedule.schedule.futex_do_wait.__futex_wait
      2.19 ±  3%      -0.4        1.80 ±  3%  perf-profile.calltrace.cycles-pp.cpupri_set.enqueue_task_rt.enqueue_task.activate_task.push_rt_task
      3.62 ±  2%      -0.4        3.23 ±  3%  perf-profile.calltrace.cycles-pp.dequeue_task_rt.try_to_block_task.__schedule.schedule.futex_do_wait
      2.27 ±  3%      -0.2        2.04 ±  2%  perf-profile.calltrace.cycles-pp.enqueue_task_rt.enqueue_task.activate_task.push_rt_task.push_rt_tasks
      4.20            -0.2        4.01        perf-profile.calltrace.cycles-pp.finish_task_switch.__schedule.schedule.exit_to_user_mode_loop.do_syscall_64
      3.90            -0.2        3.72        perf-profile.calltrace.cycles-pp.push_rt_tasks.finish_task_switch.__schedule.schedule.exit_to_user_mode_loop
      0.96            +0.0        1.00        perf-profile.calltrace.cycles-pp.__pick_next_task.__schedule.schedule_idle.do_idle.cpu_startup_entry
      0.96 ±  2%      +0.1        1.05 ±  4%  perf-profile.calltrace.cycles-pp.enqueue_pushable_task.enqueue_task.activate_task.push_rt_task.push_rt_tasks
      1.32            +0.1        1.43 ±  3%  perf-profile.calltrace.cycles-pp.finish_task_switch.__schedule.schedule_idle.do_idle.cpu_startup_entry
      1.56 ±  3%      +0.1        1.67 ±  3%  perf-profile.calltrace.cycles-pp.__flush_smp_call_function_queue.__sysvec_call_function_single.sysvec_call_function_single.asm_sysvec_call_function_single.__sched_setscheduler
      1.19 ±  2%      +0.1        1.31 ±  3%  perf-profile.calltrace.cycles-pp.__flush_smp_call_function_queue.__sysvec_call_function_single.sysvec_call_function_single.asm_sysvec_call_function_single.finish_task_switch
      1.56 ±  3%      +0.1        1.68 ±  3%  perf-profile.calltrace.cycles-pp.__sysvec_call_function_single.sysvec_call_function_single.asm_sysvec_call_function_single.__sched_setscheduler._sched_setscheduler
      1.57 ±  3%      +0.1        1.69 ±  3%  perf-profile.calltrace.cycles-pp.sysvec_call_function_single.asm_sysvec_call_function_single.__sched_setscheduler._sched_setscheduler.do_sched_setscheduler
      1.20 ±  2%      +0.1        1.32 ±  3%  perf-profile.calltrace.cycles-pp.__sysvec_call_function_single.sysvec_call_function_single.asm_sysvec_call_function_single.finish_task_switch.__schedule
      1.00            +0.1        1.13 ±  2%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.raw_spin_rq_lock_nested.balance_callbacks.__sched_setscheduler
      1.20 ±  2%      +0.1        1.33 ±  3%  perf-profile.calltrace.cycles-pp.sysvec_call_function_single.asm_sysvec_call_function_single.finish_task_switch.__schedule.schedule_idle
      1.59 ±  3%      +0.1        1.72 ±  3%  perf-profile.calltrace.cycles-pp.asm_sysvec_call_function_single.__sched_setscheduler._sched_setscheduler.do_sched_setscheduler.__x64_sys_sched_setscheduler
      0.93 ±  4%      +0.1        1.06 ±  4%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.raw_spin_rq_lock_nested.task_rq_lock.__sched_setscheduler
      1.03            +0.1        1.16 ±  3%  perf-profile.calltrace.cycles-pp.raw_spin_rq_lock_nested.balance_callbacks.__sched_setscheduler._sched_setscheduler.do_sched_setscheduler
      1.02            +0.1        1.15 ±  2%  perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested.balance_callbacks.__sched_setscheduler._sched_setscheduler
      1.22 ±  2%      +0.1        1.34 ±  3%  perf-profile.calltrace.cycles-pp.asm_sysvec_call_function_single.finish_task_switch.__schedule.schedule_idle.do_idle
      0.96 ±  4%      +0.1        1.10 ±  4%  perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested.task_rq_lock.__sched_setscheduler._sched_setscheduler
      0.96 ±  4%      +0.1        1.10 ±  4%  perf-profile.calltrace.cycles-pp.raw_spin_rq_lock_nested.task_rq_lock.__sched_setscheduler._sched_setscheduler.do_sched_setscheduler
      1.65 ±  2%      +0.1        1.80 ±  3%  perf-profile.calltrace.cycles-pp.__flush_smp_call_function_queue.__sysvec_call_function_single.sysvec_call_function_single.asm_sysvec_call_function_single.pv_native_safe_halt
      1.66 ±  2%      +0.1        1.80 ±  3%  perf-profile.calltrace.cycles-pp.__sysvec_call_function_single.sysvec_call_function_single.asm_sysvec_call_function_single.pv_native_safe_halt.acpi_safe_halt
      1.68 ±  2%      +0.2        1.84 ±  3%  perf-profile.calltrace.cycles-pp.sysvec_call_function_single.asm_sysvec_call_function_single.pv_native_safe_halt.acpi_safe_halt.acpi_idle_do_entry
      1.81 ±  2%      +0.2        1.98 ±  3%  perf-profile.calltrace.cycles-pp.pv_native_safe_halt.acpi_safe_halt.acpi_idle_do_entry.acpi_idle_enter.cpuidle_enter_state
      1.11 ±  4%      +0.2        1.28 ±  4%  perf-profile.calltrace.cycles-pp.task_rq_lock.__sched_setscheduler._sched_setscheduler.do_sched_setscheduler.__x64_sys_sched_setscheduler
      1.75 ±  2%      +0.2        1.92 ±  3%  perf-profile.calltrace.cycles-pp.asm_sysvec_call_function_single.pv_native_safe_halt.acpi_safe_halt.acpi_idle_do_entry.acpi_idle_enter
      1.92 ±  2%      +0.2        2.12 ±  3%  perf-profile.calltrace.cycles-pp.acpi_safe_halt.acpi_idle_do_entry.acpi_idle_enter.cpuidle_enter_state.cpuidle_enter
      1.92 ±  2%      +0.2        2.13 ±  3%  perf-profile.calltrace.cycles-pp.acpi_idle_do_entry.acpi_idle_enter.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
      1.92 ±  2%      +0.2        2.13 ±  3%  perf-profile.calltrace.cycles-pp.acpi_idle_enter.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
      1.96 ±  2%      +0.2        2.17 ±  3%  perf-profile.calltrace.cycles-pp.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
     10.09            +0.2       10.31        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__sched_yield
      1.95 ±  2%      +0.2        2.17 ±  3%  perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry
     10.10            +0.2       10.32        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__sched_yield
      2.02 ±  2%      +0.2        2.26 ±  2%  perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary.common_startup_64
      1.44            +0.3        1.69        perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested.sched_ttwu_pending.__flush_smp_call_function_queue.__sysvec_call_function_single
      1.44            +0.3        1.69        perf-profile.calltrace.cycles-pp.raw_spin_rq_lock_nested.sched_ttwu_pending.__flush_smp_call_function_queue.__sysvec_call_function_single.sysvec_call_function_single
      1.43            +0.3        1.68 ±  2%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.raw_spin_rq_lock_nested.sched_ttwu_pending.__flush_smp_call_function_queue
      0.86            +0.3        1.12 ±  4%  perf-profile.calltrace.cycles-pp.dequeue_task_rt.push_rt_task.push_rt_tasks.finish_task_switch.__schedule
      4.26            +0.3        4.58 ±  2%  perf-profile.calltrace.cycles-pp.sched_ttwu_pending.__flush_smp_call_function_queue.__sysvec_call_function_single.sysvec_call_function_single.asm_sysvec_call_function_single
      2.68            +0.3        3.01 ±  2%  perf-profile.calltrace.cycles-pp.__schedule.schedule_idle.do_idle.cpu_startup_entry.start_secondary
      2.69            +0.3        3.02 ±  2%  perf-profile.calltrace.cycles-pp.schedule_idle.do_idle.cpu_startup_entry.start_secondary.common_startup_64
     10.28            +0.4       10.63        perf-profile.calltrace.cycles-pp.__sched_yield
      4.90            +0.6        5.53        perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.common_startup_64
      4.90            +0.6        5.53        perf-profile.calltrace.cycles-pp.start_secondary.common_startup_64
      4.90            +0.6        5.53        perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.common_startup_64
      4.94            +0.6        5.57        perf-profile.calltrace.cycles-pp.common_startup_64
      0.00            +0.7        0.72 ±  8%  perf-profile.calltrace.cycles-pp.balance_fair.__pick_next_task.__schedule.schedule.futex_do_wait
      0.00            +0.7        0.72 ±  8%  perf-profile.calltrace.cycles-pp.sched_balance_newidle.balance_fair.__pick_next_task.__schedule.schedule
      7.78 ±  3%      +0.8        8.55 ±  3%  perf-profile.calltrace.cycles-pp.futex_wake.do_futex.__x64_sys_futex.do_syscall_64.entry_SYSCALL_64_after_hwframe
      0.75 ±  8%      +5.8        6.56 ±  6%  perf-profile.calltrace.cycles-pp._find_first_and_bit.__cpupri_find.cpupri_find_fitness.find_lowest_rq.find_lock_lowest_rq
     15.21 ±  3%      -2.9       12.34 ±  5%  perf-profile.children.cycles-pp.pull_rt_task
     32.03            -2.8       29.24 ±  2%  perf-profile.children.cycles-pp.cpupri_set
     46.02            -1.9       44.16        perf-profile.children.cycles-pp.schedule
     31.96            -1.8       30.14        perf-profile.children.cycles-pp.futex_do_wait
     32.28            -1.7       30.63        perf-profile.children.cycles-pp.__futex_wait
     32.29            -1.6       30.65        perf-profile.children.cycles-pp.futex_wait
     28.29            -1.6       26.73        perf-profile.children.cycles-pp.find_lock_lowest_rq
     48.72            -1.5       47.18        perf-profile.children.cycles-pp.__schedule
     81.25            -1.5       79.73        perf-profile.children.cycles-pp.__sched_setscheduler
     15.35            -1.5       13.84 ±  4%  perf-profile.children.cycles-pp.cpupri_find_fitness
     15.33            -1.5       13.82 ±  5%  perf-profile.children.cycles-pp.__cpupri_find
     10.53 ±  3%      -1.5        9.02 ±  5%  perf-profile.children.cycles-pp.balance_callbacks
     15.39            -1.5       13.90 ±  4%  perf-profile.children.cycles-pp.find_lowest_rq
     93.88            -1.3       92.61        perf-profile.children.cycles-pp.do_syscall_64
     93.90            -1.2       92.65        perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
      5.94 ±  3%      -1.2        4.72 ±  4%  perf-profile.children.cycles-pp.balance_rt
     32.36            -1.2       31.21        perf-profile.children.cycles-pp.push_rt_tasks
     34.19            -1.1       33.09        perf-profile.children.cycles-pp.push_rt_task
     34.74            -1.0       33.76        perf-profile.children.cycles-pp.finish_task_switch
     15.91            -0.9       14.98 ±  2%  perf-profile.children.cycles-pp.dequeue_rt_stack
     38.47            -0.9       37.57        perf-profile.children.cycles-pp._sched_setscheduler
     40.08            -0.9       39.21        perf-profile.children.cycles-pp.do_futex
     40.10            -0.9       39.23        perf-profile.children.cycles-pp.__x64_sys_futex
     38.68            -0.8       37.89        perf-profile.children.cycles-pp.__x64_sys_sched_setscheduler
     38.68            -0.8       37.89        perf-profile.children.cycles-pp.do_sched_setscheduler
      9.06 ±  2%      -0.7        8.40 ±  3%  perf-profile.children.cycles-pp.__pick_next_task
      3.62 ±  2%      -0.4        3.23 ±  3%  perf-profile.children.cycles-pp.try_to_block_task
      0.22 ±  2%      -0.1        0.16 ±  5%  perf-profile.children.cycles-pp.sched_tick
      0.26 ±  2%      -0.1        0.20 ±  3%  perf-profile.children.cycles-pp.tick_nohz_handler
      0.26            -0.1        0.20 ±  4%  perf-profile.children.cycles-pp.__hrtimer_run_queues
      0.44 ±  2%      -0.1        0.38 ±  4%  perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
      0.25 ±  2%      -0.1        0.19 ±  2%  perf-profile.children.cycles-pp.update_process_times
      0.30 ±  2%      -0.1        0.24 ±  3%  perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
      0.42 ±  2%      -0.1        0.36 ±  4%  perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
      0.29            -0.1        0.23 ±  4%  perf-profile.children.cycles-pp.hrtimer_interrupt
      0.20 ±  4%      -0.0        0.16 ± 12%  perf-profile.children.cycles-pp.sched_balance_update_blocked_averages
      0.36            -0.0        0.32        perf-profile.children.cycles-pp.irq_work_single
      0.09 ±  6%      -0.0        0.06 ± 13%  perf-profile.children.cycles-pp.sched_balance_rq
      0.04 ± 45%      +0.0        0.06 ±  7%  perf-profile.children.cycles-pp.menu_select
      0.06 ±  6%      +0.0        0.08 ± 10%  perf-profile.children.cycles-pp.plist_add
      0.06 ±  8%      +0.0        0.08 ±  5%  perf-profile.children.cycles-pp.native_irq_return_iret
      0.13 ±  3%      +0.0        0.16 ±  4%  perf-profile.children.cycles-pp.irqentry_exit_to_user_mode
      0.05            +0.0        0.08        perf-profile.children.cycles-pp.pick_task_rt
      0.04 ± 44%      +0.0        0.07 ±  9%  perf-profile.children.cycles-pp.do_perf_trace_sched_stat_runtime
      0.06 ±  8%      +0.0        0.09 ±  4%  perf-profile.children.cycles-pp._copy_from_user
      0.06 ±  7%      +0.0        0.10 ±  4%  perf-profile.children.cycles-pp.find_task_by_vpid
      0.04 ± 44%      +0.0        0.07 ±  6%  perf-profile.children.cycles-pp.sched_mm_cid_migrate_to
      0.10 ±  8%      +0.0        0.13 ±  6%  perf-profile.children.cycles-pp.prepare_task_switch
      0.07 ± 11%      +0.0        0.10 ±  8%  perf-profile.children.cycles-pp.llist_reverse_order
      0.06 ±  9%      +0.0        0.10 ± 10%  perf-profile.children.cycles-pp.pthread_mutex_lock
      0.08            +0.0        0.12 ±  5%  perf-profile.children.cycles-pp.sched_clock
      0.05 ±  7%      +0.0        0.09        perf-profile.children.cycles-pp.__get_user_8
      0.06 ±  6%      +0.0        0.10 ±  3%  perf-profile.children.cycles-pp.rseq_get_rseq_cs
      0.07            +0.0        0.11 ±  4%  perf-profile.children.cycles-pp.__resched_curr
      0.09 ±  4%      +0.0        0.13 ±  5%  perf-profile.children.cycles-pp.sched_clock_cpu
      0.08 ±  6%      +0.0        0.12 ±  4%  perf-profile.children.cycles-pp.futex_unqueue
      0.10 ±  9%      +0.0        0.14 ±  6%  perf-profile.children.cycles-pp.native_sched_clock
      0.08 ±  8%      +0.0        0.13        perf-profile.children.cycles-pp.find_get_task
      0.09 ±  4%      +0.0        0.14        perf-profile.children.cycles-pp.wakeup_preempt
      0.00            +0.1        0.05        perf-profile.children.cycles-pp.raw_spin_rq_trylock
      0.10 ±  6%      +0.1        0.15 ±  4%  perf-profile.children.cycles-pp.futex_wake_mark
      0.09 ±  5%      +0.1        0.14 ±  3%  perf-profile.children.cycles-pp.rseq_ip_fixup
      0.09 ±  4%      +0.1        0.14 ±  4%  perf-profile.children.cycles-pp.pthread_setschedparam
      0.10 ±  7%      +0.1        0.16 ±  6%  perf-profile.children.cycles-pp.do_perf_trace_sched_wakeup_template
      0.00            +0.1        0.06 ±  9%  perf-profile.children.cycles-pp.___perf_sw_event
      0.00            +0.1        0.06 ±  6%  perf-profile.children.cycles-pp.__x2apic_send_IPI_dest
      0.00            +0.1        0.06 ±  6%  perf-profile.children.cycles-pp.native_apic_msr_eoi
      0.11 ±  6%      +0.1        0.17 ±  6%  perf-profile.children.cycles-pp.ttwu_queue_wakelist
      0.09 ±  4%      +0.1        0.15 ±  2%  perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
      0.11 ±  4%      +0.1        0.17 ±  2%  perf-profile.children.cycles-pp.stress_mutex_exercise
      0.00            +0.1        0.06        perf-profile.children.cycles-pp.sysvec_reschedule_ipi
      0.13 ±  5%      +0.1        0.19 ±  4%  perf-profile.children.cycles-pp.update_rq_clock
      0.00            +0.1        0.06 ±  6%  perf-profile.children.cycles-pp.rseq_update_cpu_node_id
      0.13 ±  5%      +0.1        0.20 ±  3%  perf-profile.children.cycles-pp.update_rq_clock_task
      0.00            +0.1        0.06 ±  7%  perf-profile.children.cycles-pp.__smp_call_single_queue
      0.10 ±  4%      +0.1        0.17 ±  2%  perf-profile.children.cycles-pp.entry_SYSCALL_64
      0.00            +0.1        0.07 ±  7%  perf-profile.children.cycles-pp.__radix_tree_lookup
      0.16 ±  7%      +0.1        0.22 ±  7%  perf-profile.children.cycles-pp.update_curr_common
      0.00            +0.1        0.07 ±  5%  perf-profile.children.cycles-pp.__wrgsbase_inactive
      0.10 ±  5%      +0.1        0.17 ±  2%  perf-profile.children.cycles-pp.rt_mutex_adjust_pi
      0.13 ±  3%      +0.1        0.20        perf-profile.children.cycles-pp.switch_mm_irqs_off
      0.14 ±  4%      +0.1        0.22 ±  3%  perf-profile.children.cycles-pp.__rseq_handle_notify_resume
      0.17 ±  5%      +0.1        0.26 ±  5%  perf-profile.children.cycles-pp.__futex_hash
      0.24 ±  3%      +0.1        0.34 ±  2%  perf-profile.children.cycles-pp._raw_spin_lock_irqsave
      0.17 ±  3%      +0.1        0.27 ±  3%  perf-profile.children.cycles-pp.os_xsave
      0.15 ±  4%      +0.1        0.25 ±  3%  perf-profile.children.cycles-pp.__switch_to
      0.24 ±  2%      +0.1        0.34 ±  3%  perf-profile.children.cycles-pp.asm_sysvec_reschedule_ipi
      0.18 ±  2%      +0.1        0.29        perf-profile.children.cycles-pp.set_load_weight
      0.22 ±  4%      +0.1        0.34 ±  5%  perf-profile.children.cycles-pp.futex_wait_setup
      0.23 ±  2%      +0.2        0.38 ±  2%  perf-profile.children.cycles-pp.restore_fpregs_from_fpstate
      1.12 ±  4%      +0.2        1.29 ±  4%  perf-profile.children.cycles-pp.task_rq_lock
      0.28            +0.2        0.46 ±  2%  perf-profile.children.cycles-pp.switch_fpu_return
      0.36 ±  5%      +0.2        0.55 ±  4%  perf-profile.children.cycles-pp.futex_hash
      1.93 ±  2%      +0.2        2.14 ±  3%  perf-profile.children.cycles-pp.acpi_idle_do_entry
      1.93 ±  2%      +0.2        2.14 ±  3%  perf-profile.children.cycles-pp.acpi_safe_halt
      1.93 ±  2%      +0.2        2.14 ±  3%  perf-profile.children.cycles-pp.pv_native_safe_halt
      1.93 ±  2%      +0.2        2.14 ±  3%  perf-profile.children.cycles-pp.acpi_idle_enter
      1.97 ±  2%      +0.2        2.18 ±  3%  perf-profile.children.cycles-pp.cpuidle_enter_state
      1.97 ±  2%      +0.2        2.19 ±  3%  perf-profile.children.cycles-pp.cpuidle_enter
      2.04 ±  2%      +0.2        2.28 ±  3%  perf-profile.children.cycles-pp.cpuidle_idle_call
      3.62            +0.3        3.88 ±  3%  perf-profile.children.cycles-pp.enqueue_pushable_task
      0.40 ±  4%      +0.3        0.72 ±  8%  perf-profile.children.cycles-pp.balance_fair
      0.40 ±  4%      +0.3        0.72 ±  8%  perf-profile.children.cycles-pp.sched_balance_newidle
      2.71            +0.3        3.05 ±  2%  perf-profile.children.cycles-pp.schedule_idle
      5.99            +0.3        6.33        perf-profile.children.cycles-pp.ttwu_do_activate
     10.30            +0.4       10.67        perf-profile.children.cycles-pp.__sched_yield
      5.26 ±  2%      +0.5        5.71 ±  3%  perf-profile.children.cycles-pp.sched_ttwu_pending
      5.71 ±  2%      +0.5        6.18 ±  3%  perf-profile.children.cycles-pp.__sysvec_call_function_single
      5.82 ±  2%      +0.5        6.30 ±  3%  perf-profile.children.cycles-pp.__flush_smp_call_function_queue
      5.76 ±  2%      +0.5        6.25 ±  3%  perf-profile.children.cycles-pp.sysvec_call_function_single
      5.91 ±  2%      +0.6        6.48 ±  3%  perf-profile.children.cycles-pp.asm_sysvec_call_function_single
      4.93            +0.6        5.56        perf-profile.children.cycles-pp.do_idle
      4.90            +0.6        5.53        perf-profile.children.cycles-pp.start_secondary
      4.94            +0.6        5.57        perf-profile.children.cycles-pp.common_startup_64
      4.94            +0.6        5.57        perf-profile.children.cycles-pp.cpu_startup_entry
      7.78 ±  3%      +0.8        8.55 ±  3%  perf-profile.children.cycles-pp.futex_wake
      1.45 ±  8%      +6.0        7.48 ±  6%  perf-profile.children.cycles-pp._find_first_and_bit
     12.44            -7.7        4.74 ±  7%  perf-profile.self.cycles-pp.__cpupri_find
     14.90 ±  3%      -2.9       12.04 ±  5%  perf-profile.self.cycles-pp.pull_rt_task
     32.02            -2.8       29.23 ±  2%  perf-profile.self.cycles-pp.cpupri_set
      0.08            -0.0        0.06 ±  7%  perf-profile.self.cycles-pp.irq_work_single
      0.06            +0.0        0.08        perf-profile.self.cycles-pp.prepare_task_switch
      0.06 ±  8%      +0.0        0.08 ±  6%  perf-profile.self.cycles-pp.update_rq_clock
      0.06 ±  9%      +0.0        0.08 ±  6%  perf-profile.self.cycles-pp.plist_add
      0.07 ±  5%      +0.0        0.09 ±  5%  perf-profile.self.cycles-pp.do_sched_setscheduler
      0.06 ±  6%      +0.0        0.08 ±  5%  perf-profile.self.cycles-pp.futex_do_wait
      0.05            +0.0        0.08 ±  4%  perf-profile.self.cycles-pp.pick_task_rt
      0.06 ±  8%      +0.0        0.08 ±  5%  perf-profile.self.cycles-pp.native_irq_return_iret
      0.05            +0.0        0.08        perf-profile.self.cycles-pp._copy_from_user
      0.06 ±  6%      +0.0        0.09 ±  6%  perf-profile.self.cycles-pp.__pick_next_task
      0.06 ± 11%      +0.0        0.09 ±  6%  perf-profile.self.cycles-pp.sched_ttwu_pending
      0.05 ±  7%      +0.0        0.08 ±  5%  perf-profile.self.cycles-pp.push_rt_task
      0.06 ±  7%      +0.0        0.10 ±  6%  perf-profile.self.cycles-pp.do_perf_trace_sched_wakeup_template
      0.13 ±  6%      +0.0        0.16 ±  5%  perf-profile.self.cycles-pp.__flush_smp_call_function_queue
      0.07 ± 11%      +0.0        0.10 ±  8%  perf-profile.self.cycles-pp.llist_reverse_order
      0.05 ±  7%      +0.0        0.09 ±  4%  perf-profile.self.cycles-pp.__get_user_8
      0.08 ±  6%      +0.0        0.11 ±  3%  perf-profile.self.cycles-pp.finish_task_switch
      0.03 ± 70%      +0.0        0.07 ±  5%  perf-profile.self.cycles-pp.sched_mm_cid_migrate_to
      0.06 ±  6%      +0.0        0.10        perf-profile.self.cycles-pp.pthread_setschedparam
      0.07 ±  6%      +0.0        0.11 ±  4%  perf-profile.self.cycles-pp.futex_unqueue
      0.06 ±  7%      +0.0        0.10 ±  4%  perf-profile.self.cycles-pp.entry_SYSCALL_64
      0.11            +0.0        0.15 ±  3%  perf-profile.self.cycles-pp.pv_native_safe_halt
      0.03 ± 70%      +0.0        0.07 ±  6%  perf-profile.self.cycles-pp.ttwu_queue_wakelist
      0.07            +0.0        0.11 ±  6%  perf-profile.self.cycles-pp.__resched_curr
      0.09 ±  7%      +0.0        0.13 ±  2%  perf-profile.self.cycles-pp.futex_wake_mark
      0.10 ±  9%      +0.0        0.14 ±  7%  perf-profile.self.cycles-pp.native_sched_clock
      0.09 ±  5%      +0.0        0.13 ±  3%  perf-profile.self.cycles-pp.stress_mutex_exercise
      0.00            +0.1        0.05        perf-profile.self.cycles-pp.__sched_yield
      0.00            +0.1        0.05        perf-profile.self.cycles-pp.update_curr_common
      0.00            +0.1        0.05 ±  7%  perf-profile.self.cycles-pp.exit_to_user_mode_loop
      0.10 ±  8%      +0.1        0.15 ±  5%  perf-profile.self.cycles-pp.futex_wake
      0.10 ±  4%      +0.1        0.16 ±  7%  perf-profile.self.cycles-pp.select_task_rq_rt
      0.10 ±  3%      +0.1        0.16 ±  3%  perf-profile.self.cycles-pp.update_rq_clock_task
      0.02 ±141%      +0.1        0.07 ±  9%  perf-profile.self.cycles-pp.pthread_mutex_lock
      0.02 ±141%      +0.1        0.07 ±  6%  perf-profile.self.cycles-pp.switch_fpu_return
      0.00            +0.1        0.06 ±  6%  perf-profile.self.cycles-pp.__x2apic_send_IPI_dest
      0.00            +0.1        0.06 ±  6%  perf-profile.self.cycles-pp.native_apic_msr_eoi
      0.09 ±  4%      +0.1        0.15 ±  2%  perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
      0.00            +0.1        0.06        perf-profile.self.cycles-pp.__radix_tree_lookup
      0.00            +0.1        0.06        perf-profile.self.cycles-pp.__wrgsbase_inactive
      0.09 ±  4%      +0.1        0.15 ±  3%  perf-profile.self.cycles-pp.do_syscall_64
      0.00            +0.1        0.06 ±  6%  perf-profile.self.cycles-pp.rseq_update_cpu_node_id
      0.00            +0.1        0.06 ±  6%  perf-profile.self.cycles-pp.select_task_rq
      0.19 ±  3%      +0.1        0.26        perf-profile.self.cycles-pp.find_lock_lowest_rq
      0.13 ±  5%      +0.1        0.20 ±  4%  perf-profile.self.cycles-pp.__sched_setscheduler
      0.11 ±  3%      +0.1        0.18 ±  4%  perf-profile.self.cycles-pp.switch_mm_irqs_off
      0.44            +0.1        0.51 ±  2%  perf-profile.self.cycles-pp._raw_spin_lock
      0.17 ±  6%      +0.1        0.26 ±  4%  perf-profile.self.cycles-pp.__futex_hash
      0.14 ±  3%      +0.1        0.24        perf-profile.self.cycles-pp._raw_spin_lock_irqsave
      0.14 ±  4%      +0.1        0.24 ±  4%  perf-profile.self.cycles-pp.__switch_to
      0.18 ±  3%      +0.1        0.28 ±  4%  perf-profile.self.cycles-pp.futex_hash
      0.16 ±  3%      +0.1        0.26 ±  2%  perf-profile.self.cycles-pp.os_xsave
      0.17 ±  2%      +0.1        0.28 ±  2%  perf-profile.self.cycles-pp.set_load_weight
      0.23 ±  3%      +0.1        0.36 ±  2%  perf-profile.self.cycles-pp.__schedule
      0.23            +0.2        0.38 ±  3%  perf-profile.self.cycles-pp.restore_fpregs_from_fpstate
      3.54 ±  2%      +0.2        3.78 ±  3%  perf-profile.self.cycles-pp.enqueue_pushable_task
      0.14 ±  3%      +0.4        0.52 ±  9%  perf-profile.self.cycles-pp.sched_balance_newidle
      1.45 ±  2%      +0.6        2.03 ±  4%  perf-profile.self.cycles-pp.dequeue_task_rt
      0.72 ±  5%      +2.1        2.80 ± 11%  perf-profile.self.cycles-pp.enqueue_task_rt
      1.45 ±  8%      +6.0        7.47 ±  6%  perf-profile.self.cycles-pp._find_first_and_bit




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
RE: [PATCH] sched/rt: optimize cpupri_vec layout
Posted by Deng, Pan 4 months ago
As an alternative, with a little bit complicated change, we can separate counts
and masks into 2 vectors inlined in cpupri(counts[] and masks[]), and add two
paddings:
1. Between counts[0] and counts[1], since counts[0] is more frequently 
updated than others along with a rt task enqueues an empty runq or 
dequeues from a non-overloaded runq.
2. Between the two vectors, since counts[] is RW while masks[] is read
access when it stores pointers.    
The alternative approach introduces the complexity of 31+/21- LoC changes,
while it achieves the same performance as the simple, at the same time, struct
cpupri size is reduced from 26 cache lines to 21 cache lines.
The alternative approach is also prepared, can be sent out if you have any interest.

Best Regards
Pan

> -----Original Message-----
> From: Pan Deng <pan.deng@intel.com>
> Sent: Thursday, June 12, 2025 11:12 AM
> To: peterz@infradead.org; mingo@kernel.org
> Cc: linux-kernel@vger.kernel.org; Li, Tianyou <tianyou.li@intel.com>;
> tim.c.chen@linux.intel.com; Deng, Pan <pan.deng@intel.com>
> Subject: [PATCH] sched/rt: optimize cpupri_vec layout
> 
> When running a multi-instance ffmpeg transcoding workload which uses rt
> thread in a high core count system, cpupri_vec->count contends with the
> reading of mask in the same cache line in function cpupri_find_fitness and
> cpupri_set.
> This change separates each count and mask into different cache lines by cache
> aligned attribute to avoid the false sharing.
> Tested in a 2 sockets, 240 physical core 480 logical core machine, running
> 60 ffmpeg transcoding instances. With the change, the kernel cycles% is
> reduced from ~20% to ~12%, the fps metric is improved ~11%.
> The side effect of this change is that struct cpupri size is increased from 26
> cache lines to 203 cache lines.
> 
> Signed-off-by: Pan Deng <pan.deng@intel.com>
> Signed-off-by: Tianyou Li <tianyou.li@intel.com>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  kernel/sched/cpupri.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h index
> d6cba0020064..245b0fa626be 100644
> --- a/kernel/sched/cpupri.h
> +++ b/kernel/sched/cpupri.h
> @@ -9,7 +9,7 @@
> 
>  struct cpupri_vec {
>  	atomic_t		count;
> -	cpumask_var_t		mask;
> +	cpumask_var_t		mask	____cacheline_aligned;
>  };
> 
>  struct cpupri {
> --
> 2.43.5