[v2] Introduce SIS_CACHE to choose previous CPU during task wakeup

[PATCH v2 0/3] Introduce SIS_CACHE to choose previous CPU during task wakeup

Posted by Chen Yu 2 years, 2 months ago

v1 -> v2:
- Move the task sleep duration from sched_entity to task_struct. (Aaron Lu)
- Refine the task sleep duration calculation based on task's previous running
CPU. (Aaron Lu)
- Limit the cache-hot idle CPU scan depth to reduce the time spend on
searching, to fix the regression. (K Prateek Nayak)
- Add test results of the real life workload per request from Ingo
Daytrader on a power system. (Madadi Vineeth Reddy)
OLTP workload on Xeon Sapphire Rapids.
- Refined the commit log, added Reviewed-by tag to PATCH 1/3
(Mathieu Desnoyers).

RFC -> v1:
- drop RFC
- Only record the short sleeping time for each task, to better honor the
burst sleeping tasks. (Mathieu Desnoyers)
- Keep the forward movement monotonic for runqueue's cache-hot timeout value.
(Mathieu Desnoyers, Aaron Lu)
- Introduce a new helper function cache_hot_cpu() that considers
rq->cache_hot_timeout. (Aaron Lu)
- Add analysis of why inhibiting task migration could bring better throughput
for some benchmarks. (Gautham R. Shenoy)
- Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
(K Prateek Nayak)

Thanks for the comments and tests!

----------------------------------------------------------------------

This series aims to continue the discussion of how to make the wakee
to choose its previous CPU easier.

When task p is woken up, the scheduler leverages select_idle_sibling()
to find an idle CPU for it. p's previous CPU is usually a preference
because it can improve cache locality. However in many cases, the
previous CPU has already been taken by other wakees, thus p has to
find another idle CPU.

Inhibit the task migration could benefit many workloads. Inspired by
Mathieu's proposal to limit the task migration ratio[1], introduce
the SIS_CACHE. It considers the sleep time of the task for better
task placement. Based on the task's short sleeping history, tag p's
previous CPU as cache-hot. Later when p is woken up, it can choose
its previous CPU in select_idle_sibling(). When other task is
woken up, skip this cache-hot idle CPU and try the next idle CPU
when possible. The idea of SIS_CACHE is to optimize the idle CPU
scan sequence. The extra scan time is minimized by restricting the
scan depth of cache-hot CPUs to 50% of the scan depth of SIS_UTIL.

This test is based on tip/sched/core, on top of
Commit ada87d23b734
("x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer reprogram")

This patch set has shown 15% ~ 70% improvements for client/server
workloads like netperf and tbench. It shows 0.7% improvement of
OLTP with 0.2% run-to-run variation on Xeon 240 CPUs system.
There is 2% improvement of another real life workload Daytrader
per the test of Madadi on a power system with 96 CPUs. Prateek
has helped check there is no obvious microbenchmark regression
of the v2 on a 3rd Generation EPYC System with 128 CPUs.

Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1

Chen Yu (3):
sched/fair: Record the task sleeping time as the cache hot duration
sched/fair: Calculate the cache-hot time of the idle CPU
sched/fair: skip the cache hot CPU in select_idle_cpu()

include/linux/sched.h | 4 ++
kernel/sched/fair.c | 88 +++++++++++++++++++++++++++++++++++++++--
kernel/sched/features.h | 1 +
kernel/sched/sched.h | 1 +
4 files changed, 91 insertions(+), 3 deletions(-)

--
2.25.1

Re: [PATCH v2 0/3] Introduce SIS_CACHE to choose previous CPU during task wakeup

Posted by Madadi Vineeth Reddy 1 year, 11 months ago

Hi Chen Yu,

On 21/11/23 13:09, Chen Yu wrote:
> v1  -> v2:
> - Move the task sleep duration from sched_entity to task_struct. (Aaron Lu)
> - Refine the task sleep duration calculation based on task's previous running
>   CPU. (Aaron Lu)
> - Limit the cache-hot idle CPU scan depth to reduce the time spend on
>   searching, to fix the regression. (K Prateek Nayak)
> - Add test results of the real life workload per request from Ingo
>     Daytrader on a power system. (Madadi Vineeth Reddy)
>     OLTP workload on Xeon Sapphire Rapids.
> - Refined the commit log, added Reviewed-by tag to PATCH 1/3
>   (Mathieu Desnoyers).
> 
> RFC -> v1:
> - drop RFC
> - Only record the short sleeping time for each task, to better honor the
>   burst sleeping tasks. (Mathieu Desnoyers)
> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
>   (Mathieu Desnoyers, Aaron Lu)
> - Introduce a new helper function cache_hot_cpu() that considers
>   rq->cache_hot_timeout. (Aaron Lu)
> - Add analysis of why inhibiting task migration could bring better throughput
>   for some benchmarks. (Gautham R. Shenoy)
> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
>   select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
>   (K Prateek Nayak)
> 
> Thanks for the comments and tests!
> 
> ----------------------------------------------------------------------
> 
> This series aims to continue the discussion of how to make the wakee
> to choose its previous CPU easier.
> 
> When task p is woken up, the scheduler leverages select_idle_sibling()
> to find an idle CPU for it. p's previous CPU is usually a preference
> because it can improve cache locality. However in many cases, the
> previous CPU has already been taken by other wakees, thus p has to
> find another idle CPU.
> 
> Inhibit the task migration could benefit many workloads. Inspired by
> Mathieu's proposal to limit the task migration ratio[1], introduce
> the SIS_CACHE. It considers the sleep time of the task for better
> task placement. Based on the task's short sleeping history, tag p's
> previous CPU as cache-hot. Later when p is woken up, it can choose
> its previous CPU in select_idle_sibling(). When other task is
> woken up, skip this cache-hot idle CPU and try the next idle CPU
> when possible. The idea of SIS_CACHE is to optimize the idle CPU
> scan sequence. The extra scan time is minimized by restricting the
> scan depth of cache-hot CPUs to 50% of the scan depth of SIS_UTIL.
> 
> This test is based on tip/sched/core, on top of
> Commit ada87d23b734
> ("x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer reprogram")
> 
> This patch set has shown 15% ~ 70% improvements for client/server
> workloads like netperf and tbench. It shows 0.7% improvement of
> OLTP with 0.2% run-to-run variation on Xeon 240 CPUs system.
> There is 2% improvement of another real life workload Daytrader
> per the test of Madadi on a power system with 96 CPUs. Prateek
> has helped check there is no obvious microbenchmark regression
> of the v2 on a 3rd Generation EPYC System with 128 CPUs.
> 
> Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1
> 
> Chen Yu (3):
>   sched/fair: Record the task sleeping time as the cache hot duration
>   sched/fair: Calculate the cache-hot time of the idle CPU
>   sched/fair: skip the cache hot CPU in select_idle_cpu()
> 
>  include/linux/sched.h   |  4 ++
>  kernel/sched/fair.c     | 88 +++++++++++++++++++++++++++++++++++++++--
>  kernel/sched/features.h |  1 +
>  kernel/sched/sched.h    |  1 +
>  4 files changed, 91 insertions(+), 3 deletions(-)
> 

Any update or progress regarding this patch?

I was working on a patch that improves scheduler performance in power10 by making changes
to the order in which domains are accessed for cpu selection during wakeup. It turns out
that this patch is helpful in that regard and my patch is giving better performance on top
of this patch.

So, looking forward to know the progress/status of this patch.

Thanks and Regards
Madadi Vineeth Reddy

Re: [PATCH v2 0/3] Introduce SIS_CACHE to choose previous CPU during task wakeup

Posted by Chen Yu 1 year, 11 months ago

Hi Madadi,

On 2024-02-18 at 14:57:17 +0530, Madadi Vineeth Reddy wrote:
> Hi Chen Yu,
> 
> On 21/11/23 13:09, Chen Yu wrote:
> 
> Any update or progress regarding this patch?
> 
> I was working on a patch that improves scheduler performance in power10 by making changes
> to the order in which domains are accessed for cpu selection during wakeup. It turns out
> that this patch is helpful in that regard and my patch is giving better performance on top
> of this patch.
> 
> So, looking forward to know the progress/status of this patch.
>

Thank you for your continuous interest in this proposal. Glad to hear
that with your patch applied on top of this can get better performance.
Let me rebase the patch on the latest kernel/do some code cleanup and
send the new version out.

thanks,
Chenyu

Re: [PATCH v2 0/3] Introduce SIS_CACHE to choose previous CPU during task wakeup

Posted by Madadi Vineeth Reddy 2 years, 2 months ago

Hi Chen Yu,

On 21/11/23 13:09, Chen Yu wrote:
> v1  -> v2:
> - Move the task sleep duration from sched_entity to task_struct. (Aaron Lu)
> - Refine the task sleep duration calculation based on task's previous running
>   CPU. (Aaron Lu)
> - Limit the cache-hot idle CPU scan depth to reduce the time spend on
>   searching, to fix the regression. (K Prateek Nayak)
> - Add test results of the real life workload per request from Ingo
>     Daytrader on a power system. (Madadi Vineeth Reddy)
>     OLTP workload on Xeon Sapphire Rapids.
> - Refined the commit log, added Reviewed-by tag to PATCH 1/3
>   (Mathieu Desnoyers).
> 
> RFC -> v1:
> - drop RFC
> - Only record the short sleeping time for each task, to better honor the
>   burst sleeping tasks. (Mathieu Desnoyers)
> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
>   (Mathieu Desnoyers, Aaron Lu)
> - Introduce a new helper function cache_hot_cpu() that considers
>   rq->cache_hot_timeout. (Aaron Lu)
> - Add analysis of why inhibiting task migration could bring better throughput
>   for some benchmarks. (Gautham R. Shenoy)
> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
>   select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
>   (K Prateek Nayak)
> 
> Thanks for the comments and tests!
> 
> ----------------------------------------------------------------------
> 
> This series aims to continue the discussion of how to make the wakee
> to choose its previous CPU easier.
> 
> When task p is woken up, the scheduler leverages select_idle_sibling()
> to find an idle CPU for it. p's previous CPU is usually a preference
> because it can improve cache locality. However in many cases, the
> previous CPU has already been taken by other wakees, thus p has to
> find another idle CPU.
> 
> Inhibit the task migration could benefit many workloads. Inspired by
> Mathieu's proposal to limit the task migration ratio[1], introduce
> the SIS_CACHE. It considers the sleep time of the task for better
> task placement. Based on the task's short sleeping history, tag p's
> previous CPU as cache-hot. Later when p is woken up, it can choose
> its previous CPU in select_idle_sibling(). When other task is
> woken up, skip this cache-hot idle CPU and try the next idle CPU
> when possible. The idea of SIS_CACHE is to optimize the idle CPU
> scan sequence. The extra scan time is minimized by restricting the
> scan depth of cache-hot CPUs to 50% of the scan depth of SIS_UTIL.
> 
> This test is based on tip/sched/core, on top of
> Commit ada87d23b734
> ("x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer reprogram")
> 
> This patch set has shown 15% ~ 70% improvements for client/server
> workloads like netperf and tbench. It shows 0.7% improvement of
> OLTP with 0.2% run-to-run variation on Xeon 240 CPUs system.
> There is 2% improvement of another real life workload Daytrader
> per the test of Madadi on a power system with 96 CPUs. Prateek
> has helped check there is no obvious microbenchmark regression
> of the v2 on a 3rd Generation EPYC System with 128 CPUs.
> 

Tested the patch on power system with 46 cores. Total of 368 CPU's.
System has 8 NUMA nodes.

Below are some of the benchmark results.

schbench(new) 99.0th latency (lower is better)
========
case            load        	baseline[pct imp](std%)       SIS_CACHE[pct imp]( std%)
normal          1-mthreads      1.00 [ 0.00]( 4.34)            1.02 [ -2.00]( 5.98)
normal          2-mthreads      1.00 [ 0.00]( 13.95)           1.08 [ -8.00]( 10.39)
normal          4-mthreads      1.00 [ 0.00]( 6.20)            0.94 [ +6.00]( 10.90)
normal          6-mthreads      1.00 [ 0.00]( 12.76)           1.03 [ -3.00]( 9.33)

It seems like schbench is not much impacted with this patch(The pct imp of schbench is within the std%).
I expected some regression in wakeup latency while searching for an idle cpu which is not cache hot.
But I guess limiting the search depth had helped.


producer_consumer avg time/access (lower is better)
========
loads per consumer iteration   baseline[pct imp](std%)         SIS_CACHE[pct imp]( std%)
5                  		1.00 [ 0.00]( 0.00)            0.93 [ +7.00]( 4.77)
10                   		1.00 [ 0.00]( 0.00)            1.00 [  0.00]( 0.00)
20                    		1.00 [ 0.00]( 0.00)            1.00 [  0.00]( 0.00)

The main goal of the patch of improving cache locality is reflected as SIS_CACHE only improves in this workload, 
when loads per consumer iteration is lower.


hackbench normalized time in seconds (lower is better)
========
case            load        baseline[pct imp](std%)         SIS_CACHE[pct imp]( std%)
process-sockets 1-groups     1.00 [ 0.00]( 4.78)            0.99 [ +1.00]( 6.45)
process-sockets 2-groups     1.00 [ 0.00]( 0.97)            1.02 [ -2.00]( 1.87)
process-sockets 4-groups     1.00 [ 0.00]( 3.63)            1.01 [ -1.00]( 2.96)
process-sockets 8-groups     1.00 [ 0.00]( 0.43)            1.00 [  0.00]( 0.27)
process-pipe    1-groups     1.00 [ 0.00](23.77)            0.88 [+12.00](22.77)
process-pipe    2-groups     1.00 [ 0.00]( 3.44)            1.03 [ -3.00]( 4.00)
process-pipe    4-groups     1.00 [ 0.00]( 2.41)            0.98 [ +2.00]( 3.88)
process-pipe    8-groups     1.00 [ 0.00]( 7.09)            1.07 [ -7.00]( 4.25)
threads-pipe    1-groups     1.00 [ 0.00](18.47)            1.11 [-11.00](24.21)
threads-pipe    2-groups     1.00 [ 0.00]( 6.45)            0.97 [ +3.00]( 5.58)
threads-pipe    4-groups     1.00 [ 0.00]( 5.63)            0.96 [ +2.00]( 5.90)
threads-pipe    8-groups     1.00 [ 0.00]( 1.65)            1.03 [ -3.00]( 3.97)
threads-sockets 1-groups     1.00 [ 0.00]( 2.00)            1.00 [  0.00]( 0.65)
threads-sockets 2-groups     1.00 [ 0.00]( 1.69)            1.02 [ -2.00]( 1.48)
threads-sockets 4-groups     1.00 [ 0.00]( 5.66)            1.01 [ -1.00]( 3.56)
threads-sockets 8-groups     1.00 [ 0.00]( 0.26)            0.99 [ +1.00]( 0.36)

hackbench is not impacted.


Daytrader throughput (higher is better)
========
instances,users                baseline[pct imp](std%)         SIS_CACHE[pct imp]( std%)
3,30                 		1.00 [ 0.00]( 2.30)            1.02 [ +2.00]( 1.64)
3,60                 		1.00 [ 0.00]( 0.55)            1.01 [ +1.00]( 1.41)
3,90                  		1.00 [ 0.00]( 1.20)            1.02 [ +2.00]( 1.04)
3,120                  		1.00 [ 0.00]( 0.84)            1.02 [ +2.00]( 1.02)

A real life workload like daytrader is benefiting slightly with this patch.


Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>

Thanks and Regards
Madadi Vineeth Reddy

Re: [PATCH v2 0/3] Introduce SIS_CACHE to choose previous CPU during task wakeup

Posted by Chen Yu 2 years, 2 months ago

On 2023-11-26 at 14:14:20 +0530, Madadi Vineeth Reddy wrote:
> Hi Chen Yu,
> 
> On 21/11/23 13:09, Chen Yu wrote:
> > v1  -> v2:
> > - Move the task sleep duration from sched_entity to task_struct. (Aaron Lu)
> > - Refine the task sleep duration calculation based on task's previous running
> >   CPU. (Aaron Lu)
> > - Limit the cache-hot idle CPU scan depth to reduce the time spend on
> >   searching, to fix the regression. (K Prateek Nayak)
> > - Add test results of the real life workload per request from Ingo
> >     Daytrader on a power system. (Madadi Vineeth Reddy)
> >     OLTP workload on Xeon Sapphire Rapids.
> > - Refined the commit log, added Reviewed-by tag to PATCH 1/3
> >   (Mathieu Desnoyers).
> > 
> > RFC -> v1:
> > - drop RFC
> > - Only record the short sleeping time for each task, to better honor the
> >   burst sleeping tasks. (Mathieu Desnoyers)
> > - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
> >   (Mathieu Desnoyers, Aaron Lu)
> > - Introduce a new helper function cache_hot_cpu() that considers
> >   rq->cache_hot_timeout. (Aaron Lu)
> > - Add analysis of why inhibiting task migration could bring better throughput
> >   for some benchmarks. (Gautham R. Shenoy)
> > - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
> >   select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
> >   (K Prateek Nayak)
> > 
> > Thanks for the comments and tests!
> > 
> > ----------------------------------------------------------------------
> > 
> > This series aims to continue the discussion of how to make the wakee
> > to choose its previous CPU easier.
> > 
> > When task p is woken up, the scheduler leverages select_idle_sibling()
> > to find an idle CPU for it. p's previous CPU is usually a preference
> > because it can improve cache locality. However in many cases, the
> > previous CPU has already been taken by other wakees, thus p has to
> > find another idle CPU.
> > 
> > Inhibit the task migration could benefit many workloads. Inspired by
> > Mathieu's proposal to limit the task migration ratio[1], introduce
> > the SIS_CACHE. It considers the sleep time of the task for better
> > task placement. Based on the task's short sleeping history, tag p's
> > previous CPU as cache-hot. Later when p is woken up, it can choose
> > its previous CPU in select_idle_sibling(). When other task is
> > woken up, skip this cache-hot idle CPU and try the next idle CPU
> > when possible. The idea of SIS_CACHE is to optimize the idle CPU
> > scan sequence. The extra scan time is minimized by restricting the
> > scan depth of cache-hot CPUs to 50% of the scan depth of SIS_UTIL.
> > 
> > This test is based on tip/sched/core, on top of
> > Commit ada87d23b734
> > ("x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer reprogram")
> > 
> > This patch set has shown 15% ~ 70% improvements for client/server
> > workloads like netperf and tbench. It shows 0.7% improvement of
> > OLTP with 0.2% run-to-run variation on Xeon 240 CPUs system.
> > There is 2% improvement of another real life workload Daytrader
> > per the test of Madadi on a power system with 96 CPUs. Prateek
> > has helped check there is no obvious microbenchmark regression
> > of the v2 on a 3rd Generation EPYC System with 128 CPUs.
> > 
> 
> Tested the patch on power system with 46 cores. Total of 368 CPU's.
> System has 8 NUMA nodes.
> 
> Below are some of the benchmark results.
> 
> schbench(new) 99.0th latency (lower is better)
> ========
> case            load        	baseline[pct imp](std%)       SIS_CACHE[pct imp]( std%)
> normal          1-mthreads      1.00 [ 0.00]( 4.34)            1.02 [ -2.00]( 5.98)
> normal          2-mthreads      1.00 [ 0.00]( 13.95)           1.08 [ -8.00]( 10.39)
> normal          4-mthreads      1.00 [ 0.00]( 6.20)            0.94 [ +6.00]( 10.90)
> normal          6-mthreads      1.00 [ 0.00]( 12.76)           1.03 [ -3.00]( 9.33)
> 
> It seems like schbench is not much impacted with this patch(The pct imp of schbench is within the std%).
> I expected some regression in wakeup latency while searching for an idle cpu which is not cache hot.
> But I guess limiting the search depth had helped.
>

I think so. Cutting the cache-hot cpu scan depth to 50% seems to also cure the regression
reported by Prateek.
 
> 
> producer_consumer avg time/access (lower is better)
> ========
> loads per consumer iteration   baseline[pct imp](std%)         SIS_CACHE[pct imp]( std%)
> 5                  		1.00 [ 0.00]( 0.00)            0.93 [ +7.00]( 4.77)
> 10                   		1.00 [ 0.00]( 0.00)            1.00 [  0.00]( 0.00)
> 20                    		1.00 [ 0.00]( 0.00)            1.00 [  0.00]( 0.00)
> 
> The main goal of the patch of improving cache locality is reflected as SIS_CACHE only improves in this workload, 
> when loads per consumer iteration is lower.
> 
> 
> hackbench normalized time in seconds (lower is better)
> ========
> case            load        baseline[pct imp](std%)         SIS_CACHE[pct imp]( std%)
> process-sockets 1-groups     1.00 [ 0.00]( 4.78)            0.99 [ +1.00]( 6.45)
> process-sockets 2-groups     1.00 [ 0.00]( 0.97)            1.02 [ -2.00]( 1.87)
> process-sockets 4-groups     1.00 [ 0.00]( 3.63)            1.01 [ -1.00]( 2.96)
> process-sockets 8-groups     1.00 [ 0.00]( 0.43)            1.00 [  0.00]( 0.27)
> process-pipe    1-groups     1.00 [ 0.00](23.77)            0.88 [+12.00](22.77)
> process-pipe    2-groups     1.00 [ 0.00]( 3.44)            1.03 [ -3.00]( 4.00)
> process-pipe    4-groups     1.00 [ 0.00]( 2.41)            0.98 [ +2.00]( 3.88)
> process-pipe    8-groups     1.00 [ 0.00]( 7.09)            1.07 [ -7.00]( 4.25)
> threads-pipe    1-groups     1.00 [ 0.00](18.47)            1.11 [-11.00](24.21)
> threads-pipe    2-groups     1.00 [ 0.00]( 6.45)            0.97 [ +3.00]( 5.58)
> threads-pipe    4-groups     1.00 [ 0.00]( 5.63)            0.96 [ +2.00]( 5.90)
> threads-pipe    8-groups     1.00 [ 0.00]( 1.65)            1.03 [ -3.00]( 3.97)
> threads-sockets 1-groups     1.00 [ 0.00]( 2.00)            1.00 [  0.00]( 0.65)
> threads-sockets 2-groups     1.00 [ 0.00]( 1.69)            1.02 [ -2.00]( 1.48)
> threads-sockets 4-groups     1.00 [ 0.00]( 5.66)            1.01 [ -1.00]( 3.56)
> threads-sockets 8-groups     1.00 [ 0.00]( 0.26)            0.99 [ +1.00]( 0.36)
> 
> hackbench is not impacted.
> 
> 
> Daytrader throughput (higher is better)
> ========
> instances,users                baseline[pct imp](std%)         SIS_CACHE[pct imp]( std%)
> 3,30                 		1.00 [ 0.00]( 2.30)            1.02 [ +2.00]( 1.64)
> 3,60                 		1.00 [ 0.00]( 0.55)            1.01 [ +1.00]( 1.41)
> 3,90                  		1.00 [ 0.00]( 1.20)            1.02 [ +2.00]( 1.04)
> 3,120                  		1.00 [ 0.00]( 0.84)            1.02 [ +2.00]( 1.02)
> 
> A real life workload like daytrader is benefiting slightly with this patch.
> 
> 
> Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
>

Thanks!

Best,
Chenyu 
> Thanks and Regards
> Madadi Vineeth Reddy