include/linux/sched.h | 3 ++ kernel/sched/fair.c | 86 +++++++++++++++++++++++++++++++++++++++-- kernel/sched/features.h | 1 + kernel/sched/sched.h | 1 + 4 files changed, 87 insertions(+), 4 deletions(-)
RFC -> v1:
- drop RFC
- Only record the short sleeping time for each task, to better honor the
burst sleeping tasks. (Mathieu Desnoyers)
- Keep the forward movement monotonic for runqueue's cache-hot timeout value.
(Mathieu Desnoyers, Aaron Lu)
- Introduce a new helper function cache_hot_cpu() that considers
rq->cache_hot_timeout. (Aaron Lu)
- Add analysis of why inhibiting task migration could bring better throughput
for some benchmarks. (Gautham R. Shenoy)
- Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
(K Prateek Nayak)
Thanks for your comments and review!
----------------------------------------------------------------------
When task p is woken up, the scheduler leverages select_idle_sibling()
to find an idle CPU for it. p's previous CPU is usually a preference
because it can improve cache locality. However in many cases, the
previous CPU has already been taken by other wakees, thus p has to
find another idle CPU.
Inhibit the task migration while keeping the work conservation of
scheduler could benefit many workloads. Inspired by Mathieu's
proposal to limit the task migration ratio[1], this patch considers
the task average sleep duration. If the task is a short sleeping one,
then tag its previous CPU as cache hot for a short while. During this
reservation period, other wakees are not allowed to pick this idle CPU
until a timeout. Later if the task is woken up again, it can find its
previous CPU still idle, and choose it in select_idle_sibling().
This test is based on tip/sched/core, on top of
Commit afc1996859a2
("sched/fair: Ratelimit update to tg->load_avg")
patch afc1996859a2 has significantly reduced the cost of task migration,
the SIS_CACHE further reduces that cost. SIS_CACHE shows noticeable
throughput improvement of netperf/tbench around 100% load.
[patch 1/2] records the task's average short sleeping time in
its per sched_entity structure.
[patch 2/2] introduces the SIS_CACHE to skip the cache-hot
idle CPU during wakeup.
Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1
Chen Yu (2):
sched/fair: Record the short sleeping time of a task
sched/fair: skip the cache hot CPU in select_idle_cpu()
include/linux/sched.h | 3 ++
kernel/sched/fair.c | 86 +++++++++++++++++++++++++++++++++++++++--
kernel/sched/features.h | 1 +
kernel/sched/sched.h | 1 +
4 files changed, 87 insertions(+), 4 deletions(-)
--
2.25.1
Hi Chen Yu,
On 26/09/23 10:40, Chen Yu wrote:
> RFC -> v1:
> - drop RFC
> - Only record the short sleeping time for each task, to better honor the
> burst sleeping tasks. (Mathieu Desnoyers)
> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
> (Mathieu Desnoyers, Aaron Lu)
> - Introduce a new helper function cache_hot_cpu() that considers
> rq->cache_hot_timeout. (Aaron Lu)
> - Add analysis of why inhibiting task migration could bring better throughput
> for some benchmarks. (Gautham R. Shenoy)
> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
> select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
> (K Prateek Nayak)
>
> Thanks for your comments and review!
>
> ----------------------------------------------------------------------
Regarding making the scan for finding an idle cpu longer vs cache benefits,
I ran some benchmarks.
Tested the patch on power system with 12 cores. Total of 96 CPU's.
System has two NUMA nodes.
Below are some of the benchmark results
schbench 99.0th latency (lower is better)
========
case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
normal 1-mthreads 1.00 [ 0.00]( 3.66) 1.00 [ 0.00]( 1.71)
normal 2-mthreads 1.00 [ 0.00]( 4.55) 1.02 [ -2.00]( 3.00)
normal 4-mthreads 1.00 [ 0.00]( 4.77) 0.96 [ +4.00]( 4.27)
normal 6-mthreads 1.00 [ 0.00]( 60.37) 2.66 [ -166.00]( 23.67)
schbench results are showing that there is not much impact in wakeup latencies due to more iterations
in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better
for SIS_CACHE in case of 4-mthreads. I think we can ignore the last case due to huge run to run variations.
producer_consumer avg time/access (lower is better)
========
loads per consumer iteration baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
5 1.00 [ 0.00]( 0.00) 0.87 [ +13.0]( 1.92)
20 1.00 [ 0.00]( 0.00) 0.92 [ +8.00]( 0.00)
50 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
100 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
The main goal of the patch of improving cache locality is reflected as SIS_CACHE only improves in this workload,
mainly when loads per consumer iteration is lower.
hackbench normalized time in seconds (lower is better)
========
case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
process-pipe 1-groups 1.00 [ 0.00]( 1.50) 1.02 [ -2.00]( 3.36)
process-pipe 2-groups 1.00 [ 0.00]( 4.76) 0.99 [ +1.00]( 5.68)
process-sockets 1-groups 1.00 [ 0.00]( 2.56) 1.00 [ 0.00]( 0.86)
process-sockets 2-groups 1.00 [ 0.00]( 0.50) 0.99 [ +1.00]( 0.96)
threads-pipe 1-groups 1.00 [ 0.00]( 3.87) 0.71 [ +29.0]( 3.56)
threads-pipe 2-groups 1.00 [ 0.00]( 1.60) 0.97 [ +3.00]( 3.44)
threads-sockets 1-groups 1.00 [ 0.00]( 7.65) 0.99 [ +1.00]( 1.05)
threads-sockets 2-groups 1.00 [ 0.00]( 3.12) 1.03 [ -3.00]( 1.70)
hackbench results are similar in both kernels except the case where there is an improvement of
29% in case of threads-pipe case with 1 groups.
Daytrader throughput (higher is better)
========
As per Ingo suggestion, ran a real life workload daytrader
baseline:
===================================================================================
Instance 1
Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time
================ =============== =============== ===============
10124.5 2 0 3970
SIS_CACHE:
===================================================================================
Instance 1
Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time
================ =============== =============== ===============
10319.5 2 0 5771
In the above run, daytrader perfomance was 2% better in case of SIS_CACHE.
Thanks and Regards
Madadi Vineeth Reddy
Hi Madadi, On 2023-10-17 at 15:19:24 +0530, Madadi Vineeth Reddy wrote: > Hi Chen Yu, > > On 26/09/23 10:40, Chen Yu wrote: > > RFC -> v1: > > - drop RFC > > - Only record the short sleeping time for each task, to better honor the > > burst sleeping tasks. (Mathieu Desnoyers) > > - Keep the forward movement monotonic for runqueue's cache-hot timeout value. > > (Mathieu Desnoyers, Aaron Lu) > > - Introduce a new helper function cache_hot_cpu() that considers > > rq->cache_hot_timeout. (Aaron Lu) > > - Add analysis of why inhibiting task migration could bring better throughput > > for some benchmarks. (Gautham R. Shenoy) > > - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in > > select_idle_cpu(). To avoid possible task stacking on the waker's CPU. > > (K Prateek Nayak) > > > > Thanks for your comments and review! > > > > ---------------------------------------------------------------------- > > Regarding making the scan for finding an idle cpu longer vs cache benefits, > I ran some benchmarks. > Thanks very much for your interest and your time on the patch. > Tested the patch on power system with 12 cores. Total of 96 CPU's. > System has two NUMA nodes. > > Below are some of the benchmark results > > schbench 99.0th latency (lower is better) > ======== > case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%) > normal 1-mthreads 1.00 [ 0.00]( 3.66) 1.00 [ 0.00]( 1.71) > normal 2-mthreads 1.00 [ 0.00]( 4.55) 1.02 [ -2.00]( 3.00) > normal 4-mthreads 1.00 [ 0.00]( 4.77) 0.96 [ +4.00]( 4.27) > normal 6-mthreads 1.00 [ 0.00]( 60.37) 2.66 [ -166.00]( 23.67) > > > schbench results are showing that there is not much impact in wakeup latencies due to more iterations > in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better > for SIS_CACHE in case of 4-mthreads. The 4% improvement is within std%, so I suppose we did not see much difference in 4 mthreads case. > I think we can ignore the last case due to huge run to run variations. Although the run-to-run variation is large, it seems that the decrease is within that range. Prateek has also reported that when the system is overloaded there could be some regression from schbench: https://lore.kernel.org/lkml/27651e14-f441-c1e2-9b5b-b958d6aadc79@amd.com/ Could you also post the raw data printed by schbench? And maybe using the latest schbench could get the latency in detail. > producer_consumer avg time/access (lower is better) > ======== > loads per consumer iteration baseline[pct imp](std%) SIS_CACHE[pct imp]( std%) > 5 1.00 [ 0.00]( 0.00) 0.87 [ +13.0]( 1.92) > 20 1.00 [ 0.00]( 0.00) 0.92 [ +8.00]( 0.00) > 50 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) > 100 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) > > The main goal of the patch of improving cache locality is reflected as SIS_CACHE only improves in this workload, > mainly when loads per consumer iteration is lower. > > hackbench normalized time in seconds (lower is better) > ======== > case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%) > process-pipe 1-groups 1.00 [ 0.00]( 1.50) 1.02 [ -2.00]( 3.36) > process-pipe 2-groups 1.00 [ 0.00]( 4.76) 0.99 [ +1.00]( 5.68) > process-sockets 1-groups 1.00 [ 0.00]( 2.56) 1.00 [ 0.00]( 0.86) > process-sockets 2-groups 1.00 [ 0.00]( 0.50) 0.99 [ +1.00]( 0.96) > threads-pipe 1-groups 1.00 [ 0.00]( 3.87) 0.71 [ +29.0]( 3.56) > threads-pipe 2-groups 1.00 [ 0.00]( 1.60) 0.97 [ +3.00]( 3.44) > threads-sockets 1-groups 1.00 [ 0.00]( 7.65) 0.99 [ +1.00]( 1.05) > threads-sockets 2-groups 1.00 [ 0.00]( 3.12) 1.03 [ -3.00]( 1.70) > > hackbench results are similar in both kernels except the case where there is an improvement of > 29% in case of threads-pipe case with 1 groups. > > Daytrader throughput (higher is better) > ======== > > As per Ingo suggestion, ran a real life workload daytrader > > baseline: > =================================================================================== > Instance 1 > Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time > ================ =============== =============== =============== > 10124.5 2 0 3970 > > SIS_CACHE: > =================================================================================== > Instance 1 > Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time > ================ =============== =============== =============== > 10319.5 2 0 5771 > > In the above run, daytrader perfomance was 2% better in case of SIS_CACHE. > Thanks for bringing this good news, a real life workload benefits from this change. I'll tune this patch a little bit to address the regression from schbench. Also to mention that, I'm working with Mathieu on his proposal to make the wakee choosing its previous CPU easier(similar to SIS_CACHE, but a little simpler), and we'll check how to make more platform benefit from this change. https://lore.kernel.org/lkml/20231012203626.1298944-1-mathieu.desnoyers@efficios.com/ thanks, Chenyu
Hi Chen Yu,
On 17/10/23 16:39, Chen Yu wrote:
> Hi Madadi,
>
> On 2023-10-17 at 15:19:24 +0530, Madadi Vineeth Reddy wrote:
>> Hi Chen Yu,
>>
>> On 26/09/23 10:40, Chen Yu wrote:
>>> RFC -> v1:
>>> - drop RFC
>>> - Only record the short sleeping time for each task, to better honor the
>>> burst sleeping tasks. (Mathieu Desnoyers)
>>> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
>>> (Mathieu Desnoyers, Aaron Lu)
>>> - Introduce a new helper function cache_hot_cpu() that considers
>>> rq->cache_hot_timeout. (Aaron Lu)
>>> - Add analysis of why inhibiting task migration could bring better throughput
>>> for some benchmarks. (Gautham R. Shenoy)
>>> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
>>> select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
>>> (K Prateek Nayak)
>>>
>>> Thanks for your comments and review!
>>>
>>> ----------------------------------------------------------------------
>>
>> Regarding making the scan for finding an idle cpu longer vs cache benefits,
>> I ran some benchmarks.
>>
>
> Thanks very much for your interest and your time on the patch.
>
>> Tested the patch on power system with 12 cores. Total of 96 CPU's.
>> System has two NUMA nodes.
>>
>> Below are some of the benchmark results
>>
>> schbench 99.0th latency (lower is better)
>> ========
>> case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
>> normal 1-mthreads 1.00 [ 0.00]( 3.66) 1.00 [ 0.00]( 1.71)
>> normal 2-mthreads 1.00 [ 0.00]( 4.55) 1.02 [ -2.00]( 3.00)
>> normal 4-mthreads 1.00 [ 0.00]( 4.77) 0.96 [ +4.00]( 4.27)
>> normal 6-mthreads 1.00 [ 0.00]( 60.37) 2.66 [ -166.00]( 23.67)
>>
>>
>> schbench results are showing that there is not much impact in wakeup latencies due to more iterations
>> in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better
>> for SIS_CACHE in case of 4-mthreads.
>
> The 4% improvement is within std%, so I suppose we did not see much difference in 4 mthreads case.
>
>> I think we can ignore the last case due to huge run to run variations.
>
> Although the run-to-run variation is large, it seems that the decrease is within that range.
> Prateek has also reported that when the system is overloaded there could be some regression
> from schbench:
> https://lore.kernel.org/lkml/27651e14-f441-c1e2-9b5b-b958d6aadc79@amd.com/
> Could you also post the raw data printed by schbench? And maybe using the latest schbench could get the
> latency in detail.
>
raw data by schbench(old) with 6-mthreads
======================
Baseline (5 runs)
========
Latency percentiles (usec)
50.0000th: 22
75.0000th: 29
90.0000th: 34
95.0000th: 37
*99.0000th: 981
99.5000th: 4424
99.9000th: 9200
min=0, max=29497
Latency percentiles (usec)
50.0000th: 23
75.0000th: 29
90.0000th: 35
95.0000th: 38
*99.0000th: 495
99.5000th: 3924
99.9000th: 9872
min=0, max=29997
Latency percentiles (usec)
50.0000th: 23
75.0000th: 30
90.0000th: 36
95.0000th: 39
*99.0000th: 1326
99.5000th: 4744
99.9000th: 10000
min=0, max=23394
Latency percentiles (usec)
50.0000th: 23
75.0000th: 29
90.0000th: 34
95.0000th: 37
*99.0000th: 55
99.5000th: 3292
99.9000th: 9104
min=0, max=25196
Latency percentiles (usec)
50.0000th: 23
75.0000th: 29
90.0000th: 34
95.0000th: 37
*99.0000th: 711
99.5000th: 4600
99.9000th: 9424
min=0, max=19997
SIS_CACHE (5 runs)
=========
Latency percentiles (usec)
50.0000th: 23
75.0000th: 30
90.0000th: 35
95.0000th: 38
*99.0000th: 1894
99.5000th: 5464
99.9000th: 10000
min=0, max=19157
Latency percentiles (usec)
50.0000th: 22
75.0000th: 29
90.0000th: 34
95.0000th: 37
*99.0000th: 2396
99.5000th: 6664
99.9000th: 10000
min=0, max=24029
Latency percentiles (usec)
50.0000th: 22
75.0000th: 29
90.0000th: 34
95.0000th: 37
*99.0000th: 2132
99.5000th: 6296
99.9000th: 10000
min=0, max=25313
Latency percentiles (usec)
50.0000th: 22
75.0000th: 29
90.0000th: 34
95.0000th: 37
*99.0000th: 1090
99.5000th: 6232
99.9000th: 9744
min=0, max=27264
Latency percentiles (usec)
50.0000th: 22
75.0000th: 29
90.0000th: 34
95.0000th: 38
*99.0000th: 1786
99.5000th: 5240
99.9000th: 9968
min=0, max=24754
The above data as indicated has large run to run variation and in general, the latency is
high in case of SIS_CACHE for the 99th %ile.
schbench(new) with 6-mthreads
=============
Baseline
========
Wakeup Latencies percentiles (usec) runtime 30 (s) (209403 total samples)
50.0th: 8 (43672 samples)
90.0th: 13 (83908 samples)
* 99.0th: 20 (18323 samples)
99.9th: 775 (1785 samples)
min=1, max=8400
Request Latencies percentiles (usec) runtime 30 (s) (209543 total samples)
50.0th: 13648 (59873 samples)
90.0th: 14000 (82767 samples)
* 99.0th: 14320 (16342 samples)
99.9th: 18720 (1670 samples)
min=5130, max=38334
RPS percentiles (requests) runtime 30 (s) (31 total samples)
20.0th: 6968 (8 samples)
* 50.0th: 6984 (23 samples)
90.0th: 6984 (0 samples)
min=6835, max=6991
average rps: 6984.77
SIS_CACHE
=========
Wakeup Latencies percentiles (usec) runtime 30 (s) (209295 total samples)
50.0th: 9 (49267 samples)
90.0th: 14 (86522 samples)
* 99.0th: 21 (14091 samples)
99.9th: 1146 (1722 samples)
min=1, max=10427
Request Latencies percentiles (usec) runtime 30 (s) (209432 total samples)
50.0th: 13616 (62838 samples)
90.0th: 14000 (85301 samples)
* 99.0th: 14352 (16149 samples)
99.9th: 21408 (1660 samples)
min=5070, max=41866
RPS percentiles (requests) runtime 30 (s) (31 total samples)
20.0th: 6968 (7 samples)
* 50.0th: 6984 (21 samples)
90.0th: 6984 (0 samples)
min=6672, max=6996
average rps: 6981.07
In new schbench, I didn't observe run to run variation and also there was no regression
in case of SIS_CACHE for the 99th %ile.
>> producer_consumer avg time/access (lower is better)
>> ========
>> loads per consumer iteration baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
>> 5 1.00 [ 0.00]( 0.00) 0.87 [ +13.0]( 1.92)
>> 20 1.00 [ 0.00]( 0.00) 0.92 [ +8.00]( 0.00)
>> 50 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
>> 100 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
>>
>> The main goal of the patch of improving cache locality is reflected as SIS_CACHE only improves in this workload,
>> mainly when loads per consumer iteration is lower.
>>
>> hackbench normalized time in seconds (lower is better)
>> ========
>> case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
>> process-pipe 1-groups 1.00 [ 0.00]( 1.50) 1.02 [ -2.00]( 3.36)
>> process-pipe 2-groups 1.00 [ 0.00]( 4.76) 0.99 [ +1.00]( 5.68)
>> process-sockets 1-groups 1.00 [ 0.00]( 2.56) 1.00 [ 0.00]( 0.86)
>> process-sockets 2-groups 1.00 [ 0.00]( 0.50) 0.99 [ +1.00]( 0.96)
>> threads-pipe 1-groups 1.00 [ 0.00]( 3.87) 0.71 [ +29.0]( 3.56)
>> threads-pipe 2-groups 1.00 [ 0.00]( 1.60) 0.97 [ +3.00]( 3.44)
>> threads-sockets 1-groups 1.00 [ 0.00]( 7.65) 0.99 [ +1.00]( 1.05)
>> threads-sockets 2-groups 1.00 [ 0.00]( 3.12) 1.03 [ -3.00]( 1.70)
>>
>> hackbench results are similar in both kernels except the case where there is an improvement of
>> 29% in case of threads-pipe case with 1 groups.
>>
>> Daytrader throughput (higher is better)
>> ========
>>
>> As per Ingo suggestion, ran a real life workload daytrader
>>
>> baseline:
>> ===================================================================================
>> Instance 1
>> Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time
>> ================ =============== =============== ===============
>> 10124.5 2 0 3970
>>
>> SIS_CACHE:
>> ===================================================================================
>> Instance 1
>> Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time
>> ================ =============== =============== ===============
>> 10319.5 2 0 5771
>>
>> In the above run, daytrader perfomance was 2% better in case of SIS_CACHE.
>>
>
> Thanks for bringing this good news, a real life workload benefits from this change.
> I'll tune this patch a little bit to address the regression from schbench. Also to mention
> that, I'm working with Mathieu on his proposal to make the wakee choosing its previous
> CPU easier(similar to SIS_CACHE, but a little simpler), and we'll check how to make more
> platform benefit from this change.
> https://lore.kernel.org/lkml/20231012203626.1298944-1-mathieu.desnoyers@efficios.com/
Oh..ok. Thanks for the pointer!
>
> thanks,
> Chenyu
>
Thanks and Regards
Madadi Vineeth Reddy
On 2023-10-19 at 01:02:16 +0530, Madadi Vineeth Reddy wrote: > Hi Chen Yu, > On 17/10/23 16:39, Chen Yu wrote: > > Hi Madadi, > > > > On 2023-10-17 at 15:19:24 +0530, Madadi Vineeth Reddy wrote: > >> Hi Chen Yu, > >> > >> On 26/09/23 10:40, Chen Yu wrote: > >>> RFC -> v1: > >>> - drop RFC > >>> - Only record the short sleeping time for each task, to better honor the > >>> burst sleeping tasks. (Mathieu Desnoyers) > >>> - Keep the forward movement monotonic for runqueue's cache-hot timeout value. > >>> (Mathieu Desnoyers, Aaron Lu) > >>> - Introduce a new helper function cache_hot_cpu() that considers > >>> rq->cache_hot_timeout. (Aaron Lu) > >>> - Add analysis of why inhibiting task migration could bring better throughput > >>> for some benchmarks. (Gautham R. Shenoy) > >>> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in > >>> select_idle_cpu(). To avoid possible task stacking on the waker's CPU. > >>> (K Prateek Nayak) > >>> > >>> Thanks for your comments and review! > >>> > >>> ---------------------------------------------------------------------- > >> > >> Regarding making the scan for finding an idle cpu longer vs cache benefits, > >> I ran some benchmarks. > >> > > > > Thanks very much for your interest and your time on the patch. > > > >> Tested the patch on power system with 12 cores. Total of 96 CPU's. > >> System has two NUMA nodes. > >> > >> Below are some of the benchmark results > >> > >> schbench 99.0th latency (lower is better) > >> ======== > >> case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%) > >> normal 1-mthreads 1.00 [ 0.00]( 3.66) 1.00 [ 0.00]( 1.71) > >> normal 2-mthreads 1.00 [ 0.00]( 4.55) 1.02 [ -2.00]( 3.00) > >> normal 4-mthreads 1.00 [ 0.00]( 4.77) 0.96 [ +4.00]( 4.27) > >> normal 6-mthreads 1.00 [ 0.00]( 60.37) 2.66 [ -166.00]( 23.67) > >> > >> > >> schbench results are showing that there is not much impact in wakeup latencies due to more iterations > >> in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better > >> for SIS_CACHE in case of 4-mthreads. > > > > The 4% improvement is within std%, so I suppose we did not see much difference in 4 mthreads case. > > > >> I think we can ignore the last case due to huge run to run variations. > > > > Although the run-to-run variation is large, it seems that the decrease is within that range. > > Prateek has also reported that when the system is overloaded there could be some regression > > from schbench: > > https://lore.kernel.org/lkml/27651e14-f441-c1e2-9b5b-b958d6aadc79@amd.com/ > > Could you also post the raw data printed by schbench? And maybe using the latest schbench could get the > > latency in detail. > > > > raw data by schbench(old) with 6-mthreads > ====================== > > Baseline (5 runs) > ======== > Latency percentiles (usec) > 50.0000th: 22 > 75.0000th: 29 > 90.0000th: 34 > 95.0000th: 37 > *99.0000th: 981 > 99.5000th: 4424 > 99.9000th: 9200 > min=0, max=29497 > > Latency percentiles (usec) > 50.0000th: 23 > 75.0000th: 29 > 90.0000th: 35 > 95.0000th: 38 > *99.0000th: 495 > 99.5000th: 3924 > 99.9000th: 9872 > min=0, max=29997 > > Latency percentiles (usec) > 50.0000th: 23 > 75.0000th: 30 > 90.0000th: 36 > 95.0000th: 39 > *99.0000th: 1326 > 99.5000th: 4744 > 99.9000th: 10000 > min=0, max=23394 > > Latency percentiles (usec) > 50.0000th: 23 > 75.0000th: 29 > 90.0000th: 34 > 95.0000th: 37 > *99.0000th: 55 > 99.5000th: 3292 > 99.9000th: 9104 > min=0, max=25196 > > Latency percentiles (usec) > 50.0000th: 23 > 75.0000th: 29 > 90.0000th: 34 > 95.0000th: 37 > *99.0000th: 711 > 99.5000th: 4600 > 99.9000th: 9424 > min=0, max=19997 > > SIS_CACHE (5 runs) > ========= > Latency percentiles (usec) > 50.0000th: 23 > 75.0000th: 30 > 90.0000th: 35 > 95.0000th: 38 > *99.0000th: 1894 > 99.5000th: 5464 > 99.9000th: 10000 > min=0, max=19157 > > Latency percentiles (usec) > 50.0000th: 22 > 75.0000th: 29 > 90.0000th: 34 > 95.0000th: 37 > *99.0000th: 2396 > 99.5000th: 6664 > 99.9000th: 10000 > min=0, max=24029 > > Latency percentiles (usec) > 50.0000th: 22 > 75.0000th: 29 > 90.0000th: 34 > 95.0000th: 37 > *99.0000th: 2132 > 99.5000th: 6296 > 99.9000th: 10000 > min=0, max=25313 > > Latency percentiles (usec) > 50.0000th: 22 > 75.0000th: 29 > 90.0000th: 34 > 95.0000th: 37 > *99.0000th: 1090 > 99.5000th: 6232 > 99.9000th: 9744 > min=0, max=27264 > > Latency percentiles (usec) > 50.0000th: 22 > 75.0000th: 29 > 90.0000th: 34 > 95.0000th: 38 > *99.0000th: 1786 > 99.5000th: 5240 > 99.9000th: 9968 > min=0, max=24754 > > The above data as indicated has large run to run variation and in general, the latency is > high in case of SIS_CACHE for the 99th %ile. > > > schbench(new) with 6-mthreads > ============= > > Baseline > ======== > Wakeup Latencies percentiles (usec) runtime 30 (s) (209403 total samples) > 50.0th: 8 (43672 samples) > 90.0th: 13 (83908 samples) > * 99.0th: 20 (18323 samples) > 99.9th: 775 (1785 samples) > min=1, max=8400 > Request Latencies percentiles (usec) runtime 30 (s) (209543 total samples) > 50.0th: 13648 (59873 samples) > 90.0th: 14000 (82767 samples) > * 99.0th: 14320 (16342 samples) > 99.9th: 18720 (1670 samples) > min=5130, max=38334 > RPS percentiles (requests) runtime 30 (s) (31 total samples) > 20.0th: 6968 (8 samples) > * 50.0th: 6984 (23 samples) > 90.0th: 6984 (0 samples) > min=6835, max=6991 > average rps: 6984.77 > > > SIS_CACHE > ========= > Wakeup Latencies percentiles (usec) runtime 30 (s) (209295 total samples) > 50.0th: 9 (49267 samples) > 90.0th: 14 (86522 samples) > * 99.0th: 21 (14091 samples) > 99.9th: 1146 (1722 samples) > min=1, max=10427 > Request Latencies percentiles (usec) runtime 30 (s) (209432 total samples) > 50.0th: 13616 (62838 samples) > 90.0th: 14000 (85301 samples) > * 99.0th: 14352 (16149 samples) > 99.9th: 21408 (1660 samples) > min=5070, max=41866 > RPS percentiles (requests) runtime 30 (s) (31 total samples) > 20.0th: 6968 (7 samples) > * 50.0th: 6984 (21 samples) > 90.0th: 6984 (0 samples) > min=6672, max=6996 > average rps: 6981.07 > > In new schbench, I didn't observe run to run variation and also there was no regression > in case of SIS_CACHE for the 99th %ile. > Thanks for the test Madadi, in my opinion we can stick with the new schbench in the future. I'll have a double check on my test machine. thanks, Chenyu
Hello Chenyu,
On 9/26/2023 10:40 AM, Chen Yu wrote:
> RFC -> v1:
> - drop RFC
> - Only record the short sleeping time for each task, to better honor the
> burst sleeping tasks. (Mathieu Desnoyers)
> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
> (Mathieu Desnoyers, Aaron Lu)
> - Introduce a new helper function cache_hot_cpu() that considers
> rq->cache_hot_timeout. (Aaron Lu)
> - Add analysis of why inhibiting task migration could bring better throughput
> for some benchmarks. (Gautham R. Shenoy)
> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
> select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
> (K Prateek Nayak)
>
> Thanks for your comments and review!
Sorry for the delay! I'll leave the test results from a 3rd Generation
EPYC system below.
tl;dr
- Small regression in tbench and netperf possible due to more searching
for an idle CPU.
- Small regression in schbench (old) at 256 workers albeit with large
run to run variance.
- Other benchmarks are more or less same.
I'll leave the full result below
o System details
- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- Boost enabled, C2 Disabled (POLL and MWAIT based C1 remained enabled)
o Kernel Details
- tip: tip:sched/core at commit 5fe7765997b1 (sched/deadline: Make
dl_rq->pushable_dl_tasks update drive dl_rq->overloaded)
- SIS_CACHE: tip + this series
o Benchmark results
==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) SIS_CACHE[pct imp](CV)
1-groups 1.00 [ -0.00]( 2.36) 1.01 [ -1.47]( 3.02)
2-groups 1.00 [ -0.00]( 2.35) 0.99 [ 0.92]( 1.01)
4-groups 1.00 [ -0.00]( 1.79) 0.98 [ 2.34]( 0.63)
8-groups 1.00 [ -0.00]( 0.84) 0.98 [ 1.73]( 1.02)
16-groups 1.00 [ -0.00]( 2.39) 0.97 [ 2.76]( 2.33)
==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) SIS_CACHE[pct imp](CV)
1 1.00 [ 0.00]( 0.86) 0.97 [ -2.68]( 0.74)
2 1.00 [ 0.00]( 0.99) 0.98 [ -2.18]( 0.17)
4 1.00 [ 0.00]( 0.49) 0.98 [ -2.47]( 1.15)
8 1.00 [ 0.00]( 0.96) 0.96 [ -3.81]( 0.24)
16 1.00 [ 0.00]( 1.38) 0.96 [ -4.33]( 1.31)
32 1.00 [ 0.00]( 1.64) 0.95 [ -4.70]( 1.59)
64 1.00 [ 0.00]( 0.92) 0.97 [ -2.97]( 0.49)
128 1.00 [ 0.00]( 0.57) 0.99 [ -1.15]( 0.57)
256 1.00 [ 0.00]( 0.38) 1.00 [ 0.03]( 0.79)
512 1.00 [ 0.00]( 0.04) 1.00 [ 0.43]( 0.34)
1024 1.00 [ 0.00]( 0.20) 1.00 [ 0.41]( 0.13)
==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) SIS_CACHE[pct imp](CV)
Copy 1.00 [ 0.00]( 2.52) 0.93 [ -6.90]( 6.75)
Scale 1.00 [ 0.00]( 6.38) 0.99 [ -1.18]( 7.45)
Add 1.00 [ 0.00]( 6.54) 0.97 [ -2.55]( 7.34)
Triad 1.00 [ 0.00]( 5.18) 0.95 [ -4.64]( 6.81)
==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) SIS_CACHE[pct imp](CV)
Copy 1.00 [ 0.00]( 0.74) 1.00 [ -0.20]( 1.69)
Scale 1.00 [ 0.00]( 6.25) 1.03 [ 3.46]( 0.55)
Add 1.00 [ 0.00]( 6.53) 1.05 [ 4.58]( 0.43)
Triad 1.00 [ 0.00]( 5.14) 0.98 [ -1.78]( 6.24)
==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) SIS_CACHE[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.27) 0.98 [ -1.50]( 0.14)
2-clients 1.00 [ 0.00]( 1.32) 0.98 [ -2.35]( 0.54)
4-clients 1.00 [ 0.00]( 0.40) 0.98 [ -2.35]( 0.56)
8-clients 1.00 [ 0.00]( 0.97) 0.97 [ -2.72]( 0.50)
16-clients 1.00 [ 0.00]( 0.54) 0.96 [ -3.92]( 0.86)
32-clients 1.00 [ 0.00]( 1.38) 0.97 [ -3.10]( 0.44)
64-clients 1.00 [ 0.00]( 1.78) 0.97 [ -3.44]( 1.70)
128-clients 1.00 [ 0.00]( 1.09) 0.94 [ -5.75]( 2.67)
256-clients 1.00 [ 0.00]( 4.45) 0.97 [ -2.61]( 4.93)
512-clients 1.00 [ 0.00](54.70) 0.98 [ -1.64](55.09)
==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) SIS_CACHE[pct imp](CV)
1 1.00 [ -0.00]( 3.95) 0.97 [ 2.56](10.42)
2 1.00 [ -0.00]( 5.89) 0.83 [ 16.67](22.56)
4 1.00 [ -0.00](14.28) 1.00 [ -0.00](14.75)
8 1.00 [ -0.00]( 4.90) 0.84 [ 15.69]( 6.01)
16 1.00 [ -0.00]( 4.15) 1.00 [ -0.00]( 4.41)
32 1.00 [ -0.00]( 5.10) 1.01 [ -1.10]( 3.44)
64 1.00 [ -0.00]( 2.69) 1.04 [ -3.72]( 2.57)
128 1.00 [ -0.00]( 2.63) 0.94 [ 6.29]( 2.55)
256 1.00 [ -0.00](26.75) 1.51 [-50.57](11.40)
512 1.00 [ -0.00]( 2.93) 0.96 [ 3.52]( 3.56)
==================================================================
Test : ycsb-cassandra
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Mean
==================================================================
Metric tip SIS_CACHE(pct imp)
Throughput 1.00 1.00 (%diff: 0.27%)
==================================================================
Test : ycsb-mondodb
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Mean
==================================================================
Metric tip SIS_CACHE(pct imp)
Throughput 1.00 1.00 (%diff: -0.45%)
==================================================================
Test : DeathStarBench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Mean
==================================================================
Pinning scaling tip SIS_CACHE(pct imp)
1CCD 1 1.00 1.00 (%diff: -0.47%)
2CCD 2 1.00 0.98 (%diff: -2.34%)
4CCD 4 1.00 1.00 (%diff: -0.29%)
8CCD 8 1.00 1.01 (%diff: 0.54%)
>
> ----------------------------------------------------------------------
>
> [..snip..]
>
--
Thanks and Regards,
Prateek
Hi Prateek, On 2023-10-05 at 11:52:13 +0530, K Prateek Nayak wrote: > Hello Chenyu, > > On 9/26/2023 10:40 AM, Chen Yu wrote: > > RFC -> v1: > > - drop RFC > > - Only record the short sleeping time for each task, to better honor the > > burst sleeping tasks. (Mathieu Desnoyers) > > - Keep the forward movement monotonic for runqueue's cache-hot timeout value. > > (Mathieu Desnoyers, Aaron Lu) > > - Introduce a new helper function cache_hot_cpu() that considers > > rq->cache_hot_timeout. (Aaron Lu) > > - Add analysis of why inhibiting task migration could bring better throughput > > for some benchmarks. (Gautham R. Shenoy) > > - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in > > select_idle_cpu(). To avoid possible task stacking on the waker's CPU. > > (K Prateek Nayak) > > > > Thanks for your comments and review! > > Sorry for the delay! I'll leave the test results from a 3rd Generation > EPYC system below. > > tl;dr > > - Small regression in tbench and netperf possible due to more searching > for an idle CPU. > > - Small regression in schbench (old) at 256 workers albeit with large > run to run variance. > > - Other benchmarks are more or less same. > > Test : schbench > Units : Normalized 99th percentile latency in us > Interpretation: Lower is better > Statistic : Median > ================================================================== > #workers: tip[pct imp](CV) SIS_CACHE[pct imp](CV) > 1 1.00 [ -0.00]( 3.95) 0.97 [ 2.56](10.42) > 2 1.00 [ -0.00]( 5.89) 0.83 [ 16.67](22.56) > 4 1.00 [ -0.00](14.28) 1.00 [ -0.00](14.75) > 8 1.00 [ -0.00]( 4.90) 0.84 [ 15.69]( 6.01) > 16 1.00 [ -0.00]( 4.15) 1.00 [ -0.00]( 4.41) > 32 1.00 [ -0.00]( 5.10) 1.01 [ -1.10]( 3.44) > 64 1.00 [ -0.00]( 2.69) 1.04 [ -3.72]( 2.57) > 128 1.00 [ -0.00]( 2.63) 0.94 [ 6.29]( 2.55) > 256 1.00 [ -0.00](26.75) 1.51 [-50.57](11.40) Thanks for the testing. So the latency regression from schbench is quite obvious, and as you mentioned, it is possible due to longer scan time during select_idle_cpu(). I'll run the same test with split LLC to see if I can reproduce the issue or not. I'm also working with Mathieu on another direction to choose previous CPU over current CPU when the system is overloaded, and that should be more moderate and I'll post the test result later. thanks, Chenyu
* Chen Yu <yu.c.chen@intel.com> wrote: > When task p is woken up, the scheduler leverages select_idle_sibling() > to find an idle CPU for it. p's previous CPU is usually a preference > because it can improve cache locality. However in many cases, the > previous CPU has already been taken by other wakees, thus p has to > find another idle CPU. > > Inhibit the task migration while keeping the work conservation of > scheduler could benefit many workloads. Inspired by Mathieu's > proposal to limit the task migration ratio[1], this patch considers > the task average sleep duration. If the task is a short sleeping one, > then tag its previous CPU as cache hot for a short while. During this > reservation period, other wakees are not allowed to pick this idle CPU > until a timeout. Later if the task is woken up again, it can find its > previous CPU still idle, and choose it in select_idle_sibling(). Yeah, so I'm not convinced about this at this stage. By allowing a task to basically hog a CPU after it has gone idle already, however briefly, we reduce resource utilization efficiency for the sake of singular benchmark workloads. In a mixed environment the cost of leaving CPUs idle longer than necessary will show up - and none of these benchmarks show that kind of side effect and indirect overhead. This feature would be a lot more convincing if it tried to measure overhead in the pathological case, not the case it's been written for. Thanks, Ingo
Hi Ingo, On 2023-09-27 at 10:00:11 +0200, Ingo Molnar wrote: > > * Chen Yu <yu.c.chen@intel.com> wrote: > > > When task p is woken up, the scheduler leverages select_idle_sibling() > > to find an idle CPU for it. p's previous CPU is usually a preference > > because it can improve cache locality. However in many cases, the > > previous CPU has already been taken by other wakees, thus p has to > > find another idle CPU. > > > > Inhibit the task migration while keeping the work conservation of > > scheduler could benefit many workloads. Inspired by Mathieu's > > proposal to limit the task migration ratio[1], this patch considers > > the task average sleep duration. If the task is a short sleeping one, > > then tag its previous CPU as cache hot for a short while. During this > > reservation period, other wakees are not allowed to pick this idle CPU > > until a timeout. Later if the task is woken up again, it can find its > > previous CPU still idle, and choose it in select_idle_sibling(). > > Yeah, so I'm not convinced about this at this stage. > > By allowing a task to basically hog a CPU after it has gone idle already, > however briefly, we reduce resource utilization efficiency for the sake > of singular benchmark workloads. > Currently in the code we do not really reserve the idle CPU or force it to be idle. We just give other wakee a search sequence suggestion to find the idle CPU. If all idle CPUs are in reserved state, the first reserved idle CPU will be picked up rather than left it in idle. This can fully utilize the idle CPU resource. The main impact is the wakeup latency if I understand correctly. Let me run the latest schbench and monitor these latency statistics in detail. > In a mixed environment the cost of leaving CPUs idle longer than necessary > will show up - and none of these benchmarks show that kind of side effect > and indirect overhead. > > This feature would be a lot more convincing if it tried to measure overhead > in the pathological case, not the case it's been written for. > Thanks for the suggestion, Ingo. Yes, we should launch more tests to evaluate this proposal. As Tim mentioned, we have previously tested it using OLTP benchmark as described in PATCH [2/2]. I'm thinking of running more benchmarks to get a wider understanding of how this change would impact them, both positive and negative part. thanks, Chenyu
On Wed, 2023-09-27 at 10:00 +0200, Ingo Molnar wrote: > * Chen Yu <yu.c.chen@intel.com> wrote: > > > When task p is woken up, the scheduler leverages select_idle_sibling() > > to find an idle CPU for it. p's previous CPU is usually a preference > > because it can improve cache locality. However in many cases, the > > previous CPU has already been taken by other wakees, thus p has to > > find another idle CPU. > > > > Inhibit the task migration while keeping the work conservation of > > scheduler could benefit many workloads. Inspired by Mathieu's > > proposal to limit the task migration ratio[1], this patch considers > > the task average sleep duration. If the task is a short sleeping one, > > then tag its previous CPU as cache hot for a short while. During this > > reservation period, other wakees are not allowed to pick this idle CPU > > until a timeout. Later if the task is woken up again, it can find its > > previous CPU still idle, and choose it in select_idle_sibling(). > > Yeah, so I'm not convinced about this at this stage. > > By allowing a task to basically hog a CPU after it has gone idle already, > however briefly, we reduce resource utilization efficiency for the sake > of singular benchmark workloads. > > In a mixed environment the cost of leaving CPUs idle longer than necessary > will show up - and none of these benchmarks show that kind of side effect > and indirect overhead. > > This feature would be a lot more convincing if it tried to measure overhead > in the pathological case, not the case it's been written for. > Ingo, Mathieu's patches on detecting overly high task migrations and then rate limiting migration is a way to detect that tasks are getting crazy doing CPU musical chairs and in a pathological state. Will the migration rate be a reasonable indicator that we need to do something to reduce pathological migrations like SIS_CACHE proposal so the tasks don't get jerked all over? Or you have some other better indicators in mind? We did some experiments on the OLTP workload on a 112 core 2 socket SPR machine. The OLTP workload have a mixture of threads handling database updates on disks and handling transaction queries over network. For Mathieu's original task migration rate limit patches, we saw 1.2% improvement and for Chen Yu's SIS_CACHE proposal, we saw 0.7% improvement. System is running at ~94% busy so is under high utilization. The variation of this workload is less than 0.2%. There are improvements for such mix workload though it is not as much as the microbenchmarks. These data are perliminary and we are still doing more experiments. For the OLTP experiments, each socket with 64 cores are divided with sub-numa clusters of 4 nodes of 16 cores each so the scheduling overhead in idle CPU search is much less if SNC is off. Thanks. Tim
© 2016 - 2026 Red Hat, Inc.