include/linux/sched.h | 3 +++ kernel/sched/core.c | 2 ++ kernel/sched/debug.c | 1 + kernel/sched/fair.c | 32 ++++++++++++++++++++++++++++++++ kernel/sched/features.h | 1 + 5 files changed, 39 insertions(+)
The main purpose of this change is to avoid too many cross CPU wake up when it is unnecessary. The frequent cross CPU wake up brings significant damage to some workloads, especially on high core count systems. This patch set inhibits the cross CPU wake-up by placing the wakee on waking CPU or previous CPU, if both the waker and wakee are short-duration tasks. The first patch is to introduce the definition of a short-duration task. The second patch leverages the first patch to choose a local or previous CPU for wakee. Changes since v3: 1. Honglei and Josh have concern that the threshold of short task duration could be too long. Decreased the threshold from sysctl_sched_min_granularity to (sysctl_sched_min_granularity / 8), and the '8' comes from get_update_sysctl_factor(). 2. Export p->se.dur_avg to /proc/{pid}/sched per Yicong's suggestion. 3. Move the calculation of average duration from put_prev_task_fair() to dequeue_task_fair(). Because there is an issue in v3 that, put_prev_task_fair() will not be invoked by pick_next_task_fair() in fast path, thus the dur_avg could not be updated timely. 4. Fix the comment in PATCH 2/2, that "WRITE_ONCE(CPU1->ttwu_pending, 1);" on CPU0 is earlier than CPU1 getting "ttwu_list->p0", per Tianchen. 5. Move the scan for CPU with short duration task from select_idle_cpu() to select_idle_siblings(), because there is no CPU scan involved, per Yicong. Changes since v2: 1. Peter suggested comparing the duration of waker and the cost to scan for an idle CPU: If the cost is higher than the task duration, do not waste time finding an idle CPU, choose the local or previous CPU directly. A prototype was created based on this suggestion. However, according to the test result, this prototype does not inhibit the cross CPU wakeup and did not bring improvement. Because the cost to find an idle CPU is small in the problematic scenario. The root cause of the problem is a race condition between scanning for an idle CPU and task enqueue(please refer to the commit log in PATCH 2/2). So v3 does not change the core logic of v2, with some refinement based on Peter's suggestion. 2. Simplify the logic to record the task duration per Peter and Abel's suggestion. This change brings overall improvement on some microbenchmarks, both on Intel and AMD platforms. v3: https://lore.kernel.org/lkml/cover.1669862147.git.yu.c.chen@intel.com/ v2: https://lore.kernel.org/all/cover.1666531576.git.yu.c.chen@intel.com/ v1: https://lore.kernel.org/lkml/20220915165407.1776363-1-yu.c.chen@intel.com/ Chen Yu (2): sched/fair: Introduce short duration task check sched/fair: Choose the CPU where short task is running during wake up include/linux/sched.h | 3 +++ kernel/sched/core.c | 2 ++ kernel/sched/debug.c | 1 + kernel/sched/fair.c | 32 ++++++++++++++++++++++++++++++++ kernel/sched/features.h | 1 + 5 files changed, 39 insertions(+) -- 2.25.1
Hi Chen Yu, On 2022/12/16 14:08, Chen Yu wrote: > The main purpose of this change is to avoid too many cross CPU > wake up when it is unnecessary. The frequent cross CPU wake up > brings significant damage to some workloads, especially on high > core count systems. > > This patch set inhibits the cross CPU wake-up by placing the wakee > on waking CPU or previous CPU, if both the waker and wakee are > short-duration tasks. > > The first patch is to introduce the definition of a short-duration > task. The second patch leverages the first patch to choose a local > or previous CPU for wakee. > Tested with tbench and netperf locally and MySQL remotely. Unluckily cannot reproduce the improvement and see some regression. Detailed information below. The tests are done based on 6.2-rc1 on a 2x64 CPUs Kunpeng 920 server with 4 nodes and no SMT, CONFIG_SCHED_CLUSTER enabled. For tbench and netperf tested on single node and two nodes on one socket. tbench(node 0) vanilla sis_short 1: 326.2950 312.3700 ( -4.27%) 4: 1299.7700 1241.3400 ( -4.50%) 8: 2613.2900 2476.4700 ( -5.24%) 16: 5091.9400 5096.3100 ( 0.09%) 32: 10104.1000 9905.8500 ( -1.96%) 64: 7558.8500 7826.2000 ( 3.54%) 128: 6641.4000 6608.6600 ( -0.49%) tbench(node 0-1) vanilla sis_short 1: 328.0880 313.5830 ( -4.42%) 4: 1318.4700 1237.2100 ( -6.16%) 8: 2546.1700 2372.5900 ( -6.82%) 16: 5147.1700 4651.3400 ( -9.63%) 32: 9275.1100 8772.0100 ( -5.42%) 64: 17526.6000 17918.8000 ( 2.23%) 128: 13108.1000 13317.0000 ( 1.59%) netperf TCP_RR(node 0) vanilla sis_short 1: 76444.3100 72681.4200 ( -4.92%) 4: 77615.7475 73436.4575 ( -5.38%) 8: 78066.5887 69473.9312 ( -11.01%) 16: 77134.8562 76473.4144 ( -0.86%) 32: 76626.3322 63785.1650 ( -16.76%) 64: 29072.7872 24552.9916 ( -15.55%) 128: 11875.6830 10532.4508 ( -11.31%) netperf TCP_RR(node 0-1) vanilla sis_short 1: 77645.3700 72879.3300 ( -6.14%) 4: 78180.0150 72926.2225 ( -6.72%) 8: 77878.7675 66724.6137 ( -14.32%) 16: 77427.6944 65682.6012 ( -15.17%) 32: 71774.5309 71887.1747 ( 0.16%) 64: 75023.3972 62730.7395 ( -16.39%) 128: 28468.2309 25048.1859 ( -12.01%) netperf UDP_RR(node 0) vanilla sis_short 1: 91614.6400 89437.6600 ( -2.38%) 4: 91926.3425 90511.9750 ( -1.54%) 8: 92596.4850 87336.7162 ( -5.68%) 16: 91803.2525 95212.5069 ( 3.71%) 32: 98482.9212 79793.1803 ( -18.98%) 64: 36373.5367 30577.2208 ( -15.94%) 128: 14440.2914 12727.7759 ( -11.86%) netperf UDP_RR(node 0-1) vanilla sis_short 1: 92609.3400 90915.8600 ( -1.83%) 4: 93034.8425 91742.4825 ( -1.39%) 8: 91963.3950 85801.8250 ( -6.70%) 16: 91589.4900 83339.7944 ( -9.01%) 32: 83161.0234 90298.3934 ( 8.58%) 64: 88860.0155 74933.8544 ( -15.67%) 128: 24715.9842 28022.6998 ( 13.38%) MySQL tests are done among two servers with same hardware setup and kernel updated on both kernel. Network device locates on node 0 and network interrupts is balanced evenly by irqbalance. No NUMA binding applied. oltp_read_only 8-TPS 4315.11 4236.45 -1.82% 8-QPS 69041.74 67783.19 -1.82% 8-avg-lat 1.85 1.89 -1.98% 8-95th-lat 2.14 2.24 -4.35% 16-TPS 8315.80 7993.78 -3.87% 16-QPS 133052.87 127900.49 -3.87% 16-avg-lat 1.92 2.00 -3.99% 16-95th-lat 2.36 2.60 -10.17% 24-TPS 11601.29 11387.88 -1.84% 24-QPS 185620.63 182206.08 -1.84% 24-avg-lat 2.07 2.11 -1.94% 24-95th-lat 2.61 2.74 -4.97% 32-TPS 13882.59 13277.71 -4.36% 32-QPS 222121.46 212443.37 -4.36% 32-avg-lat 2.31 2.41 -4.48% 32-95th-lat 2.93 3.38 -15.47% 64-TPS 17558.38 16200.13 -7.74% 64-QPS 280934.13 259202.00 -7.74% 64-avg-lat 3.64 3.95 -8.42% 64-95th-lat 5.12 5.40 -5.53% 96-TPS 21858.44 21104.80 -3.45% 96-QPS 349735.03 337676.78 -3.45% 96-avg-lat 4.39 4.55 -3.57% 96-95th-lat 5.92 6.09 -2.93% 128-TPS 24633.09 24278.48 -1.44% 128-QPS 394129.44 388455.76 -1.44% 128-avg-lat 5.19 5.27 -1.54% 128-95th-lat 6.51 6.79 -4.30% 256-TPS 40793.59 41878.29 2.66% 256-QPS 652697.45 670052.64 2.66% 256-avg-lat 6.27 6.11 2.55% 256-95th-lat 9.67 9.79 -1.21% 512-TPS 50461.71 50593.95 0.26% 512-QPS 807387.36 809503.29 0.26% 512-avg-lat 10.13 10.10 0.30% 512-95th-lat 14.64 14.73 -0.61% 700-TPS 50848.04 50899.43 0.10% 700-QPS 813568.68 814390.94 0.10% 700-avg-lat 13.74 13.73 0.10% 700-95th-lat 18.95 19.06 -0.60% Further looked into 64 threads since it regress most and mainly looked on the server(mysqld) side. In this case most loads will be on node 0 and partly on node 1. With SIS_SHORT the utilization on node 0 keeps almost the same but utilization on node 1 remains nearly half comparing to baseline. That may partly answers why the performance decreases. mpstat 60s, non-idle% vanilla sis_short node 0 94.25 91.67 -2.73% node 1 46.41 30.03 -35.29% Previously there's argument that sync wakeup from interrupts maybe incorrectly and SIS_SHORT seems to consolidate that and tie more tasks on the interrupt node. SIS_SHORT will likely to make short tasks wakeup locally but seems to increase the wait time in this case. schedstat for CPUs on node 0 shows: vanilla sis_short time elapsed 60.201085s 60.200395s [cpu Average, counts/sec] #0 cpu zeros 0.0 0.0 #1 sched_yied() 0.03 0.12 #2 zeros 0.0 0.0 #3 schedule() 409415.36 340304.86 #4 sched->idle() 125616.92 91078.52 #5 try_to_wake_up() 317404.25 280896.88 #6 try_to_wakeup(local) 115451.87 140567.68 #7 run time 26071494674.87 26164834556.84 #8 wait time 3861592891.75 7361228637.54 (+ 90.63%) #9 timeslices, 283747.1 249182.34 Since it's a 128 CPU machine the min_granularity is 3000000ns which means a short task running within 375000ns. Most mysql worker threads' se.dur_avg is around 110000ns and will be regarded as short tasks. To exclude the influence of Patch 1 (which we'll always record the dur_avg and related informations), tested 64 threads case with SIS_SHORT enabled or not through debugfs control and seems Patch 2 mainly dominate the results. NO_SIS_SHORT SIS_SHORT TPS 17446.54 16464.83 -5.63% QPS 279144.61 263437.21 -5.63% avg lat 3.67 3.89 -5.99% 95th lat 5.09 5.37 -5.50% Please let me know if you need more information. Thanks. > Changes since v3: > 1. Honglei and Josh have concern that the threshold of short > task duration could be too long. Decreased the threshold from > sysctl_sched_min_granularity to (sysctl_sched_min_granularity / 8), > and the '8' comes from get_update_sysctl_factor(). > 2. Export p->se.dur_avg to /proc/{pid}/sched per Yicong's suggestion. > 3. Move the calculation of average duration from put_prev_task_fair() > to dequeue_task_fair(). Because there is an issue in v3 that, > put_prev_task_fair() will not be invoked by pick_next_task_fair() > in fast path, thus the dur_avg could not be updated timely. > 4. Fix the comment in PATCH 2/2, that "WRITE_ONCE(CPU1->ttwu_pending, 1);" > on CPU0 is earlier than CPU1 getting "ttwu_list->p0", per Tianchen. > 5. Move the scan for CPU with short duration task from select_idle_cpu() > to select_idle_siblings(), because there is no CPU scan involved, per > Yicong. > > Changes since v2: > > 1. Peter suggested comparing the duration of waker and the cost to > scan for an idle CPU: If the cost is higher than the task duration, > do not waste time finding an idle CPU, choose the local or previous > CPU directly. A prototype was created based on this suggestion. > However, according to the test result, this prototype does not inhibit > the cross CPU wakeup and did not bring improvement. Because the cost > to find an idle CPU is small in the problematic scenario. The root > cause of the problem is a race condition between scanning for an idle > CPU and task enqueue(please refer to the commit log in PATCH 2/2). > So v3 does not change the core logic of v2, with some refinement based > on Peter's suggestion. > > 2. Simplify the logic to record the task duration per Peter and Abel's suggestion. > > This change brings overall improvement on some microbenchmarks, both on > Intel and AMD platforms. > > v3: https://lore.kernel.org/lkml/cover.1669862147.git.yu.c.chen@intel.com/ > v2: https://lore.kernel.org/all/cover.1666531576.git.yu.c.chen@intel.com/ > v1: https://lore.kernel.org/lkml/20220915165407.1776363-1-yu.c.chen@intel.com/ > > Chen Yu (2): > sched/fair: Introduce short duration task check > sched/fair: Choose the CPU where short task is running during wake up > > include/linux/sched.h | 3 +++ > kernel/sched/core.c | 2 ++ > kernel/sched/debug.c | 1 + > kernel/sched/fair.c | 32 ++++++++++++++++++++++++++++++++ > kernel/sched/features.h | 1 + > 5 files changed, 39 insertions(+) >
Hi Yicong, On 2023-01-17 at 10:53:25 +0800, Yicong Yang wrote: > Hi Chen Yu, > > On 2022/12/16 14:08, Chen Yu wrote: > > The main purpose of this change is to avoid too many cross CPU > > wake up when it is unnecessary. The frequent cross CPU wake up > > brings significant damage to some workloads, especially on high > > core count systems. > > > > This patch set inhibits the cross CPU wake-up by placing the wakee > > on waking CPU or previous CPU, if both the waker and wakee are > > short-duration tasks. > > > > The first patch is to introduce the definition of a short-duration > > task. The second patch leverages the first patch to choose a local > > or previous CPU for wakee. > > > > Tested with tbench and netperf locally and MySQL remotely. Unluckily > cannot reproduce the improvement and see some regression. Thank you for the test. > Detailed information below. The tests are done based on 6.2-rc1 on a 2x64 CPUs > Kunpeng 920 server with 4 nodes and no SMT, CONFIG_SCHED_CLUSTER > enabled. > > For tbench and netperf tested on single node and two nodes on one socket. > Does this platform have 2 sockets, each socket has 2 Numa nodes, each node is attached to 32 CPUs, and the benchmarks are bound to the first sockets? And may I know how many CPUs there are in one LLC? Is it also 32? > Further looked into 64 threads since it regress most and mainly looked on the > server(mysqld) side. In this case most loads will be on node 0 and partly > on node 1. With SIS_SHORT the utilization on node 0 keeps almost the same > but utilization on node 1 remains nearly half comparing to baseline. That > may partly answers why the performance decreases. > OK, I see. So it seems that the Numa sched domain load balance does not spread tasks fast enough while SIS_SHORT is waking task on the same node frequently. But I thought select_idle_sibling() would wake up the tasks on the same node(LLC) anyway, why SIS_SHORT inhibits the load balance more? > mpstat 60s, non-idle% > vanilla sis_short > node 0 94.25 91.67 -2.73% > node 1 46.41 30.03 -35.29% > > Previously there's argument that sync wakeup from interrupts maybe incorrectly I see. Do you mean, wakeup from interrupt might cause tasks stacking on the same CPU, because the interrupt is delivered to the same CPU? I thought userspace irq load balance would spread the interrupt? Or do you mean others? > and SIS_SHORT seems to consolidate that and tie more tasks on the interrupt node. > > SIS_SHORT will likely to make short tasks wakeup locally but seems to increase the > wait time in this case. schedstat for CPUs on node 0 shows: > Agree. > vanilla sis_short > time elapsed 60.201085s 60.200395s > [cpu Average, counts/sec] > #0 cpu zeros 0.0 0.0 > #1 sched_yied() 0.03 0.12 > #2 zeros 0.0 0.0 > #3 schedule() 409415.36 340304.86 > #4 sched->idle() 125616.92 91078.52 > #5 try_to_wake_up() 317404.25 280896.88 > #6 try_to_wakeup(local) 115451.87 140567.68 > #7 run time 26071494674.87 26164834556.84 > #8 wait time 3861592891.75 7361228637.54 (+ 90.63%) > #9 timeslices, 283747.1 249182.34 > > Since it's a 128 CPU machine the min_granularity is 3000000ns which means a short > task running within 375000ns. Most mysql worker threads' se.dur_avg is around > 110000ns and will be regarded as short tasks. > > To exclude the influence of Patch 1 (which we'll always record the dur_avg and related > informations), tested 64 threads case with SIS_SHORT enabled or not through debugfs > control and seems Patch 2 mainly dominate the results. > > NO_SIS_SHORT SIS_SHORT > TPS 17446.54 16464.83 -5.63% > QPS 279144.61 263437.21 -5.63% > avg lat 3.67 3.89 -5.99% > 95th lat 5.09 5.37 -5.50% > > Please let me know if you need more information. > Very useful information, I'm thinking of taking nr_llc into consideration to adjust the threshold of short task, so on system with smaller llc we do not inhibit the idle CPU scan too much. thanks, Chenyu > Thanks. > > > Changes since v3: > > 1. Honglei and Josh have concern that the threshold of short > > task duration could be too long. Decreased the threshold from > > sysctl_sched_min_granularity to (sysctl_sched_min_granularity / 8), > > and the '8' comes from get_update_sysctl_factor(). > > 2. Export p->se.dur_avg to /proc/{pid}/sched per Yicong's suggestion. > > 3. Move the calculation of average duration from put_prev_task_fair() > > to dequeue_task_fair(). Because there is an issue in v3 that, > > put_prev_task_fair() will not be invoked by pick_next_task_fair() > > in fast path, thus the dur_avg could not be updated timely. > > 4. Fix the comment in PATCH 2/2, that "WRITE_ONCE(CPU1->ttwu_pending, 1);" > > on CPU0 is earlier than CPU1 getting "ttwu_list->p0", per Tianchen. > > 5. Move the scan for CPU with short duration task from select_idle_cpu() > > to select_idle_siblings(), because there is no CPU scan involved, per > > Yicong. > > > > Changes since v2: > > > > 1. Peter suggested comparing the duration of waker and the cost to > > scan for an idle CPU: If the cost is higher than the task duration, > > do not waste time finding an idle CPU, choose the local or previous > > CPU directly. A prototype was created based on this suggestion. > > However, according to the test result, this prototype does not inhibit > > the cross CPU wakeup and did not bring improvement. Because the cost > > to find an idle CPU is small in the problematic scenario. The root > > cause of the problem is a race condition between scanning for an idle > > CPU and task enqueue(please refer to the commit log in PATCH 2/2). > > So v3 does not change the core logic of v2, with some refinement based > > on Peter's suggestion. > > > > 2. Simplify the logic to record the task duration per Peter and Abel's suggestion. > > > > This change brings overall improvement on some microbenchmarks, both on > > Intel and AMD platforms. > > > > v3: https://lore.kernel.org/lkml/cover.1669862147.git.yu.c.chen@intel.com/ > > v2: https://lore.kernel.org/all/cover.1666531576.git.yu.c.chen@intel.com/ > > v1: https://lore.kernel.org/lkml/20220915165407.1776363-1-yu.c.chen@intel.com/ > > > > Chen Yu (2): > > sched/fair: Introduce short duration task check > > sched/fair: Choose the CPU where short task is running during wake up > > > > include/linux/sched.h | 3 +++ > > kernel/sched/core.c | 2 ++ > > kernel/sched/debug.c | 1 + > > kernel/sched/fair.c | 32 ++++++++++++++++++++++++++++++++ > > kernel/sched/features.h | 1 + > > 5 files changed, 39 insertions(+) > >
On 2023/1/20 0:11, Chen Yu wrote: > Hi Yicong, > On 2023-01-17 at 10:53:25 +0800, Yicong Yang wrote: >> Hi Chen Yu, >> >> On 2022/12/16 14:08, Chen Yu wrote: >>> The main purpose of this change is to avoid too many cross CPU >>> wake up when it is unnecessary. The frequent cross CPU wake up >>> brings significant damage to some workloads, especially on high >>> core count systems. >>> >>> This patch set inhibits the cross CPU wake-up by placing the wakee >>> on waking CPU or previous CPU, if both the waker and wakee are >>> short-duration tasks. >>> >>> The first patch is to introduce the definition of a short-duration >>> task. The second patch leverages the first patch to choose a local >>> or previous CPU for wakee. >>> >> >> Tested with tbench and netperf locally and MySQL remotely. Unluckily >> cannot reproduce the improvement and see some regression. > Thank you for the test. >> Detailed information below. The tests are done based on 6.2-rc1 on a 2x64 CPUs >> Kunpeng 920 server with 4 nodes and no SMT, CONFIG_SCHED_CLUSTER >> enabled. >> >> For tbench and netperf tested on single node and two nodes on one socket. >> > Does this platform have 2 sockets, each socket has 2 Numa nodes, each node > is attached to 32 CPUs, and the benchmarks are bound to the first sockets? > And may I know how many CPUs there are in one LLC? Is it also 32? you're right, exactly. There're 32 CPUs sharing one LLC. >> Further looked into 64 threads since it regress most and mainly looked on the >> server(mysqld) side. In this case most loads will be on node 0 and partly >> on node 1. With SIS_SHORT the utilization on node 0 keeps almost the same >> but utilization on node 1 remains nearly half comparing to baseline. That >> may partly answers why the performance decreases. >> > OK, I see. So it seems that the Numa sched domain load balance does not spread > tasks fast enough while SIS_SHORT is waking task on the same node frequently. > But I thought select_idle_sibling() would wake up the tasks > on the same node(LLC) anyway, why SIS_SHORT inhibits the load balance more? Wakeup will happen much more frequently than load balance in this case and I don't think load balance can handle it in time. In the fast path of wakeup we care little about the load. SIS_SHORT may aggravate it since we're likely to put 2 short tasks on one single CPU. >> mpstat 60s, non-idle% >> vanilla sis_short >> node 0 94.25 91.67 -2.73% >> node 1 46.41 30.03 -35.29% >> >> Previously there's argument that sync wakeup from interrupts maybe incorrectly > I see. Do you mean, wakeup from interrupt might cause tasks stacking on the same > CPU, because the interrupt is delivered to the same CPU? I thought userspace > irq load balance would spread the interrupt? Or do you mean others? Nop. I'm using HiSilicon HNS3 network card which has 32 queues and 33 interrupts spread evenly on 32 CPUs, not on one single CPU. I considering a case that the p1 on cpu0 and p2 on cpu32, interrupt happens on cpu0 and tries to wakeup p1, since p1 and p2 are both short tasks and then p2 will likely to wake and wait on cpu0. p1 is not the actual waker and has not finished yet and this will increase the wait time and also load on cpu0. Previously we may wake on the LLC of cpu0 and try to find an idle CPU, with SIS_SHORT we'll just wake on cpu0. Does it make sense to allow the SIS_SHORT wakeup only from the task context? I'll have a try on this. Some previous patch arguing interrupt wakeup from Barry and Libo: [1] https://lkml.org/lkml/2021/11/5/234 [2] https://lore.kernel.org/lkml/20220711224704.1672831-1-libo.chen@oracle.com/ >> and SIS_SHORT seems to consolidate that and tie more tasks on the interrupt node. >> >> SIS_SHORT will likely to make short tasks wakeup locally but seems to increase the >> wait time in this case. schedstat for CPUs on node 0 shows: >> > Agree. >> vanilla sis_short >> time elapsed 60.201085s 60.200395s >> [cpu Average, counts/sec] >> #0 cpu zeros 0.0 0.0 >> #1 sched_yied() 0.03 0.12 >> #2 zeros 0.0 0.0 >> #3 schedule() 409415.36 340304.86 >> #4 sched->idle() 125616.92 91078.52 >> #5 try_to_wake_up() 317404.25 280896.88 >> #6 try_to_wakeup(local) 115451.87 140567.68 >> #7 run time 26071494674.87 26164834556.84 >> #8 wait time 3861592891.75 7361228637.54 (+ 90.63%) >> #9 timeslices, 283747.1 249182.34 >> >> Since it's a 128 CPU machine the min_granularity is 3000000ns which means a short >> task running within 375000ns. Most mysql worker threads' se.dur_avg is around >> 110000ns and will be regarded as short tasks. >> >> To exclude the influence of Patch 1 (which we'll always record the dur_avg and related >> informations), tested 64 threads case with SIS_SHORT enabled or not through debugfs >> control and seems Patch 2 mainly dominate the results. >> >> NO_SIS_SHORT SIS_SHORT >> TPS 17446.54 16464.83 -5.63% >> QPS 279144.61 263437.21 -5.63% >> avg lat 3.67 3.89 -5.99% >> 95th lat 5.09 5.37 -5.50% >> >> Please let me know if you need more information. >> > Very useful information, I'm thinking of taking nr_llc into consideration to > adjust the threshold of short task, so on system with smaller llc we do > not inhibit the idle CPU scan too much. > That sounds reasonable to me to link the SIS_SHORT threshold to CPU scale like what min_granule does. Thanks, Yicong > thanks, > Chenyu >> Thanks. >> >>> Changes since v3: >>> 1. Honglei and Josh have concern that the threshold of short >>> task duration could be too long. Decreased the threshold from >>> sysctl_sched_min_granularity to (sysctl_sched_min_granularity / 8), >>> and the '8' comes from get_update_sysctl_factor(). >>> 2. Export p->se.dur_avg to /proc/{pid}/sched per Yicong's suggestion. >>> 3. Move the calculation of average duration from put_prev_task_fair() >>> to dequeue_task_fair(). Because there is an issue in v3 that, >>> put_prev_task_fair() will not be invoked by pick_next_task_fair() >>> in fast path, thus the dur_avg could not be updated timely. >>> 4. Fix the comment in PATCH 2/2, that "WRITE_ONCE(CPU1->ttwu_pending, 1);" >>> on CPU0 is earlier than CPU1 getting "ttwu_list->p0", per Tianchen. >>> 5. Move the scan for CPU with short duration task from select_idle_cpu() >>> to select_idle_siblings(), because there is no CPU scan involved, per >>> Yicong. >>> >>> Changes since v2: >>> >>> 1. Peter suggested comparing the duration of waker and the cost to >>> scan for an idle CPU: If the cost is higher than the task duration, >>> do not waste time finding an idle CPU, choose the local or previous >>> CPU directly. A prototype was created based on this suggestion. >>> However, according to the test result, this prototype does not inhibit >>> the cross CPU wakeup and did not bring improvement. Because the cost >>> to find an idle CPU is small in the problematic scenario. The root >>> cause of the problem is a race condition between scanning for an idle >>> CPU and task enqueue(please refer to the commit log in PATCH 2/2). >>> So v3 does not change the core logic of v2, with some refinement based >>> on Peter's suggestion. >>> >>> 2. Simplify the logic to record the task duration per Peter and Abel's suggestion. >>> >>> This change brings overall improvement on some microbenchmarks, both on >>> Intel and AMD platforms. >>> >>> v3: https://lore.kernel.org/lkml/cover.1669862147.git.yu.c.chen@intel.com/ >>> v2: https://lore.kernel.org/all/cover.1666531576.git.yu.c.chen@intel.com/ >>> v1: https://lore.kernel.org/lkml/20220915165407.1776363-1-yu.c.chen@intel.com/ >>> >>> Chen Yu (2): >>> sched/fair: Introduce short duration task check >>> sched/fair: Choose the CPU where short task is running during wake up >>> >>> include/linux/sched.h | 3 +++ >>> kernel/sched/core.c | 2 ++ >>> kernel/sched/debug.c | 1 + >>> kernel/sched/fair.c | 32 ++++++++++++++++++++++++++++++++ >>> kernel/sched/features.h | 1 + >>> 5 files changed, 39 insertions(+) >>> > > . >
On 2023-01-20 at 17:09:28 +0800, Yicong Yang wrote: > On 2023/1/20 0:11, Chen Yu wrote: > > Hi Yicong, > > On 2023-01-17 at 10:53:25 +0800, Yicong Yang wrote: > >> Hi Chen Yu, > >> > >> On 2022/12/16 14:08, Chen Yu wrote: > >>> The main purpose of this change is to avoid too many cross CPU > >>> wake up when it is unnecessary. The frequent cross CPU wake up > >>> brings significant damage to some workloads, especially on high > >>> core count systems. > >>> > >>> This patch set inhibits the cross CPU wake-up by placing the wakee > >>> on waking CPU or previous CPU, if both the waker and wakee are > >>> short-duration tasks. > >>> > >>> The first patch is to introduce the definition of a short-duration > >>> task. The second patch leverages the first patch to choose a local > >>> or previous CPU for wakee. > >>> > >> > >> Tested with tbench and netperf locally and MySQL remotely. Unluckily > >> cannot reproduce the improvement and see some regression. > > Thank you for the test. > >> Detailed information below. The tests are done based on 6.2-rc1 on a 2x64 CPUs > >> Kunpeng 920 server with 4 nodes and no SMT, CONFIG_SCHED_CLUSTER > >> enabled. > >> > >> For tbench and netperf tested on single node and two nodes on one socket. > >> > > Does this platform have 2 sockets, each socket has 2 Numa nodes, each node > > is attached to 32 CPUs, and the benchmarks are bound to the first sockets? > > And may I know how many CPUs there are in one LLC? Is it also 32? > > you're right, exactly. There're 32 CPUs sharing one LLC. > > >> Further looked into 64 threads since it regress most and mainly looked on the > >> server(mysqld) side. In this case most loads will be on node 0 and partly > >> on node 1. With SIS_SHORT the utilization on node 0 keeps almost the same > >> but utilization on node 1 remains nearly half comparing to baseline. That > >> may partly answers why the performance decreases. > >> > > OK, I see. So it seems that the Numa sched domain load balance does not spread > > tasks fast enough while SIS_SHORT is waking task on the same node frequently. > > But I thought select_idle_sibling() would wake up the tasks > > on the same node(LLC) anyway, why SIS_SHORT inhibits the load balance more? > > Wakeup will happen much more frequently than load balance in this case and I don't > think load balance can handle it in time. In the fast path of wakeup we care little > about the load. SIS_SHORT may aggravate it since we're likely to put 2 short tasks > on one single CPU. > I see. Suppose cpu0 ~ cpu31 are on node0, cpu32~cpu64 are on node1. With SIS_SHORT enabled, wakees could be stacked on cpu0. With SIS_SHORT disabled, wakees could be woken up on cpu0~cpu31. I undertsnad that SIS_SHORT inhits searching for idle CPUs in node0, but why SIS_SHORT inhibits tasks migrating to node1? > >> mpstat 60s, non-idle% > >> vanilla sis_short > >> node 0 94.25 91.67 -2.73% > >> node 1 46.41 30.03 -35.29% > >> > >> Previously there's argument that sync wakeup from interrupts maybe incorrectly > > I see. Do you mean, wakeup from interrupt might cause tasks stacking on the same > > CPU, because the interrupt is delivered to the same CPU? I thought userspace > > irq load balance would spread the interrupt? Or do you mean others? > > Nop. I'm using HiSilicon HNS3 network card which has 32 queues and 33 interrupts > spread evenly on 32 CPUs, not on one single CPU. I considering a case that the p1 > on cpu0 and p2 on cpu32, interrupt happens on cpu0 and tries to wakeup p1, since > p1 and p2 are both short tasks and then p2 will likely to wake and wait on cpu0. > p1 is not the actual waker and has not finished yet and this will increase the wait > time and also load on cpu0. Previously we may wake on the LLC of cpu0 and try > to find an idle CPU, with SIS_SHORT we'll just wake on cpu0. > > Does it make sense to allow the SIS_SHORT wakeup only from the task context? I'll > have a try on this. > This sounds reasonable to me. I think we can further enhance it. And before you sending out the proposal, could you help test the following experimental change, to see if it makes tbench netperf and MySQL better? Hi Prateek, could you also help check if below change would make the Spec regression go away? thanks, Chenyu diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fc8a636eca5e..6e1ab5e2686c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4974,7 +4974,7 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se); static inline int is_short_task(struct task_struct *p) { return sched_feat(SIS_SHORT) && p->se.dur_avg && - ((p->se.dur_avg * 8) <= sysctl_sched_min_granularity); + ((p->se.dur_avg * 1024) <= (__this_cpu_read(sd_llc_size) * sysctl_sched_min_granularity)); } /* -- 2.25.1
Hello Chenyu, Thank you for looking into this. On 1/22/2023 9:07 PM, Chen Yu wrote: > On 2023-01-20 at 17:09:28 +0800, Yicong Yang wrote: > [..snip..] > > This sounds reasonable to me. I think we can further enhance it. > And before you sending out the proposal, could you help test the following > experimental change, to see if it makes tbench netperf and MySQL better? > > Hi Prateek, could you also help check if below change would make the Spec regression > go away? Sure. I'll add this on top of v4 and queue some runs. Will share the results soon. > > thanks, > Chenyu > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index fc8a636eca5e..6e1ab5e2686c 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4974,7 +4974,7 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se); > static inline int is_short_task(struct task_struct *p) > { > return sched_feat(SIS_SHORT) && p->se.dur_avg && > - ((p->se.dur_avg * 8) <= sysctl_sched_min_granularity); > + ((p->se.dur_avg * 1024) <= (__this_cpu_read(sd_llc_size) * sysctl_sched_min_granularity)); > } > > /* -- Thanks and Regards, Prateek
Hello Chenyu, Including the detailed results from testing below. tl;dr o There seems to be 3 noticeable regressions: - tbench for lower number of clients. The schedstat data shows an increase in wait time. - SpecJBB MultiJVM performance drops as the workload prefers an idle CPU over a busy one. - Unixbench-pipe benchmark performance drops. o Most benchmark numbers remain same. o Small gains seen for ycsb-mongodb and unixbench-syscall. On 12/16/2022 11:38 AM, Chen Yu wrote: > The main purpose of this change is to avoid too many cross CPU > wake up when it is unnecessary. The frequent cross CPU wake up > brings significant damage to some workloads, especially on high > core count systems. > > This patch set inhibits the cross CPU wake-up by placing the wakee > on waking CPU or previous CPU, if both the waker and wakee are > short-duration tasks. > > The first patch is to introduce the definition of a short-duration > task. The second patch leverages the first patch to choose a local > or previous CPU for wakee. > > Changes since v3: > 1. Honglei and Josh have concern that the threshold of short > task duration could be too long. Decreased the threshold from > sysctl_sched_min_granularity to (sysctl_sched_min_granularity / 8), > and the '8' comes from get_update_sysctl_factor(). > 2. Export p->se.dur_avg to /proc/{pid}/sched per Yicong's suggestion. > 3. Move the calculation of average duration from put_prev_task_fair() > to dequeue_task_fair(). Because there is an issue in v3 that, > put_prev_task_fair() will not be invoked by pick_next_task_fair() > in fast path, thus the dur_avg could not be updated timely. > 4. Fix the comment in PATCH 2/2, that "WRITE_ONCE(CPU1->ttwu_pending, 1);" > on CPU0 is earlier than CPU1 getting "ttwu_list->p0", per Tianchen. > 5. Move the scan for CPU with short duration task from select_idle_cpu() > to select_idle_siblings(), because there is no CPU scan involved, per> Yicong. Following are the results from running standard benchmarks on a dual socket Zen3 (2 x 64C/128T) machine configured in different NPS modes. NPS Modes are used to logically divide single socket into multiple NUMA region. Following is the NUMA configuration for each NPS mode on the system: NPS1: Each socket is a NUMA node. Total 2 NUMA nodes in the dual socket machine. Node 0: 0-63, 128-191 Node 1: 64-127, 192-255 NPS2: Each socket is further logically divided into 2 NUMA regions. Total 4 NUMA nodes exist over 2 socket. Node 0: 0-31, 128-159 Node 1: 32-63, 160-191 Node 2: 64-95, 192-223 Node 3: 96-127, 223-255 NPS4: Each socket is logically divided into 4 NUMA regions. Total 8 NUMA nodes exist over 2 socket. Node 0: 0-15, 128-143 Node 1: 16-31, 144-159 Node 2: 32-47, 160-175 Node 3: 48-63, 176-191 Node 4: 64-79, 192-207 Node 5: 80-95, 208-223 Node 6: 96-111, 223-231 Node 7: 112-127, 232-255 Benchmark Results: Kernel versions: - tip: 6.1.0-rc2 tip sched/core - sis_short: 6.1.0-rc2 tip sched/core + this series When the testing started, the tip was at: commit d6962c4fe8f9 "sched: Clear ttwu_pending after enqueue_task()" ~~~~~~~~~~~~~ ~ hackbench ~ ~~~~~~~~~~~~~ NPS1 Test: tip sis_short 1-groups: 4.25 (0.00 pct) 4.26 (-0.23 pct) 2-groups: 4.95 (0.00 pct) 4.87 (1.61 pct) 4-groups: 5.19 (0.00 pct) 5.09 (1.92 pct) 8-groups: 5.45 (0.00 pct) 5.37 (1.46 pct) 16-groups: 7.33 (0.00 pct) 7.65 (-4.36 pct) NPS2 Test: tip sis_short 1-groups: 4.09 (0.00 pct) 4.20 (-2.68 pct) 2-groups: 4.68 (0.00 pct) 4.75 (-1.49 pct) 4-groups: 5.05 (0.00 pct) 4.94 (2.17 pct) 8-groups: 5.37 (0.00 pct) 5.29 (1.48 pct) 16-groups: 6.69 (0.00 pct) 6.78 (-1.34 pct) NPS4 Test: tip sis_short 1-groups: 4.28 (0.00 pct) 4.37 (-2.10 pct) 2-groups: 4.78 (0.00 pct) 4.82 (-0.83 pct) 4-groups: 5.11 (0.00 pct) 5.06 (0.97 pct) 8-groups: 5.48 (0.00 pct) 5.38 (1.82 pct) 16-groups: 7.07 (0.00 pct) 6.93 (1.98 pct) ~~~~~~~~~~~~ ~ schbench ~ ~~~~~~~~~~~~ NPS1 #workers: tip sis_short 1: 31.00 (0.00 pct) 32.00 (-3.22 pct) 2: 33.00 (0.00 pct) 32.00 (3.03 pct) 4: 39.00 (0.00 pct) 34.00 (12.82 pct) 8: 45.00 (0.00 pct) 43.00 (4.44 pct) 16: 61.00 (0.00 pct) 66.00 (-8.19 pct) 32: 108.00 (0.00 pct) 107.00 (0.92 pct) 64: 212.00 (0.00 pct) 223.00 (-5.18 pct) 128: 475.00 (0.00 pct) 519.00 (-9.26 pct) * 128: 434.00 (0.00 pct) 438.00 (-0.92 pct) [Verification Run] 256: 44736.00 (0.00 pct) 42048.00 (6.00 pct) 512: 77184.00 (0.00 pct) 76672.00 (0.66 pct) NPS2 #workers: tip sis_short 1: 28.00 (0.00 pct) 32.00 (-14.28 pct) 2: 34.00 (0.00 pct) 37.00 (-8.82 pct) 4: 36.00 (0.00 pct) 37.00 (-2.77 pct) 8: 51.00 (0.00 pct) 47.00 (7.84 pct) 16: 68.00 (0.00 pct) 67.00 (1.47 pct) 32: 113.00 (0.00 pct) 117.00 (-3.53 pct) 64: 221.00 (0.00 pct) 221.00 (0.00 pct) 128: 553.00 (0.00 pct) 567.00 (-2.53 pct) 256: 43840.00 (0.00 pct) 47040.00 (-7.29 pct) * 256: 46016.00 (0.00 pct) 46272.00 (-0.55 pct) [Verification Run] 512: 76672.00 (0.00 pct) 78976.00 (-3.00 pct) NPS4 #workers: tip sis_short 1: 33.00 (0.00 pct) 32.00 (3.03 pct) 2: 29.00 (0.00 pct) 42.00 (-44.82 pct) 4: 39.00 (0.00 pct) 43.00 (-10.25 pct) 8: 58.00 (0.00 pct) 56.00 (3.44 pct) 16: 66.00 (0.00 pct) 68.00 (-3.03 pct) 32: 112.00 (0.00 pct) 113.00 (-0.89 pct) 64: 215.00 (0.00 pct) 214.00 (0.46 pct) 128: 689.00 (0.00 pct) 823.00 (-19.44 pct) * 128: 424.00 (0.00 pct) 449.00 (-5.89 pct) [Verification Run] 256: 45120.00 (0.00 pct) 44608.00 (1.13 pct) 512: 77440.00 (0.00 pct) 79488.00 (-2.64 pct) ~~~~~~~~~~ ~ tbench ~ ~~~~~~~~~~ NPS1 Clients: tip sis_short 1 581.75 (0.00 pct) 545.15 (-6.29 pct) * 1 588.80 (0.00 pct) 547.71 (-6.97 pct) [Verification Run] 2 1145.75 (0.00 pct) 1074.91 (-6.18 pct) * 2 1153.55 (0.00 pct) 1068.06 (-7.41 pct) [Verification Run] 4 2127.94 (0.00 pct) 1958.71 (-7.95 pct) * 4 2133.95 (0.00 pct) 1965.03 (-7.91 pct) [Verification Run] 8 3838.27 (0.00 pct) 3701.75 (-3.55 pct) 16 6272.71 (0.00 pct) 6045.56 (-3.62 pct) 32 11400.12 (0.00 pct) 11688.67 (2.53 pct) 64 21605.96 (0.00 pct) 22070.92 (2.15 pct) 128 30715.43 (0.00 pct) 30772.11 (0.18 pct) 256 55580.78 (0.00 pct) 54998.38 (-1.04 pct) 512 56528.79 (0.00 pct) 55239.00 (-2.28 pct) 1024 56520.40 (0.00 pct) 55153.71 (-2.41 pct) NPS2 Clients: tip sis_short 1 584.13 (0.00 pct) 555.55 (-4.89 pct) 2 1153.63 (0.00 pct) 1094.62 (-5.11 pct) * 2 1147.26 (0.00 pct) 1063.35 (-7.31 pct) [Verification Run] 4 2212.89 (0.00 pct) 2049.24 (-7.39 pct) * 4 2133.57 (0.00 pct) 2004.66 (-6.04 pct) [Verification Run] 8 3871.35 (0.00 pct) 3793.66 (-2.00 pct) 16 6216.72 (0.00 pct) 5895.87 (-5.16 pct) * 16 6342.15 (0.00 pct) 5962.58 (-5.98 pct) [Verification Run] 32 11766.98 (0.00 pct) 11013.97 (-6.39 pct) * 32 11009.53 (0.00 pct) 10215.06 (-7.21 pct) [Verification Run] 64 22000.93 (0.00 pct) 20563.47 (-6.53 pct) * 64 20315.34 (0.00 pct) 20097.13 (-1.07 pct) [Verification Run] 128 31520.53 (0.00 pct) 27489.63 (-12.78 pct) * 128 29224.27 (0.00 pct) 27872.81 (-4.62 pct) [Verification Run] 256 51420.11 (0.00 pct) 51968.82 (1.06 pct) 512 53935.90 (0.00 pct) 54053.59 (0.21 pct) 1024 55239.73 (0.00 pct) 54505.89 (-1.32 pct) NPS4 Clients: tip sis_short 1 585.83 (0.00 pct) 547.36 (-6.56 pct) * 1 590.36 (0.00 pct) 546.57 (-7.41 pct) [Verification Run] 2 1141.59 (0.00 pct) 1061.99 (-6.97 pct) * 2 1167.56 (0.00 pct) 1059.16 (-9.28 pct) [Verification Run] 4 2174.79 (0.00 pct) 2002.42 (-7.92 pct) * 4 2169.72 (0.00 pct) 1973.22 (-9.05 pct) [Verification Run] 8 3887.56 (0.00 pct) 3543.62 (-8.84 pct) * 8 3747.36 (0.00 pct) 3398.57 (-9.30 pct) [Verification Run] 16 6441.59 (0.00 pct) 5896.92 (-8.45 pct) * 16 6165.54 (0.00 pct) 6019.96 (-2.36 pct) [Verification Run] 32 12133.60 (0.00 pct) 11074.16 (-8.73 pct) * 32 11182.63 (0.00 pct) 11241.82 (0.52 pct) [Verification Run] 64 21769.15 (0.00 pct) 19485.94 (-10.48 pct) * 64 20168.52 (0.00 pct) 20333.98 (0.82 pct) [Verification Run] 128 31396.31 (0.00 pct) 29029.29 (-7.53 pct) * 128 29437.98 (0.00 pct) 27438.60 (-6.79 pct) [Verification Run] 256 52792.39 (0.00 pct) 51560.92 (-2.33 pct) 512 55315.44 (0.00 pct) 52914.21 (-4.34 pct) 1024 52150.27 (0.00 pct) 53286.08 (2.17 pct) ~~~~~~~~~~ ~ stream ~ ~~~~~~~~~~ NPS1 10 Runs: Test: tip sis_short Copy: 307827.79 (0.00 pct) 320671.93 (4.17 pct) Scale: 208872.28 (0.00 pct) 213725.16 (2.32 pct) Add: 239404.64 (0.00 pct) 240872.59 (0.61 pct) Triad: 247258.30 (0.00 pct) 249764.53 (1.01 pct) 100 Runs: Test: tip sis_short Copy: 317217.55 (0.00 pct) 318031.46 (0.25 pct) Scale: 208740.82 (0.00 pct) 213773.01 (2.41 pct) Add: 240550.63 (0.00 pct) 239618.79 (-0.38 pct) Triad: 249594.21 (0.00 pct) 245722.98 (-1.55 pct) NPS2 10 Runs: Test: tip sis_short Copy: 340877.18 (0.00 pct) 336204.91 (-1.37 pct) Scale: 217318.16 (0.00 pct) 218200.84 (0.40 pct) Add: 259078.93 (0.00 pct) 258847.83 (-0.08 pct) Triad: 274500.78 (0.00 pct) 267234.01 (-2.64 pct) 100 Runs: Test: tip sis_short Copy: 341860.73 (0.00 pct) 335116.25 (-1.97 pct) Scale: 218043.00 (0.00 pct) 219265.81 (0.56 pct) Add: 253698.22 (0.00 pct) 260792.08 (2.79 pct) Triad: 265011.84 (0.00 pct) 269655.66 (1.75 pct) NPS4 10 Runs: Test: tip sis_short Copy: 323775.81 (0.00 pct) 363923.22 (12.39 pct) Scale: 237719.83 (0.00 pct) 238074.93 (0.14 pct) Add: 251896.35 (0.00 pct) 269129.04 (6.84 pct) Triad: 264124.72 (0.00 pct) 284456.89 (7.69 pct) 100 Runs: Test: tip sis_short Copy: 360777.62 (0.00 pct) 375325.94 (4.03 pct) Scale: 239333.27 (0.00 pct) 238341.16 (-0.41 pct) Add: 271142.94 (0.00 pct) 269924.55 (-0.44 pct) Triad: 285408.97 (0.00 pct) 290345.65 (1.72 pct) ~~~~~~~~~~~~~~~~ ~ ycsb-mongodb ~ ~~~~~~~~~~~~~~~~ o NPS1 tip: 131696.33 (var: 2.03%) sis_short: 129031.33 (var: 1.71%) (-2.02%) o NPS2: tip: 129895.33 (var: 2.34%) sis_short: 133652.33 (var: 1.11%) (+2.89%) o NPS4: tip: 131165.00 (var: 1.06%) sis_short: 135975.00 (var: 0.34%) (+3.66%) ~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ SPECjbb MultiJVM - NPS1 ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Kernel tip sis_short Max-jOPS 100% 96% Critical-jOPS 100% 97% ~~~~~~~~~~~~~ ~ unixbench ~ ~~~~~~~~~~~~~ o NPS1 Test Metric Parallelism tip sis_short unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48929419.48 ( 0.00%) 48992339.28 ( 0.13%) unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6266355505.80 ( 0.00%) 6251441423.60 ( -0.24%) unixbench-syscall Amean unixbench-syscall-1 2994319.73 ( 0.00%) 2665595.13 * 10.98%* unixbench-syscall Amean unixbench-syscall-512 7349715.87 ( 0.00%) 7645690.70 * -4.03%* unixbench-pipe Hmean unixbench-pipe-1 2830206.03 ( 0.00%) 2508957.89 * -11.35%* * unixbench-pipe Hmean unixbench-pipe-512 326207828.01 ( 0.00%) 306588592.66 * -6.01%* * unixbench-spawn Hmean unixbench-spawn-1 6394.21 ( 0.00%) 6153.47 ( -3.76%) unixbench-spawn Hmean unixbench-spawn-512 72700.64 ( 0.00%) 87873.75 * 20.87%* unixbench-execl Hmean unixbench-execl-1 4723.61 ( 0.00%) 4678.46 ( -0.96%) unixbench-execl Hmean unixbench-execl-512 11212.05 ( 0.00%) 11067.41 * -1.29%* o NPS2 Test Metric Parallelism tip sis_short unixbench-dhry2reg Hmean unixbench-dhry2reg-1 49271512.85 ( 0.00%) 48842191.93 ( -0.87%) unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6267992483.03 ( 0.00%) 6261579339.17 ( -0.10%) unixbench-syscall Amean unixbench-syscall-1 2995885.93 ( 0.00%) 2668794.40 * 10.92%* unixbench-syscall Amean unixbench-syscall-512 7388865.77 ( 0.00%) 7229531.57 * 2.16%* unixbench-pipe Hmean unixbench-pipe-1 2828971.95 ( 0.00%) 2509560.35 * -11.29%* * unixbench-pipe Hmean unixbench-pipe-512 326225385.37 ( 0.00%) 306941838.69 * -5.91%* * unixbench-spawn Hmean unixbench-spawn-1 6958.71 ( 0.00%) 6613.24 ( -4.96%) unixbench-spawn Hmean unixbench-spawn-512 85443.56 ( 0.00%) 89130.15 * 4.31%* unixbench-execl Hmean unixbench-execl-1 4767.99 ( 0.00%) 4684.44 * -1.75%* unixbench-execl Hmean unixbench-execl-512 11250.72 ( 0.00%) 11197.80 ( -0.47%) o NPS4 Test Metric Parallelism tip sis_short unixbench-dhry2reg Hmean unixbench-dhry2reg-1 49041932.68 ( 0.00%) 48942640.63 ( -0.20%) unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6277864986.70 ( 0.00%) 6263890106.10 ( -0.22%) unixbench-syscall Amean unixbench-syscall-1 2992405.60 ( 0.00%) 2663959.00 * 10.98%* unixbench-syscall Amean unixbench-syscall-512 7971789.70 ( 0.00%) 7834566.33 * 1.72%* unixbench-pipe Hmean unixbench-pipe-1 2822892.54 ( 0.00%) 2505995.43 * -11.23%* * unixbench-pipe Hmean unixbench-pipe-512 326408309.83 ( 0.00%) 306887263.78 * -5.98%* * unixbench-spawn Hmean unixbench-spawn-1 7685.31 ( 0.00%) 7675.29 ( -0.13%) unixbench-spawn Hmean unixbench-spawn-512 72245.56 ( 0.00%) 74190.34 * 2.69%* unixbench-execl Hmean unixbench-execl-1 4761.42 ( 0.00%) 4683.56 * -1.64%* unixbench-execl Hmean unixbench-execl-512 11533.53 ( 0.00%) 11298.59 ( -2.04%) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ Analyzing tbench (2 client) for NPS1 configuration ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ tip: 1126.33 MB/sec sis_short: 1031.26 MB/sec (-8.44%) -> Legend for per CPU stats: rq->yld_count: sched_yield count rq->sched_count: schedule called rq->sched_goidle: schedule left the processor idle rq->ttwu_count: try_to_wake_up was called rq->ttwu_local: try_to_wake_up was called to wake up the local cpu rq->rq_cpu_time: total runtime by tasks on this processor (in jiffies) rq->rq_sched_info.run_delay: total waittime by tasks on this processor (in jiffies) rq->rq_sched_info.pcount: total timeslices run on this cpu -> System wide stats ------------------------------------------------------------------------------------------------------------------------- cpu: all_cpus (avg) vs cpu: all_cpus (avg) ------------------------------------------------------------------------------------------------------------------------- sched_yield count : 0, 0 Legacy counter can be ignored : 0, 0 schedule called : 233543, 213786 | -8.46| schedule left the processor idle : 116745, 106863 | -8.46| try_to_wake_up was called : 116772, 106893 | -8.46| try_to_wake_up was called to wake up the local cpu : 74, 84 | 13.51| total runtime by tasks on this processor (in jiffies) : 579771198, 585440128 total waittime by tasks on this processor (in jiffies) : 176365, 258702 | 46.69| * Wait time is much longer total timeslices run on this cpu : 116797, 106922 | -8.45| ------------------------------------------------------------------------------------------------------------------------- < ----------------------------------------------------------------- Wakeup info: ------------------------------------ > Wakeups on same SMT cpus = all_cpus (avg) : 0, 0 Wakeups on same MC cpus = all_cpus (avg) : 116689, 106797 | -8.48| Wakeups on same DIE cpus = all_cpus (avg) : 2, 4 Wakeups on same NUMA cpus = all_cpus (avg) : 5, 7 Affine wakeups on same SMT cpus = all_cpus (avg) : 0, 0 Affine wakeups on same MC cpus = all_cpus (avg) : 116667, 106781 | -8.47| Affine wakeups on same DIE cpus = all_cpus (avg) : 2, 4 Affine wakeups on same NUMA cpus = all_cpus (avg) : 5, 6 -------------------------------------------------------------------------------------------------------------------------- The rq->rq_sched_info.pcount and rq->sched_count seems to have reduced proportionally. > > Changes since v2: > > 1. Peter suggested comparing the duration of waker and the cost to > scan for an idle CPU: If the cost is higher than the task duration, > do not waste time finding an idle CPU, choose the local or previous > CPU directly. A prototype was created based on this suggestion. > However, according to the test result, this prototype does not inhibit > the cross CPU wakeup and did not bring improvement. Because the cost > to find an idle CPU is small in the problematic scenario. The root > cause of the problem is a race condition between scanning for an idle > CPU and task enqueue(please refer to the commit log in PATCH 2/2). > So v3 does not change the core logic of v2, with some refinement based > on Peter's suggestion. > > 2. Simplify the logic to record the task duration per Peter and Abel's suggestion. > > This change brings overall improvement on some microbenchmarks, both on > Intel and AMD platforms. > > v3: https://lore.kernel.org/lkml/cover.1669862147.git.yu.c.chen@intel.com/ > v2: https://lore.kernel.org/all/cover.1666531576.git.yu.c.chen@intel.com/ > v1: https://lore.kernel.org/lkml/20220915165407.1776363-1-yu.c.chen@intel.com/ > > Chen Yu (2): > sched/fair: Introduce short duration task check > sched/fair: Choose the CPU where short task is running during wake up > > include/linux/sched.h | 3 +++ > kernel/sched/core.c | 2 ++ > kernel/sched/debug.c | 1 + > kernel/sched/fair.c | 32 ++++++++++++++++++++++++++++++++ > kernel/sched/features.h | 1 + > 5 files changed, 39 insertions(+) > All numbers are with turbo and C2 enabled. I wonder if the check "(5 * nr < 3 * sd->span_weight)" in v2 helped workloads like tbench and SpecJBB. I'll queue some runs with the condition added back and separate run with turbo and C2 disabled to see if they helps. I'll update the thread once the results are in. -- Thanks and Regards, Prateek
On 2022-12-29 at 12:46:59 +0530, K Prateek Nayak wrote: > Hello Chenyu, > > Including the detailed results from testing below. > > tl;dr > > o There seems to be 3 noticeable regressions: > - tbench for lower number of clients. The schedstat data shows > an increase in wait time. > - SpecJBB MultiJVM performance drops as the workload prefers > an idle CPU over a busy one. > - Unixbench-pipe benchmark performance drops. > > o Most benchmark numbers remain same. > > o Small gains seen for ycsb-mongodb and unixbench-syscall. > Thanks Prateek for your test. > On 12/16/2022 11:38 AM, Chen Yu wrote: > > The main purpose of this change is to avoid too many cross CPU > > wake up when it is unnecessary. The frequent cross CPU wake up > > brings significant damage to some workloads, especially on high > > core count systems. > > > > This patch set inhibits the cross CPU wake-up by placing the wakee > > on waking CPU or previous CPU, if both the waker and wakee are > > short-duration tasks. > > > > The first patch is to introduce the definition of a short-duration > > task. The second patch leverages the first patch to choose a local > > or previous CPU for wakee. > > > > Changes since v3: > > 1. Honglei and Josh have concern that the threshold of short > > task duration could be too long. Decreased the threshold from > > sysctl_sched_min_granularity to (sysctl_sched_min_granularity / 8), > > and the '8' comes from get_update_sysctl_factor(). > > 2. Export p->se.dur_avg to /proc/{pid}/sched per Yicong's suggestion. > > 3. Move the calculation of average duration from put_prev_task_fair() > > to dequeue_task_fair(). Because there is an issue in v3 that, > > put_prev_task_fair() will not be invoked by pick_next_task_fair() > > in fast path, thus the dur_avg could not be updated timely. > > 4. Fix the comment in PATCH 2/2, that "WRITE_ONCE(CPU1->ttwu_pending, 1);" > > on CPU0 is earlier than CPU1 getting "ttwu_list->p0", per Tianchen. > > 5. Move the scan for CPU with short duration task from select_idle_cpu() > > to select_idle_siblings(), because there is no CPU scan involved, per> Yicong. > > Following are the results from running standard benchmarks on a > dual socket Zen3 (2 x 64C/128T) machine configured in different > NPS modes. > > NPS Modes are used to logically divide single socket into > multiple NUMA region. > Following is the NUMA configuration for each NPS mode on the system: > > NPS1: Each socket is a NUMA node. > Total 2 NUMA nodes in the dual socket machine. > > Node 0: 0-63, 128-191 > Node 1: 64-127, 192-255 > > NPS2: Each socket is further logically divided into 2 NUMA regions. > Total 4 NUMA nodes exist over 2 socket. > > Node 0: 0-31, 128-159 > Node 1: 32-63, 160-191 > Node 2: 64-95, 192-223 > Node 3: 96-127, 223-255 > > NPS4: Each socket is logically divided into 4 NUMA regions. > Total 8 NUMA nodes exist over 2 socket. > > Node 0: 0-15, 128-143 > Node 1: 16-31, 144-159 > Node 2: 32-47, 160-175 > Node 3: 48-63, 176-191 > Node 4: 64-79, 192-207 > Node 5: 80-95, 208-223 > Node 6: 96-111, 223-231 > Node 7: 112-127, 232-255 > > Benchmark Results: > > Kernel versions: > - tip: 6.1.0-rc2 tip sched/core > - sis_short: 6.1.0-rc2 tip sched/core + this series > > When the testing started, the tip was at: > commit d6962c4fe8f9 "sched: Clear ttwu_pending after enqueue_task()" > OK, I'll rebase the code and to check if I could reproduce the regression with SNC enabled. Previously I tested v3 with SNC enabled and did not see regress so I did not enable SNC when testing v4, I'll do the test with SNC enabled. [...] > > o NPS1 > > Test Metric Parallelism tip sis_short > unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48929419.48 ( 0.00%) 48992339.28 ( 0.13%) > unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6266355505.80 ( 0.00%) 6251441423.60 ( -0.24%) > unixbench-syscall Amean unixbench-syscall-1 2994319.73 ( 0.00%) 2665595.13 * 10.98%* > unixbench-syscall Amean unixbench-syscall-512 7349715.87 ( 0.00%) 7645690.70 * -4.03%* > unixbench-pipe Hmean unixbench-pipe-1 2830206.03 ( 0.00%) 2508957.89 * -11.35%* * > unixbench-pipe Hmean unixbench-pipe-512 326207828.01 ( 0.00%) 306588592.66 * -6.01%* * I'll also run unixbench-pipe in my environment. As will-it-scale context_switch1 also stress pipe and did see improvement, I'll check why unixbench-pipe regress. [...] > total waittime by tasks on this processor (in jiffies) : 176365, 258702 | 46.69| * Wait time is much longer This seems to be fall back to the result of v1, where we inhibit the idle CPU scan and caused problems. > total timeslices run on this cpu : 116797, 106922 | -8.45| > ------------------------------------------------------------------------------------------------------------------------- > < ----------------------------------------------------------------- Wakeup info: ------------------------------------ > > Wakeups on same SMT cpus = all_cpus (avg) : 0, 0 > Wakeups on same MC cpus = all_cpus (avg) : 116689, 106797 | -8.48| > Wakeups on same DIE cpus = all_cpus (avg) : 2, 4 > Wakeups on same NUMA cpus = all_cpus (avg) : 5, 7 > Affine wakeups on same SMT cpus = all_cpus (avg) : 0, 0 > Affine wakeups on same MC cpus = all_cpus (avg) : 116667, 106781 | -8.47| > Affine wakeups on same DIE cpus = all_cpus (avg) : 2, 4 > Affine wakeups on same NUMA cpus = all_cpus (avg) : 5, 6 > -------------------------------------------------------------------------------------------------------------------------- > > The rq->rq_sched_info.pcount and rq->sched_count seems to > have reduced proportionally. > > > > > Changes since v2: > > > > 1. Peter suggested comparing the duration of waker and the cost to > > scan for an idle CPU: If the cost is higher than the task duration, > > do not waste time finding an idle CPU, choose the local or previous > > CPU directly. A prototype was created based on this suggestion. > > However, according to the test result, this prototype does not inhibit > > the cross CPU wakeup and did not bring improvement. Because the cost > > to find an idle CPU is small in the problematic scenario. The root > > cause of the problem is a race condition between scanning for an idle > > CPU and task enqueue(please refer to the commit log in PATCH 2/2). > > So v3 does not change the core logic of v2, with some refinement based > > on Peter's suggestion. > > > > 2. Simplify the logic to record the task duration per Peter and Abel's suggestion. > > > > This change brings overall improvement on some microbenchmarks, both on > > Intel and AMD platforms. > > > > v3: https://lore.kernel.org/lkml/cover.1669862147.git.yu.c.chen@intel.com/ > > v2: https://lore.kernel.org/all/cover.1666531576.git.yu.c.chen@intel.com/ > > v1: https://lore.kernel.org/lkml/20220915165407.1776363-1-yu.c.chen@intel.com/ > > > > Chen Yu (2): > > sched/fair: Introduce short duration task check > > sched/fair: Choose the CPU where short task is running during wake up > > > > include/linux/sched.h | 3 +++ > > kernel/sched/core.c | 2 ++ > > kernel/sched/debug.c | 1 + > > kernel/sched/fair.c | 32 ++++++++++++++++++++++++++++++++ > > kernel/sched/features.h | 1 + > > 5 files changed, 39 insertions(+) > > > > All numbers are with turbo and C2 enabled. I wonder if the > check "(5 * nr < 3 * sd->span_weight)" in v2 helped workloads > like tbench and SpecJBB. I'll queue some runs with the condition > added back and separate run with turbo and C2 disabled to see > if they helps. I'll update the thread once the results are in. Thanks for helping check if the nr part in v2 could bring the improvement back. However Peter seems to have concern regarding the nr check, I'll think about it more. thanks, Chenyu > > -- > Thanks and Regards, > Prateek
Hello Chenyu, On 12/30/2022 8:17 AM, Chen Yu wrote: > On 2022-12-29 at 12:46:59 +0530, K Prateek Nayak wrote: >> Hello Chenyu, >> >> Including the detailed results from testing below. >> >> tl;dr >> >> o There seems to be 3 noticeable regressions: >> - tbench for lower number of clients. The schedstat data shows >> an increase in wait time. >> - SpecJBB MultiJVM performance drops as the workload prefers >> an idle CPU over a busy one. >> - Unixbench-pipe benchmark performance drops. >> >> o Most benchmark numbers remain same. >> >> o Small gains seen for ycsb-mongodb and unixbench-syscall. >> Please ignore the last test results. The tests did not use exactly same config for tip and sis_short kernel which led to more overhead in the network stack for sis_short kernel and the longer wait time seen in sched_stat data for tbench was a result of each loop taking longer to finish. I reran the benchmarks on the latest tip making sure the configs are identical this time and only notice one regression in Spec-JBB Critical-jOPS. tl;dr o tbench sees good improvement in the throughput when the machine is fully loaded and beyond. o Some unixbench test cases show improvement as well as ycsb-mongodb in NPS2 and NPS4 mode. o Most benchmark results are same. o SpecJBB Critical-jOPS are still down. I'll share full schedstat dump for tasks separately with you. Following are the detailed results from a dual socket Zen3 machine: Following are the Kernel versions: tip: 6.2.0-rc2 tip:sched/core at commit: bbd0b031509b "sched/rseq: Fix concurrency ID handling of usermodehelper kthreads" sis_short: tip + series ~~~~~~~~~~~~~ ~ hackbench ~ ~~~~~~~~~~~~~ o NPS1 Test: tip sis_short 1-groups: 4.36 (0.00 pct) 4.33 (0.68 pct) 2-groups: 5.17 (0.00 pct) 5.20 (-0.58 pct) 4-groups: 4.17 (0.00 pct) 4.19 (-0.47 pct) 8-groups: 4.64 (0.00 pct) 4.71 (-1.50 pct) 16-groups: 5.43 (0.00 pct) 6.60 (-21.54 pct) * [Machine is overloaded with avg 2.5 tasks per run queue] 16-groups: 5.90 (0.00 pct) 5.87 (0.50 pct) [Verification Run] o NPS2 Test: tip sis_short 1-groups: 4.43 (0.00 pct) 4.18 (5.64 pct) 2-groups: 4.61 (0.00 pct) 4.99 (-8.24 pct) * 2-groups: 4.72 (0.00 pct) 4.74 (-0.42 pct) [Verification Run] 4-groups: 4.25 (0.00 pct) 4.20 (1.17 pct) 8-groups: 4.91 (0.00 pct) 5.08 (-3.46 pct) 16-groups: 5.84 (0.00 pct) 5.84 (0.00 pct) o NPS4 Test: tip sis_short 1-groups: 4.34 (0.00 pct) 4.39 (-1.15 pct) 2-groups: 4.64 (0.00 pct) 4.97 (-7.11 pct) * 2-groups: 4.86 (0.00 pct) 4.77 (1.85 pct) [Verification Run] 4-groups: 4.20 (0.00 pct) 4.26 (-1.42 pct) 8-groups: 5.21 (0.00 pct) 5.19 (0.38 pct) 16-groups: 6.24 (0.00 pct) 5.93 (4.96 pct) ~~~~~~~~~~~~ ~ schbench ~ ~~~~~~~~~~~~ $ schbench -m 2 -t <workers> -s 30 o NPS1 #workers: tip sis_short 1: 36.00 (0.00 pct) 26.00 (27.77 pct) 2: 37.00 (0.00 pct) 36.00 (2.70 pct) 4: 37.00 (0.00 pct) 37.00 (0.00 pct) 8: 47.00 (0.00 pct) 49.00 (-4.25 pct) 16: 64.00 (0.00 pct) 69.00 (-7.81 pct) 32: 109.00 (0.00 pct) 118.00 (-8.25 pct) * 32: 117.00 (0.00 pct) 116.00 (0.86 pct) [Verification Run] 64: 222.00 (0.00 pct) 219.00 (1.35 pct) 128: 515.00 (0.00 pct) 513.00 (0.38 pct) 256: 39744.00 (0.00 pct) 35776.00 (9.98 pct) 512: 81280.00 (0.00 pct) 76672.00 (5.66 pct) o NPS2 #workers: tip sis_short 1: 27.00 (0.00 pct) 25.00 (7.40 pct) 2: 31.00 (0.00 pct) 31.00 (0.00 pct) 4: 38.00 (0.00 pct) 37.00 (2.63 pct) 8: 50.00 (0.00 pct) 55.00 (-10.00 pct) 16: 66.00 (0.00 pct) 66.00 (0.00 pct) 32: 116.00 (0.00 pct) 119.00 (-2.58 pct) 64: 210.00 (0.00 pct) 219.00 (-4.28 pct) 128: 523.00 (0.00 pct) 497.00 (4.97 pct) 256: 44864.00 (0.00 pct) 46656.00 (-3.99 pct) 512: 78464.00 (0.00 pct) 79488.00 (-1.30 pct) o NPS4 #workers: tip sis_short 1: 32.00 (0.00 pct) 34.00 (-6.25 pct) 2: 32.00 (0.00 pct) 35.00 (-9.37 pct) 4: 34.00 (0.00 pct) 39.00 (-14.70 pct) 8: 58.00 (0.00 pct) 51.00 (12.06 pct) 16: 67.00 (0.00 pct) 70.00 (-4.47 pct) 32: 118.00 (0.00 pct) 123.00 (-4.23 pct) 64: 224.00 (0.00 pct) 227.00 (-1.33 pct) 128: 533.00 (0.00 pct) 537.00 (-0.75 pct) 256: 43456.00 (0.00 pct) 48192.00 (-10.89 pct) * [Machine overloaded - avg 2 tasks per run queue] 256: 46911.59 (0.00 pct) 47163.28 (-0.53 pct) [Verification Run] 512: 78976.00 (0.00 pct) 78976.00 (0.00 pct) ~~~~~~~~~~ ~ tbench ~ ~~~~~~~~~~ o NPS1 Clients: tip sis_short 1 539.96 (0.00 pct) 532.63 (-1.35 pct) 2 1068.21 (0.00 pct) 1057.35 (-1.01 pct) 4 1994.76 (0.00 pct) 2015.05 (1.01 pct) 8 3602.30 (0.00 pct) 3598.70 (-0.09 pct) 16 6075.49 (0.00 pct) 6019.96 (-0.91 pct) 32 11641.07 (0.00 pct) 11774.03 (1.14 pct) 64 21529.16 (0.00 pct) 21392.97 (-0.63 pct) 128 30852.92 (0.00 pct) 31355.87 (1.63 pct) 256 51901.20 (0.00 pct) 54896.08 (5.77 pct) 512 46797.40 (0.00 pct) 55090.07 (17.72 pct) 1024 46057.28 (0.00 pct) 54374.89 (18.05 pct) o NPS2 Clients: tip sis_short 1 536.11 (0.00 pct) 542.14 (1.12 pct) 2 1044.58 (0.00 pct) 1057.98 (1.28 pct) 4 2043.92 (0.00 pct) 1981.64 (-3.04 pct) 8 3572.50 (0.00 pct) 3579.03 (0.18 pct) 16 6040.97 (0.00 pct) 5946.20 (-1.56 pct) 32 10794.10 (0.00 pct) 11348.54 (5.13 pct) 64 20905.89 (0.00 pct) 21340.31 (2.07 pct) 128 30885.39 (0.00 pct) 30834.59 (-0.16 pct) 256 48901.25 (0.00 pct) 51905.17 (6.14 pct) 512 49673.91 (0.00 pct) 53608.18 (7.92 pct) 1024 47626.34 (0.00 pct) 53396.88 (12.11 pct) o NPS4 Clients: tip sis_short 1 544.91 (0.00 pct) 542.78 (-0.39 pct) 2 1046.49 (0.00 pct) 1057.16 (1.01 pct) 4 2007.11 (0.00 pct) 2001.21 (-0.29 pct) 8 3590.66 (0.00 pct) 3427.33 (-4.54 pct) 16 5956.60 (0.00 pct) 5898.69 (-0.97 pct) 32 10431.73 (0.00 pct) 10732.48 (2.88 pct) 64 21563.37 (0.00 pct) 19141.76 (-11.23 pct) * 64 21140.75 (0.00 pct) 20883.78 (-1.21 pct) [Verification Run] 128 30352.16 (0.00 pct) 28757.44 (-5.25 pct) * 128 29537.66 (0.00 pct) 29488.04 (-0.16 pct) [Verification Run] 256 49504.51 (0.00 pct) 52492.40 (6.03 pct) 512 44916.61 (0.00 pct) 52746.38 (17.43 pct) 1024 49986.21 (0.00 pct) 53169.62 (6.36 pct) ~~~~~~~~~~ ~ stream ~ ~~~~~~~~~~ o NPS1 - 10 Runs: Test: tip sis_short Copy: 336796.79 (0.00 pct) 336858.84 (0.01 pct) Scale: 212768.55 (0.00 pct) 213061.98 (0.13 pct) Add: 244000.34 (0.00 pct) 237874.08 (-2.51 pct) Triad: 255042.52 (0.00 pct) 250122.34 (-1.92 pct) - 100 Runs: Test: tip sis_short Copy: 335938.02 (0.00 pct) 324841.05 (-3.30 pct) Scale: 212597.92 (0.00 pct) 211516.93 (-0.50 pct) Add: 248294.62 (0.00 pct) 241706.28 (-2.65 pct) Triad: 258400.88 (0.00 pct) 251427.43 (-2.69 pct) o NPS2 - 10 Runs: Test: tip sis_short Copy: 340709.53 (0.00 pct) 338797.30 (-0.56 pct) Scale: 216849.08 (0.00 pct) 218167.05 (0.60 pct) Add: 257761.46 (0.00 pct) 258717.38 (0.37 pct) Triad: 268615.11 (0.00 pct) 270284.11 (0.62 pct) - 100 Runs: Test: tip sis_short Copy: 326385.13 (0.00 pct) 314299.63 (-3.70 pct) Scale: 216440.37 (0.00 pct) 213573.71 (-1.32 pct) Add: 255062.22 (0.00 pct) 250837.63 (-1.65 pct) Triad: 265442.03 (0.00 pct) 259851.69 (-2.10 pct) o NPS4 - 10 Runs: Test: tip sis_short Copy: 363927.86 (0.00 pct) 364556.47 (0.17 pct) Scale: 238190.49 (0.00 pct) 245339.08 (3.00 pct) Add: 262806.49 (0.00 pct) 270349.31 (2.87 pct) Triad: 276492.33 (0.00 pct) 274536.47 (-0.70 pct) - 100 Runs: Test: tip sis_short Copy: 369197.55 (0.00 pct) 365775.06 (-0.92 pct) Scale: 250508.46 (0.00 pct) 251164.01 (0.26 pct) Add: 267792.19 (0.00 pct) 268477.42 (0.25 pct) Triad: 280010.98 (0.00 pct) 272448.39 (-2.70 pct) ~~~~~~~~~~~~~~~~ ~ ycsb-mongodb ~ ~~~~~~~~~~~~~~~~ o NPS1 tip: 131328.67 (var: 2.97%) sis_short: 130867.33 (var: 3.23%) (-0.35%) o NPS2: tip: 132819.67 (var: 0.85%) sis_short: 135295.33 (var: 2.02%) (+1.86%) o NPS4: tip: 134130.00 (var: 4.12%) sis_short: 138018.67 (var: 1.92%) (+2.89%) ~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ SPECjbb MultiJVM - NPS1 ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Kernel tip sis_short Max-jOPS 100.00% 100.00% Critical-jOPS 100.00% 95.62% * ~~~~~~~~~~~~~ ~ unixbench ~ ~~~~~~~~~~~~~ o NPS1 Test Metric Parallelism tip sis_short unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48518455.50 ( 0.00%) 48698507.20 ( 0.37%) unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6268185467.60 ( 0.00%) 6274727387.00 ( 0.10%) unixbench-syscall Amean unixbench-syscall-1 2685321.17 ( 0.00%) 2701231.47 * 0.59%* unixbench-syscall Amean unixbench-syscall-512 7291476.20 ( 0.00%) 7886916.33 * 8.17%* unixbench-pipe Hmean unixbench-pipe-1 2480858.53 ( 0.00%) 2585426.98 * 4.22%* unixbench-pipe Hmean unixbench-pipe-512 300739256.62 ( 0.00%) 306860971.25 * 2.04%* unixbench-spawn Hmean unixbench-spawn-1 4358.14 ( 0.00%) 4393.55 ( 0.81%) unixbench-spawn Hmean unixbench-spawn-512 76497.32 ( 0.00%) 76175.34 * -0.42%* unixbench-execl Hmean unixbench-execl-1 4147.12 ( 0.00%) 4164.73 * 0.42%* unixbench-execl Hmean unixbench-execl-512 12435.26 ( 0.00%) 12669.59 ( 1.88%) o NPS2 Test Metric Parallelism tip sis_short unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48872335.50 ( 0.00%) 48880544.40 ( 0.02%) unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6264134378.20 ( 0.00%) 6269767014.50 ( 0.09%) unixbench-syscall Amean unixbench-syscall-1 2683903.13 ( 0.00%) 2698851.60 * 0.56%* unixbench-syscall Amean unixbench-syscall-512 7746773.60 ( 0.00%) 7767971.50 ( 0.27%) unixbench-pipe Hmean unixbench-pipe-1 2476724.23 ( 0.00%) 2587534.80 * 4.47%* unixbench-pipe Hmean unixbench-pipe-512 300277350.41 ( 0.00%) 306600469.40 * 2.11%* unixbench-spawn Hmean unixbench-spawn-1 5026.50 ( 0.00%) 4765.11 ( -5.20%) * unixbench-spawn Hmean unixbench-spawn-1 4965.80 ( 0.00%) 5283.00 ( 6.40%) [Verification Run] unixbench-spawn Hmean unixbench-spawn-512 80375.59 ( 0.00%) 79331.12 * -1.30%* unixbench-execl Hmean unixbench-execl-1 4151.70 ( 0.00%) 4139.06 ( -0.30%) unixbench-execl Hmean unixbench-execl-512 13605.15 ( 0.00%) 11898.26 ( -12.55%) * unixbench-execl Hmean unixbench-execl-512 12617.90 ( 0.00%) 13735.00 ( 8.85%) [Verification Run] o NPS4 Test Metric Parallelism tip sis_short unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48506771.20 ( 0.00%) 48886194.50 ( 0.78%) unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6280954362.50 ( 0.00%) 6289332433.50 ( 0.13%) unixbench-syscall Amean unixbench-syscall-1 2687259.30 ( 0.00%) 2700170.93 * 0.48%* unixbench-syscall Amean unixbench-syscall-512 7350275.67 ( 0.00%) 7858736.83 * 6.92%* unixbench-pipe Hmean unixbench-pipe-1 2478893.01 ( 0.00%) 2585741.42 * 4.31%* unixbench-pipe Hmean unixbench-pipe-512 301830155.61 ( 0.00%) 307556537.55 * 1.90%* unixbench-spawn Hmean unixbench-spawn-1 5208.55 ( 0.00%) 5280.85 ( 1.39%) unixbench-spawn Hmean unixbench-spawn-512 80745.79 ( 0.00%) 80411.55 * -0.41%* unixbench-execl Hmean unixbench-execl-1 4072.72 ( 0.00%) 4152.37 * 1.96%* unixbench-execl Hmean unixbench-execl-512 13746.56 ( 0.00%) 13247.30 ( -3.63%) * unixbench-execl Hmean unixbench-execl-512 13797.40 ( 0.00%) 13624.00 ( -1.25%) [Verification Run] > [..snip..] >> >> All numbers are with turbo and C2 enabled. I wonder if the >> check "(5 * nr < 3 * sd->span_weight)" in v2 helped workloads >> like tbench and SpecJBB. I'll queue some runs with the condition >> added back and separate run with turbo and C2 disabled to see >> if they helps. I'll update the thread once the results are in. > Thanks for helping check if the nr part in v2 could bring the improvement > back. However Peter seems to have concern regarding the nr check, I'll > think about it more. SpecJBB Critical-jOPS performance is known to suffer when tasks queue behind each other. I'll share the data separately. I do see the average wait_sum go up 1.3%. The Max-jOPS throughput, however, is identical on both kernels which means sis_short does not affect the overall throughput but only for the critical jobs, do we see the regression due to possible queuing of tasks. > > [..snip..] > -- Thanks and Regards, Prateek
On 2023-01-16 at 16:23:13 +0530, K Prateek Nayak wrote: > Hello Chenyu, > > On 12/30/2022 8:17 AM, Chen Yu wrote: > > On 2022-12-29 at 12:46:59 +0530, K Prateek Nayak wrote: > >> Hello Chenyu, > >> > >> Including the detailed results from testing below. > >> > >> tl;dr > >> > >> o There seems to be 3 noticeable regressions: > >> - tbench for lower number of clients. The schedstat data shows > >> an increase in wait time. > >> - SpecJBB MultiJVM performance drops as the workload prefers > >> an idle CPU over a busy one. > >> - Unixbench-pipe benchmark performance drops. > >> > >> o Most benchmark numbers remain same. > >> > >> o Small gains seen for ycsb-mongodb and unixbench-syscall. > >> > > Please ignore the last test results. The tests did not use > exactly same config for tip and sis_short kernel which led > to more overhead in the network stack for sis_short kernel > and the longer wait time seen in sched_stat data for tbench > was a result of each loop taking longer to finish. > > I reran the benchmarks on the latest tip making sure the > configs are identical this time and only notice one > regression in Spec-JBB Critical-jOPS. > > tl;dr > > o tbench sees good improvement in the throughput when > the machine is fully loaded and beyond. > o Some unixbench test cases show improvement as well as > ycsb-mongodb in NPS2 and NPS4 mode. > o Most benchmark results are same. > o SpecJBB Critical-jOPS are still down. I'll share full > schedstat dump for tasks separately with you. > Thanks Prateek! I checked the task duration for these workloads, they fall into the short duration task range, so SIS_SHORT takes effect. [snip] > > SpecJBB Critical-jOPS performance is known to suffer when tasks > queue behind each other. I'll share the data separately. I do see > the average wait_sum go up 1.3%. The Max-jOPS throughput, however, > is identical on both kernels which means sis_short does not affect > the overall throughput but only for the critical jobs, do we see > the regression due to possible queuing of tasks. > If I understand correctly, most workloads on Zen3 prefers to be spreaded on idle CPUs in the same LLC, except for the scenario when the system is extremly busy(and SIS_UTIL handles that). As Zen3 has 8C/16T per LLC, it is unlikely to trigger the race condition I described in PATCH 2/2. I'm preparing for a patch to also take nr_llc into considerdation. thanks, Chenyu > -- > Thanks and Regards, > Prateek
© 2016 - 2025 Red Hat, Inc.