kernel/sched/fair.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
Barry Song first pointed out that replacing sync wakeup with regular wakeup
seems to reduce overeager wakeup pulling and shows noticeable performance
improvement.[1]
This patch argues that allowing sync for wakeups from interrupt context
is a bug and fixing it can improve performance even when irq/softirq is
evenly spread out.
For wakeups from ISR, the waking CPU is just the CPU of ISR and the so-called
waker can be any random task that happens to be running on that CPU when the
interrupt comes in. This is completely different from other wakups where the
task running on the waking CPU is the actual waker. For example, two tasks
communicate through a pipe or mutiple tasks access the same critical section,
etc. This difference is important because with sync we assume the waker will
get off the runqueue and go to sleep immedately after the wakeup. The
assumption is built into wake_affine() where it discounts the waker's presence
from the runqueue when sync is true. The random waker from interrupts bears no
relation to the wakee and don't usually go to sleep immediately afterwards
unless wakeup granularity is reached. Plus the scheduler no longer enforces the
preepmtion of waker for sync wakeup as it used to before
patch f2e74eeac03ffb7 ("sched: Remove WAKEUP_SYNC feature"). Enforcing sync
wakeup preemption for wakeups from interrupt contexts doesn't seem to be
appropriate too but at least sync wakeup will do what it's supposed to do.
Add a check to make sure that sync can only be set when in_task() is true. If
a wakeup is from interrupt context, sync flag will be 0 because in_task()
returns 0.
Tested in two scenarios: wakeups from 1) task contexts, expected to see no
performance changes; 2) interrupt contexts, expected to see better performance
under low/medium load and no regression under heavy load.
Use hackbench for scenario 1 and pgbench for scenarios 2 both from mmtests on
a 2-socket Xeon E5-2699v3 box with 256G memory in total. Running 5.18 kernel
with SELinux disabled. Scheduler/MM tunables are all default. Irqbalance
daemon is active.
Hackbench (config-scheduler-unbound)
=========
process-pipes:
Baseline Patched
Amean 1 0.4300 ( 0.00%) 0.4420 ( -2.79%)
Amean 4 0.8757 ( 0.00%) 0.8774 ( -0.20%)
Amean 7 1.3712 ( 0.00%) 1.3789 ( -0.56%)
Amean 12 2.3541 ( 0.00%) 2.3714 ( -0.73%)
Amean 21 4.2229 ( 0.00%) 4.2439 ( -0.50%)
Amean 30 5.9113 ( 0.00%) 5.9451 ( -0.57%)
Amean 48 9.3873 ( 0.00%) 9.4898 ( -1.09%)
Amean 79 15.9281 ( 0.00%) 16.1385 ( -1.32%)
Amean 110 22.0961 ( 0.00%) 22.3433 ( -1.12%)
Amean 141 28.2973 ( 0.00%) 28.6209 ( -1.14%)
Amean 172 34.4709 ( 0.00%) 34.9347 ( -1.35%)
Amean 203 40.7621 ( 0.00%) 41.2497 ( -1.20%)
Amean 234 47.0416 ( 0.00%) 47.6470 ( -1.29%)
Amean 265 53.3048 ( 0.00%) 54.1625 ( -1.61%)
Amean 288 58.0595 ( 0.00%) 58.8096 ( -1.29%)
process-sockets:
Baseline Patched
Amean 1 0.6776 ( 0.00%) 0.6611 ( 2.43%)
Amean 4 2.6183 ( 0.00%) 2.5769 ( 1.58%)
Amean 7 4.5662 ( 0.00%) 4.4801 ( 1.89%)
Amean 12 7.7638 ( 0.00%) 7.6201 ( 1.85%)
Amean 21 13.5335 ( 0.00%) 13.2915 ( 1.79%)
Amean 30 19.3369 ( 0.00%) 18.9811 ( 1.84%)
Amean 48 31.0724 ( 0.00%) 30.6015 ( 1.52%)
Amean 79 51.1881 ( 0.00%) 50.4251 ( 1.49%)
Amean 110 71.3399 ( 0.00%) 70.4578 ( 1.24%)
Amean 141 91.4675 ( 0.00%) 90.3769 ( 1.19%)
Amean 172 111.7463 ( 0.00%) 110.3947 ( 1.21%)
Amean 203 131.6927 ( 0.00%) 130.3270 ( 1.04%)
Amean 234 151.7459 ( 0.00%) 150.1320 ( 1.06%)
Amean 265 171.9101 ( 0.00%) 169.9751 ( 1.13%)
Amean 288 186.9231 ( 0.00%) 184.7706 ( 1.15%)
thread-pipes:
Baseline Patched
Amean 1 0.4523 ( 0.00%) 0.4535 ( -0.28%)
Amean 4 0.9041 ( 0.00%) 0.9085 ( -0.48%)
Amean 7 1.4111 ( 0.00%) 1.4146 ( -0.25%)
Amean 12 2.3532 ( 0.00%) 2.3688 ( -0.66%)
Amean 21 4.1550 ( 0.00%) 4.1701 ( -0.36%)
Amean 30 6.1043 ( 0.00%) 6.2391 ( -2.21%)
Amean 48 10.2077 ( 0.00%) 10.3511 ( -1.40%)
Amean 79 16.7922 ( 0.00%) 17.0427 ( -1.49%)
Amean 110 23.3350 ( 0.00%) 23.6522 ( -1.36%)
Amean 141 29.6458 ( 0.00%) 30.2617 ( -2.08%)
Amean 172 35.8649 ( 0.00%) 36.4225 ( -1.55%)
Amean 203 42.4477 ( 0.00%) 42.8332 ( -0.91%)
Amean 234 48.7117 ( 0.00%) 49.4042 ( -1.42%)
Amean 265 54.9171 ( 0.00%) 55.6551 ( -1.34%)
Amean 288 59.5282 ( 0.00%) 60.2903 ( -1.28%)
thread-sockets:
Baseline Patched
Amean 1 0.6917 ( 0.00%) 0.6892 ( 0.37%)
Amean 4 2.6651 ( 0.00%) 2.6017 ( 2.38%)
Amean 7 4.6734 ( 0.00%) 4.5637 ( 2.35%)
Amean 12 8.0156 ( 0.00%) 7.8079 ( 2.59%)
Amean 21 14.0451 ( 0.00%) 13.6679 ( 2.69%)
Amean 30 20.0963 ( 0.00%) 19.5657 ( 2.64%)
Amean 48 32.4115 ( 0.00%) 31.6001 ( 2.50%)
Amean 79 53.1104 ( 0.00%) 51.8395 ( 2.39%)
Amean 110 74.0929 ( 0.00%) 72.4391 ( 2.23%)
Amean 141 95.1506 ( 0.00%) 93.0992 ( 2.16%)
Amean 172 116.1969 ( 0.00%) 113.8307 ( 2.04%)
Amean 203 137.4413 ( 0.00%) 134.5247 ( 2.12%)
Amean 234 158.5395 ( 0.00%) 155.2793 ( 2.06%)
Amean 265 179.7729 ( 0.00%) 176.1099 ( 2.04%)
Amean 288 195.5573 ( 0.00%) 191.3977 ( 2.13%)
Pgbench (config-db-pgbench-timed-ro-small)
=======
Baseline Patched
Hmean 1 68.54 ( 0.00%) 69.72 ( 1.71%)
Hmean 6 27725.78 ( 0.00%) 34119.11 ( 23.06%)
Hmean 12 55724.26 ( 0.00%) 63158.22 ( 13.34%)
Hmean 22 72806.26 ( 0.00%) 73389.98 ( 0.80%)
Hmean 30 79000.45 ( 0.00%) 75197.02 ( -4.81%)
Hmean 48 78180.16 ( 0.00%) 75074.09 ( -3.97%)
Hmean 80 75001.93 ( 0.00%) 70590.72 ( -5.88%)
Hmean 110 74812.25 ( 0.00%) 74128.57 ( -0.91%)
Hmean 142 74261.05 ( 0.00%) 72910.48 ( -1.82%)
Hmean 144 75375.35 ( 0.00%) 71295.72 ( -5.41%)
For hackbench, +-3% fluctuation is normal on this two-socket box, it's safe to
conclude that there are no performance differences for wakeups from task context.
For pgbench, after many runs, 10~30% gains are very consistent at lower
client counts (< #cores per socket). For higher client counts, both kernels are
very close, +-5% swings are quite common. Also NET_TX|RX softirq load
does spread out across both NUMA nodes in this test so there is no need to do
any explicit RPS/RFS.
[1]: https://lkml.org/lkml/2021/11/5/234
Signed-off-by: Libo Chen <libo.chen@oracle.com>
---
kernel/sched/fair.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 794c2cb945f8..59b210d2cdb5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6704,7 +6704,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
static int
select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
{
- int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
+ /* Don't set sync for wakeup from irq/soft ctx */
+ int sync = in_task() && (wake_flags & WF_SYNC)
+ && !(current->flags & PF_EXITING);
struct sched_domain *tmp, *sd = NULL;
int cpu = smp_processor_id();
int new_cpu = prev_cpu;
--
2.31.1
Hello Libo and Peter,
tl;dr
- We observed a major regression with tbench when testing the latest tip
sched/core at:
commit 14b3f2d9ee8d "sched/fair: Disallow sync wakeup from interrupt context"
Reason for the regression are the fewer affine wakeups that leaves the
client farther away from the data it needs to consume next primed in the
waker's LLC.
Such regressions can be expected from tasks that use sockets to communicate
significant amount of data especially on system with multiple LLCs.
- Other benchmarks have a comparable behavior to the tip at previous commit
commit : 91caa5ae2424 "sched/core: Fix the bug that task won't enqueue
into core tree when update cookie"
I'll leave more details below.
On 7/12/2022 4:17 AM, Libo Chen wrote:
> Barry Song first pointed out that replacing sync wakeup with regular wakeup
> seems to reduce overeager wakeup pulling and shows noticeable performance
> improvement.[1]
>
> This patch argues that allowing sync for wakeups from interrupt context
> is a bug and fixing it can improve performance even when irq/softirq is
> evenly spread out.
>
> For wakeups from ISR, the waking CPU is just the CPU of ISR and the so-called
> waker can be any random task that happens to be running on that CPU when the
> interrupt comes in. This is completely different from other wakups where the
> task running on the waking CPU is the actual waker. For example, two tasks
> communicate through a pipe or mutiple tasks access the same critical section,
> etc. This difference is important because with sync we assume the waker will
> get off the runqueue and go to sleep immedately after the wakeup. The
> assumption is built into wake_affine() where it discounts the waker's presence
> from the runqueue when sync is true. The random waker from interrupts bears no
> relation to the wakee and don't usually go to sleep immediately afterwards
> unless wakeup granularity is reached. Plus the scheduler no longer enforces the
> preepmtion of waker for sync wakeup as it used to before
> patch f2e74eeac03ffb7 ("sched: Remove WAKEUP_SYNC feature"). Enforcing sync
> wakeup preemption for wakeups from interrupt contexts doesn't seem to be
> appropriate too but at least sync wakeup will do what it's supposed to do.
>
> Add a check to make sure that sync can only be set when in_task() is true. If
> a wakeup is from interrupt context, sync flag will be 0 because in_task()
> returns 0.
>
> Tested in two scenarios: wakeups from 1) task contexts, expected to see no
> performance changes; 2) interrupt contexts, expected to see better performance
> under low/medium load and no regression under heavy load.
>
> Use hackbench for scenario 1 and pgbench for scenarios 2 both from mmtests on
> a 2-socket Xeon E5-2699v3 box with 256G memory in total. Running 5.18 kernel
> with SELinux disabled. Scheduler/MM tunables are all default. Irqbalance
> daemon is active.
>
> Hackbench (config-scheduler-unbound)
> =========
> process-pipes:
> Baseline Patched
> Amean 1 0.4300 ( 0.00%) 0.4420 ( -2.79%)
> Amean 4 0.8757 ( 0.00%) 0.8774 ( -0.20%)
> Amean 7 1.3712 ( 0.00%) 1.3789 ( -0.56%)
> Amean 12 2.3541 ( 0.00%) 2.3714 ( -0.73%)
> Amean 21 4.2229 ( 0.00%) 4.2439 ( -0.50%)
> Amean 30 5.9113 ( 0.00%) 5.9451 ( -0.57%)
> Amean 48 9.3873 ( 0.00%) 9.4898 ( -1.09%)
> Amean 79 15.9281 ( 0.00%) 16.1385 ( -1.32%)
> Amean 110 22.0961 ( 0.00%) 22.3433 ( -1.12%)
> Amean 141 28.2973 ( 0.00%) 28.6209 ( -1.14%)
> Amean 172 34.4709 ( 0.00%) 34.9347 ( -1.35%)
> Amean 203 40.7621 ( 0.00%) 41.2497 ( -1.20%)
> Amean 234 47.0416 ( 0.00%) 47.6470 ( -1.29%)
> Amean 265 53.3048 ( 0.00%) 54.1625 ( -1.61%)
> Amean 288 58.0595 ( 0.00%) 58.8096 ( -1.29%)
>
> process-sockets:
> Baseline Patched
> Amean 1 0.6776 ( 0.00%) 0.6611 ( 2.43%)
> Amean 4 2.6183 ( 0.00%) 2.5769 ( 1.58%)
> Amean 7 4.5662 ( 0.00%) 4.4801 ( 1.89%)
> Amean 12 7.7638 ( 0.00%) 7.6201 ( 1.85%)
> Amean 21 13.5335 ( 0.00%) 13.2915 ( 1.79%)
> Amean 30 19.3369 ( 0.00%) 18.9811 ( 1.84%)
> Amean 48 31.0724 ( 0.00%) 30.6015 ( 1.52%)
> Amean 79 51.1881 ( 0.00%) 50.4251 ( 1.49%)
> Amean 110 71.3399 ( 0.00%) 70.4578 ( 1.24%)
> Amean 141 91.4675 ( 0.00%) 90.3769 ( 1.19%)
> Amean 172 111.7463 ( 0.00%) 110.3947 ( 1.21%)
> Amean 203 131.6927 ( 0.00%) 130.3270 ( 1.04%)
> Amean 234 151.7459 ( 0.00%) 150.1320 ( 1.06%)
> Amean 265 171.9101 ( 0.00%) 169.9751 ( 1.13%)
> Amean 288 186.9231 ( 0.00%) 184.7706 ( 1.15%)
>
> thread-pipes:
> Baseline Patched
> Amean 1 0.4523 ( 0.00%) 0.4535 ( -0.28%)
> Amean 4 0.9041 ( 0.00%) 0.9085 ( -0.48%)
> Amean 7 1.4111 ( 0.00%) 1.4146 ( -0.25%)
> Amean 12 2.3532 ( 0.00%) 2.3688 ( -0.66%)
> Amean 21 4.1550 ( 0.00%) 4.1701 ( -0.36%)
> Amean 30 6.1043 ( 0.00%) 6.2391 ( -2.21%)
> Amean 48 10.2077 ( 0.00%) 10.3511 ( -1.40%)
> Amean 79 16.7922 ( 0.00%) 17.0427 ( -1.49%)
> Amean 110 23.3350 ( 0.00%) 23.6522 ( -1.36%)
> Amean 141 29.6458 ( 0.00%) 30.2617 ( -2.08%)
> Amean 172 35.8649 ( 0.00%) 36.4225 ( -1.55%)
> Amean 203 42.4477 ( 0.00%) 42.8332 ( -0.91%)
> Amean 234 48.7117 ( 0.00%) 49.4042 ( -1.42%)
> Amean 265 54.9171 ( 0.00%) 55.6551 ( -1.34%)
> Amean 288 59.5282 ( 0.00%) 60.2903 ( -1.28%)
>
> thread-sockets:
> Baseline Patched
> Amean 1 0.6917 ( 0.00%) 0.6892 ( 0.37%)
> Amean 4 2.6651 ( 0.00%) 2.6017 ( 2.38%)
> Amean 7 4.6734 ( 0.00%) 4.5637 ( 2.35%)
> Amean 12 8.0156 ( 0.00%) 7.8079 ( 2.59%)
> Amean 21 14.0451 ( 0.00%) 13.6679 ( 2.69%)
> Amean 30 20.0963 ( 0.00%) 19.5657 ( 2.64%)
> Amean 48 32.4115 ( 0.00%) 31.6001 ( 2.50%)
> Amean 79 53.1104 ( 0.00%) 51.8395 ( 2.39%)
> Amean 110 74.0929 ( 0.00%) 72.4391 ( 2.23%)
> Amean 141 95.1506 ( 0.00%) 93.0992 ( 2.16%)
> Amean 172 116.1969 ( 0.00%) 113.8307 ( 2.04%)
> Amean 203 137.4413 ( 0.00%) 134.5247 ( 2.12%)
> Amean 234 158.5395 ( 0.00%) 155.2793 ( 2.06%)
> Amean 265 179.7729 ( 0.00%) 176.1099 ( 2.04%)
> Amean 288 195.5573 ( 0.00%) 191.3977 ( 2.13%)
>
> Pgbench (config-db-pgbench-timed-ro-small)
> =======
> Baseline Patched
> Hmean 1 68.54 ( 0.00%) 69.72 ( 1.71%)
> Hmean 6 27725.78 ( 0.00%) 34119.11 ( 23.06%)
> Hmean 12 55724.26 ( 0.00%) 63158.22 ( 13.34%)
> Hmean 22 72806.26 ( 0.00%) 73389.98 ( 0.80%)
> Hmean 30 79000.45 ( 0.00%) 75197.02 ( -4.81%)
> Hmean 48 78180.16 ( 0.00%) 75074.09 ( -3.97%)
> Hmean 80 75001.93 ( 0.00%) 70590.72 ( -5.88%)
> Hmean 110 74812.25 ( 0.00%) 74128.57 ( -0.91%)
> Hmean 142 74261.05 ( 0.00%) 72910.48 ( -1.82%)
> Hmean 144 75375.35 ( 0.00%) 71295.72 ( -5.41%)
The two tests kernels used are:
- tip at commit: 14b3f2d9ee8d "sched/fair: Disallow sync wakeup from interrupt context"
- tip at commit: 91caa5ae2424 "sched/core: Fix the bug that task won't enqueue into core tree when update cookie"
Below are the tbench result on a dual socket Zen3 machine
running in NPS1 mode. Following is the NUMA configuration
NPS1 mode:
- NPS1: Each socket is a NUMA node.
Total 2 NUMA nodes in the dual socket machine.
Node 0: 0-63, 128-191
Node 1: 64-127, 192-255
Clients: tip (91caa5ae2424) tip (14b3f2d9ee8d)
1 569.24 (0.00 pct) 283.63 (-50.17 pct) *
2 1104.76 (0.00 pct) 590.45 (-46.55 pct) *
4 2058.78 (0.00 pct) 1080.63 (-47.51 pct) *
8 3590.20 (0.00 pct) 2098.05 (-41.56 pct) *
16 6119.21 (0.00 pct) 4348.40 (-28.93 pct) *
32 11383.91 (0.00 pct) 8417.55 (-26.05 pct) *
64 21910.01 (0.00 pct) 19405.11 (-11.43 pct) *
128 33105.27 (0.00 pct) 29791.80 (-10.00 pct) *
256 45550.99 (0.00 pct) 45847.10 (0.65 pct)
512 57753.81 (0.00 pct) 49481.17 (-14.32 pct) *
1024 55684.33 (0.00 pct) 48748.38 (-12.45 pct) *
This regression is consistently reproducible.
Below are the statistics gathered from schedstat data:
Kernel : tip + remove 14b3f2d9ee8d tip
HEAD commit : 91caa5ae2424 14b3f2d9ee8d
sched_yield cnt : 11 17
Legacy counter can be ignored : 0 0
schedule called : 12621212 15104022
schedule left the processor idle : 6306653 ( 49.96% of times ) 7547375 ( 49.96% of times )
try_to_wake_up was called : 6310778 7552147
try_to_wake_up was called to wake up the local cpu : 12305 ( 0.19% of times ) 12816 ( 0.16% of times )
total time by tasks on this processor (in jiffies) : 78497712520 72461915902
total time waiting tasks on this processor (in jiffies) : 56398803 ( 0.07% of times ) 34631178 ( 0.04% of times )
total timeslices run on this cpu : 6314548 7556630
Wakeups on same SMT : 39 ( 0.00062 ) 263 ( 0.00348 )
Wakeups on same MC : 6297015 ( 99.78% of time ) <--- 1079 ( 0.01429 )
Wakeups on same DIE : 559 ( 0.00886 ) 7537909 ( 99.81147 ) <--- With the patch, the task will prefer
Wakeups on same NUMA : 860 ( 0.01363 ) 80 ( 0.00106 ) to wake on the same LLC where it previously
Affine wakeups on same SMT : 25 ( 0.00040 ) 255 ( 0.00338 ) ran as compared to the LLC of waker.
Affine wakeups on same MC : 6282684 ( 99.55% of time ) <--- 961 ( 0.01272 ) This results in performance degradation as
Affine wakeups on same DIE : 523 ( 0.00829 ) 7537220 ( 99.80235 ) <--- the task is farther away from data it will
Affine wakeups on same NUMA : 839 ( 0.01329 ) 46 ( 0.00061 ) consume next.
All the statistics are comparable except for the reduced number of affine
wakeup on the waker's LLC that resulting in task being placed on the previous
LLC farther away from the data that resides in the waker's LLC that the wakee
will consume next.
All wakeups in the tbench, happens in_serving_softirq() making in_taks() false
for all the cases where sync would have been true otherwise.
We wanted to highlight there are workloads which would still benefit from
affine wakeups even when it happens in an interrupt context. It would be
great if we could spot such cases and bias wakeups towards waker's LLC.
Other benchmarks results are comparable to the tip in most cases.
All benchmarks were run on machine configured in NPS1 mode.
Following are the results:
~~~~~~~~~
hackbench
~~~~~~~~~
Test: tip (91caa5ae2424) tip (14b3f2d9ee8d)
1-groups: 4.22 (0.00 pct) 4.48 (-6.16 pct) *
1-groups: 4.22 (0.00 pct) 4.30 (-1.89 pct) [Verification run]
2-groups: 5.01 (0.00 pct) 4.87 (2.79 pct)
4-groups: 5.49 (0.00 pct) 5.34 (2.73 pct)
8-groups: 5.64 (0.00 pct) 5.50 (2.48 pct)
16-groups: 7.54 (0.00 pct) 7.34 (2.65 pct)
~~~~~~~~
schbench
~~~~~~~~
#workers: tip (91caa5ae2424) tip (14b3f2d9ee8d)
1: 22.00 (0.00 pct) 22.00 (0.00 pct)
2: 22.00 (0.00 pct) 27.00 (-22.72 pct) * Known to have run to run
4: 33.00 (0.00 pct) 38.00 (-15.15 pct) * variations.
8: 48.00 (0.00 pct) 51.00 (-6.25 pct) *
16: 70.00 (0.00 pct) 70.00 (0.00 pct)
32: 118.00 (0.00 pct) 120.00 (-1.69 pct)
64: 217.00 (0.00 pct) 223.00 (-2.76 pct)
128: 485.00 (0.00 pct) 488.00 (-0.61 pct)
256: 1066.00 (0.00 pct) 1054.00 (1.12 pct)
512: 47296.00 (0.00 pct) 47168.00 (0.27 pct)
Note: schbench results at lower worker count have a large
run-to-run variance and depends on certain characteristics
of new-idle balance.
~~~~~~
stream
~~~~~~
- 10 runs
Test: tip (91caa5ae2424) tip (14b3f2d9ee8d)
Copy: 336140.45 (0.00 pct) 334362.29 (-0.52 pct)
Scale: 214679.13 (0.00 pct) 218016.44 (1.55 pct)
Add: 251691.67 (0.00 pct) 249734.04 (-0.77 pct)
Triad: 262174.93 (0.00 pct) 260063.57 (-0.80 pct)
- 100 runs
Test: tip (91caa5ae2424) tip (14b3f2d9ee8d)
Copy: 336576.38 (0.00 pct) 334646.27 (-0.57 pct)
Scale: 219124.86 (0.00 pct) 223480.29 (1.98 pct)
Add: 251796.93 (0.00 pct) 250845.95 (-0.37 pct)
Triad: 262286.47 (0.00 pct) 258020.57 (-1.62 pct)
~~~~~~~~~~~~
ycsb-mongodb
~~~~~~~~~~~~
tip (91caa5ae2424): 290479.00 (var: 1.53)
tip (14b3f2d9ee8d): 287361.67 (var: 0.80) (-1.07 pct)
>
> [..snip..]
>
We also ran tbench on a dual socket Icelake Xeon system (2 x 32C/64T)
(Intel Xeon Platinum 8362) and following are the results:
Clients: tip (91caa5ae2424) tip (14b3f2d9ee8d)
1 131.56 (0.00 pct) 145.05 (10.25 pct)
2 279.46 (0.00 pct) 245.94 (-11.99 pct) *
4 552.19 (0.00 pct) 577.94 (4.66 pct)
8 1176.12 (0.00 pct) 1309.40 (11.33 pct)
16 2685.86 (0.00 pct) 2724.96 (1.45 pct)
32 11582.50 (0.00 pct) 10817.42 (-6.60 pct) *
64 18309.47 (0.00 pct) 18287.92 (-0.11 pct)
128 15689.97 (0.00 pct) 15322.08 (-2.34 pct)
256 37130.34 (0.00 pct) 37332.91 (0.54 pct)
512 36448.68 (0.00 pct) 36162.17 (-0.78 pct)
Both kernel versions show lot of run to run variance but the two highlighted
cases are relatively more stable and the case with 32 clients is outside of
the NUMA imbalance threshold.
Below are detailed stats of each individual runs on Icelacke machine:
Kernel: tip tip
(91caa5ae2424) (14b3f2d9ee8d)
o 1 Clients
Min : 126.23 128.01
Max : 137.75 168.99
Median : 131.20 143.83
AMean : 131.73 146.94
GMean : 131.64 145.99
HMean : 131.56 145.05
AMean Stddev : 5.78 20.66
AMean CoefVar : 4.39 pct 14.06 pct
o 2 clients *
Min : 261.71 234.26
Max : 293.38 253.03
Median : 285.27 251.43
AMean : 280.12 246.24
GMean : 279.79 246.09
HMean : 279.46 245.94
AMean Stddev : 16.45 10.40
AMean CoefVar : 5.87 pct 4.22 pct
o 4 clients
Min : 515.98 497.81
Max : 611.56 744.49
Median : 537.71 543.82
AMean : 555.08 595.37
GMean : 553.61 586.31
HMean : 552.19 577.94
AMean Stddev : 50.10 131.17
AMean CoefVar : 9.03 pct 22.03 pct
o 8 clients
Min : 1101.30 1178.59
Max : 1293.18 1442.30
Median : 1150.15 1334.56
AMean : 1181.54 1318.48
GMean : 1178.80 1313.97
HMean : 1176.12 1309.40
AMean Stddev : 99.72 132.59
AMean CoefVar : 8.44 pct 10.06 pct
o 16 clients
Min : 2478.03 2522.28
Max : 2829.35 3043.49
Median : 2777.96 2660.31
AMean : 2695.11 2742.03
GMean : 2690.54 2733.37
HMean : 2685.86 2724.96
AMean Stddev : 189.75 270.04
AMean CoefVar : 7.04 pct 9.85 pct
o 32 clients *
Min : 11479.70 10569.00
Max : 11670.50 11154.90
Median : 11598.90 10744.90
AMean : 11583.03 10822.93
GMean : 11582.77 10820.17
HMean : 11582.50 10817.42
AMean Stddev : 96.38 300.64
AMean CoefVar : 0.83 pct 2.78 pct
Please let me know if you would like me to gather any more data
on the test system.
--
Thanks and Regards,
Prateek
On 7/28/22 21:47, K Prateek Nayak wrote:
> Hello Libo and Peter,
>
> tl;dr
>
> - We observed a major regression with tbench when testing the latest tip
> sched/core at:
> commit 14b3f2d9ee8d "sched/fair: Disallow sync wakeup from interrupt context"
> Reason for the regression are the fewer affine wakeups that leaves the
> client farther away from the data it needs to consume next primed in the
> waker's LLC.
> Such regressions can be expected from tasks that use sockets to communicate
> significant amount of data especially on system with multiple LLCs.
>
> - Other benchmarks have a comparable behavior to the tip at previous commit
> commit : 91caa5ae2424 "sched/core: Fix the bug that task won't enqueue
> into core tree when update cookie"
>
> I'll leave more details below.
>
> On 7/12/2022 4:17 AM, Libo Chen wrote:
>> Barry Song first pointed out that replacing sync wakeup with regular wakeup
>> seems to reduce overeager wakeup pulling and shows noticeable performance
>> improvement.[1]
>>
>> This patch argues that allowing sync for wakeups from interrupt context
>> is a bug and fixing it can improve performance even when irq/softirq is
>> evenly spread out.
>>
>> For wakeups from ISR, the waking CPU is just the CPU of ISR and the so-called
>> waker can be any random task that happens to be running on that CPU when the
>> interrupt comes in. This is completely different from other wakups where the
>> task running on the waking CPU is the actual waker. For example, two tasks
>> communicate through a pipe or mutiple tasks access the same critical section,
>> etc. This difference is important because with sync we assume the waker will
>> get off the runqueue and go to sleep immedately after the wakeup. The
>> assumption is built into wake_affine() where it discounts the waker's presence
>> from the runqueue when sync is true. The random waker from interrupts bears no
>> relation to the wakee and don't usually go to sleep immediately afterwards
>> unless wakeup granularity is reached. Plus the scheduler no longer enforces the
>> preepmtion of waker for sync wakeup as it used to before
>> patch f2e74eeac03ffb7 ("sched: Remove WAKEUP_SYNC feature"). Enforcing sync
>> wakeup preemption for wakeups from interrupt contexts doesn't seem to be
>> appropriate too but at least sync wakeup will do what it's supposed to do.
>>
>> Add a check to make sure that sync can only be set when in_task() is true. If
>> a wakeup is from interrupt context, sync flag will be 0 because in_task()
>> returns 0.
>>
>> Tested in two scenarios: wakeups from 1) task contexts, expected to see no
>> performance changes; 2) interrupt contexts, expected to see better performance
>> under low/medium load and no regression under heavy load.
>>
>> Use hackbench for scenario 1 and pgbench for scenarios 2 both from mmtests on
>> a 2-socket Xeon E5-2699v3 box with 256G memory in total. Running 5.18 kernel
>> with SELinux disabled. Scheduler/MM tunables are all default. Irqbalance
>> daemon is active.
>>
>> Hackbench (config-scheduler-unbound)
>> =========
>> process-pipes:
>> Baseline Patched
>> Amean 1 0.4300 ( 0.00%) 0.4420 ( -2.79%)
>> Amean 4 0.8757 ( 0.00%) 0.8774 ( -0.20%)
>> Amean 7 1.3712 ( 0.00%) 1.3789 ( -0.56%)
>> Amean 12 2.3541 ( 0.00%) 2.3714 ( -0.73%)
>> Amean 21 4.2229 ( 0.00%) 4.2439 ( -0.50%)
>> Amean 30 5.9113 ( 0.00%) 5.9451 ( -0.57%)
>> Amean 48 9.3873 ( 0.00%) 9.4898 ( -1.09%)
>> Amean 79 15.9281 ( 0.00%) 16.1385 ( -1.32%)
>> Amean 110 22.0961 ( 0.00%) 22.3433 ( -1.12%)
>> Amean 141 28.2973 ( 0.00%) 28.6209 ( -1.14%)
>> Amean 172 34.4709 ( 0.00%) 34.9347 ( -1.35%)
>> Amean 203 40.7621 ( 0.00%) 41.2497 ( -1.20%)
>> Amean 234 47.0416 ( 0.00%) 47.6470 ( -1.29%)
>> Amean 265 53.3048 ( 0.00%) 54.1625 ( -1.61%)
>> Amean 288 58.0595 ( 0.00%) 58.8096 ( -1.29%)
>>
>> process-sockets:
>> Baseline Patched
>> Amean 1 0.6776 ( 0.00%) 0.6611 ( 2.43%)
>> Amean 4 2.6183 ( 0.00%) 2.5769 ( 1.58%)
>> Amean 7 4.5662 ( 0.00%) 4.4801 ( 1.89%)
>> Amean 12 7.7638 ( 0.00%) 7.6201 ( 1.85%)
>> Amean 21 13.5335 ( 0.00%) 13.2915 ( 1.79%)
>> Amean 30 19.3369 ( 0.00%) 18.9811 ( 1.84%)
>> Amean 48 31.0724 ( 0.00%) 30.6015 ( 1.52%)
>> Amean 79 51.1881 ( 0.00%) 50.4251 ( 1.49%)
>> Amean 110 71.3399 ( 0.00%) 70.4578 ( 1.24%)
>> Amean 141 91.4675 ( 0.00%) 90.3769 ( 1.19%)
>> Amean 172 111.7463 ( 0.00%) 110.3947 ( 1.21%)
>> Amean 203 131.6927 ( 0.00%) 130.3270 ( 1.04%)
>> Amean 234 151.7459 ( 0.00%) 150.1320 ( 1.06%)
>> Amean 265 171.9101 ( 0.00%) 169.9751 ( 1.13%)
>> Amean 288 186.9231 ( 0.00%) 184.7706 ( 1.15%)
>>
>> thread-pipes:
>> Baseline Patched
>> Amean 1 0.4523 ( 0.00%) 0.4535 ( -0.28%)
>> Amean 4 0.9041 ( 0.00%) 0.9085 ( -0.48%)
>> Amean 7 1.4111 ( 0.00%) 1.4146 ( -0.25%)
>> Amean 12 2.3532 ( 0.00%) 2.3688 ( -0.66%)
>> Amean 21 4.1550 ( 0.00%) 4.1701 ( -0.36%)
>> Amean 30 6.1043 ( 0.00%) 6.2391 ( -2.21%)
>> Amean 48 10.2077 ( 0.00%) 10.3511 ( -1.40%)
>> Amean 79 16.7922 ( 0.00%) 17.0427 ( -1.49%)
>> Amean 110 23.3350 ( 0.00%) 23.6522 ( -1.36%)
>> Amean 141 29.6458 ( 0.00%) 30.2617 ( -2.08%)
>> Amean 172 35.8649 ( 0.00%) 36.4225 ( -1.55%)
>> Amean 203 42.4477 ( 0.00%) 42.8332 ( -0.91%)
>> Amean 234 48.7117 ( 0.00%) 49.4042 ( -1.42%)
>> Amean 265 54.9171 ( 0.00%) 55.6551 ( -1.34%)
>> Amean 288 59.5282 ( 0.00%) 60.2903 ( -1.28%)
>>
>> thread-sockets:
>> Baseline Patched
>> Amean 1 0.6917 ( 0.00%) 0.6892 ( 0.37%)
>> Amean 4 2.6651 ( 0.00%) 2.6017 ( 2.38%)
>> Amean 7 4.6734 ( 0.00%) 4.5637 ( 2.35%)
>> Amean 12 8.0156 ( 0.00%) 7.8079 ( 2.59%)
>> Amean 21 14.0451 ( 0.00%) 13.6679 ( 2.69%)
>> Amean 30 20.0963 ( 0.00%) 19.5657 ( 2.64%)
>> Amean 48 32.4115 ( 0.00%) 31.6001 ( 2.50%)
>> Amean 79 53.1104 ( 0.00%) 51.8395 ( 2.39%)
>> Amean 110 74.0929 ( 0.00%) 72.4391 ( 2.23%)
>> Amean 141 95.1506 ( 0.00%) 93.0992 ( 2.16%)
>> Amean 172 116.1969 ( 0.00%) 113.8307 ( 2.04%)
>> Amean 203 137.4413 ( 0.00%) 134.5247 ( 2.12%)
>> Amean 234 158.5395 ( 0.00%) 155.2793 ( 2.06%)
>> Amean 265 179.7729 ( 0.00%) 176.1099 ( 2.04%)
>> Amean 288 195.5573 ( 0.00%) 191.3977 ( 2.13%)
>>
>> Pgbench (config-db-pgbench-timed-ro-small)
>> =======
>> Baseline Patched
>> Hmean 1 68.54 ( 0.00%) 69.72 ( 1.71%)
>> Hmean 6 27725.78 ( 0.00%) 34119.11 ( 23.06%)
>> Hmean 12 55724.26 ( 0.00%) 63158.22 ( 13.34%)
>> Hmean 22 72806.26 ( 0.00%) 73389.98 ( 0.80%)
>> Hmean 30 79000.45 ( 0.00%) 75197.02 ( -4.81%)
>> Hmean 48 78180.16 ( 0.00%) 75074.09 ( -3.97%)
>> Hmean 80 75001.93 ( 0.00%) 70590.72 ( -5.88%)
>> Hmean 110 74812.25 ( 0.00%) 74128.57 ( -0.91%)
>> Hmean 142 74261.05 ( 0.00%) 72910.48 ( -1.82%)
>> Hmean 144 75375.35 ( 0.00%) 71295.72 ( -5.41%)
> The two tests kernels used are:
>
> - tip at commit: 14b3f2d9ee8d "sched/fair: Disallow sync wakeup from interrupt context"
> - tip at commit: 91caa5ae2424 "sched/core: Fix the bug that task won't enqueue into core tree when update cookie"
>
> Below are the tbench result on a dual socket Zen3 machine
> running in NPS1 mode. Following is the NUMA configuration
> NPS1 mode:
>
> - NPS1: Each socket is a NUMA node.
> Total 2 NUMA nodes in the dual socket machine.
>
> Node 0: 0-63, 128-191
> Node 1: 64-127, 192-255
>
> Clients: tip (91caa5ae2424) tip (14b3f2d9ee8d)
> 1 569.24 (0.00 pct) 283.63 (-50.17 pct) *
> 2 1104.76 (0.00 pct) 590.45 (-46.55 pct) *
> 4 2058.78 (0.00 pct) 1080.63 (-47.51 pct) *
> 8 3590.20 (0.00 pct) 2098.05 (-41.56 pct) *
> 16 6119.21 (0.00 pct) 4348.40 (-28.93 pct) *
> 32 11383.91 (0.00 pct) 8417.55 (-26.05 pct) *
> 64 21910.01 (0.00 pct) 19405.11 (-11.43 pct) *
> 128 33105.27 (0.00 pct) 29791.80 (-10.00 pct) *
> 256 45550.99 (0.00 pct) 45847.10 (0.65 pct)
> 512 57753.81 (0.00 pct) 49481.17 (-14.32 pct) *
> 1024 55684.33 (0.00 pct) 48748.38 (-12.45 pct) *
>
> This regression is consistently reproducible.
I ran tbench with 1 client on my 8 nodes zen2 machine because 1~4
clients count generally shouldn't be affected by this patch. I do see
throughput regresses with the patch but
the latency improves pretty much equally. Furthermore, I also don't see
tbench tasks being separated in different LLC domains from my ftrace,
they are almost always in the same CCXes.
What I do see is there are a lot less interrupts and context switches,
and average CPU frequency is slower too with the patch. This is bizarre
that Intel doesn't seem to be impacted.
Trying to understand why right now.
> Below are the statistics gathered from schedstat data:
>
> Kernel : tip + remove 14b3f2d9ee8d tip
> HEAD commit : 91caa5ae2424 14b3f2d9ee8d
> sched_yield cnt : 11 17
> Legacy counter can be ignored : 0 0
> schedule called : 12621212 15104022
> schedule left the processor idle : 6306653 ( 49.96% of times ) 7547375 ( 49.96% of times )
> try_to_wake_up was called : 6310778 7552147
> try_to_wake_up was called to wake up the local cpu : 12305 ( 0.19% of times ) 12816 ( 0.16% of times )
> total time by tasks on this processor (in jiffies) : 78497712520 72461915902
> total time waiting tasks on this processor (in jiffies) : 56398803 ( 0.07% of times ) 34631178 ( 0.04% of times )
> total timeslices run on this cpu : 6314548 7556630
>
> Wakeups on same SMT : 39 ( 0.00062 ) 263 ( 0.00348 )
> Wakeups on same MC : 6297015 ( 99.78% of time ) <--- 1079 ( 0.01429 )
> Wakeups on same DIE : 559 ( 0.00886 ) 7537909 ( 99.81147 ) <--- With the patch, the task will prefer
I don't have a zen3 right now. What is the span of your MC domain as
well as DIE?
Thanks for the testing.
Libo
> Wakeups on same NUMA : 860 ( 0.01363 ) 80 ( 0.00106 ) to wake on the same LLC where it previously
> Affine wakeups on same SMT : 25 ( 0.00040 ) 255 ( 0.00338 ) ran as compared to the LLC of waker.
> Affine wakeups on same MC : 6282684 ( 99.55% of time ) <--- 961 ( 0.01272 ) This results in performance degradation as
> Affine wakeups on same DIE : 523 ( 0.00829 ) 7537220 ( 99.80235 ) <--- the task is farther away from data it will
> Affine wakeups on same NUMA : 839 ( 0.01329 ) 46 ( 0.00061 ) consume next.
>
> All the statistics are comparable except for the reduced number of affine
> wakeup on the waker's LLC that resulting in task being placed on the previous
> LLC farther away from the data that resides in the waker's LLC that the wakee
> will consume next.
>
> All wakeups in the tbench, happens in_serving_softirq() making in_taks() false
> for all the cases where sync would have been true otherwise.
>
> We wanted to highlight there are workloads which would still benefit from
> affine wakeups even when it happens in an interrupt context. It would be
> great if we could spot such cases and bias wakeups towards waker's LLC.
>
> Other benchmarks results are comparable to the tip in most cases.
> All benchmarks were run on machine configured in NPS1 mode.
> Following are the results:
>
> ~~~~~~~~~
> hackbench
> ~~~~~~~~~
>
> Test: tip (91caa5ae2424) tip (14b3f2d9ee8d)
> 1-groups: 4.22 (0.00 pct) 4.48 (-6.16 pct) *
> 1-groups: 4.22 (0.00 pct) 4.30 (-1.89 pct) [Verification run]
> 2-groups: 5.01 (0.00 pct) 4.87 (2.79 pct)
> 4-groups: 5.49 (0.00 pct) 5.34 (2.73 pct)
> 8-groups: 5.64 (0.00 pct) 5.50 (2.48 pct)
> 16-groups: 7.54 (0.00 pct) 7.34 (2.65 pct)
>
> ~~~~~~~~
> schbench
> ~~~~~~~~
>
> #workers: tip (91caa5ae2424) tip (14b3f2d9ee8d)
> 1: 22.00 (0.00 pct) 22.00 (0.00 pct)
> 2: 22.00 (0.00 pct) 27.00 (-22.72 pct) * Known to have run to run
> 4: 33.00 (0.00 pct) 38.00 (-15.15 pct) * variations.
> 8: 48.00 (0.00 pct) 51.00 (-6.25 pct) *
> 16: 70.00 (0.00 pct) 70.00 (0.00 pct)
> 32: 118.00 (0.00 pct) 120.00 (-1.69 pct)
> 64: 217.00 (0.00 pct) 223.00 (-2.76 pct)
> 128: 485.00 (0.00 pct) 488.00 (-0.61 pct)
> 256: 1066.00 (0.00 pct) 1054.00 (1.12 pct)
> 512: 47296.00 (0.00 pct) 47168.00 (0.27 pct)
>
> Note: schbench results at lower worker count have a large
> run-to-run variance and depends on certain characteristics
> of new-idle balance.
>
> ~~~~~~
> stream
> ~~~~~~
>
> - 10 runs
>
> Test: tip (91caa5ae2424) tip (14b3f2d9ee8d)
> Copy: 336140.45 (0.00 pct) 334362.29 (-0.52 pct)
> Scale: 214679.13 (0.00 pct) 218016.44 (1.55 pct)
> Add: 251691.67 (0.00 pct) 249734.04 (-0.77 pct)
> Triad: 262174.93 (0.00 pct) 260063.57 (-0.80 pct)
>
> - 100 runs
>
> Test: tip (91caa5ae2424) tip (14b3f2d9ee8d)
> Copy: 336576.38 (0.00 pct) 334646.27 (-0.57 pct)
> Scale: 219124.86 (0.00 pct) 223480.29 (1.98 pct)
> Add: 251796.93 (0.00 pct) 250845.95 (-0.37 pct)
> Triad: 262286.47 (0.00 pct) 258020.57 (-1.62 pct)
>
> ~~~~~~~~~~~~
> ycsb-mongodb
> ~~~~~~~~~~~~
>
> tip (91caa5ae2424): 290479.00 (var: 1.53)
> tip (14b3f2d9ee8d): 287361.67 (var: 0.80) (-1.07 pct)
>
>> [..snip..]
>>
> We also ran tbench on a dual socket Icelake Xeon system (2 x 32C/64T)
> (Intel Xeon Platinum 8362) and following are the results:
>
> Clients: tip (91caa5ae2424) tip (14b3f2d9ee8d)
> 1 131.56 (0.00 pct) 145.05 (10.25 pct)
> 2 279.46 (0.00 pct) 245.94 (-11.99 pct) *
> 4 552.19 (0.00 pct) 577.94 (4.66 pct)
> 8 1176.12 (0.00 pct) 1309.40 (11.33 pct)
> 16 2685.86 (0.00 pct) 2724.96 (1.45 pct)
> 32 11582.50 (0.00 pct) 10817.42 (-6.60 pct) *
> 64 18309.47 (0.00 pct) 18287.92 (-0.11 pct)
> 128 15689.97 (0.00 pct) 15322.08 (-2.34 pct)
> 256 37130.34 (0.00 pct) 37332.91 (0.54 pct)
> 512 36448.68 (0.00 pct) 36162.17 (-0.78 pct)
>
> Both kernel versions show lot of run to run variance but the two highlighted
> cases are relatively more stable and the case with 32 clients is outside of
> the NUMA imbalance threshold.
>
> Below are detailed stats of each individual runs on Icelacke machine:
>
> Kernel: tip tip
> (91caa5ae2424) (14b3f2d9ee8d)
>
> o 1 Clients
>
> Min : 126.23 128.01
> Max : 137.75 168.99
> Median : 131.20 143.83
> AMean : 131.73 146.94
> GMean : 131.64 145.99
> HMean : 131.56 145.05
> AMean Stddev : 5.78 20.66
> AMean CoefVar : 4.39 pct 14.06 pct
>
> o 2 clients *
>
> Min : 261.71 234.26
> Max : 293.38 253.03
> Median : 285.27 251.43
> AMean : 280.12 246.24
> GMean : 279.79 246.09
> HMean : 279.46 245.94
> AMean Stddev : 16.45 10.40
> AMean CoefVar : 5.87 pct 4.22 pct
>
> o 4 clients
>
> Min : 515.98 497.81
> Max : 611.56 744.49
> Median : 537.71 543.82
> AMean : 555.08 595.37
> GMean : 553.61 586.31
> HMean : 552.19 577.94
> AMean Stddev : 50.10 131.17
> AMean CoefVar : 9.03 pct 22.03 pct
>
> o 8 clients
>
> Min : 1101.30 1178.59
> Max : 1293.18 1442.30
> Median : 1150.15 1334.56
> AMean : 1181.54 1318.48
> GMean : 1178.80 1313.97
> HMean : 1176.12 1309.40
> AMean Stddev : 99.72 132.59
> AMean CoefVar : 8.44 pct 10.06 pct
>
> o 16 clients
>
> Min : 2478.03 2522.28
> Max : 2829.35 3043.49
> Median : 2777.96 2660.31
> AMean : 2695.11 2742.03
> GMean : 2690.54 2733.37
> HMean : 2685.86 2724.96
> AMean Stddev : 189.75 270.04
> AMean CoefVar : 7.04 pct 9.85 pct
>
> o 32 clients *
>
> Min : 11479.70 10569.00
> Max : 11670.50 11154.90
> Median : 11598.90 10744.90
> AMean : 11583.03 10822.93
> GMean : 11582.77 10820.17
> HMean : 11582.50 10817.42
> AMean Stddev : 96.38 300.64
> AMean CoefVar : 0.83 pct 2.78 pct
>
> Please let me know if you would like me to gather any more data
> on the test system.
> --
> Thanks and Regards,
> Prateek
Hello Libo,
Thank you for looking into this.
On 8/1/2022 8:27 PM, Libo Chen wrote:
>
>
> On 7/28/22 21:47, K Prateek Nayak wrote:
>> Hello Libo and Peter,
>>
>> tl;dr
>>
>> - We observed a major regression with tbench when testing the latest tip
>> sched/core at:
>> commit 14b3f2d9ee8d "sched/fair: Disallow sync wakeup from interrupt context"
>> Reason for the regression are the fewer affine wakeups that leaves the
>> client farther away from the data it needs to consume next primed in the
>> waker's LLC.
>> Such regressions can be expected from tasks that use sockets to communicate
>> significant amount of data especially on system with multiple LLCs.
>>
>> - Other benchmarks have a comparable behavior to the tip at previous commit
>> commit : 91caa5ae2424 "sched/core: Fix the bug that task won't enqueue
>> into core tree when update cookie"
>>
>> I'll leave more details below.
>>
>> On 7/12/2022 4:17 AM, Libo Chen wrote:
>>> [..snip..]
>>
>> The two tests kernels used are:
>>
>> - tip at commit: 14b3f2d9ee8d "sched/fair: Disallow sync wakeup from interrupt context"
>> - tip at commit: 91caa5ae2424 "sched/core: Fix the bug that task won't enqueue into core tree when update cookie"
>>
>> Below are the tbench result on a dual socket Zen3 machine
>> running in NPS1 mode. Following is the NUMA configuration
>> NPS1 mode:
>>
>> - NPS1: Each socket is a NUMA node.
>> Total 2 NUMA nodes in the dual socket machine.
>>
>> Node 0: 0-63, 128-191
>> Node 1: 64-127, 192-255
>>
>> Clients: tip (91caa5ae2424) tip (14b3f2d9ee8d)
>> 1 569.24 (0.00 pct) 283.63 (-50.17 pct) *
>> 2 1104.76 (0.00 pct) 590.45 (-46.55 pct) *
>> 4 2058.78 (0.00 pct) 1080.63 (-47.51 pct) *
>> 8 3590.20 (0.00 pct) 2098.05 (-41.56 pct) *
>> 16 6119.21 (0.00 pct) 4348.40 (-28.93 pct) *
>> 32 11383.91 (0.00 pct) 8417.55 (-26.05 pct) *
>> 64 21910.01 (0.00 pct) 19405.11 (-11.43 pct) *
>> 128 33105.27 (0.00 pct) 29791.80 (-10.00 pct) *
>> 256 45550.99 (0.00 pct) 45847.10 (0.65 pct)
>> 512 57753.81 (0.00 pct) 49481.17 (-14.32 pct) *
>> 1024 55684.33 (0.00 pct) 48748.38 (-12.45 pct) *
>>
>> This regression is consistently reproducible.
> I ran tbench with 1 client on my 8 nodes zen2 machine because 1~4 clients count generally shouldn't be affected by this patch. I do see throughput regresses with the patch but
> the latency improves pretty much equally. Furthermore, I also don't see tbench tasks being separated in different LLC domains from my ftrace, they are almost always in the same CCXes.
> What I do see is there are a lot less interrupts and context switches, and average CPU frequency is slower too with the patch. This is bizarre that Intel doesn't seem to be impacted.
> Trying to understand why right now.
Thank you for analyzing this. I see a drop in max latency with the patch but the
average latency have increased on the patched kernel. Following are the logs from
one of the runs for 1 client case:
- tip (91caa5ae2424)
Operation Count AvgLat MaxLat
--------------------------------------------------
Deltree 28 0.000 0.001
Flush 76361 0.008 0.018
Close 800338 0.008 0.080
LockX 3546 0.008 0.015
Mkdir 14 0.008 0.009
Rename 46131 0.008 0.050
ReadX 1707761 0.009 0.139
WriteX 543274 0.012 0.092
Unlink 220019 0.008 0.083
UnlockX 3546 0.008 0.016
FIND_FIRST 381795 0.008 0.079
SET_FILE_INFORMATION 88740 0.008 0.080
QUERY_FILE_INFORMATION 173062 0.008 0.061
QUERY_PATH_INFORMATION 987524 0.008 0.070
QUERY_FS_INFORMATION 181068 0.008 0.049
NTCreateX 1089543 0.008 0.083
Throughput 570.36 MB/sec 1 clients 1 procs max_latency=0.140 ms
- tip (14b3f2d9ee8d)
Operation Count AvgLat MaxLat
--------------------------------------------------
Deltree 14 0.000 0.001
Flush 38993 0.017 0.059
Close 408696 0.017 0.085
LockX 1810 0.017 0.023
Mkdir 7 0.016 0.017
Rename 23555 0.017 0.052
ReadX 871996 0.018 0.097
WriteX 277343 0.025 0.105
Unlink 112357 0.017 0.055
UnlockX 1810 0.017 0.023
FIND_FIRST 194961 0.017 0.089
SET_FILE_INFORMATION 45312 0.017 0.032
QUERY_FILE_INFORMATION 88356 0.017 0.078
QUERY_PATH_INFORMATION 504289 0.017 0.119
QUERY_FS_INFORMATION 92460 0.017 0.085
NTCreateX 556374 0.017 0.097
Throughput 291.163 MB/sec 1 clients 1 procs max_latency=0.119 ms
I had only analyzed the schedstat data which showed a clear shift
in the number of affine wakeups. I haven't measured the average CPU
frequency during the runs. The numbers reported are with the performance
governor. I'll try to get more data on the CPU frequency front.
>> Below are the statistics gathered from schedstat data:
>>
>> Kernel : tip + remove 14b3f2d9ee8d tip
>> HEAD commit : 91caa5ae2424 14b3f2d9ee8d
>> sched_yield cnt : 11 17
>> Legacy counter can be ignored : 0 0
>> schedule called : 12621212 15104022
>> schedule left the processor idle : 6306653 ( 49.96% of times ) 7547375 ( 49.96% of times )
>> try_to_wake_up was called : 6310778 7552147
>> try_to_wake_up was called to wake up the local cpu : 12305 ( 0.19% of times ) 12816 ( 0.16% of times )
>> total time by tasks on this processor (in jiffies) : 78497712520 72461915902
>> total time waiting tasks on this processor (in jiffies) : 56398803 ( 0.07% of times ) 34631178 ( 0.04% of times )
>> total timeslices run on this cpu : 6314548 7556630
>>
>> Wakeups on same SMT : 39 ( 0.00062 ) 263 ( 0.00348 )
>> Wakeups on same MC : 6297015 ( 99.78% of time ) <--- 1079 ( 0.01429 )
>> Wakeups on same DIE : 559 ( 0.00886 ) 7537909 ( 99.81147 ) <--- With the patch, the task will prefer
> I don't have a zen3 right now. What is the span of your MC domain as well as DIE?
on Zen3, a group in MC domain consists of the 16 CPUs on the same CCD.
On a dual socket Zen3 system (2 x 64C/128T) running in NPS 1 mode,
the DIE domain will consists of all the CPUs on the same socket. There are two
DIE groups in the dual socket test system. Following are the span of each:
- DIE0: 0-63,128-191
DIE 0 MC 0: 0-7,128-135
DIE 0 MC 1: 8-15,136-143
DIE 0 MC 2: 16-23,144-151
DIE 0 MC 3: 24-31,152-159
DIE 0 MC 4: 32-39,160-167
DIE 0 MC 5: 40-47,168-175
DIE 0 MC 6: 48-55,176-183
DIE 0 MC 7: 56-63,184-191
- DIE1: 64-127,192-255
DIE 1 MC 0: 64-71,192-199
DIE 1 MC 1: 72-79,200-207
DIE 1 MC 2: 80-87,208-215
DIE 1 MC 3: 88-95,216-223
DIE 1 MC 4: 96-103,224-231
DIE 1 MC 5: 104-111,232-239
DIE 1 MC 6: 112-119,240-247
DIE 1 MC 7: 120-127,248-255
>
> Thanks for the testing.
>
> Libo
>> Wakeups on same NUMA : 860 ( 0.01363 ) 80 ( 0.00106 ) to wake on the same LLC where it previously
>> Affine wakeups on same SMT : 25 ( 0.00040 ) 255 ( 0.00338 ) ran as compared to the LLC of waker.
>> Affine wakeups on same MC : 6282684 ( 99.55% of time ) <--- 961 ( 0.01272 ) This results in performance degradation as
>> Affine wakeups on same DIE : 523 ( 0.00829 ) 7537220 ( 99.80235 ) <--- the task is farther away from data it will
>> Affine wakeups on same NUMA : 839 ( 0.01329 ) 46 ( 0.00061 ) consume next.
>>
>> All the statistics are comparable except for the reduced number of affine
>> wakeup on the waker's LLC that resulting in task being placed on the previous
>> LLC farther away from the data that resides in the waker's LLC that the wakee
>> will consume next.
>>
>> All wakeups in the tbench, happens in_serving_softirq() making in_taks() false
>> for all the cases where sync would have been true otherwise.
>>
>> We wanted to highlight there are workloads which would still benefit from
>> affine wakeups even when it happens in an interrupt context. It would be
>> great if we could spot such cases and bias wakeups towards waker's LLC.
>>
>> Other benchmarks results are comparable to the tip in most cases.
>> All benchmarks were run on machine configured in NPS1 mode.
>> Following are the results:
>>
>> ~~~~~~~~~
>> hackbench
>> ~~~~~~~~~
>>
>> Test: tip (91caa5ae2424) tip (14b3f2d9ee8d)
>> 1-groups: 4.22 (0.00 pct) 4.48 (-6.16 pct) *
>> 1-groups: 4.22 (0.00 pct) 4.30 (-1.89 pct) [Verification run]
>> 2-groups: 5.01 (0.00 pct) 4.87 (2.79 pct)
>> 4-groups: 5.49 (0.00 pct) 5.34 (2.73 pct)
>> 8-groups: 5.64 (0.00 pct) 5.50 (2.48 pct)
>> 16-groups: 7.54 (0.00 pct) 7.34 (2.65 pct)
>>
>> ~~~~~~~~
>> schbench
>> ~~~~~~~~
>>
>> #workers: tip (91caa5ae2424) tip (14b3f2d9ee8d)
>> 1: 22.00 (0.00 pct) 22.00 (0.00 pct)
>> 2: 22.00 (0.00 pct) 27.00 (-22.72 pct) * Known to have run to run
>> 4: 33.00 (0.00 pct) 38.00 (-15.15 pct) * variations.
>> 8: 48.00 (0.00 pct) 51.00 (-6.25 pct) *
>> 16: 70.00 (0.00 pct) 70.00 (0.00 pct)
>> 32: 118.00 (0.00 pct) 120.00 (-1.69 pct)
>> 64: 217.00 (0.00 pct) 223.00 (-2.76 pct)
>> 128: 485.00 (0.00 pct) 488.00 (-0.61 pct)
>> 256: 1066.00 (0.00 pct) 1054.00 (1.12 pct)
>> 512: 47296.00 (0.00 pct) 47168.00 (0.27 pct)
>>
>> Note: schbench results at lower worker count have a large
>> run-to-run variance and depends on certain characteristics
>> of new-idle balance.
>>
>> ~~~~~~
>> stream
>> ~~~~~~
>>
>> - 10 runs
>>
>> Test: tip (91caa5ae2424) tip (14b3f2d9ee8d)
>> Copy: 336140.45 (0.00 pct) 334362.29 (-0.52 pct)
>> Scale: 214679.13 (0.00 pct) 218016.44 (1.55 pct)
>> Add: 251691.67 (0.00 pct) 249734.04 (-0.77 pct)
>> Triad: 262174.93 (0.00 pct) 260063.57 (-0.80 pct)
>>
>> - 100 runs
>>
>> Test: tip (91caa5ae2424) tip (14b3f2d9ee8d)
>> Copy: 336576.38 (0.00 pct) 334646.27 (-0.57 pct)
>> Scale: 219124.86 (0.00 pct) 223480.29 (1.98 pct)
>> Add: 251796.93 (0.00 pct) 250845.95 (-0.37 pct)
>> Triad: 262286.47 (0.00 pct) 258020.57 (-1.62 pct)
>>
>> ~~~~~~~~~~~~
>> ycsb-mongodb
>> ~~~~~~~~~~~~
>>
>> tip (91caa5ae2424): 290479.00 (var: 1.53)
>> tip (14b3f2d9ee8d): 287361.67 (var: 0.80) (-1.07 pct)
>>
>>> [..snip..]
>
Thank you again for looking into this issue and for sharing the
observations on the Zen2 system.
--
Thanks and Regards,
Prateek
* K Prateek Nayak <kprateek.nayak@amd.com> wrote: > Hello Libo and Peter, > > tl;dr > > - We observed a major regression with tbench when testing the latest tip > sched/core at: > commit 14b3f2d9ee8d "sched/fair: Disallow sync wakeup from interrupt context" > Reason for the regression are the fewer affine wakeups that leaves the > client farther away from the data it needs to consume next primed in the > waker's LLC. > Such regressions can be expected from tasks that use sockets to communicate > significant amount of data especially on system with multiple LLCs. > > - Other benchmarks have a comparable behavior to the tip at previous commit > commit : 91caa5ae2424 "sched/core: Fix the bug that task won't enqueue > into core tree when update cookie" > > I'll leave more details below. Mel Gorman also warned about this negative side-effect in: Subject: Re: [PATCH] sched/fair: no sync wakeup from interrupt context Date: Fri, 15 Jul 2022 11:07:38 +0100 Message-ID: <20220715100738.GD3493@suse.de> https://lore.kernel.org/all/20220715100738.GD3493@suse.de/ I've reverted this commit (14b3f2d9ee8df3b) for the time being from tip:sched/core. Thanks for the heads-up! Thanks, Ingo
On 8/1/22 06:26, Ingo Molnar wrote: > * K Prateek Nayak <kprateek.nayak@amd.com> wrote: > >> Hello Libo and Peter, >> >> tl;dr >> >> - We observed a major regression with tbench when testing the latest tip >> sched/core at: >> commit 14b3f2d9ee8d "sched/fair: Disallow sync wakeup from interrupt context" >> Reason for the regression are the fewer affine wakeups that leaves the >> client farther away from the data it needs to consume next primed in the >> waker's LLC. >> Such regressions can be expected from tasks that use sockets to communicate >> significant amount of data especially on system with multiple LLCs. >> >> - Other benchmarks have a comparable behavior to the tip at previous commit >> commit : 91caa5ae2424 "sched/core: Fix the bug that task won't enqueue >> into core tree when update cookie" >> >> I'll leave more details below. > Mel Gorman also warned about this negative side-effect in: > > Subject: Re: [PATCH] sched/fair: no sync wakeup from interrupt context > Date: Fri, 15 Jul 2022 11:07:38 +0100 > Message-ID: <20220715100738.GD3493@suse.de> > > https://urldefense.com/v3/__https://lore.kernel.org/all/20220715100738.GD3493@suse.de/__;!!ACWV5N9M2RV99hQ!PQsIeuK0UwII-A0xS-B3plepNniNeyw14OJowT1cYL-tnuN99MkWfg9C8P60tVFFrnxj0NEanUmEkA$ ?? Mel was talking about a completely different thing, I brought up a different patch that I wanted to revert and Mel thought it would hurt other workloads which don't benefit from pulling but as you can see, tbench somehow benefits from it, at least according to one metric from one workload. Libo > I've reverted this commit (14b3f2d9ee8df3b) for the time being from > tip:sched/core. > > Thanks for the heads-up! > > Thanks, > > Ingo
* Libo Chen <libo.chen@oracle.com> wrote: > > > On 8/1/22 06:26, Ingo Molnar wrote: > > * K Prateek Nayak <kprateek.nayak@amd.com> wrote: > > > > > Hello Libo and Peter, > > > > > > tl;dr > > > > > > - We observed a major regression with tbench when testing the latest tip > > > sched/core at: > > > commit 14b3f2d9ee8d "sched/fair: Disallow sync wakeup from interrupt context" > > > Reason for the regression are the fewer affine wakeups that leaves the > > > client farther away from the data it needs to consume next primed in the > > > waker's LLC. > > > Such regressions can be expected from tasks that use sockets to communicate > > > significant amount of data especially on system with multiple LLCs. > > > > > > - Other benchmarks have a comparable behavior to the tip at previous commit > > > commit : 91caa5ae2424 "sched/core: Fix the bug that task won't enqueue > > > into core tree when update cookie" > > > > > > I'll leave more details below. > > Mel Gorman also warned about this negative side-effect in: > > > > Subject: Re: [PATCH] sched/fair: no sync wakeup from interrupt context > > Date: Fri, 15 Jul 2022 11:07:38 +0100 > > Message-ID: <20220715100738.GD3493@suse.de> > > > > https://urldefense.com/v3/__https://lore.kernel.org/all/20220715100738.GD3493@suse.de/__;!!ACWV5N9M2RV99hQ!PQsIeuK0UwII-A0xS-B3plepNniNeyw14OJowT1cYL-tnuN99MkWfg9C8P60tVFFrnxj0NEanUmEkA$ > ?? Mel was talking about a completely different thing, I brought up a > different patch that I wanted to revert and Mel thought it would hurt other > workloads which don't benefit from pulling but > as you can see, tbench somehow benefits from it, at least according to one > metric from one workload. Yeah - but nevertheless the discussion with Mel was open-ended AFAICS, and the 'major tbench regression' report by K Prateek Nayak above still stands and needs to be investigated/understood, right? Thanks, Ingo
On 8/3/22 02:18, Ingo Molnar wrote: > * Libo Chen <libo.chen@oracle.com> wrote: > >> >> On 8/1/22 06:26, Ingo Molnar wrote: >>> * K Prateek Nayak <kprateek.nayak@amd.com> wrote: >>> >>>> Hello Libo and Peter, >>>> >>>> tl;dr >>>> >>>> - We observed a major regression with tbench when testing the latest tip >>>> sched/core at: >>>> commit 14b3f2d9ee8d "sched/fair: Disallow sync wakeup from interrupt context" >>>> Reason for the regression are the fewer affine wakeups that leaves the >>>> client farther away from the data it needs to consume next primed in the >>>> waker's LLC. >>>> Such regressions can be expected from tasks that use sockets to communicate >>>> significant amount of data especially on system with multiple LLCs. >>>> >>>> - Other benchmarks have a comparable behavior to the tip at previous commit >>>> commit : 91caa5ae2424 "sched/core: Fix the bug that task won't enqueue >>>> into core tree when update cookie" >>>> >>>> I'll leave more details below. >>> Mel Gorman also warned about this negative side-effect in: >>> >>> Subject: Re: [PATCH] sched/fair: no sync wakeup from interrupt context >>> Date: Fri, 15 Jul 2022 11:07:38 +0100 >>> Message-ID: <20220715100738.GD3493@suse.de> >>> >>> https://urldefense.com/v3/__https://lore.kernel.org/all/20220715100738.GD3493@suse.de/__;!!ACWV5N9M2RV99hQ!PQsIeuK0UwII-A0xS-B3plepNniNeyw14OJowT1cYL-tnuN99MkWfg9C8P60tVFFrnxj0NEanUmEkA$ >> ?? Mel was talking about a completely different thing, I brought up a >> different patch that I wanted to revert and Mel thought it would hurt other >> workloads which don't benefit from pulling but >> as you can see, tbench somehow benefits from it, at least according to one >> metric from one workload. > Yeah - but nevertheless the discussion with Mel was open-ended AFAICS, and > the 'major tbench regression' report by K Prateek Nayak above still stands > and needs to be investigated/understood, right? Oh yes, I have no issue with holding the patch back until the regression is fully understood. I was just a little confused on your reference to Mel's comments. Anyway, I will post my investigation soon. Libo > Thanks, > > Ingo
* Libo Chen <libo.chen@oracle.com> wrote: > Oh yes, I have no issue with holding the patch back until the regression > is fully understood. I was just a little confused on your reference to > Mel's comments. [...] Yeah, that was just me getting confused about which change Mel was referring to, as I was looking for external confirmation saying what I was thinking about the patch: in_task()/in_interrupt() heuristics rarely do well. ;-) > Anyway, I will post my investigation soon. Thx - and measurements will always be able to override any negative expectations of mine, so my comments weren't a NAK. Thanks, Ingo
On Thu, Aug 04, 2022 at 10:55:12AM +0200, Ingo Molnar wrote: > > * Libo Chen <libo.chen@oracle.com> wrote: > > > Oh yes, I have no issue with holding the patch back until the regression > > is fully understood. I was just a little confused on your reference to > > Mel's comments. [...] > > Yeah, that was just me getting confused about which change Mel was > referring to, as I was looking for external confirmation saying what I was > thinking about the patch: in_task()/in_interrupt() heuristics rarely do > well. ;-) > Even though I was referring to something else, the reported regression is still a regression. Prateek's report and what I brought up in https://lore.kernel.org/all/20220715100738.GD3493@suse.de/ are both simply examples where changes to affine wakeup can have surprising results. -- Mel Gorman SUSE Labs
On Mon, Jul 11, 2022 at 03:47:04PM -0700, Libo Chen wrote: > [1]: https://lkml.org/lkml/2021/11/5/234 Please use lore.kernel.org based links that contain the MsgID of the email in question, I think this wants to be: https://lkml.kernel.org/r/20211105105136.12137-1-21cnbao@gmail.com Right?
On 7/14/22 04:35, Peter Zijlstra wrote: > On Mon, Jul 11, 2022 at 03:47:04PM -0700, Libo Chen wrote: >> [1]: https://urldefense.com/v3/__https://lkml.org/lkml/2021/11/5/234__;!!ACWV5N9M2RV99hQ!IwoFmg5mL051R93PCGfCs27IaZENjt_CV7Rp7RCZCGsuNi9gbcMOlNwOCkCosXNX94ZRzMjAuU9khXfg7A$ > Please use lore.kernel.org based links that contain the MsgID of the > email in question, I think this wants to be: > > https://urldefense.com/v3/__https://lkml.kernel.org/r/20211105105136.12137-1-21cnbao@gmail.com__;!!ACWV5N9M2RV99hQ!IwoFmg5mL051R93PCGfCs27IaZENjt_CV7Rp7RCZCGsuNi9gbcMOlNwOCkCosXNX94ZRzMjAuU9Kr4MkBg$ > > Right? Oh yes, thanks. Libo
On Mon, Jul 11, 2022 at 03:47:04PM -0700, Libo Chen wrote:
> Barry Song first pointed out that replacing sync wakeup with regular wakeup
> seems to reduce overeager wakeup pulling and shows noticeable performance
> improvement.[1]
>
> This patch argues that allowing sync for wakeups from interrupt context
> is a bug
Yes.
> The
> assumption is built into wake_affine() where it discounts the waker's presence
> from the runqueue when sync is true. The random waker from interrupts bears no
> relation to the wakee and don't usually go to sleep immediately afterwards
> unless wakeup granularity is reached.
Exactly that.
> Signed-off-by: Libo Chen <libo.chen@oracle.com>
> ---
> kernel/sched/fair.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 794c2cb945f8..59b210d2cdb5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6704,7 +6704,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> static int
> select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> {
> - int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
> + /* Don't set sync for wakeup from irq/soft ctx */
> + int sync = in_task() && (wake_flags & WF_SYNC)
> + && !(current->flags & PF_EXITING);
That's not a coding style you'll find anywhere near this code though.
I'll fix it up.
> struct sched_domain *tmp, *sd = NULL;
> int cpu = smp_processor_id();
> int new_cpu = prev_cpu;
> --
> 2.31.1
>
On Mon, 2022-07-11 at 15:47 -0700, Libo Chen wrote:
> Barry Song first pointed out that replacing sync wakeup with regular wakeup
> seems to reduce overeager wakeup pulling and shows noticeable performance
> improvement.[1]
>
> This patch argues that allowing sync for wakeups from interrupt context
> is a bug and fixing it can improve performance even when irq/softirq is
> evenly spread out.
>
> For wakeups from ISR, the waking CPU is just the CPU of ISR and the so-called
> waker can be any random task that happens to be running on that CPU when the
> interrupt comes in. This is completely different from other wakups where the
> task running on the waking CPU is the actual waker. For example, two tasks
> communicate through a pipe or mutiple tasks access the same critical section,
> etc. This difference is important because with sync we assume the waker will
> get off the runqueue and go to sleep immedately after the wakeup. The
> assumption is built into wake_affine() where it discounts the waker's presence
> from the runqueue when sync is true. The random waker from interrupts bears no
> relation to the wakee and don't usually go to sleep immediately afterwards
> unless wakeup granularity is reached. Plus the scheduler no longer enforces the
> preepmtion of waker for sync wakeup as it used to before
> patch f2e74eeac03ffb7 ("sched: Remove WAKEUP_SYNC feature"). Enforcing sync
> wakeup preemption for wakeups from interrupt contexts doesn't seem to be
> appropriate too but at least sync wakeup will do what it's supposed to do.
Will there be scenarios where you do want the task being woken up be pulled
to the CPU where the interrupt happened, as the data that needs to be accessed is
on local CPU/NUMA that interrupt happened? For example, interrupt associated with network
packets received. Sync still seems desirable, at least if there is no task currently
running on the CPU where interrupt happened. So maybe we should have some consideration
of the load on the CPU/NUMA before deciding whether we should do sync wake for such
interrupt.
>
> Add a check to make sure that sync can only be set when in_task() is true. If
> a wakeup is from interrupt context, sync flag will be 0 because in_task()
> returns 0.
>
> Tested in two scenarios: wakeups from 1) task contexts, expected to see no
> performance changes; 2) interrupt contexts, expected to see better performance
> under low/medium load and no regression under heavy load.
>
> Use hackbench for scenario 1 and pgbench for scenarios 2 both from mmtests on
> a 2-socket Xeon E5-2699v3 box with 256G memory in total. Running 5.18 kernel
> with SELinux disabled. Scheduler/MM tunables are all default. Irqbalance
> daemon is active.
>
> Hackbench (config-scheduler-unbound)
> =========
> process-pipes:
> Baseline Patched
> Amean 1 0.4300 ( 0.00%) 0.4420 ( -2.79%)
> Amean 4 0.8757 ( 0.00%) 0.8774 ( -0.20%)
> Amean 7 1.3712 ( 0.00%) 1.3789 ( -0.56%)
> Amean 12 2.3541 ( 0.00%) 2.3714 ( -0.73%)
> Amean 21 4.2229 ( 0.00%) 4.2439 ( -0.50%)
> Amean 30 5.9113 ( 0.00%) 5.9451 ( -0.57%)
> Amean 48 9.3873 ( 0.00%) 9.4898 ( -1.09%)
> Amean 79 15.9281 ( 0.00%) 16.1385 ( -1.32%)
> Amean 110 22.0961 ( 0.00%) 22.3433 ( -1.12%)
> Amean 141 28.2973 ( 0.00%) 28.6209 ( -1.14%)
> Amean 172 34.4709 ( 0.00%) 34.9347 ( -1.35%)
> Amean 203 40.7621 ( 0.00%) 41.2497 ( -1.20%)
> Amean 234 47.0416 ( 0.00%) 47.6470 ( -1.29%)
> Amean 265 53.3048 ( 0.00%) 54.1625 ( -1.61%)
> Amean 288 58.0595 ( 0.00%) 58.8096 ( -1.29%)
>
> process-sockets:
> Baseline Patched
> Amean 1 0.6776 ( 0.00%) 0.6611 ( 2.43%)
> Amean 4 2.6183 ( 0.00%) 2.5769 ( 1.58%)
> Amean 7 4.5662 ( 0.00%) 4.4801 ( 1.89%)
> Amean 12 7.7638 ( 0.00%) 7.6201 ( 1.85%)
> Amean 21 13.5335 ( 0.00%) 13.2915 ( 1.79%)
> Amean 30 19.3369 ( 0.00%) 18.9811 ( 1.84%)
> Amean 48 31.0724 ( 0.00%) 30.6015 ( 1.52%)
> Amean 79 51.1881 ( 0.00%) 50.4251 ( 1.49%)
> Amean 110 71.3399 ( 0.00%) 70.4578 ( 1.24%)
> Amean 141 91.4675 ( 0.00%) 90.3769 ( 1.19%)
> Amean 172 111.7463 ( 0.00%) 110.3947 ( 1.21%)
> Amean 203 131.6927 ( 0.00%) 130.3270 ( 1.04%)
> Amean 234 151.7459 ( 0.00%) 150.1320 ( 1.06%)
> Amean 265 171.9101 ( 0.00%) 169.9751 ( 1.13%)
> Amean 288 186.9231 ( 0.00%) 184.7706 ( 1.15%)
>
> thread-pipes:
> Baseline Patched
> Amean 1 0.4523 ( 0.00%) 0.4535 ( -0.28%)
> Amean 4 0.9041 ( 0.00%) 0.9085 ( -0.48%)
> Amean 7 1.4111 ( 0.00%) 1.4146 ( -0.25%)
> Amean 12 2.3532 ( 0.00%) 2.3688 ( -0.66%)
> Amean 21 4.1550 ( 0.00%) 4.1701 ( -0.36%)
> Amean 30 6.1043 ( 0.00%) 6.2391 ( -2.21%)
> Amean 48 10.2077 ( 0.00%) 10.3511 ( -1.40%)
> Amean 79 16.7922 ( 0.00%) 17.0427 ( -1.49%)
> Amean 110 23.3350 ( 0.00%) 23.6522 ( -1.36%)
> Amean 141 29.6458 ( 0.00%) 30.2617 ( -2.08%)
> Amean 172 35.8649 ( 0.00%) 36.4225 ( -1.55%)
> Amean 203 42.4477 ( 0.00%) 42.8332 ( -0.91%)
> Amean 234 48.7117 ( 0.00%) 49.4042 ( -1.42%)
> Amean 265 54.9171 ( 0.00%) 55.6551 ( -1.34%)
> Amean 288 59.5282 ( 0.00%) 60.2903 ( -1.28%)
>
> thread-sockets:
> Baseline Patched
> Amean 1 0.6917 ( 0.00%) 0.6892 ( 0.37%)
> Amean 4 2.6651 ( 0.00%) 2.6017 ( 2.38%)
> Amean 7 4.6734 ( 0.00%) 4.5637 ( 2.35%)
> Amean 12 8.0156 ( 0.00%) 7.8079 ( 2.59%)
> Amean 21 14.0451 ( 0.00%) 13.6679 ( 2.69%)
> Amean 30 20.0963 ( 0.00%) 19.5657 ( 2.64%)
> Amean 48 32.4115 ( 0.00%) 31.6001 ( 2.50%)
> Amean 79 53.1104 ( 0.00%) 51.8395 ( 2.39%)
> Amean 110 74.0929 ( 0.00%) 72.4391 ( 2.23%)
> Amean 141 95.1506 ( 0.00%) 93.0992 ( 2.16%)
> Amean 172 116.1969 ( 0.00%) 113.8307 ( 2.04%)
> Amean 203 137.4413 ( 0.00%) 134.5247 ( 2.12%)
> Amean 234 158.5395 ( 0.00%) 155.2793 ( 2.06%)
> Amean 265 179.7729 ( 0.00%) 176.1099 ( 2.04%)
> Amean 288 195.5573 ( 0.00%) 191.3977 ( 2.13%)
>
> Pgbench (config-db-pgbench-timed-ro-small)
> =======
> Baseline Patched
> Hmean 1 68.54 ( 0.00%) 69.72 ( 1.71%)
> Hmean 6 27725.78 ( 0.00%) 34119.11 ( 23.06%)
> Hmean 12 55724.26 ( 0.00%) 63158.22 ( 13.34%)
> Hmean 22 72806.26 ( 0.00%) 73389.98 ( 0.80%)
> Hmean 30 79000.45 ( 0.00%) 75197.02 ( -4.81%)
> Hmean 48 78180.16 ( 0.00%) 75074.09 ( -3.97%)
> Hmean 80 75001.93 ( 0.00%) 70590.72 ( -5.88%)
> Hmean 110 74812.25 ( 0.00%) 74128.57 ( -0.91%)
> Hmean 142 74261.05 ( 0.00%) 72910.48 ( -1.82%)
> Hmean 144 75375.35 ( 0.00%) 71295.72 ( -5.41%)
>
> For hackbench, +-3% fluctuation is normal on this two-socket box, it's safe to
> conclude that there are no performance differences for wakeups from task context.
> For pgbench, after many runs, 10~30% gains are very consistent at lower
> client counts (< #cores per socket).
Can you provide some further insights on why pgebench is helped at low load
case? Is it because the woken tasks tend to stay put and not get moved around with interrupts
and maintain cache warmth?
Tim
> For higher client counts, both kernels are
> very close, +-5% swings are quite common. Also NET_TX|RX softirq load
> does spread out across both NUMA nodes in this test so there is no need to do
> any explicit RPS/RFS.
>
> [1]: https://lkml.org/lkml/2021/11/5/234
>
> Signed-off-by: Libo Chen <libo.chen@oracle.com>
> ---
> kernel/sched/fair.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 794c2cb945f8..59b210d2cdb5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6704,7 +6704,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> static int
> select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> {
> - int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
> + /* Don't set sync for wakeup from irq/soft ctx */
> + int sync = in_task() && (wake_flags & WF_SYNC)
> + && !(current->flags & PF_EXITING);
> struct sched_domain *tmp, *sd = NULL;
> int cpu = smp_processor_id();
> int new_cpu = prev_cpu;
> --
> 2.31.1
>
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 14b3f2d9ee8df3b6040f7e21f9fcd1d848938fd9
Gitweb: https://git.kernel.org/tip/14b3f2d9ee8df3b6040f7e21f9fcd1d848938fd9
Author: Libo Chen <libo.chen@oracle.com>
AuthorDate: Mon, 11 Jul 2022 15:47:04 -07:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 21 Jul 2022 10:39:39 +02:00
sched/fair: Disallow sync wakeup from interrupt context
Barry Song first pointed out that replacing sync wakeup with regular wakeup
seems to reduce overeager wakeup pulling and shows noticeable performance
improvement.[1]
This patch argues that allowing sync for wakeups from interrupt context
is a bug and fixing it can improve performance even when irq/softirq is
evenly spread out.
For wakeups from ISR, the waking CPU is just the CPU of ISR and the so-called
waker can be any random task that happens to be running on that CPU when the
interrupt comes in. This is completely different from other wakups where the
task running on the waking CPU is the actual waker. For example, two tasks
communicate through a pipe or mutiple tasks access the same critical section,
etc. This difference is important because with sync we assume the waker will
get off the runqueue and go to sleep immedately after the wakeup. The
assumption is built into wake_affine() where it discounts the waker's presence
from the runqueue when sync is true. The random waker from interrupts bears no
relation to the wakee and don't usually go to sleep immediately afterwards
unless wakeup granularity is reached. Plus the scheduler no longer enforces the
preepmtion of waker for sync wakeup as it used to before
patch f2e74eeac03ffb7 ("sched: Remove WAKEUP_SYNC feature"). Enforcing sync
wakeup preemption for wakeups from interrupt contexts doesn't seem to be
appropriate too but at least sync wakeup will do what it's supposed to do.
Add a check to make sure that sync can only be set when in_task() is true. If
a wakeup is from interrupt context, sync flag will be 0 because in_task()
returns 0.
Tested in two scenarios: wakeups from 1) task contexts, expected to see no
performance changes; 2) interrupt contexts, expected to see better performance
under low/medium load and no regression under heavy load.
Use hackbench for scenario 1 and pgbench for scenarios 2 both from mmtests on
a 2-socket Xeon E5-2699v3 box with 256G memory in total. Running 5.18 kernel
with SELinux disabled. Scheduler/MM tunables are all default. Irqbalance
daemon is active.
Hackbench (config-scheduler-unbound)
=========
process-pipes:
Baseline Patched
Amean 1 0.4300 ( 0.00%) 0.4420 ( -2.79%)
Amean 4 0.8757 ( 0.00%) 0.8774 ( -0.20%)
Amean 7 1.3712 ( 0.00%) 1.3789 ( -0.56%)
Amean 12 2.3541 ( 0.00%) 2.3714 ( -0.73%)
Amean 21 4.2229 ( 0.00%) 4.2439 ( -0.50%)
Amean 30 5.9113 ( 0.00%) 5.9451 ( -0.57%)
Amean 48 9.3873 ( 0.00%) 9.4898 ( -1.09%)
Amean 79 15.9281 ( 0.00%) 16.1385 ( -1.32%)
Amean 110 22.0961 ( 0.00%) 22.3433 ( -1.12%)
Amean 141 28.2973 ( 0.00%) 28.6209 ( -1.14%)
Amean 172 34.4709 ( 0.00%) 34.9347 ( -1.35%)
Amean 203 40.7621 ( 0.00%) 41.2497 ( -1.20%)
Amean 234 47.0416 ( 0.00%) 47.6470 ( -1.29%)
Amean 265 53.3048 ( 0.00%) 54.1625 ( -1.61%)
Amean 288 58.0595 ( 0.00%) 58.8096 ( -1.29%)
process-sockets:
Baseline Patched
Amean 1 0.6776 ( 0.00%) 0.6611 ( 2.43%)
Amean 4 2.6183 ( 0.00%) 2.5769 ( 1.58%)
Amean 7 4.5662 ( 0.00%) 4.4801 ( 1.89%)
Amean 12 7.7638 ( 0.00%) 7.6201 ( 1.85%)
Amean 21 13.5335 ( 0.00%) 13.2915 ( 1.79%)
Amean 30 19.3369 ( 0.00%) 18.9811 ( 1.84%)
Amean 48 31.0724 ( 0.00%) 30.6015 ( 1.52%)
Amean 79 51.1881 ( 0.00%) 50.4251 ( 1.49%)
Amean 110 71.3399 ( 0.00%) 70.4578 ( 1.24%)
Amean 141 91.4675 ( 0.00%) 90.3769 ( 1.19%)
Amean 172 111.7463 ( 0.00%) 110.3947 ( 1.21%)
Amean 203 131.6927 ( 0.00%) 130.3270 ( 1.04%)
Amean 234 151.7459 ( 0.00%) 150.1320 ( 1.06%)
Amean 265 171.9101 ( 0.00%) 169.9751 ( 1.13%)
Amean 288 186.9231 ( 0.00%) 184.7706 ( 1.15%)
thread-pipes:
Baseline Patched
Amean 1 0.4523 ( 0.00%) 0.4535 ( -0.28%)
Amean 4 0.9041 ( 0.00%) 0.9085 ( -0.48%)
Amean 7 1.4111 ( 0.00%) 1.4146 ( -0.25%)
Amean 12 2.3532 ( 0.00%) 2.3688 ( -0.66%)
Amean 21 4.1550 ( 0.00%) 4.1701 ( -0.36%)
Amean 30 6.1043 ( 0.00%) 6.2391 ( -2.21%)
Amean 48 10.2077 ( 0.00%) 10.3511 ( -1.40%)
Amean 79 16.7922 ( 0.00%) 17.0427 ( -1.49%)
Amean 110 23.3350 ( 0.00%) 23.6522 ( -1.36%)
Amean 141 29.6458 ( 0.00%) 30.2617 ( -2.08%)
Amean 172 35.8649 ( 0.00%) 36.4225 ( -1.55%)
Amean 203 42.4477 ( 0.00%) 42.8332 ( -0.91%)
Amean 234 48.7117 ( 0.00%) 49.4042 ( -1.42%)
Amean 265 54.9171 ( 0.00%) 55.6551 ( -1.34%)
Amean 288 59.5282 ( 0.00%) 60.2903 ( -1.28%)
thread-sockets:
Baseline Patched
Amean 1 0.6917 ( 0.00%) 0.6892 ( 0.37%)
Amean 4 2.6651 ( 0.00%) 2.6017 ( 2.38%)
Amean 7 4.6734 ( 0.00%) 4.5637 ( 2.35%)
Amean 12 8.0156 ( 0.00%) 7.8079 ( 2.59%)
Amean 21 14.0451 ( 0.00%) 13.6679 ( 2.69%)
Amean 30 20.0963 ( 0.00%) 19.5657 ( 2.64%)
Amean 48 32.4115 ( 0.00%) 31.6001 ( 2.50%)
Amean 79 53.1104 ( 0.00%) 51.8395 ( 2.39%)
Amean 110 74.0929 ( 0.00%) 72.4391 ( 2.23%)
Amean 141 95.1506 ( 0.00%) 93.0992 ( 2.16%)
Amean 172 116.1969 ( 0.00%) 113.8307 ( 2.04%)
Amean 203 137.4413 ( 0.00%) 134.5247 ( 2.12%)
Amean 234 158.5395 ( 0.00%) 155.2793 ( 2.06%)
Amean 265 179.7729 ( 0.00%) 176.1099 ( 2.04%)
Amean 288 195.5573 ( 0.00%) 191.3977 ( 2.13%)
Pgbench (config-db-pgbench-timed-ro-small)
=======
Baseline Patched
Hmean 1 68.54 ( 0.00%) 69.72 ( 1.71%)
Hmean 6 27725.78 ( 0.00%) 34119.11 ( 23.06%)
Hmean 12 55724.26 ( 0.00%) 63158.22 ( 13.34%)
Hmean 22 72806.26 ( 0.00%) 73389.98 ( 0.80%)
Hmean 30 79000.45 ( 0.00%) 75197.02 ( -4.81%)
Hmean 48 78180.16 ( 0.00%) 75074.09 ( -3.97%)
Hmean 80 75001.93 ( 0.00%) 70590.72 ( -5.88%)
Hmean 110 74812.25 ( 0.00%) 74128.57 ( -0.91%)
Hmean 142 74261.05 ( 0.00%) 72910.48 ( -1.82%)
Hmean 144 75375.35 ( 0.00%) 71295.72 ( -5.41%)
For hackbench, +-3% fluctuation is normal on this two-socket box, it's safe to
conclude that there are no performance differences for wakeups from task context.
For pgbench, after many runs, 10~30% gains are very consistent at lower
client counts (< #cores per socket). For higher client counts, both kernels are
very close, +-5% swings are quite common. Also NET_TX|RX softirq load
does spread out across both NUMA nodes in this test so there is no need to do
any explicit RPS/RFS.
[1]: https://lkml.kernel.org/r/20211105105136.12137-1-21cnbao@gmail.com
Signed-off-by: Libo Chen <libo.chen@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220711224704.1672831-1-libo.chen@oracle.com
---
kernel/sched/fair.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 914096c..2fc4725 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7013,7 +7013,10 @@ unlock:
static int
select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
{
- int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
+ /*
+ * Only consider WF_SYNC from task context; since only a task can schedule out.
+ */
+ int sync = in_task() && (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
struct sched_domain *tmp, *sd = NULL;
int cpu = smp_processor_id();
int new_cpu = prev_cpu;
© 2016 - 2026 Red Hat, Inc.