sched: Try and address some recent-ish regressions

[RFC][PATCH 0/5] sched: Try and address some recent-ish regressions

Posted by Peter Zijlstra 8 months, 3 weeks ago

Hi!

So Chris poked me about how they're having a wee performance drop after around
6.11. He's extended his schbench tool to mimic the workload in question.

Specifically the commandline given:

  schbench -L -m 4 -M auto -t 128 -n 0 -r 60

This benchmark wants to stay on a single (large) LLC (Chris, perhaps add an
option to start the CPU mask with
/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list or something). Both
the machine Chris has (SKL, 20+ cores per LLC) and the machines I ran this on
(SKL,SPR 20+ cores) are Intel, AMD has smaller LLC and the problem wasn't as
pronounced there.

Use performance CPU governor (as always when benchmarking). Also, if the test
results are unstable as all heck, disable turbo.

After a fair amount of tinkering I managed to reproduce on my SPR and Thomas'
SKL. The SKL would only give usable numbers with the second socket offline and
turbo disabled -- YMMV.

Chris further provided a bisect into the DELAY_DEQUEUE patches and a bisect
leading to commit 5f6bd380c7bd ("sched/rt: Remove default bandwidth control")
-- which enables the dl_server by default.


SKL (performance, no_turbo):

schbench-6.9.0-1.txt:average rps: 2040360.55
schbench-6.9.0-2.txt:average rps: 2038846.78
schbench-6.9.0-3.txt:average rps: 2037892.28

schbench-6.15.0-rc6+-1.txt:average rps: 1907718.18
schbench-6.15.0-rc6+-2.txt:average rps: 1906931.07
schbench-6.15.0-rc6+-3.txt:average rps: 1903190.38

schbench-6.15.0-rc6+-dirty-1.txt:average rps: 2002224.78
schbench-6.15.0-rc6+-dirty-2.txt:average rps: 2007116.80
schbench-6.15.0-rc6+-dirty-3.txt:average rps: 2005294.57

schbench-6.15.0-rc6+-dirty-delayed-1.txt:average rps: 2011282.15
schbench-6.15.0-rc6+-dirty-delayed-2.txt:average rps: 2016347.10
schbench-6.15.0-rc6+-dirty-delayed-3.txt:average rps: 2014515.47

schbench-6.15.0-rc6+-dirty-delayed-default-1.txt:average rps: 2042169.00
schbench-6.15.0-rc6+-dirty-delayed-default-2.txt:average rps: 2032789.77
schbench-6.15.0-rc6+-dirty-delayed-default-3.txt:average rps: 2040313.95


SPR (performance):

schbench-6.9.0-1.txt:average rps: 2975450.75
schbench-6.9.0-2.txt:average rps: 2975464.38
schbench-6.9.0-3.txt:average rps: 2974881.02

schbench-6.15.0-rc6+-1.txt:average rps: 2882537.37
schbench-6.15.0-rc6+-2.txt:average rps: 2881658.70
schbench-6.15.0-rc6+-3.txt:average rps: 2884293.37

schbench-6.15.0-rc6+-dl_server-1.txt:average rps: 2924423.18
schbench-6.15.0-rc6+-dl_server-2.txt:average rps: 2920422.63

schbench-6.15.0-rc6+-dirty-1.txt:average rps: 3011540.97
schbench-6.15.0-rc6+-dirty-2.txt:average rps: 3010124.10

schbench-6.15.0-rc6+-dirty-delayed-1.txt:average rps: 3030883.15
schbench-6.15.0-rc6+-dirty-delayed-2.txt:average rps: 3031627.05

schbench-6.15.0-rc6+-dirty-delayed-default-1.txt:average rps: 3053005.98
schbench-6.15.0-rc6+-dirty-delayed-default-2.txt:average rps: 3052972.80


As can be seen, the SPR is much easier to please than the SKL for whatever
reason. I'm thinking we can make TTWU_QUEUE_DELAYED default on, but I suspect
TTWU_QUEUE_DEFAULT might be a harder sell -- we'd need to run more than this
one benchmark.

Anyway, the patches are stable (finally!, I hope, knock on wood) but in a
somewhat rough state. At the very least the last patch is missing ttwu_stat(),
still need to figure out how to account it ;-)

Chris, I'm hoping your machine will agree with these numbers; it hasn't been
straight sailing in that regard.

Re: [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions

Posted by Peter Zijlstra 8 months, 2 weeks ago

On Tue, May 20, 2025 at 11:45:38AM +0200, Peter Zijlstra wrote:

> Anyway, the patches are stable (finally!, I hope, knock on wood) but in a
> somewhat rough state. At the very least the last patch is missing ttwu_stat(),
> still need to figure out how to account it ;-)
> 
> Chris, I'm hoping your machine will agree with these numbers; it hasn't been
> straight sailing in that regard.

Anybody? -- If no comments I'll just stick them in sched/core or so.

Re: [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions

Posted by Chris Mason 8 months, 2 weeks ago

On 5/28/25 3:59 PM, Peter Zijlstra wrote:
> On Tue, May 20, 2025 at 11:45:38AM +0200, Peter Zijlstra wrote:
> 
>> Anyway, the patches are stable (finally!, I hope, knock on wood) but in a
>> somewhat rough state. At the very least the last patch is missing ttwu_stat(),
>> still need to figure out how to account it ;-)
>>
>> Chris, I'm hoping your machine will agree with these numbers; it hasn't been
>> straight sailing in that regard.
> 
> Anybody? -- If no comments I'll just stick them in sched/core or so.

My initial numbers were quite bad, roughly 50% fewer RPS than the old
6.9 kernel on the big turin machine.  I need to redo things and make
sure the numbers are all valid, I'll try and do that today.

-chris

Re: [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions

Posted by Beata Michalska 8 months, 2 weeks ago

On Wed, May 28, 2025 at 09:59:44PM +0200, Peter Zijlstra wrote:
> On Tue, May 20, 2025 at 11:45:38AM +0200, Peter Zijlstra wrote:
> 
> > Anyway, the patches are stable (finally!, I hope, knock on wood) but in a
> > somewhat rough state. At the very least the last patch is missing ttwu_stat(),
> > still need to figure out how to account it ;-)
> > 
> > Chris, I'm hoping your machine will agree with these numbers; it hasn't been
> > straight sailing in that regard.
> 
> Anybody? -- If no comments I'll just stick them in sched/core or so.
>
Hi Peter,

I've tried out your series on top of 6.15 on an Ampere Altra Mt Jade
dual-socket (160-core) system, which enables SCHED_CLUSTER (2-core MC domains).
Sharing preliminary test results of 50 runs per setup as, so far, the data
show quite a bit of run-to-run variability - not sure how useful those will be.
At this point without any deep dive, which is probably needed and hopefully
will come later on.


Results for average rps (60s) sorted based on P90

CFG |   min      |  max       |   stdev    |   90th
----+------------+------------+------------+-----------
 1  | 704577.50  | 942665.67  | 46439.49   | 891272.09
 4  | 656313.48  | 877223.85  | 47871.43   | 837693.28
 3  | 658665.75  | 859520.32  | 50257.35   | 832174.80
 5  | 630419.62  | 842170.47  | 47267.52   | 815911.81
 2  | 647163.57  | 815392.65  | 35559.98   | 783884.00

Legend:
#1 : kernel 6.9
#2 : kernel 6.15
#3 : kernel 6.15 patched def (TTWU_QUEUE_ON_CPU + NO_TTWU_QUEUE_DEFAULT)
#4 : kernel 6.15 patched + TTWU_QUEUE_ON_CPU + TTWU_QUEUE_DEFAULT
#5 : kernel 6.15 patched + NO_TTWU_QUEUE_ON_CPU + NO_TTWU_QUEUE_DEFAULT

---
BR
Beata

Re: [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions

Posted by Peter Zijlstra 8 months, 2 weeks ago

On Thu, May 29, 2025 at 12:18:54PM +0200, Beata Michalska wrote:
> On Wed, May 28, 2025 at 09:59:44PM +0200, Peter Zijlstra wrote:
> > On Tue, May 20, 2025 at 11:45:38AM +0200, Peter Zijlstra wrote:
> > 
> > > Anyway, the patches are stable (finally!, I hope, knock on wood) but in a
> > > somewhat rough state. At the very least the last patch is missing ttwu_stat(),
> > > still need to figure out how to account it ;-)
> > > 
> > > Chris, I'm hoping your machine will agree with these numbers; it hasn't been
> > > straight sailing in that regard.
> > 
> > Anybody? -- If no comments I'll just stick them in sched/core or so.
> >
> Hi Peter,
> 
> I've tried out your series on top of 6.15 on an Ampere Altra Mt Jade
> dual-socket (160-core) system, which enables SCHED_CLUSTER (2-core MC domains).

Ah, that's a radically different system than what we set out with. Good
to get some feedback on that indeed.

> Sharing preliminary test results of 50 runs per setup as, so far, the data
> show quite a bit of run-to-run variability - not sure how useful those will be.

Yeah, I had some of that on the Skylake system, I had to disable turbo
for the numbers to become stable enough to say anything much.

> At this point without any deep dive, which is probably needed and hopefully
> will come later on.
> 
> 
> Results for average rps (60s) sorted based on P90
> 
> CFG |   min      |  max       |   stdev    |   90th
> ----+------------+------------+------------+-----------
>  1  | 704577.50  | 942665.67  | 46439.49   | 891272.09
>  2  | 647163.57  | 815392.65  | 35559.98   | 783884.00
>  3  | 658665.75  | 859520.32  | 50257.35   | 832174.80

>  4  | 656313.48  | 877223.85  | 47871.43   | 837693.28
>  5  | 630419.62  | 842170.47  | 47267.52   | 815911.81
> 
> Legend:
> #1 : kernel 6.9
> #2 : kernel 6.15
> #3 : kernel 6.15 patched def (TTWU_QUEUE_ON_CPU + NO_TTWU_QUEUE_DEFAULT)
> #4 : kernel 6.15 patched + TTWU_QUEUE_ON_CPU + TTWU_QUEUE_DEFAULT
> #5 : kernel 6.15 patched + NO_TTWU_QUEUE_ON_CPU + NO_TTWU_QUEUE_DEFAULT

Right, minor improvement. At least its not making it worse :-)

The new toy is TTWU_QUEUE_DELAYED, and yeah, I did notice that disabling
TTWU_QUEUE_ON_CPU was a bad idea.

Re: [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions

Posted by Chris Mason 8 months, 2 weeks ago

On 5/28/25 3:59 PM, Peter Zijlstra wrote:
> On Tue, May 20, 2025 at 11:45:38AM +0200, Peter Zijlstra wrote:
> 
>> Anyway, the patches are stable (finally!, I hope, knock on wood) but in a
>> somewhat rough state. At the very least the last patch is missing ttwu_stat(),
>> still need to figure out how to account it ;-)
>>
>> Chris, I'm hoping your machine will agree with these numbers; it hasn't been
>> straight sailing in that regard.
> 
> Anybody? -- If no comments I'll just stick them in sched/core or so.

Hi Peter,

I'll get all of these run on the big turin machine, should have some
numbers tomorrow.

-chris

Re: [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions

Posted by Peter Zijlstra 8 months ago

On Wed, May 28, 2025 at 09:41:33PM -0400, Chris Mason wrote:

> I'll get all of these run on the big turin machine, should have some
> numbers tomorrow.

Right... so Turin. I had a quick look through our IRC logs but I
couldn't find exactly which model you had, and unfortunately AMD uses
the Turin name for both Zen 5c and Zen 5 Epyc :-(

Anyway, the big and obvious difference between the whole Intel and AMD
machines is the L3. So far we've been looking at SKL/SPR single L3
performance, but Turin (whichever that might be) will be having many L3.
With Zen5 having 8 cores per L3 and Zen5c having 16.

Additionally, your schbench -M auto thing is doing exactly the wrong
thing for them. What you want is for those message threads to be spread
out across the L3s, not all stuck to the first (which is what -M auto
would end up doing). And then the associated worker threads would
ideally stick to their respective L3s and not scatter all over the
machine.

Anyway, most of the data we shared was about single socket SKL, might be
we missed some obvious things for the multi-L3 case.

I'll go poke at some of the things I've so far neglected because of the
single L3 focus.

Re: [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions

Posted by Chris Mason 7 months, 4 weeks ago

On 6/14/25 6:04 AM, Peter Zijlstra wrote:
> On Wed, May 28, 2025 at 09:41:33PM -0400, Chris Mason wrote:
> 
>> I'll get all of these run on the big turin machine, should have some
>> numbers tomorrow.
> 
> Right... so Turin. I had a quick look through our IRC logs but I
> couldn't find exactly which model you had, and unfortunately AMD uses
> the Turin name for both Zen 5c and Zen 5 Epyc :-(

Looks like the one I've been testing on is the epyc variant.  But,
stepping back for a bit, I bisected a few regressions between 6.9 and 6.13:

- DL server
- DELAY_{DEQUEUE,ZERO}
- PSI fun (not really me, but relevant)

I think these are all important and relevant, but it was strange that
none of these patches seemed to move the needle much on the turin
machines (+/- futexes), so I went back to the drawing board.

Our internal 6.9 kernel was the "fast" one, and comparing it with
vanilla 6.9, it turns out we'd carried some patches that significantly
improved our web workload on top of vanilla 6.9.

In other words, I've been trying to find a regression that Vernet
actually fixed in 6.9 already.  Bisecting pointed to:

Author: David Vernet <void@manifault.com>
Date:   Tue May 7 08:15:32 2024 -0700

    Revert "sched/fair: Remove sysctl_sched_migration_cost condition"

    This reverts commit c5b0a7eefc70150caf23e37bc9d639c68c87a097.

Comparing schedstat.py output from the fast and slow kernel..this is 6.9
vs 6.13, but I'll get a comparison tomorrow where the schedstat version
actually matches.

# grep balance slow.stat.out
lb_balance_not_idle: 687
lb_imbalance_not_idle: 0
lb_balance_idle: 659054
lb_imbalance_idle: 2635
lb_balance_newly_idle: 2051682
lb_imbalance_newly_idle: 500328
sbe_balanced: 0
sbf_balanced: 0
ttwu_move_balance: 0

# grep balance fast.stat.out
lb_balance_idle: 606600
lb_imbalance_idle: 1911
lb_balance_not_idle: 680
lb_imbalance_not_idle: 0
lb_balance_newly_idle: 11697
lb_imbalance_newly_idle: 22868
sbe_balanced: 0
sbf_balanced: 0
ttwu_move_balance: 0

Reverting that commit above on vanilla 6.9 makes us fast.  Disabling new
idle balance completely is fast on our 6.13 kernel, but reverting that
one commit doesn't change much.  I'll switch back to upstream and
compare newidle balance behavior.

> 
> Anyway, the big and obvious difference between the whole Intel and AMD
> machines is the L3. So far we've been looking at SKL/SPR single L3
> performance, but Turin (whichever that might be) will be having many L3.
> With Zen5 having 8 cores per L3 and Zen5c having 16.
> 
> Additionally, your schbench -M auto thing is doing exactly the wrong
> thing for them. What you want is for those message threads to be spread
> out across the L3s, not all stuck to the first (which is what -M auto
> would end up doing). And then the associated worker threads would
> ideally stick to their respective L3s and not scatter all over the
> machine.
> 
> Anyway, most of the data we shared was about single socket SKL, might be
> we missed some obvious things for the multi-L3 case.
> 
> I'll go poke at some of the things I've so far neglected because of the
> single L3 focus.

You're 100% right about all of this, and I really do want to add some
better smarts to the pinning for both numa and chiplets.

-chris

Re: [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions

Posted by K Prateek Nayak 8 months, 1 week ago

Hello Peter,

On 5/20/2025 3:15 PM, Peter Zijlstra wrote:
> As can be seen, the SPR is much easier to please than the SKL for whatever
> reason. I'm thinking we can make TTWU_QUEUE_DELAYED default on, but I suspect
> TTWU_QUEUE_DEFAULT might be a harder sell -- we'd need to run more than this
> one benchmark.

I haven't tried toggling any of the newly added SCHED_FEAT() yet.
Following are the numbers for the out of the box variant:

tl;dr Minor improvements across the board; no noticeable regressions
except for a few schbench datapoints but they also have a high
run-to-run variance so we should be good.

o Machine details

- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)

o Kernel details

tip:	  tip:sched/core at commit 914873bc7df9 ("Merge tag
           'x86-build-2025-05-25' of
           git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")

ttwu_opt: tip + this series as is

o Benchmark results

     ==================================================================
     Test          : hackbench
     Units         : Normalized time in seconds
     Interpretation: Lower is better
     Statistic     : AMean
     ==================================================================
     Case:           tip[pct imp](CV)       ttwu_opt[pct imp](CV)
      1-groups     1.00 [ -0.00](13.74)     0.92 [  7.68]( 6.04)
      2-groups     1.00 [ -0.00]( 9.58)     1.04 [ -3.56]( 4.96)
      4-groups     1.00 [ -0.00]( 2.10)     1.01 [ -1.30]( 2.27)
      8-groups     1.00 [ -0.00]( 1.51)     0.99 [  1.26]( 1.70)
     16-groups     1.00 [ -0.00]( 1.10)     0.97 [  3.01]( 1.62)


     ==================================================================
     Test          : tbench
     Units         : Normalized throughput
     Interpretation: Higher is better
     Statistic     : AMean
     ==================================================================
     Clients:    tip[pct imp](CV)       ttwu_opt[pct imp](CV)
         1     1.00 [  0.00]( 0.82)     1.04 [  4.33]( 1.84)
         2     1.00 [  0.00]( 1.13)     1.06 [  5.52]( 1.04)
         4     1.00 [  0.00]( 1.12)     1.05 [  5.41]( 0.53)
         8     1.00 [  0.00]( 0.93)     1.06 [  5.72]( 0.47)
        16     1.00 [  0.00]( 0.38)     1.07 [  6.99]( 0.50)
        32     1.00 [  0.00]( 0.66)     1.05 [  4.68]( 1.79)
        64     1.00 [  0.00]( 1.18)     1.06 [  5.53]( 0.37)
       128     1.00 [  0.00]( 1.12)     1.06 [  5.52]( 0.13)
       256     1.00 [  0.00]( 0.42)     0.99 [ -0.83]( 1.01)
       512     1.00 [  0.00]( 0.14)     1.01 [  1.06]( 0.13)
      1024     1.00 [  0.00]( 0.26)     1.02 [  1.82]( 0.41)


     ==================================================================
     Test          : stream-10
     Units         : Normalized Bandwidth, MB/s
     Interpretation: Higher is better
     Statistic     : HMean
     ==================================================================
     Test:       tip[pct imp](CV)       ttwu_opt[pct imp](CV)
      Copy     1.00 [  0.00]( 8.37)     0.97 [ -2.79]( 9.17)
     Scale     1.00 [  0.00]( 2.85)     1.00 [  0.12]( 2.91)
       Add     1.00 [  0.00]( 3.39)     0.98 [ -2.36]( 4.85)
     Triad     1.00 [  0.00]( 6.39)     1.01 [  1.45]( 8.42)


     ==================================================================
     Test          : stream-100
     Units         : Normalized Bandwidth, MB/s
     Interpretation: Higher is better
     Statistic     : HMean
     ==================================================================
     Test:       tip[pct imp](CV)       ttwu_opt[pct imp](CV)
      Copy     1.00 [  0.00]( 3.91)     0.98 [ -1.84]( 2.07)
     Scale     1.00 [  0.00]( 4.34)     0.96 [ -3.80]( 6.38)
       Add     1.00 [  0.00]( 4.14)     0.97 [ -3.04]( 6.31)
     Triad     1.00 [  0.00]( 1.00)     0.98 [ -2.36]( 2.60)


     ==================================================================
     Test          : netperf
     Units         : Normalized Througput
     Interpretation: Higher is better
     Statistic     : AMean
     ==================================================================
     Clients:         tip[pct imp](CV)       ttwu_opt[pct imp](CV)
      1-clients     1.00 [  0.00]( 0.41)     1.06 [  5.63]( 1.17)
      2-clients     1.00 [  0.00]( 0.58)     1.06 [  6.25]( 0.85)
      4-clients     1.00 [  0.00]( 0.35)     1.06 [  5.59]( 0.49)
      8-clients     1.00 [  0.00]( 0.48)     1.06 [  5.76]( 0.81)
     16-clients     1.00 [  0.00]( 0.66)     1.06 [  5.95]( 0.69)
     32-clients     1.00 [  0.00]( 1.15)     1.06 [  5.84]( 1.34)
     64-clients     1.00 [  0.00]( 1.38)     1.05 [  5.20]( 1.50)
     128-clients    1.00 [  0.00]( 0.87)     1.04 [  4.39]( 1.03)
     256-clients    1.00 [  0.00]( 5.36)     1.00 [  0.10]( 3.48)
     512-clients    1.00 [  0.00](54.39)     0.98 [ -1.93](52.45)


     ==================================================================
     Test          : schbench
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)       ttwu_opt[pct imp](CV)
       1     1.00 [ -0.00]( 8.54)     0.89 [ 10.87](35.39)
       2     1.00 [ -0.00]( 1.15)     0.88 [ 12.00]( 4.55)
       4     1.00 [ -0.00](13.46)     0.96 [  4.17](10.60)
       8     1.00 [ -0.00]( 7.14)     0.84 [ 15.79]( 8.44)
      16     1.00 [ -0.00]( 3.49)     1.08 [ -8.47]( 4.69)
      32     1.00 [ -0.00]( 1.06)     1.10 [ -9.57]( 2.91)
      64     1.00 [ -0.00]( 5.48)     1.25 [-25.00]( 5.36)
     128     1.00 [ -0.00](10.45)     1.18 [-17.99](12.54)
     256     1.00 [ -0.00](31.14)     1.28 [-27.79](17.66)
     512     1.00 [ -0.00]( 1.52)     1.01 [ -0.51]( 2.78)


     ==================================================================
     Test          : new-schbench-requests-per-second
     Units         : Normalized Requests per second
     Interpretation: Higher is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)       ttwu_opt[pct imp](CV)
       1     1.00 [  0.00]( 1.07)     1.00 [  0.29]( 0.00)
       2     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.15)
       4     1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.15)
       8     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.15)
      16     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)
      32     1.00 [  0.00]( 3.41)     0.99 [ -0.95]( 2.06)
      64     1.00 [  0.00]( 1.05)     0.92 [ -7.58]( 9.01)
     128     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)
     256     1.00 [  0.00]( 0.72)     1.00 [ -0.31]( 0.42)
     512     1.00 [  0.00]( 0.57)     1.00 [  0.00]( 0.45)


     ==================================================================
     Test          : new-schbench-wakeup-latency
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)       ttwu_opt[pct imp](CV)
       1     1.00 [ -0.00]( 9.11)     0.75 [ 25.00](11.08)
       2     1.00 [ -0.00]( 0.00)     1.00 [ -0.00]( 3.78)
       4     1.00 [ -0.00]( 3.78)     0.93 [  7.14]( 3.87)
       8     1.00 [ -0.00]( 0.00)     1.08 [ -8.33](12.91)
      16     1.00 [ -0.00]( 7.56)     0.92 [  7.69](11.71)
      32     1.00 [ -0.00](15.11)     1.07 [ -6.67]( 3.30)
      64     1.00 [ -0.00]( 9.63)     1.00 [ -0.00]( 8.15)
     128     1.00 [ -0.00]( 4.86)     0.89 [ 11.06]( 7.83)
     256     1.00 [ -0.00]( 2.34)     1.00 [  0.20]( 0.10)
     512     1.00 [ -0.00]( 0.40)     1.00 [  0.38]( 0.20)


     ==================================================================
     Test          : new-schbench-request-latency
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)       ttwu_opt[pct imp](CV)
       1     1.00 [ -0.00]( 2.73)     0.98 [  2.08]( 1.04)
       2     1.00 [ -0.00]( 0.87)     1.05 [ -5.40]( 3.10)
       4     1.00 [ -0.00]( 1.21)     0.99 [  0.54]( 1.27)
       8     1.00 [ -0.00]( 0.27)     0.99 [  0.79]( 2.14)
      16     1.00 [ -0.00]( 4.04)     1.01 [ -0.53]( 0.55)
      32     1.00 [ -0.00]( 7.35)     1.10 [ -9.97](21.10)
      64     1.00 [ -0.00]( 3.54)     1.03 [ -2.89]( 1.55)
     128     1.00 [ -0.00]( 0.37)     0.99 [  0.62]( 0.00)
     256     1.00 [ -0.00]( 9.57)     0.92 [  8.36]( 2.22)
     512     1.00 [ -0.00]( 1.82)     1.01 [ -1.23]( 0.94)


     ==================================================================
     Test          : Various longer running benchmarks
     Units         : %diff in throughput reported
     Interpretation: Higher is better
     Statistic     : Median
     ==================================================================
     Benchmarks:                 %diff
     ycsb-cassandra              -0.05%
     ycsb-mongodb                -0.80%

     deathstarbench-1x            2.44%
     deathstarbench-2x            5.47%
     deathstarbench-3x            0.36%
     deathstarbench-6x            1.14%

     hammerdb+mysql 16VU          1.08%
     hammerdb+mysql 64VU         -0.43%

> 
> Anyway, the patches are stable (finally!, I hope, knock on wood) but in a
> somewhat rough state. At the very least the last patch is missing ttwu_stat(),
> still need to figure out how to account it ;-)
> 

Since TTWU_QUEUE_DELAYED is off by defaults, feel free to include:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

if you are planning on retaining the current defaults for the
SCHED_FEATs. I'll get back with numbers for TTWU_QUEUE_DELAYED and
TTWU_QUEUE_DEFAULT soon.

-- 
Thanks and Regards,
Prateek

Re: [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions

Posted by K Prateek Nayak 8 months ago

Hello Peter,

On 6/2/2025 10:14 AM, K Prateek Nayak wrote:
> Hello Peter,
> 
> On 5/20/2025 3:15 PM, Peter Zijlstra wrote:
>> As can be seen, the SPR is much easier to please than the SKL for whatever
>> reason. I'm thinking we can make TTWU_QUEUE_DELAYED default on, but I suspect
>> TTWU_QUEUE_DEFAULT might be a harder sell -- we'd need to run more than this
>> one benchmark.
> 
> I haven't tried toggling any of the newly added SCHED_FEAT() yet.

Here are the full results:

tldr;

- schbench (old) has a consistent regression for 16, 32, 64,
   128, 256 workers (> CCX size, < Overloaded) except for with
   256 workers case with TTWU_QUEUE_DEFAULT which shows an
   improvement.

- new schebench has few regressions around 32, 64, and 128
   workers for wakeup and request latency.

- Most others benchmarks show minor improvements /
   regressions but nothing serious.
   

o Variants

"DELAYED" enables "TTWU_QUEUE_DELAYED" alone, "DEFAULT" enables
"TTWU_QUEUE_DEFAULT" alone, and "BOTH" variant enables both.
vanilla was shared previously which is same as out of box with no
changes made to the sched features.


o Benchmark numbers

     ==================================================================
     Test          : hackbench
     Units         : Normalized time in seconds
     Interpretation: Lower is better
     Statistic     : AMean
     ==================================================================
     Case:           tip[pct imp](CV)       vanilla[pct imp](CV)     DELAYED[pct imp](CV)     DEFUALT[pct imp](CV)      BOTH[pct imp](CV)
      1-groups     1.00 [ -0.00](13.74)     0.92 [  7.68]( 6.04)     0.95 [  5.12](10.12)     1.02 [ -1.92]( 6.70)     0.95 [  4.90]( 5.28)
      2-groups     1.00 [ -0.00]( 9.58)     1.04 [ -3.56]( 4.96)     1.03 [ -3.12]( 5.12)     0.98 [  1.56]( 4.30)     1.01 [ -1.11]( 5.78)
      4-groups     1.00 [ -0.00]( 2.10)     1.01 [ -1.30]( 2.27)     1.01 [ -1.09]( 2.68)     1.00 [ -0.43]( 2.58)     1.01 [ -0.65]( 1.38)
      8-groups     1.00 [ -0.00]( 1.51)     0.99 [  1.26]( 1.70)     0.99 [  0.95]( 4.92)     0.97 [  3.15]( 1.60)     1.00 [ -0.00]( 3.67)
     16-groups     1.00 [ -0.00]( 1.10)     0.97 [  3.01]( 1.62)     0.96 [  3.77]( 1.42)     0.95 [  4.60]( 0.67)     0.96 [  4.44]( 1.10)
     
     
     ==================================================================
     Test          : tbench
     Units         : Normalized throughput
     Interpretation: Higher is better
     Statistic     : AMean
     ==================================================================
     Clients:    tip[pct imp](CV)       vanilla[pct imp](CV)     DELAYED[pct imp](CV)     DEFUALT[pct imp](CV)      BOTH[pct imp](CV)
         1     1.00 [  0.00]( 0.82)     1.04 [  4.33]( 1.84)     1.06 [  5.97]( 0.42)     1.06 [  6.12]( 1.02)     1.06 [  5.54]( 0.73)
         2     1.00 [  0.00]( 1.13)     1.06 [  5.52]( 1.04)     1.07 [  7.17]( 0.42)     1.07 [  6.81]( 0.30)     1.08 [  7.96]( 0.39)
         4     1.00 [  0.00]( 1.12)     1.05 [  5.41]( 0.53)     1.07 [  7.39]( 0.67)     1.06 [  6.45]( 0.91)     1.07 [  7.36]( 0.63)
         8     1.00 [  0.00]( 0.93)     1.06 [  5.72]( 0.47)     1.07 [  6.90]( 0.24)     1.07 [  7.09]( 1.45)     1.07 [  6.94]( 0.45)
        16     1.00 [  0.00]( 0.38)     1.07 [  6.99]( 0.50)     1.05 [  4.95]( 0.98)     1.05 [  5.39]( 0.71)     1.05 [  5.43]( 1.05)
        32     1.00 [  0.00]( 0.66)     1.05 [  4.68]( 1.79)     1.06 [  5.70]( 0.54)     1.07 [  6.93]( 2.39)     1.03 [  3.17]( 1.06)
        64     1.00 [  0.00]( 1.18)     1.06 [  5.53]( 0.37)     1.04 [  4.05]( 0.84)     1.07 [  7.35]( 1.57)     1.06 [  5.62]( 1.13)
       128     1.00 [  0.00]( 1.12)     1.06 [  5.52]( 0.13)     1.05 [  4.94]( 0.75)     1.08 [  7.56]( 0.81)     1.05 [  4.80]( 0.55)
       256     1.00 [  0.00]( 0.42)     0.99 [ -0.83]( 1.01)     0.99 [ -0.58]( 0.57)     1.00 [  0.06]( 0.68)     1.00 [  0.03]( 1.47)
       512     1.00 [  0.00]( 0.14)     1.01 [  1.06]( 0.13)     1.02 [  1.67]( 0.18)     1.03 [  2.62]( 0.28)     1.02 [  2.17]( 0.33)
      1024     1.00 [  0.00]( 0.26)     1.02 [  1.82]( 0.41)     1.02 [  2.48]( 0.27)     1.03 [  3.38]( 0.37)     1.01 [  1.39]( 0.03)
     
     
     ==================================================================
     Test          : stream-10
     Units         : Normalized Bandwidth, MB/s
     Interpretation: Higher is better
     Statistic     : HMean
     ==================================================================
     Test:       tip[pct imp](CV)       vanilla[pct imp](CV)     DELAYED[pct imp](CV)     DEFUALT[pct imp](CV)      BOTH[pct imp](CV)
      Copy     1.00 [  0.00]( 8.37)     0.97 [ -2.79]( 9.17)     0.99 [ -1.29]( 4.68)     1.01 [  1.25]( 4.86)     0.99 [ -0.66]( 9.29)
     Scale     1.00 [  0.00]( 2.85)     1.00 [  0.12]( 2.91)     0.99 [ -1.34]( 5.55)     1.00 [ -0.20]( 3.38)     0.98 [ -2.09]( 5.33)
       Add     1.00 [  0.00]( 3.39)     0.98 [ -2.36]( 4.85)     0.98 [ -2.32]( 5.23)     1.00 [  0.10]( 3.17)     0.98 [ -1.99]( 4.73)
     Triad     1.00 [  0.00]( 6.39)     1.01 [  1.45]( 8.42)     1.00 [ -0.38]( 8.28)     1.05 [  4.69]( 5.66)     1.06 [  6.02]( 4.53)
     
     
     ==================================================================
     Test          : stream-100
     Units         : Normalized Bandwidth, MB/s
     Interpretation: Higher is better
     Statistic     : HMean
     ==================================================================
     Test:       tip[pct imp](CV)       vanilla[pct imp](CV)     DELAYED[pct imp](CV)     DEFUALT[pct imp](CV)      BOTH[pct imp](CV)
      Copy     1.00 [  0.00]( 3.91)     0.98 [ -1.84]( 2.07)     0.98 [ -2.06]( 6.75)     1.01 [  1.31]( 2.86)     1.02 [  2.12]( 3.30)
     Scale     1.00 [  0.00]( 4.34)     0.96 [ -3.80]( 6.38)     0.97 [ -2.88]( 6.99)     0.97 [ -2.62]( 5.70)     1.00 [ -0.37]( 3.94)
       Add     1.00 [  0.00]( 4.14)     0.97 [ -3.04]( 6.31)     0.97 [ -3.14]( 6.91)     0.99 [ -0.79]( 4.24)     1.00 [ -0.35]( 4.06)
     Triad     1.00 [  0.00]( 1.00)     0.98 [ -2.36]( 2.60)     0.96 [ -3.80]( 6.15)     0.99 [ -0.61]( 1.33)     0.97 [ -3.05]( 5.48)
     
     
     ==================================================================
     Test          : netperf
     Units         : Normalized Througput
     Interpretation: Higher is better
     Statistic     : AMean
     ==================================================================
     Clients:         tip[pct imp](CV)       vanilla[pct imp](CV)     DELAYED[pct imp](CV)     DEFUALT[pct imp](CV)      BOTH[pct imp](CV)
      1-clients     1.00 [  0.00]( 0.41)     1.06 [  5.63]( 1.17)     1.06 [  6.03]( 0.53)     1.09 [  8.63]( 0.79)     1.06 [  6.36]( 0.09)
      2-clients     1.00 [  0.00]( 0.58)     1.06 [  6.25]( 0.85)     1.05 [  5.47]( 0.83)     1.08 [  8.24]( 1.29)     1.05 [  5.15]( 0.57)
      4-clients     1.00 [  0.00]( 0.35)     1.06 [  5.59]( 0.49)     1.05 [  5.06]( 0.65)     1.08 [  8.15]( 0.82)     1.05 [  5.46]( 0.62)
      8-clients     1.00 [  0.00]( 0.48)     1.06 [  5.76]( 0.81)     1.05 [  5.26]( 0.71)     1.08 [  8.19]( 0.60)     1.05 [  5.34]( 0.80)
     16-clients     1.00 [  0.00]( 0.66)     1.06 [  5.95]( 0.69)     1.06 [  5.52]( 0.78)     1.08 [  8.31]( 0.86)     1.06 [  5.76]( 0.48)
     32-clients     1.00 [  0.00]( 1.15)     1.06 [  5.84]( 1.34)     1.06 [  5.57]( 0.96)     1.08 [  8.30]( 0.90)     1.06 [  5.66]( 1.45)
     64-clients     1.00 [  0.00]( 1.38)     1.05 [  5.20]( 1.50)     1.05 [  4.67]( 1.39)     1.07 [  7.43]( 1.47)     1.05 [  5.18]( 1.48)
     128-clients    1.00 [  0.00]( 0.87)     1.04 [  4.39]( 1.03)     1.04 [  4.43]( 0.98)     1.06 [  5.98]( 1.01)     1.05 [  4.60]( 1.06)
     256-clients    1.00 [  0.00]( 5.36)     1.00 [  0.10]( 3.48)     1.00 [  0.09]( 4.22)     1.01 [  0.71]( 3.18)     1.01 [  1.25]( 3.69)
     512-clients    1.00 [  0.00](54.39)     0.98 [ -1.93](52.45)     1.00 [ -0.35](53.30)     1.02 [  1.75](54.93)     1.02 [  1.76](55.71)
     
     
     ==================================================================
     Test          : schbench
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)       vanilla[pct imp](CV)     DELAYED[pct imp](CV)     DEFUALT[pct imp](CV)      BOTH[pct imp](CV)
       1     1.00 [ -0.00]( 8.54)     0.89 [ 10.87](35.39)     0.78 [ 21.74](34.41)     0.91 [  8.70](12.44)     0.72 [ 28.26](26.70)
       2     1.00 [ -0.00]( 1.15)     0.88 [ 12.00]( 4.55)     0.78 [ 22.00]( 6.61)     0.90 [ 10.00]( 5.75)     0.82 [ 18.00](17.98)
       4     1.00 [ -0.00](13.46)     0.96 [  4.17](10.60)     1.00 [ -0.00]( 8.54)     0.96 [  4.17]( 3.30)     0.98 [  2.08]( 8.19)
       8     1.00 [ -0.00]( 7.14)     0.84 [ 15.79]( 8.44)     0.98 [  1.75]( 3.67)     0.95 [  5.26]( 4.99)     0.91 [  8.77]( 2.92)
      16     1.00 [ -0.00]( 3.49)     1.08 [ -8.47]( 4.69)     1.07 [ -6.78]( 0.92)     1.07 [ -6.78]( 0.91)     1.07 [ -6.78]( 3.27)
      32     1.00 [ -0.00]( 1.06)     1.10 [ -9.57]( 2.91)     1.07 [ -7.45]( 2.97)     1.07 [ -7.45]( 4.23)     1.05 [ -5.32]( 7.80)
      64     1.00 [ -0.00]( 5.48)     1.25 [-25.00]( 5.36)     1.17 [-17.44]( 1.44)     1.23 [-23.26]( 2.79)     1.20 [-19.77]( 2.19)
     128     1.00 [ -0.00](10.45)     1.18 [-17.99](12.54)     1.16 [-16.36](21.21)     1.13 [-12.85](12.71)     1.09 [ -8.64]( 3.05)
     256     1.00 [ -0.00](31.14)     1.28 [-27.79](17.66)     0.84 [ 16.21](32.14)     1.19 [-19.21]( 1.68)     1.07 [ -6.86]( 7.48)
     512     1.00 [ -0.00]( 1.52)     1.01 [ -0.51]( 2.78)     0.97 [  3.03]( 2.91)     0.98 [  1.77]( 1.07)     1.01 [ -0.51]( 1.01)
     
     
     ==================================================================
     Test          : new-schbench-requests-per-second
     Units         : Normalized Requests per second
     Interpretation: Higher is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)       vanilla[pct imp](CV)     DELAYED[pct imp](CV)     DEFUALT[pct imp](CV)      BOTH[pct imp](CV)
       1     1.00 [  0.00]( 1.07)     1.00 [  0.29]( 0.00)     1.00 [  0.29]( 0.15)     0.99 [ -0.59]( 0.46)     1.00 [  0.29]( 0.30)
       2     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)
       4     1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.15)     1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.15)     1.00 [  0.00]( 0.00)
       8     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.15)     1.00 [  0.29]( 0.00)     1.00 [  0.00]( 0.40)     1.00 [  0.29]( 0.15)
      16     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.00)
      32     1.00 [  0.00]( 3.41)     0.99 [ -0.95]( 2.06)     0.98 [ -2.23]( 3.41)     0.98 [ -2.23]( 3.31)     1.03 [  2.54]( 0.32)
      64     1.00 [  0.00]( 1.05)     0.92 [ -7.58]( 9.01)     0.86 [-13.92](11.30)     1.00 [  0.00]( 4.74)     1.00 [ -0.38]( 9.98)
     128     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [  0.38]( 0.00)
     256     1.00 [  0.00]( 0.72)     1.00 [ -0.31]( 0.42)     1.01 [  1.23]( 1.33)     1.01 [  0.61]( 0.83)     1.01 [  0.92]( 1.36)
     512     1.00 [  0.00]( 0.57)     1.00 [  0.00]( 0.45)     0.99 [ -0.72]( 1.18)     1.00 [  0.48]( 0.33)     1.01 [  1.44]( 0.49)
     
     
     ==================================================================
     Test          : new-schbench-wakeup-latency
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)       vanilla[pct imp](CV)     DELAYED[pct imp](CV)     DEFUALT[pct imp](CV)      BOTH[pct imp](CV)
       1     1.00 [ -0.00]( 9.11)     0.75 [ 25.00](11.08)     0.69 [ 31.25]( 8.13)     0.75 [ 25.00](11.08)     0.62 [ 37.50]( 8.94)
       2     1.00 [ -0.00]( 0.00)     1.00 [ -0.00]( 3.78)     0.86 [ 14.29]( 7.45)     0.93 [  7.14]( 3.87)     0.79 [ 21.43]( 4.84)
       4     1.00 [ -0.00]( 3.78)     0.93 [  7.14]( 3.87)     0.79 [ 21.43]( 4.56)     0.93 [  7.14]( 0.00)     0.79 [ 21.43]( 8.85)
       8     1.00 [ -0.00]( 0.00)     1.08 [ -8.33](12.91)     0.92 [  8.33]( 0.00)     0.83 [ 16.67](18.23)     1.08 [ -8.33](12.91)
      16     1.00 [ -0.00]( 7.56)     0.92 [  7.69](11.71)     0.85 [ 15.38](12.06)     1.08 [ -7.69](11.92)     0.85 [ 15.38](12.91)
      32     1.00 [ -0.00](15.11)     1.07 [ -6.67]( 3.30)     1.00 [ -0.00](19.06)     1.00 [ -0.00](15.11)     0.80 [ 20.00]( 4.43)
      64     1.00 [ -0.00]( 9.63)     1.00 [ -0.00]( 8.15)     1.00 [ -0.00]( 5.34)     1.05 [ -5.00]( 7.75)     0.90 [ 10.00]( 9.94)
     128     1.00 [ -0.00]( 4.86)     0.89 [ 11.06]( 7.83)     0.91 [  8.54]( 7.87)     0.88 [ 12.06]( 8.73)     0.86 [ 14.07]( 5.01)
     256     1.00 [ -0.00]( 2.34)     1.00 [  0.20]( 0.10)     1.04 [ -4.50]( 4.59)     1.03 [ -2.90]( 1.95)     1.04 [ -3.70]( 4.13)
     512     1.00 [ -0.00]( 0.40)     1.00 [  0.38]( 0.20)     1.00 [  0.38]( 0.20)     0.99 [  0.77]( 0.20)     1.00 [ -0.00]( 0.40)
     
     
     ==================================================================
     Test          : new-schbench-request-latency
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)       vanilla[pct imp](CV)     DELAYED[pct imp](CV)     DEFUALT[pct imp](CV)      BOTH[pct imp](CV)
       1     1.00 [ -0.00]( 2.73)     0.98 [  2.08]( 1.04)     0.99 [  1.30]( 1.07)     1.02 [ -1.82]( 0.00)     1.01 [ -1.30]( 3.10)
       2     1.00 [ -0.00]( 0.87)     1.05 [ -5.40]( 3.10)     1.02 [ -1.89]( 1.58)     1.01 [ -1.08]( 2.76)     1.02 [ -1.62]( 1.45)
       4     1.00 [ -0.00]( 1.21)     0.99 [  0.54]( 1.27)     0.99 [  1.08]( 1.67)     1.01 [ -1.21]( 1.21)     1.01 [ -1.35]( 1.91)
       8     1.00 [ -0.00]( 0.27)     0.99 [  0.79]( 2.14)     0.98 [  2.37]( 0.72)     0.99 [  1.05]( 2.53)     0.99 [  0.79]( 1.12)
      16     1.00 [ -0.00]( 4.04)     1.01 [ -0.53]( 0.55)     1.01 [ -0.80]( 1.08)     1.00 [ -0.27]( 0.36)     0.99 [  0.53]( 0.50)
      32     1.00 [ -0.00]( 7.35)     1.10 [ -9.97](21.10)     1.01 [ -0.66](10.27)     1.25 [-25.36](21.41)     0.90 [  9.52]( 2.08)
      64     1.00 [ -0.00]( 3.54)     1.03 [ -2.89]( 1.55)     1.02 [ -2.00]( 0.98)     1.01 [ -0.67]( 3.62)     1.01 [ -0.89]( 4.98)
     128     1.00 [ -0.00]( 0.37)     0.99 [  0.62]( 0.00)     0.99 [  0.72]( 0.11)     0.99 [  0.62]( 0.11)     0.99 [  0.83]( 0.11)
     256     1.00 [ -0.00]( 9.57)     0.92 [  8.36]( 2.22)     1.03 [ -3.11](12.58)     1.05 [ -5.02]( 8.36)     1.00 [ -0.00](11.71)
     512     1.00 [ -0.00]( 1.82)     1.01 [ -1.23]( 0.94)     1.02 [ -2.45]( 1.53)     1.00 [  0.35]( 0.83)     1.02 [ -1.93]( 1.40)
     
     ==================================================================
     Test          : Various longer running benchmarks
     Units         : %diff in throughput reported
     Interpretation: Higher is better
     Statistic     : Median
     ==================================================================
     Benchmarks:                 vanilla     DELAYED   DEFAULT    BOTH
     ycsb-cassandra              -0.05%       0.65%    -0.49%    -0.48%
     ycsb-mongodb                -0.80%      -0.85%    -1.00%    -0.98%
      
     deathstarbench-1x            2.44%       1.54%     1.65%     0.18%
     deathstarbench-2x            5.47%       4.88%     7.92%     6.75%
     deathstarbench-3x            0.36%       1.74%    -1.75%     0.31%
     deathstarbench-6x            1.14%       1.94%     2.24%     1.58%
     
     hammerdb+mysql 16VU          1.08%       5.21%     2.69%     3.80%
     hammerdb+mysql 64VU         -0.43%      -0.31%     2.12%    -0.25%


-- 
Thanks and Regards,
Prateek

Re: [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions

Posted by Peter Zijlstra 8 months ago

On Fri, Jun 13, 2025 at 08:58:56AM +0530, K Prateek Nayak wrote:

> - schbench (old) has a consistent regression for 16, 32, 64,
>   128, 256 workers (> CCX size, < Overloaded) except for with
>   256 workers case with TTWU_QUEUE_DEFAULT which shows an
>   improvement.
> 
> - new schebench has few regressions around 32, 64, and 128
>   workers for wakeup and request latency.

Right, so I actually made Chris' favourite workloads worse with these
patches :/

Let me go try this again..