sched: Address schbench regression

[PATCH v2 00/12] sched: Address schbench regression

Posted by Peter Zijlstra 3 months, 1 week ago

Hi!

Previous version:

  https://lkml.kernel.org/r/20250520094538.086709102@infradead.org


Changes:
 - keep dl_server_stop(), just remove the 'normal' usage of it (juril)
 - have the sched_delayed wake list IPIs do select_task_rq() (vingu)
 - fixed lockdep splat (dietmar)
 - added a few preperatory patches


Patches apply on top of tip/master (which includes the disabling of private futex)
and clm's newidle balance patch (which I'm awaiting vingu's ack on).

Performance is similar to the last version; as tested on my SPR on v6.15 base:

v6.15:
schbench-6.15.0-1.txt:average rps: 2891403.72
schbench-6.15.0-2.txt:average rps: 2889997.02
schbench-6.15.0-3.txt:average rps: 2894745.17

v6.15 + patches 1-10:
schbench-6.15.0-dirty-4.txt:average rps: 3038265.95
schbench-6.15.0-dirty-5.txt:average rps: 3037327.50
schbench-6.15.0-dirty-6.txt:average rps: 3038160.15

v6.15 + all patches:
schbench-6.15.0-dirty-deferred-1.txt:average rps: 3043404.30
schbench-6.15.0-dirty-deferred-2.txt:average rps: 3046124.17
schbench-6.15.0-dirty-deferred-3.txt:average rps: 3043627.10


Patches can also be had here:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core


I'm hoping we can get this merged for next cycle so we can all move on from this.

Re: [PATCH v2 00/12] sched: Address schbench regression

Posted by Shrikanth Hegde 3 months ago


On 7/2/25 17:19, Peter Zijlstra wrote:
> Hi!
> 
> Previous version:
> 
>    https://lkml.kernel.org/r/20250520094538.086709102@infradead.org
> 
> 
> Changes:
>   - keep dl_server_stop(), just remove the 'normal' usage of it (juril)
>   - have the sched_delayed wake list IPIs do select_task_rq() (vingu)
>   - fixed lockdep splat (dietmar)
>   - added a few preperatory patches
> 
> 
> Patches apply on top of tip/master (which includes the disabling of private futex)
> and clm's newidle balance patch (which I'm awaiting vingu's ack on).
> 
> Performance is similar to the last version; as tested on my SPR on v6.15 base:
>


Hi Peter,
Gave this a spin on a machine with 5 cores (SMT8) PowerPC system.

I see significant regression in schbench. let me know if i have to test different
number of threads based on the system size.
Will go through the series and will try a bisect meanwhile.


schbench command used and varied 16,32,64,128 as thread groups.
./schbench -L -m 4 -M auto -n 0 -r 60 -t <thread_groups>


base: commit 8784fb5fa2e0042fe3b1632d4876e1037b695f56 (origin/master, origin/HEAD)
Merge: 11119b0b378a 94b59d5f567a
Author: Borislav Petkov (AMD) <bp@alien8.de>
Date:   Sat Jul 5 19:24:35 2025 +0200

     Merge irq/drivers into tip/master


====================================
16 threads   base       base+series
====================================
                              
Wakeup Latencies percentiles (usec) runtime 30 (s)
50.0th:       7.20,      12.40(-72.22)
90.0th:      14.00,      32.60(-132.86)
99.0th:      23.80,      56.00(-135.29)
99.9th:      33.80,      74.80(-121.30)

RPS percentiles (requests) runtime 30 (s)
20.0th:  381235.20,  350720.00(-8.00)
50.0th:  382054.40,  353996.80(-7.34)
90.0th:  382464.00,  356044.80(-6.91)

====================================
32 threads   base       base+series
====================================
Wakeup Latencies percentiles (usec) runtime 30 (s)
50.0th:       9.00,      47.60(-428.89)
90.0th:      19.00,     104.00(-447.37)
99.0th:      32.00,     144.20(-350.62)
99.9th:      46.00,     178.20(-287.39)

RPS percentiles (requests) runtime 30 (s)
20.0th:  763699.20,  515379.20(-32.52)
50.0th:  764928.00,  519168.00(-32.13)
90.0th:  766156.80,  530227.20(-30.79)


====================================
64 threads   base       base+series
====================================
Wakeup Latencies percentiles (usec) runtime 30 (s)
50.0th:      13.40,     112.80(-741.79)
90.0th:      25.00,     216.00(-764.00)
99.0th:      38.40,     282.00(-634.38)
99.9th:      60.00,     331.40(-452.33)

RPS percentiles (requests) runtime 30 (s)
20.0th: 1500364.80,  689152.00(-54.07)
50.0th: 1501184.00,  693248.00(-53.82)
90.0th: 1502822.40,  695296.00(-53.73)


====================================
128 threads   base       base+series
====================================
Wakeup Latencies percentiles (usec) runtime 30 (s)
50.0th:      22.00,     168.80(-667.27)
90.0th:      43.60,     320.60(-635.32)
99.0th:      71.40,     395.60(-454.06)
99.9th:     100.00,     445.40(-345.40)

RPS percentiles (requests) runtime 30 (s)
20.0th: 2686156.80, 1034854.40(-61.47)
50.0th: 2730393.60, 1057587.20(-61.27)
90.0th: 2763161.60, 1084006.40(-60.77)

Re: [PATCH v2 00/12] sched: Address schbench regression

Posted by Chris Mason 3 months ago

On 7/7/25 5:05 AM, Shrikanth Hegde wrote:
> 
> 
> On 7/2/25 17:19, Peter Zijlstra wrote:
>> Hi!
>>
>> Previous version:
>>
>>    https://lkml.kernel.org/r/20250520094538.086709102@infradead.org
>>
>> Changes:
>>   - keep dl_server_stop(), just remove the 'normal' usage of it (juril)
>>   - have the sched_delayed wake list IPIs do select_task_rq() (vingu)
>>   - fixed lockdep splat (dietmar)
>>   - added a few preperatory patches
>>
>>
>> Patches apply on top of tip/master (which includes the disabling of
>> private futex)
>> and clm's newidle balance patch (which I'm awaiting vingu's ack on).
>>
>> Performance is similar to the last version; as tested on my SPR on
>> v6.15 base:
>>
> 
> 
> Hi Peter,
> Gave this a spin on a machine with 5 cores (SMT8) PowerPC system.
> 
> I see significant regression in schbench. let me know if i have to test
> different
> number of threads based on the system size.
> Will go through the series and will try a bisect meanwhile.

Not questioning the git bisect results you had later in this thread, but
double checking that you had the newidle balance patch in place that
Peter mentioned?

https://lore.kernel.org/lkml/20250626144017.1510594-2-clm@fb.com/

The newidle balance frequency changes the cost of everything else, so I
wanted to make sure we were measuring the same things.

Thanks!

-chris

Re: [PATCH v2 00/12] sched: Address schbench regression

Posted by Shrikanth Hegde 3 months ago


On 7/8/25 20:39, Chris Mason wrote:
> On 7/7/25 5:05 AM, Shrikanth Hegde wrote:
>>
>>
>> On 7/2/25 17:19, Peter Zijlstra wrote:
>>> Hi!
>>>
>>> Previous version:
>>>
>>>     https://lkml.kernel.org/r/20250520094538.086709102@infradead.org
>>>
>>> Changes:
>>>    - keep dl_server_stop(), just remove the 'normal' usage of it (juril)
>>>    - have the sched_delayed wake list IPIs do select_task_rq() (vingu)
>>>    - fixed lockdep splat (dietmar)
>>>    - added a few preperatory patches
>>>
>>>
>>> Patches apply on top of tip/master (which includes the disabling of
>>> private futex)
>>> and clm's newidle balance patch (which I'm awaiting vingu's ack on).
>>>
>>> Performance is similar to the last version; as tested on my SPR on
>>> v6.15 base:
>>>
>>
>>
>> Hi Peter,
>> Gave this a spin on a machine with 5 cores (SMT8) PowerPC system.
>>
>> I see significant regression in schbench. let me know if i have to test
>> different
>> number of threads based on the system size.
>> Will go through the series and will try a bisect meanwhile.
> 
> Not questioning the git bisect results you had later in this thread, but
> double checking that you had the newidle balance patch in place that
> Peter mentioned?
> 
> https://lore.kernel.org/lkml/20250626144017.1510594-2-clm@fb.com/
> 
> The newidle balance frequency changes the cost of everything else, so I
> wanted to make sure we were measuring the same things.
> 

Hi Chris.

It was base + series only. and base was 8784fb5fa2e0("Merge irq/drivers into tip/master").
So it didn't have your changes.

I tested again with your changes and i still see a major regression.


./schbench -L -m 4 -M auto -t 64 -n 0 -t 60 -i 60
Wakeup Latencies percentiles (usec) runtime 30 (s) (18848611 total samples)
	  50.0th: 115        (5721408 samples)
	  90.0th: 238        (7500535 samples)
	* 99.0th: 316        (1670597 samples)
	  99.9th: 360        (162283 samples)
	  min=1, max=1487

RPS percentiles (requests) runtime 30 (s) (31 total samples)
	  20.0th: 623616     (7 samples)
	* 50.0th: 629760     (15 samples)
	  90.0th: 631808     (7 samples)
	  min=617820, max=635475
average rps: 628514.30


git log --oneline
7aaf5ef0841b (HEAD) sched: Add ttwu_queue support for delayed tasks
f77b53b6766a sched: Change ttwu_runnable() vs sched_delayed
986ced69ba7b sched: Use lock guard in sched_ttwu_pending()
2c0eb5c88134 sched: Clean up ttwu comments
e1374ac7f74a sched: Re-arrange __ttwu_queue_wakelist()
7e673db9e90f psi: Split psi_ttwu_dequeue()
e2225f1c24a9 sched: Introduce ttwu_do_migrate()
80765734f127 sched: Add ttwu_queue controls
745406820d30 sched: Use lock guard in ttwu_runnable()
d320cebe6e28 sched: Optimize ttwu() / select_task_rq()
329fc7eaad76 sched/deadline: Less agressive dl_server handling
708281193493 sched/psi: Optimize psi_group_change() cpu_clock() usage
c28590ad7b91 sched/fair: bump sd->max_newidle_lb_cost when newidle balance fails
8784fb5fa2e0 (origin/master, origin/HEAD) Merge irq/drivers into tip/master

Re: [PATCH v2 00/12] sched: Address schbench regression

Posted by Shrikanth Hegde 3 months ago


On 7/7/25 14:35, Shrikanth Hegde wrote:
> 
> 
> On 7/2/25 17:19, Peter Zijlstra wrote:
>> Hi!
>>
>> Previous version:
>>
>>    https://lkml.kernel.org/r/20250520094538.086709102@infradead.org
>>
>>
>> Changes:
>>   - keep dl_server_stop(), just remove the 'normal' usage of it (juril)
>>   - have the sched_delayed wake list IPIs do select_task_rq() (vingu)
>>   - fixed lockdep splat (dietmar)
>>   - added a few preperatory patches
>>
>>
>> Patches apply on top of tip/master (which includes the disabling of 
>> private futex)
>> and clm's newidle balance patch (which I'm awaiting vingu's ack on).
>>
>> Performance is similar to the last version; as tested on my SPR on 
>> v6.15 base:
>>
> 
> 
> Hi Peter,
> Gave this a spin on a machine with 5 cores (SMT8) PowerPC system.
> 
> I see significant regression in schbench. let me know if i have to test 
> different
> number of threads based on the system size.
> Will go through the series and will try a bisect meanwhile.
> 
> 

Used "./schbench -L -m 4 -M auto -t 64 -n 0 -t 60 -i 60" for git bisect.
Also kept HZ=1000


Git bisect points to
# first bad commit: [dc968ba0544889883d0912360dd72d90f674c140] sched: Add ttwu_queue support for delayed tasks

Note:
at commit: "sched: Change ttwu_runnable() vs sched_delayed" there is a small regression.

-------------------------------------
Numbers at different commits:
-------------------------------------
commit 8784fb5fa2e0042fe3b1632d4876e1037b695f56      <<<< baseline
Merge: 11119b0b378a 94b59d5f567a
Author: Borislav Petkov (AMD) <bp@alien8.de>
Date:   Sat Jul 5 19:24:35 2025 +0200

     Merge irq/drivers into tip/master

Wakeup Latencies percentiles (usec) runtime 30 (s) (39778894 total samples)
           50.0th: 14         (11798914 samples)
           90.0th: 27         (15931329 samples)
         * 99.0th: 42         (3032865 samples)
           99.9th: 64         (346598 samples)

RPS percentiles (requests) runtime 30 (s) (31 total samples)
           20.0th: 1394688    (18 samples)
         * 50.0th: 1394688    (0 samples)
           90.0th: 1398784    (11 samples)

--------------------------------------

commit 88ca74dd6fe5d5b03647afb4698238e4bec3da39 (HEAD)       <<< Still good commit
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Wed Jul 2 13:49:34 2025 +0200

     sched: Use lock guard in sched_ttwu_pending()

Wakeup Latencies percentiles (usec) runtime 30 (s) (40132792 total samples)
           50.0th: 14         (11986044 samples)
           90.0th: 27         (15143836 samples)
         * 99.0th: 46         (3267133 samples)
           99.9th: 72         (333940 samples)
RPS percentiles (requests) runtime 30 (s) (31 total samples)
           20.0th: 1402880    (23 samples)
         * 50.0th: 1402880    (0 samples)
           90.0th: 1406976    (8 samples)

-----------------------------------------------------------------

commit 755d11feca4544b4bc6933dcdef29c41585fa747        <<< There is a small regression.
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Wed Jul 2 13:49:35 2025 +0200

     sched: Change ttwu_runnable() vs sched_delayed

Wakeup Latencies percentiles (usec) runtime 30 (s) (39308991 total samples)
           50.0th: 18         (12991812 samples)
           90.0th: 34         (14381736 samples)
         * 99.0th: 56         (3399332 samples)
           99.9th: 84         (342508 samples)

RPS percentiles (requests) runtime 30 (s) (31 total samples)
           20.0th: 1353728    (21 samples)
         * 50.0th: 1353728    (0 samples)
           90.0th: 1357824    (10 samples)

-----------------------------------------------------------

commit dc968ba0544889883d0912360dd72d90f674c140              <<<< Major regression
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Wed Jul 2 13:49:36 2025 +0200

     sched: Add ttwu_queue support for delayed tasks

Wakeup Latencies percentiles (usec) runtime 30 (s) (19818598 total samples)
           50.0th: 111        (5891601 samples)
           90.0th: 214        (7947099 samples)
         * 99.0th: 283        (1749294 samples)
           99.9th: 329        (177336 samples)

RPS percentiles (requests) runtime 30 (s) (31 total samples)
           20.0th: 654336     (7 samples)
         * 50.0th: 660480     (11 samples)
           90.0th: 666624     (11 samples)

Re: [PATCH v2 00/12] sched: Address schbench regression

Posted by Peter Zijlstra 3 months ago

On Mon, Jul 07, 2025 at 11:49:17PM +0530, Shrikanth Hegde wrote:

> Git bisect points to
> # first bad commit: [dc968ba0544889883d0912360dd72d90f674c140] sched: Add ttwu_queue support for delayed tasks

Moo.. Are IPIs particularly expensive on your platform?

The 5 cores makes me think this is a partition of sorts, but IIRC the
power LPAR stuff was fixed physical, so routing interrupts shouldn't be
much more expensive vs native hardware.

> Note:
> at commit: "sched: Change ttwu_runnable() vs sched_delayed" there is a small regression.

Yes, that was more or less expected. I also see a dip because of that
patch, but its small compared to the gains gotten by the previous
patches -- so I was hoping I'd get away with it :-)

Re: [PATCH v2 00/12] sched: Address schbench regression

Posted by Shrikanth Hegde 2 months, 2 weeks ago

On 7/9/25 00:32, Peter Zijlstra wrote:
> On Mon, Jul 07, 2025 at 11:49:17PM +0530, Shrikanth Hegde wrote:
> 
>> Git bisect points to
>> # first bad commit: [dc968ba0544889883d0912360dd72d90f674c140] sched: Add ttwu_queue support for delayed tasks
> 
> Moo.. Are IPIs particularly expensive on your platform?
> 
>
It seems like the cost of IPIs is likely hurting here.

IPI latency really depends on whether CPU was busy, shallow idle state or deep idle state.
When it is in deep idle state numbers show close to 5-8us on average on this small system.
When system is busy, (could be doing another schbench thread) is around 1-2us.

Measured the time it took for taking the remote rq lock in baseline, that is around 1-1.5us only.
Also, here LLC is small core.(SMT4 core). So quite often the series would choose to send IPI.

Did one more experiment, pin worker and message thread such that it always sends IPI.

NO_TTWU_QUEUE_DELAYED

./schbench -L -m 4 -M auto -t 64 -n 0 -r 5 -i 5
average rps: 1549224.72
./schbench -L -m 4 -M 0-3 -W 4-39 -t 64 -n 0 -r 5 -i 5
average rps: 1560839.00

TTWU_QUEUE_DELAYED

./schbench -L -m 4 -M auto -t 64 -n 0 -r 5 -i 5             << IPI could be sent quite often ***
average rps: 959522.31
./schbench -L -m 4 -M 0-3 -W 4-39 -t 64 -n 0 -r 5 -i 5      << IPI are always sent. (M,W) don't share cache.
average rps: 470865.00                                      << rps goes even lower

=================================

*** issues/observations in schbench.

Chris,

When one does -W auto or -M auto i think code is meant to run, n message threads on first n CPUs and worker threads
on remaining CPUs?
I don't see that happening.  above behavior can be achieved only with -M <cpus> -W <cpus>

         int i = 0;
         CPU_ZERO(m_cpus);
         for (int i = 0; i < m_threads; ++i) {
                 CPU_SET(i, m_cpus);
                 CPU_CLR(i, w_cpus);
         }
         for (; i < CPU_SETSIZE; i++) {             << here i refers to the one in scope. which is 0. Hence w_cpus is set for all cpus.
                                                       And hence workers end up running on all CPUs even with -W auto
                 CPU_SET(i, w_cpus);
         }

Another issue, is that if CPU0 if offline, then auto pinning fails. Maybe no one cares about that case?

Re: [PATCH v2 00/12] sched: Address schbench regression

Posted by Chris Mason 2 months, 2 weeks ago

On 7/21/25 12:37 PM, Shrikanth Hegde wrote:

> *** issues/observations in schbench.
> 
> Chris,
> 
> When one does -W auto or -M auto i think code is meant to run, n message
> threads on first n CPUs and worker threads
> on remaining CPUs?
> I don't see that happening.  above behavior can be achieved only with -M
> <cpus> -W <cpus>
> 
>         int i = 0;
>         CPU_ZERO(m_cpus);
>         for (int i = 0; i < m_threads; ++i) {
>                 CPU_SET(i, m_cpus);
>                 CPU_CLR(i, w_cpus);
>         }
>         for (; i < CPU_SETSIZE; i++) {             << here i refers to
> the one in scope. which is 0. Hence w_cpus is set for all cpus.
>                                                       And hence workers
> end up running on all CPUs even with -W auto
>                 CPU_SET(i, w_cpus);
>         }

Oh, you're exactly right.  Fixing this up, thanks.  I'll do some runs to
see if this changes things on my test boxes as well.

> 
> 
> Another issue, is that if CPU0 if offline, then auto pinning fails.
> Maybe no one cares about that case?

The auto pinning is pretty simple right now, I'm planning on making it
numa/ccx aware.  Are CPUs offline enough on test systems that we want to
worry about that?

-chris

Re: [PATCH v2 00/12] sched: Address schbench regression

Posted by Chris Mason 2 months, 2 weeks ago

On 7/22/25 1:20 PM, Chris Mason wrote:
> On 7/21/25 12:37 PM, Shrikanth Hegde wrote:
> 
>> *** issues/observations in schbench.
>>
>> Chris,
>>
>> When one does -W auto or -M auto i think code is meant to run, n message
>> threads on first n CPUs and worker threads
>> on remaining CPUs?
>> I don't see that happening.  above behavior can be achieved only with -M
>> <cpus> -W <cpus>
>>
>>         int i = 0;
>>         CPU_ZERO(m_cpus);
>>         for (int i = 0; i < m_threads; ++i) {
>>                 CPU_SET(i, m_cpus);
>>                 CPU_CLR(i, w_cpus);
>>         }
>>         for (; i < CPU_SETSIZE; i++) {             << here i refers to
>> the one in scope. which is 0. Hence w_cpus is set for all cpus.
>>                                                       And hence workers
>> end up running on all CPUs even with -W auto
>>                 CPU_SET(i, w_cpus);
>>         }
> 
> Oh, you're exactly right.  Fixing this up, thanks.  I'll do some runs to
> see if this changes things on my test boxes as well.

Fixing this makes it substantially slower (5.2M RPS -> 3.8M RPS), with
more time spent in select_task_rq().  I need to trace a bit to
understand if the message thread CPUs are actually getting used that
often for workers, or if the exclusion makes our idle CPU hunt slower
somehow.

-chris

Re: [PATCH v2 00/12] sched: Address schbench regression

Posted by Shrikanth Hegde 2 months, 3 weeks ago


On 7/9/25 00:32, Peter Zijlstra wrote:
> On Mon, Jul 07, 2025 at 11:49:17PM +0530, Shrikanth Hegde wrote:
> 
>> Git bisect points to
>> # first bad commit: [dc968ba0544889883d0912360dd72d90f674c140] sched: Add ttwu_queue support for delayed tasks
> 
> Moo.. Are IPIs particularly expensive on your platform?
> 
> The 5 cores makes me think this is a partition of sorts, but IIRC the
> power LPAR stuff was fixed physical, so routing interrupts shouldn't be
> much more expensive vs native hardware.
> 
Some more data from the regression. I am looking at rps numbers
while running ./schbench -L -m 4 -M auto -t 64 -n 0 -r 5 -i 5.
All the data is from an LPAR(VM) with 5 cores.


echo TTWU_QUEUE_DELAYED > features
average rps: 970491.00

echo NO_TTWU_QUEUE_DELAYED > features
current rps: 1555456.78

So below data points are with feature enabled or disabled with series applied + clm's patch.
-------------------------------------------------------
./hardirqs

TTWU_QUEUE_DELAYED
HARDIRQ                    TOTAL_usecs
env2                               816
IPI-2                          1421603       << IPI are less compared to with feature.


NO_TTWU_QUEUE_DELAYED
HARDIRQ                    TOTAL_usecs
ibmvscsi                             8
env2                               266
IPI-2                          6489980

-------------------------------------------------------

Disabled all the idle states. Regression still exits.

-------------------------------------------------------

See this warning everytime i run schbench:  This happens with PATCH 12/12 only.

It is triggering this warning. Some clock update is getting messed up?

1637 static inline void assert_clock_updated(struct rq *rq)
1638 {
1639         /*
1640          * The only reason for not seeing a clock update since the
1641          * last rq_pin_lock() is if we're currently skipping updates.
1642          */
1643         WARN_ON_ONCE(rq->clock_update_flags < RQCF_ACT_SKIP);
1644 }
  

WARNING: kernel/sched/sched.h:1643 at update_load_avg+0x424/0x48c, CPU#6: swapper/6/0
CPU: 6 UID: 0 PID: 0 Comm: swapper/6 Kdump: loaded Not tainted 6.16.0-rc4+ #276 PREEMPT(voluntary)
NIP:  c0000000001cea60 LR: c0000000001d7254 CTR: c0000000001d77b0
REGS: c000000003a674c0 TRAP: 0700   Not tainted  (6.16.0-rc4+)
MSR:  8000000000021033 <SF,ME,IR,DR,RI,LE>  CR: 28008208  XER: 20040000
CFAR: c0000000001ce68c IRQMASK: 3
GPR00: c0000000001d7254 c000000003a67760 c000000001bc8100 c000000061915400
GPR04: c00000008c80f480 0000000000000005 c000000003a679b0 0000000000000000
GPR08: 0000000000000001 0000000000000000 c0000003ff14d480 0000000000004000
GPR12: c0000000001d77b0 c0000003ffff7880 0000000000000000 000000002eef18c0
GPR16: 0000000000000006 0000000000000006 0000000000000008 c000000002ca2468
GPR20: 0000000000000000 0000000000000004 0000000000000009 0000000000000001
GPR24: 0000000000000000 0000000000000001 0000000000000001 c0000003ff14d480
GPR28: 0000000000000001 0000000000000005 c00000008c80f480 c000000061915400
NIP [c0000000001cea60] update_load_avg+0x424/0x48c
LR [c0000000001d7254] enqueue_entity+0x5c/0x5b8
Call Trace:
[c000000003a67760] [c000000003a677d0] 0xc000000003a677d0 (unreliable)
[c000000003a677d0] [c0000000001d7254] enqueue_entity+0x5c/0x5b8
[c000000003a67880] [c0000000001d7918] enqueue_task_fair+0x168/0x7d8
[c000000003a678f0] [c0000000001b9554] enqueue_task+0x5c/0x1c8
[c000000003a67930] [c0000000001c3f40] ttwu_do_activate+0x98/0x2fc
[c000000003a67980] [c0000000001c4460] sched_ttwu_pending+0x2bc/0x72c
[c000000003a67a60] [c0000000002c16ac] __flush_smp_call_function_queue+0x1a0/0x750
[c000000003a67b10] [c00000000005e1c4] smp_ipi_demux_relaxed+0xec/0xf4
[c000000003a67b50] [c000000000057dd4] doorbell_exception+0xe0/0x25c
[c000000003a67b90] [c0000000000383d0] __replay_soft_interrupts+0xf0/0x154
[c000000003a67d40] [c000000000038684] arch_local_irq_restore.part.0+0x1cc/0x214
[c000000003a67d90] [c0000000001b6ec8] finish_task_switch.isra.0+0xb4/0x2f8
[c000000003a67e30] [c00000000110fb9c] __schedule+0x294/0x83c
[c000000003a67ee0] [c0000000011105f0] schedule_idle+0x3c/0x64
[c000000003a67f10] [c0000000001f27f0] do_idle+0x15c/0x1ac
[c000000003a67f60] [c0000000001f2b08] cpu_startup_entry+0x4c/0x50
[c000000003a67f90] [c00000000005ede0] start_secondary+0x284/0x288
[c000000003a67fe0] [c00000000000e058] start_secondary_prolog+0x10/0x14

----------------------------------------------------------------

perf stat -a:  ( idle states enabled)

TTWU_QUEUE_DELAYED:

         13,612,930      context-switches                 #    0.000 /sec
            912,737      cpu-migrations                   #    0.000 /sec
              1,245      page-faults                      #    0.000 /sec
    449,817,741,085      cycles
    137,051,199,092      instructions                     #    0.30  insn per cycle
     25,789,965,217      branches                         #    0.000 /sec
        286,202,628      branch-misses                    #    1.11% of all branches

NO_TTWU_QUEUE_DELAYED:

         24,782,786      context-switches                 #    0.000 /sec
          4,697,384      cpu-migrations                   #    0.000 /sec
              1,250      page-faults                      #    0.000 /sec
    701,934,506,023      cycles
    220,728,025,829      instructions                     #    0.31  insn per cycle
     40,271,327,989      branches                         #    0.000 /sec
        474,496,395      branch-misses                    #    1.18% of all branches

both cycles and instructions are low.

-------------------------------------------------------------------

perf stat -a:  ( idle states disabled)

TTWU_QUEUE_DELAYED:
            
         15,402,193      context-switches                 #    0.000 /sec
          1,237,128      cpu-migrations                   #    0.000 /sec
              1,245      page-faults                      #    0.000 /sec
    781,215,992,865      cycles
    149,112,303,840      instructions                     #    0.19  insn per cycle
     28,240,010,182      branches                         #    0.000 /sec
        294,485,795      branch-misses                    #    1.04% of all branches

NO_TTWU_QUEUE_DELAYED:

         25,332,898      context-switches                 #    0.000 /sec
          4,756,682      cpu-migrations                   #    0.000 /sec
              1,256      page-faults                      #    0.000 /sec
    781,318,730,494      cycles
    220,536,732,094      instructions                     #    0.28  insn per cycle
     40,424,495,545      branches                         #    0.000 /sec
        446,724,952      branch-misses                    #    1.11% of all branches

Since idle states are disabled, cycles are always spent on CPU. so cycles are more or less, while instruction
differs. Does it mean with feature enabled, is there a lock(maybe rq) for too long?

--------------------------------------------------------------------

Will try to gather more into why is this happening.

Re: [PATCH v2 00/12] sched: Address schbench regression

Posted by Shrikanth Hegde 3 months ago


On 7/9/25 00:32, Peter Zijlstra wrote:
> On Mon, Jul 07, 2025 at 11:49:17PM +0530, Shrikanth Hegde wrote:
> 
>> Git bisect points to
>> # first bad commit: [dc968ba0544889883d0912360dd72d90f674c140] sched: Add ttwu_queue support for delayed tasks
> 
> Moo.. Are IPIs particularly expensive on your platform?
> 
> The 5 cores makes me think this is a partition of sorts, but IIRC the
> power LPAR stuff was fixed physical, so routing interrupts shouldn't be
> much more expensive vs native hardware.
> 

Yes, we call it as dedicated LPAR. (Hypervisor optimises such that overhead is minimal,
i think that i true for interrupts too).


Some more variations of testing and numbers:

The system had some configs which i had messed up such as CONFIG_SCHED_SMT=n. I copied the default
distro config back and ran the benchmark again. Slightly better numbers compared to earlier.
Still a major regression. Collected mpstat numbers. It shows much less percentage compared to
earlier.

--------------------------------------------------------------------------
base: 8784fb5fa2e0 (tip/master)

Wakeup Latencies percentiles (usec) runtime 30 (s) (41567569 total samples)
           50.0th: 11         (10767158 samples)
           90.0th: 22         (16782627 samples)
         * 99.0th: 36         (3347363 samples)
           99.9th: 52         (344977 samples)
           min=1, max=731
RPS percentiles (requests) runtime 30 (s) (31 total samples)
           20.0th: 1443840    (31 samples)
         * 50.0th: 1443840    (0 samples)
           90.0th: 1443840    (0 samples)
           min=1433480, max=1444037
average rps: 1442889.23

CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
all    3.24    0.00   11.39    0.00   37.30    0.00    0.00    0.00    0.00   48.07
all    2.59    0.00   11.56    0.00   37.62    0.00    0.00    0.00    0.00   48.23



base + clm's patch + series:
Wakeup Latencies percentiles (usec) runtime 30 (s) (27166787 total samples)
           50.0th: 57         (8242048 samples)
           90.0th: 120        (10677365 samples)
         * 99.0th: 182        (2435082 samples)
           99.9th: 262        (241664 samples)
           min=1, max=89984
RPS percentiles (requests) runtime 30 (s) (31 total samples)
           20.0th: 896000     (8 samples)
         * 50.0th: 902144     (10 samples)
           90.0th: 928768     (10 samples)
           min=881548, max=971101
average rps: 907530.10                                               <<< close to 40% drop in RPS.

CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
all    1.95    0.00    7.67    0.00   14.84    0.00    0.00    0.00    0.00   75.55
all    1.61    0.00    7.91    0.00   13.53    0.05    0.00    0.00    0.00   76.90

-----------------------------------------------------------------------------

- To be sure, I tried on another system. That system had 30 cores.

base:
Wakeup Latencies percentiles (usec) runtime 30 (s) (40339785 total samples)
           50.0th: 12         (12585268 samples)
           90.0th: 24         (15194626 samples)
         * 99.0th: 44         (3206872 samples)
           99.9th: 59         (320508 samples)
           min=1, max=1049
RPS percentiles (requests) runtime 30 (s) (31 total samples)
           20.0th: 1320960    (14 samples)
         * 50.0th: 1333248    (2 samples)
           90.0th: 1386496    (12 samples)
           min=1309615, max=1414281

base + clm's patch + series:
Wakeup Latencies percentiles (usec) runtime 30 (s) (34318584 total samples)
           50.0th: 23         (10486283 samples)
           90.0th: 64         (13436248 samples)
         * 99.0th: 122        (3039318 samples)
           99.9th: 166        (306231 samples)
           min=1, max=7255
RPS percentiles (requests) runtime 30 (s) (31 total samples)
           20.0th: 1006592    (8 samples)
         * 50.0th: 1239040    (9 samples)
           90.0th: 1259520    (11 samples)
           min=852462, max=1268841
average rps: 1144229.23                                             << close 10-15% drop in RPS


- Then I resized that 30 core LPAR into a 5 core LPAR to see if the issue pops up in a smaller
config. It did. I see similar regression of 40-50% drop in RPS.

- Then I made it as 6 core system. To see if this is due to any ping pong because of odd numbers.
Numbers are similar to 5 core case.

- Maybe regressions is higher in smaller configurations.

Re: [PATCH v2 00/12] sched: Address schbench regression

Posted by Peter Zijlstra 3 months ago

On Mon, Jul 07, 2025 at 02:35:38PM +0530, Shrikanth Hegde wrote:
> 
> 
> On 7/2/25 17:19, Peter Zijlstra wrote:
> > Hi!
> > 
> > Previous version:
> > 
> >    https://lkml.kernel.org/r/20250520094538.086709102@infradead.org
> > 
> > 
> > Changes:
> >   - keep dl_server_stop(), just remove the 'normal' usage of it (juril)
> >   - have the sched_delayed wake list IPIs do select_task_rq() (vingu)
> >   - fixed lockdep splat (dietmar)
> >   - added a few preperatory patches
> > 
> > 
> > Patches apply on top of tip/master (which includes the disabling of private futex)
> > and clm's newidle balance patch (which I'm awaiting vingu's ack on).
> > 
> > Performance is similar to the last version; as tested on my SPR on v6.15 base:
> > 
> 
> 
> Hi Peter,
> Gave this a spin on a machine with 5 cores (SMT8) PowerPC system.
> 
> I see significant regression in schbench. let me know if i have to test different
> number of threads based on the system size.
> Will go through the series and will try a bisect meanwhile.

Urgh, those are terrible numbers :/

What do the caches look like on that setup? Obviously all the 8 SMT
(is this the supercore that glues two SMT4 things together for backwards
compat?) share some cache, but is there some shared cache between the
cores?

Re: [PATCH v2 00/12] sched: Address schbench regression

Posted by Shrikanth Hegde 3 months ago


On 7/7/25 14:41, Peter Zijlstra wrote:
> On Mon, Jul 07, 2025 at 02:35:38PM +0530, Shrikanth Hegde wrote:
>>
>>
>> On 7/2/25 17:19, Peter Zijlstra wrote:
>>> Hi!
>>>
>>> Previous version:
>>>
>>>     https://lkml.kernel.org/r/20250520094538.086709102@infradead.org
>>>
>>>
>>> Changes:
>>>    - keep dl_server_stop(), just remove the 'normal' usage of it (juril)
>>>    - have the sched_delayed wake list IPIs do select_task_rq() (vingu)
>>>    - fixed lockdep splat (dietmar)
>>>    - added a few preperatory patches
>>>
>>>
>>> Patches apply on top of tip/master (which includes the disabling of private futex)
>>> and clm's newidle balance patch (which I'm awaiting vingu's ack on).
>>>
>>> Performance is similar to the last version; as tested on my SPR on v6.15 base:
>>>
>>
>>
>> Hi Peter,
>> Gave this a spin on a machine with 5 cores (SMT8) PowerPC system.
>>
>> I see significant regression in schbench. let me know if i have to test different
>> number of threads based on the system size.
>> Will go through the series and will try a bisect meanwhile.
> 
> Urgh, those are terrible numbers :/
> 
> What do the caches look like on that setup? Obviously all the 8 SMT
> (is this the supercore that glues two SMT4 things together for backwards
> compat?) share some cache, but is there some shared cache between the
> cores?

It is a supercore(we call it as bigcore) which glues two SMT4 cores. LLC 
is per SMT4 core. So from scheduler perspective system is 10 cores (SMT4)

Re: [PATCH v2 00/12] sched: Address schbench regression

Posted by Phil Auld 2 months, 3 weeks ago

Hi Peter,

On Mon, Jul 07, 2025 at 03:08:08PM +0530 Shrikanth Hegde wrote:
> 
> 
> On 7/7/25 14:41, Peter Zijlstra wrote:
> > On Mon, Jul 07, 2025 at 02:35:38PM +0530, Shrikanth Hegde wrote:
> > > 
> > > 
> > > On 7/2/25 17:19, Peter Zijlstra wrote:
> > > > Hi!
> > > > 
> > > > Previous version:
> > > > 
> > > >     https://lkml.kernel.org/r/20250520094538.086709102@infradead.org
> > > > 
> > > > 
> > > > Changes:
> > > >    - keep dl_server_stop(), just remove the 'normal' usage of it (juril)
> > > >    - have the sched_delayed wake list IPIs do select_task_rq() (vingu)
> > > >    - fixed lockdep splat (dietmar)
> > > >    - added a few preperatory patches
> > > > 
> > > > 
> > > > Patches apply on top of tip/master (which includes the disabling of private futex)
> > > > and clm's newidle balance patch (which I'm awaiting vingu's ack on).
> > > > 
> > > > Performance is similar to the last version; as tested on my SPR on v6.15 base:
> > > > 
> > > 
> > > 
> > > Hi Peter,
> > > Gave this a spin on a machine with 5 cores (SMT8) PowerPC system.
> > > 
> > > I see significant regression in schbench. let me know if i have to test different
> > > number of threads based on the system size.
> > > Will go through the series and will try a bisect meanwhile.
> > 
> > Urgh, those are terrible numbers :/
> > 
> > What do the caches look like on that setup? Obviously all the 8 SMT
> > (is this the supercore that glues two SMT4 things together for backwards
> > compat?) share some cache, but is there some shared cache between the
> > cores?
> 
> It is a supercore(we call it as bigcore) which glues two SMT4 cores. LLC is
> per SMT4 core. So from scheduler perspective system is 10 cores (SMT4)
> 

We've confirmed the issue with schbench on EPYC hardware. It's not limited
to PPC systems, although this system may also have interesting caching. 
We don't see issues with our other tests.

---------------

Here are the latency reports from schbench on a single-socket AMD EPYC
9655P server with 96 cores and 192 CPUs.

Results for this test:
./schbench/schbench -L -m 4 -t 192 -i 30 -r 30

6.15.0-rc6  baseline
threads  wakeup_99_usec  request_99_usec
1        5               3180
16       5               3996
64       3452            14256
128      7112            32960
192      11536           46016

6.15.0-rc6.pz_fixes2 (with 12 part series))
threads  wakeup_99_usec  request_99_usec
1        5               3172
16       5               3844
64       3348            17376
128      21024           100480
192      44224           176384

For 128 and 192 threads, Wakeup and Request latencies increased by a factor of
3x.

We're testing now with NO_TTWU_QUEUE_DELAYED and I'll try to report on
that when we have results. 

Cheers,
Phil
--

Re: [PATCH v2 00/12] sched: Address schbench regression

Posted by Phil Auld 2 months, 3 weeks ago

On Wed, Jul 16, 2025 at 09:46:40AM -0400 Phil Auld wrote:
> 
> Hi Peter,
> 
> On Mon, Jul 07, 2025 at 03:08:08PM +0530 Shrikanth Hegde wrote:
> > 
> > 
> > On 7/7/25 14:41, Peter Zijlstra wrote:
> > > On Mon, Jul 07, 2025 at 02:35:38PM +0530, Shrikanth Hegde wrote:
> > > > 
> > > > 
> > > > On 7/2/25 17:19, Peter Zijlstra wrote:
> > > > > Hi!
> > > > > 
> > > > > Previous version:
> > > > > 
> > > > >     https://lkml.kernel.org/r/20250520094538.086709102@infradead.org
> > > > > 
> > > > > 
> > > > > Changes:
> > > > >    - keep dl_server_stop(), just remove the 'normal' usage of it (juril)
> > > > >    - have the sched_delayed wake list IPIs do select_task_rq() (vingu)
> > > > >    - fixed lockdep splat (dietmar)
> > > > >    - added a few preperatory patches
> > > > > 
> > > > > 
> > > > > Patches apply on top of tip/master (which includes the disabling of private futex)
> > > > > and clm's newidle balance patch (which I'm awaiting vingu's ack on).
> > > > > 
> > > > > Performance is similar to the last version; as tested on my SPR on v6.15 base:
> > > > > 
> > > > 
> > > > 
> > > > Hi Peter,
> > > > Gave this a spin on a machine with 5 cores (SMT8) PowerPC system.
> > > > 
> > > > I see significant regression in schbench. let me know if i have to test different
> > > > number of threads based on the system size.
> > > > Will go through the series and will try a bisect meanwhile.
> > > 
> > > Urgh, those are terrible numbers :/
> > > 
> > > What do the caches look like on that setup? Obviously all the 8 SMT
> > > (is this the supercore that glues two SMT4 things together for backwards
> > > compat?) share some cache, but is there some shared cache between the
> > > cores?
> > 
> > It is a supercore(we call it as bigcore) which glues two SMT4 cores. LLC is
> > per SMT4 core. So from scheduler perspective system is 10 cores (SMT4)
> > 
> 
> We've confirmed the issue with schbench on EPYC hardware. It's not limited
> to PPC systems, although this system may also have interesting caching. 
> We don't see issues with our other tests.
> 
> ---------------
> 
> Here are the latency reports from schbench on a single-socket AMD EPYC
> 9655P server with 96 cores and 192 CPUs.
> 
> Results for this test:
> ./schbench/schbench -L -m 4 -t 192 -i 30 -r 30
> 
> 6.15.0-rc6  baseline
> threads  wakeup_99_usec  request_99_usec
> 1        5               3180
> 16       5               3996
> 64       3452            14256
> 128      7112            32960
> 192      11536           46016
> 
> 6.15.0-rc6.pz_fixes2 (with 12 part series))
> threads  wakeup_99_usec  request_99_usec
> 1        5               3172
> 16       5               3844
> 64       3348            17376
> 128      21024           100480
> 192      44224           176384
> 
> For 128 and 192 threads, Wakeup and Request latencies increased by a factor of
> 3x.
> 
> We're testing now with NO_TTWU_QUEUE_DELAYED and I'll try to report on
> that when we have results. 
>

To follow up on this: With NO_TTWU_QUEUE_DELAYED the above latency issues
with schbench go away.

In addition, the randwrite regression we were having with delayed tasks
remains resolved.  And the assorted small gains here and there are still
present. 

Overall with NO_TTWU_QUEUE_DELAYED this series is helpful. We'd probably
make that the default if it got merged as is.  But maybe there is no
need for that part of the code.  


Thanks,
Phil

> Cheers,
> Phil
> -- 
> 
> 

--

Re: [PATCH v2 00/12] sched: Address schbench regression

Posted by Beata Michalska 2 months, 3 weeks ago

Hi Peter,

Below are the results of running the schbench on Altra
(as a reminder 2-core MC, 2 Numa Nodes, 160 cores)

`Legend:
- 'Flags=none' means neither TTWU_QUEUE_DEFAULT nor
  TTWU_QUEUE_DELAYED is set (or available).
- '*…*' marks Top-3 Min & Max, Bottom-3 Std dev, and
  Top-3 90th-percentile values.

Base 6.16-rc5
  Flags=none
  Min=681870.77 | Max=913649.50 | Std=53802.90       | 90th=890201.05

sched/fair: bump sd->max_newidle_lb_cost when newidle balance fails
  Flags=none
  Min=770952.12 | Max=888047.45 | Std=34430.24       | 90th=877347.24

sched/psi: Optimize psi_group_change() cpu_clock() usage
  Flags=none
  Min=748137.65 | Max=936312.33 | Std=56818.23       | 90th=*921497.27*

sched/deadline: Less agressive dl_server handling
  Flags=none
  Min=783621.95 | Max=*944604.67* | Std=43538.64     | 90th=*909961.16*

sched: Optimize ttwu() / select_task_rq()
  Flags=none
  Min=*826038.87* | Max=*1003496.73* | Std=49875.43  | 90th=*971944.88*

sched: Use lock guard in ttwu_runnable()
  Flags=none
  Min=780172.75 | Max=914170.20 | Std=35998.33       | 90th=866095.80

sched: Add ttwu_queue controls
  Flags=TTWU_QUEUE_DEFAULT
  Min=*792430.45* | Max=903422.78 | Std=33582.71     | 90th=887256.68

  Flags=none
  Min=*803532.80* | Max=894772.48 | Std=29359.35     | 90th=877920.34

sched: Introduce ttwu_do_migrate()
  Flags=TTWU_QUEUE_DEFAULT
  Min=749824.30 | Max=*965139.77* | Std=57022.47     | 90th=903659.07
 
  Flags=none
  Min=787464.65 | Max=885349.20 | Std=27030.82       | 90th=875750.44

psi: Split psi_ttwu_dequeue()
  Flags=TTWU_QUEUE_DEFAULT
  Min=762960.98 | Max=916538.12 | Std=42002.19       | 90th=876425.84
 
  Flags=none
  Min=773608.48 | Max=920812.87 | Std=42189.17       | 90th=871760.47

sched: Re-arrange __ttwu_queue_wakelist()
  Flags=TTWU_QUEUE_DEFAULT
  Min=702870.58 | Max=835243.42 | Std=44224.02       | 90th=825311.12

  Flags=none
  Min=712499.38 | Max=838492.03 | Std=38351.20       | 90th=817135.94

sched: Use lock guard in sched_ttwu_pending()
  Flags=TTWU_QUEUE_DEFAULT
  Min=729080.55 | Max=853609.62 | Std=43440.63       | 90th=838684.48

  Flags=none
  Min=708123.47 | Max=850804.48 | Std=40642.28       | 90th=830295.08

sched: Change ttwu_runnable() vs sched_delayed
  Flags=TTWU_QUEUE_DEFAULT
  Min=580218.87 | Max=838684.07 | Std=57078.24       | 90th=792973.33

  Flags=none
  Min=721274.90 | Max=784897.92 | Std=*19017.78*     | 90th=774792.30

sched: Add ttwu_queue support for delayed tasks
  Flags=none
  Min=712979.48 | Max=830192.10 | Std=33173.90       | 90th=798599.66

  Flags=TTWU_QUEUE_DEFAULT
  Min=698094.12 | Max=857627.93 | Std=38294.94       | 90th=789981.59
 
  Flags=TTWU_QUEUE_DEFAULT/TTWU_QUEUE_DELAYED
  Min=683348.77 | Max=782179.15 | Std=25086.71       | 90th=750947.00

  Flags=TTWU_QUEUE_DELAYED
  Min=669822.23 | Max=807768.85 | Std=38766.41       | 90th=794052.05

sched: fix ttwu_delayed
  Flags=none
  Min=671844.35 | Max=798737.67 | Std=33438.64       | 90th=788584.62

  Flags=TTWU_QUEUE_DEFAULT
  Min=688607.40 | Max=828679.53 | Std=33184.78       | 90th=782490.23

  Flags=TTWU_QUEUE_DEFAULT/TTWU_QUEUE_DELAYED
  Min=579171.13 | Max=643929.18 | Std=*14644.92*     | 90th=639764.16

  Flags=TTWU_QUEUE_DELAYED
  Min=614265.22 | Max=675172.05 | Std=*13309.92*     | 90th=647181.10


Best overall performer:
sched: Optimize ttwu() / select_task_rq()
  Flags=none
  Min=*826038.87* | Max=*1003496.73* | Std=49875.43 | 90th=*971944.88*

Hope this will he somehwat helpful.

---
BR
Beata

On Wed, Jul 02, 2025 at 01:49:24PM +0200, Peter Zijlstra wrote:
> Hi!
> 
> Previous version:
> 
>   https://lkml.kernel.org/r/20250520094538.086709102@infradead.org
> 
> 
> Changes:
>  - keep dl_server_stop(), just remove the 'normal' usage of it (juril)
>  - have the sched_delayed wake list IPIs do select_task_rq() (vingu)
>  - fixed lockdep splat (dietmar)
>  - added a few preperatory patches
> 
> 
> Patches apply on top of tip/master (which includes the disabling of private futex)
> and clm's newidle balance patch (which I'm awaiting vingu's ack on).
> 
> Performance is similar to the last version; as tested on my SPR on v6.15 base:
> 
> v6.15:
> schbench-6.15.0-1.txt:average rps: 2891403.72
> schbench-6.15.0-2.txt:average rps: 2889997.02
> schbench-6.15.0-3.txt:average rps: 2894745.17
> 
> v6.15 + patches 1-10:
> schbench-6.15.0-dirty-4.txt:average rps: 3038265.95
> schbench-6.15.0-dirty-5.txt:average rps: 3037327.50
> schbench-6.15.0-dirty-6.txt:average rps: 3038160.15
> 
> v6.15 + all patches:
> schbench-6.15.0-dirty-deferred-1.txt:average rps: 3043404.30
> schbench-6.15.0-dirty-deferred-2.txt:average rps: 3046124.17
> schbench-6.15.0-dirty-deferred-3.txt:average rps: 3043627.10
> 
> 
> Patches can also be had here:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core
> 
> 
> I'm hoping we can get this merged for next cycle so we can all move on from this.
> 
>

Re: [PATCH v2 00/12] sched: Address schbench regression

Posted by Beata Michalska 2 months, 3 weeks ago

On Thu, Jul 17, 2025 at 03:04:55PM +0200, Beata Michalska wrote:
> Hi Peter,
> 
> Below are the results of running the schbench on Altra
> (as a reminder 2-core MC, 2 Numa Nodes, 160 cores)
> 
> `Legend:
> - 'Flags=none' means neither TTWU_QUEUE_DEFAULT nor
>   TTWU_QUEUE_DELAYED is set (or available).
> - '*…*' marks Top-3 Min & Max, Bottom-3 Std dev, and
>   Top-3 90th-percentile values.
> 
> Base 6.16-rc5
>   Flags=none
>   Min=681870.77 | Max=913649.50 | Std=53802.90       | 90th=890201.05
> 
> sched/fair: bump sd->max_newidle_lb_cost when newidle balance fails
>   Flags=none
>   Min=770952.12 | Max=888047.45 | Std=34430.24       | 90th=877347.24
> 
> sched/psi: Optimize psi_group_change() cpu_clock() usage
>   Flags=none
>   Min=748137.65 | Max=936312.33 | Std=56818.23       | 90th=*921497.27*
> 
> sched/deadline: Less agressive dl_server handling
>   Flags=none
>   Min=783621.95 | Max=*944604.67* | Std=43538.64     | 90th=*909961.16*
> 
> sched: Optimize ttwu() / select_task_rq()
>   Flags=none
>   Min=*826038.87* | Max=*1003496.73* | Std=49875.43  | 90th=*971944.88*
> 
> sched: Use lock guard in ttwu_runnable()
>   Flags=none
>   Min=780172.75 | Max=914170.20 | Std=35998.33       | 90th=866095.80
> 
> sched: Add ttwu_queue controls
>   Flags=TTWU_QUEUE_DEFAULT
>   Min=*792430.45* | Max=903422.78 | Std=33582.71     | 90th=887256.68
> 
>   Flags=none
>   Min=*803532.80* | Max=894772.48 | Std=29359.35     | 90th=877920.34
> 
> sched: Introduce ttwu_do_migrate()
>   Flags=TTWU_QUEUE_DEFAULT
>   Min=749824.30 | Max=*965139.77* | Std=57022.47     | 90th=903659.07
>  
>   Flags=none
>   Min=787464.65 | Max=885349.20 | Std=27030.82       | 90th=875750.44
> 
> psi: Split psi_ttwu_dequeue()
>   Flags=TTWU_QUEUE_DEFAULT
>   Min=762960.98 | Max=916538.12 | Std=42002.19       | 90th=876425.84
>  
>   Flags=none
>   Min=773608.48 | Max=920812.87 | Std=42189.17       | 90th=871760.47
> 
> sched: Re-arrange __ttwu_queue_wakelist()
>   Flags=TTWU_QUEUE_DEFAULT
>   Min=702870.58 | Max=835243.42 | Std=44224.02       | 90th=825311.12
> 
>   Flags=none
>   Min=712499.38 | Max=838492.03 | Std=38351.20       | 90th=817135.94
> 
> sched: Use lock guard in sched_ttwu_pending()
>   Flags=TTWU_QUEUE_DEFAULT
>   Min=729080.55 | Max=853609.62 | Std=43440.63       | 90th=838684.48
> 
>   Flags=none
>   Min=708123.47 | Max=850804.48 | Std=40642.28       | 90th=830295.08
> 
> sched: Change ttwu_runnable() vs sched_delayed
>   Flags=TTWU_QUEUE_DEFAULT
>   Min=580218.87 | Max=838684.07 | Std=57078.24       | 90th=792973.33
> 
>   Flags=none
>   Min=721274.90 | Max=784897.92 | Std=*19017.78*     | 90th=774792.30
> 
> sched: Add ttwu_queue support for delayed tasks
>   Flags=none
>   Min=712979.48 | Max=830192.10 | Std=33173.90       | 90th=798599.66
> 
>   Flags=TTWU_QUEUE_DEFAULT
>   Min=698094.12 | Max=857627.93 | Std=38294.94       | 90th=789981.59
>  
>   Flags=TTWU_QUEUE_DEFAULT/TTWU_QUEUE_DELAYED
>   Min=683348.77 | Max=782179.15 | Std=25086.71       | 90th=750947.00
> 
>   Flags=TTWU_QUEUE_DELAYED
>   Min=669822.23 | Max=807768.85 | Std=38766.41       | 90th=794052.05
> 
> sched: fix ttwu_delayed
This one is actually:
sched: Add ttwu_queue support for delayed tasks
+
https://lore.kernel.org/all/0672c7df-543c-4f3e-829a-46969fad6b34@amd.com/

Apologies for that.

---
BR
Beata
>   Flags=none
>   Min=671844.35 | Max=798737.67 | Std=33438.64       | 90th=788584.62
> 
>   Flags=TTWU_QUEUE_DEFAULT
>   Min=688607.40 | Max=828679.53 | Std=33184.78       | 90th=782490.23
> 
>   Flags=TTWU_QUEUE_DEFAULT/TTWU_QUEUE_DELAYED
>   Min=579171.13 | Max=643929.18 | Std=*14644.92*     | 90th=639764.16
> 
>   Flags=TTWU_QUEUE_DELAYED
>   Min=614265.22 | Max=675172.05 | Std=*13309.92*     | 90th=647181.10
> 
> 
> Best overall performer:
> sched: Optimize ttwu() / select_task_rq()
>   Flags=none
>   Min=*826038.87* | Max=*1003496.73* | Std=49875.43 | 90th=*971944.88*
> 
> Hope this will he somehwat helpful.
> 
> ---
> BR
> Beata
> 
> On Wed, Jul 02, 2025 at 01:49:24PM +0200, Peter Zijlstra wrote:
> > Hi!
> > 
> > Previous version:
> > 
> >   https://lkml.kernel.org/r/20250520094538.086709102@infradead.org
> > 
> > 
> > Changes:
> >  - keep dl_server_stop(), just remove the 'normal' usage of it (juril)
> >  - have the sched_delayed wake list IPIs do select_task_rq() (vingu)
> >  - fixed lockdep splat (dietmar)
> >  - added a few preperatory patches
> > 
> > 
> > Patches apply on top of tip/master (which includes the disabling of private futex)
> > and clm's newidle balance patch (which I'm awaiting vingu's ack on).
> > 
> > Performance is similar to the last version; as tested on my SPR on v6.15 base:
> > 
> > v6.15:
> > schbench-6.15.0-1.txt:average rps: 2891403.72
> > schbench-6.15.0-2.txt:average rps: 2889997.02
> > schbench-6.15.0-3.txt:average rps: 2894745.17
> > 
> > v6.15 + patches 1-10:
> > schbench-6.15.0-dirty-4.txt:average rps: 3038265.95
> > schbench-6.15.0-dirty-5.txt:average rps: 3037327.50
> > schbench-6.15.0-dirty-6.txt:average rps: 3038160.15
> > 
> > v6.15 + all patches:
> > schbench-6.15.0-dirty-deferred-1.txt:average rps: 3043404.30
> > schbench-6.15.0-dirty-deferred-2.txt:average rps: 3046124.17
> > schbench-6.15.0-dirty-deferred-3.txt:average rps: 3043627.10
> > 
> > 
> > Patches can also be had here:
> > 
> >   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core
> > 
> > 
> > I'm hoping we can get this merged for next cycle so we can all move on from this.
> > 
> > 
>

Re: [PATCH v2 00/12] sched: Address schbench regression

Posted by Chris Mason 3 months, 1 week ago

On 7/2/25 7:49 AM, Peter Zijlstra wrote:
> Hi!
> 
> Previous version:
> 
>   https://lkml.kernel.org/r/20250520094538.086709102@infradead.org 
> 
> 
> Changes:
>  - keep dl_server_stop(), just remove the 'normal' usage of it (juril)
>  - have the sched_delayed wake list IPIs do select_task_rq() (vingu)
>  - fixed lockdep splat (dietmar)
>  - added a few preperatory patches
> 
> 
> Patches apply on top of tip/master (which includes the disabling of private futex)
> and clm's newidle balance patch (which I'm awaiting vingu's ack on).
> 
> Performance is similar to the last version; as tested on my SPR on v6.15 base:

Thanks for working on these! I'm on vacation until July 14th, but I'll
give them a shot when I'm back in the office.

-chris