Fix scalability problem in workqueue watchdog touch caused by stop_machine

[PATCH 0/4] Fix scalability problem in workqueue watchdog touch caused by stop_machine

Posted by Nicholas Piggin 1 year, 7 months ago

Here are a few patches to fix a lockup caused by very slow progress due
to a scalability problem in workqueue watchdog touch being hammered by
thousands of CPUs in multi_cpu_stop. Patch 2 is the fix.

I did notice when making a microbenchmark reproducer that the RCU call
was actually also causing slowdowns. Not nearly so bad as the workqueue
touch, but workqueue queueing of dummy jobs slowed down by a factor of
several times when lots of other CPUs were making
rcu_momentary_dyntick_idle() calls. So I did the stop_machine patches to
reduce that. So those patches 3,4 are independent of the first two and
can go in any order.

Thanks,
Nick

Nicholas Piggin (4):
  workqueue: wq_watchdog_touch is always called with valid CPU
  workqueue: Improve scalability of workqueue watchdog touch
  stop_machine: Rearrange multi_cpu_stop state machine loop
  stop_machine: Add a delay between multi_cpu_stop touching watchdogs

 kernel/stop_machine.c | 31 +++++++++++++++++++++++--------
 kernel/workqueue.c    | 12 ++++++++++--
 2 files changed, 33 insertions(+), 10 deletions(-)

-- 
2.45.1

Re: [PATCH 0/4] Fix scalability problem in workqueue watchdog touch caused by stop_machine

Posted by Michal Koutný 1 year, 7 months ago

Hello Nicholas.

On Tue, Jun 25, 2024 at 09:42:43PM GMT, Nicholas Piggin <npiggin@gmail.com> wrote:
> Here are a few patches to fix a lockup caused by very slow progress due
> to a scalability problem in workqueue watchdog touch being hammered by
> thousands of CPUs in multi_cpu_stop. Patch 2 is the fix.

<del>Is this something you noticed with stop_machine alone or in some broader
context?</del> I see you mention CPU hotplug in patch 2. Was it a single
CPU or SMT offlining?
Good job tracking it down touching same cacheline.
I had a suspicion [1] back in the day that cpuhp_smt_disable() would
scale in O(nr_cpus^2) but I didn't dedicate time to verifying that. Has
something similar popped up in your examination?

Also, I think your change in patch 2 effectively reduces
wq_watchdog_thresh to 3/4 of configured value. (Not sure if default
should be scaled accordingly.)

Thanks,
Michal

[1]
cpuhp_smt_disable
  mutex_lock(&cpu_add_remove_lock); // cpu_maps_update_begin()

  for_each_online_cpu                      // <-- nr_cpus
    takedown_cpu(cpu)
      stop_machine_cpuslocked // cpu_hotplug_lock
        active_cpus = cpumask_of(cpu)
        for_each_cpu(cpu, cpu_online_mask) // <-- nr_cpus
          cpu_stop_queue_work(cpu, work) // this is serial
          multi_cpu_stop // this runs nr_cpus in parallel
            // runs on downed cpu only
            take_cpu_down
            // other cpus are spinning in multi_cpu_stop() until take_cpu_down
            // above is finished on downed cpu
            // with interrupts disabled
            do {cpu_relax} while(curstate != MULTI_STOP_EXIT)
    lockup_detector_cleanup() // not a touch

  mutex_unlock(&cpu_add_remove_lock);

Re: [PATCH 0/4] Fix scalability problem in workqueue watchdog touch caused by stop_machine

Posted by Paul E. McKenney 1 year, 7 months ago

On Tue, Jun 25, 2024 at 09:42:43PM +1000, Nicholas Piggin wrote:
> Here are a few patches to fix a lockup caused by very slow progress due
> to a scalability problem in workqueue watchdog touch being hammered by
> thousands of CPUs in multi_cpu_stop. Patch 2 is the fix.
> 
> I did notice when making a microbenchmark reproducer that the RCU call
> was actually also causing slowdowns. Not nearly so bad as the workqueue
> touch, but workqueue queueing of dummy jobs slowed down by a factor of
> several times when lots of other CPUs were making
> rcu_momentary_dyntick_idle() calls. So I did the stop_machine patches to
> reduce that. So those patches 3,4 are independent of the first two and
> can go in any order.

For the series:

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

> Thanks,
> Nick
> 
> Nicholas Piggin (4):
>   workqueue: wq_watchdog_touch is always called with valid CPU
>   workqueue: Improve scalability of workqueue watchdog touch
>   stop_machine: Rearrange multi_cpu_stop state machine loop
>   stop_machine: Add a delay between multi_cpu_stop touching watchdogs
> 
>  kernel/stop_machine.c | 31 +++++++++++++++++++++++--------
>  kernel/workqueue.c    | 12 ++++++++++--
>  2 files changed, 33 insertions(+), 10 deletions(-)
> 
> -- 
> 2.45.1
>

Re: [PATCH 0/4] Fix scalability problem in workqueue watchdog touch caused by stop_machine

Posted by Nicholas Piggin 1 year, 7 months ago

On Wed Jun 26, 2024 at 12:53 AM AEST, Paul E. McKenney wrote:
> On Tue, Jun 25, 2024 at 09:42:43PM +1000, Nicholas Piggin wrote:
> > Here are a few patches to fix a lockup caused by very slow progress due
> > to a scalability problem in workqueue watchdog touch being hammered by
> > thousands of CPUs in multi_cpu_stop. Patch 2 is the fix.
> > 
> > I did notice when making a microbenchmark reproducer that the RCU call
> > was actually also causing slowdowns. Not nearly so bad as the workqueue
> > touch, but workqueue queueing of dummy jobs slowed down by a factor of
> > several times when lots of other CPUs were making
> > rcu_momentary_dyntick_idle() calls. So I did the stop_machine patches to
> > reduce that. So those patches 3,4 are independent of the first two and
> > can go in any order.
>
> For the series:
>
> Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

Oh, it did get a comment :) Thanks Paul. Not sure who owns the
multi_cpu_stop loop, Tejun and Peter I guess but that was 10+
years ago :P

I might ask Andrew if he would take patches 3-4, if there are
no objections.

Thanks,
Nick

Re: [PATCH 0/4] Fix scalability problem in workqueue watchdog touch caused by stop_machine

Posted by Srikar Dronamraju 1 year, 4 months ago

* Nicholas Piggin <npiggin@gmail.com> [2024-06-26 10:57:36]:

> On Wed Jun 26, 2024 at 12:53 AM AEST, Paul E. McKenney wrote:
> > On Tue, Jun 25, 2024 at 09:42:43PM +1000, Nicholas Piggin wrote:
> > > Here are a few patches to fix a lockup caused by very slow progress due
> > > to a scalability problem in workqueue watchdog touch being hammered by
> > > thousands of CPUs in multi_cpu_stop. Patch 2 is the fix.
> > > 
> > > I did notice when making a microbenchmark reproducer that the RCU call
> > > was actually also causing slowdowns. Not nearly so bad as the workqueue
> > > touch, but workqueue queueing of dummy jobs slowed down by a factor of
> > > several times when lots of other CPUs were making
> > > rcu_momentary_dyntick_idle() calls. So I did the stop_machine patches to
> > > reduce that. So those patches 3,4 are independent of the first two and
> > > can go in any order.
> >
> > For the series:
> >
> > Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
> 
> Oh, it did get a comment :) Thanks Paul. Not sure who owns the
> multi_cpu_stop loop, Tejun and Peter I guess but that was 10+
> years ago :P
> 
> I might ask Andrew if he would take patches 3-4, if there are
> no objections.
> 

patches 3 and 4 are still not part of any tree.
Can we please include them or are there any reservations on them.

The patches still seem to apply on top of Linus tree except one line where
rcu_momentary_dyntick_idle() has been renamed to rcu_momentary_eqs()

Commit 32a9f26e5e26 ("rcu: Rename rcu_momentary_dyntick_idle() into
rcu_momentary_eqs()") 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/patch/?id=32a9f26e5e26

-- 
Thanks and Regards
Srikar Dronamraju