watchdog: Better handling of concurrent lockups

[PATCH 0/4] watchdog: Better handling of concurrent lockups

Posted by Douglas Anderson 2 years, 1 month ago

When we get multiple lockups at roughly the same time, the output in
the kernel logs can be very confusing since the reports about the
lockups end up interleaved in the logs. There is some code in the
kernel to try to handle this but it wasn't that complete.

Li Zhe recently made this a bit better for softlockups (specifically
for the case where `kernel.softlockup_all_cpu_backtrace` is not set)
in commit 9d02330abd3e ("softlockup: serialized softlockup's log"),
but that only handled softlockup reports. Hardlockup reports still had
similar issues.

This series also has a small fix to avoid dumping all stacks a second
time in the case of a panic. This is a bit unrelated to the
interleaving fixes but it does also improve the clarity of lockup
reports.


Douglas Anderson (4):
  watchdog/hardlockup: Adopt softlockup logic avoiding double-dumps
  watchdog/softlockup: Use printk_cpu_sync_get_irqsave() to serialize
    reporting
  watchdog/hardlockup: Use printk_cpu_sync_get_irqsave() to serialize
    reporting
  watchdog: If panicking and we dumped everything, don't re-enable
    dumping

 kernel/watchdog.c | 43 ++++++++++++++++++++++++++++++++-----------
 1 file changed, 32 insertions(+), 11 deletions(-)

-- 
2.43.0.472.g3155946c3a-goog

Re: [PATCH 0/4] watchdog: Better handling of concurrent lockups

Posted by Petr Mladek 2 years ago

Hi,

On Wed 2023-12-20 13:15:33, Douglas Anderson wrote:
> 
> When we get multiple lockups at roughly the same time, the output in
> the kernel logs can be very confusing since the reports about the
> lockups end up interleaved in the logs. There is some code in the
> kernel to try to handle this but it wasn't that complete.
> 
> Li Zhe recently made this a bit better for softlockups (specifically
> for the case where `kernel.softlockup_all_cpu_backtrace` is not set)
> in commit 9d02330abd3e ("softlockup: serialized softlockup's log"),
> but that only handled softlockup reports. Hardlockup reports still had
> similar issues.
> 
> This series also has a small fix to avoid dumping all stacks a second
> time in the case of a panic. This is a bit unrelated to the
> interleaving fixes but it does also improve the clarity of lockup
> reports.

Just for record. This patchset has finally got on top of my queue
(after Christmas and a sick leave). And it looks good from my POV.

I was slightly afraid to use printk_cpu_sync_get_irqsave() on more
locations. It has to be used with care to avoid deadlock.

But the patchset looks good. It takes the lock only around code
proceed on the same CPU. And it always releases the lock before
triggering backtrace on another CPU.

Idea:

I have just got an idea how to make printk_cpu_sync_get_irqsave()
less error prone for deadlock on the panic() CPU. The idea is
to ignore the lock or give up locking after a timeout on
the panic CPU.

AFAIK, the lock is currently used only to serialize related
printk() calls. The only risk is that some messages might be
interleaved when it is ignored.

I am not sure if this is a good idea though. It might create
another risk when the lock gets used to serialize more
things in the future and a race might create a real problem.

Best Regards,
Petr

Re: [PATCH 0/4] watchdog: Better handling of concurrent lockups

Posted by John Ogness 2 years ago

On 2024-02-06, Petr Mladek <pmladek@suse.com> wrote:
> I have just got an idea how to make printk_cpu_sync_get_irqsave()
> less error prone for deadlock on the panic() CPU. The idea is
> to ignore the lock or give up locking after a timeout on
> the panic CPU.

This idea is out of scope for this series. But it is something we should
think about. The issue has always been a possible problem in panic().

> AFAIK, the lock is currently used only to serialize related
> printk() calls. The only risk is that some messages might be
> interleaved when it is ignored.
>
> I am not sure if this is a good idea though. It might create
> another risk when the lock gets used to serialize more
> things in the future and a race might create a real problem.

With the printk series we are currently working on [0], only the panic
CPU can store new printk messages anyway. So there would be nothing to
synchronize against (and it could be safely ignored).

kgdb uses the same technique to quiesce the CPUs. It does not use the
printk_cpu_sync for this, but it is an example of a possible future
usage not related to printk.

My vote is to make it a NOP for the panic CPU and then keep an eye on
any future uses. Should I add this to v4 of [0]?

John

[0] https://lore.kernel.org/lkml/20231214214201.499426-1-john.ogness@linutronix.de

Re: [PATCH 0/4] watchdog: Better handling of concurrent lockups

Posted by Petr Mladek 2 years ago

On Tue 2024-02-06 11:51:50, John Ogness wrote:
> On 2024-02-06, Petr Mladek <pmladek@suse.com> wrote:
> > I have just got an idea how to make printk_cpu_sync_get_irqsave()
> > less error prone for deadlock on the panic() CPU. The idea is
> > to ignore the lock or give up locking after a timeout on
> > the panic CPU.
> 
> This idea is out of scope for this series. But it is something we should
> think about. The issue has always been a possible problem in panic().
> 
> > AFAIK, the lock is currently used only to serialize related
> > printk() calls. The only risk is that some messages might be
> > interleaved when it is ignored.
> >
> > I am not sure if this is a good idea though. It might create
> > another risk when the lock gets used to serialize more
> > things in the future and a race might create a real problem.
> 
> With the printk series we are currently working on [0], only the panic
> CPU can store new printk messages anyway. So there would be nothing to
> synchronize against (and it could be safely ignored).

Right.

> kgdb uses the same technique to quiesce the CPUs. It does not use the
> printk_cpu_sync for this, but it is an example of a possible future
> usage not related to printk.
> 
> My vote is to make it a NOP for the panic CPU and then keep an eye on
> any future uses.
Sounds good.

> Should I add this to v4 of [0]?

Let's not complicate this series any more. It is almost ready ;-)
We could do it by a separate patch in top of it or in another
patchset.

> 
> [0] https://lore.kernel.org/lkml/20231214214201.499426-1-john.ogness@linutronix.de

Best Regards,
Petr

Re: [PATCH 0/4] watchdog: Better handling of concurrent lockups

Posted by Doug Anderson 2 years ago

Hi,

On Tue, Feb 6, 2024 at 2:46 AM John Ogness <john.ogness@linutronix.de> wrote:
>
> On 2024-02-06, Petr Mladek <pmladek@suse.com> wrote:
> > I have just got an idea how to make printk_cpu_sync_get_irqsave()
> > less error prone for deadlock on the panic() CPU. The idea is
> > to ignore the lock or give up locking after a timeout on
> > the panic CPU.
>
> This idea is out of scope for this series. But it is something we should
> think about. The issue has always been a possible problem in panic().

One thing to be at least a little cognizant of is how this interacts
with the 10 second timeout in nmi_trigger_cpumask_backtrace(), which
we can hit twice in some of the lockup reports since we first trace
the locked CPU and then the rest. Ideally we don't hit that timeout
lots, except that on arm64 if you don't have pseudo-NMI turned on then
it's actually pretty easy to hit the timeout when you've got a
hard-locked CPU. Probably that 10 second timeout should be
shortened...

-Doug