nmi,printk: fix ABBA deadlock between nmi_backtrace and dump_stack_lvl

[RFC PATCH] nmi,printk: fix ABBA deadlock between nmi_backtrace and dump_stack_lvl

Posted by Rik van Riel 1 year, 6 months ago

Both nmi_backtrace and dump_stack_lvl call printk_cpu_sync_get_irqsave.

However, dump_stack_lvl will call into the printk code, while holding
the printk_cpu_sync_get lock, and then take the console lock.

Another CPU may end up getting an NMI stack trace printed, after
being stuck printing something to serial console for too long,
with the console lock held.

This results in the following lock order:
CPU A: printk_cpu_sync_get lock -> console_lock
CPU B: console_lock -> (nmi) printk_cpu_sync_get lock

This will cause the system to hang with an ABBA deadlock, which
lockdep cannot detect (because the printk_cpu_sync_get "lock" is
special), and which results in the kernel panic code being unable
to get CPU B (stuck in the NMI handler) to even dump its own register
state, resulting in the kdump not containing everything needed to
get a clean backtrace of what that CPU is doing.

Omar and I poked around the kernel stacks of a few crash dumps to
figure out what was going on before we found this deadlock.

I am not entirely happy about this solution, since it can result
in NMI and non-NMI backtraces getting printed out in an interleaved
fashion, but I cannot think of a better way to avoid this deadlock.
I'm open to suggestions :)

Signed-off-by: Rik van Riel <riel@surriel.com>
Cc: Omar Sandoval <osandov@meta.com>
Fixes: 55d6af1d6688 ("lib/nmi_backtrace: explicitly serialize banner and regs")
---
 include/linux/printk.h | 28 +++++++++++++++++-----------
 kernel/printk/printk.c | 41 ++++++++++++++++++++++++++++++++---------
 lib/nmi_backtrace.c    |  4 ++--
 3 files changed, 51 insertions(+), 22 deletions(-)

diff --git a/include/linux/printk.h b/include/linux/printk.h
index 65c5184470f1..37b10f43bb58 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -280,15 +280,15 @@ static inline void console_replay_all(void)
 bool this_cpu_in_panic(void);
 
 #ifdef CONFIG_SMP
-extern int __printk_cpu_sync_try_get(void);
-extern void __printk_cpu_sync_wait(void);
-extern void __printk_cpu_sync_put(void);
+extern int __printk_cpu_sync_try_get(bool nmi);
+extern void __printk_cpu_sync_wait(bool nmi);
+extern void __printk_cpu_sync_put(bool nmi);
 
 #else
 
-#define __printk_cpu_sync_try_get() true
-#define __printk_cpu_sync_wait()
-#define __printk_cpu_sync_put()
+#define __printk_cpu_sync_try_get(_unused) true
+#define __printk_cpu_sync_wait(_unused)
+#define __printk_cpu_sync_put(_unused)
 #endif /* CONFIG_SMP */
 
 /**
@@ -312,28 +312,34 @@ extern void __printk_cpu_sync_put(void);
  *       CPUs after calling this function from any context. This includes
  *       using spinlocks or any other busy-waiting synchronization methods.
  */
-#define printk_cpu_sync_get_irqsave(flags)		\
+#define do_printk_cpu_sync_get(flags, nmi)		\
 	for (;;) {					\
 		local_irq_save(flags);			\
-		if (__printk_cpu_sync_try_get())	\
+		if (__printk_cpu_sync_try_get(nmi))	\
 			break;				\
 		local_irq_restore(flags);		\
-		__printk_cpu_sync_wait();		\
+		__printk_cpu_sync_wait(nmi);		\
 	}
 
+#define printk_cpu_sync_get_irqsave(flags) do_printk_cpu_sync_get(flags, 0)
+#define printk_cpu_sync_get_from_nmi(flags) do_printk_cpu_sync_get(flags, 1)
+
 /**
  * printk_cpu_sync_put_irqrestore() - Release the printk cpu-reentrant spinning
  *                                    lock and restore interrupts.
  * @flags: Caller's saved interrupt state, from printk_cpu_sync_get_irqsave().
  */
-#define printk_cpu_sync_put_irqrestore(flags)	\
+#define do_printk_cpu_sync_put(flags, nmi)	\
 	do {					\
-		__printk_cpu_sync_put();	\
+		__printk_cpu_sync_put(nmi);	\
 		local_irq_restore(flags);	\
 	} while (0)
 
 extern int kptr_restrict;
 
+#define printk_cpu_sync_put_irqrestore(flags) do_printk_cpu_sync_put(flags, 0)
+#define printk_cpu_sync_put_from_nmi(flags) do_printk_cpu_sync_put(flags, 1)
+
 /**
  * pr_fmt - used by the pr_*() macros to generate the printk format string
  * @fmt: format string passed from a pr_*() macro
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index dddb15f48d59..881d1dda6d3a 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -4319,7 +4319,23 @@ void console_replay_all(void)
 
 #ifdef CONFIG_SMP
 static atomic_t printk_cpu_sync_owner = ATOMIC_INIT(-1);
+static atomic_t printk_cpu_sync_owner_nmi = ATOMIC_INIT(-1);
 static atomic_t printk_cpu_sync_nested = ATOMIC_INIT(0);
+static atomic_t printk_cpu_sync_nested_nmi = ATOMIC_INIT(0);
+
+static atomic_t *get_printk_sync_owner(bool nmi)
+{
+	if (!nmi)
+		return &printk_cpu_sync_owner;
+	return &printk_cpu_sync_owner_nmi;
+}
+
+static atomic_t *get_printk_sync_nested(bool nmi)
+{
+	if (!nmi)
+		return &printk_cpu_sync_nested;
+	return &printk_cpu_sync_nested_nmi;
+}
 
 /**
  * __printk_cpu_sync_wait() - Busy wait until the printk cpu-reentrant
@@ -4327,11 +4343,13 @@ static atomic_t printk_cpu_sync_nested = ATOMIC_INIT(0);
  *
  * Context: Any context.
  */
-void __printk_cpu_sync_wait(void)
+void __printk_cpu_sync_wait(bool nmi)
 {
+	atomic_t *owner = get_printk_sync_owner(nmi);
+
 	do {
 		cpu_relax();
-	} while (atomic_read(&printk_cpu_sync_owner) != -1);
+	} while (atomic_read(owner) != -1);
 }
 EXPORT_SYMBOL(__printk_cpu_sync_wait);
 
@@ -4346,8 +4364,9 @@ EXPORT_SYMBOL(__printk_cpu_sync_wait);
  * Context: Any context. Expects interrupts to be disabled.
  * Return: 1 on success, otherwise 0.
  */
-int __printk_cpu_sync_try_get(void)
+int __printk_cpu_sync_try_get(bool nmi)
 {
+	atomic_t *owner = get_printk_sync_owner(nmi);
 	int cpu;
 	int old;
 
@@ -4372,7 +4391,7 @@ int __printk_cpu_sync_try_get(void)
 	 * ACQUIRE from __printk_cpu_sync_try_get:A to
 	 * __printk_cpu_sync_try_get:B of this CPU
 	 */
-	old = atomic_cmpxchg_acquire(&printk_cpu_sync_owner, -1,
+	old = atomic_cmpxchg_acquire(owner, -1,
 				     cpu); /* LMM(__printk_cpu_sync_try_get:A) */
 	if (old == -1) {
 		/*
@@ -4382,8 +4401,9 @@ int __printk_cpu_sync_try_get(void)
 		return 1;
 
 	} else if (old == cpu) {
+		atomic_t *nested = get_printk_sync_nested(nmi);
 		/* This CPU is already the owner. */
-		atomic_inc(&printk_cpu_sync_nested);
+		atomic_inc(nested);
 		return 1;
 	}
 
@@ -4398,10 +4418,13 @@ EXPORT_SYMBOL(__printk_cpu_sync_try_get);
  *
  * Context: Any context. Expects interrupts to be disabled.
  */
-void __printk_cpu_sync_put(void)
+void __printk_cpu_sync_put(bool nmi)
 {
-	if (atomic_read(&printk_cpu_sync_nested)) {
-		atomic_dec(&printk_cpu_sync_nested);
+	atomic_t *owner = get_printk_sync_owner(nmi);
+	atomic_t *nested = get_printk_sync_nested(nmi);
+
+	if (atomic_read(nested)) {
+		atomic_dec(nested);
 		return;
 	}
 
@@ -4428,7 +4451,7 @@ void __printk_cpu_sync_put(void)
 	 * ACQUIRE from __printk_cpu_sync_try_get:A to
 	 * __printk_cpu_sync_try_get:B of the next CPU
 	 */
-	atomic_set_release(&printk_cpu_sync_owner,
+	atomic_set_release(owner,
 			   -1); /* LMM(__printk_cpu_sync_put:B) */
 }
 EXPORT_SYMBOL(__printk_cpu_sync_put);
diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c
index 33c154264bfe..8fff7d5b86e0 100644
--- a/lib/nmi_backtrace.c
+++ b/lib/nmi_backtrace.c
@@ -101,7 +101,7 @@ bool nmi_cpu_backtrace(struct pt_regs *regs)
 		 * Allow nested NMI backtraces while serializing
 		 * against other CPUs.
 		 */
-		printk_cpu_sync_get_irqsave(flags);
+		printk_cpu_sync_get_from_nmi(flags);
 		if (!READ_ONCE(backtrace_idle) && regs && cpu_in_idle(instruction_pointer(regs))) {
 			pr_warn("NMI backtrace for cpu %d skipped: idling at %pS\n",
 				cpu, (void *)instruction_pointer(regs));
@@ -112,7 +112,7 @@ bool nmi_cpu_backtrace(struct pt_regs *regs)
 			else
 				dump_stack();
 		}
-		printk_cpu_sync_put_irqrestore(flags);
+		printk_cpu_sync_put_from_nmi(flags);
 		cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask));
 		return true;
 	}
-- 
2.45.2

Re: [RFC PATCH] nmi,printk: fix ABBA deadlock between nmi_backtrace and dump_stack_lvl

Posted by John Ogness 1 year, 6 months ago

On 2024-07-15, Rik van Riel <riel@surriel.com> wrote:
> Both nmi_backtrace and dump_stack_lvl call printk_cpu_sync_get_irqsave.
>
> However, dump_stack_lvl will call into the printk code, while holding
> the printk_cpu_sync_get lock, and then take the console lock.
>
> Another CPU may end up getting an NMI stack trace printed, after
> being stuck printing something to serial console for too long,
> with the console lock held.
>
> This results in the following lock order:
> CPU A: printk_cpu_sync_get lock -> console_lock
> CPU B: console_lock -> (nmi) printk_cpu_sync_get lock
>
> This will cause the system to hang with an ABBA deadlock

The console lock is acquired via trylock, so that will not yield
deadlock here. However, if CPU B was printing, then CPU A will spin on
@console_waiter (in console_trylock_spinning()). _That_ is a deadlock.

The purpose of printk_cpu_sync_get_irqsave() is to avoid having the
different backtraces being interleaved in the _ringbuffer_. It really
isn't necessary that they are printed in that context. And indeed, there
is no guarantee that they will be printed in that context anyway.

Perhaps a simple solution would be for printk_cpu_sync_get_irqsave() to
call printk_deferred_enter/_exit. Something like the below patch.

John Ogness

diff --git a/include/linux/printk.h b/include/linux/printk.h
index 65c5184470f1..1a6f5aac28bf 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -315,8 +315,10 @@ extern void __printk_cpu_sync_put(void);
 #define printk_cpu_sync_get_irqsave(flags)		\
 	for (;;) {					\
 		local_irq_save(flags);			\
+		printk_deferred_enter();		\
 		if (__printk_cpu_sync_try_get())	\
 			break;				\
+		printk_deferred_exit();			\
 		local_irq_restore(flags);		\
 		__printk_cpu_sync_wait();		\
 	}
@@ -329,6 +331,7 @@ extern void __printk_cpu_sync_put(void);
 #define printk_cpu_sync_put_irqrestore(flags)	\
 	do {					\
 		__printk_cpu_sync_put();	\
+		printk_deferred_exit();		\
 		local_irq_restore(flags);	\
 	} while (0)

Re: [RFC PATCH] nmi,printk: fix ABBA deadlock between nmi_backtrace and dump_stack_lvl

Posted by Petr Mladek 1 year, 6 months ago

On Wed 2024-07-17 09:22:21, John Ogness wrote:
> On 2024-07-15, Rik van Riel <riel@surriel.com> wrote:
> > Both nmi_backtrace and dump_stack_lvl call printk_cpu_sync_get_irqsave.
> >
> > However, dump_stack_lvl will call into the printk code, while holding
> > the printk_cpu_sync_get lock, and then take the console lock.
> >
> > Another CPU may end up getting an NMI stack trace printed, after
> > being stuck printing something to serial console for too long,
> > with the console lock held.
> >
> > This results in the following lock order:
> > CPU A: printk_cpu_sync_get lock -> console_lock
> > CPU B: console_lock -> (nmi) printk_cpu_sync_get lock
> >
> > This will cause the system to hang with an ABBA deadlock
> 
> The console lock is acquired via trylock, so that will not yield
> deadlock here. However, if CPU B was printing, then CPU A will spin on
> @console_waiter (in console_trylock_spinning()). _That_ is a deadlock.
>
> The purpose of printk_cpu_sync_get_irqsave() is to avoid having the
> different backtraces being interleaved in the _ringbuffer_. It really
> isn't necessary that they are printed in that context. And indeed, there
> is no guarantee that they will be printed in that context anyway.

Well, backtraces are printed when unexpected things happen. It is
usually an emergency situation and we should do our best to
flush consoles ASAP.

> Perhaps a simple solution would be for printk_cpu_sync_get_irqsave() to
> call printk_deferred_enter/_exit. Something like the below patch.
> 
> John Ogness
> 
> diff --git a/include/linux/printk.h b/include/linux/printk.h
> index 65c5184470f1..1a6f5aac28bf 100644
> --- a/include/linux/printk.h
> +++ b/include/linux/printk.h
> @@ -315,8 +315,10 @@ extern void __printk_cpu_sync_put(void);
>  #define printk_cpu_sync_get_irqsave(flags)		\
>  	for (;;) {					\
>  		local_irq_save(flags);			\
> +		printk_deferred_enter();		\
>  		if (__printk_cpu_sync_try_get())	\
>  			break;				\
> +		printk_deferred_exit();			\
>  		local_irq_restore(flags);		\
>  		__printk_cpu_sync_wait();		\
>  	}
> @@ -329,6 +331,7 @@ extern void __printk_cpu_sync_put(void);
>  #define printk_cpu_sync_put_irqrestore(flags)	\
>  	do {					\
>  		__printk_cpu_sync_put();	\
> +		printk_deferred_exit();		\
>  		local_irq_restore(flags);	\
>  	} while (0)
>  

OK, there is the basic rule: "Never take a lock in NMI when the lock might
be taken in another context". printk_cpu_sync_get() violates the rule.
But it is safe as long as the lock is re-entrant and tail.

The above patch fixes a situation where printk_cpu_sync_get() was not
a tail lock. So it looks like a reasonable fix if we want to keep it simple.

Well, I still prefer the alternative solution at
https://lore.kernel.org/r/93155b2ccafa43ed4845ae450451c6b8e27509cc.camel@surriel.com
which does not deffer the console output in normal/IRQ context. There
are doubts whether it is safe. I think that it is. Let me to reply there.

Best Regards,
Petr

Re: [RFC PATCH] nmi,printk: fix ABBA deadlock between nmi_backtrace and dump_stack_lvl

Posted by Rik van Riel 1 year, 6 months ago

On Wed, 2024-07-17 at 09:22 +0206, John Ogness wrote:
> 
> The purpose of printk_cpu_sync_get_irqsave() is to avoid having the
> different backtraces being interleaved in the _ringbuffer_. It really
> isn't necessary that they are printed in that context. And indeed,
> there
> is no guarantee that they will be printed in that context anyway.
> 
> Perhaps a simple solution would be for printk_cpu_sync_get_irqsave()
> to
> call printk_deferred_enter/_exit. Something like the below patch.
> 

I think that would do the trick. The nmi_backtrace() printk is already
deferred, because of the check for in_nmi() in vprintk(), and this
change would put all the other users of printk_cpu_sync_get_irqsave()
on the exact same footing as nmi_backtrace().

Combing through the code a little, it looks like that would remove
the potential for this deadlock to happen again.
> 
> 
> diff --git a/include/linux/printk.h b/include/linux/printk.h
> index 65c5184470f1..1a6f5aac28bf 100644
> --- a/include/linux/printk.h
> +++ b/include/linux/printk.h
> @@ -315,8 +315,10 @@ extern void __printk_cpu_sync_put(void);
>  #define printk_cpu_sync_get_irqsave(flags)		\
>  	for (;;) {					\
>  		local_irq_save(flags);			\
> +		printk_deferred_enter();		\
>  		if (__printk_cpu_sync_try_get())	\
>  			break;				\
> +		printk_deferred_exit();			\
>  		local_irq_restore(flags);		\
>  		__printk_cpu_sync_wait();		\
>  	}
> @@ -329,6 +331,7 @@ extern void __printk_cpu_sync_put(void);
>  #define printk_cpu_sync_put_irqrestore(flags)	\
>  	do {					\
>  		__printk_cpu_sync_put();	\
> +		printk_deferred_exit();		\
>  		local_irq_restore(flags);	\
>  	} while (0)
>  
> 

-- 
All Rights Reversed.

Re: [RFC PATCH] nmi,printk: fix ABBA deadlock between nmi_backtrace and dump_stack_lvl

Posted by John Ogness 1 year, 6 months ago

On 2024-07-17, Rik van Riel <riel@surriel.com> wrote:
> I think that would do the trick. The nmi_backtrace() printk is already
> deferred, because of the check for in_nmi() in vprintk(), and this
> change would put all the other users of printk_cpu_sync_get_irqsave()
> on the exact same footing as nmi_backtrace().
>
> Combing through the code a little, it looks like that would remove
> the potential for this deadlock to happen again.

Let's see what Petr has to say. (He'll be back on Monday.) He might
prefer a solution that does not result in deferring printing for all
cases. i.e. allow the console_lock if it is available, but avoid the
spinning if it is not. Below is a patch that would achieve this.

John

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index dddb15f48d59..36f40db0bf93 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1060,6 +1060,8 @@ static int __init log_buf_len_setup(char *str)
 early_param("log_buf_len", log_buf_len_setup);
 
 #ifdef CONFIG_SMP
+static bool vprintk_emit_may_spin(void);
+
 #define __LOG_CPU_MAX_BUF_LEN (1 << CONFIG_LOG_CPU_MAX_BUF_SHIFT)
 
 static void __init log_buf_add_cpu(void)
@@ -1090,6 +1092,7 @@ static void __init log_buf_add_cpu(void)
 }
 #else /* !CONFIG_SMP */
 static inline void log_buf_add_cpu(void) {}
+static inline bool vprintk_emit_may_spin(void) { return true };
 #endif /* CONFIG_SMP */
 
 static void __init set_percpu_data_ready(void)
@@ -2330,6 +2333,8 @@ asmlinkage int vprintk_emit(int facility, int level,
 
 	/* If called from the scheduler, we can not call up(). */
 	if (!in_sched) {
+		int ret;
+
 		/*
 		 * The caller may be holding system-critical or
 		 * timing-sensitive locks. Disable preemption during
@@ -2344,7 +2349,11 @@ asmlinkage int vprintk_emit(int facility, int level,
 		 * spinning variant, this context tries to take over the
 		 * printing from another printing context.
 		 */
-		if (console_trylock_spinning())
+		if (vprintk_emit_may_spin())
+			ret = console_trylock_spinning();
+		else
+			ret = console_trylock();
+		if (ret)
 			console_unlock();
 		preempt_enable();
 	}
@@ -4321,6 +4330,15 @@ void console_replay_all(void)
 static atomic_t printk_cpu_sync_owner = ATOMIC_INIT(-1);
 static atomic_t printk_cpu_sync_nested = ATOMIC_INIT(0);
 
+/*
+ * As documented in printk_cpu_sync_get_irqsave(), a context holding the
+ * printk_cpu_sync must not spin waiting for another CPU.
+ */
+static bool vprintk_emit_may_spin(void)
+{
+	return (atomic_read(&printk_cpu_sync_owner) != smp_processor_id());
+}
+
 /**
  * __printk_cpu_sync_wait() - Busy wait until the printk cpu-reentrant
  *                            spinning lock is not owned by any CPU.

Re: [RFC PATCH] nmi,printk: fix ABBA deadlock between nmi_backtrace and dump_stack_lvl

Posted by Rik van Riel 1 year, 6 months ago

On Thu, 2024-07-18 at 09:31 +0206, John Ogness wrote:
> On 2024-07-17, Rik van Riel <riel@surriel.com> wrote:
> > I think that would do the trick. The nmi_backtrace() printk is
> > already
> > deferred, because of the check for in_nmi() in vprintk(), and this
> > change would put all the other users of
> > printk_cpu_sync_get_irqsave()
> > on the exact same footing as nmi_backtrace().
> > 
> > Combing through the code a little, it looks like that would remove
> > the potential for this deadlock to happen again.
> 
> Let's see what Petr has to say. (He'll be back on Monday.) He might
> prefer a solution that does not result in deferring printing for all
> cases. i.e. allow the console_lock if it is available, but avoid the
> spinning if it is not. Below is a patch that would achieve this.
> 
> John
> 
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index dddb15f48d59..36f40db0bf93 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -1060,6 +1060,8 @@ static int __init log_buf_len_setup(char *str)
>  early_param("log_buf_len", log_buf_len_setup);
>  
>  #ifdef CONFIG_SMP
> +static bool vprintk_emit_may_spin(void);
> +
>  #define __LOG_CPU_MAX_BUF_LEN (1 << CONFIG_LOG_CPU_MAX_BUF_SHIFT)
>  
>  static void __init log_buf_add_cpu(void)
> @@ -1090,6 +1092,7 @@ static void __init log_buf_add_cpu(void)
>  }
>  #else /* !CONFIG_SMP */
>  static inline void log_buf_add_cpu(void) {}
> +static inline bool vprintk_emit_may_spin(void) { return true };
>  #endif /* CONFIG_SMP */
>  
>  static void __init set_percpu_data_ready(void)
> @@ -2330,6 +2333,8 @@ asmlinkage int vprintk_emit(int facility, int
> level,
>  
>  	/* If called from the scheduler, we can not call up(). */
>  	if (!in_sched) {
> +		int ret;
> +
>  		/*
>  		 * The caller may be holding system-critical or
>  		 * timing-sensitive locks. Disable preemption during
> @@ -2344,7 +2349,11 @@ asmlinkage int vprintk_emit(int facility, int
> level,
>  		 * spinning variant, this context tries to take over
> the
>  		 * printing from another printing context.
>  		 */
> -		if (console_trylock_spinning())
> +		if (vprintk_emit_may_spin())
> +			ret = console_trylock_spinning();
> +		else
> +			ret = console_trylock();
> +		if (ret)
>  			console_unlock();
>  		preempt_enable();
>  	}
> @@ -4321,6 +4330,15 @@ void console_replay_all(void)
>  static atomic_t printk_cpu_sync_owner = ATOMIC_INIT(-1);
>  static atomic_t printk_cpu_sync_nested = ATOMIC_INIT(0);
>  
> +/*
> + * As documented in printk_cpu_sync_get_irqsave(), a context holding
> the
> + * printk_cpu_sync must not spin waiting for another CPU.
> + */
> +static bool vprintk_emit_may_spin(void)
> +{
> +	return (atomic_read(&printk_cpu_sync_owner) !=
> smp_processor_id());
> +}

I think the above would still deadlock, because the reported
deadlock is an ABBA deadlock between two different CPUs.

I think what the code would have to do is only trylock, and never
spin after taking the printk_cpu_sync_get_irqsave lock.

Were you thinking of moving the  this_cpu_read(printk_context)
check from vprintk() into  vprintk_emit() and use that to decide
whether to spin for the lock, or to give up if the trylock fails?

-- 
All Rights Reversed.

Re: [RFC PATCH] nmi,printk: fix ABBA deadlock between nmi_backtrace and dump_stack_lvl

Posted by John Ogness 1 year, 6 months ago

On 2024-07-18, Rik van Riel <riel@surriel.com> wrote:
>> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
>> index dddb15f48d59..36f40db0bf93 100644
>> --- a/kernel/printk/printk.c
>> +++ b/kernel/printk/printk.c
>> @@ -1060,6 +1060,8 @@ static int __init log_buf_len_setup(char *str)
>>  early_param("log_buf_len", log_buf_len_setup);
>>  
>>  #ifdef CONFIG_SMP
>> +static bool vprintk_emit_may_spin(void);
>> +
>>  #define __LOG_CPU_MAX_BUF_LEN (1 << CONFIG_LOG_CPU_MAX_BUF_SHIFT)
>>  
>>  static void __init log_buf_add_cpu(void)
>> @@ -1090,6 +1092,7 @@ static void __init log_buf_add_cpu(void)
>>  }
>>  #else /* !CONFIG_SMP */
>>  static inline void log_buf_add_cpu(void) {}
>> +static inline bool vprintk_emit_may_spin(void) { return true };
>>  #endif /* CONFIG_SMP */
>>  
>>  static void __init set_percpu_data_ready(void)
>> @@ -2330,6 +2333,8 @@ asmlinkage int vprintk_emit(int facility, int
>> level,
>>  
>>  	/* If called from the scheduler, we can not call up(). */
>>  	if (!in_sched) {
>> +		int ret;
>> +
>>  		/*
>>  		 * The caller may be holding system-critical or
>>  		 * timing-sensitive locks. Disable preemption during
>> @@ -2344,7 +2349,11 @@ asmlinkage int vprintk_emit(int facility, int
>> level,
>>  		 * spinning variant, this context tries to take over
>> the
>>  		 * printing from another printing context.
>>  		 */
>> -		if (console_trylock_spinning())
>> +		if (vprintk_emit_may_spin())
>> +			ret = console_trylock_spinning();
>> +		else
>> +			ret = console_trylock();
>> +		if (ret)
>>  			console_unlock();
>>  		preempt_enable();
>>  	}
>> @@ -4321,6 +4330,15 @@ void console_replay_all(void)
>>  static atomic_t printk_cpu_sync_owner = ATOMIC_INIT(-1);
>>  static atomic_t printk_cpu_sync_nested = ATOMIC_INIT(0);
>>  
>> +/*
>> + * As documented in printk_cpu_sync_get_irqsave(), a context holding
>> the
>> + * printk_cpu_sync must not spin waiting for another CPU.
>> + */
>> +static bool vprintk_emit_may_spin(void)
>> +{
>> +	return (atomic_read(&printk_cpu_sync_owner) !=
>> smp_processor_id());
>> +}
>
> I think what the code would have to do is only trylock, and never
> spin after taking the printk_cpu_sync_get_irqsave lock.

That is what the code does. If @printk_cpu_sync_owner is set to the
current CPU, the context is holding the cpu_sync and will call the
non-spinning variant, console_trylock().

However, my first suggestion to defer whenever the cpu_sync is held
really is the only option because console_unlock() will spin on the uart
port lock, and that is also not allowed when holding the cpu_sync.

John

Re: [RFC PATCH] nmi,printk: fix ABBA deadlock between nmi_backtrace and dump_stack_lvl

Posted by Petr Mladek 1 year, 6 months ago

On Thu 2024-07-18 16:15:43, John Ogness wrote:
> On 2024-07-18, Rik van Riel <riel@surriel.com> wrote:
> >> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> >> index dddb15f48d59..36f40db0bf93 100644
> >> --- a/kernel/printk/printk.c
> >> +++ b/kernel/printk/printk.c
> >> @@ -1060,6 +1060,8 @@ static int __init log_buf_len_setup(char *str)
> >>  early_param("log_buf_len", log_buf_len_setup);
> >>  
> >>  #ifdef CONFIG_SMP
> >> +static bool vprintk_emit_may_spin(void);
> >> +
> >>  #define __LOG_CPU_MAX_BUF_LEN (1 << CONFIG_LOG_CPU_MAX_BUF_SHIFT)
> >>  
> >>  static void __init log_buf_add_cpu(void)
> >> @@ -1090,6 +1092,7 @@ static void __init log_buf_add_cpu(void)
> >>  }
> >>  #else /* !CONFIG_SMP */
> >>  static inline void log_buf_add_cpu(void) {}
> >> +static inline bool vprintk_emit_may_spin(void) { return true };
> >>  #endif /* CONFIG_SMP */
> >>  
> >>  static void __init set_percpu_data_ready(void)
> >> @@ -2330,6 +2333,8 @@ asmlinkage int vprintk_emit(int facility, int
> >> level,
> >>  
> >>  	/* If called from the scheduler, we can not call up(). */
> >>  	if (!in_sched) {
> >> +		int ret;
> >> +
> >>  		/*
> >>  		 * The caller may be holding system-critical or
> >>  		 * timing-sensitive locks. Disable preemption during
> >> @@ -2344,7 +2349,11 @@ asmlinkage int vprintk_emit(int facility, int
> >> level,
> >>  		 * spinning variant, this context tries to take over
> >> the
> >>  		 * printing from another printing context.
> >>  		 */
> >> -		if (console_trylock_spinning())
> >> +		if (vprintk_emit_may_spin())

I would either open code the check or change the function to
is_printk_cpu_sync_owner().

> >> +			ret = console_trylock_spinning();
> >> +		else
> >> +			ret = console_trylock();
> >> +		if (ret)
> >>  			console_unlock();
> >>  		preempt_enable();
> >>  	}
> >> @@ -4321,6 +4330,15 @@ void console_replay_all(void)
> >>  static atomic_t printk_cpu_sync_owner = ATOMIC_INIT(-1);
> >>  static atomic_t printk_cpu_sync_nested = ATOMIC_INIT(0);
> >>  
> >> +/*
> >> + * As documented in printk_cpu_sync_get_irqsave(), a context holding
> >> the
> >> + * printk_cpu_sync must not spin waiting for another CPU.
> >> + */
> >> +static bool vprintk_emit_may_spin(void)
> >> +{
> >> +	return (atomic_read(&printk_cpu_sync_owner) !=
> >> smp_processor_id());
> >> +}
> >
> > I think what the code would have to do is only trylock, and never
> > spin after taking the printk_cpu_sync_get_irqsave lock.
> 
> That is what the code does. If @printk_cpu_sync_owner is set to the
> current CPU, the context is holding the cpu_sync and will call the
> non-spinning variant, console_trylock().
> 
> However, my first suggestion to defer whenever the cpu_sync is held
> really is the only option because console_unlock() will spin on the uart
> port lock, and that is also not allowed when holding the cpu_sync.

It would have helped if Rick added backtraces from the crash dumps.
He just wrote:

> > CPU A: printk_cpu_sync_get lock -> console_lock
> > CPU B: console_lock -> (nmi) printk_cpu_sync_get lock

My quess is that it looked like:

CPU A				CPU B

				printk()
				  console_try_lock_spinning()
				  console_unlock()
				    console_emit_next_record()
				      console_lock_spinning_enable();
					con->write()
					  spin_lock(port->lock);

printk_cpu_sync_get()
  printk()
    console_try_lock_spinning()
      # spinning and wating for CPU B

				NMI:

				  printk_cpu_sync_get()
				    # waiting for CPU A

=> DEADLOCK


The deadlock is caused under/by printk_cpu_sync_get() but only because
console_try_lock_spinning() is blocked. It is not a true "try_lock"
operation which should never get blocked.

=> The above patch should solve the problem as well. It will cause
   that console_try_lock_spinning() would fail immediately on CPU A.

Note that port->lock can't cause any deadlock in this scenario.
console_try_lock_spinning() will always fail on CPU A until
the NMI gets handled on CPU B.

By other words, printk_cpu_sync_get() will behave as a tail lock
on CPU A because of the failing trylock.

Best Regards,
Petr

Re: [RFC PATCH] nmi,printk: fix ABBA deadlock between nmi_backtrace and dump_stack_lvl

Posted by Rik van Riel 1 year, 6 months ago

On Wed, 2024-07-24 at 14:56 +0200, Petr Mladek wrote:
> On Thu 2024-07-18 16:15:43, John Ogness wrote:
> 
> > 
> > However, my first suggestion to defer whenever the cpu_sync is held
> > really is the only option because console_unlock() will spin on the
> > uart
> > port lock, and that is also not allowed when holding the cpu_sync.
> 
> It would have helped if Rick added backtraces from the crash dumps.
> He just wrote:
> 
I would have preferred that, as well.

However, this deadlock prevent us from capturing a good backtrace
of the CPU that was stuck on the lock while in the NMI handler!

We ended up having to dig through the NMI stack of the stuck CPU
by hand, looking for anything that might be a function address,
and guess at what the CPU was doing.

This deadlock not only prevented the CPU printing, but also from
dumping its register state at panic time before the kdump kernel
was kexeced.

-- 
All Rights Reversed.

Re: [RFC PATCH] nmi,printk: fix ABBA deadlock between nmi_backtrace and dump_stack_lvl

Posted by John Ogness 1 year, 6 months ago

On 2024-07-24, Petr Mladek <pmladek@suse.com> wrote:
> My quess is that it looked like:
>
> CPU A				CPU B
>
> 				printk()
> 				  console_try_lock_spinning()
> 				  console_unlock()
> 				    console_emit_next_record()
> 				      console_lock_spinning_enable();
> 					con->write()
> 					  spin_lock(port->lock);
>
> printk_cpu_sync_get()
>   printk()
>     console_try_lock_spinning()
>       # spinning and wating for CPU B
>
> 				NMI:
>
> 				  printk_cpu_sync_get()
> 				    # waiting for CPU A
>
> => DEADLOCK
>
>
> The deadlock is caused under/by printk_cpu_sync_get() but only because
> console_try_lock_spinning() is blocked. It is not a true "try_lock"
> operation which should never get blocked.
>
> => The above patch should solve the problem as well. It will cause
>    that console_try_lock_spinning() would fail immediately on CPU A.
>
> Note that port->lock can't cause any deadlock in this scenario.
> console_try_lock_spinning() will always fail on CPU A until
> the NMI gets handled on CPU B.
>
> By other words, printk_cpu_sync_get() will behave as a tail lock
> on CPU A because of the failing trylock.

But only in _this_ scenario. The port lock could be taken by CPU B for
non-console-printing reasons. Then you still have deadlock, due to
spinning on the port lock.

The documentation of printk_cpu_sync_get() clearly states that the owner
must never perform any activities where it waits for a CPU. I think
deferring printk on that CPU [0] is the only viable solution. Note that
in the case of emergency states, atomic consoles are not prevented from
printing when printk is deferred. This is appropriate because they do
not spin-wait indefinitely for other CPUs. Although I expect they would
not be able to print emergency messages because the console is likely in
an unsafe region at that time.

John Ogness

[0] https://lore.kernel.org/lkml/87plrcqyii.fsf@jogness.linutronix.de

Re: [RFC PATCH] nmi,printk: fix ABBA deadlock between nmi_backtrace and dump_stack_lvl

Posted by Petr Mladek 1 year, 6 months ago

On Wed 2024-07-24 16:51:46, John Ogness wrote:
> On 2024-07-24, Petr Mladek <pmladek@suse.com> wrote:
> > My quess is that it looked like:
> >
> > CPU A				CPU B
> >
> > 				printk()
> > 				  console_try_lock_spinning()
> > 				  console_unlock()
> > 				    console_emit_next_record()
> > 				      console_lock_spinning_enable();
> > 					con->write()
> > 					  spin_lock(port->lock);
> >
> > printk_cpu_sync_get()
> >   printk()
> >     console_try_lock_spinning()
> >       # spinning and wating for CPU B
> >
> > 				NMI:
> >
> > 				  printk_cpu_sync_get()
> > 				    # waiting for CPU A
> >
> > => DEADLOCK
> >
> >
> > The deadlock is caused under/by printk_cpu_sync_get() but only because
> > console_try_lock_spinning() is blocked. It is not a true "try_lock"
> > operation which should never get blocked.
> >
> > => The above patch should solve the problem as well. It will cause
> >    that console_try_lock_spinning() would fail immediately on CPU A.
> >
> > Note that port->lock can't cause any deadlock in this scenario.
> > console_try_lock_spinning() will always fail on CPU A until
> > the NMI gets handled on CPU B.
> >
> > By other words, printk_cpu_sync_get() will behave as a tail lock
> > on CPU A because of the failing trylock.
> 
> But only in _this_ scenario. The port lock could be taken by CPU B for
> non-console-printing reasons. Then you still have deadlock, due to
> spinning on the port lock.

I see. I agree that deferring printk on that CPU [0] is the right solution.

> [0] https://lore.kernel.org/lkml/87plrcqyii.fsf@jogness.linutronix.de

Best Regards,
Petr

Re: [RFC PATCH] nmi,printk: fix ABBA deadlock between nmi_backtrace and dump_stack_lvl

Posted by Rik van Riel 1 year, 4 months ago

On Wed, 2024-07-24 at 17:08 +0200, Petr Mladek wrote:
> On Wed 2024-07-24 16:51:46, John Ogness wrote:
> 
> 
> I see. I agree that deferring printk on that CPU [0] is the right
> solution.
> 
> > [0]
> > https://lore.kernel.org/lkml/87plrcqyii.fsf@jogness.linutronix.de
> 
Hi Petr,

we have been running with John Ogness's fix (from the mail linked
above) for about a month now, and it has reduced the number of
deadlocks in the Linux kernel in our fleet by maybe 90%, comparing
a kernel with the fix, to a kernel without the fix.

Would it be possible to get John Ogness's fix queued up for merging
in the printk tree, and then later to Linus?

-- 
All Rights Reversed.

Re: [RFC PATCH] nmi,printk: fix ABBA deadlock between nmi_backtrace and dump_stack_lvl

Posted by Petr Mladek 1 year, 4 months ago

On Fri 2024-09-13 13:25:33, Rik van Riel wrote:
> On Wed, 2024-07-24 at 17:08 +0200, Petr Mladek wrote:
> > On Wed 2024-07-24 16:51:46, John Ogness wrote:
> > 
> > 
> > I see. I agree that deferring printk on that CPU [0] is the right
> > solution.
> > 
> > > [0]
> > > https://lore.kernel.org/lkml/87plrcqyii.fsf@jogness.linutronix.de
> > 
> Hi Petr,
> 
> we have been running with John Ogness's fix (from the mail linked
> above) for about a month now, and it has reduced the number of
> deadlocks in the Linux kernel in our fleet by maybe 90%, comparing
> a kernel with the fix, to a kernel without the fix.
> 
> Would it be possible to get John Ogness's fix queued up for merging
> in the printk tree, and then later to Linus?

Sure. Thanks for reminder.

John, could you please send the fix as a proper patch with some
reasonable commit message?

Best Regards,
Petr

Re: [RFC PATCH] nmi,printk: fix ABBA deadlock between nmi_backtrace and dump_stack_lvl

Posted by Rik van Riel 1 year, 6 months ago

On Thu, 2024-07-18 at 16:15 +0206, John Ogness wrote:
> 
> However, my first suggestion to defer whenever the cpu_sync is held
> really is the only option because console_unlock() will spin on the
> uart
> port lock, and that is also not allowed when holding the cpu_sync.
> 
Agreed, even with using only trylocks inside the printk code
and console code, there is no way to control what other locks
might be lurking in the device drivers the console uses.

I'll go try out your first suggestion.

-- 
All Rights Reversed.