printk/nbcon: Prevent hardlockup reports caused by atomic nbcon flush

[PATCH 1/3] printk/nbcon: Block printk kthreads when any CPU is in an emergency context

Posted by Petr Mladek 5 days, 9 hours ago

In emergency contexts, printk() tries to flush messages directly even
on nbcon consoles. And it is allowed to takeover the console ownership
and interrupt the printk kthread in the middle of a message.

Only one takeover and one repeated message should be enough in most
situations. The first emergency message flushes the backlog and printk
kthreads get to sleep. Next emergency messages are flushed directly
and printk() does not wake up the kthreads.

However, the one takeover is not guaranteed. Any printk() in normal
context on another CPU could wake up the kthreads. Or a new emergency
message might be added before the kthreads get to sleep. Note that
the interrupted .write_kthread() callbacks usually have to call
nbcon_reacquire_nobuf() and restore the original device setting
before checking for pending messages.

The risk of the repeated takeovers will be even bigger because
__nbcon_atomic_flush_pending_con is going to release the console
ownership after each emitted record. It will be needed to prevent
hardlockup reports on other CPUs which are busy waiting for
the context ownership, for example, by nbcon_reacquire_nobuf() or
__uart_port_nbcon_acquire().

The repeated takeovers break the output, for example:

    [ 5042.650211][ T2220] Call Trace:
    [ 5042.6511
    ** replaying previous printk message **
    [ 5042.651192][ T2220]  <TASK>
    [ 5042.652160][ T2220]  kunit_run_
    ** replaying previous printk message **
    [ 5042.652160][ T2220]  kunit_run_tests+0x72/0x90
    [ 5042.653340][ T22
    ** replaying previous printk message **
    [ 5042.653340][ T2220]  ? srso_alias_return_thunk+0x5/0xfbef5
    [ 5042.654628][ T2220]  ? stack_trace_save+0x4d/0x70
    [ 5042.6553
    ** replaying previous printk message **
    [ 5042.655394][ T2220]  ? srso_alias_return_thunk+0x5/0xfbef5
    [ 5042.656713][ T2220]  ? save_trace+0x5b/0x180

A more robust solution is to block the printk kthread entirely whenever
*any* CPU enters an emergency context. This ensures that critical messages
can be flushed without contention from the normal, non-atomic printing
path.

Link: https://lore.kernel.org/all/aNQO-zl3k1l4ENfy@pathway.suse.cz
Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 kernel/printk/nbcon.c | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
index d5d8c8c657e0..08b196e898cd 100644
--- a/kernel/printk/nbcon.c
+++ b/kernel/printk/nbcon.c
@@ -117,6 +117,9 @@
  * from scratch.
  */
 
+/* Counter of active nbcon emergency contexts. */
+atomic_t nbcon_cpu_emergency_cnt;
+
 /**
  * nbcon_state_set - Helper function to set the console state
  * @con:	Console to update
@@ -1168,6 +1171,16 @@ static bool nbcon_kthread_should_wakeup(struct console *con, struct nbcon_contex
 	if (kthread_should_stop())
 		return true;
 
+	/*
+	 * Block the kthread when the system is in an emergency or panic mode.
+	 * It increases the chance that these contexts would be able to show
+	 * the messages directly. And it reduces the risk of interrupted writes
+	 * where the context with a higher priority takes over the nbcon console
+	 * ownership in the middle of a message.
+	 */
+	if (unlikely(atomic_read(&nbcon_cpu_emergency_cnt)))
+		return false;
+
 	cookie = console_srcu_read_lock();
 
 	flags = console_srcu_read_flags(con);
@@ -1219,6 +1232,13 @@ static int nbcon_kthread_func(void *__console)
 		if (kthread_should_stop())
 			return 0;
 
+		/*
+		 * Block the kthread when the system is in an emergency or panic
+		 * mode. See nbcon_kthread_should_wakeup() for more details.
+		 */
+		if (unlikely(atomic_read(&nbcon_cpu_emergency_cnt)))
+			goto wait_for_event;
+
 		backlog = false;
 
 		/*
@@ -1660,6 +1680,8 @@ void nbcon_cpu_emergency_enter(void)
 
 	preempt_disable();
 
+	atomic_inc(&nbcon_cpu_emergency_cnt);
+
 	cpu_emergency_nesting = nbcon_get_cpu_emergency_nesting();
 	(*cpu_emergency_nesting)++;
 }
@@ -1674,10 +1696,18 @@ void nbcon_cpu_emergency_exit(void)
 	unsigned int *cpu_emergency_nesting;
 
 	cpu_emergency_nesting = nbcon_get_cpu_emergency_nesting();
-
 	if (!WARN_ON_ONCE(*cpu_emergency_nesting == 0))
 		(*cpu_emergency_nesting)--;
 
+	/*
+	 * Wake up kthreads because there might be some pending messages
+	 * added by other CPUs with normal priority since the last flush
+	 * in the emergency context.
+	 */
+	if (!WARN_ON_ONCE(atomic_read(&nbcon_cpu_emergency_cnt) == 0))
+		if (atomic_dec_return(&nbcon_cpu_emergency_cnt) == 0)
+			nbcon_kthreads_wake();
+
 	preempt_enable();
 }
 
-- 
2.51.0

Re: [PATCH 1/3] printk/nbcon: Block printk kthreads when any CPU is in an emergency context

Posted by kernel test robot 1 day, 2 hours ago

Hi Petr,

kernel test robot noticed the following build warnings:

[auto build test WARNING on soc/for-next]
[also build test WARNING on linus/master v6.17 next-20250929]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Petr-Mladek/printk-nbcon-Block-printk-kthreads-when-any-CPU-is-in-an-emergency-context/20250926-205414
base:   https://git.kernel.org/pub/scm/linux/kernel/git/soc/soc.git for-next
patch link:    https://lore.kernel.org/r/20250926124912.243464-2-pmladek%40suse.com
patch subject: [PATCH 1/3] printk/nbcon: Block printk kthreads when any CPU is in an emergency context
config: arc-randconfig-r131-20251001 (https://download.01.org/0day-ci/archive/20251001/202510010320.jV84a9vM-lkp@intel.com/config)
compiler: arc-linux-gcc (GCC) 8.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251001/202510010320.jV84a9vM-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202510010320.jV84a9vM-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> kernel/printk/nbcon.c:121:10: sparse: sparse: symbol 'nbcon_cpu_emergency_cnt' was not declared. Should it be static?

vim +/nbcon_cpu_emergency_cnt +121 kernel/printk/nbcon.c

     4	
     5	#include <linux/atomic.h>
     6	#include <linux/bug.h>
     7	#include <linux/console.h>
     8	#include <linux/delay.h>
     9	#include <linux/errno.h>
    10	#include <linux/export.h>
    11	#include <linux/init.h>
    12	#include <linux/irqflags.h>
    13	#include <linux/kthread.h>
    14	#include <linux/minmax.h>
    15	#include <linux/percpu.h>
    16	#include <linux/preempt.h>
    17	#include <linux/slab.h>
    18	#include <linux/smp.h>
    19	#include <linux/stddef.h>
    20	#include <linux/string.h>
    21	#include <linux/types.h>
    22	#include "internal.h"
    23	#include "printk_ringbuffer.h"
    24	/*
    25	 * Printk console printing implementation for consoles which does not depend
    26	 * on the legacy style console_lock mechanism.
    27	 *
    28	 * The state of the console is maintained in the "nbcon_state" atomic
    29	 * variable.
    30	 *
    31	 * The console is locked when:
    32	 *
    33	 *   - The 'prio' field contains the priority of the context that owns the
    34	 *     console. Only higher priority contexts are allowed to take over the
    35	 *     lock. A value of 0 (NBCON_PRIO_NONE) means the console is not locked.
    36	 *
    37	 *   - The 'cpu' field denotes on which CPU the console is locked. It is used
    38	 *     to prevent busy waiting on the same CPU. Also it informs the lock owner
    39	 *     that it has lost the lock in a more complex scenario when the lock was
    40	 *     taken over by a higher priority context, released, and taken on another
    41	 *     CPU with the same priority as the interrupted owner.
    42	 *
    43	 * The acquire mechanism uses a few more fields:
    44	 *
    45	 *   - The 'req_prio' field is used by the handover approach to make the
    46	 *     current owner aware that there is a context with a higher priority
    47	 *     waiting for the friendly handover.
    48	 *
    49	 *   - The 'unsafe' field allows to take over the console in a safe way in the
    50	 *     middle of emitting a message. The field is set only when accessing some
    51	 *     shared resources or when the console device is manipulated. It can be
    52	 *     cleared, for example, after emitting one character when the console
    53	 *     device is in a consistent state.
    54	 *
    55	 *   - The 'unsafe_takeover' field is set when a hostile takeover took the
    56	 *     console in an unsafe state. The console will stay in the unsafe state
    57	 *     until re-initialized.
    58	 *
    59	 * The acquire mechanism uses three approaches:
    60	 *
    61	 *   1) Direct acquire when the console is not owned or is owned by a lower
    62	 *      priority context and is in a safe state.
    63	 *
    64	 *   2) Friendly handover mechanism uses a request/grant handshake. It is used
    65	 *      when the current owner has lower priority and the console is in an
    66	 *      unsafe state.
    67	 *
    68	 *      The requesting context:
    69	 *
    70	 *        a) Sets its priority into the 'req_prio' field.
    71	 *
    72	 *        b) Waits (with a timeout) for the owning context to unlock the
    73	 *           console.
    74	 *
    75	 *        c) Takes the lock and clears the 'req_prio' field.
    76	 *
    77	 *      The owning context:
    78	 *
    79	 *        a) Observes the 'req_prio' field set on exit from the unsafe
    80	 *           console state.
    81	 *
    82	 *        b) Gives up console ownership by clearing the 'prio' field.
    83	 *
    84	 *   3) Unsafe hostile takeover allows to take over the lock even when the
    85	 *      console is an unsafe state. It is used only in panic() by the final
    86	 *      attempt to flush consoles in a try and hope mode.
    87	 *
    88	 *      Note that separate record buffers are used in panic(). As a result,
    89	 *      the messages can be read and formatted without any risk even after
    90	 *      using the hostile takeover in unsafe state.
    91	 *
    92	 * The release function simply clears the 'prio' field.
    93	 *
    94	 * All operations on @console::nbcon_state are atomic cmpxchg based to
    95	 * handle concurrency.
    96	 *
    97	 * The acquire/release functions implement only minimal policies:
    98	 *
    99	 *   - Preference for higher priority contexts.
   100	 *   - Protection of the panic CPU.
   101	 *
   102	 * All other policy decisions must be made at the call sites:
   103	 *
   104	 *   - What is marked as an unsafe section.
   105	 *   - Whether to spin-wait if there is already an owner and the console is
   106	 *     in an unsafe state.
   107	 *   - Whether to attempt an unsafe hostile takeover.
   108	 *
   109	 * The design allows to implement the well known:
   110	 *
   111	 *     acquire()
   112	 *     output_one_printk_record()
   113	 *     release()
   114	 *
   115	 * The output of one printk record might be interrupted with a higher priority
   116	 * context. The new owner is supposed to reprint the entire interrupted record
   117	 * from scratch.
   118	 */
   119	
   120	/* Counter of active nbcon emergency contexts. */
 > 121	atomic_t nbcon_cpu_emergency_cnt;
   122	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH 1/3] printk/nbcon: Block printk kthreads when any CPU is in an emergency context

Posted by Andrew Murray 2 days, 13 hours ago

On Fri, 26 Sept 2025 at 13:50, Petr Mladek <pmladek@suse.com> wrote:
>
> In emergency contexts, printk() tries to flush messages directly even
> on nbcon consoles. And it is allowed to takeover the console ownership
> and interrupt the printk kthread in the middle of a message.
>
> Only one takeover and one repeated message should be enough in most
> situations. The first emergency message flushes the backlog and printk
> kthreads get to sleep. Next emergency messages are flushed directly
> and printk() does not wake up the kthreads.
>
> However, the one takeover is not guaranteed. Any printk() in normal
> context on another CPU could wake up the kthreads. Or a new emergency
> message might be added before the kthreads get to sleep. Note that
> the interrupted .write_kthread() callbacks usually have to call
> nbcon_reacquire_nobuf() and restore the original device setting
> before checking for pending messages.
>
> The risk of the repeated takeovers will be even bigger because
> __nbcon_atomic_flush_pending_con is going to release the console
> ownership after each emitted record. It will be needed to prevent
> hardlockup reports on other CPUs which are busy waiting for
> the context ownership, for example, by nbcon_reacquire_nobuf() or
> __uart_port_nbcon_acquire().
>
> The repeated takeovers break the output, for example:
>
>     [ 5042.650211][ T2220] Call Trace:
>     [ 5042.6511
>     ** replaying previous printk message **
>     [ 5042.651192][ T2220]  <TASK>
>     [ 5042.652160][ T2220]  kunit_run_
>     ** replaying previous printk message **
>     [ 5042.652160][ T2220]  kunit_run_tests+0x72/0x90
>     [ 5042.653340][ T22
>     ** replaying previous printk message **
>     [ 5042.653340][ T2220]  ? srso_alias_return_thunk+0x5/0xfbef5
>     [ 5042.654628][ T2220]  ? stack_trace_save+0x4d/0x70
>     [ 5042.6553
>     ** replaying previous printk message **
>     [ 5042.655394][ T2220]  ? srso_alias_return_thunk+0x5/0xfbef5
>     [ 5042.656713][ T2220]  ? save_trace+0x5b/0x180
>
> A more robust solution is to block the printk kthread entirely whenever
> *any* CPU enters an emergency context. This ensures that critical messages
> can be flushed without contention from the normal, non-atomic printing
> path.
>
> Link: https://lore.kernel.org/all/aNQO-zl3k1l4ENfy@pathway.suse.cz
> Signed-off-by: Petr Mladek <pmladek@suse.com>
> ---
>  kernel/printk/nbcon.c | 32 +++++++++++++++++++++++++++++++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
> index d5d8c8c657e0..08b196e898cd 100644
> --- a/kernel/printk/nbcon.c
> +++ b/kernel/printk/nbcon.c
> @@ -117,6 +117,9 @@
>   * from scratch.
>   */
>
> +/* Counter of active nbcon emergency contexts. */
> +atomic_t nbcon_cpu_emergency_cnt;
> +
>  /**
>   * nbcon_state_set - Helper function to set the console state
>   * @con:       Console to update
> @@ -1168,6 +1171,16 @@ static bool nbcon_kthread_should_wakeup(struct console *con, struct nbcon_contex
>         if (kthread_should_stop())
>                 return true;
>
> +       /*
> +        * Block the kthread when the system is in an emergency or panic mode.
> +        * It increases the chance that these contexts would be able to show
> +        * the messages directly. And it reduces the risk of interrupted writes
> +        * where the context with a higher priority takes over the nbcon console
> +        * ownership in the middle of a message.
> +        */
> +       if (unlikely(atomic_read(&nbcon_cpu_emergency_cnt)))
> +               return false;
> +
>         cookie = console_srcu_read_lock();
>
>         flags = console_srcu_read_flags(con);
> @@ -1219,6 +1232,13 @@ static int nbcon_kthread_func(void *__console)
>                 if (kthread_should_stop())
>                         return 0;
>
> +               /*
> +                * Block the kthread when the system is in an emergency or panic
> +                * mode. See nbcon_kthread_should_wakeup() for more details.
> +                */
> +               if (unlikely(atomic_read(&nbcon_cpu_emergency_cnt)))
> +                       goto wait_for_event;
> +
>                 backlog = false;
>
>                 /*
> @@ -1660,6 +1680,8 @@ void nbcon_cpu_emergency_enter(void)
>
>         preempt_disable();
>
> +       atomic_inc(&nbcon_cpu_emergency_cnt);
> +
>         cpu_emergency_nesting = nbcon_get_cpu_emergency_nesting();
>         (*cpu_emergency_nesting)++;
>  }
> @@ -1674,10 +1696,18 @@ void nbcon_cpu_emergency_exit(void)
>         unsigned int *cpu_emergency_nesting;
>
>         cpu_emergency_nesting = nbcon_get_cpu_emergency_nesting();
> -
>         if (!WARN_ON_ONCE(*cpu_emergency_nesting == 0))
>                 (*cpu_emergency_nesting)--;
>
> +       /*
> +        * Wake up kthreads because there might be some pending messages
> +        * added by other CPUs with normal priority since the last flush
> +        * in the emergency context.
> +        */
> +       if (!WARN_ON_ONCE(atomic_read(&nbcon_cpu_emergency_cnt) == 0))
> +               if (atomic_dec_return(&nbcon_cpu_emergency_cnt) == 0)
> +                       nbcon_kthreads_wake();
> +
>         preempt_enable();
>  }
>
> --
> 2.51.0
>

Reviewed-by: Andrew Murray <amurray@thegoodpenguin.co.uk>

Thanks,

Andrew Murray

Re: [PATCH 1/3] printk/nbcon: Block printk kthreads when any CPU is in an emergency context

Posted by John Ogness 5 days, 7 hours ago

On 2025-09-26, Petr Mladek <pmladek@suse.com> wrote:
> In emergency contexts, printk() tries to flush messages directly even
> on nbcon consoles. And it is allowed to takeover the console ownership
> and interrupt the printk kthread in the middle of a message.
>
> Only one takeover and one repeated message should be enough in most
> situations. The first emergency message flushes the backlog and printk
> kthreads get to sleep. Next emergency messages are flushed directly
> and printk() does not wake up the kthreads.
>
> However, the one takeover is not guaranteed. Any printk() in normal
> context on another CPU could wake up the kthreads. Or a new emergency
> message might be added before the kthreads get to sleep. Note that
> the interrupted .write_kthread() callbacks usually have to call

                  .write_thread()

> nbcon_reacquire_nobuf() and restore the original device setting
> before checking for pending messages.
>
> The risk of the repeated takeovers will be even bigger because
> __nbcon_atomic_flush_pending_con is going to release the console
> ownership after each emitted record. It will be needed to prevent
> hardlockup reports on other CPUs which are busy waiting for
> the context ownership, for example, by nbcon_reacquire_nobuf() or
> __uart_port_nbcon_acquire().
>
> The repeated takeovers break the output, for example:
>
>     [ 5042.650211][ T2220] Call Trace:
>     [ 5042.6511
>     ** replaying previous printk message **
>     [ 5042.651192][ T2220]  <TASK>
>     [ 5042.652160][ T2220]  kunit_run_
>     ** replaying previous printk message **
>     [ 5042.652160][ T2220]  kunit_run_tests+0x72/0x90
>     [ 5042.653340][ T22
>     ** replaying previous printk message **
>     [ 5042.653340][ T2220]  ? srso_alias_return_thunk+0x5/0xfbef5
>     [ 5042.654628][ T2220]  ? stack_trace_save+0x4d/0x70
>     [ 5042.6553
>     ** replaying previous printk message **
>     [ 5042.655394][ T2220]  ? srso_alias_return_thunk+0x5/0xfbef5
>     [ 5042.656713][ T2220]  ? save_trace+0x5b/0x180
>
> A more robust solution is to block the printk kthread entirely whenever
> *any* CPU enters an emergency context. This ensures that critical messages
> can be flushed without contention from the normal, non-atomic printing
> path.
>
> Link: https://lore.kernel.org/all/aNQO-zl3k1l4ENfy@pathway.suse.cz
> Signed-off-by: Petr Mladek <pmladek@suse.com>
> ---
>  kernel/printk/nbcon.c | 32 +++++++++++++++++++++++++++++++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
> index d5d8c8c657e0..08b196e898cd 100644
> --- a/kernel/printk/nbcon.c
> +++ b/kernel/printk/nbcon.c
> @@ -117,6 +117,9 @@
>   * from scratch.
>   */
>  
> +/* Counter of active nbcon emergency contexts. */
> +atomic_t nbcon_cpu_emergency_cnt;

This can be static and should be initialized:

static atomic_t nbcon_cpu_emergency_cnt = ATOMIC_INIT(0);

> +
>  /**
>   * nbcon_state_set - Helper function to set the console state
>   * @con:	Console to update
> @@ -1168,6 +1171,16 @@ static bool nbcon_kthread_should_wakeup(struct console *con, struct nbcon_contex
>  	if (kthread_should_stop())
>  		return true;
>  
> +	/*
> +	 * Block the kthread when the system is in an emergency or panic mode.
> +	 * It increases the chance that these contexts would be able to show
> +	 * the messages directly. And it reduces the risk of interrupted writes
> +	 * where the context with a higher priority takes over the nbcon console
> +	 * ownership in the middle of a message.
> +	 */
> +	if (unlikely(atomic_read(&nbcon_cpu_emergency_cnt)))
> +		return false;
> +
>  	cookie = console_srcu_read_lock();
>  
>  	flags = console_srcu_read_flags(con);
> @@ -1219,6 +1232,13 @@ static int nbcon_kthread_func(void *__console)
>  		if (kthread_should_stop())
>  			return 0;
>  
> +		/*
> +		 * Block the kthread when the system is in an emergency or panic
> +		 * mode. See nbcon_kthread_should_wakeup() for more details.
> +		 */
> +		if (unlikely(atomic_read(&nbcon_cpu_emergency_cnt)))
> +			goto wait_for_event;
> +
>  		backlog = false;
>  
>  		/*
> @@ -1660,6 +1680,8 @@ void nbcon_cpu_emergency_enter(void)
>  
>  	preempt_disable();
>  
> +	atomic_inc(&nbcon_cpu_emergency_cnt);
> +
>  	cpu_emergency_nesting = nbcon_get_cpu_emergency_nesting();
>  	(*cpu_emergency_nesting)++;
>  }
> @@ -1674,10 +1696,18 @@ void nbcon_cpu_emergency_exit(void)
>  	unsigned int *cpu_emergency_nesting;
>  
>  	cpu_emergency_nesting = nbcon_get_cpu_emergency_nesting();
> -
>  	if (!WARN_ON_ONCE(*cpu_emergency_nesting == 0))
>  		(*cpu_emergency_nesting)--;
>  
> +	/*
> +	 * Wake up kthreads because there might be some pending messages
> +	 * added by other CPUs with normal priority since the last flush
> +	 * in the emergency context.
> +	 */
> +	if (!WARN_ON_ONCE(atomic_read(&nbcon_cpu_emergency_cnt) == 0))
> +		if (atomic_dec_return(&nbcon_cpu_emergency_cnt) == 0)
> +			nbcon_kthreads_wake();

Although technically it doesn't hurt to blindly call
nbcon_kthreads_wake(), you may want to do it more formally. Maybe like
this:

	if (!WARN_ON_ONCE(atomic_read(&nbcon_cpu_emergency_cnt) == 0)) {
		if (atomic_dec_return(&nbcon_cpu_emergency_cnt) == 0) {
			struct console_flush_type ft;

			printk_get_console_flush_type(&ft);
			if (ft.nbcon_offload)
				nbcon_kthreads_wake();
		}
	}

I leave it up to you.

With the static+initializer change:

Reviewed-by: John Ogness <john.ogness@linutronix.de>

Re: [PATCH 1/3] printk/nbcon: Block printk kthreads when any CPU is in an emergency context

Posted by Petr Mladek 2 days, 10 hours ago

On Fri 2025-09-26 16:43:33, John Ogness wrote:
> On 2025-09-26, Petr Mladek <pmladek@suse.com> wrote:
> > In emergency contexts, printk() tries to flush messages directly even
> > on nbcon consoles. And it is allowed to takeover the console ownership
> > and interrupt the printk kthread in the middle of a message.
> >
> > Only one takeover and one repeated message should be enough in most
> > situations. The first emergency message flushes the backlog and printk
> > kthreads get to sleep. Next emergency messages are flushed directly
> > and printk() does not wake up the kthreads.
> >
> > However, the one takeover is not guaranteed. Any printk() in normal
> > context on another CPU could wake up the kthreads. Or a new emergency
> > message might be added before the kthreads get to sleep. Note that
> > the interrupted .write_kthread() callbacks usually have to call
> 
>                   .write_thread()

Oh my muscle memory ;-)

> > nbcon_reacquire_nobuf() and restore the original device setting
> > before checking for pending messages.

[...]

> > --- a/kernel/printk/nbcon.c
> > +++ b/kernel/printk/nbcon.c
> > @@ -1674,10 +1696,18 @@ void nbcon_cpu_emergency_exit(void)
> >  	unsigned int *cpu_emergency_nesting;
> >  
> >  	cpu_emergency_nesting = nbcon_get_cpu_emergency_nesting();
> > -
> >  	if (!WARN_ON_ONCE(*cpu_emergency_nesting == 0))
> >  		(*cpu_emergency_nesting)--;
> >  
> > +	/*
> > +	 * Wake up kthreads because there might be some pending messages
> > +	 * added by other CPUs with normal priority since the last flush
> > +	 * in the emergency context.
> > +	 */
> > +	if (!WARN_ON_ONCE(atomic_read(&nbcon_cpu_emergency_cnt) == 0))
> > +		if (atomic_dec_return(&nbcon_cpu_emergency_cnt) == 0)
> > +			nbcon_kthreads_wake();
> 
> Although technically it doesn't hurt to blindly call
> nbcon_kthreads_wake(), you may want to do it more formally. Maybe like
> this:
> 
> 	if (!WARN_ON_ONCE(atomic_read(&nbcon_cpu_emergency_cnt) == 0)) {
> 		if (atomic_dec_return(&nbcon_cpu_emergency_cnt) == 0) {
> 			struct console_flush_type ft;
> 
> 			printk_get_console_flush_type(&ft);
> 			if (ft.nbcon_offload)
> 				nbcon_kthreads_wake();
> 		}
> 	}
> 
> I leave it up to you.

I agree that this is better. I'll use it in v2.

> With the static+initializer change:
> 
> Reviewed-by: John Ogness <john.ogness@linutronix.de>

Thanks a lot for quick review.

I am going to send v2 when the panic state API patchset (in -mm tree)
gets accepted upstream.

Best Regards,
Petr

[PATCH 1/3] printk/nbcon: Block printk kthreads when any CPU is in an emergency context
[PATCH 2/3] printk/nbcon/panic: Allow printk kthread to sleep when the system is in panic
[PATCH 3/3] printk/nbcon: Release nbcon consoles ownership in atomic flush after each emitted record