ring-buffer: Use a housekeeping CPU to wake up waiters

[PATCH] ring-buffer: Use a housekeeping CPU to wake up waiters

Posted by Petr Tesarik 1 month ago

Avoid running the wakeup irq_work on an isolated CPU. Since the wakeup can
run on any CPU, let's pick a housekeeping CPU to do the job.

This change reduces additional noise when tracing isolated CPUs. For
example, the following ipi_send_cpu stack trace was captured with
nohz_full=2 on the isolated CPU:

          <idle>-0       [002] d.h4.  1255.379293: ipi_send_cpu: cpu=2 callsite=irq_work_queue+0x2d/0x50 callback=rb_wake_up_waiters+0x0/0x80
          <idle>-0       [002] d.h4.  1255.379329: <stack trace>
 => trace_event_raw_event_ipi_send_cpu
 => __irq_work_queue_local
 => irq_work_queue
 => ring_buffer_unlock_commit
 => trace_buffer_unlock_commit_regs
 => trace_event_buffer_commit
 => trace_event_raw_event_x86_irq_vector
 => __sysvec_apic_timer_interrupt
 => sysvec_apic_timer_interrupt
 => asm_sysvec_apic_timer_interrupt
 => pv_native_safe_halt
 => default_idle
 => default_idle_call
 => do_idle
 => cpu_startup_entry
 => start_secondary
 => common_startup_64

The IRQ work interrupt alone adds considerable noise, but the impact can
get even worse with PREEMPT_RT, because the IRQ work interrupt is then
handled by a separate kernel thread. This requires a task switch and makes
tracing useless for analyzing latency on an isolated CPU.

Signed-off-by: Petr Tesarik <ptesarik@suse.com>
---
 kernel/trace/ring_buffer.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 41c9f5d079beb..ed9160599091d 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -4,6 +4,7 @@
  *
  * Copyright (C) 2008 Steven Rostedt <srostedt@redhat.com>
  */
+#include <linux/sched/isolation.h>
 #include <linux/trace_recursion.h>
 #include <linux/trace_events.h>
 #include <linux/ring_buffer.h>
@@ -4011,19 +4012,31 @@ static void rb_commit(struct ring_buffer_per_cpu *cpu_buffer)
 	rb_end_commit(cpu_buffer);
 }
 
+static inline bool
+rb_irq_work_queue(struct rb_irq_work *irq_work)
+{
+	int cpu = housekeeping_any_cpu(HK_TYPE_KERNEL_NOISE);
+
+	/*
+	 * If CPU isolation is not active, cpu is always the current
+	 * CPU, and the following is equivallent to irq_work_queue().
+	 */
+	return irq_work_queue_on(&irq_work->work, cpu);
+}
+
 static __always_inline void
 rb_wakeups(struct trace_buffer *buffer, struct ring_buffer_per_cpu *cpu_buffer)
 {
 	if (buffer->irq_work.waiters_pending) {
 		buffer->irq_work.waiters_pending = false;
 		/* irq_work_queue() supplies it's own memory barriers */
-		irq_work_queue(&buffer->irq_work.work);
+		rb_irq_work_queue(&buffer->irq_work);
 	}
 
 	if (cpu_buffer->irq_work.waiters_pending) {
 		cpu_buffer->irq_work.waiters_pending = false;
 		/* irq_work_queue() supplies it's own memory barriers */
-		irq_work_queue(&cpu_buffer->irq_work.work);
+		rb_irq_work_queue(&cpu_buffer->irq_work);
 	}
 
 	if (cpu_buffer->last_pages_touch == local_read(&cpu_buffer->pages_touched))
@@ -4043,7 +4056,7 @@ rb_wakeups(struct trace_buffer *buffer, struct ring_buffer_per_cpu *cpu_buffer)
 	cpu_buffer->irq_work.wakeup_full = true;
 	cpu_buffer->irq_work.full_waiters_pending = false;
 	/* irq_work_queue() supplies it's own memory barriers */
-	irq_work_queue(&cpu_buffer->irq_work.work);
+	rb_irq_work_queue(&cpu_buffer->irq_work);
 }
 
 #ifdef CONFIG_RING_BUFFER_RECORD_RECURSION
-- 
2.52.0

Re: [PATCH] ring-buffer: Use a housekeeping CPU to wake up waiters

Posted by Steven Rostedt 1 month ago

On Tue,  6 Jan 2026 10:10:39 +0100
Petr Tesarik <ptesarik@suse.com> wrote:

> Avoid running the wakeup irq_work on an isolated CPU. Since the wakeup can
> run on any CPU, let's pick a housekeeping CPU to do the job.
> 
> This change reduces additional noise when tracing isolated CPUs. For
> example, the following ipi_send_cpu stack trace was captured with
> nohz_full=2 on the isolated CPU:
> 
>           <idle>-0       [002] d.h4.  1255.379293: ipi_send_cpu: cpu=2 callsite=irq_work_queue+0x2d/0x50 callback=rb_wake_up_waiters+0x0/0x80
>           <idle>-0       [002] d.h4.  1255.379329: <stack trace>
>  => trace_event_raw_event_ipi_send_cpu
>  => __irq_work_queue_local
>  => irq_work_queue
>  => ring_buffer_unlock_commit
>  => trace_buffer_unlock_commit_regs
>  => trace_event_buffer_commit
>  => trace_event_raw_event_x86_irq_vector
>  => __sysvec_apic_timer_interrupt
>  => sysvec_apic_timer_interrupt
>  => asm_sysvec_apic_timer_interrupt
>  => pv_native_safe_halt
>  => default_idle
>  => default_idle_call
>  => do_idle
>  => cpu_startup_entry
>  => start_secondary
>  => common_startup_64  

I take it that even with this patch you would still get the above events.
The only difference would be the "cpu=" in the event info will not be the
same as the CPU it executed on, right?

> 
> The IRQ work interrupt alone adds considerable noise, but the impact can
> get even worse with PREEMPT_RT, because the IRQ work interrupt is then
> handled by a separate kernel thread. This requires a task switch and makes
> tracing useless for analyzing latency on an isolated CPU.
> 
> Signed-off-by: Petr Tesarik <ptesarik@suse.com>

LGTM,

I'll queue it up for the next merge window.

-- Steve

Re: [PATCH] ring-buffer: Use a housekeeping CPU to wake up waiters

Posted by Petr Tesarik 1 month ago

On Tue, 6 Jan 2026 17:04:05 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue,  6 Jan 2026 10:10:39 +0100
> Petr Tesarik <ptesarik@suse.com> wrote:
> 
> > Avoid running the wakeup irq_work on an isolated CPU. Since the wakeup can
> > run on any CPU, let's pick a housekeeping CPU to do the job.
> > 
> > This change reduces additional noise when tracing isolated CPUs. For
> > example, the following ipi_send_cpu stack trace was captured with
> > nohz_full=2 on the isolated CPU:
> > 
> >           <idle>-0       [002] d.h4.  1255.379293: ipi_send_cpu: cpu=2 callsite=irq_work_queue+0x2d/0x50 callback=rb_wake_up_waiters+0x0/0x80
> >           <idle>-0       [002] d.h4.  1255.379329: <stack trace>  
> >  => trace_event_raw_event_ipi_send_cpu
> >  => __irq_work_queue_local
> >  => irq_work_queue
> >  => ring_buffer_unlock_commit
> >  => trace_buffer_unlock_commit_regs
> >  => trace_event_buffer_commit
> >  => trace_event_raw_event_x86_irq_vector
> >  => __sysvec_apic_timer_interrupt
> >  => sysvec_apic_timer_interrupt
> >  => asm_sysvec_apic_timer_interrupt
> >  => pv_native_safe_halt
> >  => default_idle
> >  => default_idle_call
> >  => do_idle
> >  => cpu_startup_entry
> >  => start_secondary
> >  => common_startup_64    
> 
> I take it that even with this patch you would still get the above events.
> The only difference would be the "cpu=" in the event info will not be the
> same as the CPU it executed on, right?

Yes, this is trace of a similar event after applying the patch:

          <idle>-0       [002] d.h4.   313.334367: ipi_send_cpu: cpu=1 callsite=irq_work_queue_on+0x55/0x90 callback=generic_smp_call_function_single_interrupt+0x0/0x20
          <idle>-0       [002] d.h4.   313.334390: <stack trace>
 => trace_event_raw_event_ipi_send_cpu
 => __smp_call_single_queue
 => irq_work_queue_on
 => ring_buffer_unlock_commit
 => trace_buffer_unlock_commit_regs
 => trace_event_buffer_commit
 => trace_event_raw_event_x86_irq_vector
 => __sysvec_apic_timer_interrupt
 => sysvec_apic_timer_interrupt
 => asm_sysvec_apic_timer_interrupt
 => pv_native_safe_halt
 => default_idle
 => default_idle_call
 => do_idle
 => cpu_startup_entry
 => start_secondary
 => common_startup_64

The callback function in the trace event is different. That's because
send_call_function_single_ipi() always uses this value. Maybe it can be
improved, and I can look into it, but that's clearly a very separate
issue.

> > The IRQ work interrupt alone adds considerable noise, but the impact can
> > get even worse with PREEMPT_RT, because the IRQ work interrupt is then
> > handled by a separate kernel thread. This requires a task switch and makes
> > tracing useless for analyzing latency on an isolated CPU.
> > 
> > Signed-off-by: Petr Tesarik <ptesarik@suse.com>  
> 
> LGTM,
> 
> I'll queue it up for the next merge window.

Thank you!

Petr T

Re: [PATCH] ring-buffer: Use a housekeeping CPU to wake up waiters

Posted by Petr Tesarik 1 month ago

On Wed, 7 Jan 2026 08:50:09 +0100
Petr Tesarik <ptesarik@suse.com> wrote:

> On Tue, 6 Jan 2026 17:04:05 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Tue,  6 Jan 2026 10:10:39 +0100
> > Petr Tesarik <ptesarik@suse.com> wrote:
> >   
> > > Avoid running the wakeup irq_work on an isolated CPU. Since the wakeup can
> > > run on any CPU, let's pick a housekeeping CPU to do the job.
> > > 
> > > This change reduces additional noise when tracing isolated CPUs. For
> > > example, the following ipi_send_cpu stack trace was captured with
> > > nohz_full=2 on the isolated CPU:
> > > 
> > >           <idle>-0       [002] d.h4.  1255.379293: ipi_send_cpu: cpu=2 callsite=irq_work_queue+0x2d/0x50 callback=rb_wake_up_waiters+0x0/0x80
> > >           <idle>-0       [002] d.h4.  1255.379329: <stack trace>    
> > >  => trace_event_raw_event_ipi_send_cpu
> > >  => __irq_work_queue_local
> > >  => irq_work_queue
> > >  => ring_buffer_unlock_commit
> > >  => trace_buffer_unlock_commit_regs
> > >  => trace_event_buffer_commit
> > >  => trace_event_raw_event_x86_irq_vector
> > >  => __sysvec_apic_timer_interrupt
> > >  => sysvec_apic_timer_interrupt
> > >  => asm_sysvec_apic_timer_interrupt
> > >  => pv_native_safe_halt
> > >  => default_idle
> > >  => default_idle_call
> > >  => do_idle
> > >  => cpu_startup_entry
> > >  => start_secondary
> > >  => common_startup_64      
> > 
> > I take it that even with this patch you would still get the above events.
> > The only difference would be the "cpu=" in the event info will not be the
> > same as the CPU it executed on, right?  
> 
> Yes, this is trace of a similar event after applying the patch:
> 
>           <idle>-0       [002] d.h4.   313.334367: ipi_send_cpu: cpu=1 callsite=irq_work_queue_on+0x55/0x90 callback=generic_smp_call_function_single_interrupt+0x0/0x20
>           <idle>-0       [002] d.h4.   313.334390: <stack trace>
>  => trace_event_raw_event_ipi_send_cpu
>  => __smp_call_single_queue
>  => irq_work_queue_on
>  => ring_buffer_unlock_commit
>  => trace_buffer_unlock_commit_regs
>  => trace_event_buffer_commit
>  => trace_event_raw_event_x86_irq_vector
>  => __sysvec_apic_timer_interrupt
>  => sysvec_apic_timer_interrupt
>  => asm_sysvec_apic_timer_interrupt
>  => pv_native_safe_halt
>  => default_idle
>  => default_idle_call
>  => do_idle
>  => cpu_startup_entry
>  => start_secondary
>  => common_startup_64  
> 
> The callback function in the trace event is different. That's because
> send_call_function_single_ipi() always uses this value. Maybe it can be
> improved, and I can look into it, but that's clearly a very separate
> issue.

Erm. It's actually good I had a look. :-(

A helpful comment in irq_work_queue_on() explains that "arch remote IPI
send/receive backend aren't NMI safe". That's something I wasn't aware
of, and I'm afraid it's the end of story. The comment is followed by a
WARN_ON_ONCE(in_nmi()), and I can easily trigger it with "perf top"
while nmi:nmi_handler is traced.

Please, remove the patch again. I'm sorry.

Petr T

Re: [PATCH] ring-buffer: Use a housekeeping CPU to wake up waiters

Posted by Steven Rostedt 1 month ago

On Wed, 7 Jan 2026 10:51:37 +0100
Petr Tesarik <ptesarik@suse.com> wrote:

> Erm. It's actually good I had a look. :-(
> 
> A helpful comment in irq_work_queue_on() explains that "arch remote IPI
> send/receive backend aren't NMI safe". That's something I wasn't aware
> of, and I'm afraid it's the end of story. The comment is followed by a
> WARN_ON_ONCE(in_nmi()), and I can easily trigger it with "perf top"
> while nmi:nmi_handler is traced.
> 
> Please, remove the patch again. I'm sorry.

Or we simply change it to:

static inline void
rb_irq_work_queue(struct rb_irq_work *irq_work)
{
	int cpu;

	/* irq_work_queue_on() is not allowed in NMI context */
	if (in_nmi()) {
		irq_work_queue(&irq_work->work, cpu);
		return;
	}

	cpu = housekeeping_any_cpu(HK_TYPE_KERNEL_NOISE);
	/*
	 * If CPU isolation is not active, cpu is always the current
	 * CPU, and the following is equivallent to irq_work_queue().
	 */
	irq_work_queue_on(&irq_work->work, cpu);
}


-- Steve

Re: [PATCH] ring-buffer: Use a housekeeping CPU to wake up waiters

Posted by Steven Rostedt 1 month ago

On Wed, 7 Jan 2026 11:17:09 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> Or we simply change it to:
> 
> static inline void

Actually, the above should be noinline, as it's in a slower path, and
should not be adding logic into the cache of the fast path.

-- Steve


> rb_irq_work_queue(struct rb_irq_work *irq_work)
> {
> 	int cpu;
> 
> 	/* irq_work_queue_on() is not allowed in NMI context */
> 	if (in_nmi()) {
> 		irq_work_queue(&irq_work->work, cpu);
> 		return;
> 	}
> 
> 	cpu = housekeeping_any_cpu(HK_TYPE_KERNEL_NOISE);
> 	/*
> 	 * If CPU isolation is not active, cpu is always the current
> 	 * CPU, and the following is equivallent to irq_work_queue().
> 	 */
> 	irq_work_queue_on(&irq_work->work, cpu);
> }

Re: [PATCH] ring-buffer: Use a housekeeping CPU to wake up waiters

Posted by Petr Tesarik 1 month ago

On Wed, 7 Jan 2026 11:19:35 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Wed, 7 Jan 2026 11:17:09 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > Or we simply change it to:
> > 
> > static inline void  
> 
> Actually, the above should be noinline, as it's in a slower path, and
> should not be adding logic into the cache of the fast path.

However, to be honest, I'm surprized this is considered slow path. My
use case is to record a few selected trace events with "trace-cmd
record", which spends most time polling trace_pipe_raw. Consequently,
there is almost always a pending waiter that requires a wakeup.

In short, irq_work_queue() is the hot path for me.

OTOH I don't mind making it noinline, because on recent Intel and AMD
systems, a function call (noinline) is often cheaper than an increase
in L1 cache footprint (caused by inlining). But I'm confused. I have
always thought most people use tracing same way as I do.

> > rb_irq_work_queue(struct rb_irq_work *irq_work)
> > {
> > 	int cpu;
> > 
> > 	/* irq_work_queue_on() is not allowed in NMI context */
> > 	if (in_nmi()) {
> > 		irq_work_queue(&irq_work->work, cpu);
> > 		return;
> > 	}

Thanks for the idea. There are some downsides. IIUC there is no
fundamental reason IPIs to other CPUs cannot be sent from NMI context.
It's just a limitation of the current Linux kernel code. As such, it
may be lifted in the future, and at that point nobody will remember to
remove this condition.

My current plan is it to keep the patch on hold and have a look why IPI
backends are not NMI-safe. In fact, I'm not even 100% sure the comment
is correct. The issue may have fixed itself e.g. by removing the last
affected architecture. ;-)

Petr T

Re: [PATCH] ring-buffer: Use a housekeeping CPU to wake up waiters

Posted by Steven Rostedt 4 weeks, 1 day ago

On Thu, 8 Jan 2026 09:39:32 +0100
Petr Tesarik <ptesarik@suse.com> wrote:

> > > Or we simply change it to:
> > > 
> > > static inline void    
> > 
> > Actually, the above should be noinline, as it's in a slower path, and
> > should not be adding logic into the cache of the fast path.  
> 
> However, to be honest, I'm surprized this is considered slow path. My
> use case is to record a few selected trace events with "trace-cmd
> record", which spends most time polling trace_pipe_raw. Consequently,
> there is almost always a pending waiter that requires a wakeup.
> 
> In short, irq_work_queue() is the hot path for me.
> 
> OTOH I don't mind making it noinline, because on recent Intel and AMD
> systems, a function call (noinline) is often cheaper than an increase
> in L1 cache footprint (caused by inlining). But I'm confused. I have
> always thought most people use tracing same way as I do.

The call to rb_wakeups() is a fast path, but the wakeup itself is a slow
path. This is the case even when you have user space in a loop that is just
waiting on data.

User space tool:

  ring_buffer_wait() {
    wake_event_interruptible(.., rb_wait_cond(..));
  }

Writer:

  rb_wakeups() {
    if (!full_hit())
      return;
  }

The full_hit() is the watermark check. If you look at the tracefs
directory, you'll see a "buffer_percent" file, which is default set to 50.
That means that full_hit() will not return true until the ring buffer is
around 50 percent full. This function is called thousands of times before
the first wakeup happens.

Let's look at even a waiter that isn't using the buffer percent. This means
it will be woken up on any event in the buffer.

  rb_wakeups() {
	if (buffer->irq_work.waiters_pending) {
		buffer->irq_work.waiters_pending = false;
		/* irq_work_queue() supplies it's own memory barriers */
		irq_work_queue(&buffer->irq_work.work);

So it clears the waiters_pending flag and wakes up the waiter. Now the
waiter wakes up and starts reading the ring buffer. While the ring buffer
has content, it will continue to read and doesn't block again until the
ring buffer is empty. This means that thousands of events are being
recorded with no waiters to wake up.

See why this is a slow path?

-- Steve

Re: [PATCH] ring-buffer: Use a housekeeping CPU to wake up waiters

Posted by Petr Tesarik 4 weeks, 1 day ago

On Thu, 8 Jan 2026 11:58:00 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Thu, 8 Jan 2026 09:39:32 +0100
> Petr Tesarik <ptesarik@suse.com> wrote:
> 
> > > > Or we simply change it to:
> > > > 
> > > > static inline void      
> > > 
> > > Actually, the above should be noinline, as it's in a slower path, and
> > > should not be adding logic into the cache of the fast path.    
> > 
> > However, to be honest, I'm surprized this is considered slow path. My
> > use case is to record a few selected trace events with "trace-cmd
> > record", which spends most time polling trace_pipe_raw. Consequently,
> > there is almost always a pending waiter that requires a wakeup.
> > 
> > In short, irq_work_queue() is the hot path for me.
> > 
> > OTOH I don't mind making it noinline, because on recent Intel and AMD
> > systems, a function call (noinline) is often cheaper than an increase
> > in L1 cache footprint (caused by inlining). But I'm confused. I have
> > always thought most people use tracing same way as I do.  
> 
> The call to rb_wakeups() is a fast path, but the wakeup itself is a slow
> path. This is the case even when you have user space in a loop that is just
> waiting on data.
> 
> User space tool:
> 
>   ring_buffer_wait() {
>     wake_event_interruptible(.., rb_wait_cond(..));
>   }
> 
> Writer:
> 
>   rb_wakeups() {
>     if (!full_hit())
>       return;
>   }
> 
> The full_hit() is the watermark check. If you look at the tracefs
> directory, you'll see a "buffer_percent" file, which is default set to 50.
> That means that full_hit() will not return true until the ring buffer is
> around 50 percent full. This function is called thousands of times before
> the first wakeup happens.
> 
> Let's look at even a waiter that isn't using the buffer percent. This means
> it will be woken up on any event in the buffer.
> 
>   rb_wakeups() {
> 	if (buffer->irq_work.waiters_pending) {
> 		buffer->irq_work.waiters_pending = false;
> 		/* irq_work_queue() supplies it's own memory barriers */
> 		irq_work_queue(&buffer->irq_work.work);
> 
> 
> So it clears the waiters_pending flag and wakes up the waiter. Now the
> waiter wakes up and starts reading the ring buffer. While the ring buffer
> has content, it will continue to read and doesn't block again until the
> ring buffer is empty. This means that thousands of events are being
> recorded with no waiters to wake up.
> 
> See why this is a slow path?

Thank you for the detailed explanation. So, yeah, most people use it
differently from me, generating trace events fast enough that the
reader does not consume the previous event before the next one arrives.

I have removed both "inline" and "noinline" in v2, leaving it at the
discretion of the compiler. If you believe it deserves a "noinline",
feel free to add it. FWIW on x86-64, I didn't observe any measurable
diference either in latency or instruction cache footprint.

Petr T

Re: [PATCH] ring-buffer: Use a housekeeping CPU to wake up waiters

Posted by Steven Rostedt 4 weeks ago

On Fri, 9 Jan 2026 09:57:56 +0100
Petr Tesarik <ptesarik@suse.com> wrote:

> I have removed both "inline" and "noinline" in v2, leaving it at the
> discretion of the compiler. If you believe it deserves a "noinline",
> feel free to add it. FWIW on x86-64, I didn't observe any measurable
> diference either in latency or instruction cache footprint.

Please add the noinline. I went through and added strategic "noinline" and
"__always_inline", as well as placing "likely()" and "unlikely()" and
dropped the cost of adding an event from just under 300ns down to less than
150ns.

This code is called during function tracing (hit at every function call),
and yes, every little bit helps!

-- Steve

Re: [PATCH] ring-buffer: Use a housekeeping CPU to wake up waiters

Posted by Petr Tesarik 4 weeks ago

On Fri, 9 Jan 2026 11:15:06 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 9 Jan 2026 09:57:56 +0100
> Petr Tesarik <ptesarik@suse.com> wrote:
> 
> > I have removed both "inline" and "noinline" in v2, leaving it at the
> > discretion of the compiler. If you believe it deserves a "noinline",
> > feel free to add it. FWIW on x86-64, I didn't observe any measurable
> > diference either in latency or instruction cache footprint.  
> 
> Please add the noinline. I went through and added strategic "noinline" and
> "__always_inline", as well as placing "likely()" and "unlikely()" and
> dropped the cost of adding an event from just under 300ns down to less than
> 150ns.
> 
> This code is called during function tracing (hit at every function call),
> and yes, every little bit helps!

I'm going to wait a few days for further comments, then add it in v3.

Thanks for your time!

Petr T

Re: [PATCH] ring-buffer: Use a housekeeping CPU to wake up waiters

Posted by Petr Tesarik 1 month ago

On Thu, 8 Jan 2026 09:39:32 +0100
Petr Tesarik <ptesarik@suse.com> wrote:

> On Wed, 7 Jan 2026 11:19:35 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Wed, 7 Jan 2026 11:17:09 -0500
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> >   
> > > Or we simply change it to:
> > > 
> > > static inline void    
> > 
> > Actually, the above should be noinline, as it's in a slower path, and
> > should not be adding logic into the cache of the fast path.  
> 
> However, to be honest, I'm surprized this is considered slow path. My
> use case is to record a few selected trace events with "trace-cmd
> record", which spends most time polling trace_pipe_raw. Consequently,
> there is almost always a pending waiter that requires a wakeup.
> 
> In short, irq_work_queue() is the hot path for me.
> 
> OTOH I don't mind making it noinline, because on recent Intel and AMD
> systems, a function call (noinline) is often cheaper than an increase
> in L1 cache footprint (caused by inlining). But I'm confused. I have
> always thought most people use tracing same way as I do.
> 
> > > rb_irq_work_queue(struct rb_irq_work *irq_work)
> > > {
> > > 	int cpu;
> > > 
> > > 	/* irq_work_queue_on() is not allowed in NMI context */
> > > 	if (in_nmi()) {
> > > 		irq_work_queue(&irq_work->work, cpu);
> > > 		return;
> > > 	}  
> 
> Thanks for the idea. There are some downsides. IIUC there is no
> fundamental reason IPIs to other CPUs cannot be sent from NMI context.
> It's just a limitation of the current Linux kernel code. As such, it
> may be lifted in the future, and at that point nobody will remember to
> remove this condition.
> 
> My current plan is it to keep the patch on hold and have a look why IPI
> backends are not NMI-safe. In fact, I'm not even 100% sure the comment
> is correct. The issue may have fixed itself e.g. by removing the last
> affected architecture. ;-)

This turned to be an interesting digression. Since we still support
old xAPIC (not x2APIC) systems, there is a reason in hardware. The xAPIC
ICR is programmed by writing to two 32-bit registers. If an NMI occurs
between those two writes, we'd have to restore the upper 32 bits of
ICR. Alternatively, we could queue the request if ICR write is in
progress and flush the queue after finishing the write to the ICR (out
of NMI context). The code could even be arch-independent...

However, it's not worth the effort just for this one corner case.
Besides, it seems that other people have always been aware that
irq_work_queue_on() is NMI-unsafe, so in case the future brings a
better reason to make it NMI-safe, there's a fair chance that all
the extra code in rb_irq_work_queue() gets reviewed.

Petr T