[PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq

Joel Fernandes posted 8 patches 1 month, 1 week ago
[PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
Posted by Joel Fernandes 1 month, 1 week ago
From: Yao Kai <yaokai34@huawei.com>

Commit 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in
__rcu_read_unlock()") removes the recursion-protection code from
__rcu_read_unlock(). Therefore, we could invoke the deadloop in
raise_softirq_irqoff() with ftrace enabled as follows:

WARNING: CPU: 0 PID: 0 at kernel/trace/trace.c:3021 __ftrace_trace_stack.constprop.0+0x172/0x180
Modules linked in: my_irq_work(O)
CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.18.0-rc7-dirty #23 PREEMPT(full)
Tainted: [O]=OOT_MODULE
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:__ftrace_trace_stack.constprop.0+0x172/0x180
RSP: 0018:ffffc900000034a8 EFLAGS: 00010002
RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
RDX: 0000000000000003 RSI: ffffffff826d7b87 RDI: ffffffff826e9329
RBP: 0000000000090009 R08: 0000000000000005 R09: ffffffff82afbc4c
R10: 0000000000000008 R11: 0000000000011d7a R12: 0000000000000000
R13: ffff888003874100 R14: 0000000000000003 R15: ffff8880038c1054
FS:  0000000000000000(0000) GS:ffff8880fa8ea000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055b31fa7f540 CR3: 00000000078f4005 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
 <IRQ>
 trace_buffer_unlock_commit_regs+0x6d/0x220
 trace_event_buffer_commit+0x5c/0x260
 trace_event_raw_event_softirq+0x47/0x80
 raise_softirq_irqoff+0x6e/0xa0
 rcu_read_unlock_special+0xb1/0x160
 unwind_next_frame+0x203/0x9b0
 __unwind_start+0x15d/0x1c0
 arch_stack_walk+0x62/0xf0
 stack_trace_save+0x48/0x70
 __ftrace_trace_stack.constprop.0+0x144/0x180
 trace_buffer_unlock_commit_regs+0x6d/0x220
 trace_event_buffer_commit+0x5c/0x260
 trace_event_raw_event_softirq+0x47/0x80
 raise_softirq_irqoff+0x6e/0xa0
 rcu_read_unlock_special+0xb1/0x160
 unwind_next_frame+0x203/0x9b0
 __unwind_start+0x15d/0x1c0
 arch_stack_walk+0x62/0xf0
 stack_trace_save+0x48/0x70
 __ftrace_trace_stack.constprop.0+0x144/0x180
 trace_buffer_unlock_commit_regs+0x6d/0x220
 trace_event_buffer_commit+0x5c/0x260
 trace_event_raw_event_softirq+0x47/0x80
 raise_softirq_irqoff+0x6e/0xa0
 rcu_read_unlock_special+0xb1/0x160
 unwind_next_frame+0x203/0x9b0
 __unwind_start+0x15d/0x1c0
 arch_stack_walk+0x62/0xf0
 stack_trace_save+0x48/0x70
 __ftrace_trace_stack.constprop.0+0x144/0x180
 trace_buffer_unlock_commit_regs+0x6d/0x220
 trace_event_buffer_commit+0x5c/0x260
 trace_event_raw_event_softirq+0x47/0x80
 raise_softirq_irqoff+0x6e/0xa0
 rcu_read_unlock_special+0xb1/0x160
 __is_insn_slot_addr+0x54/0x70
 kernel_text_address+0x48/0xc0
 __kernel_text_address+0xd/0x40
 unwind_get_return_address+0x1e/0x40
 arch_stack_walk+0x9c/0xf0
 stack_trace_save+0x48/0x70
 __ftrace_trace_stack.constprop.0+0x144/0x180
 trace_buffer_unlock_commit_regs+0x6d/0x220
 trace_event_buffer_commit+0x5c/0x260
 trace_event_raw_event_softirq+0x47/0x80
 __raise_softirq_irqoff+0x61/0x80
 __flush_smp_call_function_queue+0x115/0x420
 __sysvec_call_function_single+0x17/0xb0
 sysvec_call_function_single+0x8c/0xc0
 </IRQ>

Commit b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
fixed the infinite loop in rcu_read_unlock_special() for IRQ work by
setting a flag before calling irq_work_queue_on(). We fix this issue by
setting the same flag before calling raise_softirq_irqoff() and rename the
flag to defer_qs_pending for more common.

Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
Reported-by: Tengda Wu <wutengda2@huawei.com>
Signed-off-by: Yao Kai <yaokai34@huawei.com>
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/rcu/tree.h        |  2 +-
 kernel/rcu/tree_plugin.h | 15 +++++++++------
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index b8bbe7960cda..2265b9c2906e 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -203,7 +203,7 @@ struct rcu_data {
 					/*  during and after the last grace */
 					/* period it is aware of. */
 	struct irq_work defer_qs_iw;	/* Obtain later scheduler attention. */
-	int defer_qs_iw_pending;	/* Scheduler attention pending? */
+	int defer_qs_pending;		/* irqwork or softirq pending? */
 	struct work_struct strict_work;	/* Schedule readers for strict GPs. */
 
 	/* 2) batch handling */
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index dbe2d02be824..95ad967adcf3 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -487,8 +487,8 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
 	union rcu_special special;
 
 	rdp = this_cpu_ptr(&rcu_data);
-	if (rdp->defer_qs_iw_pending == DEFER_QS_PENDING)
-		rdp->defer_qs_iw_pending = DEFER_QS_IDLE;
+	if (rdp->defer_qs_pending == DEFER_QS_PENDING)
+		rdp->defer_qs_pending = DEFER_QS_IDLE;
 
 	/*
 	 * If RCU core is waiting for this CPU to exit its critical section,
@@ -645,7 +645,7 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
 	 * 5. Deferred QS reporting does not happen.
 	 */
 	if (rcu_preempt_depth() > 0)
-		WRITE_ONCE(rdp->defer_qs_iw_pending, DEFER_QS_IDLE);
+		WRITE_ONCE(rdp->defer_qs_pending, DEFER_QS_IDLE);
 }
 
 /*
@@ -747,7 +747,10 @@ static void rcu_read_unlock_special(struct task_struct *t)
 			// Using softirq, safe to awaken, and either the
 			// wakeup is free or there is either an expedited
 			// GP in flight or a potential need to deboost.
-			raise_softirq_irqoff(RCU_SOFTIRQ);
+			if (rdp->defer_qs_pending != DEFER_QS_PENDING) {
+				rdp->defer_qs_pending = DEFER_QS_PENDING;
+				raise_softirq_irqoff(RCU_SOFTIRQ);
+			}
 		} else {
 			// Enabling BH or preempt does reschedule, so...
 			// Also if no expediting and no possible deboosting,
@@ -755,11 +758,11 @@ static void rcu_read_unlock_special(struct task_struct *t)
 			// tick enabled.
 			set_need_resched_current();
 			if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
-			    needs_exp && rdp->defer_qs_iw_pending != DEFER_QS_PENDING &&
+			    needs_exp && rdp->defer_qs_pending != DEFER_QS_PENDING &&
 			    cpu_online(rdp->cpu)) {
 				// Get scheduler to re-evaluate and call hooks.
 				// If !IRQ_WORK, FQS scan will eventually IPI.
-				rdp->defer_qs_iw_pending = DEFER_QS_PENDING;
+				rdp->defer_qs_pending = DEFER_QS_PENDING;
 				irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
 			}
 		}
-- 
2.34.1
Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
Posted by Frederic Weisbecker 1 month ago
Le Thu, Jan 01, 2026 at 11:34:10AM -0500, Joel Fernandes a écrit :
> From: Yao Kai <yaokai34@huawei.com>
> 
> Commit 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in
> __rcu_read_unlock()") removes the recursion-protection code from
> __rcu_read_unlock(). Therefore, we could invoke the deadloop in
> raise_softirq_irqoff() with ftrace enabled as follows:
> 
> WARNING: CPU: 0 PID: 0 at kernel/trace/trace.c:3021 __ftrace_trace_stack.constprop.0+0x172/0x180
> Modules linked in: my_irq_work(O)
> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.18.0-rc7-dirty #23 PREEMPT(full)
> Tainted: [O]=OOT_MODULE
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> RIP: 0010:__ftrace_trace_stack.constprop.0+0x172/0x180
> RSP: 0018:ffffc900000034a8 EFLAGS: 00010002
> RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
> RDX: 0000000000000003 RSI: ffffffff826d7b87 RDI: ffffffff826e9329
> RBP: 0000000000090009 R08: 0000000000000005 R09: ffffffff82afbc4c
> R10: 0000000000000008 R11: 0000000000011d7a R12: 0000000000000000
> R13: ffff888003874100 R14: 0000000000000003 R15: ffff8880038c1054
> FS:  0000000000000000(0000) GS:ffff8880fa8ea000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000055b31fa7f540 CR3: 00000000078f4005 CR4: 0000000000770ef0
> PKRU: 55555554
> Call Trace:
>  <IRQ>
>  trace_buffer_unlock_commit_regs+0x6d/0x220
>  trace_event_buffer_commit+0x5c/0x260
>  trace_event_raw_event_softirq+0x47/0x80
>  raise_softirq_irqoff+0x6e/0xa0
>  rcu_read_unlock_special+0xb1/0x160
>  unwind_next_frame+0x203/0x9b0
>  __unwind_start+0x15d/0x1c0
>  arch_stack_walk+0x62/0xf0
>  stack_trace_save+0x48/0x70
>  __ftrace_trace_stack.constprop.0+0x144/0x180
>  trace_buffer_unlock_commit_regs+0x6d/0x220
>  trace_event_buffer_commit+0x5c/0x260
>  trace_event_raw_event_softirq+0x47/0x80
>  raise_softirq_irqoff+0x6e/0xa0
>  rcu_read_unlock_special+0xb1/0x160
>  unwind_next_frame+0x203/0x9b0
>  __unwind_start+0x15d/0x1c0
>  arch_stack_walk+0x62/0xf0
>  stack_trace_save+0x48/0x70
>  __ftrace_trace_stack.constprop.0+0x144/0x180
>  trace_buffer_unlock_commit_regs+0x6d/0x220
>  trace_event_buffer_commit+0x5c/0x260
>  trace_event_raw_event_softirq+0x47/0x80
>  raise_softirq_irqoff+0x6e/0xa0
>  rcu_read_unlock_special+0xb1/0x160
>  unwind_next_frame+0x203/0x9b0
>  __unwind_start+0x15d/0x1c0
>  arch_stack_walk+0x62/0xf0
>  stack_trace_save+0x48/0x70
>  __ftrace_trace_stack.constprop.0+0x144/0x180
>  trace_buffer_unlock_commit_regs+0x6d/0x220
>  trace_event_buffer_commit+0x5c/0x260
>  trace_event_raw_event_softirq+0x47/0x80
>  raise_softirq_irqoff+0x6e/0xa0
>  rcu_read_unlock_special+0xb1/0x160
>  __is_insn_slot_addr+0x54/0x70
>  kernel_text_address+0x48/0xc0
>  __kernel_text_address+0xd/0x40
>  unwind_get_return_address+0x1e/0x40
>  arch_stack_walk+0x9c/0xf0
>  stack_trace_save+0x48/0x70
>  __ftrace_trace_stack.constprop.0+0x144/0x180
>  trace_buffer_unlock_commit_regs+0x6d/0x220
>  trace_event_buffer_commit+0x5c/0x260
>  trace_event_raw_event_softirq+0x47/0x80
>  __raise_softirq_irqoff+0x61/0x80
>  __flush_smp_call_function_queue+0x115/0x420
>  __sysvec_call_function_single+0x17/0xb0
>  sysvec_call_function_single+0x8c/0xc0
>  </IRQ>
> 
> Commit b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
> fixed the infinite loop in rcu_read_unlock_special() for IRQ work by
> setting a flag before calling irq_work_queue_on(). We fix this issue by
> setting the same flag before calling raise_softirq_irqoff() and rename the
> flag to defer_qs_pending for more common.
> 
> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
> Reported-by: Tengda Wu <wutengda2@huawei.com>
> Signed-off-by: Yao Kai <yaokai34@huawei.com>
> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>

Looks good but, BTW, what happens if rcu_qs() is called
before rcu_preempt_deferred_qs() had a chance to be called?

current->rcu_read_unlock_special.b.need_qs is reset by rcu_qs()
so subsequent calls to rcu_read_unlock() won't issue rcu_read_unlock_special()
(unless the task is blocked). And further calls to rcu_preempt_deferred_qs()
through rcu_core() will be ignored as well.

But rdp->defer_qs_pending will remain in the DEFER_QS_PENDING state until
the next grace period. And if rcu_read_unlock_special() is called again
during the next GP on unfortunate place needing deferred qs, the state machine
will spuriously assume that either rcu_core or the irq_work are pending, when
none are anymore.

The state should be reset by rcu_qs().

Thanks.

-- 
Frederic Weisbecker
SUSE Labs
Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
Posted by Joel Fernandes 1 month ago

> On Jan 7, 2026, at 6:15 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
> 
> Le Thu, Jan 01, 2026 at 11:34:10AM -0500, Joel Fernandes a écrit :
>> From: Yao Kai <yaokai34@huawei.com>
>> 
>> Commit 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in
>> __rcu_read_unlock()") removes the recursion-protection code from
>> __rcu_read_unlock(). Therefore, we could invoke the deadloop in
>> raise_softirq_irqoff() with ftrace enabled as follows:
>> 
>> WARNING: CPU: 0 PID: 0 at kernel/trace/trace.c:3021 __ftrace_trace_stack.constprop.0+0x172/0x180
>> Modules linked in: my_irq_work(O)
>> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.18.0-rc7-dirty #23 PREEMPT(full)
>> Tainted: [O]=OOT_MODULE
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
>> RIP: 0010:__ftrace_trace_stack.constprop.0+0x172/0x180
>> RSP: 0018:ffffc900000034a8 EFLAGS: 00010002
>> RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
>> RDX: 0000000000000003 RSI: ffffffff826d7b87 RDI: ffffffff826e9329
>> RBP: 0000000000090009 R08: 0000000000000005 R09: ffffffff82afbc4c
>> R10: 0000000000000008 R11: 0000000000011d7a R12: 0000000000000000
>> R13: ffff888003874100 R14: 0000000000000003 R15: ffff8880038c1054
>> FS:  0000000000000000(0000) GS:ffff8880fa8ea000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 000055b31fa7f540 CR3: 00000000078f4005 CR4: 0000000000770ef0
>> PKRU: 55555554
>> Call Trace:
>> <IRQ>
>> trace_buffer_unlock_commit_regs+0x6d/0x220
>> trace_event_buffer_commit+0x5c/0x260
>> trace_event_raw_event_softirq+0x47/0x80
>> raise_softirq_irqoff+0x6e/0xa0
>> rcu_read_unlock_special+0xb1/0x160
>> unwind_next_frame+0x203/0x9b0
>> __unwind_start+0x15d/0x1c0
>> arch_stack_walk+0x62/0xf0
>> stack_trace_save+0x48/0x70
>> __ftrace_trace_stack.constprop.0+0x144/0x180
>> trace_buffer_unlock_commit_regs+0x6d/0x220
>> trace_event_buffer_commit+0x5c/0x260
>> trace_event_raw_event_softirq+0x47/0x80
>> raise_softirq_irqoff+0x6e/0xa0
>> rcu_read_unlock_special+0xb1/0x160
>> unwind_next_frame+0x203/0x9b0
>> __unwind_start+0x15d/0x1c0
>> arch_stack_walk+0x62/0xf0
>> stack_trace_save+0x48/0x70
>> __ftrace_trace_stack.constprop.0+0x144/0x180
>> trace_buffer_unlock_commit_regs+0x6d/0x220
>> trace_event_buffer_commit+0x5c/0x260
>> trace_event_raw_event_softirq+0x47/0x80
>> raise_softirq_irqoff+0x6e/0xa0
>> rcu_read_unlock_special+0xb1/0x160
>> unwind_next_frame+0x203/0x9b0
>> __unwind_start+0x15d/0x1c0
>> arch_stack_walk+0x62/0xf0
>> stack_trace_save+0x48/0x70
>> __ftrace_trace_stack.constprop.0+0x144/0x180
>> trace_buffer_unlock_commit_regs+0x6d/0x220
>> trace_event_buffer_commit+0x5c/0x260
>> trace_event_raw_event_softirq+0x47/0x80
>> raise_softirq_irqoff+0x6e/0xa0
>> rcu_read_unlock_special+0xb1/0x160
>> __is_insn_slot_addr+0x54/0x70
>> kernel_text_address+0x48/0xc0
>> __kernel_text_address+0xd/0x40
>> unwind_get_return_address+0x1e/0x40
>> arch_stack_walk+0x9c/0xf0
>> stack_trace_save+0x48/0x70
>> __ftrace_trace_stack.constprop.0+0x144/0x180
>> trace_buffer_unlock_commit_regs+0x6d/0x220
>> trace_event_buffer_commit+0x5c/0x260
>> trace_event_raw_event_softirq+0x47/0x80
>> __raise_softirq_irqoff+0x61/0x80
>> __flush_smp_call_function_queue+0x115/0x420
>> __sysvec_call_function_single+0x17/0xb0
>> sysvec_call_function_single+0x8c/0xc0
>> </IRQ>
>> 
>> Commit b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
>> fixed the infinite loop in rcu_read_unlock_special() for IRQ work by
>> setting a flag before calling irq_work_queue_on(). We fix this issue by
>> setting the same flag before calling raise_softirq_irqoff() and rename the
>> flag to defer_qs_pending for more common.
>> 
>> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
>> Reported-by: Tengda Wu <wutengda2@huawei.com>
>> Signed-off-by: Yao Kai <yaokai34@huawei.com>
>> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> 
> Looks good but, BTW, what happens if rcu_qs() is called
> before rcu_preempt_deferred_qs() had a chance to be called?

Could you provide an example of when that can happen? 

If rcu_qs() results in reporting of a quiescent state up the node tree before the deferred reporting work had a chance to act, then indeed we should be clearing the flag (after canceling the pending raise_softirq_irqoff()).

>> flag to defer_qs_pending for more common.
>> 
>> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
>> Reported-by: Tengda Wu <wutengda2@huawei.com>
>> Signed-off-by: Yao Kai <yaokai34@huawei.com>
>> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> 
> Looks good but, BTW, what happens if rcu_qs() is called
> before rcu_preempt_deferred_qs() had a chance to be called?

Could you provide an example of when that can happen? 

As far as I can see, even if that were to happen, which I think you are right it can happen, we will still go through the path to report deferred quiescent states and cancel the pending work (reset the flag).

> current->rcu_read_unlock_special.b.need_qs is reset by rcu_qs()
> so subsequent calls to rcu_read_unlock() won't issue rcu_read_unlock_special()
> (unless the task is blocked). And further calls to rcu_preempt_deferred_qs()
> through rcu_core() will be ignored as well.

I am not sure if this implies that deferred quiescent state gets cancelled because we have already called unlock once. We have to go through the deferred quiescent state path on all subsequent quiescent state reporting, even if need_qs reset. How else will the GP complete.
> 
> But rdp->defer_qs_pending will remain in the DEFER_QS_PENDING state until
> the next grace period. And if rcu_read_unlock_special() is called again
> during the next GP on unfortunate place needing deferred qs, the state machine
> will spuriously assume that either rcu_core or the irq_work are pending, when
> none are anymore.
> 
> The state should be reset by rcu_qs().

In fact I would say if a deferred QS is pending, we should absolutely not reset its state from rcu_qs..

Maybe we should reset it from rcu_report_qs_rdp/rnp?

Unfortunately, all of this is coming from me being on a phone and not at a computer, so I will revise my response, but probably tomorrow, because today the human body is not cooperating. 

thanks,

 - Joel


> current->rcu_read_unlock_special.b.need_qs is reset by rcu_qs()
> so subsequent calls to rcu_read_unlock() won't issue rcu_read_unlock_special()
> (unless the task is blocked). And further calls to rcu_preempt_deferred_qs()
> through rcu_core() will be ignored as well.

I am not sure if this implies that deferred quiescent state gets cancelled because we have already called unlock once. We have to go through the deferred quiescent state path on all subsequent quiescent state reporting, even if need_qs reset. How else will the GP complete.
> 
> But rdp->defer_qs_pending will remain in the DEFER_QS_PENDING state until
> the next grace period. And if rcu_read_unlock_special() is called again
> during the next GP on unfortunate place needing deferred qs, the state machine
> will spuriously assume that either rcu_core or the irq_work are pending, when
> none are anymore.
> 
> The state should be reset by rcu_qs().

In fact I would say if a deferred QS is pending, we should absolutely not reset its state from rcu_qs..

Maybe we should reset it from rcu_report_qs_rdp/rnp?

thanks,

 - Joel


> 
> Thanks.
> 
> --
> Frederic Weisbecker
> SUSE Labs
> 
Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
Posted by Frederic Weisbecker 1 month ago
Le Wed, Jan 07, 2026 at 08:02:43PM -0500, Joel Fernandes a écrit :
> 
> 
> > On Jan 7, 2026, at 6:15 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
> > 
> > Le Thu, Jan 01, 2026 at 11:34:10AM -0500, Joel Fernandes a écrit :
> >> From: Yao Kai <yaokai34@huawei.com>
> >> 
> >> Commit 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in
> >> __rcu_read_unlock()") removes the recursion-protection code from
> >> __rcu_read_unlock(). Therefore, we could invoke the deadloop in
> >> raise_softirq_irqoff() with ftrace enabled as follows:
> >> 
> >> WARNING: CPU: 0 PID: 0 at kernel/trace/trace.c:3021 __ftrace_trace_stack.constprop.0+0x172/0x180
> >> Modules linked in: my_irq_work(O)
> >> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.18.0-rc7-dirty #23 PREEMPT(full)
> >> Tainted: [O]=OOT_MODULE
> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> >> RIP: 0010:__ftrace_trace_stack.constprop.0+0x172/0x180
> >> RSP: 0018:ffffc900000034a8 EFLAGS: 00010002
> >> RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
> >> RDX: 0000000000000003 RSI: ffffffff826d7b87 RDI: ffffffff826e9329
> >> RBP: 0000000000090009 R08: 0000000000000005 R09: ffffffff82afbc4c
> >> R10: 0000000000000008 R11: 0000000000011d7a R12: 0000000000000000
> >> R13: ffff888003874100 R14: 0000000000000003 R15: ffff8880038c1054
> >> FS:  0000000000000000(0000) GS:ffff8880fa8ea000(0000) knlGS:0000000000000000
> >> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> CR2: 000055b31fa7f540 CR3: 00000000078f4005 CR4: 0000000000770ef0
> >> PKRU: 55555554
> >> Call Trace:
> >> <IRQ>
> >> trace_buffer_unlock_commit_regs+0x6d/0x220
> >> trace_event_buffer_commit+0x5c/0x260
> >> trace_event_raw_event_softirq+0x47/0x80
> >> raise_softirq_irqoff+0x6e/0xa0
> >> rcu_read_unlock_special+0xb1/0x160
> >> unwind_next_frame+0x203/0x9b0
> >> __unwind_start+0x15d/0x1c0
> >> arch_stack_walk+0x62/0xf0
> >> stack_trace_save+0x48/0x70
> >> __ftrace_trace_stack.constprop.0+0x144/0x180
> >> trace_buffer_unlock_commit_regs+0x6d/0x220
> >> trace_event_buffer_commit+0x5c/0x260
> >> trace_event_raw_event_softirq+0x47/0x80
> >> raise_softirq_irqoff+0x6e/0xa0
> >> rcu_read_unlock_special+0xb1/0x160
> >> unwind_next_frame+0x203/0x9b0
> >> __unwind_start+0x15d/0x1c0
> >> arch_stack_walk+0x62/0xf0
> >> stack_trace_save+0x48/0x70
> >> __ftrace_trace_stack.constprop.0+0x144/0x180
> >> trace_buffer_unlock_commit_regs+0x6d/0x220
> >> trace_event_buffer_commit+0x5c/0x260
> >> trace_event_raw_event_softirq+0x47/0x80
> >> raise_softirq_irqoff+0x6e/0xa0
> >> rcu_read_unlock_special+0xb1/0x160
> >> unwind_next_frame+0x203/0x9b0
> >> __unwind_start+0x15d/0x1c0
> >> arch_stack_walk+0x62/0xf0
> >> stack_trace_save+0x48/0x70
> >> __ftrace_trace_stack.constprop.0+0x144/0x180
> >> trace_buffer_unlock_commit_regs+0x6d/0x220
> >> trace_event_buffer_commit+0x5c/0x260
> >> trace_event_raw_event_softirq+0x47/0x80
> >> raise_softirq_irqoff+0x6e/0xa0
> >> rcu_read_unlock_special+0xb1/0x160
> >> __is_insn_slot_addr+0x54/0x70
> >> kernel_text_address+0x48/0xc0
> >> __kernel_text_address+0xd/0x40
> >> unwind_get_return_address+0x1e/0x40
> >> arch_stack_walk+0x9c/0xf0
> >> stack_trace_save+0x48/0x70
> >> __ftrace_trace_stack.constprop.0+0x144/0x180
> >> trace_buffer_unlock_commit_regs+0x6d/0x220
> >> trace_event_buffer_commit+0x5c/0x260
> >> trace_event_raw_event_softirq+0x47/0x80
> >> __raise_softirq_irqoff+0x61/0x80
> >> __flush_smp_call_function_queue+0x115/0x420
> >> __sysvec_call_function_single+0x17/0xb0
> >> sysvec_call_function_single+0x8c/0xc0
> >> </IRQ>
> >> 
> >> Commit b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
> >> fixed the infinite loop in rcu_read_unlock_special() for IRQ work by
> >> setting a flag before calling irq_work_queue_on(). We fix this issue by
> >> setting the same flag before calling raise_softirq_irqoff() and rename the
> >> flag to defer_qs_pending for more common.
> >> 
> >> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
> >> Reported-by: Tengda Wu <wutengda2@huawei.com>
> >> Signed-off-by: Yao Kai <yaokai34@huawei.com>
> >> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
> >> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> > 
> > Looks good but, BTW, what happens if rcu_qs() is called
> > before rcu_preempt_deferred_qs() had a chance to be called?
> 
> Could you provide an example of when that can happen?

It can happen because rcu_qs() is called before rcu_preempt_deferred_qs()
in rcu_softirq_qs(). Inverting the calls could help but IRQs must be disabled
to ensure there is no read side between rcu_preempt_deferred_qs() and rcu_qs().

I'm not aware of other ways to trigger that, except perhaps this:

https://lore.kernel.org/rcu/20251230004124.438070-1-joelagnelf@nvidia.com/T/#u

Either we fix those sites and make sure that rcu_preempt_deferred_qs() is always
called before rcu_qs() in the same IRQ disabled section (or there are other
fields set in ->rcu_read_unlock_special for later clearance). If we do that we
must WARN_ON_ONCE(rdp->defer_qs_pending == DEFER_QS_PENDING) in rcu_qs().

Or we reset rdp->defer_qs_pending from rcu_qs(), which sounds more robust.

Ah an alternative is to make rdp::defer_qs_pending a field in union rcu_special
which, sadly, would need to be expanded as a u64.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs
Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
Posted by Joel Fernandes 1 month ago
Hi Frederic,

On 1/8/2026 10:25 AM, Frederic Weisbecker wrote:
> Le Wed, Jan 07, 2026 at 08:02:43PM -0500, Joel Fernandes a écrit :
>>
>>
>>> On Jan 7, 2026, at 6:15 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
>>>
>>> Le Thu, Jan 01, 2026 at 11:34:10AM -0500, Joel Fernandes a écrit :
>>>> From: Yao Kai <yaokai34@huawei.com>
>>>>
>>>> Commit 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in
>>>> __rcu_read_unlock()") removes the recursion-protection code from
>>>> __rcu_read_unlock(). Therefore, we could invoke the deadloop in
>>>> raise_softirq_irqoff() with ftrace enabled as follows:
>>>>
>>>> WARNING: CPU: 0 PID: 0 at kernel/trace/trace.c:3021 __ftrace_trace_stack.constprop.0+0x172/0x180
>>>> Modules linked in: my_irq_work(O)
>>>> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.18.0-rc7-dirty #23 PREEMPT(full)
>>>> Tainted: [O]=OOT_MODULE
>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
>>>> RIP: 0010:__ftrace_trace_stack.constprop.0+0x172/0x180
>>>> RSP: 0018:ffffc900000034a8 EFLAGS: 00010002
>>>> RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
>>>> RDX: 0000000000000003 RSI: ffffffff826d7b87 RDI: ffffffff826e9329
>>>> RBP: 0000000000090009 R08: 0000000000000005 R09: ffffffff82afbc4c
>>>> R10: 0000000000000008 R11: 0000000000011d7a R12: 0000000000000000
>>>> R13: ffff888003874100 R14: 0000000000000003 R15: ffff8880038c1054
>>>> FS:  0000000000000000(0000) GS:ffff8880fa8ea000(0000) knlGS:0000000000000000
>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> CR2: 000055b31fa7f540 CR3: 00000000078f4005 CR4: 0000000000770ef0
>>>> PKRU: 55555554
>>>> Call Trace:
>>>> <IRQ>
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> raise_softirq_irqoff+0x6e/0xa0
>>>> rcu_read_unlock_special+0xb1/0x160
>>>> unwind_next_frame+0x203/0x9b0
>>>> __unwind_start+0x15d/0x1c0
>>>> arch_stack_walk+0x62/0xf0
>>>> stack_trace_save+0x48/0x70
>>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> raise_softirq_irqoff+0x6e/0xa0
>>>> rcu_read_unlock_special+0xb1/0x160
>>>> unwind_next_frame+0x203/0x9b0
>>>> __unwind_start+0x15d/0x1c0
>>>> arch_stack_walk+0x62/0xf0
>>>> stack_trace_save+0x48/0x70
>>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> raise_softirq_irqoff+0x6e/0xa0
>>>> rcu_read_unlock_special+0xb1/0x160
>>>> unwind_next_frame+0x203/0x9b0
>>>> __unwind_start+0x15d/0x1c0
>>>> arch_stack_walk+0x62/0xf0
>>>> stack_trace_save+0x48/0x70
>>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> raise_softirq_irqoff+0x6e/0xa0
>>>> rcu_read_unlock_special+0xb1/0x160
>>>> __is_insn_slot_addr+0x54/0x70
>>>> kernel_text_address+0x48/0xc0
>>>> __kernel_text_address+0xd/0x40
>>>> unwind_get_return_address+0x1e/0x40
>>>> arch_stack_walk+0x9c/0xf0
>>>> stack_trace_save+0x48/0x70
>>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> __raise_softirq_irqoff+0x61/0x80
>>>> __flush_smp_call_function_queue+0x115/0x420
>>>> __sysvec_call_function_single+0x17/0xb0
>>>> sysvec_call_function_single+0x8c/0xc0
>>>> </IRQ>
>>>>
>>>> Commit b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
>>>> fixed the infinite loop in rcu_read_unlock_special() for IRQ work by
>>>> setting a flag before calling irq_work_queue_on(). We fix this issue by
>>>> setting the same flag before calling raise_softirq_irqoff() and rename the
>>>> flag to defer_qs_pending for more common.
>>>>
>>>> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
>>>> Reported-by: Tengda Wu <wutengda2@huawei.com>
>>>> Signed-off-by: Yao Kai <yaokai34@huawei.com>
>>>> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
>>>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
>>>
>>> Looks good but, BTW, what happens if rcu_qs() is called
>>> before rcu_preempt_deferred_qs() had a chance to be called?
>>
>> Could you provide an example of when that can happen?
> 
> It can happen because rcu_qs() is called before rcu_preempt_deferred_qs()
> in rcu_softirq_qs(). Inverting the calls could help but IRQs must be disabled
> to ensure there is no read side between rcu_preempt_deferred_qs() and rcu_qs().

Ah the rcu_softorq_qs() path. Indeed, I see what you're saying now. Not sure
how to trigger it, but yeah good catch. it would delay the reset of the flag.

> I'm not aware of other ways to trigger that, except perhaps this:
> 
> https://lore.kernel.org/rcu/20251230004124.438070-1-joelagnelf@nvidia.com/T/#u
> 
> Either we fix those sites and make sure that rcu_preempt_deferred_qs() is always
> called before rcu_qs() in the same IRQ disabled section (or there are other
> fields set in ->rcu_read_unlock_special for later clearance). If we do that we
> must WARN_ON_ONCE(rdp->defer_qs_pending == DEFER_QS_PENDING) in rcu_qs().
> 
> Or we reset rdp->defer_qs_pending from rcu_qs(), which sounds more robust.

If we did that, can the following not happen? I did believe I tried that and it
did not fix the IRQ work recursion. Supposed you have a timer interrupt and an
IRQ that triggers BPF on exit. Both are pending on the CPU's IRQ controller.

First the non-timer interrupt does this:

irq_exit()
  __irq_exit_rcu()
    /* in_hardirq() returns false after this */
    preempt_count_sub(HARDIRQ_OFFSET)
    tick_irq_exit()
      tick_nohz_irq_exit()
	    tick_nohz_stop_sched_tick()
	      trace_tick_stop()  /* a bpf prog is hooked on this trace point */
		   __bpf_trace_tick_stop()
		      bpf_trace_run2()
			    rcu_read_unlock_special()
                              /* will send a IPI to itself */
			      irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);

<timer interrupt runs>

The timer interrupt runs, and does the clean up that the IRQ work was supposed
to do.

<IPI now runs for the IRQ work>
  ->irq_exit()
   ... recursion since IRQ work issued again.

Maybe it is unlikely to happen, but it feels a bit fragile still.  All it
takes is one call to rcu_qs() after the IRQ work was queued and before it
ran, coupled with an RCU reader that somehow always enters the slow-path.

> Ah an alternative is to make rdp::defer_qs_pending a field in union rcu_special
> which, sadly, would need to be expanded as a u64.

I was thinking maybe the most robust is something like the following. We
_have_ to go through the node tree to report QS, once we "defer the QS",
there's no other way out of that, that's a path that is a guarantee to go
through in order to end the GP. So just unconditionally clear the flag there
and all such places, something like the following which passes light
rcutorture on all scenarios.

Once we issue an IRQ work or raise a softirq, we don't need to do that again
for the same CPU until the GP ends).

(EDIT: actually rcu_disable_urgency_upon_qs() or its callsites might just be
the place, since it is present in (almost?) all call sites we are to report
up on the node tree).

Thoughts? I need to double check if there any possibilities of requiring IRQ
work for more than one time during the same GP and the same CPU. I don't think
so though.

---8<-----------------------

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index b7c818cabe44..81c3af5d1f67 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -729,6 +729,12 @@ static void rcu_disable_urgency_upon_qs(struct rcu_data *rdp)
 	}
 }
 
+static void rcu_defer_qs_clear_pending(struct rcu_data *rdp)
+{
+	if (READ_ONCE(rdp->defer_qs_pending) == DEFER_QS_PENDING)
+		WRITE_ONCE(rdp->defer_qs_pending, DEFER_QS_IDLE);
+}
+
 /**
  * rcu_is_watching - RCU read-side critical sections permitted on current CPU?
  *
@@ -2483,6 +2490,8 @@ rcu_report_qs_rdp(struct rcu_data *rdp)
 		}
 
 		rcu_disable_urgency_upon_qs(rdp);
+		rcu_defer_qs_clear_pending(rdp);
+
 		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
 		/* ^^^ Released rnp->lock */
 	}
@@ -2767,6 +2776,12 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
 			if (ret > 0) {
 				mask |= rdp->grpmask;
 				rcu_disable_urgency_upon_qs(rdp);
+				/*
+				 * Clear any stale defer_qs_pending for idle/offline
+				 * CPUs reporting QS. This can happen if a CPU went
+				 * idle after raising softirq but before it ran.
+				 */
+				rcu_defer_qs_clear_pending(rdp);
 			}
 			if (ret < 0)
 				rsmask |= rdp->grpmask;
@@ -4373,6 +4388,7 @@ void rcutree_report_cpu_starting(unsigned int cpu)
 
 		local_irq_save(flags);
 		rcu_disable_urgency_upon_qs(rdp);
+		rcu_defer_qs_clear_pending(rdp);
 		/* Report QS -after- changing ->qsmaskinitnext! */
 		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
 	} else {
@@ -4432,6 +4448,7 @@ void rcutree_report_cpu_dead(void)
 	if (rnp->qsmask & mask) { /* RCU waiting on outgoing CPU? */
 		/* Report quiescent state -before- changing ->qsmaskinitnext! */
 		rcu_disable_urgency_upon_qs(rdp);
+		rcu_defer_qs_clear_pending(rdp);
 		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
 		raw_spin_lock_irqsave_rcu_node(rnp, flags);
 	}
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 96c49c56fc14..7f2af0e45883 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -272,6 +272,10 @@ static void rcu_report_exp_rdp(struct rcu_data *rdp)
 	raw_spin_lock_irqsave_rcu_node(rnp, flags);
 	WRITE_ONCE(rdp->cpu_no_qs.b.exp, false);
 	ASSERT_EXCLUSIVE_WRITER(rdp->cpu_no_qs.b.exp);
+
+	/* Expedited QS reported. TODO: what happens if we deferred both exp and normal QS (and viceversa for the other callsites)? */
+	rcu_defer_qs_clear_pending(rdp);
+
 	rcu_report_exp_cpu_mult(rnp, flags, rdp->grpmask, true);
 }
 
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 6c86c7b96c63..d706daea021f 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -487,8 +487,6 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
 	union rcu_special special;
 
 	rdp = this_cpu_ptr(&rcu_data);
-	if (rdp->defer_qs_pending == DEFER_QS_PENDING)
-		rdp->defer_qs_pending = DEFER_QS_IDLE;
 
 	/*
 	 * If RCU core is waiting for this CPU to exit its critical section,
Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
Posted by Frederic Weisbecker 1 month ago
Le Thu, Jan 08, 2026 at 08:12:56PM -0500, Joel Fernandes a écrit :
> Hi Frederic,
> 
> On 1/8/2026 10:25 AM, Frederic Weisbecker wrote:
> > Le Wed, Jan 07, 2026 at 08:02:43PM -0500, Joel Fernandes a écrit :
> >>
> >>
> >>> On Jan 7, 2026, at 6:15 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
> >>>
> >>> Le Thu, Jan 01, 2026 at 11:34:10AM -0500, Joel Fernandes a écrit :
> >>>> From: Yao Kai <yaokai34@huawei.com>
> >>>>
> >>>> Commit 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in
> >>>> __rcu_read_unlock()") removes the recursion-protection code from
> >>>> __rcu_read_unlock(). Therefore, we could invoke the deadloop in
> >>>> raise_softirq_irqoff() with ftrace enabled as follows:
> >>>>
> >>>> WARNING: CPU: 0 PID: 0 at kernel/trace/trace.c:3021 __ftrace_trace_stack.constprop.0+0x172/0x180
> >>>> Modules linked in: my_irq_work(O)
> >>>> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.18.0-rc7-dirty #23 PREEMPT(full)
> >>>> Tainted: [O]=OOT_MODULE
> >>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> >>>> RIP: 0010:__ftrace_trace_stack.constprop.0+0x172/0x180
> >>>> RSP: 0018:ffffc900000034a8 EFLAGS: 00010002
> >>>> RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
> >>>> RDX: 0000000000000003 RSI: ffffffff826d7b87 RDI: ffffffff826e9329
> >>>> RBP: 0000000000090009 R08: 0000000000000005 R09: ffffffff82afbc4c
> >>>> R10: 0000000000000008 R11: 0000000000011d7a R12: 0000000000000000
> >>>> R13: ffff888003874100 R14: 0000000000000003 R15: ffff8880038c1054
> >>>> FS:  0000000000000000(0000) GS:ffff8880fa8ea000(0000) knlGS:0000000000000000
> >>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>> CR2: 000055b31fa7f540 CR3: 00000000078f4005 CR4: 0000000000770ef0
> >>>> PKRU: 55555554
> >>>> Call Trace:
> >>>> <IRQ>
> >>>> trace_buffer_unlock_commit_regs+0x6d/0x220
> >>>> trace_event_buffer_commit+0x5c/0x260
> >>>> trace_event_raw_event_softirq+0x47/0x80
> >>>> raise_softirq_irqoff+0x6e/0xa0
> >>>> rcu_read_unlock_special+0xb1/0x160
> >>>> unwind_next_frame+0x203/0x9b0
> >>>> __unwind_start+0x15d/0x1c0
> >>>> arch_stack_walk+0x62/0xf0
> >>>> stack_trace_save+0x48/0x70
> >>>> __ftrace_trace_stack.constprop.0+0x144/0x180
> >>>> trace_buffer_unlock_commit_regs+0x6d/0x220
> >>>> trace_event_buffer_commit+0x5c/0x260
> >>>> trace_event_raw_event_softirq+0x47/0x80
> >>>> raise_softirq_irqoff+0x6e/0xa0
> >>>> rcu_read_unlock_special+0xb1/0x160
> >>>> unwind_next_frame+0x203/0x9b0
> >>>> __unwind_start+0x15d/0x1c0
> >>>> arch_stack_walk+0x62/0xf0
> >>>> stack_trace_save+0x48/0x70
> >>>> __ftrace_trace_stack.constprop.0+0x144/0x180
> >>>> trace_buffer_unlock_commit_regs+0x6d/0x220
> >>>> trace_event_buffer_commit+0x5c/0x260
> >>>> trace_event_raw_event_softirq+0x47/0x80
> >>>> raise_softirq_irqoff+0x6e/0xa0
> >>>> rcu_read_unlock_special+0xb1/0x160
> >>>> unwind_next_frame+0x203/0x9b0
> >>>> __unwind_start+0x15d/0x1c0
> >>>> arch_stack_walk+0x62/0xf0
> >>>> stack_trace_save+0x48/0x70
> >>>> __ftrace_trace_stack.constprop.0+0x144/0x180
> >>>> trace_buffer_unlock_commit_regs+0x6d/0x220
> >>>> trace_event_buffer_commit+0x5c/0x260
> >>>> trace_event_raw_event_softirq+0x47/0x80
> >>>> raise_softirq_irqoff+0x6e/0xa0
> >>>> rcu_read_unlock_special+0xb1/0x160
> >>>> __is_insn_slot_addr+0x54/0x70
> >>>> kernel_text_address+0x48/0xc0
> >>>> __kernel_text_address+0xd/0x40
> >>>> unwind_get_return_address+0x1e/0x40
> >>>> arch_stack_walk+0x9c/0xf0
> >>>> stack_trace_save+0x48/0x70
> >>>> __ftrace_trace_stack.constprop.0+0x144/0x180
> >>>> trace_buffer_unlock_commit_regs+0x6d/0x220
> >>>> trace_event_buffer_commit+0x5c/0x260
> >>>> trace_event_raw_event_softirq+0x47/0x80
> >>>> __raise_softirq_irqoff+0x61/0x80
> >>>> __flush_smp_call_function_queue+0x115/0x420
> >>>> __sysvec_call_function_single+0x17/0xb0
> >>>> sysvec_call_function_single+0x8c/0xc0
> >>>> </IRQ>
> >>>>
> >>>> Commit b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
> >>>> fixed the infinite loop in rcu_read_unlock_special() for IRQ work by
> >>>> setting a flag before calling irq_work_queue_on(). We fix this issue by
> >>>> setting the same flag before calling raise_softirq_irqoff() and rename the
> >>>> flag to defer_qs_pending for more common.
> >>>>
> >>>> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
> >>>> Reported-by: Tengda Wu <wutengda2@huawei.com>
> >>>> Signed-off-by: Yao Kai <yaokai34@huawei.com>
> >>>> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
> >>>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> >>>
> >>> Looks good but, BTW, what happens if rcu_qs() is called
> >>> before rcu_preempt_deferred_qs() had a chance to be called?
> >>
> >> Could you provide an example of when that can happen?
> > 
> > It can happen because rcu_qs() is called before rcu_preempt_deferred_qs()
> > in rcu_softirq_qs(). Inverting the calls could help but IRQs must be disabled
> > to ensure there is no read side between rcu_preempt_deferred_qs() and rcu_qs().
> 
> Ah the rcu_softorq_qs() path. Indeed, I see what you're saying now. Not sure
> how to trigger it, but yeah good catch. it would delay the reset of the flag.
> 
> > I'm not aware of other ways to trigger that, except perhaps this:
> > 
> > https://lore.kernel.org/rcu/20251230004124.438070-1-joelagnelf@nvidia.com/T/#u
> > 
> > Either we fix those sites and make sure that rcu_preempt_deferred_qs() is always
> > called before rcu_qs() in the same IRQ disabled section (or there are other
> > fields set in ->rcu_read_unlock_special for later clearance). If we do that we
> > must WARN_ON_ONCE(rdp->defer_qs_pending == DEFER_QS_PENDING) in rcu_qs().
> > 
> > Or we reset rdp->defer_qs_pending from rcu_qs(), which sounds more robust.
> 
> If we did that, can the following not happen? I did believe I tried that and it
> did not fix the IRQ work recursion. Supposed you have a timer interrupt and an
> IRQ that triggers BPF on exit. Both are pending on the CPU's IRQ controller.
> 
> First the non-timer interrupt does this:
> 
> irq_exit()
>   __irq_exit_rcu()
>     /* in_hardirq() returns false after this */
>     preempt_count_sub(HARDIRQ_OFFSET)
>     tick_irq_exit()
>       tick_nohz_irq_exit()
> 	    tick_nohz_stop_sched_tick()
> 	      trace_tick_stop()  /* a bpf prog is hooked on this trace point */
> 		   __bpf_trace_tick_stop()
> 		      bpf_trace_run2()
> 			    rcu_read_unlock_special()
>                               /* will send a IPI to itself */
> 			      irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
> 
> <timer interrupt runs>
> 
> The timer interrupt runs, and does the clean up that the IRQ work was supposed
> to do.
> 
> <IPI now runs for the IRQ work>
>   ->irq_exit()
>    ... recursion since IRQ work issued again.

If defer_qs_pending is only cleared when rcu_qs() or rcu_report_exp_rdp(),
I don't think it can happen, but I could be missing something...

> 
> Maybe it is unlikely to happen, but it feels a bit fragile still.  All it
> takes is one call to rcu_qs() after the IRQ work was queued and before it
> ran, coupled with an RCU reader that somehow always enters the slow-path.

But if rcu_qs() or rcu_report_exp_rdp() has been called there is no more need
to enter rcu_read_unlock_special(), right? Unless the task is still blocked
but I'm not sure it could recurse...

> 
> > Ah an alternative is to make rdp::defer_qs_pending a field in union rcu_special
> > which, sadly, would need to be expanded as a u64.
> 
> I was thinking maybe the most robust is something like the following. We
> _have_ to go through the node tree to report QS, once we "defer the QS",
> there's no other way out of that, that's a path that is a guarantee to go
> through in order to end the GP. So just unconditionally clear the flag there
> and all such places, something like the following which passes light
> rcutorture on all scenarios.
> 
> Once we issue an IRQ work or raise a softirq, we don't need to do that again
> for the same CPU until the GP ends).
> 
> (EDIT: actually rcu_disable_urgency_upon_qs() or its callsites might just be
> the place, since it is present in (almost?) all call sites we are to report
> up on the node tree).
> 
> Thoughts? I need to double check if there any possibilities of requiring IRQ
> work for more than one time during the same GP and the same CPU. I don't think
> so though.
> 
> ---8<-----------------------
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index b7c818cabe44..81c3af5d1f67 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -729,6 +729,12 @@ static void rcu_disable_urgency_upon_qs(struct rcu_data *rdp)
>  	}
>  }
>  
> +static void rcu_defer_qs_clear_pending(struct rcu_data *rdp)
> +{
> +	if (READ_ONCE(rdp->defer_qs_pending) == DEFER_QS_PENDING)
> +		WRITE_ONCE(rdp->defer_qs_pending, DEFER_QS_IDLE);
> +}
> +
>  /**
>   * rcu_is_watching - RCU read-side critical sections permitted on current CPU?
>   *
> @@ -2483,6 +2490,8 @@ rcu_report_qs_rdp(struct rcu_data *rdp)
>  		}
>  
>  		rcu_disable_urgency_upon_qs(rdp);
> +		rcu_defer_qs_clear_pending(rdp);
> +
>  		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
>  		/* ^^^ Released rnp->lock */
>  	}
> @@ -2767,6 +2776,12 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
>  			if (ret > 0) {
>  				mask |= rdp->grpmask;
>  				rcu_disable_urgency_upon_qs(rdp);
> +				/*
> +				 * Clear any stale defer_qs_pending for idle/offline
> +				 * CPUs reporting QS. This can happen if a CPU went
> +				 * idle after raising softirq but before it ran.
> +				 */
> +				rcu_defer_qs_clear_pending(rdp);
>  			}
>  			if (ret < 0)
>  				rsmask |= rdp->grpmask;
> @@ -4373,6 +4388,7 @@ void rcutree_report_cpu_starting(unsigned int cpu)
>  
>  		local_irq_save(flags);
>  		rcu_disable_urgency_upon_qs(rdp);
> +		rcu_defer_qs_clear_pending(rdp);
>  		/* Report QS -after- changing ->qsmaskinitnext! */
>  		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
>  	} else {
> @@ -4432,6 +4448,7 @@ void rcutree_report_cpu_dead(void)
>  	if (rnp->qsmask & mask) { /* RCU waiting on outgoing CPU? */
>  		/* Report quiescent state -before- changing ->qsmaskinitnext! */
>  		rcu_disable_urgency_upon_qs(rdp);
> +		rcu_defer_qs_clear_pending(rdp);
>  		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
>  		raw_spin_lock_irqsave_rcu_node(rnp, flags);
>  	}
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 96c49c56fc14..7f2af0e45883 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -272,6 +272,10 @@ static void rcu_report_exp_rdp(struct rcu_data *rdp)
>  	raw_spin_lock_irqsave_rcu_node(rnp, flags);
>  	WRITE_ONCE(rdp->cpu_no_qs.b.exp, false);
>  	ASSERT_EXCLUSIVE_WRITER(rdp->cpu_no_qs.b.exp);
> +
> +	/* Expedited QS reported. TODO: what happens if we deferred both exp and normal QS (and viceversa for the other callsites)? */
> +	rcu_defer_qs_clear_pending(rdp);
> +
>  	rcu_report_exp_cpu_mult(rnp, flags, rdp->grpmask, true);
>  }
>  
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 6c86c7b96c63..d706daea021f 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -487,8 +487,6 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
>  	union rcu_special special;
>  
>  	rdp = this_cpu_ptr(&rcu_data);
> -	if (rdp->defer_qs_pending == DEFER_QS_PENDING)
> -		rdp->defer_qs_pending = DEFER_QS_IDLE;
>  
>  	/*
>  	 * If RCU core is waiting for this CPU to exit its critical section,
>

Looks like a possible direction.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs
Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
Posted by Joel Fernandes 1 month ago

> On Jan 7, 2026, at 8:02 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> 
> 
>> On Jan 7, 2026, at 6:15 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
>> 
>> Le Thu, Jan 01, 2026 at 11:34:10AM -0500, Joel Fernandes a écrit :
>>> From: Yao Kai <yaokai34@huawei.com>
>>> 
>>> Commit 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in
>>> __rcu_read_unlock()") removes the recursion-protection code from
>>> __rcu_read_unlock(). Therefore, we could invoke the deadloop in
>>> raise_softirq_irqoff() with ftrace enabled as follows:
>>> 
>>> WARNING: CPU: 0 PID: 0 at kernel/trace/trace.c:3021 __ftrace_trace_stack.constprop.0+0x172/0x180
>>> Modules linked in: my_irq_work(O)
>>> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.18.0-rc7-dirty #23 PREEMPT(full)
>>> Tainted: [O]=OOT_MODULE
>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
>>> RIP: 0010:__ftrace_trace_stack.constprop.0+0x172/0x180
>>> RSP: 0018:ffffc900000034a8 EFLAGS: 00010002
>>> RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
>>> RDX: 0000000000000003 RSI: ffffffff826d7b87 RDI: ffffffff826e9329
>>> RBP: 0000000000090009 R08: 0000000000000005 R09: ffffffff82afbc4c
>>> R10: 0000000000000008 R11: 0000000000011d7a R12: 0000000000000000
>>> R13: ffff888003874100 R14: 0000000000000003 R15: ffff8880038c1054
>>> FS:  0000000000000000(0000) GS:ffff8880fa8ea000(0000) knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 000055b31fa7f540 CR3: 00000000078f4005 CR4: 0000000000770ef0
>>> PKRU: 55555554
>>> Call Trace:
>>> <IRQ>
>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>> trace_event_buffer_commit+0x5c/0x260
>>> trace_event_raw_event_softirq+0x47/0x80
>>> raise_softirq_irqoff+0x6e/0xa0
>>> rcu_read_unlock_special+0xb1/0x160
>>> unwind_next_frame+0x203/0x9b0
>>> __unwind_start+0x15d/0x1c0
>>> arch_stack_walk+0x62/0xf0
>>> stack_trace_save+0x48/0x70
>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>> trace_event_buffer_commit+0x5c/0x260
>>> trace_event_raw_event_softirq+0x47/0x80
>>> raise_softirq_irqoff+0x6e/0xa0
>>> rcu_read_unlock_special+0xb1/0x160
>>> unwind_next_frame+0x203/0x9b0
>>> __unwind_start+0x15d/0x1c0
>>> arch_stack_walk+0x62/0xf0
>>> stack_trace_save+0x48/0x70
>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>> trace_event_buffer_commit+0x5c/0x260
>>> trace_event_raw_event_softirq+0x47/0x80
>>> raise_softirq_irqoff+0x6e/0xa0
>>> rcu_read_unlock_special+0xb1/0x160
>>> unwind_next_frame+0x203/0x9b0
>>> __unwind_start+0x15d/0x1c0
>>> arch_stack_walk+0x62/0xf0
>>> stack_trace_save+0x48/0x70
>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>> trace_event_buffer_commit+0x5c/0x260
>>> trace_event_raw_event_softirq+0x47/0x80
>>> raise_softirq_irqoff+0x6e/0xa0
>>> rcu_read_unlock_special+0xb1/0x160
>>> __is_insn_slot_addr+0x54/0x70
>>> kernel_text_address+0x48/0xc0
>>> __kernel_text_address+0xd/0x40
>>> unwind_get_return_address+0x1e/0x40
>>> arch_stack_walk+0x9c/0xf0
>>> stack_trace_save+0x48/0x70
>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>> trace_event_buffer_commit+0x5c/0x260
>>> trace_event_raw_event_softirq+0x47/0x80
>>> __raise_softirq_irqoff+0x61/0x80
>>> __flush_smp_call_function_queue+0x115/0x420
>>> __sysvec_call_function_single+0x17/0xb0
>>> sysvec_call_function_single+0x8c/0xc0
>>> </IRQ>
>>> 
>>> Commit b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
>>> fixed the infinite loop in rcu_read_unlock_special() for IRQ work by
>>> setting a flag before calling irq_work_queue_on(). We fix this issue by
>>> setting the same flag before calling raise_softirq_irqoff() and rename the
>>> flag to defer_qs_pending for more common.
>>> 
>>> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
>>> Reported-by: Tengda Wu <wutengda2@huawei.com>
>>> Signed-off-by: Yao Kai <yaokai34@huawei.com>
>>> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
>>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
>> 
>> Looks good but, BTW, what happens if rcu_qs() is called
>> before rcu_preempt_deferred_qs() had a chance to be called?
> 
> Could you provide an example of when that can happen?
> 
> If rcu_qs() results in reporting of a quiescent state up the node tree before the deferred reporting work had a chance to act, then indeed we should be clearing the flag (after canceling the pending raise_softirq_irqoff()).
> 
>>> flag to defer_qs_pending for more common.
>>> 
>>> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
>>> Reported-by: Tengda Wu <wutengda2@huawei.com>
>>> Signed-off-by: Yao Kai <yaokai34@huawei.com>
>>> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
>>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
>> 
>> Looks good but, BTW, what happens if rcu_qs() is called
>> before rcu_preempt_deferred_qs() had a chance to be called?
> 
> Could you provide an example of when that can happen?
> 
> As far as I can see, even if that were to happen, which I think you are right it can happen, we will still go through the path to report deferred quiescent states and cancel the pending work (reset the flag).
> 
>> current->rcu_read_unlock_special.b.need_qs is reset by rcu_qs()
>> so subsequent calls to rcu_read_unlock() won't issue rcu_read_unlock_special()
>> (unless the task is blocked). And further calls to rcu_preempt_deferred_qs()
>> through rcu_core() will be ignored as well.
> 
> I am not sure if this implies that deferred quiescent state gets cancelled because we have already called unlock once. We have to go through the deferred quiescent state path on all subsequent quiescent state reporting, even if need_qs reset. How else will the GP complete.
>> 
>> But rdp->defer_qs_pending will remain in the DEFER_QS_PENDING state until
>> the next grace period. And if rcu_read_unlock_special() is called again
>> during the next GP on unfortunate place needing deferred qs, the state machine
>> will spuriously assume that either rcu_core or the irq_work are pending, when
>> none are anymore.
>> 
>> The state should be reset by rcu_qs().
> 
> In fact I would say if a deferred QS is pending, we should absolutely not reset its state from rcu_qs..
> 
> Maybe we should reset it from rcu_report_qs_rdp/rnp?
> 
> Unfortunately, all of this is coming from me being on a phone and not at a computer, so I will revise my response, but probably tomorrow, because today the human body is not cooperating.
> 
> thanks,
> 
> - Joel
> 
> 
>> current->rcu_read_unlock_special.b.need_qs is reset by rcu_qs()
>> so subsequent calls to rcu_read_unlock() won't issue rcu_read_unlock_special()
>> (unless the task is blocked). And further calls to rcu_preempt_deferred_qs()
>> through rcu_core() will be ignored as well.
> 
> I am not sure if this implies that deferred quiescent state gets cancelled because we have already called unlock once. We have to go through the deferred quiescent state path on all subsequent quiescent state reporting, even if need_qs reset. How else will the GP complete.
>> 
>> But rdp->defer_qs_pending will remain in the DEFER_QS_PENDING state until
>> the next grace period. And if rcu_read_unlock_special() is called again
>> during the next GP on unfortunate place needing deferred qs, the state machine
>> will spuriously assume that either rcu_core or the irq_work are pending, when
>> none are anymore.
>> 
>> The state should be reset by rcu_qs().
> 
> In fact I would say if a deferred QS is pending, we should absolutely not reset its state from rcu_qs..
> 
> Maybe we should reset it from rcu_report_qs_rdp/rnp?
> 
> thanks,


By the way, when I last tried to do it from rcu_qs, it was not fixing the original bug with the IRQ work recursion. 

I found that it was always resetting the flag. But probably it is not even the right place to do it in the first place. 

Thanks,

 - Joel










> 
> - Joel
> 
> 
>> 
>> Thanks.
>> 
>> --
>> Frederic Weisbecker
>> SUSE Labs
>> 
Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
Posted by Joel Fernandes 1 month ago

> On Jan 7, 2026, at 8:35 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> 
> 
>> On Jan 7, 2026, at 8:02 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
>> 
>> 
>> 
>>>> On Jan 7, 2026, at 6:15 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
>>> 
>>> Le Thu, Jan 01, 2026 at 11:34:10AM -0500, Joel Fernandes a écrit :
>>>> From: Yao Kai <yaokai34@huawei.com>
>>>> 
>>>> Commit 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in
>>>> __rcu_read_unlock()") removes the recursion-protection code from
>>>> __rcu_read_unlock(). Therefore, we could invoke the deadloop in
>>>> raise_softirq_irqoff() with ftrace enabled as follows:
>>>> 
>>>> WARNING: CPU: 0 PID: 0 at kernel/trace/trace.c:3021 __ftrace_trace_stack.constprop.0+0x172/0x180
>>>> Modules linked in: my_irq_work(O)
>>>> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.18.0-rc7-dirty #23 PREEMPT(full)
>>>> Tainted: [O]=OOT_MODULE
>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
>>>> RIP: 0010:__ftrace_trace_stack.constprop.0+0x172/0x180
>>>> RSP: 0018:ffffc900000034a8 EFLAGS: 00010002
>>>> RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
>>>> RDX: 0000000000000003 RSI: ffffffff826d7b87 RDI: ffffffff826e9329
>>>> RBP: 0000000000090009 R08: 0000000000000005 R09: ffffffff82afbc4c
>>>> R10: 0000000000000008 R11: 0000000000011d7a R12: 0000000000000000
>>>> R13: ffff888003874100 R14: 0000000000000003 R15: ffff8880038c1054
>>>> FS:  0000000000000000(0000) GS:ffff8880fa8ea000(0000) knlGS:0000000000000000
>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> CR2: 000055b31fa7f540 CR3: 00000000078f4005 CR4: 0000000000770ef0
>>>> PKRU: 55555554
>>>> Call Trace:
>>>> <IRQ>
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> raise_softirq_irqoff+0x6e/0xa0
>>>> rcu_read_unlock_special+0xb1/0x160
>>>> unwind_next_frame+0x203/0x9b0
>>>> __unwind_start+0x15d/0x1c0
>>>> arch_stack_walk+0x62/0xf0
>>>> stack_trace_save+0x48/0x70
>>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> raise_softirq_irqoff+0x6e/0xa0
>>>> rcu_read_unlock_special+0xb1/0x160
>>>> unwind_next_frame+0x203/0x9b0
>>>> __unwind_start+0x15d/0x1c0
>>>> arch_stack_walk+0x62/0xf0
>>>> stack_trace_save+0x48/0x70
>>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> raise_softirq_irqoff+0x6e/0xa0
>>>> rcu_read_unlock_special+0xb1/0x160
>>>> unwind_next_frame+0x203/0x9b0
>>>> __unwind_start+0x15d/0x1c0
>>>> arch_stack_walk+0x62/0xf0
>>>> stack_trace_save+0x48/0x70
>>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> raise_softirq_irqoff+0x6e/0xa0
>>>> rcu_read_unlock_special+0xb1/0x160
>>>> __is_insn_slot_addr+0x54/0x70
>>>> kernel_text_address+0x48/0xc0
>>>> __kernel_text_address+0xd/0x40
>>>> unwind_get_return_address+0x1e/0x40
>>>> arch_stack_walk+0x9c/0xf0
>>>> stack_trace_save+0x48/0x70
>>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> __raise_softirq_irqoff+0x61/0x80
>>>> __flush_smp_call_function_queue+0x115/0x420
>>>> __sysvec_call_function_single+0x17/0xb0
>>>> sysvec_call_function_single+0x8c/0xc0
>>>> </IRQ>
>>>> 
>>>> Commit b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
>>>> fixed the infinite loop in rcu_read_unlock_special() for IRQ work by
>>>> setting a flag before calling irq_work_queue_on(). We fix this issue by
>>>> setting the same flag before calling raise_softirq_irqoff() and rename the
>>>> flag to defer_qs_pending for more common.
>>>> 
>>>> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
>>>> Reported-by: Tengda Wu <wutengda2@huawei.com>
>>>> Signed-off-by: Yao Kai <yaokai34@huawei.com>
>>>> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
>>>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
>>> 
>>> Looks good but, BTW, what happens if rcu_qs() is called
>>> before rcu_preempt_deferred_qs() had a chance to be called?
>> 
>> Could you provide an example of when that can happen?
>> 
>> If rcu_qs() results in reporting of a quiescent state up the node tree before the deferred reporting work had a chance to act, then indeed we should be clearing the flag (after canceling the pending raise_softirq_irqoff()).
>> 
>>>> flag to defer_qs_pending for more common.
>>>> 
>>>> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
>>>> Reported-by: Tengda Wu <wutengda2@huawei.com>
>>>> Signed-off-by: Yao Kai <yaokai34@huawei.com>
>>>> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
>>>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
>>> 
>>> Looks good but, BTW, what happens if rcu_qs() is called
>>> before rcu_preempt_deferred_qs() had a chance to be called?
>> 
>> Could you provide an example of when that can happen?
>> 
>> As far as I can see, even if that were to happen, which I think you are right it can happen, we will still go through the path to report deferred quiescent states and cancel the pending work (reset the flag).
>> 
>>> current->rcu_read_unlock_special.b.need_qs is reset by rcu_qs()
>>> so subsequent calls to rcu_read_unlock() won't issue rcu_read_unlock_special()
>>> (unless the task is blocked). And further calls to rcu_preempt_deferred_qs()
>>> through rcu_core() will be ignored as well.
>> 
>> I am not sure if this implies that deferred quiescent state gets cancelled because we have already called unlock once. We have to go through the deferred quiescent state path on all subsequent quiescent state reporting, even if need_qs reset. How else will the GP complete.
>>> 
>>> But rdp->defer_qs_pending will remain in the DEFER_QS_PENDING state until
>>> the next grace period. And if rcu_read_unlock_special() is called again
>>> during the next GP on unfortunate place needing deferred qs, the state machine
>>> will spuriously assume that either rcu_core or the irq_work are pending, when
>>> none are anymore.
>>> 
>>> The state should be reset by rcu_qs().
>> 
>> In fact I would say if a deferred QS is pending, we should absolutely not reset its state from rcu_qs..
>> 
>> Maybe we should reset it from rcu_report_qs_rdp/rnp?
>> 
>> Unfortunately, all of this is coming from me being on a phone and not at a computer, so I will revise my response, but probably tomorrow, because today the human body is not cooperating.
>> 
>> thanks,
>> 
>> - Joel
>> 
>> 
>>> current->rcu_read_unlock_special.b.need_qs is reset by rcu_qs()
>>> so subsequent calls to rcu_read_unlock() won't issue rcu_read_unlock_special()
>>> (unless the task is blocked). And further calls to rcu_preempt_deferred_qs()
>>> through rcu_core() will be ignored as well.
>> 
>> I am not sure if this implies that deferred quiescent state gets cancelled because we have already called unlock once. We have to go through the deferred quiescent state path on all subsequent quiescent state reporting, even if need_qs reset. How else will the GP complete.
>>> 
>>> But rdp->defer_qs_pending will remain in the DEFER_QS_PENDING state until
>>> the next grace period. And if rcu_read_unlock_special() is called again
>>> during the next GP on unfortunate place needing deferred qs, the state machine
>>> will spuriously assume that either rcu_core or the irq_work are pending, when
>>> none are anymore.
>>> 
>>> The state should be reset by rcu_qs().
>> 
>> In fact I would say if a deferred QS is pending, we should absolutely not reset its state from rcu_qs..
>> 
>> Maybe we should reset it from rcu_report_qs_rdp/rnp?
>> 
>> thanks,
> 
> 
> By the way, when I last tried to do it from rcu_qs, it was not fixing the original bug with the IRQ work recursion.
> 
> I found that it was always resetting the flag. But probably it is not even the right place to do it in the first place.

I think we need to reset the flag in rcu_report_exp_rdp() as well if exp_hint is set and we reported exp qs.

 I am working on a series to cover all cases and will send RFC soon. However this patch we are 
reviewing can go in for this merge window and the rest I am preparing (for further improvement) for the next merge window, if that sounds good.

Thanks!

 - Joel


> 
> Thanks,
> 
> - Joel
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>> 
>> - Joel
>> 
>> 
>>> 
>>> Thanks.
>>> 
>>> --
>>> Frederic Weisbecker
>>> SUSE Labs
>>> 
Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
Posted by Frederic Weisbecker 1 month ago
Le Wed, Jan 07, 2026 at 10:35:44PM -0500, Joel Fernandes a écrit :
> > 
> > By the way, when I last tried to do it from rcu_qs, it was not fixing the original bug with the IRQ work recursion.
> > 
> > I found that it was always resetting the flag. But probably it is not even the right place to do it in the first place.
> 
> I think we need to reset the flag in rcu_report_exp_rdp() as well if exp_hint
> is set and we reported exp qs.

To avoid needlessly reaching the rcu_read_unlock() slowpath whenever the exp QS has
already been reported, yes indeed.

>  I am working on a series to cover all cases and will send RFC soon. However this patch we are 
> reviewing can go in for this merge window and the rest I am preparing (for
> further improvement) for the next merge window, if that sounds good.

Ok.

-- 
Frederic Weisbecker
SUSE Labs
Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
Posted by Mathieu Desnoyers 1 month ago
On 2026-01-08 10:39, Frederic Weisbecker wrote:
> Le Wed, Jan 07, 2026 at 10:35:44PM -0500, Joel Fernandes a écrit :
>>>
>>> By the way, when I last tried to do it from rcu_qs, it was not fixing the original bug with the IRQ work recursion.
>>>
>>> I found that it was always resetting the flag. But probably it is not even the right place to do it in the first place.
>>
>> I think we need to reset the flag in rcu_report_exp_rdp() as well if exp_hint
>> is set and we reported exp qs.
> 
> To avoid needlessly reaching the rcu_read_unlock() slowpath whenever the exp QS has
> already been reported, yes indeed.

This seems related to:

https://lore.kernel.org/lkml/6c96dbb5-bffc-423f-bb6a-3072abb5f711@efficios.com/

Is it the same issue ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
Posted by Steven Rostedt 1 month, 1 week ago
On Thu,  1 Jan 2026 11:34:10 -0500
Joel Fernandes <joelagnelf@nvidia.com> wrote:

>  trace_buffer_unlock_commit_regs+0x6d/0x220
>  trace_event_buffer_commit+0x5c/0x260
>  trace_event_raw_event_softirq+0x47/0x80
>  raise_softirq_irqoff+0x6e/0xa0
>  rcu_read_unlock_special+0xb1/0x160
>  unwind_next_frame+0x203/0x9b0
>  __unwind_start+0x15d/0x1c0
>  arch_stack_walk+0x62/0xf0
>  stack_trace_save+0x48/0x70
>  __ftrace_trace_stack.constprop.0+0x144/0x180
>  trace_buffer_unlock_commit_regs+0x6d/0x220
>  trace_event_buffer_commit+0x5c/0x260
>  trace_event_raw_event_softirq+0x47/0x80
>  raise_softirq_irqoff+0x6e/0xa0
>  rcu_read_unlock_special+0xb1/0x160
>  unwind_next_frame+0x203/0x9b0
>  __unwind_start+0x15d/0x1c0
>  arch_stack_walk+0x62/0xf0
>  stack_trace_save+0x48/0x70
>  __ftrace_trace_stack.constprop.0+0x144/0x180

Stacktrace should have recursion protection too.

Can you try this patch to see if it would have fixed the problem too?

-- Steve

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index ae04054a1be3..e6ca052b2a85 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -34,6 +34,13 @@ enum {
 	TRACE_INTERNAL_SIRQ_BIT,
 	TRACE_INTERNAL_TRANSITION_BIT,
 
+	/* Internal event use recursion bits */
+	TRACE_INTERNAL_EVENT_BIT,
+	TRACE_INTERNAL_EVENT_NMI_BIT,
+	TRACE_INTERNAL_EVENT_IRQ_BIT,
+	TRACE_INTERNAL_EVENT_SIRQ_BIT,
+	TRACE_INTERNAL_EVENT_TRANSITION_BIT,
+
 	TRACE_BRANCH_BIT,
 /*
  * Abuse of the trace_recursion.
@@ -58,6 +65,8 @@ enum {
 
 #define TRACE_LIST_START	TRACE_INTERNAL_BIT
 
+#define TRACE_EVENT_START	TRACE_INTERNAL_EVENT_BIT
+
 #define TRACE_CONTEXT_MASK	((1 << (TRACE_LIST_START + TRACE_CONTEXT_BITS)) - 1)
 
 /*
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2d387d56dcd4..e145d1c7f604 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3013,6 +3013,11 @@ static void __ftrace_trace_stack(struct trace_array *tr,
 	struct ftrace_stack *fstack;
 	struct stack_entry *entry;
 	int stackidx;
+	int bit;
+
+	bit = trace_test_and_set_recursion(_THIS_IP_, _RET_IP_, TRACE_EVENT_START);
+	if (bit < 0)
+		return;
 
 	/*
 	 * Add one, for this function and the call to save_stack_trace()
@@ -3081,6 +3086,7 @@ static void __ftrace_trace_stack(struct trace_array *tr,
 	/* Again, don't let gcc optimize things here */
 	barrier();
 	__this_cpu_dec(ftrace_stack_reserve);
+	trace_clear_recursion(bit);
 }
 
 static inline void ftrace_trace_stack(struct trace_array *tr,
Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
Posted by Steven Rostedt 1 month, 1 week ago
On Fri, 2 Jan 2026 12:28:07 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> Stacktrace should have recursion protection too.
> 
> Can you try this patch to see if it would have fixed the problem too?

As I believe the recursion protection should be in the tracing
infrastructure more than in RCU. As RCU is used as an active participant in
the kernel whereas tracing is supposed to be only an observer.

If tracing is the culprit, it should be the one that is fixed.

Thanks,

-- Steve
Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
Posted by Paul E. McKenney 1 month, 1 week ago
On Fri, Jan 02, 2026 at 12:30:09PM -0500, Steven Rostedt wrote:
> On Fri, 2 Jan 2026 12:28:07 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > Stacktrace should have recursion protection too.
> > 
> > Can you try this patch to see if it would have fixed the problem too?
> 
> As I believe the recursion protection should be in the tracing
> infrastructure more than in RCU. As RCU is used as an active participant in
> the kernel whereas tracing is supposed to be only an observer.
> 
> If tracing is the culprit, it should be the one that is fixed.

Makes sense to me!  But then it would...  ;-)

							Thanx, Paul
Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
Posted by Joel Fernandes 1 month ago

On 1/2/2026 2:51 PM, Paul E. McKenney wrote:
> On Fri, Jan 02, 2026 at 12:30:09PM -0500, Steven Rostedt wrote:
>> On Fri, 2 Jan 2026 12:28:07 -0500
>> Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>>> Stacktrace should have recursion protection too.
>>>
>>> Can you try this patch to see if it would have fixed the problem too?
>>
>> As I believe the recursion protection should be in the tracing
>> infrastructure more than in RCU. As RCU is used as an active participant in
>> the kernel whereas tracing is supposed to be only an observer.
>>
>> If tracing is the culprit, it should be the one that is fixed.
> 
> Makes sense to me!  But then it would...  ;-)
> 
Could we fix it in both? (RCU and tracing). The patch just adds 3 more net lines
to RCU code. It'd be good to have a guard rail against softirq recursion in RCU
read unlock path, as much as the existing guard rail we already have with
irq_work? After all, both paths attempt to do deferred work when it is safer to
do so.

Yao, if you could test Steve's patch and reply whether it fixes it too?

thanks,

 - Joel
Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
Posted by Boqun Feng 1 month ago
On Fri, Jan 02, 2026 at 07:41:38PM -0500, Joel Fernandes wrote:
> 
> 
> On 1/2/2026 2:51 PM, Paul E. McKenney wrote:
> > On Fri, Jan 02, 2026 at 12:30:09PM -0500, Steven Rostedt wrote:
> >> On Fri, 2 Jan 2026 12:28:07 -0500
> >> Steven Rostedt <rostedt@goodmis.org> wrote:
> >>
> >>> Stacktrace should have recursion protection too.
> >>>
> >>> Can you try this patch to see if it would have fixed the problem too?
> >>
> >> As I believe the recursion protection should be in the tracing
> >> infrastructure more than in RCU. As RCU is used as an active participant in
> >> the kernel whereas tracing is supposed to be only an observer.
> >>
> >> If tracing is the culprit, it should be the one that is fixed.
> > 
> > Makes sense to me!  But then it would...  ;-)
> > 
> Could we fix it in both? (RCU and tracing). The patch just adds 3 more net lines
> to RCU code. It'd be good to have a guard rail against softirq recursion in RCU
> read unlock path, as much as the existing guard rail we already have with
> irq_work? After all, both paths attempt to do deferred work when it is safer to
> do so.
> 

Agreed. First it's crucial that RCU itself can prevent indefinitely
entering rcu_read_unlock_special(), because although unlikely, any RCU
reader in raise_softirq_irqoff() would cause a similar infinite loop.
Second, with solely the tracing fix, there still exists a call chain:

	rcu_read_unlock_special():
	  raise_softirq_irqoff():
	    trace_softirq_raise():
	      rcu_read_unlock_special():
	        raise_softirq_irqoff():
		  trace_softirq_raise(); // <- recursion ends here

while with the RCU fix added, the call chain ends at the second
rcu_read_unlock_special():

	rcu_read_unlock_special():
	  raise_softirq_irqoff():
	    trace_softirq_raise():
	      rcu_read_unlock_special(); // <- recursion ends here

which would slightly improve the performance becasue of fewer calls.

I'm going to include this into the RCU PR for 7.0 if no one objects.
Thanks!

Regards,
Boqun

> Yao, if you could test Steve's patch and reply whether it fixes it too?
> 
> thanks,
> 
>  - Joel
> 
> 
> 
>
Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
Posted by Yao Kai 1 month ago

On 1/3/2026 8:41 AM, Joel Fernandes wrote:
> 
> 
> On 1/2/2026 2:51 PM, Paul E. McKenney wrote:
>> On Fri, Jan 02, 2026 at 12:30:09PM -0500, Steven Rostedt wrote:
>>> On Fri, 2 Jan 2026 12:28:07 -0500
>>> Steven Rostedt <rostedt@goodmis.org> wrote:
>>>
>>>> Stacktrace should have recursion protection too.
>>>>
>>>> Can you try this patch to see if it would have fixed the problem too?
>>>
>>> As I believe the recursion protection should be in the tracing
>>> infrastructure more than in RCU. As RCU is used as an active participant in
>>> the kernel whereas tracing is supposed to be only an observer.
>>>
>>> If tracing is the culprit, it should be the one that is fixed.
>>
>> Makes sense to me!  But then it would...  ;-)
>>
> Could we fix it in both? (RCU and tracing). The patch just adds 3 more net lines
> to RCU code. It'd be good to have a guard rail against softirq recursion in RCU
> read unlock path, as much as the existing guard rail we already have with
> irq_work? After all, both paths attempt to do deferred work when it is safer to
> do so.
> 
> Yao, if you could test Steve's patch and reply whether it fixes it too?
> 
> thanks,
> 
>   - Joel
> 
> 
> 
> 

Yes, I tested Steve's patch. It fixes the issue too.

Tested-by: Yao Kai <yaokai34@huawei.com>

  - Yao
Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
Posted by Steven Rostedt 1 month ago
On Sun, 4 Jan 2026 11:20:07 +0800
Yao Kai <yaokai34@huawei.com> wrote:

> Yes, I tested Steve's patch. It fixes the issue too.
> 
> Tested-by: Yao Kai <yaokai34@huawei.com>

Thanks for testing. I'll send out a formal patch.

And yes, I agree we should do both.

-- Steve
Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
Posted by Steven Rostedt 1 month ago
On Mon, 5 Jan 2026 12:16:11 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Sun, 4 Jan 2026 11:20:07 +0800
> Yao Kai <yaokai34@huawei.com> wrote:
> 
> > Yes, I tested Steve's patch. It fixes the issue too.
> > 
> > Tested-by: Yao Kai <yaokai34@huawei.com>  
> 
> Thanks for testing. I'll send out a formal patch.
> 
> And yes, I agree we should do both.

FYI, the tracing recursion protection made it to mainline:

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5f1ef0dfcb5b7f4a91a9b0e0ba533efd9f7e2cdb

-- Steve