b/include/linux/rcutiny.h | 1 - b/include/linux/rcutree.h | 1 - b/kernel/rcu/tree_stall.h | 30 ------------------------------ kernel/rcu/tree_stall.h | 35 ++++++++++++++++++++++------------- 4 files changed, 22 insertions(+), 45 deletions(-)
Hello! This series contains RCU CPU stall-warning changes for v6.13: 1. Delete unused rcu_gp_might_be_stalled() function. 2. Stop stall warning from dumping stacks if grace period ends. 3. Finer-grained grace-period-end checks in rcu_dump_cpu_stacks(). Thanx, Paul ------------------------------------------------------------------------ b/include/linux/rcutiny.h | 1 - b/include/linux/rcutree.h | 1 - b/kernel/rcu/tree_stall.h | 30 ------------------------------ kernel/rcu/tree_stall.h | 35 ++++++++++++++++++++++------------- 4 files changed, 22 insertions(+), 45 deletions(-)
Hello! This series contains RCU CPU stall-warning changes for v6.13: 1. Delete unused rcu_gp_might_be_stalled() function. 2. Stop stall warning from dumping stacks if grace period ends. 3. Finer-grained grace-period-end checks in rcu_dump_cpu_stacks(). Changes since v1: o Add data_race() to grace-period sequence-number access per feedback from Joel Fernandes. o Add Reviewed-by tags. Thanx, Paul ------------------------------------------------------------------------ b/include/linux/rcutiny.h | 1 - b/include/linux/rcutree.h | 1 - b/kernel/rcu/tree_stall.h | 30 ------------------------------ kernel/rcu/tree_stall.h | 35 ++++++++++++++++++++++------------- 4 files changed, 22 insertions(+), 45 deletions(-)
Le Wed, Oct 16, 2024 at 09:18:52AM -0700, Paul E. McKenney a écrit : > Hello! > > This series contains RCU CPU stall-warning changes for v6.13: > > 1. Delete unused rcu_gp_might_be_stalled() function. > > 2. Stop stall warning from dumping stacks if grace period ends. > > 3. Finer-grained grace-period-end checks in rcu_dump_cpu_stacks(). > > Changes since v1: > > o Add data_race() to grace-period sequence-number access per > feedback from Joel Fernandes. > > o Add Reviewed-by tags. Applied and pushed to rcu/stall Thanks! > Thanx, Paul > > ------------------------------------------------------------------------ > > b/include/linux/rcutiny.h | 1 - > b/include/linux/rcutree.h | 1 - > b/kernel/rcu/tree_stall.h | 30 ------------------------------ > kernel/rcu/tree_stall.h | 35 ++++++++++++++++++++++------------- > 4 files changed, 22 insertions(+), 45 deletions(-)
The rcu_gp_might_be_stalled() function is no longer used, so this commit
removes it.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
include/linux/rcutiny.h | 1 -
include/linux/rcutree.h | 1 -
kernel/rcu/tree_stall.h | 30 ------------------------------
3 files changed, 32 deletions(-)
diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 0ee270b3f5ed2..fe42315f667fc 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -165,7 +165,6 @@ static inline bool rcu_inkernel_boot_has_ended(void) { return true; }
static inline bool rcu_is_watching(void) { return true; }
static inline void rcu_momentary_eqs(void) { }
static inline void kfree_rcu_scheduler_running(void) { }
-static inline bool rcu_gp_might_be_stalled(void) { return false; }
/* Avoid RCU read-side critical sections leaking across. */
static inline void rcu_all_qs(void) { barrier(); }
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 90a684f94776e..27d86d9127817 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -40,7 +40,6 @@ void kvfree_rcu_barrier(void);
void rcu_barrier(void);
void rcu_momentary_eqs(void);
void kfree_rcu_scheduler_running(void);
-bool rcu_gp_might_be_stalled(void);
struct rcu_gp_oldstate {
unsigned long rgos_norm;
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index 4432db6d0b99b..d7cdd535e50b1 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -76,36 +76,6 @@ int rcu_jiffies_till_stall_check(void)
}
EXPORT_SYMBOL_GPL(rcu_jiffies_till_stall_check);
-/**
- * rcu_gp_might_be_stalled - Is it likely that the grace period is stalled?
- *
- * Returns @true if the current grace period is sufficiently old that
- * it is reasonable to assume that it might be stalled. This can be
- * useful when deciding whether to allocate memory to enable RCU-mediated
- * freeing on the one hand or just invoking synchronize_rcu() on the other.
- * The latter is preferable when the grace period is stalled.
- *
- * Note that sampling of the .gp_start and .gp_seq fields must be done
- * carefully to avoid false positives at the beginnings and ends of
- * grace periods.
- */
-bool rcu_gp_might_be_stalled(void)
-{
- unsigned long d = rcu_jiffies_till_stall_check() / RCU_STALL_MIGHT_DIV;
- unsigned long j = jiffies;
-
- if (d < RCU_STALL_MIGHT_MIN)
- d = RCU_STALL_MIGHT_MIN;
- smp_mb(); // jiffies before .gp_seq to avoid false positives.
- if (!rcu_gp_in_progress())
- return false;
- // Long delays at this point avoids false positive, but a delay
- // of ULONG_MAX/4 jiffies voids your no-false-positive warranty.
- smp_mb(); // .gp_seq before second .gp_start
- // And ditto here.
- return !time_before(j, READ_ONCE(rcu_state.gp_start) + d);
-}
-
/* Don't do RCU CPU stall warnings during long sysrq printouts. */
void rcu_sysrq_start(void)
{
--
2.40.1
Currently, once an RCU CPU stall warning decides to dump the stalling
CPUs' stacks, the rcu_dump_cpu_stacks() function persists until it
has gone through the full list. Unfortunately, if the stalled grace
periods ends midway through, this function will be dumping stacks of
innocent-bystander CPUs that happen to be blocking not the old grace
period, but instead the new one. This can cause serious confusion.
This commit therefore stops dumping stacks if and when the stalled grace
period ends.
[ paulmck: Apply Joel Fernandes feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
kernel/rcu/tree_stall.h | 17 +++++++++++------
1 file changed, 11 insertions(+), 6 deletions(-)
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index d7cdd535e50b1..b530844becf85 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -335,13 +335,17 @@ static int rcu_print_task_stall(struct rcu_node *rnp, unsigned long flags)
* that don't support NMI-based stack dumps. The NMI-triggered stack
* traces are more accurate because they are printed by the target CPU.
*/
-static void rcu_dump_cpu_stacks(void)
+static void rcu_dump_cpu_stacks(unsigned long gp_seq)
{
int cpu;
unsigned long flags;
struct rcu_node *rnp;
rcu_for_each_leaf_node(rnp) {
+ if (gp_seq != data_race(rcu_state.gp_seq)) {
+ pr_err("INFO: Stall ended during stack backtracing.\n");
+ return;
+ }
printk_deferred_enter();
raw_spin_lock_irqsave_rcu_node(rnp, flags);
for_each_leaf_node_possible_cpu(rnp, cpu)
@@ -608,7 +612,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps)
(long)rcu_seq_current(&rcu_state.gp_seq), totqlen,
data_race(rcu_state.n_online_cpus)); // Diagnostic read
if (ndetected) {
- rcu_dump_cpu_stacks();
+ rcu_dump_cpu_stacks(gp_seq);
/* Complain about tasks blocking the grace period. */
rcu_for_each_leaf_node(rnp)
@@ -640,7 +644,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps)
rcu_force_quiescent_state(); /* Kick them all. */
}
-static void print_cpu_stall(unsigned long gps)
+static void print_cpu_stall(unsigned long gp_seq, unsigned long gps)
{
int cpu;
unsigned long flags;
@@ -677,7 +681,7 @@ static void print_cpu_stall(unsigned long gps)
rcu_check_gp_kthread_expired_fqs_timer();
rcu_check_gp_kthread_starvation();
- rcu_dump_cpu_stacks();
+ rcu_dump_cpu_stacks(gp_seq);
raw_spin_lock_irqsave_rcu_node(rnp, flags);
/* Rewrite if needed in case of slow consoles. */
@@ -759,7 +763,8 @@ static void check_cpu_stall(struct rcu_data *rdp)
gs2 = READ_ONCE(rcu_state.gp_seq);
if (gs1 != gs2 ||
ULONG_CMP_LT(j, js) ||
- ULONG_CMP_GE(gps, js))
+ ULONG_CMP_GE(gps, js) ||
+ !rcu_seq_state(gs2))
return; /* No stall or GP completed since entering function. */
rnp = rdp->mynode;
jn = jiffies + ULONG_MAX / 2;
@@ -780,7 +785,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
pr_err("INFO: %s detected stall, but suppressed full report due to a stuck CSD-lock.\n", rcu_state.name);
} else if (self_detected) {
/* We haven't checked in, so go dump stack. */
- print_cpu_stall(gps);
+ print_cpu_stall(gs2, gps);
} else {
/* They had a few time units to dump stack, so complain. */
print_other_cpu_stall(gs2, gps);
--
2.40.1
This commit pushes the grace-period-end checks further down into
rcu_dump_cpu_stacks(), and also uses lockless checks coupled with
finer-grained locking.
The result is that the current leaf rcu_node structure's ->lock is
acquired only if a stack backtrace might be needed from the current CPU,
and is held across only that CPU's backtrace. As a result, if there are
no stalled CPUs associated with a given rcu_node structure, then its
->lock will not be acquired at all. On large systems, it is usually
(though not always) the case that a small number of CPUs are stalling
the current grace period, which means that the ->lock need be acquired
only for a small fraction of the rcu_node structures.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
kernel/rcu/tree_stall.h | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index b530844becf85..8994391b95c76 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -342,20 +342,24 @@ static void rcu_dump_cpu_stacks(unsigned long gp_seq)
struct rcu_node *rnp;
rcu_for_each_leaf_node(rnp) {
- if (gp_seq != data_race(rcu_state.gp_seq)) {
- pr_err("INFO: Stall ended during stack backtracing.\n");
- return;
- }
printk_deferred_enter();
- raw_spin_lock_irqsave_rcu_node(rnp, flags);
- for_each_leaf_node_possible_cpu(rnp, cpu)
+ for_each_leaf_node_possible_cpu(rnp, cpu) {
+ if (gp_seq != data_race(rcu_state.gp_seq)) {
+ printk_deferred_exit();
+ pr_err("INFO: Stall ended during stack backtracing.\n");
+ return;
+ }
+ if (!(data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, cpu)))
+ continue;
+ raw_spin_lock_irqsave_rcu_node(rnp, flags);
if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) {
if (cpu_is_offline(cpu))
pr_err("Offline CPU %d blocking current GP.\n", cpu);
else
dump_cpu_task(cpu);
+ raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
}
- raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+ }
printk_deferred_exit();
}
}
--
2.40.1
This commit pushes the grace-period-end checks further down into
rcu_dump_cpu_stacks(), and also uses lockless checks coupled with
finer-grained locking.
The result is that the current leaf rcu_node structure's ->lock is
acquired only if a stack backtrace might be needed from the current CPU,
and is held across only that CPU's backtrace. As a result, if there are
no stalled CPUs associated with a given rcu_node structure, then its
->lock will not be acquired at all. On large systems, it is usually
(though not always) the case that a small number of CPUs are stalling
the current grace period, which means that the ->lock need be acquired
only for a small fraction of the rcu_node structures.
[ paulmck: Apply Dan Carpenter feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index b530844becf85..925fcdad5dea2 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -342,20 +342,24 @@ static void rcu_dump_cpu_stacks(unsigned long gp_seq)
struct rcu_node *rnp;
rcu_for_each_leaf_node(rnp) {
- if (gp_seq != data_race(rcu_state.gp_seq)) {
- pr_err("INFO: Stall ended during stack backtracing.\n");
- return;
- }
printk_deferred_enter();
- raw_spin_lock_irqsave_rcu_node(rnp, flags);
- for_each_leaf_node_possible_cpu(rnp, cpu)
+ for_each_leaf_node_possible_cpu(rnp, cpu) {
+ if (gp_seq != data_race(rcu_state.gp_seq)) {
+ printk_deferred_exit();
+ pr_err("INFO: Stall ended during stack backtracing.\n");
+ return;
+ }
+ if (!(data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, cpu)))
+ continue;
+ raw_spin_lock_irqsave_rcu_node(rnp, flags);
if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) {
if (cpu_is_offline(cpu))
pr_err("Offline CPU %d blocking current GP.\n", cpu);
else
dump_cpu_task(cpu);
}
- raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+ raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+ }
printk_deferred_exit();
}
}
On Mon, 2024-10-28 at 17:22 -0700, Paul E. McKenney wrote: > The result is that the current leaf rcu_node structure's ->lock is > acquired only if a stack backtrace might be needed from the current CPU, > and is held across only that CPU's backtrace. As a result, if there are After upgrading our device to kernel-6.11, we encountered a lockup scenario under stall warning. I had prepared a patch to submit, but I noticed that this series has already addressed some issues, though it hasn't been merged into the mainline yet. So, I decided to reply to this series for discussion on how to fix it before pushing. Here is the lockup scenario We encountered: Devices: arm64 with only 8 cores One CPU holds rnp->lock in rcu_dump_cpu_stack() while trying to dump other CPUs, but it waits for the corresponding CPU to dump backtrace, with a 10-second timeout. __delay() __const_udelay() nmi_trigger_cpumask_backtrace() arch_trigger_cpumask_backtrace() trigger_single_cpu_backtrace() dump_cpu_task() rcu_dump_cpu_stacks() <- holding rnp->lock print_other_cpu_stall() check_cpu_stall() rcu_pending() rcu_sched_clock_irq() update_process_times() However, the other 7 CPUs are waiting for rnp->lock on the path to report qs. queued_spin_lock_slowpath() queued_spin_lock() do_raw_spin_lock() __raw_spin_lock_irqsave() _raw_spin_lock_irqsave() rcu_report_qs_rdp() rcu_check_quiescent_state() rcu_core() rcu_core_si() handle_softirqs() __do_softirq() ____do_softirq() call_on_irq_stack() Since the arm64 architecture uses IPI instead of true NMI to implement arch_trigger_cpumask_backtrace(), spin_lock_irqsave disables interrupts, which is enough to block this IPI request. Therefore, if other CPUs start waiting for the lock before receiving the IPI, a semi-deadlock scenario like the following occurs: CPU0 CPU1 CPU2 ----- ----- ----- lock_irqsave(rnp->lock) lock_irqsave(rnp->lock) <can't receive IPI> <send ipi to CPU 1> <wait CPU 1 for 10s> lock_irqsave(rnp->lock) <can't receive IPI> <send ipi to CPU 2> <wait CPU 2 for 10s> ... In our scenario, with 7 CPUs to dump, the lockup takes nearly 70 seconds, causing subsequent useful logs to be unable to print, leading to a watchdog timeout and system reboot. This series of changes re-acquires the lock after each dump, significantly reducing lock-holding time. However, since it still holds the lock while dumping CPU backtrace, there's still a chance for two CPUs to wait for each other for 10 seconds, which is still too long. So, I would like to ask if it's necessary to dump backtrace within the spinlock section? If not, especially now that lockless checks are possible, maybe it can be changed as follows? - if (!(data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, cpu))) - continue; - raw_spin_lock_irqsave_rcu_node(rnp, flags); - if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) { + if (data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, cpu)) { if (cpu_is_offline(cpu)) pr_err("Offline CPU %d blocking current GP.\n", cpu); else dump_cpu_task(cpu); } } - raw_spin_unlock_irqrestore_rcu_node(rnp, flags); Or should this be considered an arm64 issue, and they should switch to true NMI, otherwise, they shouldn't use nmi_trigger_cpumask_backtrace()?
On Tue, Oct 29, 2024 at 02:20:51AM +0000, Cheng-Jui Wang (王正睿) wrote: > On Mon, 2024-10-28 at 17:22 -0700, Paul E. McKenney wrote: > > The result is that the current leaf rcu_node structure's ->lock is > > acquired only if a stack backtrace might be needed from the current CPU, > > and is held across only that CPU's backtrace. As a result, if there are > > After upgrading our device to kernel-6.11, we encountered a lockup > scenario under stall warning. > I had prepared a patch to submit, but I noticed that this series has > already addressed some issues, though it hasn't been merged into the > mainline yet. So, I decided to reply to this series for discussion on > how to fix it before pushing. Here is the lockup scenario We > encountered: > > Devices: arm64 with only 8 cores > One CPU holds rnp->lock in rcu_dump_cpu_stack() while trying to dump > other CPUs, but it waits for the corresponding CPU to dump backtrace, > with a 10-second timeout. > > __delay() > __const_udelay() > nmi_trigger_cpumask_backtrace() > arch_trigger_cpumask_backtrace() > trigger_single_cpu_backtrace() > dump_cpu_task() > rcu_dump_cpu_stacks() <- holding rnp->lock > print_other_cpu_stall() > check_cpu_stall() > rcu_pending() > rcu_sched_clock_irq() > update_process_times() > > However, the other 7 CPUs are waiting for rnp->lock on the path to > report qs. > > queued_spin_lock_slowpath() > queued_spin_lock() > do_raw_spin_lock() > __raw_spin_lock_irqsave() > _raw_spin_lock_irqsave() > rcu_report_qs_rdp() > rcu_check_quiescent_state() > rcu_core() > rcu_core_si() > handle_softirqs() > __do_softirq() > ____do_softirq() > call_on_irq_stack() > > Since the arm64 architecture uses IPI instead of true NMI to implement > arch_trigger_cpumask_backtrace(), spin_lock_irqsave disables > interrupts, which is enough to block this IPI request. > Therefore, if other CPUs start waiting for the lock before receiving > the IPI, a semi-deadlock scenario like the following occurs: > > CPU0 CPU1 CPU2 > ----- ----- ----- > lock_irqsave(rnp->lock) > lock_irqsave(rnp->lock) > <can't receive IPI> > <send ipi to CPU 1> > <wait CPU 1 for 10s> > lock_irqsave(rnp->lock) > <can't receive IPI> > <send ipi to CPU 2> > <wait CPU 2 for 10s> > ... > > > In our scenario, with 7 CPUs to dump, the lockup takes nearly 70 > seconds, causing subsequent useful logs to be unable to print, leading > to a watchdog timeout and system reboot. > > This series of changes re-acquires the lock after each dump, > significantly reducing lock-holding time. However, since it still holds > the lock while dumping CPU backtrace, there's still a chance for two > CPUs to wait for each other for 10 seconds, which is still too long. > So, I would like to ask if it's necessary to dump backtrace within the > spinlock section? > If not, especially now that lockless checks are possible, maybe it can > be changed as follows? > > - if (!(data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, cpu))) > - continue; > - raw_spin_lock_irqsave_rcu_node(rnp, flags); > - if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) { > + if (data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, cpu)) { > if (cpu_is_offline(cpu)) > pr_err("Offline CPU %d blocking current GP.\n", cpu); > else > dump_cpu_task(cpu); > } > } > - raw_spin_unlock_irqrestore_rcu_node(rnp, > flags); > > Or should this be considered an arm64 issue, and they should switch to > true NMI, otherwise, they shouldn't use > nmi_trigger_cpumask_backtrace()? Thank you for looking into this! We do assume that nmi_trigger_cpumask_backtrace() uses true NMIs, so, yes, nmi_trigger_cpumask_backtrace() should use true NMIs, just like the name says. ;-) Alternatively, arm64 could continue using nmi_trigger_cpumask_backtrace() with normal interrupts (for example, on SoCs not implementing true NMIs), but have a short timeout (maybe a few jiffies?) after which its returns false (and presumably also cancels the backtrace request so that when the non-NMI interrupt eventually does happen, its handler simply returns without backtracing). This should be implemented using atomics to avoid deadlock issues. This alternative approach would provide accurate arm64 backtraces in the common case where interrupts are enabled, but allow a graceful fallback to remote tracing otherwise. Would you be interested in working this issue, whatever solution the arm64 maintainers end up preferring? Thanx, Paul
On Tue, 2024-10-29 at 09:29 -0700, Paul E. McKenney wrote: > External email : Please do not click links or open attachments until you have verified the sender or the content. > > > On Tue, Oct 29, 2024 at 02:20:51AM +0000, Cheng-Jui Wang (王正睿) wrote: > > On Mon, 2024-10-28 at 17:22 -0700, Paul E. McKenney wrote: > > > The result is that the current leaf rcu_node structure's ->lock is > > > acquired only if a stack backtrace might be needed from the current CPU, > > > and is held across only that CPU's backtrace. As a result, if there are > > > > After upgrading our device to kernel-6.11, we encountered a lockup > > scenario under stall warning. > > I had prepared a patch to submit, but I noticed that this series has > > already addressed some issues, though it hasn't been merged into the > > mainline yet. So, I decided to reply to this series for discussion on > > how to fix it before pushing. Here is the lockup scenario We > > encountered: > > > > Devices: arm64 with only 8 cores > > One CPU holds rnp->lock in rcu_dump_cpu_stack() while trying to dump > > other CPUs, but it waits for the corresponding CPU to dump backtrace, > > with a 10-second timeout. > > > > __delay() > > __const_udelay() > > nmi_trigger_cpumask_backtrace() > > arch_trigger_cpumask_backtrace() > > trigger_single_cpu_backtrace() > > dump_cpu_task() > > rcu_dump_cpu_stacks() <- holding rnp->lock > > print_other_cpu_stall() > > check_cpu_stall() > > rcu_pending() > > rcu_sched_clock_irq() > > update_process_times() > > > > However, the other 7 CPUs are waiting for rnp->lock on the path to > > report qs. > > > > queued_spin_lock_slowpath() > > queued_spin_lock() > > do_raw_spin_lock() > > __raw_spin_lock_irqsave() > > _raw_spin_lock_irqsave() > > rcu_report_qs_rdp() > > rcu_check_quiescent_state() > > rcu_core() > > rcu_core_si() > > handle_softirqs() > > __do_softirq() > > ____do_softirq() > > call_on_irq_stack() > > > > Since the arm64 architecture uses IPI instead of true NMI to implement > > arch_trigger_cpumask_backtrace(), spin_lock_irqsave disables > > interrupts, which is enough to block this IPI request. > > Therefore, if other CPUs start waiting for the lock before receiving > > the IPI, a semi-deadlock scenario like the following occurs: > > > > CPU0 CPU1 CPU2 > > ----- ----- ----- > > lock_irqsave(rnp->lock) > > lock_irqsave(rnp->lock) > > <can't receive IPI> > > <send ipi to CPU 1> > > <wait CPU 1 for 10s> > > lock_irqsave(rnp->lock) > > <can't receive IPI> > > <send ipi to CPU 2> > > <wait CPU 2 for 10s> > > ... > > > > > > In our scenario, with 7 CPUs to dump, the lockup takes nearly 70 > > seconds, causing subsequent useful logs to be unable to print, leading > > to a watchdog timeout and system reboot. > > > > This series of changes re-acquires the lock after each dump, > > significantly reducing lock-holding time. However, since it still holds > > the lock while dumping CPU backtrace, there's still a chance for two > > CPUs to wait for each other for 10 seconds, which is still too long. > > So, I would like to ask if it's necessary to dump backtrace within the > > spinlock section? > > If not, especially now that lockless checks are possible, maybe it can > > be changed as follows? > > > > - if (!(data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, cpu))) > > - continue; > > - raw_spin_lock_irqsave_rcu_node(rnp, flags); > > - if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) { > > + if (data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, cpu)) { > > if (cpu_is_offline(cpu)) > > pr_err("Offline CPU %d blocking current GP.\n", cpu); > > else > > dump_cpu_task(cpu); > > } > > } > > - raw_spin_unlock_irqrestore_rcu_node(rnp, > > flags); > > > > Or should this be considered an arm64 issue, and they should switch to > > true NMI, otherwise, they shouldn't use > > nmi_trigger_cpumask_backtrace()? > > Thank you for looking into this! > > We do assume that nmi_trigger_cpumask_backtrace() uses true NMIs, so, > yes, nmi_trigger_cpumask_backtrace() should use true NMIs, just like > the name says. ;-) In the comments of following patch, the arm64 maintainers have differing views on the use of nmi_trigger_cpumask_backtrace(). I'm a bit confused and unsure which perspective is more reasonable. https://lore.kernel.org/all/20230906090246.v13.4.Ie6c132b96ebbbcddbf6954b9469ed40a6960343c@changeid/ > /* > * NOTE: though nmi_trigger_cpumask_backtrace() has "nmi_" in the name, > * nothing about it truly needs to be implemented using an NMI, it's > * just that it's _allowed_ to work with NMIs. If ipi_should_be_nmi() > * returned false our backtrace attempt will just use a regular IPI. > */ > Alternatively, arm64 could continue using nmi_trigger_cpumask_backtrace() > with normal interrupts (for example, on SoCs not implementing true NMIs), > but have a short timeout (maybe a few jiffies?) after which its returns > false (and presumably also cancels the backtrace request so that when > the non-NMI interrupt eventually does happen, its handler simply returns > without backtracing). This should be implemented using atomics to avoid > deadlock issues. This alternative approach would provide accurate arm64 > backtraces in the common case where interrupts are enabled, but allow > a graceful fallback to remote tracing otherwise. > > Would you be interested in working this issue, whatever solution the > arm64 maintainers end up preferring? > The 10-second timeout is hard-coded in nmi_trigger_cpumask_backtrace(). It is shared code and not architecture-specific. Currently, I haven't thought of a feasible solution. I have also CC'd the authors of the aforementioned patch to see if they have any other ideas. Regarding the rcu stall warning, I think the purpose of acquiring `rnp- >lock` is to protect the rnp->qsmask variable rather than to protect the `dump_cpu_task()` operation, right? Therefore, there is no need to call dump_cpu_task() while holding the lock. When holding the spinlock, we can store the CPUs that need to be dumped into a cpumask, and then dump them all at once after releasing the lock. Here is my temporary solution used locally based on kernel-6.11. + cpumask_var_t mask; + bool mask_ok; + mask_ok = zalloc_cpumask_var(&mask, GFP_ATOMIC); rcu_for_each_leaf_node(rnp) { raw_spin_lock_irqsave_rcu_node(rnp, flags); for_each_leaf_node_possible_cpu(rnp, cpu) if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) { if (cpu_is_offline(cpu)) pr_err("Offline CPU %d blocking current GP.\n", cpu); + else if (mask_ok) + cpumask_set_cpu(cpu, mask); else dump_cpu_task(cpu); } raw_spin_unlock_irqrestore_rcu_node(rnp, flags); } + if (mask_ok) { + if (!trigger_cpumask_backtrace(mask)) { + for_each_cpu(cpu, mask) + dump_cpu_task(cpu); + } + free_cpumask_var(mask); + } After applying this, I haven't encountered the lockup issue for five days, whereas it used to occur about once a day. > Thanx, Paul
On Wed, Oct 30, 2024 at 03:55:56AM +0000, Cheng-Jui Wang (王正睿) wrote: > On Tue, 2024-10-29 at 09:29 -0700, Paul E. McKenney wrote: > > External email : Please do not click links or open attachments until you have verified the sender or the content. > > > > > > On Tue, Oct 29, 2024 at 02:20:51AM +0000, Cheng-Jui Wang (王正睿) wrote: > > > On Mon, 2024-10-28 at 17:22 -0700, Paul E. McKenney wrote: > > > > The result is that the current leaf rcu_node structure's ->lock is > > > > acquired only if a stack backtrace might be needed from the current CPU, > > > > and is held across only that CPU's backtrace. As a result, if there are > > > > > > After upgrading our device to kernel-6.11, we encountered a lockup > > > scenario under stall warning. > > > I had prepared a patch to submit, but I noticed that this series has > > > already addressed some issues, though it hasn't been merged into the > > > mainline yet. So, I decided to reply to this series for discussion on > > > how to fix it before pushing. Here is the lockup scenario We > > > encountered: > > > > > > Devices: arm64 with only 8 cores > > > One CPU holds rnp->lock in rcu_dump_cpu_stack() while trying to dump > > > other CPUs, but it waits for the corresponding CPU to dump backtrace, > > > with a 10-second timeout. > > > > > > __delay() > > > __const_udelay() > > > nmi_trigger_cpumask_backtrace() > > > arch_trigger_cpumask_backtrace() > > > trigger_single_cpu_backtrace() > > > dump_cpu_task() > > > rcu_dump_cpu_stacks() <- holding rnp->lock > > > print_other_cpu_stall() > > > check_cpu_stall() > > > rcu_pending() > > > rcu_sched_clock_irq() > > > update_process_times() > > > > > > However, the other 7 CPUs are waiting for rnp->lock on the path to > > > report qs. > > > > > > queued_spin_lock_slowpath() > > > queued_spin_lock() > > > do_raw_spin_lock() > > > __raw_spin_lock_irqsave() > > > _raw_spin_lock_irqsave() > > > rcu_report_qs_rdp() > > > rcu_check_quiescent_state() > > > rcu_core() > > > rcu_core_si() > > > handle_softirqs() > > > __do_softirq() > > > ____do_softirq() > > > call_on_irq_stack() > > > > > > Since the arm64 architecture uses IPI instead of true NMI to implement > > > arch_trigger_cpumask_backtrace(), spin_lock_irqsave disables > > > interrupts, which is enough to block this IPI request. > > > Therefore, if other CPUs start waiting for the lock before receiving > > > the IPI, a semi-deadlock scenario like the following occurs: > > > > > > CPU0 CPU1 CPU2 > > > ----- ----- ----- > > > lock_irqsave(rnp->lock) > > > lock_irqsave(rnp->lock) > > > <can't receive IPI> > > > <send ipi to CPU 1> > > > <wait CPU 1 for 10s> > > > lock_irqsave(rnp->lock) > > > <can't receive IPI> > > > <send ipi to CPU 2> > > > <wait CPU 2 for 10s> > > > ... > > > > > > > > > In our scenario, with 7 CPUs to dump, the lockup takes nearly 70 > > > seconds, causing subsequent useful logs to be unable to print, leading > > > to a watchdog timeout and system reboot. > > > > > > This series of changes re-acquires the lock after each dump, > > > significantly reducing lock-holding time. However, since it still holds > > > the lock while dumping CPU backtrace, there's still a chance for two > > > CPUs to wait for each other for 10 seconds, which is still too long. > > > So, I would like to ask if it's necessary to dump backtrace within the > > > spinlock section? > > > If not, especially now that lockless checks are possible, maybe it can > > > be changed as follows? > > > > > > - if (!(data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, cpu))) > > > - continue; > > > - raw_spin_lock_irqsave_rcu_node(rnp, flags); > > > - if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) { > > > + if (data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, cpu)) { > > > if (cpu_is_offline(cpu)) > > > pr_err("Offline CPU %d blocking current GP.\n", cpu); > > > else > > > dump_cpu_task(cpu); > > > } > > > } > > > - raw_spin_unlock_irqrestore_rcu_node(rnp, > > > flags); > > > > > > Or should this be considered an arm64 issue, and they should switch to > > > true NMI, otherwise, they shouldn't use > > > nmi_trigger_cpumask_backtrace()? > > > > Thank you for looking into this! > > > > We do assume that nmi_trigger_cpumask_backtrace() uses true NMIs, so, > > yes, nmi_trigger_cpumask_backtrace() should use true NMIs, just like > > the name says. ;-) > > In the comments of following patch, the arm64 maintainers have > differing views on the use of nmi_trigger_cpumask_backtrace(). I'm a > bit confused and unsure which perspective is more reasonable. > > https://lore.kernel.org/all/20230906090246.v13.4.Ie6c132b96ebbbcddbf6954b9469ed40a6960343c@changeid/ I clearly need to have a chat with the arm64 maintainers, and thank you for checking. > > /* > > * NOTE: though nmi_trigger_cpumask_backtrace() has "nmi_" in the > name, > > * nothing about it truly needs to be implemented using an NMI, it's > > * just that it's _allowed_ to work with NMIs. If ipi_should_be_nmi() > > * returned false our backtrace attempt will just use a regular IPI. > > */ > > > Alternatively, arm64 could continue using nmi_trigger_cpumask_backtrace() > > with normal interrupts (for example, on SoCs not implementing true NMIs), > > but have a short timeout (maybe a few jiffies?) after which its returns > > false (and presumably also cancels the backtrace request so that when > > the non-NMI interrupt eventually does happen, its handler simply returns > > without backtracing). This should be implemented using atomics to avoid > > deadlock issues. This alternative approach would provide accurate arm64 > > backtraces in the common case where interrupts are enabled, but allow > > a graceful fallback to remote tracing otherwise. > > > > Would you be interested in working this issue, whatever solution the > > arm64 maintainers end up preferring? > > The 10-second timeout is hard-coded in nmi_trigger_cpumask_backtrace(). > It is shared code and not architecture-specific. Currently, I haven't > thought of a feasible solution. I have also CC'd the authors of the > aforementioned patch to see if they have any other ideas. It should be possible for arm64 to have an architecture-specific hook that enables them to use a much shorter timeout. Or, to eventually switch to real NMIs. > Regarding the rcu stall warning, I think the purpose of acquiring `rnp- > >lock` is to protect the rnp->qsmask variable rather than to protect > the `dump_cpu_task()` operation, right? As noted below, it is also to prevent false-positive stack dumps. > Therefore, there is no need to call dump_cpu_task() while holding the > lock. > When holding the spinlock, we can store the CPUs that need to be dumped > into a cpumask, and then dump them all at once after releasing the > lock. > Here is my temporary solution used locally based on kernel-6.11. > > + cpumask_var_t mask; > + bool mask_ok; > > + mask_ok = zalloc_cpumask_var(&mask, GFP_ATOMIC); > rcu_for_each_leaf_node(rnp) { > raw_spin_lock_irqsave_rcu_node(rnp, flags); > for_each_leaf_node_possible_cpu(rnp, cpu) > if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) > { > if (cpu_is_offline(cpu)) > pr_err("Offline CPU %d blocking > current GP.\n", cpu); > + else if (mask_ok) > + cpumask_set_cpu(cpu, mask); > else > dump_cpu_task(cpu); > } > raw_spin_unlock_irqrestore_rcu_node(rnp, flags); > } > + if (mask_ok) { > + if (!trigger_cpumask_backtrace(mask)) { > + for_each_cpu(cpu, mask) > + dump_cpu_task(cpu); > + } > + free_cpumask_var(mask); > + } > > After applying this, I haven't encountered the lockup issue for five > days, whereas it used to occur about once a day. We used to do it this way, and the reason that we changed was to avoid false-positive (and very confusing) stack dumps in the surprisingly common case where the act of dumping the first stack caused the stalled grace period to end. So sorry, but we really cannot go back to doing it that way. Thanx, Paul
On Wed, 2024-10-30 at 06:54 -0700, Paul E. McKenney wrote: > > > Alternatively, arm64 could continue using nmi_trigger_cpumask_backtrace() > > > with normal interrupts (for example, on SoCs not implementing true NMIs), > > > but have a short timeout (maybe a few jiffies?) after which its returns > > > false (and presumably also cancels the backtrace request so that when > > > the non-NMI interrupt eventually does happen, its handler simply returns > > > without backtracing). This should be implemented using atomics to avoid > > > deadlock issues. This alternative approach would provide accurate arm64 > > > backtraces in the common case where interrupts are enabled, but allow > > > a graceful fallback to remote tracing otherwise. > > > > > > Would you be interested in working this issue, whatever solution the > > > arm64 maintainers end up preferring? > > > > The 10-second timeout is hard-coded in nmi_trigger_cpumask_backtrace(). > > It is shared code and not architecture-specific. Currently, I haven't > > thought of a feasible solution. I have also CC'd the authors of the > > aforementioned patch to see if they have any other ideas. > > It should be possible for arm64 to have an architecture-specific hook > that enables them to use a much shorter timeout. Or, to eventually > switch to real NMIs. There is already another thread discussing the timeout issue, but I still have some questions about RCU. To avoid mixing the discussions, I start this separate thread to discuss RCU. > > Regarding the rcu stall warning, I think the purpose of acquiring `rnp- > > > lock` is to protect the rnp->qsmask variable rather than to protect > > > > the `dump_cpu_task()` operation, right? > > As noted below, it is also to prevent false-positive stack dumps. > > > Therefore, there is no need to call dump_cpu_task() while holding the > > lock. > > When holding the spinlock, we can store the CPUs that need to be dumped > > into a cpumask, and then dump them all at once after releasing the > > lock. > > Here is my temporary solution used locally based on kernel-6.11. > > > > + cpumask_var_t mask; > > + bool mask_ok; > > > > + mask_ok = zalloc_cpumask_var(&mask, GFP_ATOMIC); > > rcu_for_each_leaf_node(rnp) { > > raw_spin_lock_irqsave_rcu_node(rnp, flags); > > for_each_leaf_node_possible_cpu(rnp, cpu) > > if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) > > { > > if (cpu_is_offline(cpu)) > > pr_err("Offline CPU %d blocking > > current GP.\n", cpu); > > + else if (mask_ok) > > + cpumask_set_cpu(cpu, mask); > > else > > dump_cpu_task(cpu); > > } > > raw_spin_unlock_irqrestore_rcu_node(rnp, flags); > > } > > + if (mask_ok) { > > + if (!trigger_cpumask_backtrace(mask)) { > > + for_each_cpu(cpu, mask) > > + dump_cpu_task(cpu); > > + } > > + free_cpumask_var(mask); > > + } > > > > After applying this, I haven't encountered the lockup issue for five > > days, whereas it used to occur about once a day. > > We used to do it this way, and the reason that we changed was to avoid > false-positive (and very confusing) stack dumps in the surprisingly > common case where the act of dumping the first stack caused the stalled > grace period to end. > > So sorry, but we really cannot go back to doing it that way. > > Thanx, Paul Let me clarify, the reason for the issue mentioned above is that it pre-determines all the CPUs to be dumped before starting the dump process. Then, dumping the first stack caused the stalled grace period to end. Subsequently, many CPUs that do not need to be dumped (false positives) are dumped. So,to prevent false positives, it should be about excluding those CPUs that do not to be dumped, right? Therefore, the action that trully help is actually "releasing the lock after each dump (allowing other CPUs to update qsmask) and rechecking (gp_seq and qsmask) to confirm whether to continue dumping". I think holding the lock while dumping CPUs does not help prevent false positives; it only blocks those CPUs waiting for the lock (e.g., CPUs aboult to report qs). For CPUs that do not interact with this lock, holding it should not have any impact. Did I miss anything? -Cheng-Jui
On Fri, Nov 01, 2024 at 07:41:27AM +0000, Cheng-Jui Wang (王正睿) wrote: > On Wed, 2024-10-30 at 06:54 -0700, Paul E. McKenney wrote: > > > > Alternatively, arm64 could continue using nmi_trigger_cpumask_backtrace() > > > > with normal interrupts (for example, on SoCs not implementing true NMIs), > > > > but have a short timeout (maybe a few jiffies?) after which its returns > > > > false (and presumably also cancels the backtrace request so that when > > > > the non-NMI interrupt eventually does happen, its handler simply returns > > > > without backtracing). This should be implemented using atomics to avoid > > > > deadlock issues. This alternative approach would provide accurate arm64 > > > > backtraces in the common case where interrupts are enabled, but allow > > > > a graceful fallback to remote tracing otherwise. > > > > > > > > Would you be interested in working this issue, whatever solution the > > > > arm64 maintainers end up preferring? > > > > > > The 10-second timeout is hard-coded in nmi_trigger_cpumask_backtrace(). > > > It is shared code and not architecture-specific. Currently, I haven't > > > thought of a feasible solution. I have also CC'd the authors of the > > > aforementioned patch to see if they have any other ideas. > > > > It should be possible for arm64 to have an architecture-specific hook > > that enables them to use a much shorter timeout. Or, to eventually > > switch to real NMIs. > > There is already another thread discussing the timeout issue, but I > still have some questions about RCU. To avoid mixing the discussions, I > start this separate thread to discuss RCU. > > > > Regarding the rcu stall warning, I think the purpose of acquiring `rnp- > > > > lock` is to protect the rnp->qsmask variable rather than to protect > > > > > > the `dump_cpu_task()` operation, right? > > > > As noted below, it is also to prevent false-positive stack dumps. > > > > > Therefore, there is no need to call dump_cpu_task() while holding the > > > lock. > > > When holding the spinlock, we can store the CPUs that need to be dumped > > > into a cpumask, and then dump them all at once after releasing the > > > lock. > > > Here is my temporary solution used locally based on kernel-6.11. > > > > > > + cpumask_var_t mask; > > > + bool mask_ok; > > > > > > + mask_ok = zalloc_cpumask_var(&mask, GFP_ATOMIC); > > > rcu_for_each_leaf_node(rnp) { > > > raw_spin_lock_irqsave_rcu_node(rnp, flags); > > > for_each_leaf_node_possible_cpu(rnp, cpu) > > > if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) > > > { > > > if (cpu_is_offline(cpu)) > > > pr_err("Offline CPU %d blocking > > > current GP.\n", cpu); > > > + else if (mask_ok) > > > + cpumask_set_cpu(cpu, mask); > > > else > > > dump_cpu_task(cpu); > > > } > > > raw_spin_unlock_irqrestore_rcu_node(rnp, flags); > > > } > > > + if (mask_ok) { > > > + if (!trigger_cpumask_backtrace(mask)) { > > > + for_each_cpu(cpu, mask) > > > + dump_cpu_task(cpu); > > > + } > > > + free_cpumask_var(mask); > > > + } > > > > > > After applying this, I haven't encountered the lockup issue for five > > > days, whereas it used to occur about once a day. > > > > We used to do it this way, and the reason that we changed was to avoid > > false-positive (and very confusing) stack dumps in the surprisingly > > common case where the act of dumping the first stack caused the stalled > > grace period to end. > > > > So sorry, but we really cannot go back to doing it that way. > > > > Thanx, Paul > > Let me clarify, the reason for the issue mentioned above is that it > pre-determines all the CPUs to be dumped before starting the dump > process. Then, dumping the first stack caused the stalled grace period > to end. Subsequently, many CPUs that do not need to be dumped (false > positives) are dumped. > > So,to prevent false positives, it should be about excluding those CPUs > that do not to be dumped, right? Therefore, the action that trully help > is actually "releasing the lock after each dump (allowing other CPUs to > update qsmask) and rechecking (gp_seq and qsmask) to confirm whether to > continue dumping". > > I think holding the lock while dumping CPUs does not help prevent false > positives; it only blocks those CPUs waiting for the lock (e.g., CPUs > aboult to report qs). For CPUs that do not interact with this lock, > holding it should not have any impact. Did I miss anything? Yes. The stalled CPU could unstall itself just after the lock was released, so that the stack dump would be from some random irrelevant and confusing point in the code. This would not be a good thing. In contrast, with the lock held, the stalled CPU cannot fully exit its RCU read-side critical section, so that the dump has at least some relevance. Let's please instead confine the change to architecture-specific code that chooses to use interrupts instead of NMIs, as suggested in my previous email. If there is more than one such architecture (arm64 and arm32?), they can of course share code, if appropriate. Thanx, Paul
Hi, On Wed, Oct 30, 2024 at 6:54 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > We do assume that nmi_trigger_cpumask_backtrace() uses true NMIs, so, > > > yes, nmi_trigger_cpumask_backtrace() should use true NMIs, just like > > > the name says. ;-) > > > > In the comments of following patch, the arm64 maintainers have > > differing views on the use of nmi_trigger_cpumask_backtrace(). I'm a > > bit confused and unsure which perspective is more reasonable. > > > > https://lore.kernel.org/all/20230906090246.v13.4.Ie6c132b96ebbbcddbf6954b9469ed40a6960343c@changeid/ > > I clearly need to have a chat with the arm64 maintainers, and thank > you for checking. > > > > /* > > > * NOTE: though nmi_trigger_cpumask_backtrace() has "nmi_" in the > > name, > > > * nothing about it truly needs to be implemented using an NMI, it's > > > * just that it's _allowed_ to work with NMIs. If ipi_should_be_nmi() > > > * returned false our backtrace attempt will just use a regular IPI. > > > */ > > > > > Alternatively, arm64 could continue using nmi_trigger_cpumask_backtrace() > > > with normal interrupts (for example, on SoCs not implementing true NMIs), > > > but have a short timeout (maybe a few jiffies?) after which its returns > > > false (and presumably also cancels the backtrace request so that when > > > the non-NMI interrupt eventually does happen, its handler simply returns > > > without backtracing). This should be implemented using atomics to avoid > > > deadlock issues. This alternative approach would provide accurate arm64 > > > backtraces in the common case where interrupts are enabled, but allow > > > a graceful fallback to remote tracing otherwise. > > > > > > Would you be interested in working this issue, whatever solution the > > > arm64 maintainers end up preferring? > > > > The 10-second timeout is hard-coded in nmi_trigger_cpumask_backtrace(). > > It is shared code and not architecture-specific. Currently, I haven't > > thought of a feasible solution. I have also CC'd the authors of the > > aforementioned patch to see if they have any other ideas. > > It should be possible for arm64 to have an architecture-specific hook > that enables them to use a much shorter timeout. Or, to eventually > switch to real NMIs. Note that: * Switching to real NMIs is impossible on many existing arm64 CPUs. The hardware support simply isn't there. Pseudo-NMIs should work fine and are (in nearly all cases) just as good as NMIs but they have a small performance impact. There are also compatibility issues with some pre-existing bootloaders. ...so code can't assume even Pseudo-NMI will work and needs to be able to fall back. Prior to my recent changes arm64 CPUs wouldn't even do stacktraces in some scenarios. Now at least they fall back to regular IPIs. * Even if we decided that we absolutely had to disable stacktrades on arm64 CPUs without some type of NMI, that won't fix arm32. arm32 has been using regular IPIs for backtraces like this for many, many years. Maybe folks don't care as much about arm32 anymore but it feels bad if we totally break it. IMO waiting 10 seconds for a backtrace is pretty crazy to begin with. What about just changing that globally to 1 second? If not, it doesn't feel like it would be impossibly hard to make an arch-specific callback to choose the time and that callback could even take into account whether we managed to get an NMI. I'd be happy to review such a patch. -Doug
On Wed, Oct 30, 2024 at 01:12:01PM -0700, Doug Anderson wrote: > Hi, > > On Wed, Oct 30, 2024 at 6:54 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > We do assume that nmi_trigger_cpumask_backtrace() uses true NMIs, so, > > > > yes, nmi_trigger_cpumask_backtrace() should use true NMIs, just like > > > > the name says. ;-) > > > > > > In the comments of following patch, the arm64 maintainers have > > > differing views on the use of nmi_trigger_cpumask_backtrace(). I'm a > > > bit confused and unsure which perspective is more reasonable. > > > > > > https://lore.kernel.org/all/20230906090246.v13.4.Ie6c132b96ebbbcddbf6954b9469ed40a6960343c@changeid/ > > > > I clearly need to have a chat with the arm64 maintainers, and thank > > you for checking. > > > > > > /* > > > > * NOTE: though nmi_trigger_cpumask_backtrace() has "nmi_" in the > > > name, > > > > * nothing about it truly needs to be implemented using an NMI, it's > > > > * just that it's _allowed_ to work with NMIs. If ipi_should_be_nmi() > > > > * returned false our backtrace attempt will just use a regular IPI. > > > > */ > > > > > > > Alternatively, arm64 could continue using nmi_trigger_cpumask_backtrace() > > > > with normal interrupts (for example, on SoCs not implementing true NMIs), > > > > but have a short timeout (maybe a few jiffies?) after which its returns > > > > false (and presumably also cancels the backtrace request so that when > > > > the non-NMI interrupt eventually does happen, its handler simply returns > > > > without backtracing). This should be implemented using atomics to avoid > > > > deadlock issues. This alternative approach would provide accurate arm64 > > > > backtraces in the common case where interrupts are enabled, but allow > > > > a graceful fallback to remote tracing otherwise. > > > > > > > > Would you be interested in working this issue, whatever solution the > > > > arm64 maintainers end up preferring? > > > > > > The 10-second timeout is hard-coded in nmi_trigger_cpumask_backtrace(). > > > It is shared code and not architecture-specific. Currently, I haven't > > > thought of a feasible solution. I have also CC'd the authors of the > > > aforementioned patch to see if they have any other ideas. > > > > It should be possible for arm64 to have an architecture-specific hook > > that enables them to use a much shorter timeout. Or, to eventually > > switch to real NMIs. > > Note that: > > * Switching to real NMIs is impossible on many existing arm64 CPUs. > The hardware support simply isn't there. Pseudo-NMIs should work fine > and are (in nearly all cases) just as good as NMIs but they have a > small performance impact. There are also compatibility issues with > some pre-existing bootloaders. ...so code can't assume even Pseudo-NMI > will work and needs to be able to fall back. Prior to my recent > changes arm64 CPUs wouldn't even do stacktraces in some scenarios. Now > at least they fall back to regular IPIs. But those IPIs are of no help whatsoever when the target CPU is spinning with interrupts disabled, which is the scenario under discussion. Hence my suggestion that arm64, when using IRQs to request stack backtraces, have an additional short timeout (milliseconds) in order to fall back to remote stack tracing. In many cases, this fallback happens automatically, as you can see in dump_cpu_task(). If trigger_single_cpu_backtrace() returns false, the stack is dumped remotely. > * Even if we decided that we absolutely had to disable stacktrades on > arm64 CPUs without some type of NMI, that won't fix arm32. arm32 has > been using regular IPIs for backtraces like this for many, many years. > Maybe folks don't care as much about arm32 anymore but it feels bad if > we totally break it. Because arm32 isn't used much for datacenter workloads, it will not be suffering from this issue. Not enough to have motivated anyone to fix it, anyway. In contrast, in the datacenter, CPUs really can and do have interrupts disabled for many seconds. (No, this is not a good thing, but it is common enough for us to need to avoid disabling other debugging facilities.) So it might well be that arm32 will continue to do just fine with IRQ-based stack tracing. Or maybe someday arm32 will also need to deal with this same issue. But no "maybe" for arm64, given its increasing use in the datacenter. > IMO waiting 10 seconds for a backtrace is pretty crazy to begin with. > What about just changing that globally to 1 second? If not, it doesn't > feel like it would be impossibly hard to make an arch-specific > callback to choose the time and that callback could even take into > account whether we managed to get an NMI. I'd be happy to review such > a patch. Let's please keep the 10-second timeout for NMI-based backtraces. But I take it that you are not opposed to a shorter timeout for the special case of IRQ-based stack backtrace requests? Thanx, Paul
Hi, On Wed, Oct 30, 2024 at 4:27 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Wed, Oct 30, 2024 at 01:12:01PM -0700, Doug Anderson wrote: > > Hi, > > > > On Wed, Oct 30, 2024 at 6:54 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > We do assume that nmi_trigger_cpumask_backtrace() uses true NMIs, so, > > > > > yes, nmi_trigger_cpumask_backtrace() should use true NMIs, just like > > > > > the name says. ;-) > > > > > > > > In the comments of following patch, the arm64 maintainers have > > > > differing views on the use of nmi_trigger_cpumask_backtrace(). I'm a > > > > bit confused and unsure which perspective is more reasonable. > > > > > > > > https://lore.kernel.org/all/20230906090246.v13.4.Ie6c132b96ebbbcddbf6954b9469ed40a6960343c@changeid/ > > > > > > I clearly need to have a chat with the arm64 maintainers, and thank > > > you for checking. > > > > > > > > /* > > > > > * NOTE: though nmi_trigger_cpumask_backtrace() has "nmi_" in the > > > > name, > > > > > * nothing about it truly needs to be implemented using an NMI, it's > > > > > * just that it's _allowed_ to work with NMIs. If ipi_should_be_nmi() > > > > > * returned false our backtrace attempt will just use a regular IPI. > > > > > */ > > > > > > > > > Alternatively, arm64 could continue using nmi_trigger_cpumask_backtrace() > > > > > with normal interrupts (for example, on SoCs not implementing true NMIs), > > > > > but have a short timeout (maybe a few jiffies?) after which its returns > > > > > false (and presumably also cancels the backtrace request so that when > > > > > the non-NMI interrupt eventually does happen, its handler simply returns > > > > > without backtracing). This should be implemented using atomics to avoid > > > > > deadlock issues. This alternative approach would provide accurate arm64 > > > > > backtraces in the common case where interrupts are enabled, but allow > > > > > a graceful fallback to remote tracing otherwise. > > > > > > > > > > Would you be interested in working this issue, whatever solution the > > > > > arm64 maintainers end up preferring? > > > > > > > > The 10-second timeout is hard-coded in nmi_trigger_cpumask_backtrace(). > > > > It is shared code and not architecture-specific. Currently, I haven't > > > > thought of a feasible solution. I have also CC'd the authors of the > > > > aforementioned patch to see if they have any other ideas. > > > > > > It should be possible for arm64 to have an architecture-specific hook > > > that enables them to use a much shorter timeout. Or, to eventually > > > switch to real NMIs. > > > > Note that: > > > > * Switching to real NMIs is impossible on many existing arm64 CPUs. > > The hardware support simply isn't there. Pseudo-NMIs should work fine > > and are (in nearly all cases) just as good as NMIs but they have a > > small performance impact. There are also compatibility issues with > > some pre-existing bootloaders. ...so code can't assume even Pseudo-NMI > > will work and needs to be able to fall back. Prior to my recent > > changes arm64 CPUs wouldn't even do stacktraces in some scenarios. Now > > at least they fall back to regular IPIs. > > But those IPIs are of no help whatsoever when the target CPU is spinning > with interrupts disabled, which is the scenario under discussion. Right that we can't trace the target CPU spinning with interrupts disabled. ...but in the case where NMI isn't available it _still_ makes sense to make arch_trigger_cpumask_backtrace() fall back to IPI because: 1. There are many cases where the trigger.*cpu.*backtrace() class of functions are used to trace CPUs that _aren't_ looping with interrupts disabled. We still want to be able to backtrace in that case. For instance, you can see in "panic.c" that there are cases where trigger_all_cpu_backtrace() is called. It's valuable to make cases like this (where there's no indication that a CPU is locked) actually work. 2. Even if there's a CPU that's looping with interrupts disabled and we can't trace it because of no NMI, it could still be useful to get backtraces of other CPUs. It's possible that what the other CPUs are doing could give a hint about the CPU that's wedged. This isn't a great hint, but some info is better than no info. > Hence my suggestion that arm64, when using IRQs to request stack > backtraces, have an additional short timeout (milliseconds) in > order to fall back to remote stack tracing. In many cases, this > fallback happens automatically, as you can see in dump_cpu_task(). > If trigger_single_cpu_backtrace() returns false, the stack is dumped > remotely. I think the answer here is that the API needs to change. The whole boolean return value for trigger.*cpu.*backtrace() is mostly ignored by callers. Yes, you've pointed at one of the two places that it's not ignored and it tries a reasonable fallback, but all the other callers just silently do nothing when the function returns false. That led to many places where arm64 devices were simply not getting _any_ stacktrace. Perhaps we need some type of API that says "I actually have a fallback, so if you don't think the stracktrace will succeed then it's OK to return false". > > * Even if we decided that we absolutely had to disable stacktrades on > > arm64 CPUs without some type of NMI, that won't fix arm32. arm32 has > > been using regular IPIs for backtraces like this for many, many years. > > Maybe folks don't care as much about arm32 anymore but it feels bad if > > we totally break it. > > Because arm32 isn't used much for datacenter workloads, it will not > be suffering from this issue. Not enough to have motivated anyone to > fix it, anyway. In contrast, in the datacenter, CPUs really can and > do have interrupts disabled for many seconds. (No, this is not a good > thing, but it is common enough for us to need to avoid disabling other > debugging facilities.) > > So it might well be that arm32 will continue to do just fine with > IRQ-based stack tracing. Or maybe someday arm32 will also need to deal > with this same issue. But no "maybe" for arm64, given its increasing > use in the datacenter. IMO the answer here is that anyone in datacenters should just turn on pseudo NMI (or NMI on newer arm64 CPUs). > > IMO waiting 10 seconds for a backtrace is pretty crazy to begin with. > > What about just changing that globally to 1 second? If not, it doesn't > > feel like it would be impossibly hard to make an arch-specific > > callback to choose the time and that callback could even take into > > account whether we managed to get an NMI. I'd be happy to review such > > a patch. > > Let's please keep the 10-second timeout for NMI-based backtraces. > > But I take it that you are not opposed to a shorter timeout for the > special case of IRQ-based stack backtrace requests? IMO the 10 second is still awfully long (HW watchdogs can trigger during that time), but I'm not that upset by it. I'm OK with shorter timeouts for IRQ-based ones, though I'm not sure I'd be OK w/ timeouts measured in milliseconds unless the caller has actually said that they can handle a "false" return or the caller says that it's more important to be quick than to wait for a long time. -Doug
On Wed, Oct 30, 2024 at 05:21:13PM -0700, Doug Anderson wrote: > Hi, > > On Wed, Oct 30, 2024 at 4:27 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Wed, Oct 30, 2024 at 01:12:01PM -0700, Doug Anderson wrote: > > > Hi, > > > > > > On Wed, Oct 30, 2024 at 6:54 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > > > We do assume that nmi_trigger_cpumask_backtrace() uses true NMIs, so, > > > > > > yes, nmi_trigger_cpumask_backtrace() should use true NMIs, just like > > > > > > the name says. ;-) > > > > > > > > > > In the comments of following patch, the arm64 maintainers have > > > > > differing views on the use of nmi_trigger_cpumask_backtrace(). I'm a > > > > > bit confused and unsure which perspective is more reasonable. > > > > > > > > > > https://lore.kernel.org/all/20230906090246.v13.4.Ie6c132b96ebbbcddbf6954b9469ed40a6960343c@changeid/ > > > > > > > > I clearly need to have a chat with the arm64 maintainers, and thank > > > > you for checking. > > > > > > > > > > /* > > > > > > * NOTE: though nmi_trigger_cpumask_backtrace() has "nmi_" in the > > > > > name, > > > > > > * nothing about it truly needs to be implemented using an NMI, it's > > > > > > * just that it's _allowed_ to work with NMIs. If ipi_should_be_nmi() > > > > > > * returned false our backtrace attempt will just use a regular IPI. > > > > > > */ > > > > > > > > > > > Alternatively, arm64 could continue using nmi_trigger_cpumask_backtrace() > > > > > > with normal interrupts (for example, on SoCs not implementing true NMIs), > > > > > > but have a short timeout (maybe a few jiffies?) after which its returns > > > > > > false (and presumably also cancels the backtrace request so that when > > > > > > the non-NMI interrupt eventually does happen, its handler simply returns > > > > > > without backtracing). This should be implemented using atomics to avoid > > > > > > deadlock issues. This alternative approach would provide accurate arm64 > > > > > > backtraces in the common case where interrupts are enabled, but allow > > > > > > a graceful fallback to remote tracing otherwise. > > > > > > > > > > > > Would you be interested in working this issue, whatever solution the > > > > > > arm64 maintainers end up preferring? > > > > > > > > > > The 10-second timeout is hard-coded in nmi_trigger_cpumask_backtrace(). > > > > > It is shared code and not architecture-specific. Currently, I haven't > > > > > thought of a feasible solution. I have also CC'd the authors of the > > > > > aforementioned patch to see if they have any other ideas. > > > > > > > > It should be possible for arm64 to have an architecture-specific hook > > > > that enables them to use a much shorter timeout. Or, to eventually > > > > switch to real NMIs. > > > > > > Note that: > > > > > > * Switching to real NMIs is impossible on many existing arm64 CPUs. > > > The hardware support simply isn't there. Pseudo-NMIs should work fine > > > and are (in nearly all cases) just as good as NMIs but they have a > > > small performance impact. There are also compatibility issues with > > > some pre-existing bootloaders. ...so code can't assume even Pseudo-NMI > > > will work and needs to be able to fall back. Prior to my recent > > > changes arm64 CPUs wouldn't even do stacktraces in some scenarios. Now > > > at least they fall back to regular IPIs. > > > > But those IPIs are of no help whatsoever when the target CPU is spinning > > with interrupts disabled, which is the scenario under discussion. > > Right that we can't trace the target CPU spinning with interrupts > disabled. You can by invoking sched_show_task(cpu_curr(cpu)) on the CPU requesting the backtrace. The resulting backtrace will not be as good as you would get from using an NMI, but if you don't have NMIs, it is better than nothing. > ...but in the case where NMI isn't available it _still_ > makes sense to make arch_trigger_cpumask_backtrace() fall back to IPI > because: > > 1. There are many cases where the trigger.*cpu.*backtrace() class of > functions are used to trace CPUs that _aren't_ looping with interrupts > disabled. We still want to be able to backtrace in that case. For > instance, you can see in "panic.c" that there are cases where > trigger_all_cpu_backtrace() is called. It's valuable to make cases > like this (where there's no indication that a CPU is locked) actually > work. > > 2. Even if there's a CPU that's looping with interrupts disabled and > we can't trace it because of no NMI, it could still be useful to get > backtraces of other CPUs. It's possible that what the other CPUs are > doing could give a hint about the CPU that's wedged. This isn't a > great hint, but some info is better than no info. And it also makes sense for an IRQ-based backtrace to fall back to something like the aforementioned sched_show_task(cpu_curr(cpu)) if the destination CPU has interrupts disabled. > > Hence my suggestion that arm64, when using IRQs to request stack > > backtraces, have an additional short timeout (milliseconds) in > > order to fall back to remote stack tracing. In many cases, this > > fallback happens automatically, as you can see in dump_cpu_task(). > > If trigger_single_cpu_backtrace() returns false, the stack is dumped > > remotely. > > I think the answer here is that the API needs to change. The whole > boolean return value for trigger.*cpu.*backtrace() is mostly ignored > by callers. Yes, you've pointed at one of the two places that it's not > ignored and it tries a reasonable fallback, but all the other callers > just silently do nothing when the function returns false. That led to > many places where arm64 devices were simply not getting _any_ > stacktrace. > > Perhaps we need some type of API that says "I actually have a > fallback, so if you don't think the stracktrace will succeed then it's > OK to return false". Boolean is fine for trigger_single_cpu_backtrace(), which is directed at a single CPU. And the one call to this function that does not currently check its return value could just call dump_cpu_task() instead. There are only 20 or so uses of all of these functions, so the API can change, give or take the pain involved in changing architecture-specific code. > > > * Even if we decided that we absolutely had to disable stacktrades on > > > arm64 CPUs without some type of NMI, that won't fix arm32. arm32 has > > > been using regular IPIs for backtraces like this for many, many years. > > > Maybe folks don't care as much about arm32 anymore but it feels bad if > > > we totally break it. > > > > Because arm32 isn't used much for datacenter workloads, it will not > > be suffering from this issue. Not enough to have motivated anyone to > > fix it, anyway. In contrast, in the datacenter, CPUs really can and > > do have interrupts disabled for many seconds. (No, this is not a good > > thing, but it is common enough for us to need to avoid disabling other > > debugging facilities.) > > > > So it might well be that arm32 will continue to do just fine with > > IRQ-based stack tracing. Or maybe someday arm32 will also need to deal > > with this same issue. But no "maybe" for arm64, given its increasing > > use in the datacenter. > > IMO the answer here is that anyone in datacenters should just turn on > pseudo NMI (or NMI on newer arm64 CPUs). Suppose datacenters all do this. What part of this issue remains? > > > IMO waiting 10 seconds for a backtrace is pretty crazy to begin with. > > > What about just changing that globally to 1 second? If not, it doesn't > > > feel like it would be impossibly hard to make an arch-specific > > > callback to choose the time and that callback could even take into > > > account whether we managed to get an NMI. I'd be happy to review such > > > a patch. > > > > Let's please keep the 10-second timeout for NMI-based backtraces. > > > > But I take it that you are not opposed to a shorter timeout for the > > special case of IRQ-based stack backtrace requests? > > IMO the 10 second is still awfully long (HW watchdogs can trigger > during that time), but I'm not that upset by it. I'm OK with shorter > timeouts for IRQ-based ones, though I'm not sure I'd be OK w/ timeouts > measured in milliseconds unless the caller has actually said that they > can handle a "false" return or the caller says that it's more > important to be quick than to wait for a long time. If you are using NMIs, and the CPU doesn't respond in 10 seconds, something is likely broken. Maybe bad firmware or bad hardware. You are right that watchdogs might trigger, but they are likely already triggering due to the target CPU being so firmly locked up that it is not responding even to NMIs. On the other hand, when you are not using NMIs, then I agree you want a shorter timeout before remotely tracing the staek via sched_show_task(cpu_curr(cpu)). I picked a few milliseconds, but if datacenters are using real NMIs or pseudo NMIs (and thus are not being blocked by normal interrupt disabling), then I must defer to others on the default timeout. Much depends on the workload and configuration. Thanx, Paul
Hi, On Wed, Oct 30, 2024 at 10:03 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > Note that: > > > > > > > > * Switching to real NMIs is impossible on many existing arm64 CPUs. > > > > The hardware support simply isn't there. Pseudo-NMIs should work fine > > > > and are (in nearly all cases) just as good as NMIs but they have a > > > > small performance impact. There are also compatibility issues with > > > > some pre-existing bootloaders. ...so code can't assume even Pseudo-NMI > > > > will work and needs to be able to fall back. Prior to my recent > > > > changes arm64 CPUs wouldn't even do stacktraces in some scenarios. Now > > > > at least they fall back to regular IPIs. > > > > > > But those IPIs are of no help whatsoever when the target CPU is spinning > > > with interrupts disabled, which is the scenario under discussion. > > > > Right that we can't trace the target CPU spinning with interrupts > > disabled. > > You can by invoking sched_show_task(cpu_curr(cpu)) on the CPU requesting > the backtrace. The resulting backtrace will not be as good as you > would get from using an NMI, but if you don't have NMIs, it is better > than nothing. > > > ...but in the case where NMI isn't available it _still_ > > makes sense to make arch_trigger_cpumask_backtrace() fall back to IPI > > because: > > > > 1. There are many cases where the trigger.*cpu.*backtrace() class of > > functions are used to trace CPUs that _aren't_ looping with interrupts > > disabled. We still want to be able to backtrace in that case. For > > instance, you can see in "panic.c" that there are cases where > > trigger_all_cpu_backtrace() is called. It's valuable to make cases > > like this (where there's no indication that a CPU is locked) actually > > work. > > > > 2. Even if there's a CPU that's looping with interrupts disabled and > > we can't trace it because of no NMI, it could still be useful to get > > backtraces of other CPUs. It's possible that what the other CPUs are > > doing could give a hint about the CPU that's wedged. This isn't a > > great hint, but some info is better than no info. > > And it also makes sense for an IRQ-based backtrace to fall back to > something like the aforementioned sched_show_task(cpu_curr(cpu)) if > the destination CPU has interrupts disabled. > > > > Hence my suggestion that arm64, when using IRQs to request stack > > > backtraces, have an additional short timeout (milliseconds) in > > > order to fall back to remote stack tracing. In many cases, this > > > fallback happens automatically, as you can see in dump_cpu_task(). > > > If trigger_single_cpu_backtrace() returns false, the stack is dumped > > > remotely. > > > > I think the answer here is that the API needs to change. The whole > > boolean return value for trigger.*cpu.*backtrace() is mostly ignored > > by callers. Yes, you've pointed at one of the two places that it's not > > ignored and it tries a reasonable fallback, but all the other callers > > just silently do nothing when the function returns false. That led to > > many places where arm64 devices were simply not getting _any_ > > stacktrace. > > > > Perhaps we need some type of API that says "I actually have a > > fallback, so if you don't think the stracktrace will succeed then it's > > OK to return false". > > Boolean is fine for trigger_single_cpu_backtrace(), which is directed at > a single CPU. And the one call to this function that does not currently > check its return value could just call dump_cpu_task() instead. > > There are only 20 or so uses of all of these functions, so the API can > change, give or take the pain involved in changing architecture-specific > code. Right. Falling back to "sched_show_task(cpu_curr(cpu))" or something similar if the IPI doesn't go through is a good idea. Indeed, falling back to that if the NMI doesn't go through is _also_ a good idea, right? ...and we don't want to change all 20 callers to all add a fallback. To that end, it feels like someone should change it so that the common code includes the fallback and we get rid of the boolean return value. > > > > * Even if we decided that we absolutely had to disable stacktrades on > > > > arm64 CPUs without some type of NMI, that won't fix arm32. arm32 has > > > > been using regular IPIs for backtraces like this for many, many years. > > > > Maybe folks don't care as much about arm32 anymore but it feels bad if > > > > we totally break it. > > > > > > Because arm32 isn't used much for datacenter workloads, it will not > > > be suffering from this issue. Not enough to have motivated anyone to > > > fix it, anyway. In contrast, in the datacenter, CPUs really can and > > > do have interrupts disabled for many seconds. (No, this is not a good > > > thing, but it is common enough for us to need to avoid disabling other > > > debugging facilities.) > > > > > > So it might well be that arm32 will continue to do just fine with > > > IRQ-based stack tracing. Or maybe someday arm32 will also need to deal > > > with this same issue. But no "maybe" for arm64, given its increasing > > > use in the datacenter. > > > > IMO the answer here is that anyone in datacenters should just turn on > > pseudo NMI (or NMI on newer arm64 CPUs). > > Suppose datacenters all do this. What part of this issue remains? I believe that if datacenters enable pseudo NMIs then the issue is "fixed" and thus only remains for anyone else (datacenters or other users of Linux) that don't turn on pseudo NMIs. > > > > IMO waiting 10 seconds for a backtrace is pretty crazy to begin with. > > > > What about just changing that globally to 1 second? If not, it doesn't > > > > feel like it would be impossibly hard to make an arch-specific > > > > callback to choose the time and that callback could even take into > > > > account whether we managed to get an NMI. I'd be happy to review such > > > > a patch. > > > > > > Let's please keep the 10-second timeout for NMI-based backtraces. > > > > > > But I take it that you are not opposed to a shorter timeout for the > > > special case of IRQ-based stack backtrace requests? > > > > IMO the 10 second is still awfully long (HW watchdogs can trigger > > during that time), but I'm not that upset by it. I'm OK with shorter > > timeouts for IRQ-based ones, though I'm not sure I'd be OK w/ timeouts > > measured in milliseconds unless the caller has actually said that they > > can handle a "false" return or the caller says that it's more > > important to be quick than to wait for a long time. > > If you are using NMIs, and the CPU doesn't respond in 10 seconds, > something is likely broken. Maybe bad firmware or bad hardware. You are > right that watchdogs might trigger, but they are likely already triggering > due to the target CPU being so firmly locked up that it is not responding > even to NMIs. Yeah, if NMIs aren't working then things are pretty bad. It still seems like it would be better to let Linux dump more info before a hardware watchdog triggers. For instance it could perhaps call something akin to "sched_show_task(cpu_curr(cpu))". I don't feel super strongly about it, but in my mind even if a CPU is unresponsive to NMIs for 1-2 seconds then things are in pretty bad shape and I'd want to start dumping some more info before the hardware watchdog kicks in and we can't dump anything. > On the other hand, when you are not using NMIs, then I agree > you want a shorter timeout before remotely tracing the staek via > sched_show_task(cpu_curr(cpu)). I picked a few milliseconds, but if > datacenters are using real NMIs or pseudo NMIs (and thus are not being > blocked by normal interrupt disabling), then I must defer to others on > the default timeout. Much depends on the workload and configuration. As I said, I don't feel strongly about this, so if people want to keep the timeout or shorten it or add a knob, any of those would be fine with me. I personally would object to a timeout that's _too_ short or a timeout that is longer than the current 10 seconds. -Doug
On Thu, 2024-10-31 at 14:27 -0700, Doug Anderson wrote: > Hi, > > On Wed, Oct 30, 2024 at 10:03 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > Note that: > > > > > > > > > > * Switching to real NMIs is impossible on many existing arm64 CPUs. > > > > > The hardware support simply isn't there. Pseudo-NMIs should work fine > > > > > and are (in nearly all cases) just as good as NMIs but they have a > > > > > small performance impact. There are also compatibility issues with > > > > > some pre-existing bootloaders. ...so code can't assume even Pseudo-NMI > > > > > will work and needs to be able to fall back. Prior to my recent > > > > > changes arm64 CPUs wouldn't even do stacktraces in some scenarios. Now > > > > > at least they fall back to regular IPIs. > > > > > > > > But those IPIs are of no help whatsoever when the target CPU is spinning > > > > with interrupts disabled, which is the scenario under discussion. > > > > > > Right that we can't trace the target CPU spinning with interrupts > > > disabled. > > > > You can by invoking sched_show_task(cpu_curr(cpu)) on the CPU requesting > > the backtrace. The resulting backtrace will not be as good as you > > would get from using an NMI, but if you don't have NMIs, it is better > > than nothing. > > > > > ...but in the case where NMI isn't available it _still_ > > > makes sense to make arch_trigger_cpumask_backtrace() fall back to IPI > > > because: > > > > > > 1. There are many cases where the trigger.*cpu.*backtrace() class of > > > functions are used to trace CPUs that _aren't_ looping with interrupts > > > disabled. We still want to be able to backtrace in that case. For > > > instance, you can see in "panic.c" that there are cases where > > > trigger_all_cpu_backtrace() is called. It's valuable to make cases > > > like this (where there's no indication that a CPU is locked) actually > > > work. > > > > > > 2. Even if there's a CPU that's looping with interrupts disabled and > > > we can't trace it because of no NMI, it could still be useful to get > > > backtraces of other CPUs. It's possible that what the other CPUs are > > > doing could give a hint about the CPU that's wedged. This isn't a > > > great hint, but some info is better than no info. > > > > And it also makes sense for an IRQ-based backtrace to fall back to > > something like the aforementioned sched_show_task(cpu_curr(cpu)) if > > the destination CPU has interrupts disabled. > > > > > > Hence my suggestion that arm64, when using IRQs to request stack > > > > backtraces, have an additional short timeout (milliseconds) in > > > > order to fall back to remote stack tracing. In many cases, this > > > > fallback happens automatically, as you can see in dump_cpu_task(). > > > > If trigger_single_cpu_backtrace() returns false, the stack is dumped > > > > remotely. > > > > > > I think the answer here is that the API needs to change. The whole > > > boolean return value for trigger.*cpu.*backtrace() is mostly ignored > > > by callers. Yes, you've pointed at one of the two places that it's not > > > ignored and it tries a reasonable fallback, but all the other callers > > > just silently do nothing when the function returns false. That led to > > > many places where arm64 devices were simply not getting _any_ > > > stacktrace. > > > > > > Perhaps we need some type of API that says "I actually have a > > > fallback, so if you don't think the stracktrace will succeed then it's > > > OK to return false". > > > > Boolean is fine for trigger_single_cpu_backtrace(), which is directed at > > a single CPU. And the one call to this function that does not currently > > check its return value could just call dump_cpu_task() instead. > > > > There are only 20 or so uses of all of these functions, so the API can > > change, give or take the pain involved in changing architecture-specific > > code. > > Right. Falling back to "sched_show_task(cpu_curr(cpu))" or something > similar if the IPI doesn't go through is a good idea. Indeed, falling > back to that if the NMI doesn't go through is _also_ a good idea, > right? ...and we don't want to change all 20 callers to all add a > fallback. To that end, it feels like someone should change it so that > the common code includes the fallback and we get rid of the boolean > return value. > > > > > > * Even if we decided that we absolutely had to disable stacktrades on > > > > > arm64 CPUs without some type of NMI, that won't fix arm32. arm32 has > > > > > been using regular IPIs for backtraces like this for many, many years. > > > > > Maybe folks don't care as much about arm32 anymore but it feels bad if > > > > > we totally break it. > > > > > > > > Because arm32 isn't used much for datacenter workloads, it will not > > > > be suffering from this issue. Not enough to have motivated anyone to > > > > fix it, anyway. In contrast, in the datacenter, CPUs really can and > > > > do have interrupts disabled for many seconds. (No, this is not a good > > > > thing, but it is common enough for us to need to avoid disabling other > > > > debugging facilities.) > > > > > > > > So it might well be that arm32 will continue to do just fine with > > > > IRQ-based stack tracing. Or maybe someday arm32 will also need to deal > > > > with this same issue. But no "maybe" for arm64, given its increasing > > > > use in the datacenter. > > > > > > IMO the answer here is that anyone in datacenters should just turn on > > > pseudo NMI (or NMI on newer arm64 CPUs). > > > > Suppose datacenters all do this. What part of this issue remains? > > I believe that if datacenters enable pseudo NMIs then the issue is > "fixed" and thus only remains for anyone else (datacenters or other > users of Linux) that don't turn on pseudo NMIs. > > > > > > > IMO waiting 10 seconds for a backtrace is pretty crazy to begin with. > > > > > What about just changing that globally to 1 second? If not, it doesn't > > > > > feel like it would be impossibly hard to make an arch-specific > > > > > callback to choose the time and that callback could even take into > > > > > account whether we managed to get an NMI. I'd be happy to review such > > > > > a patch. > > > > > > > > Let's please keep the 10-second timeout for NMI-based backtraces. > > > > > > > > But I take it that you are not opposed to a shorter timeout for the > > > > special case of IRQ-based stack backtrace requests? > > > > > > IMO the 10 second is still awfully long (HW watchdogs can trigger > > > during that time), but I'm not that upset by it. I'm OK with shorter > > > timeouts for IRQ-based ones, though I'm not sure I'd be OK w/ timeouts > > > measured in milliseconds unless the caller has actually said that they > > > can handle a "false" return or the caller says that it's more > > > important to be quick than to wait for a long time. > > > > If you are using NMIs, and the CPU doesn't respond in 10 seconds, > > something is likely broken. Maybe bad firmware or bad hardware. You are > > right that watchdogs might trigger, but they are likely already triggering > > due to the target CPU being so firmly locked up that it is not responding > > even to NMIs. > > Yeah, if NMIs aren't working then things are pretty bad. It still > seems like it would be better to let Linux dump more info before a > hardware watchdog triggers. For instance it could perhaps call > something akin to "sched_show_task(cpu_curr(cpu))". > > I don't feel super strongly about it, but in my mind even if a CPU is > unresponsive to NMIs for 1-2 seconds then things are in pretty bad > shape and I'd want to start dumping some more info before the hardware > watchdog kicks in and we can't dump anything. > > > > On the other hand, when you are not using NMIs, then I agree > > you want a shorter timeout before remotely tracing the staek via > > sched_show_task(cpu_curr(cpu)). I picked a few milliseconds, but if > > datacenters are using real NMIs or pseudo NMIs (and thus are not being > > blocked by normal interrupt disabling), then I must defer to others on > > the default timeout. Much depends on the workload and configuration. > > As I said, I don't feel strongly about this, so if people want to keep > the timeout or shorten it or add a knob, any of those would be fine > with me. I personally would object to a timeout that's _too_ short or > a timeout that is longer than the current 10 seconds. > > -Doug I hope this fix can be applied to the stable kernels. Modifying an API that is called by many architectures may encounter difficulties during the backporting process. How about this: we keep the original nmi_trigger_cpumask_backtrace() but add a __nmi_trigger_cpumask_backtrace() in the middle that can accept a timeout parameter. +#define NMI_BACKTRACE_TIMEOUT (10 * 1000) void nmi_trigger_cpumask_backtrace(...) +{ + __nmi_trigger_cpumask_backtrace(..., NMI_BACKTRACE_TIMEOUT); +} + +void __nmi_trigger_cpumask_backtrace(..., unsigned long timeout) { ... - for (i = 0; i < 10 * 1000; i++) { + for (i = 0; i < timeout; i++) { Then, the arch_trigger_cpumask_backtrace() for different architectures can pass in their desired timeout, for example: /* 1 seconds if fallbacked to IPI */ timeout = has_nmi() ? NMI_BACKTRACE_TIMEOUT : 1000; __nmi_trigger_cpu_mask_backtrace(..., timeout); This way, archiectures that want different timeouts can modify it themselves without having to implement complex logic on their own. -Cheng-Jui
On Fri, Nov 01, 2024 at 01:44:15AM +0000, Cheng-Jui Wang (王正睿) wrote: > On Thu, 2024-10-31 at 14:27 -0700, Doug Anderson wrote: > > Hi, > > > > On Wed, Oct 30, 2024 at 10:03 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > > Note that: > > > > > > > > > > > > * Switching to real NMIs is impossible on many existing arm64 CPUs. > > > > > > The hardware support simply isn't there. Pseudo-NMIs should work fine > > > > > > and are (in nearly all cases) just as good as NMIs but they have a > > > > > > small performance impact. There are also compatibility issues with > > > > > > some pre-existing bootloaders. ...so code can't assume even Pseudo-NMI > > > > > > will work and needs to be able to fall back. Prior to my recent > > > > > > changes arm64 CPUs wouldn't even do stacktraces in some scenarios. Now > > > > > > at least they fall back to regular IPIs. > > > > > > > > > > But those IPIs are of no help whatsoever when the target CPU is spinning > > > > > with interrupts disabled, which is the scenario under discussion. > > > > > > > > Right that we can't trace the target CPU spinning with interrupts > > > > disabled. > > > > > > You can by invoking sched_show_task(cpu_curr(cpu)) on the CPU requesting > > > the backtrace. The resulting backtrace will not be as good as you > > > would get from using an NMI, but if you don't have NMIs, it is better > > > than nothing. > > > > > > > ...but in the case where NMI isn't available it _still_ > > > > makes sense to make arch_trigger_cpumask_backtrace() fall back to IPI > > > > because: > > > > > > > > 1. There are many cases where the trigger.*cpu.*backtrace() class of > > > > functions are used to trace CPUs that _aren't_ looping with interrupts > > > > disabled. We still want to be able to backtrace in that case. For > > > > instance, you can see in "panic.c" that there are cases where > > > > trigger_all_cpu_backtrace() is called. It's valuable to make cases > > > > like this (where there's no indication that a CPU is locked) actually > > > > work. > > > > > > > > 2. Even if there's a CPU that's looping with interrupts disabled and > > > > we can't trace it because of no NMI, it could still be useful to get > > > > backtraces of other CPUs. It's possible that what the other CPUs are > > > > doing could give a hint about the CPU that's wedged. This isn't a > > > > great hint, but some info is better than no info. > > > > > > And it also makes sense for an IRQ-based backtrace to fall back to > > > something like the aforementioned sched_show_task(cpu_curr(cpu)) if > > > the destination CPU has interrupts disabled. > > > > > > > > Hence my suggestion that arm64, when using IRQs to request stack > > > > > backtraces, have an additional short timeout (milliseconds) in > > > > > order to fall back to remote stack tracing. In many cases, this > > > > > fallback happens automatically, as you can see in dump_cpu_task(). > > > > > If trigger_single_cpu_backtrace() returns false, the stack is dumped > > > > > remotely. > > > > > > > > I think the answer here is that the API needs to change. The whole > > > > boolean return value for trigger.*cpu.*backtrace() is mostly ignored > > > > by callers. Yes, you've pointed at one of the two places that it's not > > > > ignored and it tries a reasonable fallback, but all the other callers > > > > just silently do nothing when the function returns false. That led to > > > > many places where arm64 devices were simply not getting _any_ > > > > stacktrace. > > > > > > > > Perhaps we need some type of API that says "I actually have a > > > > fallback, so if you don't think the stracktrace will succeed then it's > > > > OK to return false". > > > > > > Boolean is fine for trigger_single_cpu_backtrace(), which is directed at > > > a single CPU. And the one call to this function that does not currently > > > check its return value could just call dump_cpu_task() instead. > > > > > > There are only 20 or so uses of all of these functions, so the API can > > > change, give or take the pain involved in changing architecture-specific > > > code. > > > > Right. Falling back to "sched_show_task(cpu_curr(cpu))" or something > > similar if the IPI doesn't go through is a good idea. Indeed, falling > > back to that if the NMI doesn't go through is _also_ a good idea, > > right? ...and we don't want to change all 20 callers to all add a > > fallback. To that end, it feels like someone should change it so that > > the common code includes the fallback and we get rid of the boolean > > return value. > > > > > > > > * Even if we decided that we absolutely had to disable stacktrades on > > > > > > arm64 CPUs without some type of NMI, that won't fix arm32. arm32 has > > > > > > been using regular IPIs for backtraces like this for many, many years. > > > > > > Maybe folks don't care as much about arm32 anymore but it feels bad if > > > > > > we totally break it. > > > > > > > > > > Because arm32 isn't used much for datacenter workloads, it will not > > > > > be suffering from this issue. Not enough to have motivated anyone to > > > > > fix it, anyway. In contrast, in the datacenter, CPUs really can and > > > > > do have interrupts disabled for many seconds. (No, this is not a good > > > > > thing, but it is common enough for us to need to avoid disabling other > > > > > debugging facilities.) > > > > > > > > > > So it might well be that arm32 will continue to do just fine with > > > > > IRQ-based stack tracing. Or maybe someday arm32 will also need to deal > > > > > with this same issue. But no "maybe" for arm64, given its increasing > > > > > use in the datacenter. > > > > > > > > IMO the answer here is that anyone in datacenters should just turn on > > > > pseudo NMI (or NMI on newer arm64 CPUs). > > > > > > Suppose datacenters all do this. What part of this issue remains? > > > > I believe that if datacenters enable pseudo NMIs then the issue is > > "fixed" and thus only remains for anyone else (datacenters or other > > users of Linux) that don't turn on pseudo NMIs. > > > > > > > > > > IMO waiting 10 seconds for a backtrace is pretty crazy to begin with. > > > > > > What about just changing that globally to 1 second? If not, it doesn't > > > > > > feel like it would be impossibly hard to make an arch-specific > > > > > > callback to choose the time and that callback could even take into > > > > > > account whether we managed to get an NMI. I'd be happy to review such > > > > > > a patch. > > > > > > > > > > Let's please keep the 10-second timeout for NMI-based backtraces. > > > > > > > > > > But I take it that you are not opposed to a shorter timeout for the > > > > > special case of IRQ-based stack backtrace requests? > > > > > > > > IMO the 10 second is still awfully long (HW watchdogs can trigger > > > > during that time), but I'm not that upset by it. I'm OK with shorter > > > > timeouts for IRQ-based ones, though I'm not sure I'd be OK w/ timeouts > > > > measured in milliseconds unless the caller has actually said that they > > > > can handle a "false" return or the caller says that it's more > > > > important to be quick than to wait for a long time. > > > > > > If you are using NMIs, and the CPU doesn't respond in 10 seconds, > > > something is likely broken. Maybe bad firmware or bad hardware. You are > > > right that watchdogs might trigger, but they are likely already triggering > > > due to the target CPU being so firmly locked up that it is not responding > > > even to NMIs. > > > > Yeah, if NMIs aren't working then things are pretty bad. It still > > seems like it would be better to let Linux dump more info before a > > hardware watchdog triggers. For instance it could perhaps call > > something akin to "sched_show_task(cpu_curr(cpu))". > > > > I don't feel super strongly about it, but in my mind even if a CPU is > > unresponsive to NMIs for 1-2 seconds then things are in pretty bad > > shape and I'd want to start dumping some more info before the hardware > > watchdog kicks in and we can't dump anything. > > > > > > > On the other hand, when you are not using NMIs, then I agree > > > you want a shorter timeout before remotely tracing the staek via > > > sched_show_task(cpu_curr(cpu)). I picked a few milliseconds, but if > > > datacenters are using real NMIs or pseudo NMIs (and thus are not being > > > blocked by normal interrupt disabling), then I must defer to others on > > > the default timeout. Much depends on the workload and configuration. > > > > As I said, I don't feel strongly about this, so if people want to keep > > the timeout or shorten it or add a knob, any of those would be fine > > with me. I personally would object to a timeout that's _too_ short or > > a timeout that is longer than the current 10 seconds. > > > > -Doug > > I hope this fix can be applied to the stable kernels. Modifying an API > that is called by many architectures may encounter difficulties during > the backporting process. > > How about this: we keep the original nmi_trigger_cpumask_backtrace() > but add a __nmi_trigger_cpumask_backtrace() in the middle that can > accept a timeout parameter. > > > +#define NMI_BACKTRACE_TIMEOUT (10 * 1000) > > void nmi_trigger_cpumask_backtrace(...) > +{ > + __nmi_trigger_cpumask_backtrace(..., NMI_BACKTRACE_TIMEOUT); > +} > + > +void __nmi_trigger_cpumask_backtrace(..., unsigned long timeout) > { > ... > - for (i = 0; i < 10 * 1000; i++) { > + for (i = 0; i < timeout; i++) { > > > Then, the arch_trigger_cpumask_backtrace() for different architectures > can pass in their desired timeout, for example: > > /* 1 seconds if fallbacked to IPI */ > timeout = has_nmi() ? NMI_BACKTRACE_TIMEOUT : 1000; > __nmi_trigger_cpu_mask_backtrace(..., timeout); > > This way, archiectures that want different timeouts can modify it > themselves without having to implement complex logic on their own. Why not leave the current 10-second timeout (which is needed to handle completely locked-up CPUs), and then add logic to the arm64 code that does the substitution of the plain interrupt for the NMI? If needed, arm32 can do the same thing. That way, we don't need to change the common-code API, we don't need coordinated changes among multiple architectures, and architectures using real NMIs need not change. Thanx, Paul
On Wed, Oct 09, 2024 at 11:05:02AM -0700, Paul E. McKenney wrote: > Hello! > > This series contains RCU CPU stall-warning changes for v6.13: > > 1. Delete unused rcu_gp_might_be_stalled() function. > > 2. Stop stall warning from dumping stacks if grace period ends. > > 3. Finer-grained grace-period-end checks in rcu_dump_cpu_stacks(). > Other than small nit in 2/3, Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> I was curious if you're seeing perf or other improvements with the finer-grained rnp locking. thanks, - Joel > Thanx, Paul > > ------------------------------------------------------------------------ > > b/include/linux/rcutiny.h | 1 - > b/include/linux/rcutree.h | 1 - > b/kernel/rcu/tree_stall.h | 30 ------------------------------ > kernel/rcu/tree_stall.h | 35 ++++++++++++++++++++++------------- > 4 files changed, 22 insertions(+), 45 deletions(-) >
On Tue, Oct 15, 2024 at 02:49:06PM -0400, Joel Fernandes wrote: > On Wed, Oct 09, 2024 at 11:05:02AM -0700, Paul E. McKenney wrote: > > Hello! > > > > This series contains RCU CPU stall-warning changes for v6.13: > > > > 1. Delete unused rcu_gp_might_be_stalled() function. > > > > 2. Stop stall warning from dumping stacks if grace period ends. > > > > 3. Finer-grained grace-period-end checks in rcu_dump_cpu_stacks(). > > > > Other than small nit in 2/3, > > Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Applied, thank you, and I also added the data_race() in 2/3. > I was curious if you're seeing perf or other improvements with the > finer-grained rnp locking. This is about robustness rather that performance, though I suppose you could argue that lack of robustness is an extreme form of bad performance. Holding the leaf rcu_node locks for too long while dumping stacks can result in things like CSD-lock timeouts due to the dumping CPU having interrupts disabled for an extended period. And earlier commit, 1ecd9d68eb44 ("rcu: Defer printing stall-warning backtrace when holding rcu_node lock"), took care of most of the issue by deferring the actual output. But while in the area, it seemed wise to avoid up to 64 dumps being generated (but no longer printed) while holding a leaf rcu_node lock. Hence this commit. Thanx, Paul > thanks, > > - Joel > > > > Thanx, Paul > > > > ------------------------------------------------------------------------ > > > > b/include/linux/rcutiny.h | 1 - > > b/include/linux/rcutree.h | 1 - > > b/kernel/rcu/tree_stall.h | 30 ------------------------------ > > kernel/rcu/tree_stall.h | 35 ++++++++++++++++++++++------------- > > 4 files changed, 22 insertions(+), 45 deletions(-) > >
On Tue, Oct 15, 2024 at 04:02:37PM -0700, Paul E. McKenney wrote: > On Tue, Oct 15, 2024 at 02:49:06PM -0400, Joel Fernandes wrote: > > On Wed, Oct 09, 2024 at 11:05:02AM -0700, Paul E. McKenney wrote: > > > Hello! > > > > > > This series contains RCU CPU stall-warning changes for v6.13: > > > > > > 1. Delete unused rcu_gp_might_be_stalled() function. > > > > > > 2. Stop stall warning from dumping stacks if grace period ends. > > > > > > 3. Finer-grained grace-period-end checks in rcu_dump_cpu_stacks(). > > > > > > > Other than small nit in 2/3, > > > > Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > Applied, thank you, and I also added the data_race() in 2/3. > > > I was curious if you're seeing perf or other improvements with the > > finer-grained rnp locking. > > This is about robustness rather that performance, though I suppose you > could argue that lack of robustness is an extreme form of bad performance. > Holding the leaf rcu_node locks for too long while dumping stacks can > result in things like CSD-lock timeouts due to the dumping CPU having > interrupts disabled for an extended period. > > And earlier commit, 1ecd9d68eb44 ("rcu: Defer printing stall-warning > backtrace when holding rcu_node lock"), took care of most of the issue > by deferring the actual output. But while in the area, it seemed wise > to avoid up to 64 dumps being generated (but no longer printed) while > holding a leaf rcu_node lock. > > Hence this commit. That's smart to do and makes sense, thanks! - Joel > > Thanx, Paul > > > thanks, > > > > - Joel > > > > > > > Thanx, Paul > > > > > > ------------------------------------------------------------------------ > > > > > > b/include/linux/rcutiny.h | 1 - > > > b/include/linux/rcutree.h | 1 - > > > b/kernel/rcu/tree_stall.h | 30 ------------------------------ > > > kernel/rcu/tree_stall.h | 35 ++++++++++++++++++++++------------- > > > 4 files changed, 22 insertions(+), 45 deletions(-) > > >
© 2016 - 2024 Red Hat, Inc.