This issue was found when an EFI pstore was configured for kdump
logging with the NMI hard lockup detector enabled. The efi-pstore
write operation was slow, and with a large number of logs, the
pstore dump callback within kmsg_dump() took a long time.
This delay triggered the NMI watchdog, leading to a nested panic.
The call flow demonstrates how the secondary panic caused an
emergency_restart() to be triggered before the initial pstore
operation could finish, leading to a failure to dump the logs:
real panic() {
kmsg_dump() {
...
pstore_dump() {
start_dump();
... // long time operation triggers NMI watchdog
nmi panic() {
...
emergency_restart(); // pstore unfinished
}
...
finish_dump(); // never reached
}
}
}
Both watchdog_buddy_check_hardlockup() and watchdog_overflow_callback() may
trigger during a panic. This can lead to recursive panic handling.
Add panic_in_progress() checks so watchdog activity is skipped once a panic
has begun.
This prevents recursive panic and keeps the panic path more reliable.
Signed-off-by: Jinchao Wang <wangjinchao600@gmail.com>
Reviewed-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
---
kernel/watchdog.c | 6 ++++++
kernel/watchdog_perf.c | 4 ++++
2 files changed, 10 insertions(+)
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 80b56c002c7f..597c0d947c93 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -740,6 +740,12 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
if (!watchdog_enabled)
return HRTIMER_NORESTART;
+ /*
+ * pass the buddy check if a panic is in process
+ */
+ if (panic_in_progress())
+ return HRTIMER_NORESTART;
+
watchdog_hardlockup_kick();
/* kick the softlockup detector */
diff --git a/kernel/watchdog_perf.c b/kernel/watchdog_perf.c
index 9c58f5b4381d..d3ca70e3c256 100644
--- a/kernel/watchdog_perf.c
+++ b/kernel/watchdog_perf.c
@@ -12,6 +12,7 @@
#define pr_fmt(fmt) "NMI watchdog: " fmt
+#include <linux/panic.h>
#include <linux/nmi.h>
#include <linux/atomic.h>
#include <linux/module.h>
@@ -108,6 +109,9 @@ static void watchdog_overflow_callback(struct perf_event *event,
/* Ensure the watchdog never gets throttled */
event->hw.interrupts = 0;
+ if (panic_in_progress())
+ return;
+
if (!watchdog_check_timestamp())
return;
--
2.43.0