From nobody Tue Apr 7 16:15:40 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EEC2E39E17E; Thu, 12 Mar 2026 23:22:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773357754; cv=none; b=kqz6xw+sJr8l07ugy090m979SctkUVe6SB+OiW+IVmu+lhXgrCWYCSPiOH7Z0U6I61+zPL54STtvuxk+6noXe45/897q3QpnjalUB94q6LGNV67URERoCLuUAVL4Q0wAuvXoR2gyBSGLN561SoVe57xmGmScx4jGE2zRifzVqJA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773357754; c=relaxed/simple; bh=YH2mKdaJhvLHzAH3E0arTkx7m03q3Vnf8+sv4vYqr9g=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=i0GvTQwQwX6tH+dpBaIswkOFltsLLTAaHpXoMKKINM3pPHX17D2KgJU5t4xGTyHJdB5CvdW4bWkoj6JkLSsPtb9Eh7lQXEc89xKWGq7HPT4UZCrRcmsYqSNtyA/ECu8mgjWjx8kSwUF/VrDuM0HqXadKc2HzTZXi+thn2cYsCKo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=qzlPPFfE; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="qzlPPFfE" Received: by smtp.kernel.org (Postfix) with ESMTPS id C9A74C2BC9E; Thu, 12 Mar 2026 23:22:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1773357753; bh=YH2mKdaJhvLHzAH3E0arTkx7m03q3Vnf8+sv4vYqr9g=; h=From:Date:Subject:References:In-Reply-To:To:Cc:Reply-To:From; b=qzlPPFfEgNG+3H3kTSZgnm8IbWNlB9zhDVfiCz6btvPwkbF91Pn+iMEYlBBj7awHt WKjEdLarKPhL5GO0myHK3k+agMiAsvUWWOl/kSUDqQXW4jXLzvZVF+52Td9CDceQIr dM42U4Jsd5UkDMpUd/QFrRepXcL0+CcGGgNXX5AftLskIglIKcj50HGle7xnnna67e t4GY2/tslV3qodR3HaL5CjahXUXrXfz7tAgwv2sv5uq/0sYu/tC22WTqjySPXY0ZHx 48dOrm7F7jY7TUxWRP86VsuqXaxG4om7euJ1vi+3x9Jj4sOLZZxc0rXl7vFBkk2nNk Sj3shFqLOOgog== Received: from aws-us-west-2-korg-lkml-1.web.codeaurora.org (localhost.localdomain [127.0.0.1]) by smtp.lore.kernel.org (Postfix) with ESMTP id BE3E3106ACFE; Thu, 12 Mar 2026 23:22:33 +0000 (UTC) From: Mayank Rungta via B4 Relay Date: Thu, 12 Mar 2026 16:22:05 -0700 Subject: [PATCH v2 4/5] watchdog/hardlockup: improve buddy system detection timeliness Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260312-hardlockup-watchdog-fixes-v2-4-45bd8a0cc7ed@google.com> References: <20260312-hardlockup-watchdog-fixes-v2-0-45bd8a0cc7ed@google.com> In-Reply-To: <20260312-hardlockup-watchdog-fixes-v2-0-45bd8a0cc7ed@google.com> To: Petr Mladek , Jinchao Wang , Yunhui Cui , Stephane Eranian , Ian Rogers , Li Huafei , Feng Tang , Max Kellermann , Jonathan Corbet , Douglas Anderson , Andrew Morton , Florian Delizy , Shuah Khan Cc: linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, Mayank Rungta X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1773357752; l=5127; i=mrungta@google.com; s=20260212; h=from:subject:message-id; bh=5wYvVIHfFCgN5ygcPFtvVUMvQnV/JrQ6DexOgbq9zes=; b=Ve4A5fVpXCjy0qR9OOgYTqE82VOOLfn2r8KcluiT8/2Cov4UiiiRegO9i3eZdiUXoCJdtOIim fATQ3YWZMPRA2wEEp6DeaTkjR6ADfJm0Dnb1vm+4fnAzjtL4iTc8r4x X-Developer-Key: i=mrungta@google.com; a=ed25519; pk=2Bjwbv/ibL10QnyvK9G7DoKpffXy7z6+M4NawEYgYDI= X-Endpoint-Received: by B4 Relay for mrungta@google.com/20260212 with auth_id=634 X-Original-From: Mayank Rungta Reply-To: mrungta@google.com From: Mayank Rungta Currently, the buddy system only performs checks every 3rd sample. With a 4-second interval. If a check window is missed, the next check occurs 12 seconds later, potentially delaying hard lockup detection for up to 24 seconds. Modify the buddy system to perform checks at every interval (4s). Introduce a missed-interrupt threshold to maintain the existing grace period while reducing the detection window to 8-12 seconds. Best and worst case detection scenarios: Before (12s check window): - Best case: Lockup occurs after first check but just before heartbeat interval. Detected in ~8s (8s till next check). - Worst case: Lockup occurs just after a check. Detected in ~24s (missed check + 12s till next check + 12s logic). After (4s check window with threshold of 3): - Best case: Lockup occurs just before a check. Detected in ~8s (0s till 1st check + 4s till 2nd + 4s till 3rd). - Worst case: Lockup occurs just after a check. Detected in ~12s (4s till 1st check + 4s till 2nd + 4s till 3rd). Reviewed-by: Douglas Anderson Signed-off-by: Mayank Rungta Reviewed-by: Petr Mladek --- include/linux/nmi.h | 1 + kernel/watchdog.c | 19 ++++++++++++++++--- kernel/watchdog_buddy.c | 9 +-------- 3 files changed, 18 insertions(+), 11 deletions(-) diff --git a/include/linux/nmi.h b/include/linux/nmi.h index 207156f2143c..bc1162895f35 100644 --- a/include/linux/nmi.h +++ b/include/linux/nmi.h @@ -21,6 +21,7 @@ void lockup_detector_soft_poweroff(void); extern int watchdog_user_enabled; extern int watchdog_thresh; extern unsigned long watchdog_enabled; +extern int watchdog_hardlockup_miss_thresh; =20 extern struct cpumask watchdog_cpumask; extern unsigned long *watchdog_cpumask_bits; diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 431c540bd035..87dd5e0f6968 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -60,6 +60,13 @@ unsigned long *watchdog_cpumask_bits =3D cpumask_bits(&w= atchdog_cpumask); int __read_mostly sysctl_hardlockup_all_cpu_backtrace; # endif /* CONFIG_SMP */ =20 +/* + * Number of consecutive missed interrupts before declaring a lockup. + * Default to 1 (immediate) for NMI/Perf. Buddy will overwrite this to 3. + */ +int __read_mostly watchdog_hardlockup_miss_thresh =3D 1; +EXPORT_SYMBOL_GPL(watchdog_hardlockup_miss_thresh); + /* * Should we panic when a soft-lockup or hard-lockup occurs: */ @@ -137,6 +144,7 @@ __setup("nmi_watchdog=3D", hardlockup_panic_setup); =20 static DEFINE_PER_CPU(atomic_t, hrtimer_interrupts); static DEFINE_PER_CPU(int, hrtimer_interrupts_saved); +static DEFINE_PER_CPU(int, hrtimer_interrupts_missed); static DEFINE_PER_CPU(bool, watchdog_hardlockup_warned); static DEFINE_PER_CPU(bool, watchdog_hardlockup_touched); static unsigned long hard_lockup_nmi_warn; @@ -159,7 +167,7 @@ void watchdog_hardlockup_touch_cpu(unsigned int cpu) per_cpu(watchdog_hardlockup_touched, cpu) =3D true; } =20 -static void watchdog_hardlockup_update(unsigned int cpu) +static void watchdog_hardlockup_update_reset(unsigned int cpu) { int hrint =3D atomic_read(&per_cpu(hrtimer_interrupts, cpu)); =20 @@ -169,6 +177,7 @@ static void watchdog_hardlockup_update(unsigned int cpu) * written/read by a single CPU. */ per_cpu(hrtimer_interrupts_saved, cpu) =3D hrint; + per_cpu(hrtimer_interrupts_missed, cpu) =3D 0; } =20 static bool is_hardlockup(unsigned int cpu) @@ -176,10 +185,14 @@ static bool is_hardlockup(unsigned int cpu) int hrint =3D atomic_read(&per_cpu(hrtimer_interrupts, cpu)); =20 if (per_cpu(hrtimer_interrupts_saved, cpu) !=3D hrint) { - watchdog_hardlockup_update(cpu); + watchdog_hardlockup_update_reset(cpu); return false; } =20 + per_cpu(hrtimer_interrupts_missed, cpu)++; + if (per_cpu(hrtimer_interrupts_missed, cpu) % watchdog_hardlockup_miss_th= resh) + return false; + return true; } =20 @@ -198,7 +211,7 @@ void watchdog_hardlockup_check(unsigned int cpu, struct= pt_regs *regs) unsigned long flags; =20 if (per_cpu(watchdog_hardlockup_touched, cpu)) { - watchdog_hardlockup_update(cpu); + watchdog_hardlockup_update_reset(cpu); per_cpu(watchdog_hardlockup_touched, cpu) =3D false; return; } diff --git a/kernel/watchdog_buddy.c b/kernel/watchdog_buddy.c index ee754d767c21..3a1e57080c1c 100644 --- a/kernel/watchdog_buddy.c +++ b/kernel/watchdog_buddy.c @@ -21,6 +21,7 @@ static unsigned int watchdog_next_cpu(unsigned int cpu) =20 int __init watchdog_hardlockup_probe(void) { + watchdog_hardlockup_miss_thresh =3D 3; return 0; } =20 @@ -86,14 +87,6 @@ void watchdog_buddy_check_hardlockup(int hrtimer_interru= pts) { unsigned int next_cpu; =20 - /* - * Test for hardlockups every 3 samples. The sample period is - * watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly over - * watchdog_thresh (over by 20%). - */ - if (hrtimer_interrupts % 3 !=3D 0) - return; - /* check for a hardlockup on the next CPU */ next_cpu =3D watchdog_next_cpu(smp_processor_id()); if (next_cpu >=3D nr_cpu_ids) --=20 2.53.0.851.ga537e3e6e9-goog