arch/arm64/kernel/proton-pack.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
Context of the Issue:
In an ARM64 environment, the following steps were performed:
1. Repeatedly ran stress-ng to stress the CPU, memory, and I/O.
2. Cyclically executed test case pty06 from the LTP test suite.
3. Added mitigations=off to the GRUB parameters.
After 1–2 hours of stress testing, a hardlockup occurred,
causing a system crash.
Root Cause of the Hardlockup:
Each time stress-ng starts, it invokes the /sys/kernel/debug/clear_warn_once
interface, which clears the values in the memory section from __start_once
to __end_once. This caused functions like pr_info_once() — originally
designed to print only once — to print again every time stress-ng was called.
If the pty06 test case happened to be using the serial module at that same
moment, it would sleep in waiter.list within the __down_common function.
After pr_info_once() completed its output using the serial module,
it invoked the semaphore up() function to wake up the process waiting
in waiter.list. This sequence triggered an A-A deadlock, ultimately
leading to a hardlockup and system crash.
To prevent this, a local variable should be used to control and ensure
the print operation occurs only once.
Hard lockup call stack:
_raw_spin_lock_nested+168
ttwu_queue+180 (rq_lock(rq, &rf); 2nd acquiring the rq->__lock)
try_to_wake_up+548
wake_up_process+32
__up+88
up+100
__up_console_sem+96
console_unlock+696
vprintk_emit+428
vprintk_default+64
vprintk_func+220
printk+104
spectre_v4_enable_task_mitigation+344
__switch_to+100
__schedule+1028 (rq_lock(rq, &rf); 1st acquiring the rq->__lock)
schedule_idle+48
do_idle+388
cpu_startup_entry+44
secondary_start_kernel+352
Signed-off-by: shechenglong <shechenglong@xfusion.com>
---
arch/arm64/kernel/proton-pack.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/kernel/proton-pack.c b/arch/arm64/kernel/proton-pack.c
index edf1783ffc81..f8663157e041 100644
--- a/arch/arm64/kernel/proton-pack.c
+++ b/arch/arm64/kernel/proton-pack.c
@@ -424,8 +424,10 @@ static bool spectre_v4_mitigations_off(void)
bool ret = cpu_mitigations_off() ||
__spectre_v4_policy == SPECTRE_V4_POLICY_MITIGATION_DISABLED;
- if (ret)
- pr_info_once("spectre-v4 mitigation disabled by command-line option\n");
+ static atomic_t __printk_once = ATOMIC_INIT(0);
+
+ if (ret && !atomic_cmpxchg(&__printk_once, 0, 1))
+ pr_info("spectre-v4 mitigation disabled by command-line option\n");
return ret;
}
--
2.33.0
On Thu, Sep 18, 2025 at 02:49:07PM +0800, shechenglong wrote: > Context of the Issue: > In an ARM64 environment, the following steps were performed: > > 1. Repeatedly ran stress-ng to stress the CPU, memory, and I/O. > 2. Cyclically executed test case pty06 from the LTP test suite. > 3. Added mitigations=off to the GRUB parameters. > > After 1–2 hours of stress testing, a hardlockup occurred, > causing a system crash. > > Root Cause of the Hardlockup: > Each time stress-ng starts, it invokes the /sys/kernel/debug/clear_warn_once > interface, which clears the values in the memory section from __start_once > to __end_once. This caused functions like pr_info_once() — originally > designed to print only once — to print again every time stress-ng was called. > If the pty06 test case happened to be using the serial module at that same > moment, it would sleep in waiter.list within the __down_common function. > > After pr_info_once() completed its output using the serial module, > it invoked the semaphore up() function to wake up the process waiting > in waiter.list. This sequence triggered an A-A deadlock, ultimately > leading to a hardlockup and system crash. > > To prevent this, a local variable should be used to control and ensure > the print operation occurs only once. > > Hard lockup call stack: > > _raw_spin_lock_nested+168 > ttwu_queue+180 (rq_lock(rq, &rf); 2nd acquiring the rq->__lock) > try_to_wake_up+548 > wake_up_process+32 > __up+88 > up+100 > __up_console_sem+96 > console_unlock+696 > vprintk_emit+428 > vprintk_default+64 > vprintk_func+220 > printk+104 > spectre_v4_enable_task_mitigation+344 > __switch_to+100 > __schedule+1028 (rq_lock(rq, &rf); 1st acquiring the rq->__lock) > schedule_idle+48 > do_idle+388 > cpu_startup_entry+44 > secondary_start_kernel+352 Is the problem actually that we call the spectre v4 stuff on the switch_to() path (we can't change this) under the rq_lock() and it subsequently calls printk() which takes the console semaphore? I think the "once" aspect makes it less likely but does not address the actual problem. > diff --git a/arch/arm64/kernel/proton-pack.c b/arch/arm64/kernel/proton-pack.c > index edf1783ffc81..f8663157e041 100644 > --- a/arch/arm64/kernel/proton-pack.c > +++ b/arch/arm64/kernel/proton-pack.c > @@ -424,8 +424,10 @@ static bool spectre_v4_mitigations_off(void) > bool ret = cpu_mitigations_off() || > __spectre_v4_policy == SPECTRE_V4_POLICY_MITIGATION_DISABLED; > > - if (ret) > - pr_info_once("spectre-v4 mitigation disabled by command-line option\n"); > + static atomic_t __printk_once = ATOMIC_INIT(0); > + > + if (ret && !atomic_cmpxchg(&__printk_once, 0, 1)) > + pr_info("spectre-v4 mitigation disabled by command-line option\n"); > > return ret; > } I think we should just avoid the printk() on the spectre_v4_enable_task_mitigation() path. Well, I'd remove it altogether from the spectre_v4_mitigations_off() as it's called on kernel entry as well. Just add a different way to print the status during kernel boot if there isn't one already, maybe an initcall. -- Catalin
On Thu, Sep 18, 2025 at 12:28:05PM +0100, Catalin Marinas wrote: > On Thu, Sep 18, 2025 at 02:49:07PM +0800, shechenglong wrote: > > Context of the Issue: > > In an ARM64 environment, the following steps were performed: > > > > 1. Repeatedly ran stress-ng to stress the CPU, memory, and I/O. > > 2. Cyclically executed test case pty06 from the LTP test suite. > > 3. Added mitigations=off to the GRUB parameters. > > > > After 1–2 hours of stress testing, a hardlockup occurred, > > causing a system crash. > > > > Root Cause of the Hardlockup: > > Each time stress-ng starts, it invokes the /sys/kernel/debug/clear_warn_once > > interface, which clears the values in the memory section from __start_once > > to __end_once. This caused functions like pr_info_once() — originally > > designed to print only once — to print again every time stress-ng was called. > > If the pty06 test case happened to be using the serial module at that same > > moment, it would sleep in waiter.list within the __down_common function. > > > > After pr_info_once() completed its output using the serial module, > > it invoked the semaphore up() function to wake up the process waiting > > in waiter.list. This sequence triggered an A-A deadlock, ultimately > > leading to a hardlockup and system crash. > > > > To prevent this, a local variable should be used to control and ensure > > the print operation occurs only once. > > > > Hard lockup call stack: > > > > _raw_spin_lock_nested+168 > > ttwu_queue+180 (rq_lock(rq, &rf); 2nd acquiring the rq->__lock) > > try_to_wake_up+548 > > wake_up_process+32 > > __up+88 > > up+100 > > __up_console_sem+96 > > console_unlock+696 > > vprintk_emit+428 > > vprintk_default+64 > > vprintk_func+220 > > printk+104 > > spectre_v4_enable_task_mitigation+344 > > __switch_to+100 > > __schedule+1028 (rq_lock(rq, &rf); 1st acquiring the rq->__lock) > > schedule_idle+48 > > do_idle+388 > > cpu_startup_entry+44 > > secondary_start_kernel+352 > > Is the problem actually that we call the spectre v4 stuff on the > switch_to() path (we can't change this) under the rq_lock() and it > subsequently calls printk() which takes the console semaphore? I think > the "once" aspect makes it less likely but does not address the actual > problem. Agreed; I think what we do here is structurally wrong, even if (in the asbence of writes to the 'clear_warn_once' file) this happens to largely do what we want today. We really shouldn't print in accessors for kernel state. > > diff --git a/arch/arm64/kernel/proton-pack.c b/arch/arm64/kernel/proton-pack.c > > index edf1783ffc81..f8663157e041 100644 > > --- a/arch/arm64/kernel/proton-pack.c > > +++ b/arch/arm64/kernel/proton-pack.c > > @@ -424,8 +424,10 @@ static bool spectre_v4_mitigations_off(void) > > bool ret = cpu_mitigations_off() || > > __spectre_v4_policy == SPECTRE_V4_POLICY_MITIGATION_DISABLED; > > > > - if (ret) > > - pr_info_once("spectre-v4 mitigation disabled by command-line option\n"); > > + static atomic_t __printk_once = ATOMIC_INIT(0); > > + > > + if (ret && !atomic_cmpxchg(&__printk_once, 0, 1)) > > + pr_info("spectre-v4 mitigation disabled by command-line option\n"); > > > > return ret; > > } > > I think we should just avoid the printk() on the > spectre_v4_enable_task_mitigation() path. Well, I'd remove it altogether > from the spectre_v4_mitigations_off() as it's called on kernel entry as > well. Just add a different way to print the status during kernel boot if > there isn't one already, maybe an initcall. I agree; I think we want to rip that out of spectre_v2_mitigations_off() too. We print a bunch of things under setup_system_capabilities(), so hanging something off that feels like the right thing to do. Mark.
Okay, understood. Thank you! May I ask when the fix/patch is expected to be available? -----邮件原件----- 发件人: Catalin Marinas <catalin.marinas@arm.com> 发送时间: 2025年9月18日 19:28 收件人: shechenglong <shechenglong@xfusion.com> 抄送: will@kernel.org; linux-arm-kernel@lists.infradead.org; linux-kernel@vger.kernel.org; xulei <stone.xulei@xfusion.com>; chenjialong <chenjialong@xfusion.com>; yuxiating <yuxiating@xfusion.com> 主题: Re: [PATCH] cpu: fix hard lockup triggered during stress-ng stress testing. On Thu, Sep 18, 2025 at 02:49:07PM +0800, shechenglong wrote: > Context of the Issue: > In an ARM64 environment, the following steps were performed: > > 1. Repeatedly ran stress-ng to stress the CPU, memory, and I/O. > 2. Cyclically executed test case pty06 from the LTP test suite. > 3. Added mitigations=off to the GRUB parameters. > > After 1–2 hours of stress testing, a hardlockup occurred, causing a > system crash. > > Root Cause of the Hardlockup: > Each time stress-ng starts, it invokes the > /sys/kernel/debug/clear_warn_once interface, which clears the values > in the memory section from __start_once to __end_once. This caused > functions like pr_info_once() — originally designed to print only once — to print again every time stress-ng was called. > If the pty06 test case happened to be using the serial module at that > same moment, it would sleep in waiter.list within the __down_common function. > > After pr_info_once() completed its output using the serial module, it > invoked the semaphore up() function to wake up the process waiting in > waiter.list. This sequence triggered an A-A deadlock, ultimately > leading to a hardlockup and system crash. > > To prevent this, a local variable should be used to control and ensure > the print operation occurs only once. > > Hard lockup call stack: > > _raw_spin_lock_nested+168 > ttwu_queue+180 (rq_lock(rq, &rf); 2nd acquiring the rq->__lock) > try_to_wake_up+548 > wake_up_process+32 > __up+88 > up+100 > __up_console_sem+96 > console_unlock+696 > vprintk_emit+428 > vprintk_default+64 > vprintk_func+220 > printk+104 > spectre_v4_enable_task_mitigation+344 > __switch_to+100 > __schedule+1028 (rq_lock(rq, &rf); 1st acquiring the rq->__lock) > schedule_idle+48 > do_idle+388 > cpu_startup_entry+44 > secondary_start_kernel+352 Is the problem actually that we call the spectre v4 stuff on the switch_to() path (we can't change this) under the rq_lock() and it subsequently calls printk() which takes the console semaphore? I think the "once" aspect makes it less likely but does not address the actual problem. > diff --git a/arch/arm64/kernel/proton-pack.c > b/arch/arm64/kernel/proton-pack.c index edf1783ffc81..f8663157e041 > 100644 > --- a/arch/arm64/kernel/proton-pack.c > +++ b/arch/arm64/kernel/proton-pack.c > @@ -424,8 +424,10 @@ static bool spectre_v4_mitigations_off(void) > bool ret = cpu_mitigations_off() || > __spectre_v4_policy == SPECTRE_V4_POLICY_MITIGATION_DISABLED; > > - if (ret) > - pr_info_once("spectre-v4 mitigation disabled by command-line option\n"); > + static atomic_t __printk_once = ATOMIC_INIT(0); > + > + if (ret && !atomic_cmpxchg(&__printk_once, 0, 1)) > + pr_info("spectre-v4 mitigation disabled by command-line option\n"); > > return ret; > } I think we should just avoid the printk() on the spectre_v4_enable_task_mitigation() path. Well, I'd remove it altogether from the spectre_v4_mitigations_off() as it's called on kernel entry as well. Just add a different way to print the status during kernel boot if there isn't one already, maybe an initcall. -- Catalin
On Fri, Sep 19, 2025 at 12:05:38PM +0000, shechenglong wrote: > Okay, understood. Thank you! May I ask when the fix/patch is expected > to be available? If you send one, that could be really soon ;). See Mark's suggestions for where to add the pr_info(). -- Catalin
relocate the printk() calls from spectre_v4_mitigations_off() and
spectre_v2_mitigations_off() into setup_system_capabilities() function,
preventing hard lockups that occur when printk() is invoked from scheduler context.
Link: https://patchwork.kernel.org/project/linux-arm-kernel/patch/20250918064907.1832-1-shechenglong@xfusion.com/
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: shechenglong <shechenglong@xfusion.com>
---
arch/arm64/include/asm/spectre.h | 3 +++
arch/arm64/kernel/cpufeature.c | 9 +++++++++
arch/arm64/kernel/proton-pack.c | 18 ++++--------------
3 files changed, 16 insertions(+), 14 deletions(-)
diff --git a/arch/arm64/include/asm/spectre.h b/arch/arm64/include/asm/spectre.h
index 8fef12626090..6fe29df41788 100644
--- a/arch/arm64/include/asm/spectre.h
+++ b/arch/arm64/include/asm/spectre.h
@@ -118,5 +118,8 @@ void spectre_bhb_patch_wa3(struct alt_instr *alt,
void spectre_bhb_patch_clearbhb(struct alt_instr *alt,
__le32 *origptr, __le32 *updptr, int nr_inst);
+bool spectre_v2_mitigations_off(void);
+bool spectre_v4_mitigations_off(void);
+
#endif /* __ASSEMBLY__ */
#endif /* __ASM_SPECTRE_H */
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index ef269a5a37e1..7d1f541e66a0 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -94,6 +94,7 @@
#include <asm/traps.h>
#include <asm/vectors.h>
#include <asm/virt.h>
+#include <asm/spectre.h>
/* Kernel representation of AT_HWCAP and AT_HWCAP2 */
static DECLARE_BITMAP(elf_hwcap, MAX_CPU_FEATURES) __read_mostly;
@@ -3942,6 +3943,14 @@ static void __init setup_system_capabilities(void)
*/
if (system_uses_ttbr0_pan())
pr_info("emulated: Privileged Access Never (PAN) using TTBR0_EL1 switching\n");
+
+ /*
+ * Report Spectre mitigations status.
+ */
+ if (spectre_v2_mitigations_off())
+ pr_info("spectre-v2 mitigation disabled by command line option\n");
+ if (spectre_v4_mitigations_off())
+ pr_info("spectre-v4 mitigation disabled by command-line option\n");
}
void __init setup_system_features(void)
diff --git a/arch/arm64/kernel/proton-pack.c b/arch/arm64/kernel/proton-pack.c
index edf1783ffc81..0d4a8a123e07 100644
--- a/arch/arm64/kernel/proton-pack.c
+++ b/arch/arm64/kernel/proton-pack.c
@@ -89,14 +89,9 @@ static int __init parse_spectre_v2_param(char *str)
}
early_param("nospectre_v2", parse_spectre_v2_param);
-static bool spectre_v2_mitigations_off(void)
+bool spectre_v2_mitigations_off(void)
{
- bool ret = __nospectre_v2 || cpu_mitigations_off();
-
- if (ret)
- pr_info_once("spectre-v2 mitigation disabled by command line option\n");
-
- return ret;
+ return __nospectre_v2 || cpu_mitigations_off();
}
static const char *get_bhb_affected_string(enum mitigation_state bhb_state)
@@ -419,15 +414,10 @@ early_param("ssbd", parse_spectre_v4_param);
* with contradictory parameters. The mitigation is always either "off",
* "dynamic" or "on".
*/
-static bool spectre_v4_mitigations_off(void)
+bool spectre_v4_mitigations_off(void)
{
- bool ret = cpu_mitigations_off() ||
+ return cpu_mitigations_off() ||
__spectre_v4_policy == SPECTRE_V4_POLICY_MITIGATION_DISABLED;
-
- if (ret)
- pr_info_once("spectre-v4 mitigation disabled by command-line option\n");
-
- return ret;
}
/* Do we need to toggle the mitigation state on entry to/exit from the kernel? */
--
2.33.0
On Wed, Sep 24, 2025 at 08:32:47PM +0800, shechenglong wrote: > relocate the printk() calls from spectre_v4_mitigations_off() and > spectre_v2_mitigations_off() into setup_system_capabilities() function, > preventing hard lockups that occur when printk() is invoked from scheduler context. > > Link: https://patchwork.kernel.org/project/linux-arm-kernel/patch/20250918064907.1832-1-shechenglong@xfusion.com/ > Suggested-by: Mark Rutland <mark.rutland@arm.com> > Suggested-by: Catalin Marinas <catalin.marinas@arm.com> > Signed-off-by: shechenglong <shechenglong@xfusion.com> It looks fine to me. Thanks. Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Will, do you want to take this for 6.18? It might be worth a cc stable. In general it's unlikely to happen unless you keep writing to /sys/kernel/debug/clear_warn_once. -- Catalin
© 2016 - 2025 Red Hat, Inc.