[PATCH] cpu: fix hard lockup triggered during stress-ng stress testing.

shechenglong posted 1 patch 2 weeks ago
arch/arm64/kernel/proton-pack.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
[PATCH] cpu: fix hard lockup triggered during stress-ng stress testing.
Posted by shechenglong 2 weeks ago
Context of the Issue:
In an ARM64 environment, the following steps were performed:

1. Repeatedly ran stress-ng to stress the CPU, memory, and I/O.
2. Cyclically executed test case pty06 from the LTP test suite.
3. Added mitigations=off to the GRUB parameters.

After 1–2 hours of stress testing, a hardlockup occurred,
causing a system crash.

Root Cause of the Hardlockup:
Each time stress-ng starts, it invokes the /sys/kernel/debug/clear_warn_once
interface, which clears the values in the memory section from __start_once
to __end_once. This caused functions like pr_info_once() — originally
designed to print only once — to print again every time stress-ng was called.
If the pty06 test case happened to be using the serial module at that same
moment, it would sleep in waiter.list within the __down_common function.

After pr_info_once() completed its output using the serial module,
it invoked the semaphore up() function to wake up the process waiting
in waiter.list. This sequence triggered an A-A deadlock, ultimately
leading to a hardlockup and system crash.

To prevent this, a local variable should be used to control and ensure
the print operation occurs only once.

Hard lockup call stack:

_raw_spin_lock_nested+168
ttwu_queue+180 (rq_lock(rq, &rf); 2nd acquiring the rq->__lock)
try_to_wake_up+548
wake_up_process+32
__up+88
up+100
__up_console_sem+96
console_unlock+696
vprintk_emit+428
vprintk_default+64
vprintk_func+220
printk+104
spectre_v4_enable_task_mitigation+344
__switch_to+100
__schedule+1028 (rq_lock(rq, &rf); 1st acquiring the rq->__lock)
schedule_idle+48
do_idle+388
cpu_startup_entry+44
secondary_start_kernel+352

Signed-off-by: shechenglong <shechenglong@xfusion.com>
---
 arch/arm64/kernel/proton-pack.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kernel/proton-pack.c b/arch/arm64/kernel/proton-pack.c
index edf1783ffc81..f8663157e041 100644
--- a/arch/arm64/kernel/proton-pack.c
+++ b/arch/arm64/kernel/proton-pack.c
@@ -424,8 +424,10 @@ static bool spectre_v4_mitigations_off(void)
 	bool ret = cpu_mitigations_off() ||
 		   __spectre_v4_policy == SPECTRE_V4_POLICY_MITIGATION_DISABLED;
 
-	if (ret)
-		pr_info_once("spectre-v4 mitigation disabled by command-line option\n");
+	static atomic_t __printk_once = ATOMIC_INIT(0);
+
+	if (ret && !atomic_cmpxchg(&__printk_once, 0, 1))
+		pr_info("spectre-v4 mitigation disabled by command-line option\n");
 
 	return ret;
 }
-- 
2.33.0

Re: [PATCH] cpu: fix hard lockup triggered during stress-ng stress testing.
Posted by Catalin Marinas 1 week, 6 days ago
On Thu, Sep 18, 2025 at 02:49:07PM +0800, shechenglong wrote:
> Context of the Issue:
> In an ARM64 environment, the following steps were performed:
> 
> 1. Repeatedly ran stress-ng to stress the CPU, memory, and I/O.
> 2. Cyclically executed test case pty06 from the LTP test suite.
> 3. Added mitigations=off to the GRUB parameters.
> 
> After 1–2 hours of stress testing, a hardlockup occurred,
> causing a system crash.
> 
> Root Cause of the Hardlockup:
> Each time stress-ng starts, it invokes the /sys/kernel/debug/clear_warn_once
> interface, which clears the values in the memory section from __start_once
> to __end_once. This caused functions like pr_info_once() — originally
> designed to print only once — to print again every time stress-ng was called.
> If the pty06 test case happened to be using the serial module at that same
> moment, it would sleep in waiter.list within the __down_common function.
> 
> After pr_info_once() completed its output using the serial module,
> it invoked the semaphore up() function to wake up the process waiting
> in waiter.list. This sequence triggered an A-A deadlock, ultimately
> leading to a hardlockup and system crash.
> 
> To prevent this, a local variable should be used to control and ensure
> the print operation occurs only once.
> 
> Hard lockup call stack:
> 
> _raw_spin_lock_nested+168
> ttwu_queue+180 (rq_lock(rq, &rf); 2nd acquiring the rq->__lock)
> try_to_wake_up+548
> wake_up_process+32
> __up+88
> up+100
> __up_console_sem+96
> console_unlock+696
> vprintk_emit+428
> vprintk_default+64
> vprintk_func+220
> printk+104
> spectre_v4_enable_task_mitigation+344
> __switch_to+100
> __schedule+1028 (rq_lock(rq, &rf); 1st acquiring the rq->__lock)
> schedule_idle+48
> do_idle+388
> cpu_startup_entry+44
> secondary_start_kernel+352

Is the problem actually that we call the spectre v4 stuff on the
switch_to() path (we can't change this) under the rq_lock() and it
subsequently calls printk() which takes the console semaphore? I think
the "once" aspect makes it less likely but does not address the actual
problem.

> diff --git a/arch/arm64/kernel/proton-pack.c b/arch/arm64/kernel/proton-pack.c
> index edf1783ffc81..f8663157e041 100644
> --- a/arch/arm64/kernel/proton-pack.c
> +++ b/arch/arm64/kernel/proton-pack.c
> @@ -424,8 +424,10 @@ static bool spectre_v4_mitigations_off(void)
>  	bool ret = cpu_mitigations_off() ||
>  		   __spectre_v4_policy == SPECTRE_V4_POLICY_MITIGATION_DISABLED;
>  
> -	if (ret)
> -		pr_info_once("spectre-v4 mitigation disabled by command-line option\n");
> +	static atomic_t __printk_once = ATOMIC_INIT(0);
> +
> +	if (ret && !atomic_cmpxchg(&__printk_once, 0, 1))
> +		pr_info("spectre-v4 mitigation disabled by command-line option\n");
>  
>  	return ret;
>  }

I think we should just avoid the printk() on the
spectre_v4_enable_task_mitigation() path. Well, I'd remove it altogether
from the spectre_v4_mitigations_off() as it's called on kernel entry as
well. Just add a different way to print the status during kernel boot if
there isn't one already, maybe an initcall.

-- 
Catalin
Re: [PATCH] cpu: fix hard lockup triggered during stress-ng stress testing.
Posted by Mark Rutland 1 week, 2 days ago
On Thu, Sep 18, 2025 at 12:28:05PM +0100, Catalin Marinas wrote:
> On Thu, Sep 18, 2025 at 02:49:07PM +0800, shechenglong wrote:
> > Context of the Issue:
> > In an ARM64 environment, the following steps were performed:
> > 
> > 1. Repeatedly ran stress-ng to stress the CPU, memory, and I/O.
> > 2. Cyclically executed test case pty06 from the LTP test suite.
> > 3. Added mitigations=off to the GRUB parameters.
> > 
> > After 1–2 hours of stress testing, a hardlockup occurred,
> > causing a system crash.
> > 
> > Root Cause of the Hardlockup:
> > Each time stress-ng starts, it invokes the /sys/kernel/debug/clear_warn_once
> > interface, which clears the values in the memory section from __start_once
> > to __end_once. This caused functions like pr_info_once() — originally
> > designed to print only once — to print again every time stress-ng was called.
> > If the pty06 test case happened to be using the serial module at that same
> > moment, it would sleep in waiter.list within the __down_common function.
> > 
> > After pr_info_once() completed its output using the serial module,
> > it invoked the semaphore up() function to wake up the process waiting
> > in waiter.list. This sequence triggered an A-A deadlock, ultimately
> > leading to a hardlockup and system crash.
> > 
> > To prevent this, a local variable should be used to control and ensure
> > the print operation occurs only once.
> > 
> > Hard lockup call stack:
> > 
> > _raw_spin_lock_nested+168
> > ttwu_queue+180 (rq_lock(rq, &rf); 2nd acquiring the rq->__lock)
> > try_to_wake_up+548
> > wake_up_process+32
> > __up+88
> > up+100
> > __up_console_sem+96
> > console_unlock+696
> > vprintk_emit+428
> > vprintk_default+64
> > vprintk_func+220
> > printk+104
> > spectre_v4_enable_task_mitigation+344
> > __switch_to+100
> > __schedule+1028 (rq_lock(rq, &rf); 1st acquiring the rq->__lock)
> > schedule_idle+48
> > do_idle+388
> > cpu_startup_entry+44
> > secondary_start_kernel+352
> 
> Is the problem actually that we call the spectre v4 stuff on the
> switch_to() path (we can't change this) under the rq_lock() and it
> subsequently calls printk() which takes the console semaphore? I think
> the "once" aspect makes it less likely but does not address the actual
> problem.

Agreed; I think what we do here is structurally wrong, even if (in the
asbence of writes to the 'clear_warn_once' file) this happens to largely
do what we want today.

We really shouldn't print in accessors for kernel state.

> > diff --git a/arch/arm64/kernel/proton-pack.c b/arch/arm64/kernel/proton-pack.c
> > index edf1783ffc81..f8663157e041 100644
> > --- a/arch/arm64/kernel/proton-pack.c
> > +++ b/arch/arm64/kernel/proton-pack.c
> > @@ -424,8 +424,10 @@ static bool spectre_v4_mitigations_off(void)
> >  	bool ret = cpu_mitigations_off() ||
> >  		   __spectre_v4_policy == SPECTRE_V4_POLICY_MITIGATION_DISABLED;
> >  
> > -	if (ret)
> > -		pr_info_once("spectre-v4 mitigation disabled by command-line option\n");
> > +	static atomic_t __printk_once = ATOMIC_INIT(0);
> > +
> > +	if (ret && !atomic_cmpxchg(&__printk_once, 0, 1))
> > +		pr_info("spectre-v4 mitigation disabled by command-line option\n");
> >  
> >  	return ret;
> >  }
> 
> I think we should just avoid the printk() on the
> spectre_v4_enable_task_mitigation() path. Well, I'd remove it altogether
> from the spectre_v4_mitigations_off() as it's called on kernel entry as
> well. Just add a different way to print the status during kernel boot if
> there isn't one already, maybe an initcall.

I agree; I think we want to rip that out of spectre_v2_mitigations_off()
too.

We print a bunch of things under setup_system_capabilities(), so hanging
something off that feels like the right thing to do.

Mark.
答复: [PATCH] cpu: fix hard lockup triggered during stress-ng stress testing.
Posted by shechenglong 1 week, 5 days ago
Okay, understood. Thank you! May I ask when the fix/patch is expected to be available?

-----邮件原件-----
发件人: Catalin Marinas <catalin.marinas@arm.com> 
发送时间: 2025年9月18日 19:28
收件人: shechenglong <shechenglong@xfusion.com>
抄送: will@kernel.org; linux-arm-kernel@lists.infradead.org; linux-kernel@vger.kernel.org; xulei <stone.xulei@xfusion.com>; chenjialong <chenjialong@xfusion.com>; yuxiating <yuxiating@xfusion.com>
主题: Re: [PATCH] cpu: fix hard lockup triggered during stress-ng stress testing.

On Thu, Sep 18, 2025 at 02:49:07PM +0800, shechenglong wrote:
> Context of the Issue:
> In an ARM64 environment, the following steps were performed:
> 
> 1. Repeatedly ran stress-ng to stress the CPU, memory, and I/O.
> 2. Cyclically executed test case pty06 from the LTP test suite.
> 3. Added mitigations=off to the GRUB parameters.
> 
> After 1–2 hours of stress testing, a hardlockup occurred, causing a 
> system crash.
> 
> Root Cause of the Hardlockup:
> Each time stress-ng starts, it invokes the 
> /sys/kernel/debug/clear_warn_once interface, which clears the values 
> in the memory section from __start_once to __end_once. This caused 
> functions like pr_info_once() — originally designed to print only once — to print again every time stress-ng was called.
> If the pty06 test case happened to be using the serial module at that 
> same moment, it would sleep in waiter.list within the __down_common function.
> 
> After pr_info_once() completed its output using the serial module, it 
> invoked the semaphore up() function to wake up the process waiting in 
> waiter.list. This sequence triggered an A-A deadlock, ultimately 
> leading to a hardlockup and system crash.
> 
> To prevent this, a local variable should be used to control and ensure 
> the print operation occurs only once.
> 
> Hard lockup call stack:
> 
> _raw_spin_lock_nested+168
> ttwu_queue+180 (rq_lock(rq, &rf); 2nd acquiring the rq->__lock)
> try_to_wake_up+548
> wake_up_process+32
> __up+88
> up+100
> __up_console_sem+96
> console_unlock+696
> vprintk_emit+428
> vprintk_default+64
> vprintk_func+220
> printk+104
> spectre_v4_enable_task_mitigation+344
> __switch_to+100
> __schedule+1028 (rq_lock(rq, &rf); 1st acquiring the rq->__lock)
> schedule_idle+48
> do_idle+388
> cpu_startup_entry+44
> secondary_start_kernel+352

Is the problem actually that we call the spectre v4 stuff on the
switch_to() path (we can't change this) under the rq_lock() and it subsequently calls printk() which takes the console semaphore? I think the "once" aspect makes it less likely but does not address the actual problem.

> diff --git a/arch/arm64/kernel/proton-pack.c 
> b/arch/arm64/kernel/proton-pack.c index edf1783ffc81..f8663157e041 
> 100644
> --- a/arch/arm64/kernel/proton-pack.c
> +++ b/arch/arm64/kernel/proton-pack.c
> @@ -424,8 +424,10 @@ static bool spectre_v4_mitigations_off(void)
>  	bool ret = cpu_mitigations_off() ||
>  		   __spectre_v4_policy == SPECTRE_V4_POLICY_MITIGATION_DISABLED;
>  
> -	if (ret)
> -		pr_info_once("spectre-v4 mitigation disabled by command-line option\n");
> +	static atomic_t __printk_once = ATOMIC_INIT(0);
> +
> +	if (ret && !atomic_cmpxchg(&__printk_once, 0, 1))
> +		pr_info("spectre-v4 mitigation disabled by command-line option\n");
>  
>  	return ret;
>  }

I think we should just avoid the printk() on the
spectre_v4_enable_task_mitigation() path. Well, I'd remove it altogether from the spectre_v4_mitigations_off() as it's called on kernel entry as well. Just add a different way to print the status during kernel boot if there isn't one already, maybe an initcall.

--
Catalin
Re: 答复: [PATCH] cpu: fix hard lockup triggered during stress-ng stress testing.
Posted by Catalin Marinas 1 week, 2 days ago
On Fri, Sep 19, 2025 at 12:05:38PM +0000, shechenglong wrote:
> Okay, understood. Thank you! May I ask when the fix/patch is expected
> to be available?

If you send one, that could be really soon ;). See Mark's suggestions
for where to add the pr_info().

-- 
Catalin
[PATCH] cpu: fix hard lockup triggered by printk calls within scheduling context
Posted by shechenglong 1 week ago
relocate the printk() calls from spectre_v4_mitigations_off() and
spectre_v2_mitigations_off() into setup_system_capabilities() function,
preventing hard lockups that occur when printk() is invoked from scheduler context.

Link: https://patchwork.kernel.org/project/linux-arm-kernel/patch/20250918064907.1832-1-shechenglong@xfusion.com/
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: shechenglong <shechenglong@xfusion.com>
---
 arch/arm64/include/asm/spectre.h |  3 +++
 arch/arm64/kernel/cpufeature.c   |  9 +++++++++
 arch/arm64/kernel/proton-pack.c  | 18 ++++--------------
 3 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/include/asm/spectre.h b/arch/arm64/include/asm/spectre.h
index 8fef12626090..6fe29df41788 100644
--- a/arch/arm64/include/asm/spectre.h
+++ b/arch/arm64/include/asm/spectre.h
@@ -118,5 +118,8 @@ void spectre_bhb_patch_wa3(struct alt_instr *alt,
 void spectre_bhb_patch_clearbhb(struct alt_instr *alt,
 				__le32 *origptr, __le32 *updptr, int nr_inst);
 
+bool spectre_v2_mitigations_off(void);
+bool spectre_v4_mitigations_off(void);
+
 #endif	/* __ASSEMBLY__ */
 #endif	/* __ASM_SPECTRE_H */
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index ef269a5a37e1..7d1f541e66a0 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -94,6 +94,7 @@
 #include <asm/traps.h>
 #include <asm/vectors.h>
 #include <asm/virt.h>
+#include <asm/spectre.h>
 
 /* Kernel representation of AT_HWCAP and AT_HWCAP2 */
 static DECLARE_BITMAP(elf_hwcap, MAX_CPU_FEATURES) __read_mostly;
@@ -3942,6 +3943,14 @@ static void __init setup_system_capabilities(void)
 	 */
 	if (system_uses_ttbr0_pan())
 		pr_info("emulated: Privileged Access Never (PAN) using TTBR0_EL1 switching\n");
+
+	/*
+	 * Report Spectre mitigations status.
+	 */
+	if (spectre_v2_mitigations_off())
+		pr_info("spectre-v2 mitigation disabled by command line option\n");
+	if (spectre_v4_mitigations_off())
+		pr_info("spectre-v4 mitigation disabled by command-line option\n");
 }
 
 void __init setup_system_features(void)
diff --git a/arch/arm64/kernel/proton-pack.c b/arch/arm64/kernel/proton-pack.c
index edf1783ffc81..0d4a8a123e07 100644
--- a/arch/arm64/kernel/proton-pack.c
+++ b/arch/arm64/kernel/proton-pack.c
@@ -89,14 +89,9 @@ static int __init parse_spectre_v2_param(char *str)
 }
 early_param("nospectre_v2", parse_spectre_v2_param);
 
-static bool spectre_v2_mitigations_off(void)
+bool spectre_v2_mitigations_off(void)
 {
-	bool ret = __nospectre_v2 || cpu_mitigations_off();
-
-	if (ret)
-		pr_info_once("spectre-v2 mitigation disabled by command line option\n");
-
-	return ret;
+	return __nospectre_v2 || cpu_mitigations_off();
 }
 
 static const char *get_bhb_affected_string(enum mitigation_state bhb_state)
@@ -419,15 +414,10 @@ early_param("ssbd", parse_spectre_v4_param);
  * with contradictory parameters. The mitigation is always either "off",
  * "dynamic" or "on".
  */
-static bool spectre_v4_mitigations_off(void)
+bool spectre_v4_mitigations_off(void)
 {
-	bool ret = cpu_mitigations_off() ||
+	return cpu_mitigations_off() ||
 		   __spectre_v4_policy == SPECTRE_V4_POLICY_MITIGATION_DISABLED;
-
-	if (ret)
-		pr_info_once("spectre-v4 mitigation disabled by command-line option\n");
-
-	return ret;
 }
 
 /* Do we need to toggle the mitigation state on entry to/exit from the kernel? */
-- 
2.33.0
Re: [PATCH] cpu: fix hard lockup triggered by printk calls within scheduling context
Posted by Catalin Marinas 6 days, 19 hours ago
On Wed, Sep 24, 2025 at 08:32:47PM +0800, shechenglong wrote:
> relocate the printk() calls from spectre_v4_mitigations_off() and
> spectre_v2_mitigations_off() into setup_system_capabilities() function,
> preventing hard lockups that occur when printk() is invoked from scheduler context.
> 
> Link: https://patchwork.kernel.org/project/linux-arm-kernel/patch/20250918064907.1832-1-shechenglong@xfusion.com/
> Suggested-by: Mark Rutland <mark.rutland@arm.com>
> Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
> Signed-off-by: shechenglong <shechenglong@xfusion.com>

It looks fine to me. Thanks.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

Will, do you want to take this for 6.18? It might be worth a cc stable.
In general it's unlikely to happen unless you keep writing to
/sys/kernel/debug/clear_warn_once.

-- 
Catalin