kernel/locking/qspinlock.c | 4 ++++ kernel/locking/qspinlock.h | 5 +++++ 2 files changed, 9 insertions(+)
From: Yu Peng <pengyu@kylinos.cn>
A hardlock detected on arm64: rq->lock was released, but a CPU
blocked at mcs_node->locked and timed out.
We found xchg_tail and atomic_try_cmpxchg_relaxed used _relaxed
versions without memory barriers. Suspected insufficient coherence
guarantees on some arm64 microarchitectures, potentially leading to
the following issues occurred:
CPU0: CPU1:
// Set tail to CPU0
old = xchg_tail(lock, tail);
//CPU0 read tail is itself
if ((val & _Q_TAIL_MASK) == tail)
// CPU1 exchanges the tail
old = xchg_tail(lock, tail)
//assuming CPU0 not see tail change
atomic_try_cmpxchg_relaxed(
&lock->val, &val, _Q_LOCKED_VAL)
//released without notifying CPU1
goto release;
//hardlock detected
arch_mcs_spin_lock_contended(
&node->locked)
Therefore, xchg_tail and atomic_try_cmpxchg using _mb to replace _relaxed.
Signed-off-by: pengyu <pengyu@kylinos.cn>
---
kernel/locking/qspinlock.c | 4 ++++
kernel/locking/qspinlock.h | 5 +++++
2 files changed, 9 insertions(+)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index af8d122bb649..5e0c489d0d81 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -350,7 +350,11 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
* PENDING will make the uncontended transition fail.
*/
if ((val & _Q_TAIL_MASK) == tail) {
+#if defined(CONFIG_ARM64)
+ if (atomic_try_cmpxchg(&lock->val, &val, _Q_LOCKED_VAL))
+#else
if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
+#endif
goto release; /* No contention */
}
diff --git a/kernel/locking/qspinlock.h b/kernel/locking/qspinlock.h
index d69958a844f7..c89a6877b96b 100644
--- a/kernel/locking/qspinlock.h
+++ b/kernel/locking/qspinlock.h
@@ -113,12 +113,17 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
*/
static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
{
+#if defined(CONFIG_ARM64)
+ return (u32)xchg(&lock->tail,
+ tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
+#else
/*
* We can use relaxed semantics since the caller ensures that the
* MCS node is properly initialized before updating the tail.
*/
return (u32)xchg_relaxed(&lock->tail,
tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
+#endif
}
#else /* _Q_PENDING_BITS == 8 */
2.25.1
On Tue, Sep 16, 2025 at 11:39:03AM +0800, pengyu wrote: > From: Yu Peng <pengyu@kylinos.cn> > > A hardlock detected on arm64: rq->lock was released, but a CPU > blocked at mcs_node->locked and timed out. > > We found xchg_tail and atomic_try_cmpxchg_relaxed used _relaxed > versions without memory barriers. Suspected insufficient coherence > guarantees on some arm64 microarchitectures, potentially leading to > the following issues occurred: > > CPU0: CPU1: > // Set tail to CPU0 > old = xchg_tail(lock, tail); > > //CPU0 read tail is itself > if ((val & _Q_TAIL_MASK) == tail) > // CPU1 exchanges the tail > old = xchg_tail(lock, tail) > //assuming CPU0 not see tail change > atomic_try_cmpxchg_relaxed( > &lock->val, &val, _Q_LOCKED_VAL) > //released without notifying CPU1 > goto release; > //hardlock detected > arch_mcs_spin_lock_contended( > &node->locked) > > Therefore, xchg_tail and atomic_try_cmpxchg using _mb to replace _relaxed. Yeah, no. We do not apply patches based on suspicion. And we most certainly do not sprinkle #ifdef ARM64 in generic code. There is this thread: https://lkml.kernel.org/r/cb83e3e4-9e22-4457-bf61-5614cc4396ad@tu-bs.de Which is also concerned with xchg_tail(). Reading back, I'm not sure we've ever heard back from ARM on whether that extra ;si was correct or not, Will? Anyway, as Waiman already asked, please state your exact ARM64 microarch. Barring the ;si, the above thread suggests that they can prove the code correct with the below change, does that resolve your problem? Other than that, I'm going to have to leave this to Will and co. diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index af8d122bb649..249231460ea0 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -261,14 +261,8 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) goto release; /* - * Ensure that the initialisation of @node is complete before we - * publish the updated tail via xchg_tail() and potentially link - * @node into the waitqueue via WRITE_ONCE(prev->next, node) below. - */ - smp_wmb(); - - /* - * Publish the updated tail. + * Publish the updated tail. This is a RELEASE barrier and ensures the + * @node is complete before the link becomes visible. * We have already touched the queueing cacheline; don't bother with * pending stuff. * diff --git a/kernel/locking/qspinlock.h b/kernel/locking/qspinlock.h index d69958a844f7..bb9a006e76eb 100644 --- a/kernel/locking/qspinlock.h +++ b/kernel/locking/qspinlock.h @@ -117,7 +117,7 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) * We can use relaxed semantics since the caller ensures that the * MCS node is properly initialized before updating the tail. */ - return (u32)xchg_relaxed(&lock->tail, + return (u32)xchg_release(&lock->tail, tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET; } @@ -167,7 +167,7 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) * the MCS node is properly initialized before updating the * tail. */ - } while (!atomic_try_cmpxchg_relaxed(&lock->val, &old, new)); + } while (!atomic_try_cmpxchg_release(&lock->val, &old, new)); return old; }
On 9/16/25 10:10 AM, Peter Zijlstra wrote: > On Tue, Sep 16, 2025 at 11:39:03AM +0800, pengyu wrote: >> From: Yu Peng <pengyu@kylinos.cn> >> >> A hardlock detected on arm64: rq->lock was released, but a CPU >> blocked at mcs_node->locked and timed out. >> >> We found xchg_tail and atomic_try_cmpxchg_relaxed used _relaxed >> versions without memory barriers. Suspected insufficient coherence >> guarantees on some arm64 microarchitectures, potentially leading to >> the following issues occurred: >> >> CPU0: CPU1: >> // Set tail to CPU0 >> old = xchg_tail(lock, tail); >> >> //CPU0 read tail is itself >> if ((val & _Q_TAIL_MASK) == tail) >> // CPU1 exchanges the tail >> old = xchg_tail(lock, tail) >> //assuming CPU0 not see tail change >> atomic_try_cmpxchg_relaxed( >> &lock->val, &val, _Q_LOCKED_VAL) >> //released without notifying CPU1 >> goto release; >> //hardlock detected >> arch_mcs_spin_lock_contended( >> &node->locked) >> >> Therefore, xchg_tail and atomic_try_cmpxchg using _mb to replace _relaxed. > Yeah, no. We do not apply patches based on suspicion. And we most > certainly do not sprinkle #ifdef ARM64 in generic code. > > There is this thread: > > https://lkml.kernel.org/r/cb83e3e4-9e22-4457-bf61-5614cc4396ad@tu-bs.de Ah, I was not cc'ed on this email thread. That is why I was not aware of this discussion about xchg_tail(). It is an interesting read. Anyway, this particular problem may be about the clarity of the arm64 memory model and whether any microarch's strictly follow it or not. Cheers, Longman
On Tue, Sep 16, 2025 at 04:10:32PM +0200, Peter Zijlstra wrote: > On Tue, Sep 16, 2025 at 11:39:03AM +0800, pengyu wrote: > > From: Yu Peng <pengyu@kylinos.cn> > > > > A hardlock detected on arm64: rq->lock was released, but a CPU > > blocked at mcs_node->locked and timed out. > > > > We found xchg_tail and atomic_try_cmpxchg_relaxed used _relaxed > > versions without memory barriers. Suspected insufficient coherence > > guarantees on some arm64 microarchitectures, potentially leading to > > the following issues occurred: > > > > CPU0: CPU1: > > // Set tail to CPU0 > > old = xchg_tail(lock, tail); > > > > //CPU0 read tail is itself > > if ((val & _Q_TAIL_MASK) == tail) > > // CPU1 exchanges the tail > > old = xchg_tail(lock, tail) > > //assuming CPU0 not see tail change > > atomic_try_cmpxchg_relaxed( > > &lock->val, &val, _Q_LOCKED_VAL) > > //released without notifying CPU1 > > goto release; > > //hardlock detected > > arch_mcs_spin_lock_contended( > > &node->locked) > > > > Therefore, xchg_tail and atomic_try_cmpxchg using _mb to replace _relaxed. > > Yeah, no. We do not apply patches based on suspicion. And we most > certainly do not sprinkle #ifdef ARM64 in generic code. Absolutely. > There is this thread: > > https://lkml.kernel.org/r/cb83e3e4-9e22-4457-bf61-5614cc4396ad@tu-bs.de > > Which is also concerned with xchg_tail(). Reading back, I'm not sure > we've ever heard back from ARM on whether that extra ;si was correct or > not, Will? It's still under discussion with the Arm architects but it was _very_ close to concluding last time we met and I wouldn't worry about it for the purposes of this report. > Anyway, as Waiman already asked, please state your exact ARM64 > microarch. > > Barring the ;si, the above thread suggests that they can prove the code > correct with the below change, does that resolve your problem? > > Other than that, I'm going to have to leave this to Will and co. I'll take a look but it's light on details. Will
On 9/17/25 00:00, Will Deacon wrote: > On Tue, Sep 16, 2025 at 04:10:32PM +0200, Peter Zijlstra wrote: >> On Tue, Sep 16, 2025 at 11:39:03AM +0800, pengyu wrote: >>> From: Yu Peng <pengyu@kylinos.cn> >>> >>> A hardlock detected on arm64: rq->lock was released, but a CPU >>> blocked at mcs_node->locked and timed out. >>> >>> We found xchg_tail and atomic_try_cmpxchg_relaxed used _relaxed >>> versions without memory barriers. Suspected insufficient coherence >>> guarantees on some arm64 microarchitectures, potentially leading to >>> the following issues occurred: >>> >>> CPU0: CPU1: >>> // Set tail to CPU0 >>> old = xchg_tail(lock, tail); >>> >>> //CPU0 read tail is itself >>> if ((val & _Q_TAIL_MASK) == tail) >>> // CPU1 exchanges the tail >>> old = xchg_tail(lock, tail) >>> //assuming CPU0 not see tail change >>> atomic_try_cmpxchg_relaxed( >>> &lock->val, &val, _Q_LOCKED_VAL) >>> //released without notifying CPU1 >>> goto release; >>> //hardlock detected >>> arch_mcs_spin_lock_contended( >>> &node->locked) >>> >>> Therefore, xchg_tail and atomic_try_cmpxchg using _mb to replace _relaxed. >> >> Yeah, no. We do not apply patches based on suspicion. And we most >> certainly do not sprinkle #ifdef ARM64 in generic code. > > Absolutely. > >> There is this thread: >> >> https://lkml.kernel.org/r/cb83e3e4-9e22-4457-bf61-5614cc4396ad@tu-bs.de >> >> Which is also concerned with xchg_tail(). Reading back, I'm not sure >> we've ever heard back from ARM on whether that extra ;si was correct or >> not, Will? > > It's still under discussion with the Arm architects but it was _very_ > close to concluding last time we met and I wouldn't worry about it for > the purposes of this report. > >> Anyway, as Waiman already asked, please state your exact ARM64 >> microarch. >> >> Barring the ;si, the above thread suggests that they can prove the code >> correct with the below change, does that resolve your problem? >> >> Other than that, I'm going to have to leave this to Will and co. > > I'll take a look but it's light on details. > > Will Yes, this issue occurred on a kunpeng920 96-core machine and only affected a small number of systems that had been running for over a year. Vmcore Analysis: • Panic triggered by CPU 83 detecting a hard lockup at queued_spin_lock_slowpath+0x1d8/0x320. • Corresponding code: arch_mcs_spin_lock_contended(&node->locked); • The qspinlock involved was the rq lock, which showed a cleared state: crash> rq.lock,cpu ffffad96ff2907c0 lock = { raw_lock = { { val = { counter = 0 }, { locked = 0 '\000', pending = 0 '\000' }, { locked_pending = 0, tail = 0 } } } }, cpu = 50, • CPU 83’s MCS node remained in a locked=0 state, with no previous node found in the qnodes list. crash> p qnodes:83 per_cpu(qnodes, 83) = $292 = {{ mcs = { next = 0x0, locked = 0, count = 1 } }, crash> p qnodes | grep 83 [83]: ffffadd6bf7914c0 crash> p qnodes:all | grep ffffadd6bf7914c0 crash> • Since rq->lock was cleared, no CPU could notify CPU 83. This issue has occurred multiple times, but the root cause remains unclear. We suspect that CPU 83 may have failed to enqueue itself, potentially due to a failure in the xchg_tail atomic operation. It has been noted that the _relaxed version is used in xchx_tail, and we are uncertain whether this could lead to visibility issues—for example, if CPU 83 modifies lock->tail, but other CPUs fail to observe the change. We are also checking if this is related: https://lkml.kernel.org/r/cb83e3e4-9e22-4457-bf61-5614cc4396ad@tu-bs.de
On Wed, Sep 17, 2025 at 06:51:18PM +0800, pengyu wrote: > Yes, this issue occurred on a kunpeng920 96-core machine and only > affected a small number of systems that had been running for over a > year. > > Vmcore Analysis: > • Panic triggered by CPU 83 detecting a hard lockup at > queued_spin_lock_slowpath+0x1d8/0x320. > > • Corresponding code: > arch_mcs_spin_lock_contended(&node->locked); > > • The qspinlock involved was the rq lock, which showed a cleared state: > crash> rq.lock,cpu ffffad96ff2907c0 > lock = { > raw_lock = { > { > val = { > counter = 0 > }, > { > locked = 0 '\000', > pending = 0 '\000' > }, > { > locked_pending = 0, > tail = 0 > } > } > } > }, > cpu = 50, > > • CPU 83’s MCS node remained in a locked=0 state, with no previous > node found in the qnodes list. > crash> p qnodes:83 > per_cpu(qnodes, 83) = $292 = > {{ > mcs = { > next = 0x0, > locked = 0, > count = 1 > } > }, > crash> p qnodes | grep 83 > [83]: ffffadd6bf7914c0 > crash> p qnodes:all | grep ffffadd6bf7914c0 > crash> > > • Since rq->lock was cleared, no CPU could notify CPU 83. > > This issue has occurred multiple times, but the root cause remains > unclear. We suspect that CPU 83 may have failed to enqueue itself, > potentially due to a failure in the xchg_tail atomic operation. Hmm. For the lock word to be clear with a CPU spinning on its MCS node then something has gone quite badly wrong. I think that would mean that: 1. The spinning CPU has updated tail to point to its node (xchg_tail()) 2. The lock-owning CPU then erroneously cleared the tail field (atomic_try_cmpxchg_relaxed()) But for the cmpxchg() to succeed in (2) then the xchg() in (1) must be ordered after it and the lock word wouldn't end up as zero. This is because RmW atomics must be totally ordered for a given memory location and that applies regardless of their memory ordering properties. Of course, there could be _a_ bug here but, given the information you've been able to provide, it's not obviously as "simple" as a missing memory barrier. Have you confirmed that adding memory barriers makes the problem go away? If you're able to check the thread_info (via sp_el0) of CPU83 in your example, it might be interesting to see whether or not the 'cpu' field has been corrupted. For example, if it ends up being read as -1 then we may compute a tail of 0 when enqueuing our MCS node into the lock word. > It has been noted that the _relaxed version is used in xchx_tail, and we > are uncertain whether this could lead to visibility issues—for example, > if CPU 83 modifies lock->tail, but other CPUs fail to observe the > change. > > We are also checking if this is related: > https://lkml.kernel.org/r/cb83e3e4-9e22-4457-bf61-5614cc4396ad@tu-bs.de Hopefully I (or somebody from Arm) can provide an update soon on this topic but I wouldn't necessarily expect it to help you with this new case. Will
On 2025/9/17 19:37, Will Deacon wrote: > On Wed, Sep 17, 2025 at 06:51:18PM +0800, pengyu wrote: >> Yes, this issue occurred on a kunpeng920 96-core machine and only >> affected a small number of systems that had been running for over a >> year. >> >> Vmcore Analysis: >> • Panic triggered by CPU 83 detecting a hard lockup at >> queued_spin_lock_slowpath+0x1d8/0x320. >> >> • Corresponding code: >> arch_mcs_spin_lock_contended(&node->locked); >> >> • The qspinlock involved was the rq lock, which showed a cleared state: >> crash> rq.lock,cpu ffffad96ff2907c0 >> lock = { >> raw_lock = { >> { >> val = { >> counter = 0 >> }, >> { >> locked = 0 '\000', >> pending = 0 '\000' >> }, >> { >> locked_pending = 0, >> tail = 0 >> } >> } >> } >> }, >> cpu = 50, >> >> • CPU 83’s MCS node remained in a locked=0 state, with no previous >> node found in the qnodes list. >> crash> p qnodes:83 >> per_cpu(qnodes, 83) = $292 = >> {{ >> mcs = { >> next = 0x0, >> locked = 0, >> count = 1 >> } >> }, >> crash> p qnodes | grep 83 >> [83]: ffffadd6bf7914c0 >> crash> p qnodes:all | grep ffffadd6bf7914c0 >> crash> >> >> • Since rq->lock was cleared, no CPU could notify CPU 83. >> >> This issue has occurred multiple times, but the root cause remains >> unclear. We suspect that CPU 83 may have failed to enqueue itself, >> potentially due to a failure in the xchg_tail atomic operation. > > Hmm. For the lock word to be clear with a CPU spinning on its MCS node > then something has gone quite badly wrong. I think that would mean that: > > 1. The spinning CPU has updated tail to point to its node (xchg_tail()) > 2. The lock-owning CPU then erroneously cleared the tail field > (atomic_try_cmpxchg_relaxed()) > > But for the cmpxchg() to succeed in (2) then the xchg() in (1) must be > ordered after it and the lock word wouldn't end up as zero. This is > because RmW atomics must be totally ordered for a given memory location > and that applies regardless of their memory ordering properties. > > Of course, there could be _a_ bug here but, given the information you've > been able to provide, it's not obviously as "simple" as a missing memory > barrier. Have you confirmed that adding memory barriers makes the problem > go away? > Could this mean that even with xchg's relaxed version, cmpxchg would be impossible to succeed after xchg? We're unsure of the exact cause, only speculating that memory barriers might control CPU Store Buffer flushing so that other CPUs can see the modifications immediately,but our understanding of this is quite limited. Moreover, each test cycle is lengthy, making it difficult to confirm whether adding memory barriers resolves the issue. > If you're able to check the thread_info (via sp_el0) of CPU83 in your > example, it might be interesting to see whether or not the 'cpu' field > has been corrupted. For example, if it ends up being read as -1 then we > may compute a tail of 0 when enqueuing our MCS node into the lock word. > I checked the code for xchg_tail: lsr w1, w2, #16 add x7, x19, #0x2 swph w1, w0, [x7] x19 is the address of the rq lock, and tail is stored in w2. The dumped registers are: x19: ffffad96ff2907c0 x2 : 0000000001500000 So CPU 83 should calculate tail as 0x1500000, not 0. >> It has been noted that the _relaxed version is used in xchx_tail, and we >> are uncertain whether this could lead to visibility issues—for example, >> if CPU 83 modifies lock->tail, but other CPUs fail to observe the >> change. >> >> We are also checking if this is related: >> https://lkml.kernel.org/r/cb83e3e4-9e22-4457-bf61-5614cc4396ad@tu-bs.de > > Hopefully I (or somebody from Arm) can provide an update soon on this > topic but I wouldn't necessarily expect it to help you with this new > case. > > Will We also have another similar issue: • Watchdog detected hard LOCKUP on cpu 18 • pc : queued_spin_lock_slowpath+0x2ac It was stuck in a loop at : next = smp_cond_load_relaxed(&node->next, (VAL)); The read mcs node->next is 0: crash> p qnodes:18 per_cpu(qnodes, 18) = $1 = {{ mcs = { next = 0x0, locked = 0, count = 1 } }, This seems consistent with the phenomenon described in (6b), but we found that lock->val read by the following code was incorrectly cleared: val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK)); According to the assembly code, lock->val is loaded into w5: ldar w5, [x19] mov w0, w5 and w1, w5, #0xffff The value of register x5 is 0: x5 : 0000000000000000 If it were just reordering in xchg_tail, I don't think tail would be erroneously cleared. Might there be other possibilities for this issue? -- Thanks, Yu Peng
On 9/15/25 11:39 PM, pengyu wrote: > From: Yu Peng <pengyu@kylinos.cn> > > A hardlock detected on arm64: rq->lock was released, but a CPU > blocked at mcs_node->locked and timed out. > > We found xchg_tail and atomic_try_cmpxchg_relaxed used _relaxed > versions without memory barriers. Suspected insufficient coherence > guarantees on some arm64 microarchitectures, potentially leading to > the following issues occurred: > > CPU0: CPU1: > // Set tail to CPU0 > old = xchg_tail(lock, tail); > > //CPU0 read tail is itself > if ((val & _Q_TAIL_MASK) == tail) > // CPU1 exchanges the tail > old = xchg_tail(lock, tail) > //assuming CPU0 not see tail change > atomic_try_cmpxchg_relaxed( > &lock->val, &val, _Q_LOCKED_VAL) > //released without notifying CPU1 > goto release; > //hardlock detected > arch_mcs_spin_lock_contended( > &node->locked) > > Therefore, xchg_tail and atomic_try_cmpxchg using _mb to replace _relaxed. > > Signed-off-by: pengyu <pengyu@kylinos.cn> The qspinlock code had been enabled for arm64 for quite a long time. This is the first time that we got report like this. How reproducible is this hangup problem? What arm64 architecture has this problem? It can be a hardware bug. Anyway, changing a relaxed version of atomic op to a fully barrier version can be expensive on arm64 in general. We need more information to ensure that we are doing the right thing. Cheers, Longman
© 2016 - 2025 Red Hat, Inc.