[PATCH] locking/qspinlock: use xchg with _mb in slowpath for arm64

pengyu posted 1 patch 2 weeks, 2 days ago
kernel/locking/qspinlock.c | 4 ++++
kernel/locking/qspinlock.h | 5 +++++
2 files changed, 9 insertions(+)
[PATCH] locking/qspinlock: use xchg with _mb in slowpath for arm64
Posted by pengyu 2 weeks, 2 days ago
From: Yu Peng <pengyu@kylinos.cn>

A hardlock detected on arm64: rq->lock was released, but a CPU
blocked at mcs_node->locked and timed out.

We found xchg_tail and atomic_try_cmpxchg_relaxed used _relaxed
versions without memory barriers. Suspected insufficient coherence
guarantees on some arm64 microarchitectures, potentially leading to
the following issues occurred:

CPU0:                                           CPU1:
// Set tail to CPU0
old = xchg_tail(lock, tail);

//CPU0 read tail is itself
if ((val & _Q_TAIL_MASK) == tail)
                                                // CPU1 exchanges the tail
                                                old = xchg_tail(lock, tail)
//assuming CPU0 not see tail change
atomic_try_cmpxchg_relaxed(
	  &lock->val, &val, _Q_LOCKED_VAL)
//released without notifying CPU1
goto release;
                                                //hardlock detected
                                                arch_mcs_spin_lock_contended(
                                                      &node->locked)

Therefore, xchg_tail and atomic_try_cmpxchg using _mb to replace _relaxed.

Signed-off-by: pengyu <pengyu@kylinos.cn>
---
 kernel/locking/qspinlock.c | 4 ++++
 kernel/locking/qspinlock.h | 5 +++++
 2 files changed, 9 insertions(+)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index af8d122bb649..5e0c489d0d81 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -350,7 +350,11 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 *       PENDING will make the uncontended transition fail.
 	 */
 	if ((val & _Q_TAIL_MASK) == tail) {
+#if defined(CONFIG_ARM64)
+		if (atomic_try_cmpxchg(&lock->val, &val, _Q_LOCKED_VAL))
+#else
 		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
+#endif
 			goto release; /* No contention */
 	}

diff --git a/kernel/locking/qspinlock.h b/kernel/locking/qspinlock.h
index d69958a844f7..c89a6877b96b 100644
--- a/kernel/locking/qspinlock.h
+++ b/kernel/locking/qspinlock.h
@@ -113,12 +113,17 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
  */
 static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 {
+#if defined(CONFIG_ARM64)
+	return (u32)xchg(&lock->tail,
+				 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
+#else
 	/*
 	 * We can use relaxed semantics since the caller ensures that the
 	 * MCS node is properly initialized before updating the tail.
 	 */
 	return (u32)xchg_relaxed(&lock->tail,
 				 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
+#endif
 }

 #else /* _Q_PENDING_BITS == 8 */

2.25.1
Re: [PATCH] locking/qspinlock: use xchg with _mb in slowpath for arm64
Posted by Peter Zijlstra 2 weeks, 1 day ago
On Tue, Sep 16, 2025 at 11:39:03AM +0800, pengyu wrote:
> From: Yu Peng <pengyu@kylinos.cn>
> 
> A hardlock detected on arm64: rq->lock was released, but a CPU
> blocked at mcs_node->locked and timed out.
> 
> We found xchg_tail and atomic_try_cmpxchg_relaxed used _relaxed
> versions without memory barriers. Suspected insufficient coherence
> guarantees on some arm64 microarchitectures, potentially leading to
> the following issues occurred:
> 
> CPU0:                                           CPU1:
> // Set tail to CPU0
> old = xchg_tail(lock, tail);
> 
> //CPU0 read tail is itself
> if ((val & _Q_TAIL_MASK) == tail)
>                                                 // CPU1 exchanges the tail
>                                                 old = xchg_tail(lock, tail)
> //assuming CPU0 not see tail change
> atomic_try_cmpxchg_relaxed(
> 	  &lock->val, &val, _Q_LOCKED_VAL)
> //released without notifying CPU1
> goto release;
>                                                 //hardlock detected
>                                                 arch_mcs_spin_lock_contended(
>                                                       &node->locked)
> 
> Therefore, xchg_tail and atomic_try_cmpxchg using _mb to replace _relaxed.

Yeah, no. We do not apply patches based on suspicion. And we most
certainly do not sprinkle #ifdef ARM64 in generic code.

There is this thread:

  https://lkml.kernel.org/r/cb83e3e4-9e22-4457-bf61-5614cc4396ad@tu-bs.de

Which is also concerned with xchg_tail(). Reading back, I'm not sure
we've ever heard back from ARM on whether that extra ;si was correct or
not, Will?

Anyway, as Waiman already asked, please state your exact ARM64
microarch.

Barring the ;si, the above thread suggests that they can prove the code
correct with the below change, does that resolve your problem?

Other than that, I'm going to have to leave this to Will and co.

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index af8d122bb649..249231460ea0 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -261,14 +261,8 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		goto release;
 
 	/*
-	 * Ensure that the initialisation of @node is complete before we
-	 * publish the updated tail via xchg_tail() and potentially link
-	 * @node into the waitqueue via WRITE_ONCE(prev->next, node) below.
-	 */
-	smp_wmb();
-
-	/*
-	 * Publish the updated tail.
+	 * Publish the updated tail. This is a RELEASE barrier and ensures the
+	 * @node is complete before the link becomes visible.
 	 * We have already touched the queueing cacheline; don't bother with
 	 * pending stuff.
 	 *
diff --git a/kernel/locking/qspinlock.h b/kernel/locking/qspinlock.h
index d69958a844f7..bb9a006e76eb 100644
--- a/kernel/locking/qspinlock.h
+++ b/kernel/locking/qspinlock.h
@@ -117,7 +117,7 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 	 * We can use relaxed semantics since the caller ensures that the
 	 * MCS node is properly initialized before updating the tail.
 	 */
-	return (u32)xchg_relaxed(&lock->tail,
+	return (u32)xchg_release(&lock->tail,
 				 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
 }
 
@@ -167,7 +167,7 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 		 * the MCS node is properly initialized before updating the
 		 * tail.
 		 */
-	} while (!atomic_try_cmpxchg_relaxed(&lock->val, &old, new));
+	} while (!atomic_try_cmpxchg_release(&lock->val, &old, new));
 
 	return old;
 }
Re: [PATCH] locking/qspinlock: use xchg with _mb in slowpath for arm64
Posted by Waiman Long 2 weeks, 1 day ago
On 9/16/25 10:10 AM, Peter Zijlstra wrote:
> On Tue, Sep 16, 2025 at 11:39:03AM +0800, pengyu wrote:
>> From: Yu Peng <pengyu@kylinos.cn>
>>
>> A hardlock detected on arm64: rq->lock was released, but a CPU
>> blocked at mcs_node->locked and timed out.
>>
>> We found xchg_tail and atomic_try_cmpxchg_relaxed used _relaxed
>> versions without memory barriers. Suspected insufficient coherence
>> guarantees on some arm64 microarchitectures, potentially leading to
>> the following issues occurred:
>>
>> CPU0:                                           CPU1:
>> // Set tail to CPU0
>> old = xchg_tail(lock, tail);
>>
>> //CPU0 read tail is itself
>> if ((val & _Q_TAIL_MASK) == tail)
>>                                                  // CPU1 exchanges the tail
>>                                                  old = xchg_tail(lock, tail)
>> //assuming CPU0 not see tail change
>> atomic_try_cmpxchg_relaxed(
>> 	  &lock->val, &val, _Q_LOCKED_VAL)
>> //released without notifying CPU1
>> goto release;
>>                                                  //hardlock detected
>>                                                  arch_mcs_spin_lock_contended(
>>                                                        &node->locked)
>>
>> Therefore, xchg_tail and atomic_try_cmpxchg using _mb to replace _relaxed.
> Yeah, no. We do not apply patches based on suspicion. And we most
> certainly do not sprinkle #ifdef ARM64 in generic code.
>
> There is this thread:
>
>    https://lkml.kernel.org/r/cb83e3e4-9e22-4457-bf61-5614cc4396ad@tu-bs.de

Ah, I was not cc'ed on this email thread. That is why I was not aware of 
this discussion about xchg_tail(). It is an interesting read.

Anyway, this particular problem may be about the clarity of the arm64 
memory model and whether any microarch's strictly follow it or not.

Cheers,
Longman
Re: [PATCH] locking/qspinlock: use xchg with _mb in slowpath for arm64
Posted by Will Deacon 2 weeks, 1 day ago
On Tue, Sep 16, 2025 at 04:10:32PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 16, 2025 at 11:39:03AM +0800, pengyu wrote:
> > From: Yu Peng <pengyu@kylinos.cn>
> > 
> > A hardlock detected on arm64: rq->lock was released, but a CPU
> > blocked at mcs_node->locked and timed out.
> > 
> > We found xchg_tail and atomic_try_cmpxchg_relaxed used _relaxed
> > versions without memory barriers. Suspected insufficient coherence
> > guarantees on some arm64 microarchitectures, potentially leading to
> > the following issues occurred:
> > 
> > CPU0:                                           CPU1:
> > // Set tail to CPU0
> > old = xchg_tail(lock, tail);
> > 
> > //CPU0 read tail is itself
> > if ((val & _Q_TAIL_MASK) == tail)
> >                                                 // CPU1 exchanges the tail
> >                                                 old = xchg_tail(lock, tail)
> > //assuming CPU0 not see tail change
> > atomic_try_cmpxchg_relaxed(
> > 	  &lock->val, &val, _Q_LOCKED_VAL)
> > //released without notifying CPU1
> > goto release;
> >                                                 //hardlock detected
> >                                                 arch_mcs_spin_lock_contended(
> >                                                       &node->locked)
> > 
> > Therefore, xchg_tail and atomic_try_cmpxchg using _mb to replace _relaxed.
> 
> Yeah, no. We do not apply patches based on suspicion. And we most
> certainly do not sprinkle #ifdef ARM64 in generic code.

Absolutely.

> There is this thread:
> 
>   https://lkml.kernel.org/r/cb83e3e4-9e22-4457-bf61-5614cc4396ad@tu-bs.de
> 
> Which is also concerned with xchg_tail(). Reading back, I'm not sure
> we've ever heard back from ARM on whether that extra ;si was correct or
> not, Will?

It's still under discussion with the Arm architects but it was _very_
close to concluding last time we met and I wouldn't worry about it for
the purposes of this report.

> Anyway, as Waiman already asked, please state your exact ARM64
> microarch.
> 
> Barring the ;si, the above thread suggests that they can prove the code
> correct with the below change, does that resolve your problem?
> 
> Other than that, I'm going to have to leave this to Will and co.

I'll take a look but it's light on details.

Will
Re: [PATCH] locking/qspinlock: use xchg with _mb in slowpath for arm64
Posted by pengyu 2 weeks, 1 day ago
On 9/17/25 00:00, Will Deacon wrote:
> On Tue, Sep 16, 2025 at 04:10:32PM +0200, Peter Zijlstra wrote:
>> On Tue, Sep 16, 2025 at 11:39:03AM +0800, pengyu wrote:
>>> From: Yu Peng <pengyu@kylinos.cn>
>>>
>>> A hardlock detected on arm64: rq->lock was released, but a CPU
>>> blocked at mcs_node->locked and timed out.
>>>
>>> We found xchg_tail and atomic_try_cmpxchg_relaxed used _relaxed
>>> versions without memory barriers. Suspected insufficient coherence
>>> guarantees on some arm64 microarchitectures, potentially leading to
>>> the following issues occurred:
>>>
>>> CPU0:                                           CPU1:
>>> // Set tail to CPU0
>>> old = xchg_tail(lock, tail);
>>>
>>> //CPU0 read tail is itself
>>> if ((val & _Q_TAIL_MASK) == tail)
>>>                                                  // CPU1 exchanges the tail
>>>                                                  old = xchg_tail(lock, tail)
>>> //assuming CPU0 not see tail change
>>> atomic_try_cmpxchg_relaxed(
>>> 	  &lock->val, &val, _Q_LOCKED_VAL)
>>> //released without notifying CPU1
>>> goto release;
>>>                                                  //hardlock detected
>>>                                                  arch_mcs_spin_lock_contended(
>>>                                                        &node->locked)
>>>
>>> Therefore, xchg_tail and atomic_try_cmpxchg using _mb to replace _relaxed.
>>
>> Yeah, no. We do not apply patches based on suspicion. And we most
>> certainly do not sprinkle #ifdef ARM64 in generic code.
> 
> Absolutely.
> 
>> There is this thread:
>>
>>    https://lkml.kernel.org/r/cb83e3e4-9e22-4457-bf61-5614cc4396ad@tu-bs.de
>>
>> Which is also concerned with xchg_tail(). Reading back, I'm not sure
>> we've ever heard back from ARM on whether that extra ;si was correct or
>> not, Will?
> 
> It's still under discussion with the Arm architects but it was _very_
> close to concluding last time we met and I wouldn't worry about it for
> the purposes of this report.
> 
>> Anyway, as Waiman already asked, please state your exact ARM64
>> microarch.
>>
>> Barring the ;si, the above thread suggests that they can prove the code
>> correct with the below change, does that resolve your problem?
>>
>> Other than that, I'm going to have to leave this to Will and co.
> 
> I'll take a look but it's light on details.
> 
> Will


Yes, this issue occurred on a kunpeng920 96-core machine and only
affected a small number of systems that had been running for over a
year.

Vmcore Analysis:
• Panic triggered by CPU 83 detecting a hard lockup at
     queued_spin_lock_slowpath+0x1d8/0x320.

• Corresponding code:
     arch_mcs_spin_lock_contended(&node->locked);

• The qspinlock involved was the rq lock, which showed a cleared state:
     crash> rq.lock,cpu ffffad96ff2907c0
       lock = {
         raw_lock = {
           {
             val = {
               counter = 0
             },
             {
               locked = 0 '\000',
               pending = 0 '\000'
             },
             {
               locked_pending = 0,
               tail = 0
             }
           }
         }
       },
       cpu = 50,

• CPU 83’s MCS node remained in a locked=0 state, with no previous
node found in the qnodes list.
     crash> p qnodes:83
     per_cpu(qnodes, 83) = $292 =
      {{
         mcs = {
           next = 0x0,
           locked = 0,
           count = 1
         }
       },
     crash> p qnodes | grep 83
       [83]: ffffadd6bf7914c0
     crash> p qnodes:all | grep ffffadd6bf7914c0
     crash>

• Since rq->lock was cleared, no CPU could notify CPU 83.

This issue has occurred multiple times, but the root cause remains
unclear. We suspect that CPU 83 may have failed to enqueue itself,
potentially due to a failure in the xchg_tail atomic operation.

It has been noted that the _relaxed version is used in xchx_tail, and we
are uncertain whether this could lead to visibility issues—for example,
if CPU 83 modifies lock->tail, but other CPUs fail to observe the
change.

We are also checking if this is related:
     https://lkml.kernel.org/r/cb83e3e4-9e22-4457-bf61-5614cc4396ad@tu-bs.de
Re: [PATCH] locking/qspinlock: use xchg with _mb in slowpath for arm64
Posted by Will Deacon 2 weeks, 1 day ago
On Wed, Sep 17, 2025 at 06:51:18PM +0800, pengyu wrote:
> Yes, this issue occurred on a kunpeng920 96-core machine and only
> affected a small number of systems that had been running for over a
> year.
> 
> Vmcore Analysis:
> • Panic triggered by CPU 83 detecting a hard lockup at
>     queued_spin_lock_slowpath+0x1d8/0x320.
> 
> • Corresponding code:
>     arch_mcs_spin_lock_contended(&node->locked);
> 
> • The qspinlock involved was the rq lock, which showed a cleared state:
>     crash> rq.lock,cpu ffffad96ff2907c0
>       lock = {
>         raw_lock = {
>           {
>             val = {
>               counter = 0
>             },
>             {
>               locked = 0 '\000',
>               pending = 0 '\000'
>             },
>             {
>               locked_pending = 0,
>               tail = 0
>             }
>           }
>         }
>       },
>       cpu = 50,
> 
> • CPU 83’s MCS node remained in a locked=0 state, with no previous
> node found in the qnodes list.
>     crash> p qnodes:83
>     per_cpu(qnodes, 83) = $292 =
>      {{
>         mcs = {
>           next = 0x0,
>           locked = 0,
>           count = 1
>         }
>       },
>     crash> p qnodes | grep 83
>       [83]: ffffadd6bf7914c0
>     crash> p qnodes:all | grep ffffadd6bf7914c0
>     crash>
> 
> • Since rq->lock was cleared, no CPU could notify CPU 83.
> 
> This issue has occurred multiple times, but the root cause remains
> unclear. We suspect that CPU 83 may have failed to enqueue itself,
> potentially due to a failure in the xchg_tail atomic operation.

Hmm. For the lock word to be clear with a CPU spinning on its MCS node
then something has gone quite badly wrong. I think that would mean that:

  1. The spinning CPU has updated tail to point to its node (xchg_tail())
  2. The lock-owning CPU then erroneously cleared the tail field
     (atomic_try_cmpxchg_relaxed())

But for the cmpxchg() to succeed in (2) then the xchg() in (1) must be
ordered after it and the lock word wouldn't end up as zero. This is
because RmW atomics must be totally ordered for a given memory location
and that applies regardless of their memory ordering properties.

Of course, there could be _a_ bug here but, given the information you've
been able to provide, it's not obviously as "simple" as a missing memory
barrier. Have you confirmed that adding memory barriers makes the problem
go away?

If you're able to check the thread_info (via sp_el0) of CPU83 in your
example, it might be interesting to see whether or not the 'cpu' field
has been corrupted. For example, if it ends up being read as -1 then we
may compute a tail of 0 when enqueuing our MCS node into the lock word.

> It has been noted that the _relaxed version is used in xchx_tail, and we
> are uncertain whether this could lead to visibility issues—for example,
> if CPU 83 modifies lock->tail, but other CPUs fail to observe the
> change.
> 
> We are also checking if this is related:
>     https://lkml.kernel.org/r/cb83e3e4-9e22-4457-bf61-5614cc4396ad@tu-bs.de

Hopefully I (or somebody from Arm) can provide an update soon on this
topic but I wouldn't necessarily expect it to help you with this new
case.

Will
Re: [PATCH] locking/qspinlock: use xchg with _mb in slowpath for arm64
Posted by pengyu 1 week, 6 days ago
On 2025/9/17 19:37, Will Deacon wrote:
> On Wed, Sep 17, 2025 at 06:51:18PM +0800, pengyu wrote:
>> Yes, this issue occurred on a kunpeng920 96-core machine and only
>> affected a small number of systems that had been running for over a
>> year.
>>
>> Vmcore Analysis:
>> • Panic triggered by CPU 83 detecting a hard lockup at
>>      queued_spin_lock_slowpath+0x1d8/0x320.
>>
>> • Corresponding code:
>>      arch_mcs_spin_lock_contended(&node->locked);
>>
>> • The qspinlock involved was the rq lock, which showed a cleared state:
>>      crash> rq.lock,cpu ffffad96ff2907c0
>>        lock = {
>>          raw_lock = {
>>            {
>>              val = {
>>                counter = 0
>>              },
>>              {
>>                locked = 0 '\000',
>>                pending = 0 '\000'
>>              },
>>              {
>>                locked_pending = 0,
>>                tail = 0
>>              }
>>            }
>>          }
>>        },
>>        cpu = 50,
>>
>> • CPU 83’s MCS node remained in a locked=0 state, with no previous
>> node found in the qnodes list.
>>      crash> p qnodes:83
>>      per_cpu(qnodes, 83) = $292 =
>>       {{
>>          mcs = {
>>            next = 0x0,
>>            locked = 0,
>>            count = 1
>>          }
>>        },
>>      crash> p qnodes | grep 83
>>        [83]: ffffadd6bf7914c0
>>      crash> p qnodes:all | grep ffffadd6bf7914c0
>>      crash>
>>
>> • Since rq->lock was cleared, no CPU could notify CPU 83.
>>
>> This issue has occurred multiple times, but the root cause remains
>> unclear. We suspect that CPU 83 may have failed to enqueue itself,
>> potentially due to a failure in the xchg_tail atomic operation.
> 
> Hmm. For the lock word to be clear with a CPU spinning on its MCS node
> then something has gone quite badly wrong. I think that would mean that:
> 
>    1. The spinning CPU has updated tail to point to its node (xchg_tail())
>    2. The lock-owning CPU then erroneously cleared the tail field
>       (atomic_try_cmpxchg_relaxed())
> 
> But for the cmpxchg() to succeed in (2) then the xchg() in (1) must be
> ordered after it and the lock word wouldn't end up as zero. This is
> because RmW atomics must be totally ordered for a given memory location
> and that applies regardless of their memory ordering properties.
> 
> Of course, there could be _a_ bug here but, given the information you've
> been able to provide, it's not obviously as "simple" as a missing memory
> barrier. Have you confirmed that adding memory barriers makes the problem
> go away?
> 

Could this mean that even with xchg's relaxed version, cmpxchg would be
impossible to succeed after xchg?

We're unsure of the exact cause, only speculating that memory barriers
might control CPU Store Buffer flushing so that other CPUs can see the
modifications immediately,but our understanding of this is quite limited.

Moreover, each test cycle is lengthy, making it difficult to confirm
whether adding memory barriers resolves the issue.

> If you're able to check the thread_info (via sp_el0) of CPU83 in your
> example, it might be interesting to see whether or not the 'cpu' field
> has been corrupted. For example, if it ends up being read as -1 then we
> may compute a tail of 0 when enqueuing our MCS node into the lock word.
> 

I checked the code for xchg_tail:
   lsr     w1, w2, #16
   add     x7, x19, #0x2
   swph    w1, w0, [x7]

x19 is the address of the rq lock, and tail is stored in w2. The dumped
registers are:
   x19: ffffad96ff2907c0
   x2 : 0000000001500000

So CPU 83 should calculate tail as 0x1500000, not 0.

>> It has been noted that the _relaxed version is used in xchx_tail, and we
>> are uncertain whether this could lead to visibility issues—for example,
>> if CPU 83 modifies lock->tail, but other CPUs fail to observe the
>> change.
>>
>> We are also checking if this is related:
>>      https://lkml.kernel.org/r/cb83e3e4-9e22-4457-bf61-5614cc4396ad@tu-bs.de
> 
> Hopefully I (or somebody from Arm) can provide an update soon on this
> topic but I wouldn't necessarily expect it to help you with this new
> case.
> 
> Will

We also have another similar issue:
• Watchdog detected hard LOCKUP on cpu 18
• pc : queued_spin_lock_slowpath+0x2ac

It was stuck in a loop at :
     next = smp_cond_load_relaxed(&node->next, (VAL));

The read mcs node->next is 0:
crash> p qnodes:18
per_cpu(qnodes, 18) = $1 =
  {{
     mcs = {
       next = 0x0,
       locked = 0,
       count = 1
     }
   },

This seems consistent with the phenomenon described in (6b), but we
found that lock->val read by the following code was incorrectly cleared:

  val = atomic_cond_read_acquire(&lock->val, !(VAL &
_Q_LOCKED_PENDING_MASK));

According to the assembly code, lock->val is loaded into w5:
   ldar    w5, [x19]
   mov     w0, w5
   and     w1, w5, #0xffff
The value of register x5 is 0:
   x5 : 0000000000000000

If it were just reordering in xchg_tail, I don't think tail would be
erroneously cleared. Might there be other possibilities for this issue?

-- 
Thanks,
Yu Peng
Re: [PATCH] locking/qspinlock: use xchg with _mb in slowpath for arm64
Posted by Waiman Long 2 weeks, 1 day ago
On 9/15/25 11:39 PM, pengyu wrote:
> From: Yu Peng <pengyu@kylinos.cn>
>
> A hardlock detected on arm64: rq->lock was released, but a CPU
> blocked at mcs_node->locked and timed out.
>
> We found xchg_tail and atomic_try_cmpxchg_relaxed used _relaxed
> versions without memory barriers. Suspected insufficient coherence
> guarantees on some arm64 microarchitectures, potentially leading to
> the following issues occurred:
>
> CPU0:                                           CPU1:
> // Set tail to CPU0
> old = xchg_tail(lock, tail);
>
> //CPU0 read tail is itself
> if ((val & _Q_TAIL_MASK) == tail)
>                                                  // CPU1 exchanges the tail
>                                                  old = xchg_tail(lock, tail)
> //assuming CPU0 not see tail change
> atomic_try_cmpxchg_relaxed(
> 	  &lock->val, &val, _Q_LOCKED_VAL)
> //released without notifying CPU1
> goto release;
>                                                  //hardlock detected
>                                                  arch_mcs_spin_lock_contended(
>                                                        &node->locked)
>
> Therefore, xchg_tail and atomic_try_cmpxchg using _mb to replace _relaxed.
>
> Signed-off-by: pengyu <pengyu@kylinos.cn>

The qspinlock code had been enabled for arm64 for quite a long time. 
This is the first time that we got report like this. How reproducible is 
this hangup problem?

What arm64 architecture has this problem? It can be a hardware bug.

Anyway, changing a relaxed version of atomic op to a fully barrier 
version can be expensive on arm64 in general. We need more information 
to ensure that we are doing the right thing.

Cheers,
Longman