locking/pvqspinlock: Use try_cmpxchg() in pv_unhash

[PATCH] locking/pvqspinlock: Use try_cmpxchg() in pv_unhash

Posted by Bibo Mao 1 year, 1 month ago

We ported pv spinlock to old linux kernel on LoongArch platform, there is
error with some stress tests. The error report is something like this for
short:
 kernel BUG at kernel/locking/qspinlock_paravirt.h:261!
 Oops - BUG[#1]:
 CPU: 1 PID: 6613 Comm: pidof Not tainted 4.19.190+ #43
 Hardware name: Loongson KVM, BIOS 0.0.0 02/06/2015
    ra: 9000000000509cfc do_task_stat+0x29c/0xaf0
   ERA: 9000000000291308 __pv_queued_spin_unlock_slowpath+0xf8/0x100
  CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
  PRMD: 00000000 (PPLV0 -PIE -PWE)
         ...
 Call Trace:
 [<9000000000291308>] __pv_queued_spin_unlock_slowpath+0xf8/0x100
 [<9000000000509cf8>] do_task_stat+0x298/0xaf0
 [<9000000000502570>] proc_single_show+0x60/0xe0

The problem is that memory accessing is out of order on LoongArch
platform, there is contension between pv_unhash() and pv_hash().

CPU0 pv_unhash:                                CPU1 pv_hash:

  for_each_hash_entry(he, offset, hash) {        for_each_hash_entry(he, offset, hash) {
    if (READ_ONCE(he->lock) == lock) {             struct qspinlock *old = NULL;
      node = READ_ONCE(he->node);
      WRITE_ONCE(he->lock, NULL);

On LoongArch platform which is out of order, the execution order may be
switched like this:
>      WRITE_ONCE(he->lock, NULL);
                                                    if (try_cmpxchg(&he->lock, &old, lock)) {
                                                      WRITE_ONCE(he->node, node);
                                                      return &he->lock;

CPU1 pv_hash() is executing and watch that lock is set with NULL. Write
he->node with node of new lock.
>      node = READ_ONCE(he->node);
READ_ONCE(he->node) on CPU0 will return node of new lock rather than itself.

Here READ_ONCE/WRITE_ONCE is replaced with try_cmpxchg().

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
---
 kernel/locking/qspinlock_paravirt.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index dc1cb90e3644..cc4eabce092d 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -240,9 +240,10 @@ static struct pv_node *pv_unhash(struct qspinlock *lock)
 	struct pv_node *node;
 
 	for_each_hash_entry(he, offset, hash) {
-		if (READ_ONCE(he->lock) == lock) {
+		struct qspinlock *old = lock;
+
+		if (try_cmpxchg(&he->lock, &old, NULL)) {
 			node = READ_ONCE(he->node);
-			WRITE_ONCE(he->lock, NULL);
 			return node;
 		}
 	}

base-commit: 48f506ad0b683d3e7e794efa60c5785c4fdc86fa
-- 
2.39.3

Re: [PATCH] locking/pvqspinlock: Use try_cmpxchg() in pv_unhash

Posted by Waiman Long 1 year, 1 month ago

On 12/23/24 2:47 AM, Bibo Mao wrote:
> We ported pv spinlock to old linux kernel on LoongArch platform, there is
What old kernel are you using? Race condition like that should be hard 
to reproduce. Do you hit this bug only once?
> error with some stress tests. The error report is something like this for
> short:
>   kernel BUG at kernel/locking/qspinlock_paravirt.h:261!
>   Oops - BUG[#1]:
>   CPU: 1 PID: 6613 Comm: pidof Not tainted 4.19.190+ #43
>   Hardware name: Loongson KVM, BIOS 0.0.0 02/06/2015
>      ra: 9000000000509cfc do_task_stat+0x29c/0xaf0
>     ERA: 9000000000291308 __pv_queued_spin_unlock_slowpath+0xf8/0x100
>    CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
>    PRMD: 00000000 (PPLV0 -PIE -PWE)
>           ...
>   Call Trace:
>   [<9000000000291308>] __pv_queued_spin_unlock_slowpath+0xf8/0x100
>   [<9000000000509cf8>] do_task_stat+0x298/0xaf0
>   [<9000000000502570>] proc_single_show+0x60/0xe0
>
> The problem is that memory accessing is out of order on LoongArch
> platform, there is contension between pv_unhash() and pv_hash().
>
> CPU0 pv_unhash:                                CPU1 pv_hash:
>
>    for_each_hash_entry(he, offset, hash) {        for_each_hash_entry(he, offset, hash) {
>      if (READ_ONCE(he->lock) == lock) {             struct qspinlock *old = NULL;
>        node = READ_ONCE(he->node);
>        WRITE_ONCE(he->lock, NULL);
>
> On LoongArch platform which is out of order, the execution order may be
> switched like this:
>>       WRITE_ONCE(he->lock, NULL);
>                                                      if (try_cmpxchg(&he->lock, &old, lock)) {
>                                                        WRITE_ONCE(he->node, node);
>                                                        return &he->lock;
>
> CPU1 pv_hash() is executing and watch that lock is set with NULL. Write
> he->node with node of new lock.
>>       node = READ_ONCE(he->node);
> READ_ONCE(he->node) on CPU0 will return node of new lock rather than itself.
The pv_hash_entry structure is supposed to be cacheline aligned. The 
use  of READ_ONCE/WRITE_ONCE will ensure that compiler won't rearrange 
the ordering. If the CPU can rearrange read/write ordering on the same 
cacheline like that, it may be some advance optimization technique that 
I don't  quite understand. Can you check if the buffer returned by 
alloc_large_system_hash() is really properly aligned?
>
> Here READ_ONCE/WRITE_ONCE is replaced with try_cmpxchg().
>
> Signed-off-by: Bibo Mao <maobibo@loongson.cn>
> ---
>   kernel/locking/qspinlock_paravirt.h | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
> index dc1cb90e3644..cc4eabce092d 100644
> --- a/kernel/locking/qspinlock_paravirt.h
> +++ b/kernel/locking/qspinlock_paravirt.h
> @@ -240,9 +240,10 @@ static struct pv_node *pv_unhash(struct qspinlock *lock)
>   	struct pv_node *node;
>   
>   	for_each_hash_entry(he, offset, hash) {
> -		if (READ_ONCE(he->lock) == lock) {
> +		struct qspinlock *old = lock;
> +
> +		if (try_cmpxchg(&he->lock, &old, NULL)) {
>   			node = READ_ONCE(he->node);
> -			WRITE_ONCE(he->lock, NULL);
>   			return node;
>   		}
>   	}

Anyway, this change isn't quite right as suggested by Akira.

Cheers,
Longman

>
> base-commit: 48f506ad0b683d3e7e794efa60c5785c4fdc86fa

Re: [PATCH] locking/pvqspinlock: Use try_cmpxchg() in pv_unhash

Posted by bibo mao 1 year, 1 month ago


On 2024/12/26 上午12:04, Waiman Long wrote:
> On 12/23/24 2:47 AM, Bibo Mao wrote:
>> We ported pv spinlock to old linux kernel on LoongArch platform, there is
> What old kernel are you using? Race condition like that should be hard 
> to reproduce. Do you hit this bug only once?
>> error with some stress tests. The error report is something like this for
>> short:
>>   kernel BUG at kernel/locking/qspinlock_paravirt.h:261!
>>   Oops - BUG[#1]:
>>   CPU: 1 PID: 6613 Comm: pidof Not tainted 4.19.190+ #43
>>   Hardware name: Loongson KVM, BIOS 0.0.0 02/06/2015
>>      ra: 9000000000509cfc do_task_stat+0x29c/0xaf0
>>     ERA: 9000000000291308 __pv_queued_spin_unlock_slowpath+0xf8/0x100
>>    CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
>>    PRMD: 00000000 (PPLV0 -PIE -PWE)
>>           ...
>>   Call Trace:
>>   [<9000000000291308>] __pv_queued_spin_unlock_slowpath+0xf8/0x100
>>   [<9000000000509cf8>] do_task_stat+0x298/0xaf0
>>   [<9000000000502570>] proc_single_show+0x60/0xe0
>>
>> The problem is that memory accessing is out of order on LoongArch
>> platform, there is contension between pv_unhash() and pv_hash().
>>
>> CPU0 pv_unhash:                                CPU1 pv_hash:
>>
>>    for_each_hash_entry(he, offset, hash) {        
>> for_each_hash_entry(he, offset, hash) {
>>      if (READ_ONCE(he->lock) == lock) {             struct qspinlock 
>> *old = NULL;
>>        node = READ_ONCE(he->node);
>>        WRITE_ONCE(he->lock, NULL);
>>
>> On LoongArch platform which is out of order, the execution order may be
>> switched like this:
>>>       WRITE_ONCE(he->lock, NULL);
>>                                                      if 
>> (try_cmpxchg(&he->lock, &old, lock)) {
>>                                                        
>> WRITE_ONCE(he->node, node);
>>                                                        return &he->lock;
>>
>> CPU1 pv_hash() is executing and watch that lock is set with NULL. Write
>> he->node with node of new lock.
>>>       node = READ_ONCE(he->node);
>> READ_ONCE(he->node) on CPU0 will return node of new lock rather than 
>> itself.
> The pv_hash_entry structure is supposed to be cacheline aligned. The 
> use  of READ_ONCE/WRITE_ONCE will ensure that compiler won't rearrange 
> the ordering. If the CPU can rearrange read/write ordering on the same 
> cacheline like that, it may be some advance optimization technique that 
> I don't  quite understand. Can you check if the buffer returned by 
> alloc_large_system_hash() is really properly aligned?
Yes, you are right. After communicating with HW guys, WRITE_ONCE will 
not surpass over READ_ONCE, WRITE_ONCE can only surpass WRITE_ONCE.

And buffer allocated by alloc_large_system_hash() is cache aligned also,
the problem may be LoongArch VM specific, I am investigating the 
detailed reason.

>>
>> Here READ_ONCE/WRITE_ONCE is replaced with try_cmpxchg().
>>
>> Signed-off-by: Bibo Mao <maobibo@loongson.cn>
>> ---
>>   kernel/locking/qspinlock_paravirt.h | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/locking/qspinlock_paravirt.h 
>> b/kernel/locking/qspinlock_paravirt.h
>> index dc1cb90e3644..cc4eabce092d 100644
>> --- a/kernel/locking/qspinlock_paravirt.h
>> +++ b/kernel/locking/qspinlock_paravirt.h
>> @@ -240,9 +240,10 @@ static struct pv_node *pv_unhash(struct qspinlock 
>> *lock)
>>       struct pv_node *node;
>>       for_each_hash_entry(he, offset, hash) {
>> -        if (READ_ONCE(he->lock) == lock) {
>> +        struct qspinlock *old = lock;
>> +
>> +        if (try_cmpxchg(&he->lock, &old, NULL)) {
>>               node = READ_ONCE(he->node);
>> -            WRITE_ONCE(he->lock, NULL);
>>               return node;
>>           }
>>       }
> 
> Anyway, this change isn't quite right as suggested by Akira.
This change is abuse usage about try_cmpxchg(), the origin code maybe 
has no problem, I am testing again.

Sorry for the noise and happy new year holiday :)

Regards
Bibo Mao

> 
> Cheers,
> Longman
> 
>>
>> base-commit: 48f506ad0b683d3e7e794efa60c5785c4fdc86fa

Re: [PATCH] locking/pvqspinlock: Use try_cmpxchg() in pv_unhash

Posted by Akira Yokosawa 1 year, 1 month ago

Hi,

[With my LKMM reviewer hat on]

On Mon, 23 Dec 2024 15:47:04 +0800, Bibo Mao wrote:
> We ported pv spinlock to old linux kernel on LoongArch platform, there is
> error with some stress tests. The error report is something like this for
> short:
>  kernel BUG at kernel/locking/qspinlock_paravirt.h:261!
>  Oops - BUG[#1]:
>  CPU: 1 PID: 6613 Comm: pidof Not tainted 4.19.190+ #43
>  Hardware name: Loongson KVM, BIOS 0.0.0 02/06/2015
>     ra: 9000000000509cfc do_task_stat+0x29c/0xaf0
>    ERA: 9000000000291308 __pv_queued_spin_unlock_slowpath+0xf8/0x100
>   CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
>   PRMD: 00000000 (PPLV0 -PIE -PWE)
>          ...
>  Call Trace:
>  [<9000000000291308>] __pv_queued_spin_unlock_slowpath+0xf8/0x100
>  [<9000000000509cf8>] do_task_stat+0x298/0xaf0
>  [<9000000000502570>] proc_single_show+0x60/0xe0
> 
> The problem is that memory accessing is out of order on LoongArch
> platform, there is contension between pv_unhash() and pv_hash().
> 
> CPU0 pv_unhash:                                CPU1 pv_hash:
> 
>   for_each_hash_entry(he, offset, hash) {        for_each_hash_entry(he, offset, hash) {
>     if (READ_ONCE(he->lock) == lock) {             struct qspinlock *old = NULL;
>       node = READ_ONCE(he->node);
>       WRITE_ONCE(he->lock, NULL);
> 
> On LoongArch platform which is out of order, the execution order may be
> switched like this:
>>      WRITE_ONCE(he->lock, NULL);
>                                                     if (try_cmpxchg(&he->lock, &old, lock)) {
>                                                       WRITE_ONCE(he->node, node);
>                                                       return &he->lock;
> 
> CPU1 pv_hash() is executing and watch that lock is set with NULL. Write
> he->node with node of new lock.
>>      node = READ_ONCE(he->node);
> READ_ONCE(he->node) on CPU0 will return node of new lock rather than itself.
> 
> Here READ_ONCE/WRITE_ONCE is replaced with try_cmpxchg().
> 
> Signed-off-by: Bibo Mao <maobibo@loongson.cn>
> ---
>  kernel/locking/qspinlock_paravirt.h | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
> index dc1cb90e3644..cc4eabce092d 100644
> --- a/kernel/locking/qspinlock_paravirt.h
> +++ b/kernel/locking/qspinlock_paravirt.h
> @@ -240,9 +240,10 @@ static struct pv_node *pv_unhash(struct qspinlock *lock)
>  	struct pv_node *node;
>  
>  	for_each_hash_entry(he, offset, hash) {
> -		if (READ_ONCE(he->lock) == lock) {
> +		struct qspinlock *old = lock;
> +
> +		if (try_cmpxchg(&he->lock, &old, NULL))
>  			node = READ_ONCE(he->node);
> -			WRITE_ONCE(he->lock, NULL);
>  			return node;
>  		}
>  	}

But this change might delay load of he->node *after* he->lock returns to NULL.

Let's get back to the current code (plus labeling ONCE accesses):

	for_each_hash_entry(he, offset, hash) {
		if (READ_ONCE(he->lock) == lock) {   /* A */
			node = READ_ONCE(he->node);  /* B */
			WRITE_ONCE(he->lock, NULL);  /* C */
			return node;
		}
	}

It looks to me you want to guarantee the ordering of A -> B -> C.

Your change effectively provides ordering of [A C] -> B.
[A C] is done in an atomic RMW op.

For the ordering of A -> B -> C, I'd change the code to

	for_each_hash_entry(he, offset, hash) {
		if (smp_load_acquire(&he->lock) == lock) {   /* A */
			node = READ_ONCE(he->node);          /* B */
			smp_store_release(&he->lock, NULL);  /* C */
			return node;
		}
	}

Note: A -> B is load-to-load ordering, and it is not provided by
control dependency.  Upgrading A to ACQUIRE is a lightweight option.

I'm expecting Will and/or Boqun chiming in after holiday break.

        Thanks, Akira

> 
> base-commit: 48f506ad0b683d3e7e794efa60c5785c4fdc86fa
> -- 
> 2.39.3
>

Re: [PATCH] locking/pvqspinlock: Use try_cmpxchg() in pv_unhash

Posted by Waiman Long 1 year, 1 month ago

On 12/24/24 10:15 PM, Akira Yokosawa wrote:
> Hi,
>
> [With my LKMM reviewer hat on]
>
> On Mon, 23 Dec 2024 15:47:04 +0800, Bibo Mao wrote:
>> We ported pv spinlock to old linux kernel on LoongArch platform, there is
>> error with some stress tests. The error report is something like this for
>> short:
>>   kernel BUG at kernel/locking/qspinlock_paravirt.h:261!
>>   Oops - BUG[#1]:
>>   CPU: 1 PID: 6613 Comm: pidof Not tainted 4.19.190+ #43
>>   Hardware name: Loongson KVM, BIOS 0.0.0 02/06/2015
>>      ra: 9000000000509cfc do_task_stat+0x29c/0xaf0
>>     ERA: 9000000000291308 __pv_queued_spin_unlock_slowpath+0xf8/0x100
>>    CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
>>    PRMD: 00000000 (PPLV0 -PIE -PWE)
>>           ...
>>   Call Trace:
>>   [<9000000000291308>] __pv_queued_spin_unlock_slowpath+0xf8/0x100
>>   [<9000000000509cf8>] do_task_stat+0x298/0xaf0
>>   [<9000000000502570>] proc_single_show+0x60/0xe0
>>
>> The problem is that memory accessing is out of order on LoongArch
>> platform, there is contension between pv_unhash() and pv_hash().
>>
>> CPU0 pv_unhash:                                CPU1 pv_hash:
>>
>>    for_each_hash_entry(he, offset, hash) {        for_each_hash_entry(he, offset, hash) {
>>      if (READ_ONCE(he->lock) == lock) {             struct qspinlock *old = NULL;
>>        node = READ_ONCE(he->node);
>>        WRITE_ONCE(he->lock, NULL);
>>
>> On LoongArch platform which is out of order, the execution order may be
>> switched like this:
>>>       WRITE_ONCE(he->lock, NULL);
>>                                                      if (try_cmpxchg(&he->lock, &old, lock)) {
>>                                                        WRITE_ONCE(he->node, node);
>>                                                        return &he->lock;
>>
>> CPU1 pv_hash() is executing and watch that lock is set with NULL. Write
>> he->node with node of new lock.
>>>       node = READ_ONCE(he->node);
>> READ_ONCE(he->node) on CPU0 will return node of new lock rather than itself.
>>
>> Here READ_ONCE/WRITE_ONCE is replaced with try_cmpxchg().
>>
>> Signed-off-by: Bibo Mao <maobibo@loongson.cn>
>> ---
>>   kernel/locking/qspinlock_paravirt.h | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
>> index dc1cb90e3644..cc4eabce092d 100644
>> --- a/kernel/locking/qspinlock_paravirt.h
>> +++ b/kernel/locking/qspinlock_paravirt.h
>> @@ -240,9 +240,10 @@ static struct pv_node *pv_unhash(struct qspinlock *lock)
>>   	struct pv_node *node;
>>   
>>   	for_each_hash_entry(he, offset, hash) {
>> -		if (READ_ONCE(he->lock) == lock) {
>> +		struct qspinlock *old = lock;
>> +
>> +		if (try_cmpxchg(&he->lock, &old, NULL))
>>   			node = READ_ONCE(he->node);
>> -			WRITE_ONCE(he->lock, NULL);
>>   			return node;
>>   		}
>>   	}
> But this change might delay load of he->node *after* he->lock returns to NULL.
>
> Let's get back to the current code (plus labeling ONCE accesses):
>
> 	for_each_hash_entry(he, offset, hash) {
> 		if (READ_ONCE(he->lock) == lock) {   /* A */
> 			node = READ_ONCE(he->node);  /* B */
> 			WRITE_ONCE(he->lock, NULL);  /* C */
> 			return node;
> 		}
> 	}
>
> It looks to me you want to guarantee the ordering of A -> B -> C.
>
> Your change effectively provides ordering of [A C] -> B.
> [A C] is done in an atomic RMW op.
>
> For the ordering of A -> B -> C, I'd change the code to
>
> 	for_each_hash_entry(he, offset, hash) {
> 		if (smp_load_acquire(&he->lock) == lock) {   /* A */
> 			node = READ_ONCE(he->node);          /* B */
> 			smp_store_release(&he->lock, NULL);  /* C */
> 			return node;
> 		}
> 	}
>
> Note: A -> B is load-to-load ordering, and it is not provided by
> control dependency.  Upgrading A to ACQUIRE is a lightweight option.

If Loongson CPU, like PowerPC, does not provide load-to-load ordering in 
a control dependency, the proper way is to provide its version of  
smp_acquire__after_ctrl_dep() helper and use it to provide the acquire 
barrier instead of using smp_load_acquire() here.

Cheers,
Longman