[v1] locking/osq_lock: Avoid false sharing in optimistic_spin_node

[PATCH] locking/osq_lock: Avoid false sharing in optimistic_spin_node

Posted by Zeng Heng 2 years, 1 month ago

Using the UnixBench test suite, we clearly find that osq_lock() cause
extremely high overheads with perf tool in the File Copy items:

Overhead  Shared Object            Symbol
  94.25%  [kernel]                 [k] osq_lock
   0.74%  [kernel]                 [k] rwsem_spin_on_owner
   0.32%  [kernel]                 [k] filemap_get_read_batch

In response to this, we conducted an analysis and made some gains:

In the prologue of osq_lock(), it set `cpu` member of percpu struct
optimistic_spin_node with the local cpu id, after that the value of the
percpu struct would never change in fact. Based on that, we can regard
the `cpu` member as a constant variable.

In the meanwhile, other members of the percpu struct like next, prev and
locked are frequently modified by osq_lock() and osq_unlock() which are
called by rwsem, mutex and so on. However, that would invalidate the cache
of the cpu member on other CPUs.

Therefore, we can place padding here and split them into different cache
lines to avoid cache misses when the next CPU is spinning to check other
node's cpu member by vcpu_is_preempted().

Here provide the UnixBench full-core test result based on v6.6 as below:
Machine Intel(R) Xeon(R) Gold 6248 CPU, 40 cores, 80 threads
Run the command of "./Run -c 80 -i 3" over 20 times and take the average.

System Benchmarks Index Values           Without Patch   With Patch     Diff
Dhrystone 2 using register variables         185518.38    185329.56   -0.10%
Double-Precision Whetstone                    79330.46     79268.22   -0.08%
Execl Throughput                               9725.14     10390.18    6.84%
File Copy 1024 bufsize 2000 maxblocks          1658.42      2035.55   22.74%
File Copy 256 bufsize 500 maxblocks            1086.54      1316.96   21.21%
File Copy 4096 bufsize 8000 maxblocks          3610.42      4152.79   15.02%
Pipe Throughput                               69325.18     69913.85    0.85%
Pipe-based Context Switching                  14026.32     14703.07    4.82%
Process Creation                               8329.94      8355.31    0.30%
Shell Scripts (1 concurrent)                  38942.41     41518.39    6.61%
Shell Scripts (8 concurrent)                  37762.35     40224.49    6.52%
System Call Overhead                           4064.44      4004.45   -1.48%
                                                                    ========
System Benchmarks Index Score                 13634.17     14560.71    6.80%

Signed-off-by: Zeng Heng <zengheng4@huawei.com>
---
 include/linux/osq_lock.h  | 6 +++++-
 kernel/locking/osq_lock.c | 8 +++++++-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
index 5581dbd3bd34..f33f47ec0c90 100644
--- a/include/linux/osq_lock.h
+++ b/include/linux/osq_lock.h
@@ -9,7 +9,11 @@
 struct optimistic_spin_node {
 	struct optimistic_spin_node *next, *prev;
 	int locked; /* 1 if lock acquired */
-	int cpu; /* encoded CPU # + 1 value */
+	/*
+	 * Stores an encoded CPU # + 1 value.
+	 * Only read by other cpus, so split into different cache lines.
+	 */
+	int cpu ____cacheline_aligned;
 };
 
 struct optimistic_spin_queue {
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index d5610ad52b92..17618d62343f 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -96,7 +96,13 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 
 	node->locked = 0;
 	node->next = NULL;
-	node->cpu = curr;
+	/*
+	 * After this cpu member is initialized for the first time, it
+	 * would no longer change in fact. That could avoid cache misses
+	 * when spin and access the cpu member by other CPUs.
+	 */
+	if (node->cpu != curr)
+		node->cpu = curr;
 
 	/*
 	 * We need both ACQUIRE (pairs with corresponding RELEASE in
-- 
2.25.1

Re: [PATCH] locking/osq_lock: Avoid false sharing in optimistic_spin_node

Posted by kernel test robot 2 years, 1 month ago

Hi Zeng,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/locking/core]
[also build test ERROR on linus/master v6.7-rc6 next-20231220]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Zeng-Heng/locking-osq_lock-Avoid-false-sharing-in-optimistic_spin_node/20231220-150010
base:   tip/locking/core
patch link:    https://lore.kernel.org/r/20231220070204.2662730-1-zengheng4%40huawei.com
patch subject: [PATCH] locking/osq_lock: Avoid false sharing in optimistic_spin_node
config: arm-defconfig (https://download.01.org/0day-ci/archive/20231221/202312210155.Wc2HUK8C-lkp@intel.com/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project.git f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231221/202312210155.Wc2HUK8C-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202312210155.Wc2HUK8C-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from arch/arm/kernel/asm-offsets.c:11:
   In file included from include/linux/sched.h:15:
   In file included from include/linux/sem.h:5:
   In file included from include/uapi/linux/sem.h:5:
   In file included from include/linux/ipc.h:7:
   In file included from include/linux/rhashtable-types.h:14:
   In file included from include/linux/mutex.h:20:
>> include/linux/osq_lock.h:16:9: error: expected ';' at end of declaration list
           int cpu ____cacheline_aligned;
                  ^
                  ;
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:1084:
   In file included from include/linux/huge_mm.h:8:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:97:11: warning: array index 3 is past the end of the array (which contains 2 elements) [-Warray-bounds]
                   return (set->sig[3] | set->sig[2] |
                           ^        ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:1084:
   In file included from include/linux/huge_mm.h:8:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:97:25: warning: array index 2 is past the end of the array (which contains 2 elements) [-Warray-bounds]
                   return (set->sig[3] | set->sig[2] |
                                         ^        ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:1084:
   In file included from include/linux/huge_mm.h:8:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:113:11: warning: array index 3 is past the end of the array (which contains 2 elements) [-Warray-bounds]
                   return  (set1->sig[3] == set2->sig[3]) &&
                            ^         ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:1084:
   In file included from include/linux/huge_mm.h:8:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:113:27: warning: array index 3 is past the end of the array (which contains 2 elements) [-Warray-bounds]
                   return  (set1->sig[3] == set2->sig[3]) &&
                                            ^         ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:1084:
   In file included from include/linux/huge_mm.h:8:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:114:5: warning: array index 2 is past the end of the array (which contains 2 elements) [-Warray-bounds]
                           (set1->sig[2] == set2->sig[2]) &&
                            ^         ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:1084:
   In file included from include/linux/huge_mm.h:8:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:114:21: warning: array index 2 is past the end of the array (which contains 2 elements) [-Warray-bounds]
                           (set1->sig[2] == set2->sig[2]) &&
                                            ^         ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:1084:
   In file included from include/linux/huge_mm.h:8:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:156:1: warning: array index 3 is past the end of the array (which contains 2 elements) [-Warray-bounds]
   _SIG_SET_BINOP(sigorsets, _sig_or)
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:137:8: note: expanded from macro '_SIG_SET_BINOP'
                   a3 = a->sig[3]; a2 = a->sig[2];                         \
                        ^      ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:1084:
   In file included from include/linux/huge_mm.h:8:


vim +16 include/linux/osq_lock.h

     4	
     5	/*
     6	 * An MCS like lock especially tailored for optimistic spinning for sleeping
     7	 * lock implementations (mutex, rwsem, etc).
     8	 */
     9	struct optimistic_spin_node {
    10		struct optimistic_spin_node *next, *prev;
    11		int locked; /* 1 if lock acquired */
    12		/*
    13		 * Stores an encoded CPU # + 1 value.
    14		 * Only read by other cpus, so split into different cache lines.
    15		 */
  > 16		int cpu ____cacheline_aligned;
    17	};
    18	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH] locking/osq_lock: Avoid false sharing in optimistic_spin_node

Posted by Waiman Long 2 years, 1 month ago

On 12/20/23 02:02, Zeng Heng wrote:
> Using the UnixBench test suite, we clearly find that osq_lock() cause
> extremely high overheads with perf tool in the File Copy items:
>
> Overhead  Shared Object            Symbol
>    94.25%  [kernel]                 [k] osq_lock
>     0.74%  [kernel]                 [k] rwsem_spin_on_owner
>     0.32%  [kernel]                 [k] filemap_get_read_batch
>
> In response to this, we conducted an analysis and made some gains:
>
> In the prologue of osq_lock(), it set `cpu` member of percpu struct
> optimistic_spin_node with the local cpu id, after that the value of the
> percpu struct would never change in fact. Based on that, we can regard
> the `cpu` member as a constant variable.
>
> In the meanwhile, other members of the percpu struct like next, prev and
> locked are frequently modified by osq_lock() and osq_unlock() which are
> called by rwsem, mutex and so on. However, that would invalidate the cache
> of the cpu member on other CPUs.
>
> Therefore, we can place padding here and split them into different cache
> lines to avoid cache misses when the next CPU is spinning to check other
> node's cpu member by vcpu_is_preempted().
>
> Here provide the UnixBench full-core test result based on v6.6 as below:
> Machine Intel(R) Xeon(R) Gold 6248 CPU, 40 cores, 80 threads
> Run the command of "./Run -c 80 -i 3" over 20 times and take the average.
>
> System Benchmarks Index Values           Without Patch   With Patch     Diff
> Dhrystone 2 using register variables         185518.38    185329.56   -0.10%
> Double-Precision Whetstone                    79330.46     79268.22   -0.08%
> Execl Throughput                               9725.14     10390.18    6.84%
> File Copy 1024 bufsize 2000 maxblocks          1658.42      2035.55   22.74%
> File Copy 256 bufsize 500 maxblocks            1086.54      1316.96   21.21%
> File Copy 4096 bufsize 8000 maxblocks          3610.42      4152.79   15.02%
> Pipe Throughput                               69325.18     69913.85    0.85%
> Pipe-based Context Switching                  14026.32     14703.07    4.82%
> Process Creation                               8329.94      8355.31    0.30%
> Shell Scripts (1 concurrent)                  38942.41     41518.39    6.61%
> Shell Scripts (8 concurrent)                  37762.35     40224.49    6.52%
> System Call Overhead                           4064.44      4004.45   -1.48%
>                                                                      ========
> System Benchmarks Index Score                 13634.17     14560.71    6.80%
>
> Signed-off-by: Zeng Heng <zengheng4@huawei.com>
> ---
>   include/linux/osq_lock.h  | 6 +++++-
>   kernel/locking/osq_lock.c | 8 +++++++-
>   2 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
> index 5581dbd3bd34..f33f47ec0c90 100644
> --- a/include/linux/osq_lock.h
> +++ b/include/linux/osq_lock.h
> @@ -9,7 +9,11 @@
>   struct optimistic_spin_node {
>   	struct optimistic_spin_node *next, *prev;
>   	int locked; /* 1 if lock acquired */
> -	int cpu; /* encoded CPU # + 1 value */
> +	/*
> +	 * Stores an encoded CPU # + 1 value.
> +	 * Only read by other cpus, so split into different cache lines.
> +	 */
> +	int cpu ____cacheline_aligned;
>   };
>   
>   struct optimistic_spin_queue {
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index d5610ad52b92..17618d62343f 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -96,7 +96,13 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>   
>   	node->locked = 0;
>   	node->next = NULL;
> -	node->cpu = curr;
> +	/*
> +	 * After this cpu member is initialized for the first time, it
> +	 * would no longer change in fact. That could avoid cache misses
> +	 * when spin and access the cpu member by other CPUs.
> +	 */
> +	if (node->cpu != curr)
> +		node->cpu = curr;
>   
>   	/*
>   	 * We need both ACQUIRE (pairs with corresponding RELEASE in

The contention on prev->cpu is due to the vcpu_is_preempted() call in 
osq_lock(). Could you try the attached patch to see if that helps?

Thanks,
Longman

Re: [PATCH] locking/osq_lock: Avoid false sharing in optimistic_spin_node

Posted by Zeng Heng 2 years, 1 month ago

在 2023/12/21 0:58, Waiman Long 写道:
>
> On 12/20/23 02:02, Zeng Heng wrote:
>> Using the UnixBench test suite, we clearly find that osq_lock() cause
>> extremely high overheads with perf tool in the File Copy items:
>>
>> Overhead  Shared Object            Symbol
>>    94.25%  [kernel]                 [k] osq_lock
>>     0.74%  [kernel]                 [k] rwsem_spin_on_owner
>>     0.32%  [kernel]                 [k] filemap_get_read_batch
>>
>> In response to this, we conducted an analysis and made some gains:
>>
>> In the prologue of osq_lock(), it set `cpu` member of percpu struct
>> optimistic_spin_node with the local cpu id, after that the value of the
>> percpu struct would never change in fact. Based on that, we can regard
>> the `cpu` member as a constant variable.
>>
>> In the meanwhile, other members of the percpu struct like next, prev and
>> locked are frequently modified by osq_lock() and osq_unlock() which are
>> called by rwsem, mutex and so on. However, that would invalidate the 
>> cache
>> of the cpu member on other CPUs.
>>
>> Therefore, we can place padding here and split them into different cache
>> lines to avoid cache misses when the next CPU is spinning to check other
>> node's cpu member by vcpu_is_preempted().
>>
>> Here provide the UnixBench full-core test result based on v6.6 as below:
>> Machine Intel(R) Xeon(R) Gold 6248 CPU, 40 cores, 80 threads
>> Run the command of "./Run -c 80 -i 3" over 20 times and take the 
>> average.
>>
>> System Benchmarks Index Values           Without Patch   With 
>> Patch     Diff
>> Dhrystone 2 using register variables         185518.38 185329.56   
>> -0.10%
>> Double-Precision Whetstone                    79330.46 79268.22   -0.08%
>> Execl Throughput                               9725.14 10390.18    6.84%
>> File Copy 1024 bufsize 2000 maxblocks          1658.42 2035.55   22.74%
>> File Copy 256 bufsize 500 maxblocks            1086.54 1316.96   21.21%
>> File Copy 4096 bufsize 8000 maxblocks          3610.42 4152.79   15.02%
>> Pipe Throughput                               69325.18 69913.85    0.85%
>> Pipe-based Context Switching                  14026.32 14703.07    4.82%
>> Process Creation                               8329.94 8355.31    0.30%
>> Shell Scripts (1 concurrent)                  38942.41 41518.39    6.61%
>> Shell Scripts (8 concurrent)                  37762.35 40224.49    6.52%
>> System Call Overhead                           4064.44 4004.45   -1.48%
>> ========
>> System Benchmarks Index Score                 13634.17 14560.71    6.80%
>>
>> Signed-off-by: Zeng Heng <zengheng4@huawei.com>
>> ---
>>   include/linux/osq_lock.h  | 6 +++++-
>>   kernel/locking/osq_lock.c | 8 +++++++-
>>   2 files changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
>> index 5581dbd3bd34..f33f47ec0c90 100644
>> --- a/include/linux/osq_lock.h
>> +++ b/include/linux/osq_lock.h
>> @@ -9,7 +9,11 @@
>>   struct optimistic_spin_node {
>>       struct optimistic_spin_node *next, *prev;
>>       int locked; /* 1 if lock acquired */
>> -    int cpu; /* encoded CPU # + 1 value */
>> +    /*
>> +     * Stores an encoded CPU # + 1 value.
>> +     * Only read by other cpus, so split into different cache lines.
>> +     */
>> +    int cpu ____cacheline_aligned;
>>   };
>>     struct optimistic_spin_queue {
>> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
>> index d5610ad52b92..17618d62343f 100644
>> --- a/kernel/locking/osq_lock.c
>> +++ b/kernel/locking/osq_lock.c
>> @@ -96,7 +96,13 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>>         node->locked = 0;
>>       node->next = NULL;
>> -    node->cpu = curr;
>> +    /*
>> +     * After this cpu member is initialized for the first time, it
>> +     * would no longer change in fact. That could avoid cache misses
>> +     * when spin and access the cpu member by other CPUs.
>> +     */
>> +    if (node->cpu != curr)
>> +        node->cpu = curr;
>>         /*
>>        * We need both ACQUIRE (pairs with corresponding RELEASE in
>
> The contention on prev->cpu is due to the vcpu_is_preempted() call in 
> osq_lock(). Could you try the attached patch to see if that helps?
>
> Thanks,
> Longman
>
Hi Waiman,

I apply the patch you sent and do reboot before each UnixBench test.

This is the result after testing 18 times. Here for your information:

Dhrystone 2 using register variables            181176.4667
Double-Precision Whetstone                        77940.39444
Execl Throughput                                          10137.1
File Copy 1024 bufsize 2000 maxblocks       1966.116667
File Copy 256 bufsize 500 maxblocks           1293.883333
File Copy 4096 bufsize 8000 maxblocks       4046.594444
Pipe Throughput 67955.76111
Pipe-based Context Switching                      13385.50556
Process Creation 8466.627778
Shell Scripts (1 concurrent)                           40251.48889
Shell Scripts (8 concurrent)                           39259.65556
System Call Overhead                                   4060.766667
System Benchmarks Index Score                  14251.95

B.R.,

Zeng Heng