[v2] locking/osq_lock: Avoid false sharing in optimistic_spin_node

[PATCH v2] locking/osq_lock: Avoid false sharing in optimistic_spin_node

Posted by Zeng Heng 2 years, 1 month ago

Using the UnixBench test suite, we clearly find that osq_lock() cause
extremely high overheads with perf tool in the File Copy items:

Overhead  Shared Object            Symbol
  94.25%  [kernel]                 [k] osq_lock
   0.74%  [kernel]                 [k] rwsem_spin_on_owner
   0.32%  [kernel]                 [k] filemap_get_read_batch

In response to this, we conducted an analysis and made some gains:

In the prologue of osq_lock(), it set `cpu` member of percpu struct
optimistic_spin_node with the local cpu id, after that the value of the
percpu struct would never change in fact. Based on that, we can regard
the `cpu` member as a constant variable.

In the meanwhile, other members of the percpu struct like next, prev and
locked are frequently modified by osq_lock() and osq_unlock() which are
called by rwsem, mutex and so on. However, that would invalidate the cache
of the cpu member on other CPUs.

Therefore, we can place padding here and split them into different cache
lines to avoid cache misses when the next CPU is spinning to check other
node's cpu member by vcpu_is_preempted().

Here provide the UnixBench full-core test result based on v6.6 as below:
Machine Intel(R) Xeon(R) Gold 6248 CPU, 40 cores, 80 threads
Run the command of "./Run -c 80 -i 3" over 20 times and take the average.

System Benchmarks Index Values           Without Patch   With Patch     Diff
Dhrystone 2 using register variables         185518.38    185329.56   -0.10%
Double-Precision Whetstone                    79330.46     79268.22   -0.08%
Execl Throughput                               9725.14     10390.18    6.84%
File Copy 1024 bufsize 2000 maxblocks          1658.42      2035.55   22.74%
File Copy 256 bufsize 500 maxblocks            1086.54      1316.96   21.21%
File Copy 4096 bufsize 8000 maxblocks          3610.42      4152.79   15.02%
Pipe Throughput                               69325.18     69913.85    0.85%
Pipe-based Context Switching                  14026.32     14703.07    4.82%
Process Creation                               8329.94      8355.31    0.30%
Shell Scripts (1 concurrent)                  38942.41     41518.39    6.61%
Shell Scripts (8 concurrent)                  37762.35     40224.49    6.52%
System Call Overhead                           4064.44      4004.45   -1.48%
                                                                    ========
System Benchmarks Index Score                 13634.17     14560.71    6.80%

Signed-off-by: Zeng Heng <zengheng4@huawei.com>
---
v1->v2: fix compile issue

 include/linux/osq_lock.h  | 10 +++++++++-
 kernel/locking/osq_lock.c |  8 +++++++-
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
index 5581dbd3bd34..1883c31bf536 100644
--- a/include/linux/osq_lock.h
+++ b/include/linux/osq_lock.h
@@ -2,6 +2,8 @@
 #ifndef __LINUX_OSQ_LOCK_H
 #define __LINUX_OSQ_LOCK_H

+#include <linux/cache.h>
+
 /*
  * An MCS like lock especially tailored for optimistic spinning for sleeping
  * lock implementations (mutex, rwsem, etc).
@@ -9,7 +11,13 @@
 struct optimistic_spin_node {
 	struct optimistic_spin_node *next, *prev;
 	int locked; /* 1 if lock acquired */
-	int cpu; /* encoded CPU # + 1 value */
+
+	CACHELINE_PADDING(_pad1_);
+	/*
+	 * Stores an encoded CPU # + 1 value.
+	 * Only read by other cpus, so split into different cache lines.
+	 */
+	int cpu;
 };

 struct optimistic_spin_queue {
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index d5610ad52b92..17618d62343f 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -96,7 +96,13 @@ bool osq_lock(struct optimistic_spin_queue *lock)

 	node->locked = 0;
 	node->next = NULL;
-	node->cpu = curr;
+	/*
+	 * After this cpu member is initialized for the first time, it
+	 * would no longer change in fact. That could avoid cache misses
+	 * when spin and access the cpu member by other CPUs.
+	 */
+	if (node->cpu != curr)
+		node->cpu = curr;

 	/*
 	 * We need both ACQUIRE (pairs with corresponding RELEASE in
--
2.25.1

Re: [PATCH v2] locking/osq_lock: Avoid false sharing in optimistic_spin_node

Posted by kernel test robot 2 years, 1 month ago


Hello,

kernel test robot noticed a 15.3% improvement of fxmark.ssd_xfs_DWOM_72_bufferedio.works/sec on:


commit: 511ac0a137f4211f42aa2ba168e50550b703bb7c ("[PATCH v2] locking/osq_lock: Avoid false sharing in optimistic_spin_node")
url: https://github.com/intel-lab-lkp/linux/commits/Zeng-Heng/locking-osq_lock-Avoid-false-sharing-in-optimistic_spin_node/20231222-200921
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git a51749ab34d9e5dec548fe38ede7e01e8bb26454
patch link: https://lore.kernel.org/all/20231222121040.2635879-1-zengheng4@huawei.com/
patch subject: [PATCH v2] locking/osq_lock: Avoid false sharing in optimistic_spin_node

testcase: fxmark
test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory
parameters:

	disk: 1SSD
	media: ssd
	test: DWOM
	fstype: xfs
	directio: bufferedio
	cpufreq_governor: performance






Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240105/202401051804.77722270-oliver.sang@intel.com

=========================================================================================
compiler/cpufreq_governor/directio/disk/fstype/kconfig/media/rootfs/tbox_group/test/testcase:
  gcc-12/performance/bufferedio/1SSD/xfs/x86_64-rhel-8.3/ssd/debian-11.1-x86_64-20220510.cgz/lkp-icl-2sp5/DWOM/fxmark

commit: 
  a51749ab34 ("locking/mutex: Document that mutex_unlock() is non-atomic")
  511ac0a137 ("locking/osq_lock: Avoid false sharing in optimistic_spin_node")

a51749ab34d9e5de 511ac0a137f4211f42aa2ba168e 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
     19727 ±  6%     -13.5%      17073 ±  6%  vmstat.system.cs
     19254 ±  7%     -13.4%      16678 ±  7%  perf-stat.i.context-switches
     19499 ±  6%     -13.6%      16851 ±  6%  perf-stat.ps.context-switches
    105.12 ± 13%    +115.9%     227.00 ±  9%  perf-c2c.DRAM.local
      5028 ±  2%      -9.1%       4571 ±  2%  perf-c2c.DRAM.remote
      3526 ±  3%     -13.8%       3040 ±  3%  perf-c2c.HITM.remote
  77565522 ±  2%     +11.2%   86238276 ±  3%  fxmark.ssd_xfs_DWOM_18_bufferedio.works
   1551311 ±  2%     +11.2%    1724766 ±  3%  fxmark.ssd_xfs_DWOM_18_bufferedio.works/sec
  61955316           +12.3%   69564863        fxmark.ssd_xfs_DWOM_36_bufferedio.works
   1239107           +12.3%    1391298        fxmark.ssd_xfs_DWOM_36_bufferedio.works/sec
      2.21 ±  4%      +9.2%       2.41 ±  2%  fxmark.ssd_xfs_DWOM_4_bufferedio.user_sec
      1.10 ±  4%      +9.3%       1.20 ±  2%  fxmark.ssd_xfs_DWOM_4_bufferedio.user_util
  46595523           +16.2%   54146121 ±  2%  fxmark.ssd_xfs_DWOM_54_bufferedio.works
    931911           +16.2%    1082933 ±  2%  fxmark.ssd_xfs_DWOM_54_bufferedio.works/sec
  44518641           +15.3%   51332276        fxmark.ssd_xfs_DWOM_72_bufferedio.works
    890372           +15.3%    1026646        fxmark.ssd_xfs_DWOM_72_bufferedio.works/sec




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

RE: [PATCH v2] locking/osq_lock: Avoid false sharing in optimistic_spin_node

Posted by David Laight 2 years, 1 month ago

From: Zeng Heng
> Sent: 22 December 2023 12:11
> 
> Using the UnixBench test suite, we clearly find that osq_lock() cause
> extremely high overheads with perf tool in the File Copy items:
> 
> Overhead  Shared Object            Symbol
>   94.25%  [kernel]                 [k] osq_lock
>    0.74%  [kernel]                 [k] rwsem_spin_on_owner
>    0.32%  [kernel]                 [k] filemap_get_read_batch
> 
> In response to this, we conducted an analysis and made some gains:
> 
> In the prologue of osq_lock(), it set `cpu` member of percpu struct
> optimistic_spin_node with the local cpu id, after that the value of the
> percpu struct would never change in fact. Based on that, we can regard
> the `cpu` member as a constant variable.
> 
...
> @@ -9,7 +11,13 @@
>  struct optimistic_spin_node {
>  	struct optimistic_spin_node *next, *prev;
>  	int locked; /* 1 if lock acquired */
> -	int cpu; /* encoded CPU # + 1 value */
> +
> +	CACHELINE_PADDING(_pad1_);
> +	/*
> +	 * Stores an encoded CPU # + 1 value.
> +	 * Only read by other cpus, so split into different cache lines.
> +	 */
> +	int cpu;
>  };

Isn't this structure embedded in every mutex and rwsem (etc)?
So that is a significant bloat especially on systems with
large cache lines.

Did you try just moving the initialisation of the per-cpu 'node'
below the first fast-path (uncontended) test in osq_lock()?

OTOH if you really have multiple cpu spinning on the same rwsem
perhaps the test and/or filemap code are really at fault!

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Re: [PATCH v2] locking/osq_lock: Avoid false sharing in optimistic_spin_node

Posted by Zeng Heng 2 years, 1 month ago

在 2023/12/22 20:40, David Laight 写道:
> From: Zeng Heng
>> Sent: 22 December 2023 12:11
>>
>> Using the UnixBench test suite, we clearly find that osq_lock() cause
>> extremely high overheads with perf tool in the File Copy items:
>>
>> Overhead  Shared Object            Symbol
>>    94.25%  [kernel]                 [k] osq_lock
>>     0.74%  [kernel]                 [k] rwsem_spin_on_owner
>>     0.32%  [kernel]                 [k] filemap_get_read_batch
>>
>> In response to this, we conducted an analysis and made some gains:
>>
>> In the prologue of osq_lock(), it set `cpu` member of percpu struct
>> optimistic_spin_node with the local cpu id, after that the value of the
>> percpu struct would never change in fact. Based on that, we can regard
>> the `cpu` member as a constant variable.
>>
> ...
>> @@ -9,7 +11,13 @@
>>   struct optimistic_spin_node {
>>   	struct optimistic_spin_node *next, *prev;
>>   	int locked; /* 1 if lock acquired */
>> -	int cpu; /* encoded CPU # + 1 value */
>> +
>> +	CACHELINE_PADDING(_pad1_);
>> +	/*
>> +	 * Stores an encoded CPU # + 1 value.
>> +	 * Only read by other cpus, so split into different cache lines.
>> +	 */
>> +	int cpu;
>>   };
> Isn't this structure embedded in every mutex and rwsem (etc)?
> So that is a significant bloat especially on systems with
> large cache lines.
>
> Did you try just moving the initialisation of the per-cpu 'node'
> below the first fast-path (uncontended) test in osq_lock()?
>
> OTOH if you really have multiple cpu spinning on the same rwsem
> perhaps the test and/or filemap code are really at fault!
>
> 	David

Hi，

The File Copy items of UnixBench testsuite are using 1 read file and 1 
write file

for file read/write/copy test. In multi-parallel scenario, that would 
lead to high

file lock contention.

That is just a performance test suite and has nothing to do with whether 
the user

program design is correct or not.


B.R.,

Zeng Heng

RE: [PATCH v2] locking/osq_lock: Avoid false sharing in optimistic_spin_node

Posted by David Laight 2 years, 1 month ago

From: Zeng Heng
> Sent: 23 December 2023 08:55
> 
> 在 2023/12/22 20:40, David Laight 写道:
> > From: Zeng Heng
> >> Sent: 22 December 2023 12:11
> >>
> >> Using the UnixBench test suite, we clearly find that osq_lock() cause
> >> extremely high overheads with perf tool in the File Copy items:
> >>
> >> Overhead  Shared Object            Symbol
> >>    94.25%  [kernel]                 [k] osq_lock
> >>     0.74%  [kernel]                 [k] rwsem_spin_on_owner
> >>     0.32%  [kernel]                 [k] filemap_get_read_batch
> >>
> >> In response to this, we conducted an analysis and made some gains:
> >>
> >> In the prologue of osq_lock(), it set `cpu` member of percpu struct
> >> optimistic_spin_node with the local cpu id, after that the value of the
> >> percpu struct would never change in fact. Based on that, we can regard
> >> the `cpu` member as a constant variable.
> >>
> > ...
> >> @@ -9,7 +11,13 @@
> >>   struct optimistic_spin_node {
> >>   	struct optimistic_spin_node *next, *prev;
> >>   	int locked; /* 1 if lock acquired */
> >> -	int cpu; /* encoded CPU # + 1 value */
> >> +
> >> +	CACHELINE_PADDING(_pad1_);
> >> +	/*
> >> +	 * Stores an encoded CPU # + 1 value.
> >> +	 * Only read by other cpus, so split into different cache lines.
> >> +	 */
> >> +	int cpu;
> >>   };
> > Isn't this structure embedded in every mutex and rwsem (etc)?
> > So that is a significant bloat especially on systems with
> > large cache lines.

This code is making my head hurt :-)
The 'spin_node' does only exist per-cpu.

> > Did you try just moving the initialisation of the per-cpu 'node'
> > below the first fast-path (uncontended) test in osq_lock()?

Reading more closely they do need to be valid before the fast-path
cmpxchg.
But I suspect the 'cache line dirty' could be done conditionally
or in the unlock/fail path.

I think the unlock fast-path always has node->next == NULL and it is
set to NULL in the slow path.
The lock-fail path calls osq_wait_next() - which also NULLs it.
So maybe it is always NULL on entry anyway?

node->locked is set by the slow-path lock code.
So could be cleared when checked or any time before the unlock returns.
Possibly unconditionally in the unlock slow path and conditionally
in the unlock fast path?

I think that would mean the assignment to node in osq_lock() could
be moved below the first xchg() (provided 'node' can be initialised).

I also wonder what the performance difference is between
smp_processor_id() and this_cpu_ptr(&osq_node)?

Bloating all the mutex/rwsem by 4 bytes (on 64bit) and changing
lock->tail to 'struct optimistic_spin_node *' (and moving it's
definition into the .c file) may well improve performance?


> >
> > OTOH if you really have multiple cpu spinning on the same rwsem
> > perhaps the test and/or filemap code are really at fault!
> >
> > 	David
> 
> Hi，
> 
> The File Copy items of UnixBench testsuite are using 1 read file and 1
> write file
> 
> for file read/write/copy test. In multi-parallel scenario, that would
> lead to high file lock contention.
> 
> That is just a performance test suite and has nothing to do with whether
> the user program design is correct or not.

But it might be stressing some code paths that don't usually happen.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)