[v2] locking/osq_lock: Update osq_lock to dynamic

[PATCH v2 0/3] locking/osq_lock: Update osq_lock to dynamic
Posted by yongli-oc 1 month, 4 weeks ago
    The series patch changes osq lock to 2 bytes if the
    CONFIG_LOCK_SPIN_ON_OWNER=y, and change some coding problems,
    add more comments.
    Since the 2-byte and the 4-byte osq lock both access the same
    cacheline with LOCK# asserted, the speed should be essentially
    the same.
    To compare the performance for the two kinds of osq lock,
    I use locktorture and set cpu affinity for each mutex_lock
    write kthreads. The result is the an average of 9 times test.

    locktortue-SET CPU AFFINITY   AMD EPYC 7551 32-core, 2 sockets 
    Writers 6.6.28  6.6.28-osq-dynamic disable 6.6.28-osq-dynamic enable
    stress  4-byte          2-byte tail               2-byte tail    
         Average CV      Average   CV  Improve    Average   CV   Improve
     1  21047265 3.48%  21331993 12.16%  1.35%   21359519  6.76%   1.48%
     2  39186677 5.66%  40348197  6.44%  2.96%   39387961  4.18%   0.51%
     4  43467264 3.63%  44133849  4.95%  1.53%   38961218  7.01% -10.37%
     8  43780445 6.67%  48887433  3.31% 11.66%   41725007  5.29%  -4.69%
    16  41407176 4.19%  51042178  3.45% 23.27%   71381112  2.75%  72.39%
    32  46000746 6.63%  50060246 14.19%  8.82%   79361487  2.29%  72.52%
    48  44235011 5.20%  44988160  7.22%  1.70%   79779501  4.88%  80.35%
    64  59054128 4.02%  62233006  2.00%  5.38%  112695286  7.42%  90.83%

    The 2-byte osq lock in 1, 2, 4 threads, the performance is nearly
    the same as the 4-byte lock. 8, 16, 32 threads, the performance
    is better than the 4-byte lock, more threads, the
    performance tends to be the same. If turn on dynamic switching,
    2-byte locks in 4, 8 threads, the performance has a little
    degradation, may be related to the lock contention cumulative,
    but it is never satisfied with the switching conditions.
    If 16 threads or more, the performance improvements approaching
    80 percent.
    
    v1:
    The dynamic numa-aware osq_lock supports numa architecture based on
    the kernel in-box osq_lock.
    After enable by echo 1 > /proc/zx_numa_lock/dynamic_enable, the patch
    keeps checking how many processes are waiting for a osq lock. If more
    than a threshold, the patch stops these processes and switches to the
    numa-aware osq lock, then restarts. By a cleanup work queue,
    numa-aware osq lock turns back to osq_lock when all nodes unlocked,
    all the numa-aware lock memory returns to the pre-allocated Linux
    kernel memory cache.
    
    The struct optimistic_spin_queue of dynamic numa-aware lock is also
    4 bytes, the same as the in-box osq lock. If enable dynamic switch,
    it will be accessed as three members by union. The tail is tail16,
    2 bytes, supports 65534 cpu cores. The other two members are for lock
    switch state and numa memory index, each 1 byte.
    
    The serial are added to the struct optimistic_spin_node to
    know how many processes is waiting an osq lock. Each process applies
    an osq lock, the serial will add 1.
    
    We have done some performance evaluation for the dynamic numa-aware
    osq lock by perf, locktorture, unixbench, fxmark etc.
    fxmark: Filesystem Multicore Scalability Benchmark
    https://github.com/sslab-gatech/fxmark
    
    The following results are tested by Zhaoxin KH40000 32 cores processor
    or 32+32 cores, two sockets processor, and AMD EPYC 7551 32-core
    processor, two sockets. Since I do not know well about AMD CPU, the
    code to support AMD CPU is a sample only.
    
    The number under Average represents an average of test five times.
    The CV is the Coefficient of Variation.
    The kernel source code is 6.6.28 stable, compiled in the default
    configuration.
    The 6.6.28-osq-dynamic is the kernel 6.6.28 with the patch and enable
    dynamic switch.
    The OS is Ubuntu 22.04.02 LTS, gcc version 9.5.0.
    
    perf bench     Zhaoxin KH40000 32 cores
    kernel     6.6.28          6.6.28-osq-dynamic
    epoll  Average  CV       Average  CV    Improve
    ADD     25620  0.78%     64609  2.55%  152.18%
    WAIT     7134  1.77%     11098  0.52%   55.56%
    
    locktortue                Zhaoxin KH40000 32 cores
    kernel                 6.6.28            6.6.28-osq-dynamic
    lock torture        Average   CV     Average      CV    Improve
    mutex_lock Writes   7433503  1.59%   17979058   1.90%   141.87%
    
    unixbench             Zhaoxin KH40000 32+32 cores, run on ssd
    64 copys                          6.6.28           6.6.28-osq-dynamic
    System Benchmarks Partial   Average    CV     Average   CV     Improve
         Execl Throughput       1460.18   1.18%   1865.22  0.25%    27.74%
    File Copy 1024 bufsize 200   549.94   0.62%   1221.32  6.71%   122.08%
    File Copy 256 bufsize 500    339.62   2.20%    896.58  6.57%   164.00%
    File Copy 4096 bufsize 800  1173.68   1.88%   2089.7   5.20%    78.05%
         Pipe Throughput       52122.26   0.18%  53842.72  0.15%     3.30%
    Pipe-based Context Switchi 18340.38   0.92%  19874.66  0.80%     8.37%
         Process Creation       2325.12   0.18%   2178.16  0.21%    -6.32%
    Shell Scripts (1 concurren  7414.32   0.29%   8458.5   0.10%    14.08%
    Shell Scripts (16 concurre                                    
    Shell Scripts (8 concurren  7156.48   0.10%   8132.42  0.14%    13.64%
       System Call Overhead     1476.9    0.14%   1574.32  0.09%     6.60%
    System Benchmarks Index Sc  2982.64   0.33%   4008.66  0.94%    34.40%
    
    fxmark                Zhaoxin KH40000 32 cores, run on ssd (ssd, ext4)
    parallel cores              32                           24
              6.6.28  vs   6.6.28-osq-dynamic  6.6.28 vs 6.6.28-osq-dynamic
    item  Improve      Average,CV:Average,CV         Improv      CV:CV
    DWAL  -0.17% (  455895, 0.14%:  455115, 0.37%)   -0.07% ( 0.42%: 0.44%)
    DWOL   1.10% (32166648, 2.64%:32521877, 2.06%)   -0.68% ( 2.54%: 3.04%)
    DWOM  51.63% (  496955, 4.34%:  753509, 8.32%)   45.93% ( 3.14%: 2.57%)
    DWSL   1.67% (   20229, 2.34%:   20566, 3.18%)   -1.74% ( 1.96%: 2.66%)
    MWRL  71.00% (  348097, 0.92%:  595241, 1.26%)   65.95% ( 0.65%: 2.27%)
    MWRM  63.06% (    6750, 3.33%:   11007, 4.31%)   60.18% ( 5.67%: 4.81%)
    MWCL  16.99% (  149628, 1.66%:  175054, 0.82%)   16.96% ( 2.57%: 0.51%)
    MWCM  80.97% (    9448, 4.66%:   17098, 0.96%)   73.79% ( 5.37%: 1.79%)
    MWUM  37.73% (   16858, 3.13%:   23220, 3.42%)   31.16% ( 3.59%: 1.62%)
    MWUL  12.83% (   45275, 3.90%:   51083, 3.25%)   19.94% ( 4.19%: 1.98%)
    DWTL  41.44% (   85255, 5.01%:  120583, 9.83%)   45.07% ( 6.42%: 6.11%)
    MRPL  -2.63% (11448731, 1.91%:11147179, 4.18%)   -0.56% ( 1.65%: 3.33%)
    MRPM   0.29% ( 5423233, 1.77%: 5438929, 2.59%)  -10.54% (15.85%:16.42%)
    MRPH  -0.49% (  688629, 2.84%:  685266, 2.88%)  -18.99% (15.00%:24.98%)
    MRDM   8.42% ( 3662627, 0.76%: 3971133, 0.45%)    4.53% ( 1.77%: 1.72%)
    MRDL   6.25% (  530518, 2.75%:  563671, 5.33%)   12.43% (25.88%:26.91%)
    DRBH   7.16% (  388144, 7.88%:  415933,17.87%)  -20.61% (29.12%:21.91%)
    DRBM  -4.34% (  381710, 5.51%:  365159, 3.15%)  -16.93% (27.15%:29.85%)
    DRBL  -0.17% (46227341, 2.50%:46147935, 2.89%)   -4.03% ( 4.01%: 5.30%)
    
    fxmark      Zhaoxin KH40000 32 cores, run on ssd (ssd, ext4)
    parallel cores      2                        1         
    	6.6.28 vs 6.6.28-osq-dynamic  6.6.28 vs 6.6.28-osq-dynamic
    item     Improve    CV:CV          Improve    CV: CV
    DWAL     1.78%  (0.31%:0.20%)      6.36%  (2.52%: 0.67%)
    DWOL     2.46%  (2.26%:2.53%)      1.83%  (2.69%: 3.07%)
    DWOM     2.70%  (2.58%:3.12%)      2.22%  (2.67%: 3.79%)
    DWSL     3.28%  (2.90%:3.38%)      4.41%  (1.32%: 1.36%)
    MWRL    -0.76%  (1.46%:1.94%)     -0.82%  (2.04%: 2.32%)
    MWRM     1.94%  (4.38%:0.89%)     -2.05%  (4.07%: 5.16%)
    MWCL    -0.07%  (1.36%:3.84%)     -2.17%  (1.58%: 3.04%)
    MWCM     1.85%  (2.95%:4.68%)      0.28%  (0.45%: 2.48%)
    MWUM    -2.85%  (1.48%:2.01%)     -3.06%  (1.47%: 1.97%)
    MWUL    -1.46%  (0.58%:2.27%)     -2.98%  (0.71%: 2.11%)
    DWTL     0.40%  (3.89%:4.35%)     -2.68%  (4.04%: 3.15%)
    MRPL     3.11%  (1.38%:0.35%)     -4.81%  (0.32%:16.52%)
    MRPM     2.99%  (0.29%:1.19%)      3.50%  (0.56%: 0.78%)
    MRPH     3.01%  (1.10%:1.42%)      5.06%  (1.18%: 1.73%)
    MRDM    -1.67%  (4.59%:5.58%)     -3.30%  (0.23%: 8.01%)
    MRDL     1.94%  (1.56%:4.39%)     -0.55%  (0.88%: 9.57%)
    DRBH     7.24%  (7.07%:7.10%)      3.36%  (3.30%: 2.95%)
    DRBM     4.40%  (5.11%:0.74%)     -2.55%  (0.46%: 3.28%)
    DRBL     5.50%  (5.58%:0.30%)     -1.00%  (0.71%: 5.21%)
    
    (some tests has more than 10% loss, CV is also more than 10%,
    the result is not stable)
    
    perf bench    AMD EPYC 7551 32-core, 2 sockets
    kernel     6.6.28          6.6.28-osq-dynamic
    epoll  Average  CV       Average  CV    Improve
    ADD     15258  2.30%      62160  2.40%  307.38%
    WAIT     3861  4.20%       6990 16.77%   81.03%
    
    locktortue            AMD EPYC 7551 32-core, 2 sockets
    kernel                 6.6.28            6.6.28-osq-dynamic
    lock torture        Average   CV     Average      CV    Improve
    mutex_lock Writes  10435064  3.14%   22627890   4.92%   116.84%
    
    unixbench     AMD EPYC 7551 32-core, 2 sockets. run on ramdisk
    64 copys                          6.6.28        6.6.28-osq-dynamic
    System Benchmarks Partial   Average   CV      Average   CV   Improve
         Execl Throughput       2677.18  0.90%    3451.76  0.22%  28.93%
    File Copy 1024 bufsize 200   815.2   0.59%    1999.54  0.36% 145.28%
    File Copy 256 bufsize 500    504.6   0.69%    1359.6   0.49% 169.44%
    File Copy 4096 bufsize 800  1842.76  1.24%    3236.48  1.40%  75.63%
         Pipe Throughput       57748.74  0.01%   57539.6   0.03%  -0.36%
    Pipe-based Context Switchi 20882.18  0.57%   20525.38  0.57%  -1.71%
         Process Creation       4523.98  0.20%    4784.98  0.10%   5.77%
    Shell Scripts (1 concurren 13136.54  0.06%   15883.6   0.35%  20.91%
    Shell Scripts (16 concurre                                   
    Shell Scripts (8 concurren 12883.82  0.14%   15640.32  0.20%  21.40%
       System Call Overhead     3533.74  0.04%    3544.16  0.02%   0.29%
    System Benchmarks Index Sc  4809.38  0.23%    6575.44  0.14%  36.72%
    
    fxmark  AMD EPYC 7551 32-core, 2 sockets. run on ramdisk (mem,tmpfs)
    parallel cores         64                                32
                6.6.28 vs 6.6.28-osq-dynamic   6.6.28 vs 6.6.28-osq-dynamic
    item  Improve   Average,   CV :  Average,   CV    Improve  CV  :   CV
    DWAL  -0.22% ( 24091112, 0.31%: 24038426, 0.52%)  -0.26% (0.10%: 0.12%)
    DWOL   2.21% ( 86569869, 0.36%: 88479947, 0.27%)   1.99% (0.41%: 0.29%)
    DWOM 210.41% (   425986, 0.77%:  1322320, 0.28%) 128.86% (0.59%: 0.46%)
    DWSL   1.27% ( 70260252, 0.39%: 71149334, 0.37%)   1.19% (0.31%: 0.22%)
    MWRL   0.85% (   489865, 0.22%:   494045, 0.25%)   2.29% (0.12%: 0.33%)
    MWRM  96.28% (   149042, 0.45%:   292540, 3.55%)  60.10% (2.49%: 0.38%)
    MWCL  -5.44% (   772582, 2.92%:   730585, 0.80%)   0.32% (2.41%: 2.56%)
    MWCM  53.89% (   153857, 1.92%:   236774, 0.46%)  23.84% (0.72%: 0.50%)
    MWUM  88.20% (   214551, 3.90%:   403790, 0.41%)  62.81% (0.80%: 1.12%)
    MWUL  -8.26% (   970810, 1.63%:   890615, 1.63%)  -6.73% (3.01%: 1.61%)
    DWTL   5.90% (  5522297, 0.49%:  5847951, 0.18%)   5.03% (0.44%: 0.08%)
    MRPL  -1.10% ( 39707577, 0.07%: 39268812, 0.03%)  -1.30% (0.18%: 0.07%)
    MRPM  -0.63% ( 16446350, 0.47%: 16341936, 0.40%)   0.45% (0.15%: 0.45%)
    MRPH  -0.03% (  3805484, 0.50%:  3804248, 0.12%)   3.02% (1.54%: 0.36%)
    MRDM  49.41% ( 20178742, 1.89%: 30148449, 1.01%)  17.58% (1.19%: 0.85%)
    MRDL  -1.95% (227253170, 0.48%:222825409, 1.34%)  -1.80% (0.32%: 0.54%)
    DRBH   6.01% (  1045587, 1.91%:  1108467, 0.64%)   0.12% (0.13%: 0.30%)
    DRBM   0.65% (117702744, 0.31%:118473408, 0.87%)   1.12% (0.25%: 1.18%)
    DRBL   0.93% (121770444, 0.42%:122905957, 0.25%)   1.59% (0.31%: 0.40%)
    
    fxmark  AMD EPYC 7551 32-core, 2 sockets. run on ramdisk (mem,tmpfs)
    parallel cores      2                        1         
    	6.6.28 vs 6.6.28-osq-dynamic  6.6.28 vs 6.6.28-osq-dynamic
    item     Improve    CV :  CV           Improve    CV :  CV
    DWAL    -0.74%   (0.33%:  0.19%)      -1.02%   (0.19%:  0.34%)
    DWOL     1.50%   (0.36%:  0.44%)       1.89%   (0.30%:  0.36%)
    DWOM    -2.00%   (0.73%:  0.38%)       2.43%   (0.35%:  0.29%)
    DWSL     1.03%   (0.34%:  0.54%)       1.18%   (0.46%:  0.61%)
    MWRL     0.93%   (0.39%:  0.18%)       2.25%   (1.28%:  1.78%)
    MWRM    -0.30%   (0.60%:  0.47%)       0.17%   (0.58%:  0.47%)
    MWCL    -1.28%   (0.41%:  0.66%)      -0.38%   (0.19%:  0.44%)
    MWCM    -1.23%   (0.36%:  0.23%)      -1.42%   (0.41%:  0.54%)
    MWUM    -2.28%   (0.57%:  0.75%)      -1.11%   (0.82%:  0.21%)
    MWUL    -1.87%   (0.64%:  0.50%)      -1.75%   (0.58%:  0.65%)
    DWTL     0.36%   (0.09%:  0.12%)       0.19%   (0.09%:  0.09%)
    MRPL    -1.45%   (0.37%:  0.31%)      -1.35%   (0.12%:  0.54%)
    MRPM    -0.58%   (0.30%:  0.11%)      -1.04%   (0.18%:  0.31%)
    MRPH     0.79%   (3.92%:  0.48%)      -0.53%   (0.68%:  0.33%)
    MRDM    -0.55%   (0.93%:  0.44%)      -0.13%   (0.43%:  0.67%)
    MRDL    -0.11%   (0.56%:  0.19%)       0.68%   (0.71%:  0.49%)
    DRBH     0.09%   (1.31%:  0.87%)       2.75%   (0.68%:  0.45%)
    DRBM     1.09%   (0.19%:  1.05%)       1.60%   (0.15%:  0.72%)
    DRBL     3.26%   (1.00%:  0.56%)       2.34%   (0.36%:  0.23%)

    From the test result, when heavy contention, the performance of
    dynamic numa-aware lock is better than the performance of in-box
    osq_lock. If not too many processes apply a lock, the performance
    is nearly the same as the in-box osq_lock.

    ---
    Changes since v1 (based on Longman reviews)
    #1 Changes the bisection from v1 patchs.
    #2 Modify some code, such as the definition, special value in macro,
       cpu_relax().
    #3 Add some comments.  


yongli-oc (3):
  locking/osq_lock: The Kconfig for dynamic numa-aware osq lock.
  locking/osq_lock: Define osq by union to support dynamic numa-aware
    lock.
  locking/osq_lock: Turn from 2-byte osq_lock/unlock to numa
    lock/unlock.

 include/linux/osq_lock.h     |  33 ++-
 kernel/Kconfig.numalocks     |  17 ++
 kernel/locking/Makefile      |   3 +
 kernel/locking/numa.h        |  90 ++++++
 kernel/locking/numa_osq.h    |  29 ++
 kernel/locking/x_osq_lock.c  | 371 ++++++++++++++++++++++++
 kernel/locking/zx_numa.c     | 540 +++++++++++++++++++++++++++++++++++
 kernel/locking/zx_numa_osq.c | 497 ++++++++++++++++++++++++++++++++
 lib/Kconfig.debug            |   1 +
 9 files changed, 1580 insertions(+), 1 deletion(-)
 create mode 100644 kernel/Kconfig.numalocks
 create mode 100644 kernel/locking/numa.h
 create mode 100644 kernel/locking/numa_osq.h
 create mode 100644 kernel/locking/x_osq_lock.c
 create mode 100644 kernel/locking/zx_numa.c
 create mode 100644 kernel/locking/zx_numa_osq.c

-- 
2.34.1