include/linux/osq_lock.h | 33 ++- kernel/Kconfig.numalocks | 17 ++ kernel/locking/Makefile | 3 + kernel/locking/numa.h | 90 ++++++ kernel/locking/numa_osq.h | 29 ++ kernel/locking/x_osq_lock.c | 371 ++++++++++++++++++++++++ kernel/locking/zx_numa.c | 540 +++++++++++++++++++++++++++++++++++ kernel/locking/zx_numa_osq.c | 497 ++++++++++++++++++++++++++++++++ lib/Kconfig.debug | 1 + 9 files changed, 1580 insertions(+), 1 deletion(-) create mode 100644 kernel/Kconfig.numalocks create mode 100644 kernel/locking/numa.h create mode 100644 kernel/locking/numa_osq.h create mode 100644 kernel/locking/x_osq_lock.c create mode 100644 kernel/locking/zx_numa.c create mode 100644 kernel/locking/zx_numa_osq.c
The series patch changes osq lock to 2 bytes if the CONFIG_LOCK_SPIN_ON_OWNER=y, and change some coding problems, add more comments. Since the 2-byte and the 4-byte osq lock both access the same cacheline with LOCK# asserted, the speed should be essentially the same. To compare the performance for the two kinds of osq lock, I use locktorture and set cpu affinity for each mutex_lock write kthreads. The result is the an average of 9 times test. locktortue-SET CPU AFFINITY AMD EPYC 7551 32-core, 2 sockets Writers 6.6.28 6.6.28-osq-dynamic disable 6.6.28-osq-dynamic enable stress 4-byte 2-byte tail 2-byte tail Average CV Average CV Improve Average CV Improve 1 21047265 3.48% 21331993 12.16% 1.35% 21359519 6.76% 1.48% 2 39186677 5.66% 40348197 6.44% 2.96% 39387961 4.18% 0.51% 4 43467264 3.63% 44133849 4.95% 1.53% 38961218 7.01% -10.37% 8 43780445 6.67% 48887433 3.31% 11.66% 41725007 5.29% -4.69% 16 41407176 4.19% 51042178 3.45% 23.27% 71381112 2.75% 72.39% 32 46000746 6.63% 50060246 14.19% 8.82% 79361487 2.29% 72.52% 48 44235011 5.20% 44988160 7.22% 1.70% 79779501 4.88% 80.35% 64 59054128 4.02% 62233006 2.00% 5.38% 112695286 7.42% 90.83% The 2-byte osq lock in 1, 2, 4 threads, the performance is nearly the same as the 4-byte lock. 8, 16, 32 threads, the performance is better than the 4-byte lock, more threads, the performance tends to be the same. If turn on dynamic switching, 2-byte locks in 4, 8 threads, the performance has a little degradation, may be related to the lock contention cumulative, but it is never satisfied with the switching conditions. If 16 threads or more, the performance improvements approaching 80 percent. v1: The dynamic numa-aware osq_lock supports numa architecture based on the kernel in-box osq_lock. After enable by echo 1 > /proc/zx_numa_lock/dynamic_enable, the patch keeps checking how many processes are waiting for a osq lock. If more than a threshold, the patch stops these processes and switches to the numa-aware osq lock, then restarts. By a cleanup work queue, numa-aware osq lock turns back to osq_lock when all nodes unlocked, all the numa-aware lock memory returns to the pre-allocated Linux kernel memory cache. The struct optimistic_spin_queue of dynamic numa-aware lock is also 4 bytes, the same as the in-box osq lock. If enable dynamic switch, it will be accessed as three members by union. The tail is tail16, 2 bytes, supports 65534 cpu cores. The other two members are for lock switch state and numa memory index, each 1 byte. The serial are added to the struct optimistic_spin_node to know how many processes is waiting an osq lock. Each process applies an osq lock, the serial will add 1. We have done some performance evaluation for the dynamic numa-aware osq lock by perf, locktorture, unixbench, fxmark etc. fxmark: Filesystem Multicore Scalability Benchmark https://github.com/sslab-gatech/fxmark The following results are tested by Zhaoxin KH40000 32 cores processor or 32+32 cores, two sockets processor, and AMD EPYC 7551 32-core processor, two sockets. Since I do not know well about AMD CPU, the code to support AMD CPU is a sample only. The number under Average represents an average of test five times. The CV is the Coefficient of Variation. The kernel source code is 6.6.28 stable, compiled in the default configuration. The 6.6.28-osq-dynamic is the kernel 6.6.28 with the patch and enable dynamic switch. The OS is Ubuntu 22.04.02 LTS, gcc version 9.5.0. perf bench Zhaoxin KH40000 32 cores kernel 6.6.28 6.6.28-osq-dynamic epoll Average CV Average CV Improve ADD 25620 0.78% 64609 2.55% 152.18% WAIT 7134 1.77% 11098 0.52% 55.56% locktortue Zhaoxin KH40000 32 cores kernel 6.6.28 6.6.28-osq-dynamic lock torture Average CV Average CV Improve mutex_lock Writes 7433503 1.59% 17979058 1.90% 141.87% unixbench Zhaoxin KH40000 32+32 cores, run on ssd 64 copys 6.6.28 6.6.28-osq-dynamic System Benchmarks Partial Average CV Average CV Improve Execl Throughput 1460.18 1.18% 1865.22 0.25% 27.74% File Copy 1024 bufsize 200 549.94 0.62% 1221.32 6.71% 122.08% File Copy 256 bufsize 500 339.62 2.20% 896.58 6.57% 164.00% File Copy 4096 bufsize 800 1173.68 1.88% 2089.7 5.20% 78.05% Pipe Throughput 52122.26 0.18% 53842.72 0.15% 3.30% Pipe-based Context Switchi 18340.38 0.92% 19874.66 0.80% 8.37% Process Creation 2325.12 0.18% 2178.16 0.21% -6.32% Shell Scripts (1 concurren 7414.32 0.29% 8458.5 0.10% 14.08% Shell Scripts (16 concurre Shell Scripts (8 concurren 7156.48 0.10% 8132.42 0.14% 13.64% System Call Overhead 1476.9 0.14% 1574.32 0.09% 6.60% System Benchmarks Index Sc 2982.64 0.33% 4008.66 0.94% 34.40% fxmark Zhaoxin KH40000 32 cores, run on ssd (ssd, ext4) parallel cores 32 24 6.6.28 vs 6.6.28-osq-dynamic 6.6.28 vs 6.6.28-osq-dynamic item Improve Average,CV:Average,CV Improv CV:CV DWAL -0.17% ( 455895, 0.14%: 455115, 0.37%) -0.07% ( 0.42%: 0.44%) DWOL 1.10% (32166648, 2.64%:32521877, 2.06%) -0.68% ( 2.54%: 3.04%) DWOM 51.63% ( 496955, 4.34%: 753509, 8.32%) 45.93% ( 3.14%: 2.57%) DWSL 1.67% ( 20229, 2.34%: 20566, 3.18%) -1.74% ( 1.96%: 2.66%) MWRL 71.00% ( 348097, 0.92%: 595241, 1.26%) 65.95% ( 0.65%: 2.27%) MWRM 63.06% ( 6750, 3.33%: 11007, 4.31%) 60.18% ( 5.67%: 4.81%) MWCL 16.99% ( 149628, 1.66%: 175054, 0.82%) 16.96% ( 2.57%: 0.51%) MWCM 80.97% ( 9448, 4.66%: 17098, 0.96%) 73.79% ( 5.37%: 1.79%) MWUM 37.73% ( 16858, 3.13%: 23220, 3.42%) 31.16% ( 3.59%: 1.62%) MWUL 12.83% ( 45275, 3.90%: 51083, 3.25%) 19.94% ( 4.19%: 1.98%) DWTL 41.44% ( 85255, 5.01%: 120583, 9.83%) 45.07% ( 6.42%: 6.11%) MRPL -2.63% (11448731, 1.91%:11147179, 4.18%) -0.56% ( 1.65%: 3.33%) MRPM 0.29% ( 5423233, 1.77%: 5438929, 2.59%) -10.54% (15.85%:16.42%) MRPH -0.49% ( 688629, 2.84%: 685266, 2.88%) -18.99% (15.00%:24.98%) MRDM 8.42% ( 3662627, 0.76%: 3971133, 0.45%) 4.53% ( 1.77%: 1.72%) MRDL 6.25% ( 530518, 2.75%: 563671, 5.33%) 12.43% (25.88%:26.91%) DRBH 7.16% ( 388144, 7.88%: 415933,17.87%) -20.61% (29.12%:21.91%) DRBM -4.34% ( 381710, 5.51%: 365159, 3.15%) -16.93% (27.15%:29.85%) DRBL -0.17% (46227341, 2.50%:46147935, 2.89%) -4.03% ( 4.01%: 5.30%) fxmark Zhaoxin KH40000 32 cores, run on ssd (ssd, ext4) parallel cores 2 1 6.6.28 vs 6.6.28-osq-dynamic 6.6.28 vs 6.6.28-osq-dynamic item Improve CV:CV Improve CV: CV DWAL 1.78% (0.31%:0.20%) 6.36% (2.52%: 0.67%) DWOL 2.46% (2.26%:2.53%) 1.83% (2.69%: 3.07%) DWOM 2.70% (2.58%:3.12%) 2.22% (2.67%: 3.79%) DWSL 3.28% (2.90%:3.38%) 4.41% (1.32%: 1.36%) MWRL -0.76% (1.46%:1.94%) -0.82% (2.04%: 2.32%) MWRM 1.94% (4.38%:0.89%) -2.05% (4.07%: 5.16%) MWCL -0.07% (1.36%:3.84%) -2.17% (1.58%: 3.04%) MWCM 1.85% (2.95%:4.68%) 0.28% (0.45%: 2.48%) MWUM -2.85% (1.48%:2.01%) -3.06% (1.47%: 1.97%) MWUL -1.46% (0.58%:2.27%) -2.98% (0.71%: 2.11%) DWTL 0.40% (3.89%:4.35%) -2.68% (4.04%: 3.15%) MRPL 3.11% (1.38%:0.35%) -4.81% (0.32%:16.52%) MRPM 2.99% (0.29%:1.19%) 3.50% (0.56%: 0.78%) MRPH 3.01% (1.10%:1.42%) 5.06% (1.18%: 1.73%) MRDM -1.67% (4.59%:5.58%) -3.30% (0.23%: 8.01%) MRDL 1.94% (1.56%:4.39%) -0.55% (0.88%: 9.57%) DRBH 7.24% (7.07%:7.10%) 3.36% (3.30%: 2.95%) DRBM 4.40% (5.11%:0.74%) -2.55% (0.46%: 3.28%) DRBL 5.50% (5.58%:0.30%) -1.00% (0.71%: 5.21%) (some tests has more than 10% loss, CV is also more than 10%, the result is not stable) perf bench AMD EPYC 7551 32-core, 2 sockets kernel 6.6.28 6.6.28-osq-dynamic epoll Average CV Average CV Improve ADD 15258 2.30% 62160 2.40% 307.38% WAIT 3861 4.20% 6990 16.77% 81.03% locktortue AMD EPYC 7551 32-core, 2 sockets kernel 6.6.28 6.6.28-osq-dynamic lock torture Average CV Average CV Improve mutex_lock Writes 10435064 3.14% 22627890 4.92% 116.84% unixbench AMD EPYC 7551 32-core, 2 sockets. run on ramdisk 64 copys 6.6.28 6.6.28-osq-dynamic System Benchmarks Partial Average CV Average CV Improve Execl Throughput 2677.18 0.90% 3451.76 0.22% 28.93% File Copy 1024 bufsize 200 815.2 0.59% 1999.54 0.36% 145.28% File Copy 256 bufsize 500 504.6 0.69% 1359.6 0.49% 169.44% File Copy 4096 bufsize 800 1842.76 1.24% 3236.48 1.40% 75.63% Pipe Throughput 57748.74 0.01% 57539.6 0.03% -0.36% Pipe-based Context Switchi 20882.18 0.57% 20525.38 0.57% -1.71% Process Creation 4523.98 0.20% 4784.98 0.10% 5.77% Shell Scripts (1 concurren 13136.54 0.06% 15883.6 0.35% 20.91% Shell Scripts (16 concurre Shell Scripts (8 concurren 12883.82 0.14% 15640.32 0.20% 21.40% System Call Overhead 3533.74 0.04% 3544.16 0.02% 0.29% System Benchmarks Index Sc 4809.38 0.23% 6575.44 0.14% 36.72% fxmark AMD EPYC 7551 32-core, 2 sockets. run on ramdisk (mem,tmpfs) parallel cores 64 32 6.6.28 vs 6.6.28-osq-dynamic 6.6.28 vs 6.6.28-osq-dynamic item Improve Average, CV : Average, CV Improve CV : CV DWAL -0.22% ( 24091112, 0.31%: 24038426, 0.52%) -0.26% (0.10%: 0.12%) DWOL 2.21% ( 86569869, 0.36%: 88479947, 0.27%) 1.99% (0.41%: 0.29%) DWOM 210.41% ( 425986, 0.77%: 1322320, 0.28%) 128.86% (0.59%: 0.46%) DWSL 1.27% ( 70260252, 0.39%: 71149334, 0.37%) 1.19% (0.31%: 0.22%) MWRL 0.85% ( 489865, 0.22%: 494045, 0.25%) 2.29% (0.12%: 0.33%) MWRM 96.28% ( 149042, 0.45%: 292540, 3.55%) 60.10% (2.49%: 0.38%) MWCL -5.44% ( 772582, 2.92%: 730585, 0.80%) 0.32% (2.41%: 2.56%) MWCM 53.89% ( 153857, 1.92%: 236774, 0.46%) 23.84% (0.72%: 0.50%) MWUM 88.20% ( 214551, 3.90%: 403790, 0.41%) 62.81% (0.80%: 1.12%) MWUL -8.26% ( 970810, 1.63%: 890615, 1.63%) -6.73% (3.01%: 1.61%) DWTL 5.90% ( 5522297, 0.49%: 5847951, 0.18%) 5.03% (0.44%: 0.08%) MRPL -1.10% ( 39707577, 0.07%: 39268812, 0.03%) -1.30% (0.18%: 0.07%) MRPM -0.63% ( 16446350, 0.47%: 16341936, 0.40%) 0.45% (0.15%: 0.45%) MRPH -0.03% ( 3805484, 0.50%: 3804248, 0.12%) 3.02% (1.54%: 0.36%) MRDM 49.41% ( 20178742, 1.89%: 30148449, 1.01%) 17.58% (1.19%: 0.85%) MRDL -1.95% (227253170, 0.48%:222825409, 1.34%) -1.80% (0.32%: 0.54%) DRBH 6.01% ( 1045587, 1.91%: 1108467, 0.64%) 0.12% (0.13%: 0.30%) DRBM 0.65% (117702744, 0.31%:118473408, 0.87%) 1.12% (0.25%: 1.18%) DRBL 0.93% (121770444, 0.42%:122905957, 0.25%) 1.59% (0.31%: 0.40%) fxmark AMD EPYC 7551 32-core, 2 sockets. run on ramdisk (mem,tmpfs) parallel cores 2 1 6.6.28 vs 6.6.28-osq-dynamic 6.6.28 vs 6.6.28-osq-dynamic item Improve CV : CV Improve CV : CV DWAL -0.74% (0.33%: 0.19%) -1.02% (0.19%: 0.34%) DWOL 1.50% (0.36%: 0.44%) 1.89% (0.30%: 0.36%) DWOM -2.00% (0.73%: 0.38%) 2.43% (0.35%: 0.29%) DWSL 1.03% (0.34%: 0.54%) 1.18% (0.46%: 0.61%) MWRL 0.93% (0.39%: 0.18%) 2.25% (1.28%: 1.78%) MWRM -0.30% (0.60%: 0.47%) 0.17% (0.58%: 0.47%) MWCL -1.28% (0.41%: 0.66%) -0.38% (0.19%: 0.44%) MWCM -1.23% (0.36%: 0.23%) -1.42% (0.41%: 0.54%) MWUM -2.28% (0.57%: 0.75%) -1.11% (0.82%: 0.21%) MWUL -1.87% (0.64%: 0.50%) -1.75% (0.58%: 0.65%) DWTL 0.36% (0.09%: 0.12%) 0.19% (0.09%: 0.09%) MRPL -1.45% (0.37%: 0.31%) -1.35% (0.12%: 0.54%) MRPM -0.58% (0.30%: 0.11%) -1.04% (0.18%: 0.31%) MRPH 0.79% (3.92%: 0.48%) -0.53% (0.68%: 0.33%) MRDM -0.55% (0.93%: 0.44%) -0.13% (0.43%: 0.67%) MRDL -0.11% (0.56%: 0.19%) 0.68% (0.71%: 0.49%) DRBH 0.09% (1.31%: 0.87%) 2.75% (0.68%: 0.45%) DRBM 1.09% (0.19%: 1.05%) 1.60% (0.15%: 0.72%) DRBL 3.26% (1.00%: 0.56%) 2.34% (0.36%: 0.23%) From the test result, when heavy contention, the performance of dynamic numa-aware lock is better than the performance of in-box osq_lock. If not too many processes apply a lock, the performance is nearly the same as the in-box osq_lock. --- Changes since v1 (based on Longman reviews) #1 Changes the bisection from v1 patchs. #2 Modify some code, such as the definition, special value in macro, cpu_relax(). #3 Add some comments. yongli-oc (3): locking/osq_lock: The Kconfig for dynamic numa-aware osq lock. locking/osq_lock: Define osq by union to support dynamic numa-aware lock. locking/osq_lock: Turn from 2-byte osq_lock/unlock to numa lock/unlock. include/linux/osq_lock.h | 33 ++- kernel/Kconfig.numalocks | 17 ++ kernel/locking/Makefile | 3 + kernel/locking/numa.h | 90 ++++++ kernel/locking/numa_osq.h | 29 ++ kernel/locking/x_osq_lock.c | 371 ++++++++++++++++++++++++ kernel/locking/zx_numa.c | 540 +++++++++++++++++++++++++++++++++++ kernel/locking/zx_numa_osq.c | 497 ++++++++++++++++++++++++++++++++ lib/Kconfig.debug | 1 + 9 files changed, 1580 insertions(+), 1 deletion(-) create mode 100644 kernel/Kconfig.numalocks create mode 100644 kernel/locking/numa.h create mode 100644 kernel/locking/numa_osq.h create mode 100644 kernel/locking/x_osq_lock.c create mode 100644 kernel/locking/zx_numa.c create mode 100644 kernel/locking/zx_numa_osq.c -- 2.34.1
© 2016 - 2024 Red Hat, Inc.