[v2] futex: Add support task local hash maps.

[RFC v2 PATCH 0/4] futex: Add support task local hash maps.

Posted by Sebastian Andrzej Siewior 1 year, 3 months ago

Hi,

this is a follow up on
	https://lore.kernel.org/ZwVOMgBMxrw7BU9A@jlelli-thinkpadt14gen4.remote.csb

and adds support for task local futex_hash_bucket. It can be created via
prctl(). Last patch in the series enables it one the first thread is
created.

I've been how this auto-create behaves and so far dpkg creates threads
and uses the local-hashmap. systemd-journal on the hand forks a thread
from time to time and I haven't seen it using the hashmap. Need to do
more testing.

v1…v2 https://lore.kernel.org/all/20241026224306.982896-1-bigeasy@linutronix.de/:
  - Moved to struct signal_struct and is used process wide.
  - Automaticly allocated once the first thread is created.

Sebastian

Re: [RFC v2 PATCH 0/4] futex: Add support task local hash maps.

Posted by Sebastian Andrzej Siewior 1 year, 3 months ago

On 2024-10-28 13:13:54 [+0100], To linux-kernel@vger.kernel.org wrote:
>                                                            Need to do
> more testing.

So there is "perf bench futex hash". On a 256 CPU NUMA box:
	perf bench futex hash -t 240 -m -s -b $hb
and hb 2 … 131072 (moved the allocation to kvmalloc) I get the following
(averaged over 3 three runs)

buckets op/sec
      2     9158.33   
      4    21665.66	+ ~136%
      8    44686.66	+ ~106
     16    84144.33	+ ~ 88
     32   139998.33	+ ~ 66
     64   279957.0	+ ~ 99
    128   509533.0	+ ~100
    256  1019846.0	+ ~100
    512  1634940.0	+ ~ 60
   1024  1834859.33	+ ~ 12
         1868129.33 (global hash, 65536 hash)
   2048  1912071.33	+ ~  4
   4096  1918686.66	+ ~  0
   8192  1922285.66	+ ~  0
  16384  1923017.0	+ ~  0
  32768  1923319.0	+ ~  0
  65536  1932906.0	+ ~  0
 131072  2042571.33	+ ~  5

By doubling the hash size the ops/sec almost double until 256 slots.
After 2048 slots the increase is almost noise (except for the last
entry).

Pinning the bench to individual CPUs belonging to a NUMA node and
running the same test with 110 threads only (avg over 5 runs):
          ops/sec global	ops/sec local
node 0		2278572.2	2534827.4
node 1		2229838.6	2437498.8
node 0+1	2542602.4	2535749.8

<--->
RAW numbers:

futex hash table entries: 65536 (order: 10, 4194304 bytes, vmalloc hugepage)
Run summary [PID 4541]: 240 threads, each operating on 1024 [private] futexes for 10 secs.
Averaged 1883542 operations/sec (+- 0,28%), total secs = 10
Averaged 1864680 operations/sec (+- 0,31%), total secs = 10
Averaged 1856166 operations/sec (+- 0,32%), total secs = 10
1868129.3333333333
====

Run summary [PID 6247]: 240 threads, hash slots: 2 each operating on 1024 [private] futexes for 10 secs.
Averaged 9219 operations/sec (+- 0,19%), total secs = 10
Averaged 9185 operations/sec (+- 0,18%), total secs = 10
Averaged 9071 operations/sec (+- 0,20%), total secs = 10
9158.333333333334

Run summary [PID 6970]: 240 threads, hash slots: 4 each operating on 1024 [private] futexes for 10 secs.
Averaged 16911 operations/sec (+- 0,29%), total secs = 10
Averaged 24145 operations/sec (+- 0,17%), total secs = 10
Averaged 23941 operations/sec (+- 0,17%), total secs = 10
21665.666666666668

Run summary [PID 7693]: 240 threads, hash slots: 8 each operating on 1024 [private] futexes for 10 secs.
Averaged 45376 operations/sec (+- 0,25%), total secs = 10
Averaged 44587 operations/sec (+- 0,17%), total secs = 10
Averaged 44097 operations/sec (+- 0,26%), total secs = 10
44686.666666666664

Run summary [PID 8416]: 240 threads, hash slots: 16 each operating on 1024 [private] futexes for 10 secs.
Averaged 84547 operations/sec (+- 0,25%), total secs = 10
Averaged 84672 operations/sec (+- 0,18%), total secs = 10
Averaged 83214 operations/sec (+- 0,26%), total secs = 10
84144.33333333333

Run summary [PID 9139]: 240 threads, hash slots: 32 each operating on 1024 [private] futexes for 10 secs.
Averaged 163342 operations/sec (+- 0,55%), total secs = 10
Averaged 127630 operations/sec (+- 0,28%), total secs = 10
Averaged 129023 operations/sec (+- 0,27%), total secs = 10
139998.33333333334

Run summary [PID 9862]: 240 threads, hash slots: 64 each operating on 1024 [private] futexes for 10 secs.
Averaged 279627 operations/sec (+- 0,29%), total secs = 10
Averaged 279572 operations/sec (+- 0,21%), total secs = 10
Averaged 280672 operations/sec (+- 0,26%), total secs = 10
279957.0

Run summary [PID 10585]: 240 threads, hash slots: 128 each operating on 1024 [private] futexes for 10 secs.
Averaged 508759 operations/sec (+- 0,21%), total secs = 10
Averaged 511253 operations/sec (+- 0,22%), total secs = 10
Averaged 508587 operations/sec (+- 0,26%), total secs = 10
509533.0

Run summary [PID 11308]: 240 threads, hash slots: 256 each operating on 1024 [private] futexes for 10 secs.
Averaged 1023552 operations/sec (+- 0,10%), total secs = 10
Averaged 1034426 operations/sec (+- 0,11%), total secs = 10
Averaged 1001560 operations/sec (+- 0,10%), total secs = 10
1019846.0

Run summary [PID 12031]: 240 threads, hash slots: 512 each operating on 1024 [private] futexes for 10 secs.
Averaged 1636187 operations/sec (+- 0,22%), total secs = 10
Averaged 1607427 operations/sec (+- 0,23%), total secs = 10
Averaged 1661206 operations/sec (+- 0,24%), total secs = 10
1634940.0

Run summary [PID 12756]: 240 threads, hash slots: 1024 each operating on 1024 [private] futexes for 10 secs.
Averaged 1833474 operations/sec (+- 0,24%), total secs = 10
Averaged 1835817 operations/sec (+- 0,24%), total secs = 10
Averaged 1835287 operations/sec (+- 0,25%), total secs = 10
1834859.3333333333

Run summary [PID 13479]: 240 threads, hash slots: 2048 each operating on 1024 [private] futexes for 10 secs.
Averaged 1915836 operations/sec (+- 0,29%), total secs = 10
Averaged 1907866 operations/sec (+- 0,28%), total secs = 10
Averaged 1912512 operations/sec (+- 0,29%), total secs = 10
1912071.3333333333

Run summary [PID 14202]: 240 threads, hash slots: 4096 each operating on 1024 [private] futexes for 10 secs.
Averaged 1916947 operations/sec (+- 0,27%), total secs = 10
Averaged 1918102 operations/sec (+- 0,28%), total secs = 10
Averaged 1921011 operations/sec (+- 0,29%), total secs = 10
1918686.6666666667

Run summary [PID 14925]: 240 threads, hash slots: 8192 each operating on 1024 [private] futexes for 10 secs.
Averaged 1916001 operations/sec (+- 0,27%), total secs = 10
Averaged 1923156 operations/sec (+- 0,27%), total secs = 10
Averaged 1927700 operations/sec (+- 0,27%), total secs = 10
1922285.6666666667

Run summary [PID 15648]: 240 threads, hash slots: 16384 each operating on 1024 [private] futexes for 10 secs.
Averaged 1928497 operations/sec (+- 0,28%), total secs = 10
Averaged 1916906 operations/sec (+- 0,27%), total secs = 10
Averaged 1923648 operations/sec (+- 0,26%), total secs = 10
1923017.0

Run summary [PID 16371]: 240 threads, hash slots: 32768 each operating on 1024 [private] futexes for 10 secs.
Averaged 1920425 operations/sec (+- 0,27%), total secs = 10
Averaged 1923449 operations/sec (+- 0,27%), total secs = 10
Averaged 1926083 operations/sec (+- 0,29%), total secs = 10
1923319.0

Run summary [PID 17094]: 240 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 1927007 operations/sec (+- 0,28%), total secs = 10
Averaged 1935182 operations/sec (+- 0,28%), total secs = 10
Averaged 1936529 operations/sec (+- 0,28%), total secs = 10
1932906.0

Run summary [PID 17817]: 240 threads, hash slots: 131072 each operating on 1024 [private] futexes for 10 secs.
Averaged 2033664 operations/sec (+- 0,32%), total secs = 10
Averaged 2060081 operations/sec (+- 0,33%), total secs = 10
Averaged 2033969 operations/sec (+- 0,32%), total secs = 10
2042571.3333333333

----

bigeasy@z3:~$ taskset -pc $$; ./run-numa.sh
pid 7679's current affinity list: 64-127,192-255
====
# Running 'futex/hash' benchmark:
Run summary [PID 23094]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2180419 operations/sec (+- 0,77%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 23205]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2258612 operations/sec (+- 0,87%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 23317]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2245819 operations/sec (+- 0,80%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 23428]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2231469 operations/sec (+- 0,81%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 23539]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2232874 operations/sec (+- 0,78%), total secs = 10
====
# Running 'futex/hash' benchmark:
Run summary [PID 23650]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2469636 operations/sec (+- 0,92%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 23761]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2432942 operations/sec (+- 0,91%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 23872]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2411433 operations/sec (+- 0,90%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 23983]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2438380 operations/sec (+- 0,94%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 24094]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2435103 operations/sec (+- 0,94%), total secs = 10
====
bigeasy@z3:~$ taskset -pc $$; ./run-numa.sh
pid 9731's current affinity list: 0-63,128-191
====
# Running 'futex/hash' benchmark:
Run summary [PID 24207]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2206612 operations/sec (+- 0,75%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 24318]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2321819 operations/sec (+- 0,85%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 24429]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2238386 operations/sec (+- 0,77%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 24541]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2325869 operations/sec (+- 0,85%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 24652]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2300175 operations/sec (+- 0,82%), total secs = 10
====
# Running 'futex/hash' benchmark:
Run summary [PID 24763]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2530561 operations/sec (+- 0,96%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 24874]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2573315 operations/sec (+- 1,03%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 24985]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2517479 operations/sec (+- 0,99%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 25096]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2554631 operations/sec (+- 1,01%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 25207]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2498151 operations/sec (+- 0,94%), total secs = 10
====
bigeasy@z3:~$ taskset -pc $$; ./run-numa.sh
pid 10975's current affinity list: 0-255
====
# Running 'futex/hash' benchmark:
Run summary [PID 25324]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2561817 operations/sec (+- 0,14%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 25435]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2539522 operations/sec (+- 0,11%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 25546]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2532349 operations/sec (+- 0,11%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 25657]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2539481 operations/sec (+- 0,11%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 25768]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2539843 operations/sec (+- 0,13%), total secs = 10
====
# Running 'futex/hash' benchmark:
Run summary [PID 25879]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2540858 operations/sec (+- 0,50%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 25990]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2550342 operations/sec (+- 0,48%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 26101]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2522785 operations/sec (+- 0,48%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 26212]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2528686 operations/sec (+- 0,49%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 26323]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.

Averaged 2536078 operations/sec (+- 0,48%), total secs = 10
====

Sebastian

Re: [RFC v2 PATCH 0/4] futex: Add support task local hash maps.

Posted by Waiman Long 1 year, 3 months ago

On 10/31/24 11:56 AM, Sebastian Andrzej Siewior wrote:
> On 2024-10-28 13:13:54 [+0100], To linux-kernel@vger.kernel.org wrote:
>>                                                             Need to do
>> more testing.
> So there is "perf bench futex hash". On a 256 CPU NUMA box:
> 	perf bench futex hash -t 240 -m -s -b $hb
> and hb 2 … 131072 (moved the allocation to kvmalloc) I get the following
> (averaged over 3 three runs)
>
> buckets op/sec
>        2     9158.33
>        4    21665.66	+ ~136%
>        8    44686.66	+ ~106
>       16    84144.33	+ ~ 88
>       32   139998.33	+ ~ 66
>       64   279957.0	+ ~ 99
>      128   509533.0	+ ~100
>      256  1019846.0	+ ~100
>      512  1634940.0	+ ~ 60
>     1024  1834859.33	+ ~ 12
>           1868129.33 (global hash, 65536 hash)
>     2048  1912071.33	+ ~  4
>     4096  1918686.66	+ ~  0
>     8192  1922285.66	+ ~  0
>    16384  1923017.0	+ ~  0
>    32768  1923319.0	+ ~  0
>    65536  1932906.0	+ ~  0
>   131072  2042571.33	+ ~  5
>
> By doubling the hash size the ops/sec almost double until 256 slots.
> After 2048 slots the increase is almost noise (except for the last
> entry).
>
> Pinning the bench to individual CPUs belonging to a NUMA node and
> running the same test with 110 threads only (avg over 5 runs):
>            ops/sec global	ops/sec local
> node 0		2278572.2	2534827.4
> node 1		2229838.6	2437498.8
> node 0+1	2542602.4	2535749.8

Looking at the performance data, we should probably use the global hash 
table to maximize throughput if latency isn't important.

AFAICT, the reason why patch 4 allocates a local hash whenever the first 
thread is created to avoid a race between the same futex hashed on both 
the local and global hash tables. Correct me if my understanding is 
incorrect. That will enforce all multithreaded processes to use local 
hash tables for private futexes even if they don't care about latency.

Maybe we should limit the auto local hash table allocation only to RT 
processes. To avoid the race, we could add a flag to indicate if  a 
private futex has ever been hashed in the kernel and avoid local hash 
creation in this case and probably also when the prctl() is being called 
to create local hash table.

My 2 cents.

Cheers,

Re: [RFC v2 PATCH 0/4] futex: Add support task local hash maps.

Posted by Waiman Long 1 year, 3 months ago

On 10/31/24 11:56 AM, Sebastian Andrzej Siewior wrote:
> On 2024-10-28 13:13:54 [+0100], To linux-kernel@vger.kernel.org wrote:
>>                                                             Need to do
>> more testing.
> So there is "perf bench futex hash". On a 256 CPU NUMA box:
> 	perf bench futex hash -t 240 -m -s -b $hb
> and hb 2 … 131072 (moved the allocation to kvmalloc) I get the following
> (averaged over 3 three runs)
>
> buckets op/sec
>        2     9158.33
>        4    21665.66	+ ~136%
>        8    44686.66	+ ~106
>       16    84144.33	+ ~ 88
>       32   139998.33	+ ~ 66
>       64   279957.0	+ ~ 99
>      128   509533.0	+ ~100
>      256  1019846.0	+ ~100
>      512  1634940.0	+ ~ 60
>     1024  1834859.33	+ ~ 12
>           1868129.33 (global hash, 65536 hash)
>     2048  1912071.33	+ ~  4
>     4096  1918686.66	+ ~  0
>     8192  1922285.66	+ ~  0
>    16384  1923017.0	+ ~  0
>    32768  1923319.0	+ ~  0
>    65536  1932906.0	+ ~  0
>   131072  2042571.33	+ ~  5
>
> By doubling the hash size the ops/sec almost double until 256 slots.
> After 2048 slots the increase is almost noise (except for the last
> entry).

Looking at the performance data, we should probably use the global hash 
map to maximize throughput if latency isn't important.

AFAICT, the reason why patch 4 creates a local hash map when the first 
thread is created is to avoid a race of the same futex being hashed on 
both the local and the global hash maps. Correct me if my understanding 
is incorrect. So all the multithreaded processes will have to use local 
hash maps for their private futexes even if they don't care about latency.

Maybe we should limit the auto local hash map creation to only RT 
processes where latency is important. To avoid the race, we could add a 
flag to indicate if a private futex hashing operation had ever been done 
before and prevent the creation of local hash map after that.

My 2 cents.

Cheers,
Longman

Re: [RFC v2 PATCH 0/4] futex: Add support task local hash maps.

Posted by Sebastian Andrzej Siewior 1 year, 3 months ago

On 2024-10-31 16:56:43 [+0100], To linux-kernel@vger.kernel.org wrote:
> Pinning the bench to individual CPUs belonging to a NUMA node and
> running the same test with 110 threads only (avg over 5 runs):
>           ops/sec global	ops/sec local
> node 0		2278572.2	2534827.4
> node 1		2229838.6	2437498.8
> node 0+1	2542602.4	2535749.8

Running on node 1, with variable slot size:
 hash slots	ops/sec
    2            43292.2
    4            81829.2
    8           156903.4
   16           297063.6
   32           554229.4
   64           962158.4
  128          1615859.6
  256          2106941.4
  512          2269494.8
 1024          2328782.6
 2048          2342981.6
 4096          2337705.2
 8192          2334141.4
16384          2334237.6
32768          2339262.2
65536          2438800.4

Sebastian

Re: [RFC v2 PATCH 0/4] futex: Add support task local hash maps.

Posted by Sebastian Andrzej Siewior 1 year, 3 months ago

On 2024-10-31 18:47:40 [+0100], To linux-kernel@vger.kernel.org wrote:
> On 2024-10-31 16:56:43 [+0100], To linux-kernel@vger.kernel.org wrote:

Since all of this can be scripted and I can have one kernel with …
so I hooked various hash algorithms to see where we get to.
240 threads, same box.
+---------+------------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+
| buckets | jhash2 (regular) | jhash2 (addr+offs) |      xxhash |   hash_long |      crc32c |       crc32 |     siphash |    hsiphash |
+---------+------------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+
|       2 |          9,172.4 |            9,175.8 |     9,116.4 |     9,497.2 |     9,317.6 |     9,564.0 |     9,091.8 |     9,217.8 |
|       4 |         23,370.8 |           22,611.0 |    20,917.2 |    17,780.6 |    18,185.6 |    17,305.4 |    20,415.0 |    20,885.4 |
|       8 |         44,378.2 |           44,898.4 |    44,713.8 |    42,943.8 |    45,151.8 |    45,149.6 |    44,601.4 |    44,739.4 |
|      16 |         84,567.2 |           84,190.0 |    84,645.2 |    84,737.4 |    86,970.2 |    85,036.8 |    83,142.0 |    85,485.0 |
|      32 |        131,059.2 |          127,895.4 |   127,953.8 |   126,631.2 |   132,293.0 |   125,622.2 |   127,038.4 |   126,322.8 |
|      64 |        285,339.0 |          284,488.8 |   288,109.2 |   268,630.4 |   289,783.8 |   285,281.0 |   285,111.2 |   288,104.4 |
|     128 |        510,550.0 |          515,596.6 |   526,738.0 |   557,349.6 |   508,871.6 |   524,447.0 |   512,482.8 |   513,963.0 |
|     256 |      1,038,348.6 |        1,034,837.4 | 1,042,341.4 | 1,060,650.4 | 1,039,328.6 | 1,098,865.8 | 1,042,759.4 | 1,026,998.6 |
|     512 |      1,626,287.8 |        1,640,112.0 | 1,622,828.8 | 1,637,973.4 | 1,677,108.6 | 1,707,027.2 | 1,588,240.6 | 1,628,800.8 |
|    1024 |      1,827,878.6 |        1,849,074.4 | 1,836,483.8 | 1,776,366.4 | 1,884,670.8 | 1,842,734.2 | 1,765,815.0 | 1,822,137.8 |
|    2048 |      1,905,406.4 |        1,928,399.2 | 1,903,506.0 | 1,822,750.8 | 1,946,141.6 | 1,907,584.6 | 1,830,906.8 | 1,887,678.2 |
|    4096 |      1,912,522.6 |        1,929,667.4 | 1,907,121.6 | 1,847,231.6 | 1,949,908.0 | 1,927,728.6 | 1,834,648.0 | 1,893,792.2 |
|    8192 |      1,912,352.6 |        1,935,078.4 | 1,915,500.4 | 1,853,232.2 | 1,973,339.2 | 1,958,150.4 | 1,840,190.8 | 1,896,981.6 |
|   16384 |      1,917,836.8 |        1,941,917.0 | 1,910,106.0 | 1,863,751.4 | 1,955,101.4 | 1,947,673.2 | 1,836,488.2 | 1,898,002.0 |
|   32768 |      1,919,074.6 |        1,937,200.2 | 1,914,704.8 | 1,872,348.0 | 1,974,182.2 | 1,959,147.2 | 1,837,694.6 | 1,896,566.6 |
|   65536 |      1,930,988.0 |        1,959,076.0 | 1,926,927.6 | 1,873,267.6 | 1,914,420.8 | 1,951,292.4 | 1,849,658.6 | 1,910,334.6 |
|  131072 |      2,023,509.4 |        2,050,380.4 | 2,037,104.6 | 1,990,559.6 | 2,003,758.4 | 1,978,931.2 | 1,946,145.2 | 2,007,205.6 |
+---------+------------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+

Intel(R) Xeon(R) CPU E7-8890 v3, 144 CPUs, 4 nodes.
Test using 140 threads, 0 buckets means global hash:
+---------+-------------+
| buckets |     ops/sec |
+---------+-------------+
|       0 | 2,644,742.8 |
|       2 |    21,750.2 |
|       4 |    37,537.2 |
|       8 |    69,950.6 |
|      16 |   127,722.0 |
|      32 |   225,479.2 |
|      64 |   401,335.6 |
|     128 |   753,714.8 |
|     256 | 1,376,116.0 |
|     512 | 2,008,764.2 |
|    1024 | 2,386,441.2 |
|    2048 | 2,564,764.0 |
|    4096 | 2,851,801.2 |
|    8192 | 2,862,999.6 |
|   16384 | 2,521,325.0 |
|   32768 | 2,421,839.2 |
|   65536 | 2,483,676.0 |
|  131072 | 2,733,504.2 |
+---------+-------------+

Binding the test to individual NUMA node, 34 threads:
+---------+-------------+-------------+-------------+-------------+
| buckets |      node 0 |      node 1 |      node 2 |      node 3 |
+---------+-------------+-------------+-------------+-------------+
|       0 | 4,149,878.4 | 4,149,079.8 | 4,148,085.2 | 4,149,420.6 |
|       2 |   194,714.4 |   197,382.8 |   191,967.0 |   193,510.6 |
|       4 |   363,778.6 |   360,700.2 |   364,293.6 |   361,830.2 |
|       8 |   681,770.4 |   673,973.0 |   658,601.6 |   662,212.0 |
|      16 | 1,201,256.4 | 1,177,681.0 | 1,195,749.4 | 1,181,200.2 |
|      32 | 2,002,673.2 | 1,989,139.0 | 1,988,264.4 | 1,981,004.8 |
|      64 | 2,963,416.0 | 2,962,292.0 | 2,957,491.6 | 2,964,479.6 |
|     128 | 3,499,580.0 | 3,495,971.2 | 3,495,537.6 | 3,499,902.8 |
|     256 | 3,713,251.2 | 3,711,806.4 | 3,716,935.4 | 3,715,458.2 |
|     512 | 3,800,606.4 | 3,801,960.4 | 3,813,903.4 | 3,809,076.6 |
|    1024 | 3,840,679.0 | 3,839,486.4 | 3,841,558.6 | 3,838,641.4 |
|    2048 | 3,867,732.8 | 3,866,216.2 | 3,858,603.4 | 3,848,031.6 |
|    4096 | 3,806,776.8 | 3,819,237.8 | 3,813,381.4 | 3,800,440.2 |
|    8192 | 3,815,358.4 | 3,806,204.2 | 3,804,171.2 | 3,795,476.2 |
|   16384 | 3,865,728.6 | 3,883,038.4 | 3,871,992.0 | 3,857,763.4 |
|   32768 | 4,017,227.0 | 4,025,249.8 | 4,022,779.4 | 4,009,740.8 |
|   65536 | 4,188,410.0 | 4,186,900.8 | 4,195,128.4 | 4,190,580.8 |
|  131072 | 4,334,937.0 | 4,335,978.8 | 4,327,250.2 | 4,332,567.8 |
+---------+-------------+-------------+-------------+-------------+

140 threads, all nodes for the algorithms test:
+---------+------------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+
| buckets | jhash2 (regular) | jhash2 (addr+offs) |      xxhash |   hash_long |      crc32c |       crc32 |     siphash |    hsiphash |
+---------+------------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+
|       2 |         21,346.0 |           21,321.8 |    20,598.4 |    23,403.0 |    23,336.6 |    21,232.8 |    21,011.4 |    20,661.0 |
|       4 |         38,220.0 |           37,712.0 |    37,421.6 |    39,206.4 |    39,086.2 |    40,098.2 |    37,144.2 |    37,209.8 |
|       8 |         68,470.8 |           68,994.4 |    69,373.6 |    73,973.0 |    70,306.8 |    70,396.0 |    68,950.8 |    69,366.6 |
|      16 |        126,612.2 |          127,433.2 |   128,121.2 |   133,981.8 |   127,268.0 |   130,204.4 |   126,594.4 |   127,812.8 |
|      32 |        224,943.0 |          224,695.2 |   222,879.6 |   227,023.8 |   220,036.4 |   217,311.2 |   224,100.0 |   223,442.8 |
|      64 |        406,235.6 |          399,020.2 |   407,580.6 |   413,988.6 |   404,817.4 |   394,156.0 |   411,282.8 |   389,992.6 |
|     128 |        758,259.0 |          759,423.2 |   755,778.8 |   774,913.8 |   765,497.8 |   763,987.8 |   748,676.8 |   749,303.6 |
|     256 |      1,381,720.6 |        1,380,707.6 | 1,372,685.0 | 1,357,849.0 | 1,331,275.2 | 1,430,867.4 | 1,377,411.6 | 1,374,432.2 |
|     512 |      2,001,912.4 |        2,011,120.8 | 1,993,617.8 | 2,331,041.0 | 2,097,737.0 | 2,079,965.6 | 1,971,513.8 | 1,989,508.6 |
|    1024 |      2,378,279.6 |        2,412,139.6 | 2,371,655.4 | 2,650,416.8 | 2,477,507.8 | 2,456,023.8 | 2,309,010.4 | 2,353,854.2 |
|    2048 |      2,560,923.0 |        2,604,756.2 | 2,544,586.6 | 2,658,535.8 | 2,631,261.0 | 2,628,532.0 | 2,459,461.2 | 2,523,348.0 |
|    4096 |      2,855,199.2 |        2,942,364.8 | 2,822,369.8 | 2,998,159.4 | 2,936,124.2 | 2,919,140.6 | 2,694,488.8 | 2,794,201.4 |
|    8192 |      2,868,792.8 |        2,953,256.8 | 2,834,506.0 | 2,993,257.8 | 2,924,754.2 | 2,941,119.0 | 2,705,526.4 | 2,806,921.2 |
|   16384 |      2,527,784.0 |        2,595,100.2 | 2,498,789.8 | 2,610,646.8 | 2,540,535.4 | 2,550,376.0 | 2,398,098.4 | 2,475,184.4 |
|   32768 |      2,427,199.8 |        2,492,474.2 | 2,408,768.4 | 2,486,733.6 | 2,381,828.0 | 2,425,293.0 | 2,312,774.0 | 2,384,687.6 |
|   65536 |      2,489,441.8 |        2,554,741.4 | 2,465,692.0 | 2,666,031.8 | 2,419,651.8 | 2,515,099.8 | 2,368,451.8 | 2,438,185.6 |
|  131072 |      2,745,458.4 |        2,820,823.0 | 2,720,660.6 | 3,282,233.0 | 2,625,217.6 | 2,466,424.0 | 2,597,005.2 | 2,680,356.4 |
+---------+------------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+

And now something smaller, Intel(R) Xeon(R) CPU E5-2650 0, 32CPUs in
total.
28 threads used for the test:
+---------+-------------+
| buckets |     ops/sec |
+---------+-------------+
|       0 | 2,344,905.8 |
|       2 |    91,881.2 |
|       4 |   168,243.0 |
|       8 |   310,982.2 |
|      16 |   550,534.4 |
|      32 |   884,066.0 |
|      64 | 1,475,389.4 |
|     128 | 1,949,364.6 |
|     256 | 2,142,025.8 |
|     512 | 2,234,222.2 |
|    1024 | 2,267,931.8 |
|    2048 | 2,287,753.4 |
|    4096 | 2,315,330.4 |
|    8192 | 2,337,878.2 |
|   16384 | 2,444,502.2 |
+---------+-------------+

14 Threads limited to a node:
+---------+-------------+-------------+
| buckets |      node 0 |      node 1 |
+---------+-------------+-------------+
|       0 | 2,761,709.8 | 2,765,630.0 |
|       2 |   397,527.8 |   397,126.8 |
|       4 |   718,205.0 |   719,615.2 |
|       8 | 1,350,627.4 | 1,305,201.4 |
|      16 | 1,992,643.4 | 1,989,499.2 |
|      32 | 2,365,813.6 | 2,357,618.6 |
|      64 | 2,554,185.8 | 2,555,256.8 |
|     128 | 2,646,479.0 | 2,654,572.6 |
|     256 | 2,679,394.4 | 2,698,002.4 |
|     512 | 2,713,385.6 | 2,723,413.6 |
|    1024 | 2,719,330.6 | 2,733,464.6 |
|    2048 | 2,730,376.6 | 2,738,581.6 |
|    4096 | 2,704,520.6 | 2,720,546.4 |
|    8192 | 2,773,213.4 | 2,782,565.6 |
|   16384 | 2,863,843.2 | 2,858,963.2 |
+---------+-------------+-------------+

And now algorithms, 28 Threads.
+---------+------------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+
| buckets | jhash2 (regular) | jhash2 (addr+offs) |      xxhash |   hash_long |      crc32c |       crc32 |     siphash |    hsiphash |
+---------+------------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+
|       2 |         92,557.8 |           92,815.2 |    93,172.2 |   103,097.6 |    97,403.2 |    92,629.6 |    94,030.8 |    91,847.2 |
|       4 |        165,385.2 |          167,200.0 |   168,681.2 |   177,600.2 |   172,851.2 |   173,423.6 |   167,814.6 |   168,136.8 |
|       8 |        319,044.0 |          317,291.6 |   318,322.0 |   342,179.4 |   318,252.6 |   323,456.6 |   319,079.6 |   317,106.2 |
|      16 |        555,103.6 |          556,075.0 |   563,529.0 |   595,052.8 |   537,199.2 |   557,180.4 |   554,498.8 |   550,170.4 |
|      32 |        896,751.8 |          908,569.4 |   908,687.4 |   852,593.2 |   892,222.6 |   919,105.0 |   874,487.8 |   920,554.6 |
|      64 |      1,488,013.0 |        1,500,952.6 | 1,467,258.8 | 1,528,428.2 | 1,530,458.6 | 1,526,439.6 | 1,459,185.2 | 1,480,434.0 |
|     128 |      1,944,216.0 |        1,974,618.6 | 1,927,277.6 | 1,748,598.4 | 1,989,212.0 | 1,975,526.2 | 1,839,080.4 | 1,903,844.4 |
|     256 |      2,142,823.0 |        2,185,436.6 | 2,126,787.8 | 2,194,752.2 | 2,189,521.2 | 2,164,454.2 | 1,987,121.0 | 2,081,487.0 |
|     512 |      2,232,887.4 |        2,279,553.4 | 2,215,265.8 | 2,274,402.6 | 2,278,595.6 | 2,262,156.4 | 2,047,572.8 | 2,169,430.8 |
|    1024 |      2,269,308.2 |        2,312,200.0 | 2,250,841.0 | 2,278,423.2 | 2,328,832.6 | 2,288,490.0 | 2,075,494.2 | 2,190,907.8 |
|    2048 |      2,281,539.0 |        2,336,340.6 | 2,255,446.8 | 2,221,195.4 | 2,374,069.2 | 2,330,833.2 | 2,083,151.4 | 2,196,610.0 |
|    4096 |      2,315,628.8 |        2,367,224.6 | 2,284,841.4 | 2,397,385.8 | 2,373,043.2 | 2,394,276.6 | 2,104,235.0 | 2,233,600.0 |
|    8192 |      2,341,296.8 |        2,401,307.8 | 2,320,777.4 | 2,336,329.0 | 2,331,216.4 | 2,391,361.6 | 2,122,129.4 | 2,250,452.6 |
|   16384 |      2,435,181.6 |        2,509,588.2 | 2,422,407.8 | 2,378,702.8 | 2,514,325.8 | 2,552,565.8 | 2,200,619.0 | 2,350,706.0 |
+---------+------------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+

Sebastian

Re: [RFC v2 PATCH 0/4] futex: Add support task local hash maps.

Posted by Juri Lelli 1 year, 3 months ago

Hi Sebastian,

On 28/10/24 13:13, Sebastian Andrzej Siewior wrote:
> Hi,
> 
> this is a follow up on
> 	https://lore.kernel.org/ZwVOMgBMxrw7BU9A@jlelli-thinkpadt14gen4.remote.csb

Thank you so much for working on this!

> and adds support for task local futex_hash_bucket. It can be created via
> prctl(). Last patch in the series enables it one the first thread is
> created.
> 
> I've been how this auto-create behaves and so far dpkg creates threads
> and uses the local-hashmap. systemd-journal on the hand forks a thread
> from time to time and I haven't seen it using the hashmap. Need to do
> more testing.

I ported it to one of our kernels with the intent of asking perf folks
to have a go at it (after some manual smoke testing maybe). It will
take a couple of weeks or so to get numbers back.

Do you need specific additional info to possibly be collected while
running? I saw your reply about usage. If you want to agree on what to
collect feel free to send out the debug patch I guess you used for that.

Going of course to also play with it myself and holler if I find any
issue.

Best,
Juri

Re: [RFC v2 PATCH 0/4] futex: Add support task local hash maps.

Posted by Sebastian Andrzej Siewior 1 year, 3 months ago

On 2024-10-29 12:10:25 [+0100], Juri Lelli wrote:
> Hi Sebastian,
Hi Juri,

> > I've been how this auto-create behaves and so far dpkg creates threads
> > and uses the local-hashmap. systemd-journal on the hand forks a thread
> > from time to time and I haven't seen it using the hashmap. Need to do
> > more testing.
> 
> I ported it to one of our kernels with the intent of asking perf folks
> to have a go at it (after some manual smoke testing maybe). It will
> take a couple of weeks or so to get numbers back.

Thanks.

> Do you need specific additional info to possibly be collected while
> running? I saw your reply about usage. If you want to agree on what to
> collect feel free to send out the debug patch I guess you used for that.

If you run a specific locking test cases, you could try set the number of
slots upfront (instead of relying on the default 4) and see how this
affects the performance. Also there is a cap at 16, you might want to
raise this to 1024 and try some higher numbers and see how this effects
performance. The prctl() interface should be easy to set/ get the values.
The default 4 might be too conservative.
That would give an idea what a sane default value and upper limit might be.

The hunk attached (against the to be posted v3) adds counters to see how
many auto-allocated slots were used vs not used. In my tests the number
of unused hash buckets was very small, so I don't think it matters.

> Best,
> Juri

Sebastian

---------------------->8---------------------

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 3b8c8975cd493..aa2a0d059b1a8 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -248,6 +248,7 @@ struct signal_struct {
 						 * and may have inconsistent
 						 * permissions.
 						 */
+	unsigned int			futex_hash_used;
 	unsigned int			futex_hash_mask;
 	struct futex_hash_bucket	*futex_hash_bucket;
 } __randomize_layout;
diff --git a/kernel/fork.c b/kernel/fork.c
index e792a43934363..341331778032a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -945,10 +945,19 @@ static void mmdrop_async(struct mm_struct *mm)
 	}
 }
 
+extern atomic64_t futex_hash_stats_used;
+extern atomic64_t futex_hash_stats_unused;
+
 static inline void free_signal_struct(struct signal_struct *sig)
 {
 	taskstats_tgid_free(sig);
 	sched_autogroup_exit(sig);
+	if (sig->futex_hash_bucket) {
+		if (sig->futex_hash_used)
+			atomic64_inc(&futex_hash_stats_used);
+		else
+			atomic64_inc(&futex_hash_stats_unused);
+	}
 	kfree(sig->futex_hash_bucket);
 	/*
 	 * __mmdrop is not safe to call from softirq context on x86 due to
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index b48abf2e97c25..04a597736cb00 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -40,6 +40,7 @@
 #include <linux/fault-inject.h>
 #include <linux/slab.h>
 #include <linux/prctl.h>
+#include <linux/proc_fs.h>
 
 #include "futex.h"
 #include "../locking/rtmutex_common.h"
@@ -132,8 +133,10 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
 			  key->both.offset);
 
 	fhb = current->signal->futex_hash_bucket;
-	if (fhb && futex_key_is_private(key))
+	if (fhb && futex_key_is_private(key)) {
+		current->signal->futex_hash_used = 1;
 		return &fhb[hash & current->signal->futex_hash_mask];
+	}
 
 	return &futex_queues[hash & (futex_hashsize - 1)];
 }
@@ -1202,8 +1205,13 @@ static int futex_hash_allocate(unsigned int hash_slots)
 	return 0;
 }
 
+atomic64_t futex_hash_stats_used;
+atomic64_t futex_hash_stats_unused;
+atomic64_t futex_hash_stats_auto_create;
+
 int futex_hash_allocate_default(void)
 {
+	atomic64_inc(&futex_hash_stats_auto_create);
 	return futex_hash_allocate(0);
 }
 
@@ -1235,6 +1243,19 @@ int futex_hash_prctl(unsigned long arg2, unsigned long arg3,
 	return ret;
 }
 
+static int proc_show_futex_stats(struct seq_file *seq, void *offset)
+{
+	long fh_used, fh_unused, fh_auto_create;
+
+	fh_used = atomic64_read(&futex_hash_stats_used);
+	fh_unused = atomic64_read(&futex_hash_stats_unused);
+	fh_auto_create = atomic64_read(&futex_hash_stats_auto_create);
+
+	seq_printf(seq, "used: %ld unsued: %ld auto: %ld\n",
+		   fh_used, fh_unused, fh_auto_create);
+	return 0;
+}
+
 static int __init futex_init(void)
 {
 	unsigned int futex_shift;
@@ -1255,6 +1276,7 @@ static int __init futex_init(void)
 	for (i = 0; i < futex_hashsize; i++)
 		futex_hash_bucket_init(&futex_queues[i]);
 
+	proc_create_single("futex_stats", 0, NULL, proc_show_futex_stats);
 	return 0;
 }
 core_initcall(futex_init);

Re: [RFC v2 PATCH 0/4] futex: Add support task local hash maps.

Posted by Sebastian Andrzej Siewior 1 year, 3 months ago

On 2024-10-28 13:13:54 [+0100], To linux-kernel@vger.kernel.org wrote:
> from time to time and I haven't seen it using the hashmap. Need to do
> more testing.

booted gnome and did a few things:
- Total allocations: 632
- tasks which never used their allocated futex hash: 2
  gpg-agent and systemd-journal
- Tasks which did not terminate within the measurement: 85
  This includes gpg-agent and systemd-journal
- Top5 users of the private hash:
  - firefox-esr-3786 used 215985
  - gnome-software-2343 used 121296
  - chromium-3369 used 65796
  - chromium-3209 used 34639
  - Isolated used 34211

This looks like we could attach the private futex hashmap directly on
fork instead of delaying it to the first usage.

Side note: If someone is waiting for a thread to exit via pthread_join()
then glibc uses here futex() with op 0x109. I would have expected a
private flag. 

Sebastian