include/linux/workqueue.h | 1 + kernel/workqueue.c | 72 ++++++++++-- lib/Kconfig.debug | 10 ++ lib/Makefile | 1 + lib/test_workqueue.c | 275 +++++++++++++++++++++++++++++++++++++++++++++ tools/workqueue/wq_dump.py | 3 +- 6 files changed, 352 insertions(+), 10 deletions(-)
TL;DR: Some modern processors have many CPUs per LLC (L3 cache), and
unbound workqueues using the default affinity (WQ_AFFN_CACHE) collapse
to a single worker pool, causing heavy spinlock (pool->lock) contention.
Create a new affinity (WQ_AFFN_CACHE_SHARD) that caps each pool at
wq_cache_shard_size CPUs (default 8).
Problem
=======
Some modern systems have many CPUs sharing one LLC. Here are some examples I have
access to:
* NVIDIA Grace CPU: 72 real CPUs per LLC
* Intel(R) Xeon(R) Gold 6450C: 59 SMT threads per LLC
* Intel(R) Xeon(R) Platinum 8321HC: 51 SMT threads per LLC
On these systems, the default unbound workqueue uses the WQ_AFFN_CACHE
affinity, which results in just a single pool for the whole system (when
all the CPUs share the same LLC as the systems above).
This causes contention on pool->lock, potentially affecting IO
performance (btrfs, writeback, etc)
When profiling an IO-intensive usercache at Meta, I found significant
contention on __queue_work(), making it one of the top 5 contended
locks.
Additionally, Chuck Lever recently reported this problem:
"For example, on a 12-core system with a single shared L3 cache running
NFS over RDMA with 12 fio jobs, perf shows approximately 39% of CPU
cycles spent in native_queued_spin_lock_slowpath, nearly all from
__queue_work() contending on the single pool lock.
On such systems WQ_AFFN_CACHE, WQ_AFFN_SMT, and WQ_AFFN_NUMA
scopes all collapse to a single pod."
Link: https://lore.kernel.org/all/20260203143744.16578-1-cel@kernel.org/
Solution
========
Tejun suggested solving this problem by creating an intermediate
affinity level (aka cache_shard), which would shard the WQ_AFFN_CACHE
using a heuristic, avoiding collapsing all those affinity levels to
a single pod.
Solve this by creating an intermediate sharded cache affinity, and use
it as the default one.
Micro benchmark
===============
To test its benefit, I created a microbenchmark (part of this series)
that enqueues work (queue_work) in a loop and reports the latency.
Benchmark on NVIDIA Grace (72 CPUs, single LLC, 50k items/thread):
cpu 3248519 items/sec p50=10944 p90=11488 p95=11648 ns
smt 3362119 items/sec p50=10945 p90=11520 p95=11712 ns
cache_shard 3629098 items/sec p50=6080 p90=8896 p95=9728 ns (NEW) **
cache 708168 items/sec p50=44000 p90=47104 p95=47904 ns
numa 710559 items/sec p50=44096 p90=47265 p95=48064 ns
system 718370 items/sec p50=43104 p90=46432 p95=47264 ns
Same benchmark on Intel 8321HC.
cpu 2831751 items/sec p50=3909 p90=9222 p95=11580 ns
smt 2810699 items/sec p50=2229 p90=4928 p95=5979 ns
cache_shard 1861028 items/sec p50=4874 p90=8423 p95=9415 ns (NEW)
cache 591001 items/sec p50=24901 p90=29865 p95=31169 ns
numa 590431 items/sec p50=24901 p90=29819 p95=31133 ns
system 591912 items/sec p50=25049 p90=29916 p95=31219 ns
(** It is still unclear why cache_shard is "better" than SMT on
Grace/ARM. The result is constantly reproducible, though. Still
investigating it)
Block benchmark
===============
Host: Intel(R) Xeon(R) D-2191A CPU @ 1.60GHz (16 Cores - 32 SMT)
In order to stress the workqueue, I am running fio on a dm-crypt device.
1) Create a plain dm-crypt device on top of NVMe
* cryptsetup creates an encrypted block device (/dev/mapper/crypt_nvme) on top
of a raw NVMe drive. All I/O to this device goes through kcryptd — dm-crypt's
workqueue that handles AES encryption/decryption of every data block.
# cryptsetup open --type plain -c aes-xts-plain64 -s 256 /dev/nvme0n1 crypt_nvme -d -
2) Run fio
* fio hammers the encrypted device with 36 threads (one per CPU), each doing
128-deep 4K _buffered_ I/O for 10 seconds. This generates massive workqueue
pressure — every I/O completion triggers a kcryptd work item to encrypt or
decrypt data.
# fio --filename=/dev/mapper/crypt_nvme \
--ioengine=io_uring --direct=0 \
--bs=4k --iodepth=128 \
--numjobs=$(nproc) --runtime=10 \
--time_based --group_reporting
Running this for ~3 hours:
┌────────────┬────────────────────────┬────────────────────────┬───────────┬────────┬─────────────────┐
│ Workload │ Avg cache │ Avg cache_shard │ Avg delta │ Stddev │ 2-sigma range │
├────────────┼────────────────────────┼────────────────────────┼───────────┼────────┼─────────────────┤
│ randread │ 389 MiB/s (99.6k IOPS) │ 413 MiB/s (106k IOPS) │ +5.9% │ 3.3% │ -0.7% to +12.5% │
├────────────┼────────────────────────┼────────────────────────┼───────────┼────────┼─────────────────┤
│ randwrite │ 622 MiB/s (159k IOPS) │ 614 MiB/s (157k IOPS) │ -1.3% │ 0.9% │ -3.1% to +0.5% │
├────────────┼────────────────────────┼────────────────────────┼───────────┼────────┼─────────────────┤
│ randrw │ 240 MiB/s (61.4k IOPS) │ 250 MiB/s (64.1k IOPS) │ +4.3% │ 3.4% │ -2.5% to +11.1% │
└────────────┴────────────────────────┴────────────────────────┴───────────┴────────┴─────────────────┘
Same results for buffered IO:
┌───────────┬────────────────────────┬────────────────────────┬───────────┬────────┬────────────────┐
│ Workload │ Avg cache │ Avg cache_shard │ Avg delta │ Stddev │ 2-sigma range │
├───────────┼────────────────────────┼────────────────────────┼───────────┼────────┼────────────────┤
│ randread │ 559 MiB/s (143k IOPS) │ 577 MiB/s (148k IOPS) │ +3.1% │ 1.3% │ +0.5% to +5.7% │
├───────────┼────────────────────────┼────────────────────────┼───────────┼────────┼────────────────┤
│ randwrite │ 437 MiB/s (112k IOPS) │ 431 MiB/s (110k IOPS) │ -1.5% │ 1.0% │ -3.5% to +0.5% │
├───────────┼────────────────────────┼────────────────────────┼───────────┼────────┼────────────────┤
│ randrw │ 272 MiB/s (69.7k IOPS) │ 273 MiB/s (69.8k IOPS) │ +0.1% │ 1.5% │ -2.9% to +3.1% │
└───────────┴────────────────────────┴────────────────────────┴───────────┴────────┴────────────────┘
(randwrite result seems to be noise (!?))
Patchset organization
=====================
This series adds a new WQ_AFFN_CACHE_SHARD affinity scope that
subdivides each LLC into groups of at most wq_cache_shard_size CPUs
(default 8, tunable via boot parameter), providing an intermediate
option between per-LLC and per-SMT-core granularity.
On top of the feature, this patchset also prepares the code for the new
cache_shard affinity, and creates a stress test for workqueue.
Then, make this new cache affinity the default one.
On systems with 8 or fewer CPUs per LLC, CACHE_SHARD produces a single
shard covering the entire LLC, making it functionally identical to the
previous CACHE default. The sharding only activates when an LLC has more
than 8 CPUs.
---
Breno Leitao (5):
workqueue: fix parse_affn_scope() prefix matching bug
workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope
workqueue: add test_workqueue benchmark module
tools/workqueue: add CACHE_SHARD support to wq_dump.py
include/linux/workqueue.h | 1 +
kernel/workqueue.c | 72 ++++++++++--
lib/Kconfig.debug | 10 ++
lib/Makefile | 1 +
lib/test_workqueue.c | 275 +++++++++++++++++++++++++++++++++++++++++++++
tools/workqueue/wq_dump.py | 3 +-
6 files changed, 352 insertions(+), 10 deletions(-)
---
base-commit: b29fb8829bff243512bb8c8908fd39406f9fd4c3
change-id: 20260309-workqueue_sharded-2327956e889b
Best regards,
--
Breno Leitao <leitao@debian.org>
Hello, Applied 1/5. Some comments on the rest: - The sharding currently splits on CPU boundary, which can split SMT siblings across different pods. The worse performance on Intel compared to SMT scope may be indicating exactly this - HT siblings ending up in different pods. It'd be better to shard on core boundary so that SMT siblings always stay together. - How was the default shard size of 8 picked? There's a tradeoff between the number of kworkers created and locality. Can you also report the number of kworkers for each configuration? And is there data on different shard sizes? It'd be useful to see how the numbers change across e.g. 4, 8, 16, 32. - Can you also test on AMD machines? Their CCD topology (16 or 32 threads per LLC) would be a good data point. Thanks. -- tejun
Hello Tejun, On Fri, Mar 13, 2026 at 07:57:20AM -1000, Tejun Heo wrote: > Hello, > > Applied 1/5. Some comments on the rest: > > - The sharding currently splits on CPU boundary, which can split SMT > siblings across different pods. The worse performance on Intel compared > to SMT scope may be indicating exactly this - HT siblings ending up in > different pods. It'd be better to shard on core boundary so that SMT > siblings always stay together. Thank you for the insight. I'll modify the sharding to operate at the core boundary rather than at the SMT/thread level to ensure sibling CPUs remain in the same pod. > - How was the default shard size of 8 picked? There's a tradeoff > between the number of kworkers created and locality. Can you also > report the number of kworkers for each configuration? And is there > data on different shard sizes? It'd be useful to see how the numbers > change across e.g. 4, 8, 16, 32. The choice of 8 as the default shard size was somewhat arbitrary – it was selected primarily to generate initial data points. I'll run tests with different shard sizes and report the results. I'm currently working on finding a suitable workload with minimal noise. Testing on real NVMe devices shows significant jitter that makes analysis difficult. I've also been experimenting with nullblk, but haven't had much success yet. If you have any suggestions for a reliable workload or benchmark, I'd appreciate your input. > - Can you also test on AMD machines? Their CCD topology (16 or 32 > threads per LLC) would be a good data point. Absolutely, I'll test on AMD machines as well. Thanks, --breno
On 3/17/26 7:32 AM, Breno Leitao wrote: > Hello Tejun, > > On Fri, Mar 13, 2026 at 07:57:20AM -1000, Tejun Heo wrote: >> Hello, >> >> Applied 1/5. Some comments on the rest: >> >> - The sharding currently splits on CPU boundary, which can split SMT >> siblings across different pods. The worse performance on Intel compared >> to SMT scope may be indicating exactly this - HT siblings ending up in >> different pods. It'd be better to shard on core boundary so that SMT >> siblings always stay together. > > Thank you for the insight. I'll modify the sharding to operate at the > core boundary rather than at the SMT/thread level to ensure sibling CPUs > remain in the same pod. > >> - How was the default shard size of 8 picked? There's a tradeoff >> between the number of kworkers created and locality. Can you also >> report the number of kworkers for each configuration? And is there >> data on different shard sizes? It'd be useful to see how the numbers >> change across e.g. 4, 8, 16, 32. > > The choice of 8 as the default shard size was somewhat arbitrary – it was > selected primarily to generate initial data points. Perhaps instead of basing the sharding on a particular number of CPUs per shard, why not cap the total number of shards? IIUC that is the main concern about ballooning the number of kworker threads. > I'll run tests with different shard sizes and report the results. > > I'm currently working on finding a suitable workload with minimal noise. > Testing on real NVMe devices shows significant jitter that makes analysis > difficult. I've also been experimenting with nullblk, but haven't had much > success yet. > > If you have any suggestions for a reliable workload or benchmark, I'd > appreciate your input. > >> - Can you also test on AMD machines? Their CCD topology (16 or 32 >> threads per LLC) would be a good data point. > > Absolutely, I'll test on AMD machines as well. > > Thanks, > --breno -- Chuck Lever
On Tue, Mar 17, 2026 at 09:58:54AM -0400, Chuck Lever wrote: > On 3/17/26 7:32 AM, Breno Leitao wrote: > >> - How was the default shard size of 8 picked? There's a tradeoff > >> between the number of kworkers created and locality. Can you also > >> report the number of kworkers for each configuration? And is there > >> data on different shard sizes? It'd be useful to see how the numbers > >> change across e.g. 4, 8, 16, 32. > > > > The choice of 8 as the default shard size was somewhat arbitrary – it was > > selected primarily to generate initial data points. > > Perhaps instead of basing the sharding on a particular number of CPUs > per shard, why not cap the total number of shards? IIUC that is the main > concern about ballooning the number of kworker threads. That's a great suggestion. I'll send a v2 that implements this approach, where the parameter specifies the number of shards rather than the number of CPUs per shard. Thanks for the feedback, --breno
On Wed, Mar 18, 2026 at 10:51:15AM -0700, Breno Leitao wrote: > On Tue, Mar 17, 2026 at 09:58:54AM -0400, Chuck Lever wrote: > > On 3/17/26 7:32 AM, Breno Leitao wrote: > > >> - How was the default shard size of 8 picked? There's a tradeoff > > >> between the number of kworkers created and locality. Can you also > > >> report the number of kworkers for each configuration? And is there > > >> data on different shard sizes? It'd be useful to see how the numbers > > >> change across e.g. 4, 8, 16, 32. > > > > > > The choice of 8 as the default shard size was somewhat arbitrary – it was > > > selected primarily to generate initial data points. > > > > Perhaps instead of basing the sharding on a particular number of CPUs > > per shard, why not cap the total number of shards? IIUC that is the main > > concern about ballooning the number of kworker threads. > > That's a great suggestion. I'll send a v2 that implements this approach, > where the parameter specifies the number of shards rather than the number > of CPUs per shard. Woudl it make sense tho? If feels really odd to define the maximum number of shards when contention is primarily a function of the number of CPUs banging on the same CPU. Why would 32 cpu and 512 cpu systems have the same number of shards? Thanks. -- tejun
On Wed, Mar 18, 2026 at 01:00:07PM -1000, Tejun Heo wrote:
> On Wed, Mar 18, 2026 at 10:51:15AM -0700, Breno Leitao wrote:
> > On Tue, Mar 17, 2026 at 09:58:54AM -0400, Chuck Lever wrote:
> > > On 3/17/26 7:32 AM, Breno Leitao wrote:
> > > >> - How was the default shard size of 8 picked? There's a tradeoff
> > > >> between the number of kworkers created and locality. Can you also
> > > >> report the number of kworkers for each configuration? And is there
> > > >> data on different shard sizes? It'd be useful to see how the numbers
> > > >> change across e.g. 4, 8, 16, 32.
> > > >
> > > > The choice of 8 as the default shard size was somewhat arbitrary – it was
> > > > selected primarily to generate initial data points.
> > >
> > > Perhaps instead of basing the sharding on a particular number of CPUs
> > > per shard, why not cap the total number of shards? IIUC that is the main
> > > concern about ballooning the number of kworker threads.
> >
> > That's a great suggestion. I'll send a v2 that implements this approach,
> > where the parameter specifies the number of shards rather than the number
> > of CPUs per shard.
>
> Woudl it make sense tho? If feels really odd to define the maximum number of
> shards when contention is primarily a function of the number of CPUs banging
> on the same CPU. Why would 32 cpu and 512 cpu systems have the same number
> of shards?
The trade-off is that specifying the maximum number of shards makes it
clearer how many times the LLC is being sharded, which might be easier
to reason about, but it will have less impact on contention scaling as
you reported above.
I've collected some numbers with sharding per LLC, and I will switch
back to the original approach to gather comparison data.
Current change:
https://github.com/leitao/linux/commit/bedaf9ebe9594320976dcbf0cb507ecf083097c0
Workload:
========
I've finally found a workload that exercises the workqueue sufficiently,
which allows me to obtain stable benchmark results.
This is what I am doing:
- Sets up a local loopback NFS environment backed by an 8 GB tmpfs
(/tmp/nfsexport → /mnt/nfs)
- Iterates over six fio I/O engines: sync, psync, vsync, pvsync, pvsync2,
libaio
- For each engine, runs a 200-job, 512-byte block size fio benchmark (writes
then reads)
- Tests each workload under both cache and cache_shard workqueue affinity
scopes via /sys/module/workqueue/parameters/default_affinity_scope
- Prints a summary table with aggregate bandwidth (MB) per scope and the
percentage delta to show whether cache_shard helps or hurts
- Restores the affinity scope back to cache when done
The test I am running could be found at
https://github.com/leitao/debug/blob/main/workqueue_performance/test_affinity.sh
Hosts:
======
* ARM (NVIDIA Grace - Neoverse V2 - single L3 domain: CPUs 0-71)
# cat /sys/devices/system/cpu/cpu*/cache/index3/shared_cpu_list | sort -u
0-71
* Intel (AMD EPYC 9D64 88-Core Processor - 11 L3 domains, 8 Cores / 16 vCPUs each)
# cat /sys/devices/system/cpu/cpu*/cache/index3/shared_cpu_list | sort -u
0-7,88-95
16-23,104-111
24-31,112-119
32-39,120-127
40-47,128-135
48-55,136-143
56-63,144-151
64-71,152-159
72-79,160-167
80-87,168-175
8-15,96-103
Results
=======
Tl;DR:
* ARM (single L3, 72 CPUs): cache_shard consistently improves write
throughput by +6 to +12% across all shard counts (2-32), with
the peak at 2 shards. Read impact is minimal (noise-level).
Shard=1 confirms no effect as expected.
* Intel (11 L3 domains, 16 CPUs each): cache_shard shows no meaningful
benefit at 1-4 shards (all within noise/stddev). At 8 shards it
regresses by ~4% for both reads and writes, likely due to loss of
data locality when sharding already-small 16-CPU cache domains
further.
benchmark Data:
===============
ARM:
┌────────┬───────────────────┬──────────────┬──────────────────┬─────────────┐
│ Shards │ Write Delta (avg) │ Write stddev │ Read Delta (avg) │ Read stddev │
├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
│ 1 │ -0.2% │ ±1.0% │ +1.2% │ ±1.7% │
├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
│ 2 │ +12.5% │ ±1.3% │ -0.3% │ ±0.9% │
├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
│ 4 │ +8.7% │ ±0.9% │ +1.8% │ ±1.5% │
├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
│ 8 │ +11.4% │ ±1.8% │ +3.1% │ ±1.5% │
├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
│ 16 │ +7.8% │ ±1.3% │ +1.6% │ ±1.0% │
├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
│ 32 │ +6.1% │ ±0.6% │ +0.3% │ ±1.5% │
└────────┴───────────────────┴──────────────┴──────────────────┴─────────────┘
Intel:
┌────────┬───────────────────┬──────────────┬──────────────────┬─────────────┐
│ Shards │ Write Delta (avg) │ Write stddev │ Read Delta (avg) │ Read stddev │
├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
│ 1 │ -0.2% │ ±1.2% │ +0.1% │ ±1.0% │
├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
│ 2 │ +0.7% │ ±1.4% │ +0.5% │ ±1.1% │
├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
│ 4 │ +0.8% │ ±1.1% │ +1.3% │ ±1.2% │
├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
│ 8 │ -4.0% │ ±1.3% │ -4.5% │ ±0.9% │
└────────┴───────────────────┴──────────────┴──────────────────┴─────────────┘
Microbenchmark result
=====================
I've run the micro-benchmark from this patchset as well, and the resluts
comparison:
* Intel (11 L3 domains, 16 CPUs each): cache_shard delivers +45-55%
throughput and 36-44% lower latency at 2-8 shards. The sweet spot is
4 shards (+55%, p50 cut nearly in half). Shard=1 confirms no effect.
Even though Intel already has multiple L3 domains, each 16-CPU domain
still has enough contention to benefit from further splitting (for
the sake of this microbenchmark/stress test)
* ARM (single L3, 72 CPUs): The gains are dramatic — 2x at 2 shards,
3.2x at 4 shards, and 4.4x at 8 shards. At 8 shards, cache_shard
(3.2M items/s) nearly matches cpu scope performance (3.7M), with p50
latency dropping from 43.5 us to 6.9 us.
The single monolithic L3 makes cache scope degenerate to a single
contended pool, so sharding has a massive effect.
Intel
┌────────┬─────────────────┬───────────────────────┬─────────────────┬───────────┬───────────┬───────────────────┐
│ Shards │ cache (items/s) │ cache_shard (items/s) │ Throughput gain │ cache p50 │ shard p50 │ Latency reduction │
├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
│ 1 │ 2,660,103 │ 2,667,740 │ +0.3% │ 27.5 us │ 27.5 us │ 0% │
├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
│ 2 │ 2,619,884 │ 3,788,454 │ +44.6% │ 28.0 us │ 17.8 us │ -36% │
├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
│ 4 │ 2,506,185 │ 3,891,064 │ +55.3% │ 29.3 us │ 16.5 us │ -44% │
├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
│ 8 │ 2,628,321 │ 4,015,312 │ +52.8% │ 27.9 us │ 16.4 us │ -41% │
└────────┴─────────────────┴───────────────────────┴─────────────────┴───────────┴───────────┴───────────────────┘
Reference scopes (stable across shard counts): cpu ~6.2M items/s, smt ~4.0M, numa/system ~422K.
ARM
┌────────┬─────────────────┬───────────────────────┬─────────────────┬───────────┬───────────┬───────────────────┐
│ Shards │ cache (items/s) │ cache_shard (items/s) │ Throughput gain │ cache p50 │ shard p50 │ Latency reduction │
├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
│ 2 │ 725,999 │ 1,516,967 │ +109% │ 43.8 us │ 19.6 us │ -55% │
├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
│ 4 │ 729,615 │ 2,347,335 │ +222% │ 43.6 us │ 11.0 us │ -75% │
├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
│ 8 │ 731,517 │ 3,230,168 │ +342% │ 43.5 us │ 6.9 us │ -84% │
└────────┴─────────────────┴───────────────────────┴─────────────────┴───────────┴───────────┴───────────────────┘
Next Steps:
* Revert the code to sharding by CPU count (instead of by shard count) and
report it again.
* Any other test that would help us?
© 2016 - 2026 Red Hat, Inc.