TL;DR: Some modern processors have many CPUs per LLC (L3 cache), and
unbound workqueues using the default affinity (WQ_AFFN_CACHE) collapse
to a single worker pool, causing heavy spinlock (pool->lock) contention.
Create a new affinity (WQ_AFFN_CACHE_SHARD) that caps each pool at
wq_cache_shard_size CPUs (default 8).
Changes from RFC:
* wq_cache_shard_size is in terms of cores (not vCPU). So,
wq_cache_shard_size=8 means the pool will have 8 cores and their siblings,
like 16 threads/CPUs if SMT=1
* Got more data:
- AMD EPYC: All means are within ~1 stdev of zero. The deltas are
indistinguishable from noise. Shard scoping has no measurable effect
regardless of shard size. This is justified due to AMD EPYC having
11 L3 domains and lock contention not being a problem:
- ARM: A strong, consistent signal. At shard sizes 8 and 16 the mean
improvement is ~7% with relatively tight stdev (~1–2%) at write,
meaning the gain is real and reproducible across all IO engines.
Even shard size 4 shows a solid +3.5% with the tightest stdev
(0.97%).
Reads: Small shard sizes (2, 4) show a slight regression of
~1.3–1.7% (low stdev, so consistent). Larger shard sizes (8, 16)
flip to a modest +1.4% gain, though shard_size=8 reads have high
variance (stdev 2.79%) driven by a single outliers (which seems
noise)
- Sweet spot: Shard size 8 to 16 offers the best overall profile
— highest write gain (6.95%) with the lowest write stdev (1.18%),
plus a consistent read gain (1.42%, stdev 0.70%), no impact on
AMD/x86.
* ARM (NVIDIA Grace - Neoverse V2 - single L3 domain: CPUs 0-71)
┌────────────┬────────────┬─────────────┬───────────┬────────────┐
│ Shard Size │ Write Mean │ Write StDev │ Read Mean │ Read StDev │
├────────────┼────────────┼─────────────┼───────────┼────────────┤
│ 2 │ +0.75% │ 1.32% │ -1.28% │ 0.45% │
├────────────┼────────────┼─────────────┼───────────┼────────────┤
│ 4 │ +3.45% │ 0.97% │ -1.73% │ 0.52% │
├────────────┼────────────┼─────────────┼───────────┼────────────┤
│ 8 │ +6.72% │ 1.97% │ +1.38% │ 2.79% │
├────────────┼────────────┼─────────────┼───────────┼────────────┤
│ 16 │ +6.95% │ 1.18% │ +1.42% │ 0.70% │
└────────────┴────────────┴─────────────┴───────────┴────────────┘
* x86 (AMD EPYC 9D64 88-Core Processor - 11 L3 domains, 8 Cores / 16 vCPUs each)
┌────────────┬────────────┬─────────────┬───────────┬────────────┐
│ Shard Size │ Write Mean │ Write StDev │ Read Mean │ Read StDev │
├────────────┼────────────┼─────────────┼───────────┼────────────┤
│ 2 │ +3.22% │ 1.90% │ -0.08% │ 0.72% │
├────────────┼────────────┼─────────────┼───────────┼────────────┤
│ 4 │ +0.92% │ 1.59% │ +0.67% │ 2.33% │
├────────────┼────────────┼─────────────┼───────────┼────────────┤
│ 8 │ +1.75% │ 1.47% │ -0.42% │ 0.72% │
├────────────┼────────────┼─────────────┼───────────┼────────────┤
│ 16 │ +1.22% │ 1.72% │ +0.43% │ 1.32% │
└────────────┴────────────┴─────────────┴───────────┴────────────┘
---
Changes in v3:
- Precomputed the shards to avoid exponential time when creating the
pool. (Tejun)
- Add documentation about the new cache sharding affinity.
- Fixed use-after-free on module unload (on the selftest)
- Link to v2: https://patch.msgid.link/20260320-workqueue_sharded-v2-0-8372930931af@debian.org
Changes in v2:
- wq_cache_shard_size is in terms of cores (not vCPU)
- Link to v1: https://patch.msgid.link/20260312-workqueue_sharded-v1-0-2c43a7b861d0@debian.org
---
Breno Leitao (6):
workqueue: fix typo in WQ_AFFN_SMT comment
workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope
tools/workqueue: add CACHE_SHARD support to wq_dump.py
workqueue: add test_workqueue benchmark module
docs: workqueue: document WQ_AFFN_CACHE_SHARD affinity scope
Documentation/admin-guide/kernel-parameters.txt | 3 +-
Documentation/core-api/workqueue.rst | 14 +-
include/linux/workqueue.h | 3 +-
kernel/workqueue.c | 185 ++++++++++++++-
lib/Kconfig.debug | 10 +
lib/Makefile | 1 +
lib/test_workqueue.c | 294 ++++++++++++++++++++++++
tools/workqueue/wq_dump.py | 3 +-
8 files changed, 505 insertions(+), 8 deletions(-)
---
base-commit: 0e4f8f1a3d081e834be5fd0a62bdb2554fadd307
change-id: 20260309-workqueue_sharded-2327956e889b
Best regards,
--
Breno Leitao <leitao@debian.org>