include/linux/workqueue.h | 3 +- kernel/workqueue.c | 110 +++++++++++++++++- lib/Kconfig.debug | 10 ++ lib/Makefile | 1 + lib/test_workqueue.c | 277 +++++++++++++++++++++++++++++++++++++++++++++ tools/workqueue/wq_dump.py | 3 +- 6 files changed, 401 insertions(+), 3 deletions(-)
TL;DR: Some modern processors have many CPUs per LLC (L3 cache), and
unbound workqueues using the default affinity (WQ_AFFN_CACHE) collapse
to a single worker pool, causing heavy spinlock (pool->lock) contention.
Create a new affinity (WQ_AFFN_CACHE_SHARD) that caps each pool at
wq_cache_shard_size CPUs (default 8).
Changes from RFC:
* wq_cache_shard_size is in terms of cores (not vCPU). So,
wq_cache_shard_size=8 means the pool will have 8 cores and their siblings,
like 16 threads/CPUs if SMT=1
* Got more data:
- AMD EPYC: All means are within ~1 stdev of zero. The deltas are
indistinguishable from noise. Shard scoping has no measurable effect
regardless of shard size. This is justified due to AMD EPYC having
11 L3 domains and lock contention not being a problem:
- ARM: A strong, consistent signal. At shard sizes 8 and 16 the mean
improvement is ~7% with relatively tight stdev (~1–2%) at write,
meaning the gain is real and reproducible across all IO engines.
Even shard size 4 shows a solid +3.5% with the tightest stdev
(0.97%).
Reads: Small shard sizes (2, 4) show a slight regression of
~1.3–1.7% (low stdev, so consistent). Larger shard sizes (8, 16)
flip to a modest +1.4% gain, though shard_size=8 reads have high
variance (stdev 2.79%) driven by a single outliers (which seems
noise)
- Sweet spot: Shard size 8 to 16 offers the best overall profile
— highest write gain (6.95%) with the lowest write stdev (1.18%),
plus a consistent read gain (1.42%, stdev 0.70%), no impact on
Intel.
* ARM (NVIDIA Grace - Neoverse V2 - single L3 domain: CPUs 0-71)
┌────────────┬────────────┬─────────────┬───────────┬────────────┐
│ Shard Size │ Write Mean │ Write StDev │ Read Mean │ Read StDev │
├────────────┼────────────┼─────────────┼───────────┼────────────┤
│ 2 │ +0.75% │ 1.32% │ -1.28% │ 0.45% │
├────────────┼────────────┼─────────────┼───────────┼────────────┤
│ 4 │ +3.45% │ 0.97% │ -1.73% │ 0.52% │
├────────────┼────────────┼─────────────┼───────────┼────────────┤
│ 8 │ +6.72% │ 1.97% │ +1.38% │ 2.79% │
├────────────┼────────────┼─────────────┼───────────┼────────────┤
│ 16 │ +6.95% │ 1.18% │ +1.42% │ 0.70% │
└────────────┴────────────┴─────────────┴───────────┴────────────┘
* Intel (AMD EPYC 9D64 88-Core Processor - 11 L3 domains, 8 Cores / 16 vCPUs each)
┌────────────┬────────────┬─────────────┬───────────┬────────────┐
│ Shard Size │ Write Mean │ Write StDev │ Read Mean │ Read StDev │
├────────────┼────────────┼─────────────┼───────────┼────────────┤
│ 2 │ +3.22% │ 1.90% │ -0.08% │ 0.72% │
├────────────┼────────────┼─────────────┼───────────┼────────────┤
│ 4 │ +0.92% │ 1.59% │ +0.67% │ 2.33% │
├────────────┼────────────┼─────────────┼───────────┼────────────┤
│ 8 │ +1.75% │ 1.47% │ -0.42% │ 0.72% │
├────────────┼────────────┼─────────────┼───────────┼────────────┤
│ 16 │ +1.22% │ 1.72% │ +0.43% │ 1.32% │
└────────────┴────────────┴─────────────┴───────────┴────────────┘
---
Changes in v2:
- wq_cache_shard_size is in terms of cores (not vCPU)
- Link to v1: https://patch.msgid.link/20260312-workqueue_sharded-v1-0-2c43a7b861d0@debian.org
---
Breno Leitao (5):
workqueue: fix typo in WQ_AFFN_SMT comment
workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope
tools/workqueue: add CACHE_SHARD support to wq_dump.py
workqueue: add test_workqueue benchmark module
include/linux/workqueue.h | 3 +-
kernel/workqueue.c | 110 +++++++++++++++++-
lib/Kconfig.debug | 10 ++
lib/Makefile | 1 +
lib/test_workqueue.c | 277 +++++++++++++++++++++++++++++++++++++++++++++
tools/workqueue/wq_dump.py | 3 +-
6 files changed, 401 insertions(+), 3 deletions(-)
---
base-commit: 1adb306427e971ccac25b19410c9f068b92bd583
change-id: 20260309-workqueue_sharded-2327956e889b
Best regards,
--
Breno Leitao <leitao@debian.org>
On Fri, Mar 20, 2026, at 1:56 PM, Breno Leitao wrote: > TL;DR: Some modern processors have many CPUs per LLC (L3 cache), and > unbound workqueues using the default affinity (WQ_AFFN_CACHE) collapse > to a single worker pool, causing heavy spinlock (pool->lock) contention. > Create a new affinity (WQ_AFFN_CACHE_SHARD) that caps each pool at > wq_cache_shard_size CPUs (default 8). > > Changes from RFC: > > * wq_cache_shard_size is in terms of cores (not vCPU). So, > wq_cache_shard_size=8 means the pool will have 8 cores and their siblings, > like 16 threads/CPUs if SMT=1 My concern about the "cores per shard" approach is that it improves the default situation for moderately-sized machines little or not at all. A machine with one L3 and 10 cores will go from 1 UNBOUND pool to only 2. For virtual machines commonly deployed as cloud instances, which are 2, 4, or 8 core systems (up to 16 threads) there will still be significant contention for UNBOUND workers. IOW, if you want good scaling, human intervention (via a boot command-line option) is still needed. -- Chuck Lever
Hello Chuck, On Mon, Mar 23, 2026 at 10:11:07AM -0400, Chuck Lever wrote: > On Fri, Mar 20, 2026, at 1:56 PM, Breno Leitao wrote: > > TL;DR: Some modern processors have many CPUs per LLC (L3 cache), and > > unbound workqueues using the default affinity (WQ_AFFN_CACHE) collapse > > to a single worker pool, causing heavy spinlock (pool->lock) contention. > > Create a new affinity (WQ_AFFN_CACHE_SHARD) that caps each pool at > > wq_cache_shard_size CPUs (default 8). > > > > Changes from RFC: > > > > * wq_cache_shard_size is in terms of cores (not vCPU). So, > > wq_cache_shard_size=8 means the pool will have 8 cores and their siblings, > > like 16 threads/CPUs if SMT=1 > > My concern about the "cores per shard" approach is that it > improves the default situation for moderately-sized machines > little or not at all. > > A machine with one L3 and 10 cores will go from 1 UNBOUND > pool to only 2. For virtual machines commonly deployed as > cloud instances, which are 2, 4, or 8 core systems (up to > 16 threads) there will still be significant contention for > UNBOUND workers. Could you clarify your concern? Are you suggesting the default value of wq_cache_shard_size=8 is too high, or that the cores-per-shard approach fundamentally doesn't scale well for moderately-sized systems? Any approach—whether sharding by cores or by LLC—ultimately relies on heuristics that may need tuning for specific workloads. The key difference is where we draw the line. The current default of 8 cores prevents the worst-case scenario: severe lock contention on large systems with 16+ CPUs all hammering a single unbound workqueue. For smaller systems (2-4 CPUs), contention is usually negligible regardless of the approach. My perf lock contention measurements consistently show minimal contention in that range. > IOW, if you want good scaling, human intervention (via a > boot command-line option) is still needed. I am not convinced. The wq_cache_shard_size approach creates multiple pools on large systems while leaving small systems (<8 cores) unchanged. This eliminates the pathological lock contention we're observing on high-core-count machines without impacting smaller deployments. In contrast, splitting pools per LLC would force fragmentation even on systems that aren't experiencing contention, increasing the need for manual tuning across a wider range of configurations. Thanks for the review, --breno
On 3/23/26 11:10 AM, Breno Leitao wrote: > Hello Chuck, > > On Mon, Mar 23, 2026 at 10:11:07AM -0400, Chuck Lever wrote: >> On Fri, Mar 20, 2026, at 1:56 PM, Breno Leitao wrote: >>> TL;DR: Some modern processors have many CPUs per LLC (L3 cache), and >>> unbound workqueues using the default affinity (WQ_AFFN_CACHE) collapse >>> to a single worker pool, causing heavy spinlock (pool->lock) contention. >>> Create a new affinity (WQ_AFFN_CACHE_SHARD) that caps each pool at >>> wq_cache_shard_size CPUs (default 8). >>> >>> Changes from RFC: >>> >>> * wq_cache_shard_size is in terms of cores (not vCPU). So, >>> wq_cache_shard_size=8 means the pool will have 8 cores and their siblings, >>> like 16 threads/CPUs if SMT=1 >> >> My concern about the "cores per shard" approach is that it >> improves the default situation for moderately-sized machines >> little or not at all. >> >> A machine with one L3 and 10 cores will go from 1 UNBOUND >> pool to only 2. For virtual machines commonly deployed as >> cloud instances, which are 2, 4, or 8 core systems (up to >> 16 threads) there will still be significant contention for >> UNBOUND workers. > > Could you clarify your concern? Are you suggesting the default value of > wq_cache_shard_size=8 is too high, or that the cores-per-shard approach > fundamentally doesn't scale well for moderately-sized systems? > > Any approach—whether sharding by cores or by LLC—ultimately relies on > heuristics that may need tuning for specific workloads. The key difference > is where we draw the line. The current default of 8 cores prevents the > worst-case scenario: severe lock contention on large systems with 16+ CPUs > all hammering a single unbound workqueue. An 8-core machine with 16 threads can handle quite a bit of I/O, but with the proposed scheme it will still have a single UNBOUND pool. For NFS workloads I commonly benchmark, splitting the UNBOUND pool on such systems is a very clear win. > For smaller systems (2-4 CPUs), contention is usually negligible > regardless of the approach. My perf lock contention measurements > consistently show minimal contention in that range. > >> IOW, if you want good scaling, human intervention (via a >> boot command-line option) is still needed. > > I am not convinced. The wq_cache_shard_size approach creates multiple > pools on large systems while leaving small systems (<8 cores) unchanged. This is exactly my concern. Smaller systems /do/ experience measurable contention in this area. I don't object to your series at all, it's clean and well-motivated; but the cores-per-shard approach doesn't scale down to very commonly deployed machine sizes. We might also argue that the NFS client and other subsystems that make significant use of UNBOUND workqueues in their I/O paths might be well advised to modify their approach. (net/sunrpc/sched.c, hint hint) > This eliminates the pathological lock contention we're observing on > high-core-count machines without impacting smaller deployments. > > In contrast, splitting pools per LLC would force fragmentation even on > systems that aren't experiencing contention, increasing the need for > manual tuning across a wider range of configurations. I claim that smaller deployments also need help. Further, I don't see how UNBOUND pool fragmentation is a problem on such systems that needs to be addressed (IMHO). -- Chuck Lever
Hello Chuck, On Mon, Mar 23, 2026 at 11:28:49AM -0400, Chuck Lever wrote: > On 3/23/26 11:10 AM, Breno Leitao wrote: > > > > I am not convinced. The wq_cache_shard_size approach creates multiple > > pools on large systems while leaving small systems (<8 cores) unchanged. > > This is exactly my concern. Smaller systems /do/ experience measurable > contention in this area. I don't object to your series at all, it's > clean and well-motivated; but the cores-per-shard approach doesn't scale > down to very commonly deployed machine sizes. I don't see why the cores-per-shard approach wouldn't scale down effectively. The sharding mechanism itself is independent of whether we use cores-per-shard or shards-per-LLC as the allocation strategy, correct? Regardless of the approach, we retain full control over the granularity of the shards. > We might also argue that the NFS client and other subsystems that make > significant use of UNBOUND workqueues in their I/O paths might be well > advised to modify their approach. (net/sunrpc/sched.c, hint hint) > > > > This eliminates the pathological lock contention we're observing on > > high-core-count machines without impacting smaller deployments. > > > In contrast, splitting pools per LLC would force fragmentation even on > > systems that aren't experiencing contention, increasing the need for > > manual tuning across a wider range of configurations. > > I claim that smaller deployments also need help. Further, I don't see > how UNBOUND pool fragmentation is a problem on such systems that needs > to be addressed (IMHO). Are you suggesting we should reduce the default value to something like wq_cache_shard_size=2 instead of wq_cache_shard_size=8? Thanks for the feedback, --breno
On 3/23/26 12:26 PM, Breno Leitao wrote: > Hello Chuck, > > On Mon, Mar 23, 2026 at 11:28:49AM -0400, Chuck Lever wrote: >> On 3/23/26 11:10 AM, Breno Leitao wrote: >>> >>> I am not convinced. The wq_cache_shard_size approach creates multiple >>> pools on large systems while leaving small systems (<8 cores) unchanged. >> >> This is exactly my concern. Smaller systems /do/ experience measurable >> contention in this area. I don't object to your series at all, it's >> clean and well-motivated; but the cores-per-shard approach doesn't scale >> down to very commonly deployed machine sizes. > > I don't see why the cores-per-shard approach wouldn't scale down > effectively. Sharding the UNBOUND pool is fine. But with a fixed cores-per-shard ratio of 8, it doesn't scale down to smaller systems. > The sharding mechanism itself is independent of whether we use > cores-per-shard or shards-per-LLC as the allocation strategy, correct? > > Regardless of the approach, we retain full control over the granularity > of the shards. > >> We might also argue that the NFS client and other subsystems that make >> significant use of UNBOUND workqueues in their I/O paths might be well >> advised to modify their approach. (net/sunrpc/sched.c, hint hint) >> >> >>> This eliminates the pathological lock contention we're observing on >>> high-core-count machines without impacting smaller deployments. >> >>> In contrast, splitting pools per LLC would force fragmentation even on >>> systems that aren't experiencing contention, increasing the need for >>> manual tuning across a wider range of configurations. >> >> I claim that smaller deployments also need help. Further, I don't see >> how UNBOUND pool fragmentation is a problem on such systems that needs >> to be addressed (IMHO). > > Are you suggesting we should reduce the default value to something like > wq_cache_shard_size=2 instead of wq_cache_shard_size=8? A shard size of 2 clearly won't scale properly to hundreds of cores. A varying default cores-per-shard ratio would help scaling in both directions, without having to manually tune. -- Chuck Lever
Hello, On Mon, Mar 23, 2026 at 02:04:57PM -0400, Chuck Lever wrote: > > I don't see why the cores-per-shard approach wouldn't scale down > > effectively. > > Sharding the UNBOUND pool is fine. But with a fixed cores-per-shard > ratio of 8, it doesn't scale down to smaller systems. You aren't making a lot of sense. Contention is primarily the function of the number of CPUs competing, not inverse of how many cores are in the LLC. > A shard size of 2 clearly won't scale properly to hundreds of cores. A > varying default cores-per-shard ratio would help scaling in both > directions, without having to manually tune. If your workload is bottlenecked on pool lock on small machines, the right course of action is either making the offending workqueue per-cpu or configure the unbound workqueue for that specific use case. That's why it's progrmatically configurable in the first place. Thanks. -- tejun
© 2016 - 2026 Red Hat, Inc.