[v2] workqueue: Introduce a sharded cache affinity scope

[PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope

Posted by Breno Leitao 2 weeks, 2 days ago

TL;DR: Some modern processors have many CPUs per LLC (L3 cache), and
unbound workqueues using the default affinity (WQ_AFFN_CACHE) collapse
to a single worker pool, causing heavy spinlock (pool->lock) contention.
Create a new affinity (WQ_AFFN_CACHE_SHARD) that caps each pool at
wq_cache_shard_size CPUs (default 8).

Changes from RFC:

* wq_cache_shard_size is in terms of cores (not vCPU). So,
  wq_cache_shard_size=8 means the pool will have 8 cores and their siblings,
  like 16 threads/CPUs if SMT=1

* Got more data:

  - AMD EPYC: All means are within ~1 stdev of zero. The deltas are
    indistinguishable from noise. Shard scoping has no measurable effect
    regardless of shard size. This is justified due to AMD EPYC having
    11 L3 domains and lock contention not being a problem:

  - ARM: A strong, consistent signal. At shard sizes 8 and 16 the mean
    improvement is ~7% with relatively tight stdev (~1–2%) at write,
    meaning the gain is real and reproducible across all IO engines.
    Even shard size 4 shows a solid +3.5% with the tightest stdev
    (0.97%).

    Reads: Small shard sizes (2, 4) show a slight regression of
    ~1.3–1.7% (low stdev, so consistent). Larger shard sizes (8, 16)
    flip to a modest +1.4% gain, though shard_size=8 reads have high
    variance (stdev 2.79%) driven by a single outliers (which seems
    noise)

  - Sweet spot: Shard size 8 to 16 offers the best overall profile
    — highest write gain (6.95%) with the lowest write stdev (1.18%),
    plus a consistent read gain (1.42%, stdev 0.70%), no impact on
    Intel.

* ARM (NVIDIA Grace - Neoverse V2 - single L3 domain: CPUs 0-71)

  ┌────────────┬────────────┬─────────────┬───────────┬────────────┐
  │ Shard Size │ Write Mean │ Write StDev │ Read Mean │ Read StDev │
  ├────────────┼────────────┼─────────────┼───────────┼────────────┤
  │ 2          │ +0.75%     │ 1.32%       │ -1.28%    │ 0.45%      │
  ├────────────┼────────────┼─────────────┼───────────┼────────────┤
  │ 4          │ +3.45%     │ 0.97%       │ -1.73%    │ 0.52%      │
  ├────────────┼────────────┼─────────────┼───────────┼────────────┤
  │ 8          │ +6.72%     │ 1.97%       │ +1.38%    │ 2.79%      │
  ├────────────┼────────────┼─────────────┼───────────┼────────────┤
  │ 16         │ +6.95%     │ 1.18%       │ +1.42%    │ 0.70%      │
  └────────────┴────────────┴─────────────┴───────────┴────────────┘

 * Intel (AMD EPYC 9D64 88-Core Processor - 11 L3 domains, 8 Cores / 16 vCPUs each)

  ┌────────────┬────────────┬─────────────┬───────────┬────────────┐
  │ Shard Size │ Write Mean │ Write StDev │ Read Mean │ Read StDev │
  ├────────────┼────────────┼─────────────┼───────────┼────────────┤
  │ 2          │ +3.22%     │ 1.90%       │ -0.08%    │ 0.72%      │
  ├────────────┼────────────┼─────────────┼───────────┼────────────┤
  │ 4          │ +0.92%     │ 1.59%       │ +0.67%    │ 2.33%      │
  ├────────────┼────────────┼─────────────┼───────────┼────────────┤
  │ 8          │ +1.75%     │ 1.47%       │ -0.42%    │ 0.72%      │
  ├────────────┼────────────┼─────────────┼───────────┼────────────┤
  │ 16         │ +1.22%     │ 1.72%       │ +0.43%    │ 1.32%      │
  └────────────┴────────────┴─────────────┴───────────┴────────────┘

---
Changes in v2:
- wq_cache_shard_size is in terms of cores (not vCPU)
- Link to v1: https://patch.msgid.link/20260312-workqueue_sharded-v1-0-2c43a7b861d0@debian.org

---
Breno Leitao (5):
      workqueue: fix typo in WQ_AFFN_SMT comment
      workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
      workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope
      tools/workqueue: add CACHE_SHARD support to wq_dump.py
      workqueue: add test_workqueue benchmark module

 include/linux/workqueue.h  |   3 +-
 kernel/workqueue.c         | 110 +++++++++++++++++-
 lib/Kconfig.debug          |  10 ++
 lib/Makefile               |   1 +
 lib/test_workqueue.c       | 277 +++++++++++++++++++++++++++++++++++++++++++++
 tools/workqueue/wq_dump.py |   3 +-
 6 files changed, 401 insertions(+), 3 deletions(-)
---
base-commit: 1adb306427e971ccac25b19410c9f068b92bd583
change-id: 20260309-workqueue_sharded-2327956e889b

Best regards,
--  
Breno Leitao <leitao@debian.org>

Re: [PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope

Posted by Chuck Lever 1 week, 6 days ago

On Fri, Mar 20, 2026, at 1:56 PM, Breno Leitao wrote:
> TL;DR: Some modern processors have many CPUs per LLC (L3 cache), and
> unbound workqueues using the default affinity (WQ_AFFN_CACHE) collapse
> to a single worker pool, causing heavy spinlock (pool->lock) contention.
> Create a new affinity (WQ_AFFN_CACHE_SHARD) that caps each pool at
> wq_cache_shard_size CPUs (default 8).
>
> Changes from RFC:
>
> * wq_cache_shard_size is in terms of cores (not vCPU). So,
>   wq_cache_shard_size=8 means the pool will have 8 cores and their siblings,
>   like 16 threads/CPUs if SMT=1

My concern about the "cores per shard" approach is that it
improves the default situation for moderately-sized machines
little or not at all.

A machine with one L3 and 10 cores will go from 1 UNBOUND
pool to only 2. For virtual machines commonly deployed as
cloud instances, which are 2, 4, or 8 core systems (up to
16 threads) there will still be significant contention for
UNBOUND workers.

IOW, if you want good scaling, human intervention (via a
boot command-line option) is still needed.

-- 
Chuck Lever

Re: [PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope

Posted by Breno Leitao 1 week, 6 days ago

Hello Chuck,

On Mon, Mar 23, 2026 at 10:11:07AM -0400, Chuck Lever wrote:
> On Fri, Mar 20, 2026, at 1:56 PM, Breno Leitao wrote:
> > TL;DR: Some modern processors have many CPUs per LLC (L3 cache), and
> > unbound workqueues using the default affinity (WQ_AFFN_CACHE) collapse
> > to a single worker pool, causing heavy spinlock (pool->lock) contention.
> > Create a new affinity (WQ_AFFN_CACHE_SHARD) that caps each pool at
> > wq_cache_shard_size CPUs (default 8).
> >
> > Changes from RFC:
> >
> > * wq_cache_shard_size is in terms of cores (not vCPU). So,
> >   wq_cache_shard_size=8 means the pool will have 8 cores and their siblings,
> >   like 16 threads/CPUs if SMT=1
>
> My concern about the "cores per shard" approach is that it
> improves the default situation for moderately-sized machines
> little or not at all.
>
> A machine with one L3 and 10 cores will go from 1 UNBOUND
> pool to only 2. For virtual machines commonly deployed as
> cloud instances, which are 2, 4, or 8 core systems (up to
> 16 threads) there will still be significant contention for
> UNBOUND workers.

Could you clarify your concern? Are you suggesting the default value of
wq_cache_shard_size=8 is too high, or that the cores-per-shard approach
fundamentally doesn't scale well for moderately-sized systems?

Any approach—whether sharding by cores or by LLC—ultimately relies on
heuristics that may need tuning for specific workloads. The key difference
is where we draw the line. The current default of 8 cores prevents the
worst-case scenario: severe lock contention on large systems with 16+ CPUs
all hammering a single unbound workqueue.

For smaller systems (2-4 CPUs), contention is usually negligible
regardless of the approach. My perf lock contention measurements
consistently show minimal contention in that range.

> IOW, if you want good scaling, human intervention (via a
> boot command-line option) is still needed.

I am not convinced. The wq_cache_shard_size approach creates multiple
pools on large systems while leaving small systems (<8 cores) unchanged.
This eliminates the pathological lock contention we're observing on
high-core-count machines without impacting smaller deployments.

In contrast, splitting pools per LLC would force fragmentation even on
systems that aren't experiencing contention, increasing the need for
manual tuning across a wider range of configurations.

Thanks for the review,
--breno

Re: [PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope

Posted by Chuck Lever 1 week, 6 days ago

On 3/23/26 11:10 AM, Breno Leitao wrote:
> Hello Chuck,
> 
> On Mon, Mar 23, 2026 at 10:11:07AM -0400, Chuck Lever wrote:
>> On Fri, Mar 20, 2026, at 1:56 PM, Breno Leitao wrote:
>>> TL;DR: Some modern processors have many CPUs per LLC (L3 cache), and
>>> unbound workqueues using the default affinity (WQ_AFFN_CACHE) collapse
>>> to a single worker pool, causing heavy spinlock (pool->lock) contention.
>>> Create a new affinity (WQ_AFFN_CACHE_SHARD) that caps each pool at
>>> wq_cache_shard_size CPUs (default 8).
>>>
>>> Changes from RFC:
>>>
>>> * wq_cache_shard_size is in terms of cores (not vCPU). So,
>>>   wq_cache_shard_size=8 means the pool will have 8 cores and their siblings,
>>>   like 16 threads/CPUs if SMT=1
>>
>> My concern about the "cores per shard" approach is that it
>> improves the default situation for moderately-sized machines
>> little or not at all.
>>
>> A machine with one L3 and 10 cores will go from 1 UNBOUND
>> pool to only 2. For virtual machines commonly deployed as
>> cloud instances, which are 2, 4, or 8 core systems (up to
>> 16 threads) there will still be significant contention for
>> UNBOUND workers.
> 
> Could you clarify your concern? Are you suggesting the default value of
> wq_cache_shard_size=8 is too high, or that the cores-per-shard approach
> fundamentally doesn't scale well for moderately-sized systems?
> 
> Any approach—whether sharding by cores or by LLC—ultimately relies on
> heuristics that may need tuning for specific workloads. The key difference
> is where we draw the line. The current default of 8 cores prevents the
> worst-case scenario: severe lock contention on large systems with 16+ CPUs
> all hammering a single unbound workqueue.

An 8-core machine with 16 threads can handle quite a bit of I/O, but
with the proposed scheme it will still have a single UNBOUND pool.
For NFS workloads I commonly benchmark, splitting the UNBOUND pool
on such systems is a very clear win.


> For smaller systems (2-4 CPUs), contention is usually negligible
> regardless of the approach. My perf lock contention measurements
> consistently show minimal contention in that range.
> 
>> IOW, if you want good scaling, human intervention (via a
>> boot command-line option) is still needed.
> 
> I am not convinced. The wq_cache_shard_size approach creates multiple
> pools on large systems while leaving small systems (<8 cores) unchanged.

This is exactly my concern. Smaller systems /do/ experience measurable
contention in this area. I don't object to your series at all, it's
clean and well-motivated; but the cores-per-shard approach doesn't scale
down to very commonly deployed machine sizes.

We might also argue that the NFS client and other subsystems that make
significant use of UNBOUND workqueues in their I/O paths might be well
advised to modify their approach. (net/sunrpc/sched.c, hint hint)


> This eliminates the pathological lock contention we're observing on
> high-core-count machines without impacting smaller deployments.
> 
> In contrast, splitting pools per LLC would force fragmentation even on
> systems that aren't experiencing contention, increasing the need for
> manual tuning across a wider range of configurations.

I claim that smaller deployments also need help. Further, I don't see
how UNBOUND pool fragmentation is a problem on such systems that needs
to be addressed (IMHO).


-- 
Chuck Lever

Re: [PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope

Posted by Breno Leitao 1 week, 6 days ago

Hello Chuck,

On Mon, Mar 23, 2026 at 11:28:49AM -0400, Chuck Lever wrote:
> On 3/23/26 11:10 AM, Breno Leitao wrote:
> >
> > I am not convinced. The wq_cache_shard_size approach creates multiple
> > pools on large systems while leaving small systems (<8 cores) unchanged.
>
> This is exactly my concern. Smaller systems /do/ experience measurable
> contention in this area. I don't object to your series at all, it's
> clean and well-motivated; but the cores-per-shard approach doesn't scale
> down to very commonly deployed machine sizes.

I don't see why the cores-per-shard approach wouldn't scale down
effectively.

The sharding mechanism itself is independent of whether we use
cores-per-shard or shards-per-LLC as the allocation strategy, correct?

Regardless of the approach, we retain full control over the granularity
of the shards.

> We might also argue that the NFS client and other subsystems that make
> significant use of UNBOUND workqueues in their I/O paths might be well
> advised to modify their approach. (net/sunrpc/sched.c, hint hint)
>
>
> > This eliminates the pathological lock contention we're observing on
> > high-core-count machines without impacting smaller deployments.
>
> > In contrast, splitting pools per LLC would force fragmentation even on
> > systems that aren't experiencing contention, increasing the need for
> > manual tuning across a wider range of configurations.
>
> I claim that smaller deployments also need help. Further, I don't see
> how UNBOUND pool fragmentation is a problem on such systems that needs
> to be addressed (IMHO).

Are you suggesting we should reduce the default value to something like
wq_cache_shard_size=2 instead of wq_cache_shard_size=8?

Thanks for the feedback,
--breno

Re: [PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope

Posted by Chuck Lever 1 week, 6 days ago

On 3/23/26 12:26 PM, Breno Leitao wrote:
> Hello Chuck,
> 
> On Mon, Mar 23, 2026 at 11:28:49AM -0400, Chuck Lever wrote:
>> On 3/23/26 11:10 AM, Breno Leitao wrote:
>>>
>>> I am not convinced. The wq_cache_shard_size approach creates multiple
>>> pools on large systems while leaving small systems (<8 cores) unchanged.
>>
>> This is exactly my concern. Smaller systems /do/ experience measurable
>> contention in this area. I don't object to your series at all, it's
>> clean and well-motivated; but the cores-per-shard approach doesn't scale
>> down to very commonly deployed machine sizes.
> 
> I don't see why the cores-per-shard approach wouldn't scale down
> effectively.

Sharding the UNBOUND pool is fine. But with a fixed cores-per-shard
ratio of 8, it doesn't scale down to smaller systems.


> The sharding mechanism itself is independent of whether we use
> cores-per-shard or shards-per-LLC as the allocation strategy, correct?
> 
> Regardless of the approach, we retain full control over the granularity
> of the shards.
> 
>> We might also argue that the NFS client and other subsystems that make
>> significant use of UNBOUND workqueues in their I/O paths might be well
>> advised to modify their approach. (net/sunrpc/sched.c, hint hint)
>>
>>
>>> This eliminates the pathological lock contention we're observing on
>>> high-core-count machines without impacting smaller deployments.
>>
>>> In contrast, splitting pools per LLC would force fragmentation even on
>>> systems that aren't experiencing contention, increasing the need for
>>> manual tuning across a wider range of configurations.
>>
>> I claim that smaller deployments also need help. Further, I don't see
>> how UNBOUND pool fragmentation is a problem on such systems that needs
>> to be addressed (IMHO).
> 
> Are you suggesting we should reduce the default value to something like
> wq_cache_shard_size=2 instead of wq_cache_shard_size=8?

A shard size of 2 clearly won't scale properly to hundreds of cores. A
varying default cores-per-shard ratio would help scaling in both
directions, without having to manually tune.


-- 
Chuck Lever

Re: [PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope

Posted by Tejun Heo 1 week, 6 days ago

Hello,

On Mon, Mar 23, 2026 at 02:04:57PM -0400, Chuck Lever wrote:
> > I don't see why the cores-per-shard approach wouldn't scale down
> > effectively.
> 
> Sharding the UNBOUND pool is fine. But with a fixed cores-per-shard
> ratio of 8, it doesn't scale down to smaller systems.

You aren't making a lot of sense. Contention is primarily the function of
the number of CPUs competing, not inverse of how many cores are in the LLC.

> A shard size of 2 clearly won't scale properly to hundreds of cores. A
> varying default cores-per-shard ratio would help scaling in both
> directions, without having to manually tune.

If your workload is bottlenecked on pool lock on small machines, the right
course of action is either making the offending workqueue per-cpu or
configure the unbound workqueue for that specific use case. That's why it's
progrmatically configurable in the first place.

Thanks.

-- 
tejun