[PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope

Breno Leitao posted 5 patches 3 weeks, 4 days ago
include/linux/workqueue.h  |   1 +
kernel/workqueue.c         |  72 ++++++++++--
lib/Kconfig.debug          |  10 ++
lib/Makefile               |   1 +
lib/test_workqueue.c       | 275 +++++++++++++++++++++++++++++++++++++++++++++
tools/workqueue/wq_dump.py |   3 +-
6 files changed, 352 insertions(+), 10 deletions(-)
[PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
Posted by Breno Leitao 3 weeks, 4 days ago
TL;DR: Some modern processors have many CPUs per LLC (L3 cache), and
unbound workqueues using the default affinity (WQ_AFFN_CACHE) collapse
to a single worker pool, causing heavy spinlock (pool->lock) contention.
Create a new affinity (WQ_AFFN_CACHE_SHARD) that caps each pool at
wq_cache_shard_size CPUs (default 8).

Problem
=======

Some modern systems have many CPUs sharing one LLC. Here are some examples I have
access to:

 * NVIDIA Grace CPU: 72 real CPUs per LLC
 * Intel(R) Xeon(R) Gold 6450C: 59 SMT threads per LLC
 * Intel(R) Xeon(R) Platinum 8321HC: 51 SMT threads per LLC

On these systems, the default unbound workqueue uses the WQ_AFFN_CACHE
affinity, which results in just a single pool for the whole system (when
all the CPUs share the same LLC as the systems above).

This causes contention on pool->lock, potentially affecting IO
performance (btrfs, writeback, etc)

When profiling an IO-intensive usercache at Meta, I found significant
contention on __queue_work(), making it one of the top 5 contended
locks.

Additionally, Chuck Lever recently reported this problem:

	"For example, on a 12-core system with a single shared L3 cache running
	NFS over RDMA with 12 fio jobs, perf shows approximately 39% of CPU
	cycles spent in native_queued_spin_lock_slowpath, nearly all from
	__queue_work() contending on the single pool lock.

	On such systems WQ_AFFN_CACHE, WQ_AFFN_SMT, and WQ_AFFN_NUMA
	scopes all collapse to a single pod."

Link: https://lore.kernel.org/all/20260203143744.16578-1-cel@kernel.org/

Solution
========

Tejun suggested solving this problem by creating an intermediate
affinity level (aka cache_shard), which would shard the WQ_AFFN_CACHE
using a heuristic, avoiding collapsing all those affinity levels to
a single pod.

Solve this by creating an intermediate sharded cache affinity, and use
it as the default one.

Micro benchmark
===============

To test its benefit, I created a microbenchmark (part of this series)
that enqueues work (queue_work) in a loop and reports the latency.

  Benchmark on NVIDIA Grace (72 CPUs, single LLC, 50k items/thread):

    cpu          3248519 items/sec p50=10944    p90=11488    p95=11648 ns
    smt          3362119 items/sec p50=10945    p90=11520    p95=11712 ns
    cache_shard  3629098 items/sec p50=6080     p90=8896     p95=9728 ns (NEW) **
    cache        708168 items/sec  p50=44000    p90=47104    p95=47904 ns
    numa         710559 items/sec  p50=44096    p90=47265    p95=48064 ns
    system       718370 items/sec  p50=43104    p90=46432    p95=47264 ns

Same benchmark on Intel 8321HC.

    cpu          2831751 items/sec p50=3909     p90=9222     p95=11580 ns
    smt          2810699 items/sec p50=2229     p90=4928     p95=5979 ns
    cache_shard  1861028 items/sec p50=4874     p90=8423     p95=9415 ns (NEW)
    cache        591001 items/sec p50=24901     p90=29865    p95=31169 ns
    numa         590431 items/sec p50=24901     p90=29819    p95=31133 ns
    system       591912 items/sec p50=25049     p90=29916    p95=31219 ns

(** It is still unclear why cache_shard is "better" than SMT on
Grace/ARM. The result is constantly reproducible, though. Still
investigating it)

Block benchmark
===============

Host: Intel(R) Xeon(R) D-2191A CPU @ 1.60GHz (16 Cores - 32 SMT)

In order to stress the workqueue, I am running fio on a dm-crypt device.

  1) Create a plain dm-crypt device on top of NVMe
   * cryptsetup creates an encrypted block device (/dev/mapper/crypt_nvme) on top
     of a raw NVMe drive. All I/O to this device goes through kcryptd — dm-crypt's
     workqueue that handles AES encryption/decryption of every data block.

   # cryptsetup open --type plain -c aes-xts-plain64 -s 256 /dev/nvme0n1 crypt_nvme -d -

  2) Run fio
   * fio hammers the encrypted device with 36 threads (one per CPU), each doing
     128-deep 4K _buffered_ I/O for 10 seconds. This generates massive workqueue
     pressure — every I/O completion triggers a kcryptd work item to encrypt or
     decrypt data.

   # fio --filename=/dev/mapper/crypt_nvme \
         --ioengine=io_uring --direct=0 \
         --bs=4k --iodepth=128 \
         --numjobs=$(nproc) --runtime=10 \
         --time_based --group_reporting

Running this for ~3 hours:

  ┌────────────┬────────────────────────┬────────────────────────┬───────────┬────────┬─────────────────┐
  │ Workload   │       Avg cache        │    Avg cache_shard     │ Avg delta │ Stddev │  2-sigma range  │
  ├────────────┼────────────────────────┼────────────────────────┼───────────┼────────┼─────────────────┤
  │ randread   │ 389 MiB/s (99.6k IOPS) │ 413 MiB/s (106k IOPS)  │ +5.9%     │ 3.3%   │ -0.7% to +12.5% │
  ├────────────┼────────────────────────┼────────────────────────┼───────────┼────────┼─────────────────┤
  │ randwrite  │ 622 MiB/s (159k IOPS)  │ 614 MiB/s (157k IOPS)  │ -1.3%     │ 0.9%   │ -3.1% to +0.5%  │
  ├────────────┼────────────────────────┼────────────────────────┼───────────┼────────┼─────────────────┤
  │ randrw     │ 240 MiB/s (61.4k IOPS) │ 250 MiB/s (64.1k IOPS) │ +4.3%     │ 3.4%   │ -2.5% to +11.1% │
  └────────────┴────────────────────────┴────────────────────────┴───────────┴────────┴─────────────────┘

Same results for buffered IO:

  ┌───────────┬────────────────────────┬────────────────────────┬───────────┬────────┬────────────────┐
  │ Workload  │       Avg cache        │    Avg cache_shard     │ Avg delta │ Stddev │ 2-sigma range  │
  ├───────────┼────────────────────────┼────────────────────────┼───────────┼────────┼────────────────┤
  │ randread  │ 559 MiB/s (143k IOPS)  │ 577 MiB/s (148k IOPS)  │ +3.1%     │ 1.3%   │ +0.5% to +5.7% │
  ├───────────┼────────────────────────┼────────────────────────┼───────────┼────────┼────────────────┤
  │ randwrite │ 437 MiB/s (112k IOPS)  │ 431 MiB/s (110k IOPS)  │ -1.5%     │ 1.0%   │ -3.5% to +0.5% │
  ├───────────┼────────────────────────┼────────────────────────┼───────────┼────────┼────────────────┤
  │ randrw    │ 272 MiB/s (69.7k IOPS) │ 273 MiB/s (69.8k IOPS) │ +0.1%     │ 1.5%   │ -2.9% to +3.1% │
  └───────────┴────────────────────────┴────────────────────────┴───────────┴────────┴────────────────┘

(randwrite result seems to be noise (!?))

Patchset organization
=====================

This series adds a new WQ_AFFN_CACHE_SHARD affinity scope that
subdivides each LLC into groups of at most wq_cache_shard_size CPUs
(default 8, tunable via boot parameter), providing an intermediate
option between per-LLC and per-SMT-core granularity.

On top of the feature, this patchset also prepares the code for the new
cache_shard affinity, and creates a stress test for workqueue.

Then, make this new cache affinity the default one.

On systems with 8 or fewer CPUs per LLC, CACHE_SHARD produces a single
shard covering the entire LLC, making it functionally identical to the
previous CACHE default. The sharding only activates when an LLC has more
than 8 CPUs.

---
Breno Leitao (5):
      workqueue: fix parse_affn_scope() prefix matching bug
      workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
      workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope
      workqueue: add test_workqueue benchmark module
      tools/workqueue: add CACHE_SHARD support to wq_dump.py

 include/linux/workqueue.h  |   1 +
 kernel/workqueue.c         |  72 ++++++++++--
 lib/Kconfig.debug          |  10 ++
 lib/Makefile               |   1 +
 lib/test_workqueue.c       | 275 +++++++++++++++++++++++++++++++++++++++++++++
 tools/workqueue/wq_dump.py |   3 +-
 6 files changed, 352 insertions(+), 10 deletions(-)
---
base-commit: b29fb8829bff243512bb8c8908fd39406f9fd4c3
change-id: 20260309-workqueue_sharded-2327956e889b

Best regards,
--  
Breno Leitao <leitao@debian.org>

Re: [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
Posted by Tejun Heo 3 weeks, 3 days ago
Hello,

Applied 1/5. Some comments on the rest:

- The sharding currently splits on CPU boundary, which can split SMT
  siblings across different pods. The worse performance on Intel compared
  to SMT scope may be indicating exactly this - HT siblings ending up in
  different pods. It'd be better to shard on core boundary so that SMT
  siblings always stay together.

- How was the default shard size of 8 picked? There's a tradeoff between
  the number of kworkers created and locality. Can you also report the
  number of kworkers for each configuration? And is there data on
  different shard sizes? It'd be useful to see how the numbers change
  across e.g. 4, 8, 16, 32.

- Can you also test on AMD machines? Their CCD topology (16 or 32
  threads per LLC) would be a good data point.

Thanks.

--
tejun
Re: [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
Posted by Breno Leitao 3 weeks ago
Hello Tejun,

On Fri, Mar 13, 2026 at 07:57:20AM -1000, Tejun Heo wrote:
> Hello,
>
> Applied 1/5. Some comments on the rest:
>
> - The sharding currently splits on CPU boundary, which can split SMT
>   siblings across different pods. The worse performance on Intel compared
>   to SMT scope may be indicating exactly this - HT siblings ending up in
>   different pods. It'd be better to shard on core boundary so that SMT
>   siblings always stay together.

Thank you for the insight. I'll modify the sharding to operate at the
core boundary rather than at the SMT/thread level to ensure sibling CPUs
remain in the same pod.

> - How was the default shard size of 8 picked? There's a tradeoff
> between the number of kworkers created and locality. Can you also
> report the number of kworkers for each configuration? And is there
> data on different shard sizes? It'd be useful to see how the numbers
> change across e.g. 4, 8, 16, 32.

The choice of 8 as the default shard size was somewhat arbitrary – it was
selected primarily to generate initial data points.

I'll run tests with different shard sizes and report the results.

I'm currently working on finding a suitable workload with minimal noise.
Testing on real NVMe devices shows significant jitter that makes analysis
difficult. I've also been experimenting with nullblk, but haven't had much
success yet.

If you have any suggestions for a reliable workload or benchmark, I'd
appreciate your input.

> - Can you also test on AMD machines? Their CCD topology (16 or 32
> threads per LLC) would be a good data point.

Absolutely, I'll test on AMD machines as well.

Thanks,
--breno
Re: [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
Posted by Chuck Lever 2 weeks, 6 days ago
On 3/17/26 7:32 AM, Breno Leitao wrote:
> Hello Tejun,
> 
> On Fri, Mar 13, 2026 at 07:57:20AM -1000, Tejun Heo wrote:
>> Hello,
>>
>> Applied 1/5. Some comments on the rest:
>>
>> - The sharding currently splits on CPU boundary, which can split SMT
>>   siblings across different pods. The worse performance on Intel compared
>>   to SMT scope may be indicating exactly this - HT siblings ending up in
>>   different pods. It'd be better to shard on core boundary so that SMT
>>   siblings always stay together.
> 
> Thank you for the insight. I'll modify the sharding to operate at the
> core boundary rather than at the SMT/thread level to ensure sibling CPUs
> remain in the same pod.
> 
>> - How was the default shard size of 8 picked? There's a tradeoff
>> between the number of kworkers created and locality. Can you also
>> report the number of kworkers for each configuration? And is there
>> data on different shard sizes? It'd be useful to see how the numbers
>> change across e.g. 4, 8, 16, 32.
> 
> The choice of 8 as the default shard size was somewhat arbitrary – it was
> selected primarily to generate initial data points.

Perhaps instead of basing the sharding on a particular number of CPUs
per shard, why not cap the total number of shards? IIUC that is the main
concern about ballooning the number of kworker threads.


> I'll run tests with different shard sizes and report the results.
> 
> I'm currently working on finding a suitable workload with minimal noise.
> Testing on real NVMe devices shows significant jitter that makes analysis
> difficult. I've also been experimenting with nullblk, but haven't had much
> success yet.
> 
> If you have any suggestions for a reliable workload or benchmark, I'd
> appreciate your input.
> 
>> - Can you also test on AMD machines? Their CCD topology (16 or 32
>> threads per LLC) would be a good data point.
> 
> Absolutely, I'll test on AMD machines as well.
> 
> Thanks,
> --breno


-- 
Chuck Lever
Re: [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
Posted by Breno Leitao 2 weeks, 5 days ago
On Tue, Mar 17, 2026 at 09:58:54AM -0400, Chuck Lever wrote:
> On 3/17/26 7:32 AM, Breno Leitao wrote:
> >> - How was the default shard size of 8 picked? There's a tradeoff
> >> between the number of kworkers created and locality. Can you also
> >> report the number of kworkers for each configuration? And is there
> >> data on different shard sizes? It'd be useful to see how the numbers
> >> change across e.g. 4, 8, 16, 32.
> >
> > The choice of 8 as the default shard size was somewhat arbitrary – it was
> > selected primarily to generate initial data points.
>
> Perhaps instead of basing the sharding on a particular number of CPUs
> per shard, why not cap the total number of shards? IIUC that is the main
> concern about ballooning the number of kworker threads.

That's a great suggestion. I'll send a v2 that implements this approach,
where the parameter specifies the number of shards rather than the number
of CPUs per shard.

Thanks for the feedback,
--breno
Re: [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
Posted by Tejun Heo 2 weeks, 5 days ago
On Wed, Mar 18, 2026 at 10:51:15AM -0700, Breno Leitao wrote:
> On Tue, Mar 17, 2026 at 09:58:54AM -0400, Chuck Lever wrote:
> > On 3/17/26 7:32 AM, Breno Leitao wrote:
> > >> - How was the default shard size of 8 picked? There's a tradeoff
> > >> between the number of kworkers created and locality. Can you also
> > >> report the number of kworkers for each configuration? And is there
> > >> data on different shard sizes? It'd be useful to see how the numbers
> > >> change across e.g. 4, 8, 16, 32.
> > >
> > > The choice of 8 as the default shard size was somewhat arbitrary – it was
> > > selected primarily to generate initial data points.
> >
> > Perhaps instead of basing the sharding on a particular number of CPUs
> > per shard, why not cap the total number of shards? IIUC that is the main
> > concern about ballooning the number of kworker threads.
> 
> That's a great suggestion. I'll send a v2 that implements this approach,
> where the parameter specifies the number of shards rather than the number
> of CPUs per shard.

Woudl it make sense tho? If feels really odd to define the maximum number of
shards when contention is primarily a function of the number of CPUs banging
on the same CPU. Why would 32 cpu and 512 cpu systems have the same number
of shards?

Thanks.

-- 
tejun
Re: [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
Posted by Breno Leitao 2 weeks, 4 days ago
On Wed, Mar 18, 2026 at 01:00:07PM -1000, Tejun Heo wrote:
> On Wed, Mar 18, 2026 at 10:51:15AM -0700, Breno Leitao wrote:
> > On Tue, Mar 17, 2026 at 09:58:54AM -0400, Chuck Lever wrote:
> > > On 3/17/26 7:32 AM, Breno Leitao wrote:
> > > >> - How was the default shard size of 8 picked? There's a tradeoff
> > > >> between the number of kworkers created and locality. Can you also
> > > >> report the number of kworkers for each configuration? And is there
> > > >> data on different shard sizes? It'd be useful to see how the numbers
> > > >> change across e.g. 4, 8, 16, 32.
> > > >
> > > > The choice of 8 as the default shard size was somewhat arbitrary – it was
> > > > selected primarily to generate initial data points.
> > >
> > > Perhaps instead of basing the sharding on a particular number of CPUs
> > > per shard, why not cap the total number of shards? IIUC that is the main
> > > concern about ballooning the number of kworker threads.
> >
> > That's a great suggestion. I'll send a v2 that implements this approach,
> > where the parameter specifies the number of shards rather than the number
> > of CPUs per shard.
>
> Woudl it make sense tho? If feels really odd to define the maximum number of
> shards when contention is primarily a function of the number of CPUs banging
> on the same CPU. Why would 32 cpu and 512 cpu systems have the same number
> of shards?

The trade-off is that specifying the maximum number of shards makes it
clearer how many times the LLC is being sharded, which might be easier
to reason about, but it will have less impact on contention scaling as
you reported above.

I've collected some numbers with sharding per LLC, and I will switch
back to the original approach to gather comparison data.

Current change:
https://github.com/leitao/linux/commit/bedaf9ebe9594320976dcbf0cb507ecf083097c0


Workload:
========

I've finally found a workload that exercises the workqueue sufficiently,
which allows me to obtain stable benchmark results.

This is what I am doing:

  - Sets  up a local loopback NFS environment backed by an 8 GB tmpfs
    (/tmp/nfsexport → /mnt/nfs)
  - Iterates over six fio I/O engines: sync, psync, vsync, pvsync, pvsync2,
    libaio
  - For each engine, runs a 200-job, 512-byte block size fio benchmark (writes
    then reads)
  - Tests each workload under both cache and cache_shard workqueue affinity
    scopes via /sys/module/workqueue/parameters/default_affinity_scope
  - Prints a summary table with aggregate bandwidth (MB) per scope and the
    percentage delta to show whether cache_shard helps or hurts
  - Restores the affinity scope back to cache when done

The test I am running could be found at
https://github.com/leitao/debug/blob/main/workqueue_performance/test_affinity.sh

Hosts:
======

 * ARM (NVIDIA Grace - Neoverse V2 - single L3 domain: CPUs 0-71)
	# cat /sys/devices/system/cpu/cpu*/cache/index3/shared_cpu_list | sort -u
	0-71

 * Intel (AMD EPYC 9D64 88-Core Processor - 11 L3 domains, 8 Cores / 16 vCPUs each)
	#   cat /sys/devices/system/cpu/cpu*/cache/index3/shared_cpu_list | sort -u
	0-7,88-95
	16-23,104-111
	24-31,112-119
	32-39,120-127
	40-47,128-135
	48-55,136-143
	56-63,144-151
	64-71,152-159
	72-79,160-167
	80-87,168-175
	8-15,96-103


Results
=======


Tl;DR: 

 * ARM (single L3, 72 CPUs): cache_shard consistently improves write
   throughput by +6 to +12% across all shard counts (2-32), with
   the peak at 2 shards. Read impact is minimal (noise-level).
   Shard=1 confirms no effect as expected.

 * Intel (11 L3 domains, 16 CPUs each): cache_shard shows no meaningful
   benefit at 1-4 shards (all within noise/stddev). At 8 shards it
   regresses by ~4% for both reads and writes, likely due to loss of
   data locality when sharding already-small 16-CPU cache domains
   further.

benchmark Data:
===============

ARM:

  ┌────────┬───────────────────┬──────────────┬──────────────────┬─────────────┐
  │ Shards │ Write Delta (avg) │ Write stddev │ Read Delta (avg) │ Read stddev │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │      1 │             -0.2% │        ±1.0% │            +1.2% │       ±1.7% │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │      2 │            +12.5% │        ±1.3% │            -0.3% │       ±0.9% │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │      4 │             +8.7% │        ±0.9% │            +1.8% │       ±1.5% │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │      8 │            +11.4% │        ±1.8% │            +3.1% │       ±1.5% │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │     16 │             +7.8% │        ±1.3% │            +1.6% │       ±1.0% │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │     32 │             +6.1% │        ±0.6% │            +0.3% │       ±1.5% │
  └────────┴───────────────────┴──────────────┴──────────────────┴─────────────┘

Intel:

  ┌────────┬───────────────────┬──────────────┬──────────────────┬─────────────┐
  │ Shards │ Write Delta (avg) │ Write stddev │ Read Delta (avg) │ Read stddev │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │      1 │             -0.2% │        ±1.2% │            +0.1% │       ±1.0% │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │      2 │             +0.7% │        ±1.4% │            +0.5% │       ±1.1% │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │      4 │             +0.8% │        ±1.1% │            +1.3% │       ±1.2% │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │      8 │             -4.0% │        ±1.3% │            -4.5% │       ±0.9% │
  └────────┴───────────────────┴──────────────┴──────────────────┴─────────────┘


Microbenchmark result
=====================

I've run the micro-benchmark from this patchset as well, and the resluts
comparison:

 * Intel (11 L3 domains, 16 CPUs each): cache_shard delivers +45-55%
   throughput and 36-44% lower latency at 2-8 shards. The sweet spot is
   4 shards (+55%, p50 cut nearly in half). Shard=1 confirms no effect.

   Even though Intel already has multiple L3 domains, each 16-CPU domain
   still has enough contention to benefit from further splitting (for
   the sake of this microbenchmark/stress test)

 * ARM (single L3, 72 CPUs): The gains are dramatic — 2x at 2 shards,
   3.2x at 4 shards, and 4.4x at 8 shards. At 8 shards, cache_shard
   (3.2M items/s) nearly matches cpu scope performance (3.7M), with p50
   latency dropping from 43.5 us to 6.9 us.

   The single monolithic L3 makes cache scope degenerate to a single
   contended pool, so sharding  has a massive effect.


Intel

  ┌────────┬─────────────────┬───────────────────────┬─────────────────┬───────────┬───────────┬───────────────────┐
  │ Shards │ cache (items/s) │ cache_shard (items/s) │ Throughput gain │ cache p50 │ shard p50 │ Latency reduction │
  ├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
  │      1 │       2,660,103 │             2,667,740 │           +0.3% │   27.5 us │   27.5 us │                0% │
  ├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
  │      2 │       2,619,884 │             3,788,454 │          +44.6% │   28.0 us │   17.8 us │              -36% │
  ├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
  │      4 │       2,506,185 │             3,891,064 │          +55.3% │   29.3 us │   16.5 us │              -44% │
  ├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
  │      8 │       2,628,321 │             4,015,312 │          +52.8% │   27.9 us │   16.4 us │              -41% │
  └────────┴─────────────────┴───────────────────────┴─────────────────┴───────────┴───────────┴───────────────────┘

  Reference scopes (stable across shard counts): cpu ~6.2M items/s, smt ~4.0M, numa/system ~422K.

ARM

  ┌────────┬─────────────────┬───────────────────────┬─────────────────┬───────────┬───────────┬───────────────────┐
  │ Shards │ cache (items/s) │ cache_shard (items/s) │ Throughput gain │ cache p50 │ shard p50 │ Latency reduction │
  ├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
  │      2 │         725,999 │             1,516,967 │           +109% │   43.8 us │   19.6 us │              -55% │
  ├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
  │      4 │         729,615 │             2,347,335 │           +222% │   43.6 us │   11.0 us │              -75% │
  ├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
  │      8 │         731,517 │             3,230,168 │           +342% │   43.5 us │    6.9 us │              -84% │
  └────────┴─────────────────┴───────────────────────┴─────────────────┴───────────┴───────────┴───────────────────┘


Next Steps:
  * Revert the code to sharding by CPU count (instead of by shard count) and
    report it again. 
  * Any other test that would help us?