[PATCH v1 0/7] perf bench: Add qspinlock benchmark

Yuzhuo Jing posted 7 patches 2 months, 1 week ago
tools/arch/x86/include/asm/atomic.h           |  14 +
tools/arch/x86/include/asm/cmpxchg.h          | 113 +++++
tools/include/asm-generic/atomic-gcc.h        |  47 ++
tools/include/asm/barrier.h                   |  58 +++
tools/include/linux/atomic.h                  |  27 ++
tools/include/linux/compiler_types.h          |  30 ++
tools/include/linux/percpu-simulate.h         | 128 ++++++
tools/include/linux/prefetch.h                |  41 ++
tools/perf/bench/Build                        |   2 +
tools/perf/bench/bench.h                      |   1 +
.../perf/bench/include/mcs_spinlock-private.h | 115 +++++
tools/perf/bench/include/mcs_spinlock.h       |  19 +
tools/perf/bench/include/qspinlock-private.h  | 204 +++++++++
tools/perf/bench/include/qspinlock.h          | 153 +++++++
tools/perf/bench/include/qspinlock_types.h    |  98 +++++
tools/perf/bench/qspinlock.c                  | 411 ++++++++++++++++++
tools/perf/bench/sync.c                       | 329 ++++++++++++++
tools/perf/builtin-bench.c                    |   7 +
tools/perf/check-headers.sh                   |  32 ++
19 files changed, 1829 insertions(+)
create mode 100644 tools/include/linux/percpu-simulate.h
create mode 100644 tools/include/linux/prefetch.h
create mode 100644 tools/perf/bench/include/mcs_spinlock-private.h
create mode 100644 tools/perf/bench/include/mcs_spinlock.h
create mode 100644 tools/perf/bench/include/qspinlock-private.h
create mode 100644 tools/perf/bench/include/qspinlock.h
create mode 100644 tools/perf/bench/include/qspinlock_types.h
create mode 100644 tools/perf/bench/qspinlock.c
create mode 100644 tools/perf/bench/sync.c
[PATCH v1 0/7] perf bench: Add qspinlock benchmark
Posted by Yuzhuo Jing 2 months, 1 week ago
As an effort to improve the perf bench subcommand, this patch series
adds benchmark for the kernel's queued spinlock implementation.

This series imports necessary kernel definitions such as atomics,
introduces userspace per-cpu adapter, and imports the qspinlock
implementation from the kernel tree to tools tree, with minimum
adaptions.

This subcommand enables convenient commands to investigate the
performance of kernel lock implementations, such as using sampling:

    perf record -- ./perf bench sync qspinlock -t5
    perf report

Yuzhuo Jing (7):
  tools: Import cmpxchg and xchg functions
  tools: Import smp_cond_load and atomic_cond_read
  tools: Partial import of prefetch.h
  tools: Implement userspace per-cpu
  perf bench: Import qspinlock from kernel
  perf bench: Add 'bench sync qspinlock' subcommand
  perf bench sync: Add latency histogram functionality

 tools/arch/x86/include/asm/atomic.h           |  14 +
 tools/arch/x86/include/asm/cmpxchg.h          | 113 +++++
 tools/include/asm-generic/atomic-gcc.h        |  47 ++
 tools/include/asm/barrier.h                   |  58 +++
 tools/include/linux/atomic.h                  |  27 ++
 tools/include/linux/compiler_types.h          |  30 ++
 tools/include/linux/percpu-simulate.h         | 128 ++++++
 tools/include/linux/prefetch.h                |  41 ++
 tools/perf/bench/Build                        |   2 +
 tools/perf/bench/bench.h                      |   1 +
 .../perf/bench/include/mcs_spinlock-private.h | 115 +++++
 tools/perf/bench/include/mcs_spinlock.h       |  19 +
 tools/perf/bench/include/qspinlock-private.h  | 204 +++++++++
 tools/perf/bench/include/qspinlock.h          | 153 +++++++
 tools/perf/bench/include/qspinlock_types.h    |  98 +++++
 tools/perf/bench/qspinlock.c                  | 411 ++++++++++++++++++
 tools/perf/bench/sync.c                       | 329 ++++++++++++++
 tools/perf/builtin-bench.c                    |   7 +
 tools/perf/check-headers.sh                   |  32 ++
 19 files changed, 1829 insertions(+)
 create mode 100644 tools/include/linux/percpu-simulate.h
 create mode 100644 tools/include/linux/prefetch.h
 create mode 100644 tools/perf/bench/include/mcs_spinlock-private.h
 create mode 100644 tools/perf/bench/include/mcs_spinlock.h
 create mode 100644 tools/perf/bench/include/qspinlock-private.h
 create mode 100644 tools/perf/bench/include/qspinlock.h
 create mode 100644 tools/perf/bench/include/qspinlock_types.h
 create mode 100644 tools/perf/bench/qspinlock.c
 create mode 100644 tools/perf/bench/sync.c

-- 
2.50.1.487.gc89ff58d15-goog
Re: [PATCH v1 0/7] perf bench: Add qspinlock benchmark
Posted by Namhyung Kim 2 months ago
Hello,

On Mon, Jul 28, 2025 at 07:26:33PM -0700, Yuzhuo Jing wrote:
> As an effort to improve the perf bench subcommand, this patch series
> adds benchmark for the kernel's queued spinlock implementation.
> 
> This series imports necessary kernel definitions such as atomics,
> introduces userspace per-cpu adapter, and imports the qspinlock
> implementation from the kernel tree to tools tree, with minimum
> adaptions.

But I'm curious how you handled difference in kernel vs. user space.
For example, normally kernel spinlocks imply no preemption but we cannot
guarantee that in userspace.

> 
> This subcommand enables convenient commands to investigate the
> performance of kernel lock implementations, such as using sampling:
> 
>     perf record -- ./perf bench sync qspinlock -t5
>     perf report

It'd be nice if you can share an example output of the change.

Thanks,
Namhyung

> 
> Yuzhuo Jing (7):
>   tools: Import cmpxchg and xchg functions
>   tools: Import smp_cond_load and atomic_cond_read
>   tools: Partial import of prefetch.h
>   tools: Implement userspace per-cpu
>   perf bench: Import qspinlock from kernel
>   perf bench: Add 'bench sync qspinlock' subcommand
>   perf bench sync: Add latency histogram functionality
> 
>  tools/arch/x86/include/asm/atomic.h           |  14 +
>  tools/arch/x86/include/asm/cmpxchg.h          | 113 +++++
>  tools/include/asm-generic/atomic-gcc.h        |  47 ++
>  tools/include/asm/barrier.h                   |  58 +++
>  tools/include/linux/atomic.h                  |  27 ++
>  tools/include/linux/compiler_types.h          |  30 ++
>  tools/include/linux/percpu-simulate.h         | 128 ++++++
>  tools/include/linux/prefetch.h                |  41 ++
>  tools/perf/bench/Build                        |   2 +
>  tools/perf/bench/bench.h                      |   1 +
>  .../perf/bench/include/mcs_spinlock-private.h | 115 +++++
>  tools/perf/bench/include/mcs_spinlock.h       |  19 +
>  tools/perf/bench/include/qspinlock-private.h  | 204 +++++++++
>  tools/perf/bench/include/qspinlock.h          | 153 +++++++
>  tools/perf/bench/include/qspinlock_types.h    |  98 +++++
>  tools/perf/bench/qspinlock.c                  | 411 ++++++++++++++++++
>  tools/perf/bench/sync.c                       | 329 ++++++++++++++
>  tools/perf/builtin-bench.c                    |   7 +
>  tools/perf/check-headers.sh                   |  32 ++
>  19 files changed, 1829 insertions(+)
>  create mode 100644 tools/include/linux/percpu-simulate.h
>  create mode 100644 tools/include/linux/prefetch.h
>  create mode 100644 tools/perf/bench/include/mcs_spinlock-private.h
>  create mode 100644 tools/perf/bench/include/mcs_spinlock.h
>  create mode 100644 tools/perf/bench/include/qspinlock-private.h
>  create mode 100644 tools/perf/bench/include/qspinlock.h
>  create mode 100644 tools/perf/bench/include/qspinlock_types.h
>  create mode 100644 tools/perf/bench/qspinlock.c
>  create mode 100644 tools/perf/bench/sync.c
> 
> -- 
> 2.50.1.487.gc89ff58d15-goog
>
Re: [PATCH v1 0/7] perf bench: Add qspinlock benchmark
Posted by Mark Rutland 2 months ago
On Mon, Jul 28, 2025 at 07:26:33PM -0700, Yuzhuo Jing wrote:
> As an effort to improve the perf bench subcommand, this patch series
> adds benchmark for the kernel's queued spinlock implementation.
> 
> This series imports necessary kernel definitions such as atomics,
> introduces userspace per-cpu adapter, and imports the qspinlock
> implementation from the kernel tree to tools tree, with minimum
> adaptions.

Who is this intended to be useful for, and when would they use this?

This doesn't serve as a benchmark of the host kernel, since it tests
whatever stale copy of the qspinlock code was built into the perf
binary.

I can understand that being able to test the code in userspace may be
helpful when making some changes, but why does this need to be built
into the perf tool?

Mark.

> This subcommand enables convenient commands to investigate the
> performance of kernel lock implementations, such as using sampling:
> 
>     perf record -- ./perf bench sync qspinlock -t5
>     perf report
> 
> Yuzhuo Jing (7):
>   tools: Import cmpxchg and xchg functions
>   tools: Import smp_cond_load and atomic_cond_read
>   tools: Partial import of prefetch.h
>   tools: Implement userspace per-cpu
>   perf bench: Import qspinlock from kernel
>   perf bench: Add 'bench sync qspinlock' subcommand
>   perf bench sync: Add latency histogram functionality
> 
>  tools/arch/x86/include/asm/atomic.h           |  14 +
>  tools/arch/x86/include/asm/cmpxchg.h          | 113 +++++
>  tools/include/asm-generic/atomic-gcc.h        |  47 ++
>  tools/include/asm/barrier.h                   |  58 +++
>  tools/include/linux/atomic.h                  |  27 ++
>  tools/include/linux/compiler_types.h          |  30 ++
>  tools/include/linux/percpu-simulate.h         | 128 ++++++
>  tools/include/linux/prefetch.h                |  41 ++
>  tools/perf/bench/Build                        |   2 +
>  tools/perf/bench/bench.h                      |   1 +
>  .../perf/bench/include/mcs_spinlock-private.h | 115 +++++
>  tools/perf/bench/include/mcs_spinlock.h       |  19 +
>  tools/perf/bench/include/qspinlock-private.h  | 204 +++++++++
>  tools/perf/bench/include/qspinlock.h          | 153 +++++++
>  tools/perf/bench/include/qspinlock_types.h    |  98 +++++
>  tools/perf/bench/qspinlock.c                  | 411 ++++++++++++++++++
>  tools/perf/bench/sync.c                       | 329 ++++++++++++++
>  tools/perf/builtin-bench.c                    |   7 +
>  tools/perf/check-headers.sh                   |  32 ++
>  19 files changed, 1829 insertions(+)
>  create mode 100644 tools/include/linux/percpu-simulate.h
>  create mode 100644 tools/include/linux/prefetch.h
>  create mode 100644 tools/perf/bench/include/mcs_spinlock-private.h
>  create mode 100644 tools/perf/bench/include/mcs_spinlock.h
>  create mode 100644 tools/perf/bench/include/qspinlock-private.h
>  create mode 100644 tools/perf/bench/include/qspinlock.h
>  create mode 100644 tools/perf/bench/include/qspinlock_types.h
>  create mode 100644 tools/perf/bench/qspinlock.c
>  create mode 100644 tools/perf/bench/sync.c
> 
> -- 
> 2.50.1.487.gc89ff58d15-goog
>
Re: [PATCH v1 0/7] perf bench: Add qspinlock benchmark
Posted by Peter Zijlstra 2 weeks, 5 days ago
On Mon, Aug 04, 2025 at 03:28:12PM +0100, Mark Rutland wrote:
> On Mon, Jul 28, 2025 at 07:26:33PM -0700, Yuzhuo Jing wrote:
> > As an effort to improve the perf bench subcommand, this patch series
> > adds benchmark for the kernel's queued spinlock implementation.
> > 
> > This series imports necessary kernel definitions such as atomics,
> > introduces userspace per-cpu adapter, and imports the qspinlock
> > implementation from the kernel tree to tools tree, with minimum
> > adaptions.
> 
> Who is this intended to be useful for, and when would they use this?
> 
> This doesn't serve as a benchmark of the host kernel, since it tests
> whatever stale copy of the qspinlock code was built into the perf
> binary.
> 
> I can understand that being able to test the code in userspace may be
> helpful when making some changes, but why does this need to be built
> into the perf tool?

Right, I think most of us already have a userspace version of it. I have
a thingy that has TAS, TICKET and QSPINLOCK wrapped in a perf self
monitor that I can run on various x86_64 to see how it behaves.

IIRC it also has a pile of 'raw' atomic ops to see the contention
behaviour. This shows that eg. XADD is *waay* nicer than a CMPXCHG loop
when heavily contended.

Anyway, that lives as a random tar file on a random machine in my house,
I'm not sure it makes much sense to stick that in perf as such. Rather
specific.
Re: [PATCH v1 0/7] perf bench: Add qspinlock benchmark
Posted by Ian Rogers 2 weeks, 5 days ago
On Tue, Sep 16, 2025 at 7:18 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Aug 04, 2025 at 03:28:12PM +0100, Mark Rutland wrote:
> > On Mon, Jul 28, 2025 at 07:26:33PM -0700, Yuzhuo Jing wrote:
> > > As an effort to improve the perf bench subcommand, this patch series
> > > adds benchmark for the kernel's queued spinlock implementation.
> > >
> > > This series imports necessary kernel definitions such as atomics,
> > > introduces userspace per-cpu adapter, and imports the qspinlock
> > > implementation from the kernel tree to tools tree, with minimum
> > > adaptions.
> >
> > Who is this intended to be useful for, and when would they use this?
> >
> > This doesn't serve as a benchmark of the host kernel, since it tests
> > whatever stale copy of the qspinlock code was built into the perf
> > binary.
> >
> > I can understand that being able to test the code in userspace may be
> > helpful when making some changes, but why does this need to be built
> > into the perf tool?
>
> Right, I think most of us already have a userspace version of it. I have
> a thingy that has TAS, TICKET and QSPINLOCK wrapped in a perf self
> monitor that I can run on various x86_64 to see how it behaves.
>
> IIRC it also has a pile of 'raw' atomic ops to see the contention
> behaviour. This shows that eg. XADD is *waay* nicer than a CMPXCHG loop
> when heavily contended.
>
> Anyway, that lives as a random tar file on a random machine in my house,
> I'm not sure it makes much sense to stick that in perf as such. Rather
> specific.

The intent was that the benchmark wouldn't have stale copies of files
in the same way we keep other files in perf in sync with those in the
kernel.

The inspiration for adding a benchmark this way comes from the
existing perf bench memcpy benchmark. The reason to care is that, as
with memcpy, there are subtle effects from things like RISC-V's
non-temporal atomics (ARM near-far atomics) and the size of CPU cores.
In general queued spinlock is preferred in the kernel, a benchmark of
queued spinlock and ticket spinlock may reveal that ticket spinlock
would be a better default for certain configurations.

Does it make sense to have this in perf? It makes it easier to tune
the implementations, keep code in sync with the kernel, etc. Does it
make sense for perf to have a memcpy benchmark? Maybe not these days
of having a more reliable rep movsb. Anyway, in general the bar to
getting things into perf bench hasn't been hugely high and I don't see
disagreement that on some occasions a benchmark like this is useful.
As someone who cares about this kind of performance tuning, I care
about having the benchmark.

Thanks,
Ian
Re: [PATCH v1 0/7] perf bench: Add qspinlock benchmark
Posted by Peter Zijlstra 2 weeks, 4 days ago
On Tue, Sep 16, 2025 at 10:00:13AM -0700, Ian Rogers wrote:

> The inspiration for adding a benchmark this way comes from the
> existing perf bench memcpy benchmark. The reason to care is that, as
> with memcpy, there are subtle effects from things like RISC-V's
> non-temporal atomics (ARM near-far atomics) and the size of CPU cores.

But the patch as proposed was very much x86 only. No RISC-V or ARM64
support.

> In general queued spinlock is preferred in the kernel, a benchmark of
> queued spinlock and ticket spinlock may reveal that ticket spinlock
> would be a better default for certain configurations.

And didn't do ticket, even though we have a generic implementation in
the kernel too (IIRC I have a few patches for that as well.. someday I
might have time).

And yeah, ticket is very good and hard to beat for 'smaller' systems.

There is a reason for commit: a8ad07e5240c :-)

> Does it make sense to have this in perf? It makes it easier to tune
> the implementations, keep code in sync with the kernel, etc. Does it
> make sense for perf to have a memcpy benchmark? Maybe not these days
> of having a more reliable rep movsb. Anyway, in general the bar to
> getting things into perf bench hasn't been hugely high and I don't see
> disagreement that on some occasions a benchmark like this is useful.
> As someone who cares about this kind of performance tuning, I care
> about having the benchmark.

Yeah, not convinced we should stuff all that in perf. But also, the
benchmark doesn't actually seem to do what you say you wanted, so meh.