tools/perf/Documentation/perf-lock.txt | 23 +- tools/perf/builtin-lock.c | 403 ++++++++++++++++++++++--- 2 files changed, 391 insertions(+), 35 deletions(-)
Hello,
It's to add a new subcommand 'contention' (shortly 'con') to perf lock.
The new subcommand is to handle the new lock:contention_{begin,end}
tracepoints and shows lock type and caller address like below:
$ perf lock contention
contended total wait max wait avg wait type caller
238 1.41 ms 29.20 us 5.94 us spinlock update_blocked_averages+0x4c
1 902.08 us 902.08 us 902.08 us rwsem:R do_user_addr_fault+0x1dd
81 330.30 us 17.24 us 4.08 us spinlock _nohz_idle_balance+0x172
2 89.54 us 61.26 us 44.77 us spinlock do_anonymous_page+0x16d
24 78.36 us 12.27 us 3.27 us mutex pipe_read+0x56
2 71.58 us 59.56 us 35.79 us spinlock __handle_mm_fault+0x6aa
6 25.68 us 6.89 us 4.28 us spinlock do_idle+0x28d
1 18.46 us 18.46 us 18.46 us rtmutex exec_fw_cmd+0x21b
3 15.25 us 6.26 us 5.08 us spinlock tick_do_update_jiffies64+0x2c
...
where rwsem:R stands for read access (down_read) for a rw-semaphore.
Other types of lock access is obvious and it doesn't detect optimistic
spinning on mutex yet. This is just a base work for lock contention
analysis and more will come later.
You can get this from 'perf/lock-subcmd-v1' branch on
git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
Thanks,
Namhyung
Namhyung Kim (6):
perf lock: Fix a copy-n-paste bug
perf lock: Add flags field in the lock_stat
perf lock: Add lock aggregation enum
perf lock: Add 'contention' subcommand
perf lock: Add -k and -F options to 'contention' subcommand
perf lock: Support -t option for 'contention' subcommand
tools/perf/Documentation/perf-lock.txt | 23 +-
tools/perf/builtin-lock.c | 403 ++++++++++++++++++++++---
2 files changed, 391 insertions(+), 35 deletions(-)
base-commit: 41d0914d861e82b4eeec16dd0f6109c311cf4c7e
--
2.37.0.170.g444d1eabd0-goog
* Namhyung Kim <namhyung@kernel.org> wrote:
> Hello,
>
> It's to add a new subcommand 'contention' (shortly 'con') to perf lock.
>
> The new subcommand is to handle the new lock:contention_{begin,end}
> tracepoints and shows lock type and caller address like below:
>
> $ perf lock contention
> contended total wait max wait avg wait type caller
>
> 238 1.41 ms 29.20 us 5.94 us spinlock update_blocked_averages+0x4c
> 1 902.08 us 902.08 us 902.08 us rwsem:R do_user_addr_fault+0x1dd
> 81 330.30 us 17.24 us 4.08 us spinlock _nohz_idle_balance+0x172
> 2 89.54 us 61.26 us 44.77 us spinlock do_anonymous_page+0x16d
> 24 78.36 us 12.27 us 3.27 us mutex pipe_read+0x56
> 2 71.58 us 59.56 us 35.79 us spinlock __handle_mm_fault+0x6aa
> 6 25.68 us 6.89 us 4.28 us spinlock do_idle+0x28d
> 1 18.46 us 18.46 us 18.46 us rtmutex exec_fw_cmd+0x21b
> 3 15.25 us 6.26 us 5.08 us spinlock tick_do_update_jiffies64+0x2c
> ...
Wouldn't it also be useful to display a lock contention percentage value,
the ratio of fastpath vs. contended/wait events?
That's usually the first-approximation metric to see how contended
different locks are, and the average wait time quantifies it.
Thanks,
Ingo
Hi Ingo,
On Wed, Aug 3, 2022 at 2:51 AM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Namhyung Kim <namhyung@kernel.org> wrote:
>
> > Hello,
> >
> > It's to add a new subcommand 'contention' (shortly 'con') to perf lock.
> >
> > The new subcommand is to handle the new lock:contention_{begin,end}
> > tracepoints and shows lock type and caller address like below:
> >
> > $ perf lock contention
> > contended total wait max wait avg wait type caller
> >
> > 238 1.41 ms 29.20 us 5.94 us spinlock update_blocked_averages+0x4c
> > 1 902.08 us 902.08 us 902.08 us rwsem:R do_user_addr_fault+0x1dd
> > 81 330.30 us 17.24 us 4.08 us spinlock _nohz_idle_balance+0x172
> > 2 89.54 us 61.26 us 44.77 us spinlock do_anonymous_page+0x16d
> > 24 78.36 us 12.27 us 3.27 us mutex pipe_read+0x56
> > 2 71.58 us 59.56 us 35.79 us spinlock __handle_mm_fault+0x6aa
> > 6 25.68 us 6.89 us 4.28 us spinlock do_idle+0x28d
> > 1 18.46 us 18.46 us 18.46 us rtmutex exec_fw_cmd+0x21b
> > 3 15.25 us 6.26 us 5.08 us spinlock tick_do_update_jiffies64+0x2c
> > ...
>
> Wouldn't it also be useful to display a lock contention percentage value,
> the ratio of fastpath vs. contended/wait events?
>
> That's usually the first-approximation metric to see how contended
> different locks are, and the average wait time quantifies it.
Yeah, that'd be nice to have. But it requires some action in the fast path
which I don't want because I'd like to use this in production. So these
new tracepoints were added only in the slow path.
Instead, I think we can display the ratio of (total) contended time vs.
wall clock time. What do you think?
Thanks,
Namhyung
* Namhyung Kim <namhyung@kernel.org> wrote:
> Hi Ingo,
>
> On Wed, Aug 3, 2022 at 2:51 AM Ingo Molnar <mingo@kernel.org> wrote:
> >
> >
> > * Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > > Hello,
> > >
> > > It's to add a new subcommand 'contention' (shortly 'con') to perf lock.
> > >
> > > The new subcommand is to handle the new lock:contention_{begin,end}
> > > tracepoints and shows lock type and caller address like below:
> > >
> > > $ perf lock contention
> > > contended total wait max wait avg wait type caller
> > >
> > > 238 1.41 ms 29.20 us 5.94 us spinlock update_blocked_averages+0x4c
> > > 1 902.08 us 902.08 us 902.08 us rwsem:R do_user_addr_fault+0x1dd
> > > 81 330.30 us 17.24 us 4.08 us spinlock _nohz_idle_balance+0x172
> > > 2 89.54 us 61.26 us 44.77 us spinlock do_anonymous_page+0x16d
> > > 24 78.36 us 12.27 us 3.27 us mutex pipe_read+0x56
> > > 2 71.58 us 59.56 us 35.79 us spinlock __handle_mm_fault+0x6aa
> > > 6 25.68 us 6.89 us 4.28 us spinlock do_idle+0x28d
> > > 1 18.46 us 18.46 us 18.46 us rtmutex exec_fw_cmd+0x21b
> > > 3 15.25 us 6.26 us 5.08 us spinlock tick_do_update_jiffies64+0x2c
> > > ...
> >
> > Wouldn't it also be useful to display a lock contention percentage value,
> > the ratio of fastpath vs. contended/wait events?
> >
> > That's usually the first-approximation metric to see how contended
> > different locks are, and the average wait time quantifies it.
>
> Yeah, that'd be nice to have. But it requires some action in the fast
> path which I don't want because I'd like to use this in production. So
> these new tracepoints were added only in the slow path.
Yeah. Might make sense to re-measure the impact of possibly doing that
though: most of the locking fast-patch is out of line already and could be
instrumented, with only a handful of inlined primitives -
CONFIG_UNINLINE_SPIN_UNLOCK in particular.
How many additional inlined NOP sequences does this add in a defconfig
kernel? How much is the bloat, and would it be acceptable for production
kernels?
The other question is to keep tracing overhead low in production systems.
For that we'd have to implement some concept of 'sampling tracepoints',
which generate only one event for every 128 fast path invocations or so,
but stay out of the way & don't slow down the system otherwise.
OTOH frequently used locking fastpaths are measured via regular PMU
sampling based profiling already.
> Instead, I think we can display the ratio of (total) contended time vs.
> wall clock time. What do you think?
That looks useful too - but also the time spent waiting/spinning in a
thread vs. the time spent actually running and executing real stuff.
That ratio could easily get over 100%, for wait-dominated workloads - so
ordering by that ratio would highlight the tasks that make the least amount
of real progress. Measuring the ratio based only on wall clock time would
hide this aspect.
Thanks,
Ingo
On Thu, Aug 4, 2022 at 1:37 AM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Namhyung Kim <namhyung@kernel.org> wrote:
>
> > Hi Ingo,
> >
> > On Wed, Aug 3, 2022 at 2:51 AM Ingo Molnar <mingo@kernel.org> wrote:
> > >
> > >
> > > * Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > > Hello,
> > > >
> > > > It's to add a new subcommand 'contention' (shortly 'con') to perf lock.
> > > >
> > > > The new subcommand is to handle the new lock:contention_{begin,end}
> > > > tracepoints and shows lock type and caller address like below:
> > > >
> > > > $ perf lock contention
> > > > contended total wait max wait avg wait type caller
> > > >
> > > > 238 1.41 ms 29.20 us 5.94 us spinlock update_blocked_averages+0x4c
> > > > 1 902.08 us 902.08 us 902.08 us rwsem:R do_user_addr_fault+0x1dd
> > > > 81 330.30 us 17.24 us 4.08 us spinlock _nohz_idle_balance+0x172
> > > > 2 89.54 us 61.26 us 44.77 us spinlock do_anonymous_page+0x16d
> > > > 24 78.36 us 12.27 us 3.27 us mutex pipe_read+0x56
> > > > 2 71.58 us 59.56 us 35.79 us spinlock __handle_mm_fault+0x6aa
> > > > 6 25.68 us 6.89 us 4.28 us spinlock do_idle+0x28d
> > > > 1 18.46 us 18.46 us 18.46 us rtmutex exec_fw_cmd+0x21b
> > > > 3 15.25 us 6.26 us 5.08 us spinlock tick_do_update_jiffies64+0x2c
> > > > ...
> > >
> > > Wouldn't it also be useful to display a lock contention percentage value,
> > > the ratio of fastpath vs. contended/wait events?
> > >
> > > That's usually the first-approximation metric to see how contended
> > > different locks are, and the average wait time quantifies it.
> >
> > Yeah, that'd be nice to have. But it requires some action in the fast
> > path which I don't want because I'd like to use this in production. So
> > these new tracepoints were added only in the slow path.
>
> Yeah. Might make sense to re-measure the impact of possibly doing that
> though: most of the locking fast-patch is out of line already and could be
> instrumented, with only a handful of inlined primitives -
> CONFIG_UNINLINE_SPIN_UNLOCK in particular.
I'll take a look and measure overhead with tracepoints in the fast path.
>
> How many additional inlined NOP sequences does this add in a defconfig
> kernel? How much is the bloat, and would it be acceptable for production
> kernels?
Last time I measured that it added around 1.3KB of code with the
tracepoints but they are all in the slow path. So it'd be ok for production.
>
> The other question is to keep tracing overhead low in production systems.
>
> For that we'd have to implement some concept of 'sampling tracepoints',
> which generate only one event for every 128 fast path invocations or so,
> but stay out of the way & don't slow down the system otherwise.
Yeah, it looks like we can do it with perf tools using a custom period like
`-c 128` option. But I'm not sure how it can deal with the slow path in
sync. IOW the fast path and the slow path will fire separately.
>
> OTOH frequently used locking fastpaths are measured via regular PMU
> sampling based profiling already.
Sure. Also note that the recent off-cpu profiling (perf record --off-cpu)
would be helpful to see impact of the sleeping locks.
>
> > Instead, I think we can display the ratio of (total) contended time vs.
> > wall clock time. What do you think?
>
> That looks useful too - but also the time spent waiting/spinning in a
> thread vs. the time spent actually running and executing real stuff.
>
> That ratio could easily get over 100%, for wait-dominated workloads - so
> ordering by that ratio would highlight the tasks that make the least amount
> of real progress. Measuring the ratio based only on wall clock time would
> hide this aspect.
Do you talk about per-thread stats? It has the -t/--threads option but it
doesn't work with this BPF program yet. Will work on it.
By default it shows per-lock (well.. actually per-callstack) contention stat.
We use caller symbols because there're no symbols for locks in general.
It'd be nice if we could merge the stats for the same lock (class) from
different call stacks.
Thanks,
Namhyung
© 2016 - 2026 Red Hat, Inc.