[v4] locking: contended_release tracepoint instrumentation

[PATCH v4 0/5] locking: contended_release tracepoint instrumentation

Posted by Dmitry Ilvokhin 1 week ago

The existing contention_begin/contention_end tracepoints fire on the
waiter side. The lock holder's identity and stack can be captured at
contention_begin time (e.g. perf lock contention --lock-owner), but
this reflects the holder's state when a waiter arrives, not when the
lock is actually released.

This series adds a contended_release tracepoint that fires on the
holder side when a lock with waiters is released. This provides:

- Hold time estimation: when the holder's own acquisition was
  contended, its contention_end (acquisition) and contended_release
  can be correlated to measure how long the lock was held under
  contention.

- The holder's stack at release time, which may differ from what perf lock
  contention --lock-owner captures if the holder does significant work between
  the waiter's arrival and the unlock.

Note: for reader/writer locks, the tracepoint fires for every reader
releasing while a writer is waiting, not only for the last reader.

v3 -> v4:

- Fix spurious events in __percpu_up_read(): guard with
  rcuwait_active(&sem->writer) to avoid tracing during the RCU grace
  period after a writer releases (Sashiko).
- Fix possible use-after-free in semaphore up(): move
  trace_contended_release() inside the sem->lock critical section
  (Sashiko).
- Fix build failure with CONFIG_PARAVIRT_SPINLOCKS=y: introduce
  queued_spin_release() as the arch-overridable unlock primitive,
  so queued_spin_unlock() can be a generic tracing wrapper. Convert
  x86 (paravirt) and MIPS overrides (Sashiko).
- Add EXPORT_TRACEPOINT_SYMBOL_GPL(contended_release) for module
  support (Sashiko).
- Split spinning locks patch: factor out queued_spin_release() as a
  separate preparatory commit (Sashiko).
- Make read unlock tracepoint behavior consistent across all
  reader/writer lock types: fire for every reader releasing while
  a writer is waiting (rwsem, rwbase_rt were previously last-reader
  only).

v2 -> v3:

- Added new patch: extend contended_release tracepoint to queued spinlocks
  and queued rwlocks (marked as RFC, requesting feedback). This is prompted by
  Matthew Wilcox's suggestion to try to come up with generic instrumentation,
  instead of instrumenting each "special" lock manually. See [1] for the
  discussion.
- Reworked tracepoint placement to fire before the lock is released and
  before the waiter is woken where possible, for consistency with
  spinning locks where there is no explicit wake (inspired by Usama Arif's
  suggestion).
- Remove unnecessary linux/sched.h include from trace/events/lock.h.

RFC -> v2:

- Add trace_contended_release_enabled() guard before waiter checks that
  exist only for the tracepoint (Steven Rostedt).
- Rename __percpu_up_read_slowpath() to __percpu_up_read() (Peter
  Zijlstra).
- Add extern for __percpu_up_read() (Peter Zijlstra).
- Squashed tracepoint introduction and usage commits (Masami Hiramatsu).

v3: https://lore.kernel.org/all/cover.1773858853.git.d@ilvokhin.com/
v2: https://lore.kernel.org/all/cover.1773164180.git.d@ilvokhin.com/
RFC: https://lore.kernel.org/all/cover.1772642407.git.d@ilvokhin.com/

[1]: https://lore.kernel.org/all/aa7G1nD7Rd9F4eBH@casper.infradead.org/

Dmitry Ilvokhin (5):
  tracing/lock: Remove unnecessary linux/sched.h include
  locking/percpu-rwsem: Extract __percpu_up_read()
  locking: Add contended_release tracepoint to sleepable locks
  locking: Factor out queued_spin_release()
  locking: Add contended_release tracepoint to spinning locks

 arch/mips/include/asm/spinlock.h         |  6 +--
 arch/x86/include/asm/paravirt-spinlock.h |  6 +--
 include/asm-generic/qrwlock.h            | 48 ++++++++++++++++++++----
 include/asm-generic/qspinlock.h          | 33 ++++++++++++++--
 include/linux/percpu-rwsem.h             | 15 ++------
 include/trace/events/lock.h              | 18 ++++++++-
 kernel/locking/mutex.c                   |  4 ++
 kernel/locking/percpu-rwsem.c            | 29 ++++++++++++++
 kernel/locking/qrwlock.c                 | 16 ++++++++
 kernel/locking/qspinlock.c               |  8 ++++
 kernel/locking/rtmutex.c                 |  1 +
 kernel/locking/rwbase_rt.c               |  6 +++
 kernel/locking/rwsem.c                   | 10 ++++-
 kernel/locking/semaphore.c               |  4 ++
 14 files changed, 172 insertions(+), 32 deletions(-)

-- 
2.52.0

Re: [PATCH v4 0/5] locking: contended_release tracepoint instrumentation

Posted by Usama Arif 2 days, 8 hours ago

On Thu, 26 Mar 2026 15:09:59 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:

> The existing contention_begin/contention_end tracepoints fire on the
> waiter side. The lock holder's identity and stack can be captured at
> contention_begin time (e.g. perf lock contention --lock-owner), but
> this reflects the holder's state when a waiter arrives, not when the
> lock is actually released.
> 
> This series adds a contended_release tracepoint that fires on the
> holder side when a lock with waiters is released. This provides:
> 
> - Hold time estimation: when the holder's own acquisition was
>   contended, its contention_end (acquisition) and contended_release
>   can be correlated to measure how long the lock was held under
>   contention.
> 
> - The holder's stack at release time, which may differ from what perf lock
>   contention --lock-owner captures if the holder does significant work between
>   the waiter's arrival and the unlock.
> 
> Note: for reader/writer locks, the tracepoint fires for every reader
> releasing while a writer is waiting, not only for the last reader.
> 

Would it be better to reorder the patches? It would help with git
bisectability as well. Move the refractoring work in patch 4 and
5 (excluding adding the tracepoints ofcourse) earlier, and then add
all the tracepoints in the same commit at the end? It would help
in the future with git blame to see where all the tracepoints
were added as well.

Re: [PATCH v4 0/5] locking: contended_release tracepoint instrumentation

Posted by Dmitry Ilvokhin 2 days, 6 hours ago

On Tue, Mar 31, 2026 at 03:27:03AM -0700, Usama Arif wrote:
> On Thu, 26 Mar 2026 15:09:59 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> 
> > The existing contention_begin/contention_end tracepoints fire on the
> > waiter side. The lock holder's identity and stack can be captured at
> > contention_begin time (e.g. perf lock contention --lock-owner), but
> > this reflects the holder's state when a waiter arrives, not when the
> > lock is actually released.
> > 
> > This series adds a contended_release tracepoint that fires on the
> > holder side when a lock with waiters is released. This provides:
> > 
> > - Hold time estimation: when the holder's own acquisition was
> >   contended, its contention_end (acquisition) and contended_release
> >   can be correlated to measure how long the lock was held under
> >   contention.
> > 
> > - The holder's stack at release time, which may differ from what perf lock
> >   contention --lock-owner captures if the holder does significant work between
> >   the waiter's arrival and the unlock.
> > 
> > Note: for reader/writer locks, the tracepoint fires for every reader
> > releasing while a writer is waiting, not only for the last reader.
> > 
> 
> Would it be better to reorder the patches? It would help with git
> bisectability as well. Move the refractoring work in patch 4 and
> 5 (excluding adding the tracepoints ofcourse) earlier, and then add
> all the tracepoints in the same commit at the end? It would help
> in the future with git blame to see where all the tracepoints
> were added as well.

Thanks for the suggestion, Usama.

I'd prefer to keep the current ordering: each refactoring commit is
immediately followed by the commit that uses it. For example,
queued_spin_release() is factored out right before the commit that adds
the tracepoint to spinning locks. This makes the motivation for the
refactoring clear and should also ease the review since the context is
still fresh.

Bisectability should be fine as-is, each commit compiles and works
independently, since the refactoring patches do not introduce behavioral
changes on their own.

Re: [PATCH v4 0/5] locking: contended_release tracepoint instrumentation

Posted by Matthew Wilcox 1 week ago

On Thu, Mar 26, 2026 at 03:09:59PM +0000, Dmitry Ilvokhin wrote:
> The existing contention_begin/contention_end tracepoints fire on the
> waiter side. The lock holder's identity and stack can be captured at
> contention_begin time (e.g. perf lock contention --lock-owner), but
> this reflects the holder's state when a waiter arrives, not when the
> lock is actually released.
> 
> This series adds a contended_release tracepoint that fires on the
> holder side when a lock with waiters is released. This provides:
> 
> - Hold time estimation: when the holder's own acquisition was
>   contended, its contention_end (acquisition) and contended_release
>   can be correlated to measure how long the lock was held under
>   contention.
> 
> - The holder's stack at release time, which may differ from what perf lock
>   contention --lock-owner captures if the holder does significant work between
>   the waiter's arrival and the unlock.

As someone who's not an expert in this area (so please use short words
to explain it to me), why do we want to know how long this holder took
to release the lock from when it became contended?

I understand why we want to know how long any given waiter had to wait
to gain the lock (but we already have tracepoints which show that).

I also don't understand why we want to know the holder's stack at
release time.  The stack at contention-begin time will include
the point at which the lock was acquired which should be correlated
with where the lock was released.

Perhaps examples might help me understand why we want this?

Re: [PATCH v4 0/5] locking: contended_release tracepoint instrumentation

Posted by Dmitry Ilvokhin 1 week ago

On Thu, Mar 26, 2026 at 03:55:21PM +0000, Matthew Wilcox wrote:
> On Thu, Mar 26, 2026 at 03:09:59PM +0000, Dmitry Ilvokhin wrote:
> > The existing contention_begin/contention_end tracepoints fire on the
> > waiter side. The lock holder's identity and stack can be captured at
> > contention_begin time (e.g. perf lock contention --lock-owner), but
> > this reflects the holder's state when a waiter arrives, not when the
> > lock is actually released.
> > 
> > This series adds a contended_release tracepoint that fires on the
> > holder side when a lock with waiters is released. This provides:
> > 
> > - Hold time estimation: when the holder's own acquisition was
> >   contended, its contention_end (acquisition) and contended_release
> >   can be correlated to measure how long the lock was held under
> >   contention.
> > 
> > - The holder's stack at release time, which may differ from what perf lock
> >   contention --lock-owner captures if the holder does significant work between
> >   the waiter's arrival and the unlock.
> 
> As someone who's not an expert in this area (so please use short words
> to explain it to me), why do we want to know how long this holder took
> to release the lock from when it became contended?
> 
> I understand why we want to know how long any given waiter had to wait
> to gain the lock (but we already have tracepoints which show that).

I think the simplest way to think about it is the following. Waiter time
is the symptom, while holder time is the cause.

The waiter-side contention_begin/contention_end tells us how long a
waiter waited, but that time can span multiple holders.

If a waiter waited 10 ms, we can not tell whether one holder held the
lock for 10 ms or five holders held it for 2 ms each. These need
different treatments: the first means shrink the critical section, the
second means reduce lock frequency or split the lock. Today we can not
distinguish between these cases from waiter-side data alone.

> 
> I also don't understand why we want to know the holder's stack at
> release time.  The stack at contention-begin time will include
> the point at which the lock was acquired which should be correlated
> with where the lock was released.
> 
> Perhaps examples might help me understand why we want this?

Holder's stack allows us to understand who exactly waiters were waiting
for to release the lock.

The stack at contention_begin time does not always include the holder's
stack. The --lock-owner feature works by reading the owner field from
the lock struct, but it only supports mutex and rwsem. For spinlocks,
queued rwlocks, semaphores, and several others, the waiter has no
visibility into the holder whatsoever.

contended_release fires in the holder's context, so we get the holder's
stack at release time. For spinlocks, this is the only way to get any
holder-side information.

Original motivation was zone lock contention (a spinlock) in Meta
production workloads. We could see waiters were blocked, but had no way
to identify the holders or what they were doing.

Re: [PATCH v4 0/5] locking: contended_release tracepoint instrumentation

Posted by Steven Rostedt 1 week ago

On Thu, 26 Mar 2026 15:55:21 +0000
Matthew Wilcox <willy@infradead.org> wrote:

> > - The holder's stack at release time, which may differ from what perf lock
> >   contention --lock-owner captures if the holder does significant work between
> >   the waiter's arrival and the unlock.  
> 
> As someone who's not an expert in this area (so please use short words
> to explain it to me), why do we want to know how long this holder took
> to release the lock from when it became contended?
> 
> I understand why we want to know how long any given waiter had to wait
> to gain the lock (but we already have tracepoints which show that).
> 
> I also don't understand why we want to know the holder's stack at
> release time.  The stack at contention-begin time will include
> the point at which the lock was acquired which should be correlated
> with where the lock was released.
> 
> Perhaps examples might help me understand why we want this?

Dmitry could give his own rationale for this, but I have my only use case.

This would be useful to find out how long the critical section is. If a
lock is highly contended by many tasks, you could get a high contention
time simply because other tasks are causing the delay for the waiter.
Seeing the release time and location would let you also know how long the
critical section was held, and if the length of the critical section is
causing the contention.

Having a stack trace of the release would differentiate the path that
released the lock, as there can be many places that release them. Although,
I have to admit, I'm not sure there are many different places locks are
released. Especially now that we have guard(), which will make all the
releases in a function at the same location.

-- Steve