[v4] mm: zone lock tracepoint instrumentation

[PATCH v4 0/5] mm: zone lock tracepoint instrumentation

Posted by Dmitry Ilvokhin 1 month ago

Zone lock contention can significantly impact allocation and
reclaim latency, as it is a central synchronization point in
the page allocator and reclaim paths. Improved visibility into
its behavior is therefore important for diagnosing performance
issues in memory-intensive workloads.

On some production workloads at Meta, we have observed noticeable
zone lock contention. Deeper analysis of lock holders and waiters
is currently difficult with existing instrumentation.

While generic lock contention_begin/contention_end tracepoints
cover the slow path, they do not provide sufficient visibility
into lock hold times. In particular, the lack of a release-side
event makes it difficult to identify long lock holders and
correlate them with waiters. As a result, distinguishing between
short bursts of contention and pathological long hold times
requires additional instrumentation.

This patch series adds dedicated tracepoint instrumentation to
zone lock, following the existing mmap_lock tracing model.

The goal is to enable detailed holder/waiter analysis and lock
hold time measurements without affecting the fast path when
tracing is disabled.

The series is structured as follows:

  1. Introduce zone lock wrappers.
  2. Mechanically convert zone lock users to the wrappers.
  3. Convert compaction to use the wrappers (requires minor
     restructuring of compact_lock_irqsave()).
  4. Rename zone->lock to zone->_lock.
  5. Add zone lock tracepoints.

The tracepoints are added via lightweight inline helpers in the
wrappers. When tracing is disabled, the fast path remains
unchanged.

Changes in v4:
- Fix build errors in kernel/power/snapshot.c and mm/shuffle.c files.
- Remove EXPORT_SYMBOL().
- Rename header zone_lock.h to mmzone_lock.h.
- Properly indent __acquires() annotation.

Changes in v3:
- Split compact_lock_irqsave() to compact_zone_lock_irqsave() and 
  compact_lruvec_lock_irqsave().
- Rename zone->lock to zone->_lock.

Changes in v2:
- Move mecanical changes from mm/compaction.c to different commit.
- Removed compact_do_zone_trylock() and compact_do_raw_trylock_irqsave().

v1: https://lore.kernel.org/all/cover.1770821420.git.d@ilvokhin.com/
v2: https://lore.kernel.org/all/cover.1772030186.git.d@ilvokhin.com/
v3: https://lore.kernel.org/all/cover.1772129168.git.d@ilvokhin.com/

Dmitry Ilvokhin (5):
  mm: introduce zone lock wrappers
  mm: convert zone lock users to wrappers
  mm: convert compaction to zone lock wrappers
  mm: rename zone->lock to zone->_lock
  mm: add tracepoints for zone lock

 MAINTAINERS                      |   2 +
 include/linux/mmzone.h           |   7 ++-
 include/linux/mmzone_lock.h      | 100 +++++++++++++++++++++++++++++++
 include/trace/events/zone_lock.h |  64 ++++++++++++++++++++
 kernel/power/snapshot.c          |   5 +-
 mm/Makefile                      |   2 +-
 mm/compaction.c                  |  58 +++++++++++-------
 mm/internal.h                    |   2 +-
 mm/memory_hotplug.c              |   9 +--
 mm/mm_init.c                     |   3 +-
 mm/page_alloc.c                  |  89 +++++++++++++--------------
 mm/page_isolation.c              |  23 +++----
 mm/page_owner.c                  |   2 +-
 mm/page_reporting.c              |  13 ++--
 mm/show_mem.c                    |   5 +-
 mm/shuffle.c                     |   9 +--
 mm/vmscan.c                      |   5 +-
 mm/vmstat.c                      |   9 +--
 mm/zone_lock.c                   |  28 +++++++++
 19 files changed, 330 insertions(+), 105 deletions(-)
 create mode 100644 include/linux/mmzone_lock.h
 create mode 100644 include/trace/events/zone_lock.h
 create mode 100644 mm/zone_lock.c

-- 
2.47.3

Re: [PATCH v4 0/5] mm: zone lock tracepoint instrumentation

Posted by Matthew Wilcox 3 weeks, 3 days ago

On Fri, Feb 27, 2026 at 04:00:22PM +0000, Dmitry Ilvokhin wrote:
> Zone lock contention can significantly impact allocation and
> reclaim latency, as it is a central synchronization point in
> the page allocator and reclaim paths. Improved visibility into
> its behavior is therefore important for diagnosing performance
> issues in memory-intensive workloads.
> 
> On some production workloads at Meta, we have observed noticeable
> zone lock contention. Deeper analysis of lock holders and waiters
> is currently difficult with existing instrumentation.
> 
> While generic lock contention_begin/contention_end tracepoints
> cover the slow path, they do not provide sufficient visibility
> into lock hold times. In particular, the lack of a release-side
> event makes it difficult to identify long lock holders and
> correlate them with waiters. As a result, distinguishing between
> short bursts of contention and pathological long hold times
> requires additional instrumentation.
> 
> This patch series adds dedicated tracepoint instrumentation to
> zone lock, following the existing mmap_lock tracing model.

I don't like this at all.  We have CONFIG_LOCK_STAT.  That should be
improved insted of coming up with one-offs for every single lock
that someone deems "special".

Re: [PATCH v4 0/5] mm: zone lock tracepoint instrumentation

Posted by Dmitry Ilvokhin 3 weeks, 3 days ago

On Mon, Mar 09, 2026 at 01:10:46PM +0000, Matthew Wilcox wrote:
> On Fri, Feb 27, 2026 at 04:00:22PM +0000, Dmitry Ilvokhin wrote:
> > Zone lock contention can significantly impact allocation and
> > reclaim latency, as it is a central synchronization point in
> > the page allocator and reclaim paths. Improved visibility into
> > its behavior is therefore important for diagnosing performance
> > issues in memory-intensive workloads.
> > 
> > On some production workloads at Meta, we have observed noticeable
> > zone lock contention. Deeper analysis of lock holders and waiters
> > is currently difficult with existing instrumentation.
> > 
> > While generic lock contention_begin/contention_end tracepoints
> > cover the slow path, they do not provide sufficient visibility
> > into lock hold times. In particular, the lack of a release-side
> > event makes it difficult to identify long lock holders and
> > correlate them with waiters. As a result, distinguishing between
> > short bursts of contention and pathological long hold times
> > requires additional instrumentation.
> > 
> > This patch series adds dedicated tracepoint instrumentation to
> > zone lock, following the existing mmap_lock tracing model.
> 
> I don't like this at all.  We have CONFIG_LOCK_STAT.  That should be
> improved insted of coming up with one-offs for every single lock
> that someone deems "special".

Thanks for the feedback, Matthew.

CONFIG_LOCK_STAT provides useful statistics, but it is primarily a
debug facility and is generally too heavyweight for the production
environments.

The motivation for this series was to provide lightweight observability
for the zone lock in production workloads.

I agree that improving generic lock instrumentation would be preferable.
I did consider whether something similar could be done generically for
spinlocks, but the unlock path there is typically just a single atomic
store, so adding generic lightweight instrumentation without affecting
the fast path is difficult.

In parallel, I've been experimenting with improving observability for
sleepable locks by adding a contended_release tracepoint, which would
allow correlating lock holders and waiters in a more generic way. I've
posted an RFC here:

https://lore.kernel.org/all/cover.1772642407.git.d@ilvokhin.com/

I'd appreciate feedback on whether that direction makes sense for
improving the generic lock tracing infrastructure.

Re: [PATCH v4 0/5] mm: zone lock tracepoint instrumentation

Posted by Matthew Wilcox 3 weeks, 3 days ago

On Mon, Mar 09, 2026 at 02:21:42PM +0000, Dmitry Ilvokhin wrote:
> On Mon, Mar 09, 2026 at 01:10:46PM +0000, Matthew Wilcox wrote:
> > On Fri, Feb 27, 2026 at 04:00:22PM +0000, Dmitry Ilvokhin wrote:
> > > This patch series adds dedicated tracepoint instrumentation to
> > > zone lock, following the existing mmap_lock tracing model.
> > 
> > I don't like this at all.  We have CONFIG_LOCK_STAT.  That should be
> > improved insted of coming up with one-offs for every single lock
> > that someone deems "special".
> 
> Thanks for the feedback, Matthew.
> 
> CONFIG_LOCK_STAT provides useful statistics, but it is primarily a
> debug facility and is generally too heavyweight for the production
> environments.

Yes, agreed.  I think that is what needs to change.

> The motivation for this series was to provide lightweight observability
> for the zone lock in production workloads.

I read that.  But first it was the mmap lock.  Now it's the zone lock.
Which lock will be next?  This is too heavyweight a procedure to
follow for each lock of interest.

> I agree that improving generic lock instrumentation would be preferable.
> I did consider whether something similar could be done generically for
> spinlocks, but the unlock path there is typically just a single atomic
> store, so adding generic lightweight instrumentation without affecting
> the fast path is difficult.

This is why we have tracepoint_enabled() and friends.  But ... LOCK_STAT
doesn't affect the unlock path at all.  It only changes the acquire side
to call lock_acquired() (and lock_contended() if the trylock failed).

> In parallel, I've been experimenting with improving observability for
> sleepable locks by adding a contended_release tracepoint, which would
> allow correlating lock holders and waiters in a more generic way. I've
> posted an RFC here:
> 
> https://lore.kernel.org/all/cover.1772642407.git.d@ilvokhin.com/
> 
> I'd appreciate feedback on whether that direction makes sense for
> improving the generic lock tracing infrastructure.

It seems fine to me, but I don't have the depth in the locking code to
give it the thorough review it deserves.

Re: [PATCH v4 0/5] mm: zone lock tracepoint instrumentation

Posted by Steven Rostedt 3 weeks, 2 days ago

On Mon, 9 Mar 2026 14:47:29 +0000
Matthew Wilcox <willy@infradead.org> wrote:

> > CONFIG_LOCK_STAT provides useful statistics, but it is primarily a
> > debug facility and is generally too heavyweight for the production
> > environments.  
> 
> Yes, agreed.  I think that is what needs to change.

The biggest issue with making a generic light weight LOCK_STAT is that
locks are extremely optimized. Any addition of generic lock encoding will
cause a noticeable overhead when compiled in, even when disabled.

Most of the lock code is inlined for the fast path. Now if we want to add
lock stats, we would need to add hooks into those inlined paths. This will
undoubtedly increase the size of text, which will have an impact on I$.

The other issue is the data we store for the lock. A lock is usually just a
word (or long) in size, embedded in a structure. LOCKDEP and LOCK_STAT adds
a key per lock. This increases the data size of the kernel.

LOCK_STAT was designed on top of LOCKDEP. Perhaps a lighter version of
LOCK_STAT could be designed without LOCKDEP, but it would be a project on
its own.

-- Steve

Re: [PATCH v4 0/5] mm: zone lock tracepoint instrumentation

Posted by Matthew Wilcox 3 weeks, 2 days ago

On Mon, Mar 09, 2026 at 03:13:17PM -0400, Steven Rostedt wrote:
> The biggest issue with making a generic light weight LOCK_STAT is that
> locks are extremely optimized. Any addition of generic lock encoding will
> cause a noticeable overhead when compiled in, even when disabled.

I'm not sure that's true.  Taking the current Debian kernel config
leads to a "call" instruction to acquire a spinlock:

void __insert_inode_hash(struct inode *inode, unsigned long hashval)
{
        struct hlist_head *b = inode_hashtable + hash(inode->i_sb, hashval);

        spin_lock(&inode_hash_lock);
        spin_lock(&inode->i_lock);
        hlist_add_head_rcu(&inode->i_hash, b);
        spin_unlock(&inode->i_lock);
        spin_unlock(&inode_hash_lock);
}

compiles to:

[...]
     280:       23 35 00 00 00 00       and    0x0(%rip),%esi        # 286 <__insert_inode_hash+0x56>
                        282: R_X86_64_PC32      .data..ro_after_init+0x10
     286:       48 8d 2c f0             lea    (%rax,%rsi,8),%rbp
     28a:       e8 00 00 00 00          call   28f <__insert_inode_hash+0x5f>
                        28b: R_X86_64_PLT32     _raw_spin_lock-0x4
     28f:       4c 89 e7                mov    %r12,%rdi
     292:       e8 00 00 00 00          call   297 <__insert_inode_hash+0x67>
                        293: R_X86_64_PLT32     _raw_spin_lock-0x4
[...]

Debian doesn't do anything too weird here:

#
# Lock Debugging (spinlocks, mutexes, etc...)
#
CONFIG_LOCK_DEBUGGING_SUPPORT=y
# CONFIG_PROVE_LOCKING is not set
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_WW_MUTEX_SLOWPATH is not set
# CONFIG_DEBUG_RWSEMS is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_DEBUG_ATOMIC_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_LOCK_TORTURE_TEST is not set
# CONFIG_WW_MUTEX_SELFTEST is not set
# CONFIG_SCF_TORTURE_TEST is not set
# CONFIG_CSD_LOCK_WAIT_DEBUG is not set

(The spinlock code is too complex for me to follow what config options
influence whether it's a function call; you probably have enough of it
in your head that you'd know)

> The other issue is the data we store for the lock. A lock is usually just a
> word (or long) in size, embedded in a structure. LOCKDEP and LOCK_STAT adds
> a key per lock. This increases the data size of the kernel.

It does, but perhaps for a light weight lockstat, we could do better
than that.  For example it could use the return address to look up
which lock is being accessed rather than embedding a key in each lock.

Re: [PATCH v4 0/5] mm: zone lock tracepoint instrumentation

Posted by Steven Rostedt 3 weeks, 2 days ago

On Mon, 9 Mar 2026 20:45:31 +0000
Matthew Wilcox <willy@infradead.org> wrote:

> On Mon, Mar 09, 2026 at 03:13:17PM -0400, Steven Rostedt wrote:
> > The biggest issue with making a generic light weight LOCK_STAT is that
> > locks are extremely optimized. Any addition of generic lock encoding will
> > cause a noticeable overhead when compiled in, even when disabled.  
> 
> I'm not sure that's true.  Taking the current Debian kernel config
> leads to a "call" instruction to acquire a spinlock:
> 
> void __insert_inode_hash(struct inode *inode, unsigned long hashval)
> {
>         struct hlist_head *b = inode_hashtable + hash(inode->i_sb, hashval);
> 
>         spin_lock(&inode_hash_lock);
>         spin_lock(&inode->i_lock);
>         hlist_add_head_rcu(&inode->i_hash, b);
>         spin_unlock(&inode->i_lock);
>         spin_unlock(&inode_hash_lock);
> }
> 
> compiles to:
> 
> [...]
>      280:       23 35 00 00 00 00       and    0x0(%rip),%esi        # 286 <__insert_inode_hash+0x56>
>                         282: R_X86_64_PC32      .data..ro_after_init+0x10
>      286:       48 8d 2c f0             lea    (%rax,%rsi,8),%rbp
>      28a:       e8 00 00 00 00          call   28f <__insert_inode_hash+0x5f>
>                         28b: R_X86_64_PLT32     _raw_spin_lock-0x4
>      28f:       4c 89 e7                mov    %r12,%rdi
>      292:       e8 00 00 00 00          call   297 <__insert_inode_hash+0x67>
>                         293: R_X86_64_PLT32     _raw_spin_lock-0x4
> [...]

Ah, you're correct. Looks like it's an arch specific thing. I was going
back to my memory from around 2006, but it appears that only a few archs
inline spinlocks anymore. Thomas made it a bit easier to see what does and
does not do that (in 2009).

  6beb000923882 ("locking: Make inlining decision Kconfig based")

So, perhaps adding code to the spinlocks will not be as much of a hit on I$.

> (The spinlock code is too complex for me to follow what config options
> influence whether it's a function call; you probably have enough of it
> in your head that you'd know)

Yeah, I feel like I'm always relearning the code every time I have to jump
in and understand it again.

> 
> > The other issue is the data we store for the lock. A lock is usually just a
> > word (or long) in size, embedded in a structure. LOCKDEP and LOCK_STAT adds
> > a key per lock. This increases the data size of the kernel.  
> 
> It does, but perhaps for a light weight lockstat, we could do better
> than that.  For example it could use the return address to look up
> which lock is being accessed rather than embedding a key in each lock.

Right.

-- Steve

Re: [PATCH v4 0/5] mm: zone lock tracepoint instrumentation

Posted by Dmitry Ilvokhin 2 weeks, 2 days ago

Thanks for the discussion and for the feedback on generic lock
instrumentation.

Following up on the earlier points about lightweight LOCK_STAT and
tracepoints.

While the thread has focused on spin_lock() calls, the real gap is on
the unlock path: there is currently no release-side tracepoint to
correlate holders and waiters or measure contended hold times.

A possible generic solution is a trace_contended_release() for spin
locks, for example:

    if (trace_contended_release_enabled() &&
        atomic_read(&lock->val) & ~_Q_LOCKED_MASK)
        trace_contended_release(lock);

This might work on x86, but could increase code size and regress
performance on arches where spin_unlock() is inlined, such as arm64
under !PREEMPTION.

So even a generic release-side tracepoint has nontrivial downsides (not
to mention lightweight LOCK_STAT, since a disabled tracepoint is about
as lightweight as it gets, the more stats we add, the less lightweight
the mechanism becomes).

The zone lock wrappers provide a practical, production-safe solution to
observe this specific high-contention lock, if the generic approach
proves impractical.

Re: [PATCH v4 0/5] mm: zone lock tracepoint instrumentation

Posted by Dmitry Ilvokhin 2 weeks ago

On Mon, Mar 16, 2026 at 05:40:50PM +0000, Dmitry Ilvokhin wrote:

[...]

> A possible generic solution is a trace_contended_release() for spin
> locks, for example:
> 
>     if (trace_contended_release_enabled() &&
>         atomic_read(&lock->val) & ~_Q_LOCKED_MASK)
>         trace_contended_release(lock);
> 
> This might work on x86, but could increase code size and regress
> performance on arches where spin_unlock() is inlined, such as arm64
> under !PREEMPTION.

I took a stab at this idea and submitted an RFC [1].

The implementation builds on your earlier observation from Matthew that
_raw_spin_unlock() is not inlined in most configurations. In those
cases, when the tracepoint is disabled, this adds a single NOP on the
fast path, with the conditional check staying out of line. The measured
text size increase in this configuration is +983 bytes.

For configurations where _raw_spin_unlock() is inlined, the
instrumentation does increase code size more noticeably
(+71 KB in my measurements), since the check and out of line call is
replicated at each call site.

This provides a generic release-side signal for contended locks,
allowing: correlation of lock holders with waiters and measurement of
contended hold times

This RFC addressing the same visibility gap without introducing per-lock
instrumentation.

If this tradeoff is acceptable, this could be a generic alternative to
lock-specific tracepoints.

[1]: https://lore.kernel.org/all/51aad0415b78c5a39f2029722118fa01eac77538.1773858853.git.d@ilvokhin.com

Re: [PATCH v4 0/5] mm: zone lock tracepoint instrumentation

Posted by Andrew Morton 1 week, 1 day ago

On Thu, 19 Mar 2026 13:22:54 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:

> On Mon, Mar 16, 2026 at 05:40:50PM +0000, Dmitry Ilvokhin wrote:
> 
> [...]
> 
> > A possible generic solution is a trace_contended_release() for spin
> > locks, for example:
> > 
> >     if (trace_contended_release_enabled() &&
> >         atomic_read(&lock->val) & ~_Q_LOCKED_MASK)
> >         trace_contended_release(lock);
> > 
> > This might work on x86, but could increase code size and regress
> > performance on arches where spin_unlock() is inlined, such as arm64
> > under !PREEMPTION.
> 
> I took a stab at this idea and submitted an RFC [1].
> 
> The implementation builds on your earlier observation from Matthew that
> _raw_spin_unlock() is not inlined in most configurations. In those
> cases, when the tracepoint is disabled, this adds a single NOP on the
> fast path, with the conditional check staying out of line. The measured
> text size increase in this configuration is +983 bytes.
> 
> For configurations where _raw_spin_unlock() is inlined, the
> instrumentation does increase code size more noticeably
> (+71 KB in my measurements), since the check and out of line call is
> replicated at each call site.
> 
> This provides a generic release-side signal for contended locks,
> allowing: correlation of lock holders with waiters and measurement of
> contended hold times
> 
> This RFC addressing the same visibility gap without introducing per-lock
> instrumentation.
> 
> If this tradeoff is acceptable, this could be a generic alternative to
> lock-specific tracepoints.
> 
> [1]: https://lore.kernel.org/all/51aad0415b78c5a39f2029722118fa01eac77538.1773858853.git.d@ilvokhin.com 

That submission has met a disappointing response.

How should I proceed with this series "mm: zone lock tracepoint
instrumentation"?  It's not urgent so I'm inclined to put this on hold
while you pursue "locking: Add contended_release tracepoint to spinning
locks"?

Please send that v2 sometime and hopefully Steven can help push it along?

Re: [PATCH v4 0/5] mm: zone lock tracepoint instrumentation

Posted by Dmitry Ilvokhin 1 week, 1 day ago

On Tue, Mar 24, 2026 at 04:39:18PM -0700, Andrew Morton wrote:
> On Thu, 19 Mar 2026 13:22:54 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> 
> > On Mon, Mar 16, 2026 at 05:40:50PM +0000, Dmitry Ilvokhin wrote:
> > 
> > [...]
> > 
> > > A possible generic solution is a trace_contended_release() for spin
> > > locks, for example:
> > > 
> > >     if (trace_contended_release_enabled() &&
> > >         atomic_read(&lock->val) & ~_Q_LOCKED_MASK)
> > >         trace_contended_release(lock);
> > > 
> > > This might work on x86, but could increase code size and regress
> > > performance on arches where spin_unlock() is inlined, such as arm64
> > > under !PREEMPTION.
> > 
> > I took a stab at this idea and submitted an RFC [1].
> > 
> > The implementation builds on your earlier observation from Matthew that
> > _raw_spin_unlock() is not inlined in most configurations. In those
> > cases, when the tracepoint is disabled, this adds a single NOP on the
> > fast path, with the conditional check staying out of line. The measured
> > text size increase in this configuration is +983 bytes.
> > 
> > For configurations where _raw_spin_unlock() is inlined, the
> > instrumentation does increase code size more noticeably
> > (+71 KB in my measurements), since the check and out of line call is
> > replicated at each call site.
> > 
> > This provides a generic release-side signal for contended locks,
> > allowing: correlation of lock holders with waiters and measurement of
> > contended hold times
> > 
> > This RFC addressing the same visibility gap without introducing per-lock
> > instrumentation.
> > 
> > If this tradeoff is acceptable, this could be a generic alternative to
> > lock-specific tracepoints.
> > 
> > [1]: https://lore.kernel.org/all/51aad0415b78c5a39f2029722118fa01eac77538.1773858853.git.d@ilvokhin.com 
> 
> That submission has met a disappointing response.
> 
> How should I proceed with this series "mm: zone lock tracepoint
> instrumentation"?  It's not urgent so I'm inclined to put this on hold
> while you pursue "locking: Add contended_release tracepoint to spinning
> locks"?

Thanks for the follow-up, Andrew.

My current plan is to focus on the "locking: Add contended_release
tracepoint to spinning locks" work and drive it to a clear conclusion:
either by getting feedback that it's not a good direction, or by getting
it into mainline.

In the meantime, it seems reasonable to drop the "mm: zone lock
tracepoint instrumentation" patchset from mm-new to avoid confusion
until the direction is clearer. I can revisit and respin it if the more
generic locking approach doesn't pan out.

> 
> Please send that v2 sometime and hopefully Steven can help push it along?

I'll send the next version of the generic locking series soon. Any help
in pushing it along would be appreciated.

Re: [PATCH v4 0/5] mm: zone lock tracepoint instrumentation

Posted by Steven Rostedt 1 week, 1 day ago

On Wed, 25 Mar 2026 12:14:35 +0000
Dmitry Ilvokhin <d@ilvokhin.com> wrote:

> > Please send that v2 sometime and hopefully Steven can help push it along?  
> 
> I'll send the next version of the generic locking series soon. Any help
> in pushing it along would be appreciated.

I'll see what I can do when I see v2!

-- Steve