MAINTAINERS | 2 + include/linux/mmzone.h | 7 ++- include/linux/mmzone_lock.h | 100 +++++++++++++++++++++++++++++++ include/trace/events/zone_lock.h | 64 ++++++++++++++++++++ kernel/power/snapshot.c | 5 +- mm/Makefile | 2 +- mm/compaction.c | 58 +++++++++++------- mm/internal.h | 2 +- mm/memory_hotplug.c | 9 +-- mm/mm_init.c | 3 +- mm/page_alloc.c | 89 +++++++++++++-------------- mm/page_isolation.c | 23 +++---- mm/page_owner.c | 2 +- mm/page_reporting.c | 13 ++-- mm/show_mem.c | 5 +- mm/shuffle.c | 9 +-- mm/vmscan.c | 5 +- mm/vmstat.c | 9 +-- mm/zone_lock.c | 28 +++++++++ 19 files changed, 330 insertions(+), 105 deletions(-) create mode 100644 include/linux/mmzone_lock.h create mode 100644 include/trace/events/zone_lock.h create mode 100644 mm/zone_lock.c
Zone lock contention can significantly impact allocation and
reclaim latency, as it is a central synchronization point in
the page allocator and reclaim paths. Improved visibility into
its behavior is therefore important for diagnosing performance
issues in memory-intensive workloads.
On some production workloads at Meta, we have observed noticeable
zone lock contention. Deeper analysis of lock holders and waiters
is currently difficult with existing instrumentation.
While generic lock contention_begin/contention_end tracepoints
cover the slow path, they do not provide sufficient visibility
into lock hold times. In particular, the lack of a release-side
event makes it difficult to identify long lock holders and
correlate them with waiters. As a result, distinguishing between
short bursts of contention and pathological long hold times
requires additional instrumentation.
This patch series adds dedicated tracepoint instrumentation to
zone lock, following the existing mmap_lock tracing model.
The goal is to enable detailed holder/waiter analysis and lock
hold time measurements without affecting the fast path when
tracing is disabled.
The series is structured as follows:
1. Introduce zone lock wrappers.
2. Mechanically convert zone lock users to the wrappers.
3. Convert compaction to use the wrappers (requires minor
restructuring of compact_lock_irqsave()).
4. Rename zone->lock to zone->_lock.
5. Add zone lock tracepoints.
The tracepoints are added via lightweight inline helpers in the
wrappers. When tracing is disabled, the fast path remains
unchanged.
Changes in v4:
- Fix build errors in kernel/power/snapshot.c and mm/shuffle.c files.
- Remove EXPORT_SYMBOL().
- Rename header zone_lock.h to mmzone_lock.h.
- Properly indent __acquires() annotation.
Changes in v3:
- Split compact_lock_irqsave() to compact_zone_lock_irqsave() and
compact_lruvec_lock_irqsave().
- Rename zone->lock to zone->_lock.
Changes in v2:
- Move mecanical changes from mm/compaction.c to different commit.
- Removed compact_do_zone_trylock() and compact_do_raw_trylock_irqsave().
v1: https://lore.kernel.org/all/cover.1770821420.git.d@ilvokhin.com/
v2: https://lore.kernel.org/all/cover.1772030186.git.d@ilvokhin.com/
v3: https://lore.kernel.org/all/cover.1772129168.git.d@ilvokhin.com/
Dmitry Ilvokhin (5):
mm: introduce zone lock wrappers
mm: convert zone lock users to wrappers
mm: convert compaction to zone lock wrappers
mm: rename zone->lock to zone->_lock
mm: add tracepoints for zone lock
MAINTAINERS | 2 +
include/linux/mmzone.h | 7 ++-
include/linux/mmzone_lock.h | 100 +++++++++++++++++++++++++++++++
include/trace/events/zone_lock.h | 64 ++++++++++++++++++++
kernel/power/snapshot.c | 5 +-
mm/Makefile | 2 +-
mm/compaction.c | 58 +++++++++++-------
mm/internal.h | 2 +-
mm/memory_hotplug.c | 9 +--
mm/mm_init.c | 3 +-
mm/page_alloc.c | 89 +++++++++++++--------------
mm/page_isolation.c | 23 +++----
mm/page_owner.c | 2 +-
mm/page_reporting.c | 13 ++--
mm/show_mem.c | 5 +-
mm/shuffle.c | 9 +--
mm/vmscan.c | 5 +-
mm/vmstat.c | 9 +--
mm/zone_lock.c | 28 +++++++++
19 files changed, 330 insertions(+), 105 deletions(-)
create mode 100644 include/linux/mmzone_lock.h
create mode 100644 include/trace/events/zone_lock.h
create mode 100644 mm/zone_lock.c
--
2.47.3
On Fri, Feb 27, 2026 at 04:00:22PM +0000, Dmitry Ilvokhin wrote: > Zone lock contention can significantly impact allocation and > reclaim latency, as it is a central synchronization point in > the page allocator and reclaim paths. Improved visibility into > its behavior is therefore important for diagnosing performance > issues in memory-intensive workloads. > > On some production workloads at Meta, we have observed noticeable > zone lock contention. Deeper analysis of lock holders and waiters > is currently difficult with existing instrumentation. > > While generic lock contention_begin/contention_end tracepoints > cover the slow path, they do not provide sufficient visibility > into lock hold times. In particular, the lack of a release-side > event makes it difficult to identify long lock holders and > correlate them with waiters. As a result, distinguishing between > short bursts of contention and pathological long hold times > requires additional instrumentation. > > This patch series adds dedicated tracepoint instrumentation to > zone lock, following the existing mmap_lock tracing model. I don't like this at all. We have CONFIG_LOCK_STAT. That should be improved insted of coming up with one-offs for every single lock that someone deems "special".
On Mon, Mar 09, 2026 at 01:10:46PM +0000, Matthew Wilcox wrote: > On Fri, Feb 27, 2026 at 04:00:22PM +0000, Dmitry Ilvokhin wrote: > > Zone lock contention can significantly impact allocation and > > reclaim latency, as it is a central synchronization point in > > the page allocator and reclaim paths. Improved visibility into > > its behavior is therefore important for diagnosing performance > > issues in memory-intensive workloads. > > > > On some production workloads at Meta, we have observed noticeable > > zone lock contention. Deeper analysis of lock holders and waiters > > is currently difficult with existing instrumentation. > > > > While generic lock contention_begin/contention_end tracepoints > > cover the slow path, they do not provide sufficient visibility > > into lock hold times. In particular, the lack of a release-side > > event makes it difficult to identify long lock holders and > > correlate them with waiters. As a result, distinguishing between > > short bursts of contention and pathological long hold times > > requires additional instrumentation. > > > > This patch series adds dedicated tracepoint instrumentation to > > zone lock, following the existing mmap_lock tracing model. > > I don't like this at all. We have CONFIG_LOCK_STAT. That should be > improved insted of coming up with one-offs for every single lock > that someone deems "special". Thanks for the feedback, Matthew. CONFIG_LOCK_STAT provides useful statistics, but it is primarily a debug facility and is generally too heavyweight for the production environments. The motivation for this series was to provide lightweight observability for the zone lock in production workloads. I agree that improving generic lock instrumentation would be preferable. I did consider whether something similar could be done generically for spinlocks, but the unlock path there is typically just a single atomic store, so adding generic lightweight instrumentation without affecting the fast path is difficult. In parallel, I've been experimenting with improving observability for sleepable locks by adding a contended_release tracepoint, which would allow correlating lock holders and waiters in a more generic way. I've posted an RFC here: https://lore.kernel.org/all/cover.1772642407.git.d@ilvokhin.com/ I'd appreciate feedback on whether that direction makes sense for improving the generic lock tracing infrastructure.
On Mon, Mar 09, 2026 at 02:21:42PM +0000, Dmitry Ilvokhin wrote: > On Mon, Mar 09, 2026 at 01:10:46PM +0000, Matthew Wilcox wrote: > > On Fri, Feb 27, 2026 at 04:00:22PM +0000, Dmitry Ilvokhin wrote: > > > This patch series adds dedicated tracepoint instrumentation to > > > zone lock, following the existing mmap_lock tracing model. > > > > I don't like this at all. We have CONFIG_LOCK_STAT. That should be > > improved insted of coming up with one-offs for every single lock > > that someone deems "special". > > Thanks for the feedback, Matthew. > > CONFIG_LOCK_STAT provides useful statistics, but it is primarily a > debug facility and is generally too heavyweight for the production > environments. Yes, agreed. I think that is what needs to change. > The motivation for this series was to provide lightweight observability > for the zone lock in production workloads. I read that. But first it was the mmap lock. Now it's the zone lock. Which lock will be next? This is too heavyweight a procedure to follow for each lock of interest. > I agree that improving generic lock instrumentation would be preferable. > I did consider whether something similar could be done generically for > spinlocks, but the unlock path there is typically just a single atomic > store, so adding generic lightweight instrumentation without affecting > the fast path is difficult. This is why we have tracepoint_enabled() and friends. But ... LOCK_STAT doesn't affect the unlock path at all. It only changes the acquire side to call lock_acquired() (and lock_contended() if the trylock failed). > In parallel, I've been experimenting with improving observability for > sleepable locks by adding a contended_release tracepoint, which would > allow correlating lock holders and waiters in a more generic way. I've > posted an RFC here: > > https://lore.kernel.org/all/cover.1772642407.git.d@ilvokhin.com/ > > I'd appreciate feedback on whether that direction makes sense for > improving the generic lock tracing infrastructure. It seems fine to me, but I don't have the depth in the locking code to give it the thorough review it deserves.
On Mon, 9 Mar 2026 14:47:29 +0000 Matthew Wilcox <willy@infradead.org> wrote: > > CONFIG_LOCK_STAT provides useful statistics, but it is primarily a > > debug facility and is generally too heavyweight for the production > > environments. > > Yes, agreed. I think that is what needs to change. The biggest issue with making a generic light weight LOCK_STAT is that locks are extremely optimized. Any addition of generic lock encoding will cause a noticeable overhead when compiled in, even when disabled. Most of the lock code is inlined for the fast path. Now if we want to add lock stats, we would need to add hooks into those inlined paths. This will undoubtedly increase the size of text, which will have an impact on I$. The other issue is the data we store for the lock. A lock is usually just a word (or long) in size, embedded in a structure. LOCKDEP and LOCK_STAT adds a key per lock. This increases the data size of the kernel. LOCK_STAT was designed on top of LOCKDEP. Perhaps a lighter version of LOCK_STAT could be designed without LOCKDEP, but it would be a project on its own. -- Steve
On Mon, Mar 09, 2026 at 03:13:17PM -0400, Steven Rostedt wrote:
> The biggest issue with making a generic light weight LOCK_STAT is that
> locks are extremely optimized. Any addition of generic lock encoding will
> cause a noticeable overhead when compiled in, even when disabled.
I'm not sure that's true. Taking the current Debian kernel config
leads to a "call" instruction to acquire a spinlock:
void __insert_inode_hash(struct inode *inode, unsigned long hashval)
{
struct hlist_head *b = inode_hashtable + hash(inode->i_sb, hashval);
spin_lock(&inode_hash_lock);
spin_lock(&inode->i_lock);
hlist_add_head_rcu(&inode->i_hash, b);
spin_unlock(&inode->i_lock);
spin_unlock(&inode_hash_lock);
}
compiles to:
[...]
280: 23 35 00 00 00 00 and 0x0(%rip),%esi # 286 <__insert_inode_hash+0x56>
282: R_X86_64_PC32 .data..ro_after_init+0x10
286: 48 8d 2c f0 lea (%rax,%rsi,8),%rbp
28a: e8 00 00 00 00 call 28f <__insert_inode_hash+0x5f>
28b: R_X86_64_PLT32 _raw_spin_lock-0x4
28f: 4c 89 e7 mov %r12,%rdi
292: e8 00 00 00 00 call 297 <__insert_inode_hash+0x67>
293: R_X86_64_PLT32 _raw_spin_lock-0x4
[...]
Debian doesn't do anything too weird here:
#
# Lock Debugging (spinlocks, mutexes, etc...)
#
CONFIG_LOCK_DEBUGGING_SUPPORT=y
# CONFIG_PROVE_LOCKING is not set
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_WW_MUTEX_SLOWPATH is not set
# CONFIG_DEBUG_RWSEMS is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_DEBUG_ATOMIC_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_LOCK_TORTURE_TEST is not set
# CONFIG_WW_MUTEX_SELFTEST is not set
# CONFIG_SCF_TORTURE_TEST is not set
# CONFIG_CSD_LOCK_WAIT_DEBUG is not set
(The spinlock code is too complex for me to follow what config options
influence whether it's a function call; you probably have enough of it
in your head that you'd know)
> The other issue is the data we store for the lock. A lock is usually just a
> word (or long) in size, embedded in a structure. LOCKDEP and LOCK_STAT adds
> a key per lock. This increases the data size of the kernel.
It does, but perhaps for a light weight lockstat, we could do better
than that. For example it could use the return address to look up
which lock is being accessed rather than embedding a key in each lock.
On Mon, 9 Mar 2026 20:45:31 +0000
Matthew Wilcox <willy@infradead.org> wrote:
> On Mon, Mar 09, 2026 at 03:13:17PM -0400, Steven Rostedt wrote:
> > The biggest issue with making a generic light weight LOCK_STAT is that
> > locks are extremely optimized. Any addition of generic lock encoding will
> > cause a noticeable overhead when compiled in, even when disabled.
>
> I'm not sure that's true. Taking the current Debian kernel config
> leads to a "call" instruction to acquire a spinlock:
>
> void __insert_inode_hash(struct inode *inode, unsigned long hashval)
> {
> struct hlist_head *b = inode_hashtable + hash(inode->i_sb, hashval);
>
> spin_lock(&inode_hash_lock);
> spin_lock(&inode->i_lock);
> hlist_add_head_rcu(&inode->i_hash, b);
> spin_unlock(&inode->i_lock);
> spin_unlock(&inode_hash_lock);
> }
>
> compiles to:
>
> [...]
> 280: 23 35 00 00 00 00 and 0x0(%rip),%esi # 286 <__insert_inode_hash+0x56>
> 282: R_X86_64_PC32 .data..ro_after_init+0x10
> 286: 48 8d 2c f0 lea (%rax,%rsi,8),%rbp
> 28a: e8 00 00 00 00 call 28f <__insert_inode_hash+0x5f>
> 28b: R_X86_64_PLT32 _raw_spin_lock-0x4
> 28f: 4c 89 e7 mov %r12,%rdi
> 292: e8 00 00 00 00 call 297 <__insert_inode_hash+0x67>
> 293: R_X86_64_PLT32 _raw_spin_lock-0x4
> [...]
Ah, you're correct. Looks like it's an arch specific thing. I was going
back to my memory from around 2006, but it appears that only a few archs
inline spinlocks anymore. Thomas made it a bit easier to see what does and
does not do that (in 2009).
6beb000923882 ("locking: Make inlining decision Kconfig based")
So, perhaps adding code to the spinlocks will not be as much of a hit on I$.
> (The spinlock code is too complex for me to follow what config options
> influence whether it's a function call; you probably have enough of it
> in your head that you'd know)
Yeah, I feel like I'm always relearning the code every time I have to jump
in and understand it again.
>
> > The other issue is the data we store for the lock. A lock is usually just a
> > word (or long) in size, embedded in a structure. LOCKDEP and LOCK_STAT adds
> > a key per lock. This increases the data size of the kernel.
>
> It does, but perhaps for a light weight lockstat, we could do better
> than that. For example it could use the return address to look up
> which lock is being accessed rather than embedding a key in each lock.
Right.
-- Steve
Thanks for the discussion and for the feedback on generic lock
instrumentation.
Following up on the earlier points about lightweight LOCK_STAT and
tracepoints.
While the thread has focused on spin_lock() calls, the real gap is on
the unlock path: there is currently no release-side tracepoint to
correlate holders and waiters or measure contended hold times.
A possible generic solution is a trace_contended_release() for spin
locks, for example:
if (trace_contended_release_enabled() &&
atomic_read(&lock->val) & ~_Q_LOCKED_MASK)
trace_contended_release(lock);
This might work on x86, but could increase code size and regress
performance on arches where spin_unlock() is inlined, such as arm64
under !PREEMPTION.
So even a generic release-side tracepoint has nontrivial downsides (not
to mention lightweight LOCK_STAT, since a disabled tracepoint is about
as lightweight as it gets, the more stats we add, the less lightweight
the mechanism becomes).
The zone lock wrappers provide a practical, production-safe solution to
observe this specific high-contention lock, if the generic approach
proves impractical.
On Mon, Mar 16, 2026 at 05:40:50PM +0000, Dmitry Ilvokhin wrote: [...] > A possible generic solution is a trace_contended_release() for spin > locks, for example: > > if (trace_contended_release_enabled() && > atomic_read(&lock->val) & ~_Q_LOCKED_MASK) > trace_contended_release(lock); > > This might work on x86, but could increase code size and regress > performance on arches where spin_unlock() is inlined, such as arm64 > under !PREEMPTION. I took a stab at this idea and submitted an RFC [1]. The implementation builds on your earlier observation from Matthew that _raw_spin_unlock() is not inlined in most configurations. In those cases, when the tracepoint is disabled, this adds a single NOP on the fast path, with the conditional check staying out of line. The measured text size increase in this configuration is +983 bytes. For configurations where _raw_spin_unlock() is inlined, the instrumentation does increase code size more noticeably (+71 KB in my measurements), since the check and out of line call is replicated at each call site. This provides a generic release-side signal for contended locks, allowing: correlation of lock holders with waiters and measurement of contended hold times This RFC addressing the same visibility gap without introducing per-lock instrumentation. If this tradeoff is acceptable, this could be a generic alternative to lock-specific tracepoints. [1]: https://lore.kernel.org/all/51aad0415b78c5a39f2029722118fa01eac77538.1773858853.git.d@ilvokhin.com
On Thu, 19 Mar 2026 13:22:54 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote: > On Mon, Mar 16, 2026 at 05:40:50PM +0000, Dmitry Ilvokhin wrote: > > [...] > > > A possible generic solution is a trace_contended_release() for spin > > locks, for example: > > > > if (trace_contended_release_enabled() && > > atomic_read(&lock->val) & ~_Q_LOCKED_MASK) > > trace_contended_release(lock); > > > > This might work on x86, but could increase code size and regress > > performance on arches where spin_unlock() is inlined, such as arm64 > > under !PREEMPTION. > > I took a stab at this idea and submitted an RFC [1]. > > The implementation builds on your earlier observation from Matthew that > _raw_spin_unlock() is not inlined in most configurations. In those > cases, when the tracepoint is disabled, this adds a single NOP on the > fast path, with the conditional check staying out of line. The measured > text size increase in this configuration is +983 bytes. > > For configurations where _raw_spin_unlock() is inlined, the > instrumentation does increase code size more noticeably > (+71 KB in my measurements), since the check and out of line call is > replicated at each call site. > > This provides a generic release-side signal for contended locks, > allowing: correlation of lock holders with waiters and measurement of > contended hold times > > This RFC addressing the same visibility gap without introducing per-lock > instrumentation. > > If this tradeoff is acceptable, this could be a generic alternative to > lock-specific tracepoints. > > [1]: https://lore.kernel.org/all/51aad0415b78c5a39f2029722118fa01eac77538.1773858853.git.d@ilvokhin.com That submission has met a disappointing response. How should I proceed with this series "mm: zone lock tracepoint instrumentation"? It's not urgent so I'm inclined to put this on hold while you pursue "locking: Add contended_release tracepoint to spinning locks"? Please send that v2 sometime and hopefully Steven can help push it along?
On Tue, Mar 24, 2026 at 04:39:18PM -0700, Andrew Morton wrote: > On Thu, 19 Mar 2026 13:22:54 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote: > > > On Mon, Mar 16, 2026 at 05:40:50PM +0000, Dmitry Ilvokhin wrote: > > > > [...] > > > > > A possible generic solution is a trace_contended_release() for spin > > > locks, for example: > > > > > > if (trace_contended_release_enabled() && > > > atomic_read(&lock->val) & ~_Q_LOCKED_MASK) > > > trace_contended_release(lock); > > > > > > This might work on x86, but could increase code size and regress > > > performance on arches where spin_unlock() is inlined, such as arm64 > > > under !PREEMPTION. > > > > I took a stab at this idea and submitted an RFC [1]. > > > > The implementation builds on your earlier observation from Matthew that > > _raw_spin_unlock() is not inlined in most configurations. In those > > cases, when the tracepoint is disabled, this adds a single NOP on the > > fast path, with the conditional check staying out of line. The measured > > text size increase in this configuration is +983 bytes. > > > > For configurations where _raw_spin_unlock() is inlined, the > > instrumentation does increase code size more noticeably > > (+71 KB in my measurements), since the check and out of line call is > > replicated at each call site. > > > > This provides a generic release-side signal for contended locks, > > allowing: correlation of lock holders with waiters and measurement of > > contended hold times > > > > This RFC addressing the same visibility gap without introducing per-lock > > instrumentation. > > > > If this tradeoff is acceptable, this could be a generic alternative to > > lock-specific tracepoints. > > > > [1]: https://lore.kernel.org/all/51aad0415b78c5a39f2029722118fa01eac77538.1773858853.git.d@ilvokhin.com > > That submission has met a disappointing response. > > How should I proceed with this series "mm: zone lock tracepoint > instrumentation"? It's not urgent so I'm inclined to put this on hold > while you pursue "locking: Add contended_release tracepoint to spinning > locks"? Thanks for the follow-up, Andrew. My current plan is to focus on the "locking: Add contended_release tracepoint to spinning locks" work and drive it to a clear conclusion: either by getting feedback that it's not a good direction, or by getting it into mainline. In the meantime, it seems reasonable to drop the "mm: zone lock tracepoint instrumentation" patchset from mm-new to avoid confusion until the direction is clearer. I can revisit and respin it if the more generic locking approach doesn't pan out. > > Please send that v2 sometime and hopefully Steven can help push it along? I'll send the next version of the generic locking series soon. Any help in pushing it along would be appreciated.
On Wed, 25 Mar 2026 12:14:35 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote: > > Please send that v2 sometime and hopefully Steven can help push it along? > > I'll send the next version of the generic locking series soon. Any help > in pushing it along would be appreciated. I'll see what I can do when I see v2! -- Steve
© 2016 - 2026 Red Hat, Inc.