include/linux/mutex.h | 4 ++ include/linux/rtmutex.h | 3 + kernel/locking/mutex.c | 41 +++++++++++--- kernel/locking/rtmutex.c | 37 +++++++----- kernel/locking/rtmutex_api.c | 47 +++++++++++++--- kernel/trace/ftrace.c | 106 +++++++++++++++++------------------ 6 files changed, 157 insertions(+), 81 deletions(-)
Recently, we resolved a latency spike issue caused by concurrently running bpftrace processes. The root cause was high contention on the ftrace_lock due to optimistic spinning. We can optimize this by disabling optimistic spinning for ftrace_lock. While semaphores may present similar challenges, I'm not currently aware of specific instances that exhibit this exact issue. Should we encounter problematic semaphores in production workloads, we can address them at that time. PATCH #1: introduce slow_mutex_[un]lock to disable optimistic spinning PATCH #2: add variant for rtmutex PATCH #3: disable optimistic spinning for ftrace_lock v1->v2: - add slow_mutex_[un]lock (Steven) - add variant for rtmutex (Waiman) - revise commit log for clarity and accuracy (Waiman, Peter) - note that semaphores may present similar challenges (David) RFC v1: https://lore.kernel.org/bpf/20260304074650.58165-1-laoar.shao@gmail.com/ Yafang Shao (3): locking/mutex: Add slow path variants for lock/unlock locking/rtmutex: Add slow path variants for lock/unlock ftrace: Disable optimistic spinning for ftrace_lock include/linux/mutex.h | 4 ++ include/linux/rtmutex.h | 3 + kernel/locking/mutex.c | 41 +++++++++++--- kernel/locking/rtmutex.c | 37 +++++++----- kernel/locking/rtmutex_api.c | 47 +++++++++++++--- kernel/trace/ftrace.c | 106 +++++++++++++++++------------------ 6 files changed, 157 insertions(+), 81 deletions(-) -- 2.47.3
On Wed, Mar 11, 2026 at 07:52:47PM +0800, Yafang Shao wrote: > Recently, we resolved a latency spike issue caused by concurrently running > bpftrace processes. The root cause was high contention on the ftrace_lock > due to optimistic spinning. We can optimize this by disabling optimistic > spinning for ftrace_lock. > > While semaphores may present similar challenges, I'm not currently aware of > specific instances that exhibit this exact issue. Should we encounter > problematic semaphores in production workloads, we can address them at that > time. > > PATCH #1: introduce slow_mutex_[un]lock to disable optimistic spinning > PATCH #2: add variant for rtmutex > PATCH #3: disable optimistic spinning for ftrace_lock > So I really utterly hate this.
On Wed, 11 Mar 2026 12:54:26 +0100
Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Mar 11, 2026 at 07:52:47PM +0800, Yafang Shao wrote:
> > Recently, we resolved a latency spike issue caused by concurrently running
> > bpftrace processes. The root cause was high contention on the ftrace_lock
> > due to optimistic spinning. We can optimize this by disabling optimistic
> > spinning for ftrace_lock.
> >
> > While semaphores may present similar challenges, I'm not currently aware of
> > specific instances that exhibit this exact issue. Should we encounter
> > problematic semaphores in production workloads, we can address them at that
> > time.
> >
> > PATCH #1: introduce slow_mutex_[un]lock to disable optimistic spinning
> > PATCH #2: add variant for rtmutex
> > PATCH #3: disable optimistic spinning for ftrace_lock
> >
>
> So I really utterly hate this.
Yep...
Adding the extra parameter is likely to have a measurable impact
on everything else.
The problematic path is obvious: find_kallsyms_symbol+142
module_address_lookup+104
kallsyms_lookup_buildid+203
kallsyms_lookup+20
print_rec+64
t_show+67
seq_read_iter+709
seq_read+165
vfs_read+165
ksys_read+103
__x64_sys_read+25
do_syscall_64+56
entry_SYSCALL_64_after_hwframe+100
The code needs to drop the ftrace_lock across t_show.
Although there is a bigger issue of why on earth the code is reading the
list of filter functions at all - never mind all the time.
I'll do it by hand when debugging, but I'd have though anything using bpf
will know exactly where to add its hooks.
David
On Wed, Mar 11, 2026 at 8:53 PM David Laight <david.laight.linux@gmail.com> wrote: > > On Wed, 11 Mar 2026 12:54:26 +0100 > Peter Zijlstra <peterz@infradead.org> wrote: > > > On Wed, Mar 11, 2026 at 07:52:47PM +0800, Yafang Shao wrote: > > > Recently, we resolved a latency spike issue caused by concurrently running > > > bpftrace processes. The root cause was high contention on the ftrace_lock > > > due to optimistic spinning. We can optimize this by disabling optimistic > > > spinning for ftrace_lock. > > > > > > While semaphores may present similar challenges, I'm not currently aware of > > > specific instances that exhibit this exact issue. Should we encounter > > > problematic semaphores in production workloads, we can address them at that > > > time. > > > > > > PATCH #1: introduce slow_mutex_[un]lock to disable optimistic spinning > > > PATCH #2: add variant for rtmutex > > > PATCH #3: disable optimistic spinning for ftrace_lock > > > > > > > So I really utterly hate this. > > Yep... > Adding the extra parameter is likely to have a measurable impact > on everything else. > > The problematic path is obvious: find_kallsyms_symbol+142 > module_address_lookup+104 > kallsyms_lookup_buildid+203 > kallsyms_lookup+20 > print_rec+64 > t_show+67 > seq_read_iter+709 > seq_read+165 > vfs_read+165 > ksys_read+103 > __x64_sys_read+25 > do_syscall_64+56 > entry_SYSCALL_64_after_hwframe+100 > > The code needs to drop the ftrace_lock across t_show. It's unclear whether we can safely release ftrace_lock within t_show — doing so would probably necessitate a major redesign of the current implementation. > > Although there is a bigger issue of why on earth the code is reading the > list of filter functions at all - never mind all the time. bpftrace reads the complete list of available functions into userspace, then performs matching against the target function to determine if it is traceable. > I'll do it by hand when debugging, but I'd have though anything using bpf > will know exactly where to add its hooks. -- Regards Yafang
Hi, On Wed, 11 Mar 2026 21:40:32 +0800 Yafang Shao <laoar.shao@gmail.com> wrote: > > Although there is a bigger issue of why on earth the code is reading the > > list of filter functions at all - never mind all the time. > > bpftrace reads the complete list of available functions into > userspace, then performs matching against the target function to > determine if it is traceable. What about changing bpftrace userspace tool to cache the list of available functions? (or just add an option to pass available function list?) Then, you can just copy the function list somewhere and uses it. Of course we can do the same thing in the kernel, but I don't think there is any reason to do it in the kernel instead of user space. Thank you, > > > I'll do it by hand when debugging, but I'd have though anything using bpf > > will know exactly where to add its hooks. > > > -- > Regards > Yafang -- Masami Hiramatsu (Google) <mhiramat@kernel.org>
On Thu, Mar 12, 2026 at 4:06 PM Masami Hiramatsu <mhiramat@kernel.org> wrote: > > Hi, > > On Wed, 11 Mar 2026 21:40:32 +0800 > Yafang Shao <laoar.shao@gmail.com> wrote: > > > > Although there is a bigger issue of why on earth the code is reading the > > > list of filter functions at all - never mind all the time. > > > > bpftrace reads the complete list of available functions into > > userspace, then performs matching against the target function to > > determine if it is traceable. > > What about changing bpftrace userspace tool to cache the list of available > functions? (or just add an option to pass available function list?) > Then, you can just copy the function list somewhere and uses it. Thanks for the suggestion. Steven also mentioned this optimization, and I believe it could indeed help address the performance issue. > > Of course we can do the same thing in the kernel, but I don't think > there is any reason to do it in the kernel instead of user space. Implementing this in the kernel provides a more generic solution. Instead of patching each tool individually (bpftrace, perf, trace-cmd, etc.). -- Regards Yafang
On Wed, 11 Mar 2026 21:40:32 +0800 Yafang Shao <laoar.shao@gmail.com> wrote: > > The code needs to drop the ftrace_lock across t_show. > > It's unclear whether we can safely release ftrace_lock within t_show — > doing so would probably necessitate a major redesign of the current > implementation. The issue isn't t_show, it's the calls between t_start and t_next and subsequent t_next calls, which needs to keep a consistent state. t_show just happens to be called in between them. > > > > > Although there is a bigger issue of why on earth the code is reading the > > list of filter functions at all - never mind all the time. > > bpftrace reads the complete list of available functions into > userspace, then performs matching against the target function to > determine if it is traceable. Could it parse it in smaller bits? That is, the lock is held only during an individual read system call. If it reads the available_filter_functions file via smaller buffers, it would not hold the lock for as long. -- Steve
On Wed, 11 Mar 2026 13:07:43 -0400 Steven Rostedt <rostedt@goodmis.org> wrote: > On Wed, 11 Mar 2026 21:40:32 +0800 > Yafang Shao <laoar.shao@gmail.com> wrote: > > > > The code needs to drop the ftrace_lock across t_show. > > > > It's unclear whether we can safely release ftrace_lock within t_show — > > doing so would probably necessitate a major redesign of the current > > implementation. > > The issue isn't t_show, it's the calls between t_start and t_next and > subsequent t_next calls, which needs to keep a consistent state. t_show > just happens to be called in between them. > > > > > > > > > Although there is a bigger issue of why on earth the code is reading the > > > list of filter functions at all - never mind all the time. > > > > bpftrace reads the complete list of available functions into > > userspace, then performs matching against the target function to > > determine if it is traceable. > > Could it parse it in smaller bits? That is, the lock is held only during an > individual read system call. If it reads the available_filter_functions > file via smaller buffers, it would not hold the lock for as long. But the expensive part is probably looking up the symbol name. Shorter reads would hand the lock off to the other process, but overall the lock would still be held for the same length of time. How does the code work out where to start from for each read system call? Couldn't you (effectively) do one symbol at a time the same way? Another option would be to put a 'generation number' on the list of functions (after all it doesn't change that often). Then you can release the lock, generate the data, re-acquire the lock and check the generation number hasn't changed. If it hasn't changed you can carry on processing the list using the same pointer. If the generation number has changed terminate the read and worry about locating the correct start position for the next read. David > > -- Steve >
On Wed, 11 Mar 2026 13:07:43 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> > bpftrace reads the complete list of available functions into
> > userspace, then performs matching against the target function to
> > determine if it is traceable.
>
> Could it parse it in smaller bits? That is, the lock is held only during an
> individual read system call. If it reads the available_filter_functions
> file via smaller buffers, it would not hold the lock for as long
Hmm, I guess this wouldn't help much. I ran:
trace-cmd record -e sys_enter_read -e sys_exit_read -F cat /sys/kernel/tracing/available_filter_functions > /dev/null
And trace-cmd report shows:
[..]
cat-1208 [001] ..... 142.025582: sys_enter_read: fd: 0x00000003, buf: 0x7fa9daf21000, count: 0x00040000
cat-1208 [001] ..... 142.025995: sys_exit_read: 0xfee
cat-1208 [001] ..... 142.026000: sys_enter_read: fd: 0x00000003, buf: 0x7fa9daf21000, count: 0x00040000
cat-1208 [001] ..... 142.026392: sys_exit_read: 0xff8
cat-1208 [001] ..... 142.026396: sys_enter_read: fd: 0x00000003, buf: 0x7fa9daf21000, count: 0x00040000
cat-1208 [001] ..... 142.026766: sys_exit_read: 0xfed
cat-1208 [001] ..... 142.026770: sys_enter_read: fd: 0x00000003, buf: 0x7fa9daf21000, count: 0x00040000
cat-1208 [001] ..... 142.027113: sys_exit_read: 0xfe0
cat-1208 [001] ..... 142.027117: sys_enter_read: fd: 0x00000003, buf: 0x7fa9daf21000, count: 0x00040000
cat-1208 [001] ..... 142.027502: sys_exit_read: 0xfec
cat-1208 [001] ..... 142.027506: sys_enter_read: fd: 0x00000003, buf: 0x7fa9daf21000, count: 0x00040000
[..]
which shows that even though the read buffer size is 0x40000, the size read
is just 0xff8. So the buffer being read and return is never more than a page.
Unless you are running on powerpc, where the page is likely 64K.
-- Steve
On Wed, Mar 11, 2026 at 7:54 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Mar 11, 2026 at 07:52:47PM +0800, Yafang Shao wrote: > > Recently, we resolved a latency spike issue caused by concurrently running > > bpftrace processes. The root cause was high contention on the ftrace_lock > > due to optimistic spinning. We can optimize this by disabling optimistic > > spinning for ftrace_lock. > > > > While semaphores may present similar challenges, I'm not currently aware of > > specific instances that exhibit this exact issue. Should we encounter > > problematic semaphores in production workloads, we can address them at that > > time. > > > > PATCH #1: introduce slow_mutex_[un]lock to disable optimistic spinning > > PATCH #2: add variant for rtmutex > > PATCH #3: disable optimistic spinning for ftrace_lock > > > > So I really utterly hate this. Do you have any other suggestions for optimizing this? -- Regards Yafang
© 2016 - 2026 Red Hat, Inc.