include/linux/bpf.h | 8 +- include/linux/ftrace.h | 51 ++++++++++--- kernel/bpf/trampoline.c | 94 +++++++++++++----------- kernel/trace/ftrace.c | 481 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------- kernel/trace/trace.h | 8 -- kernel/trace/trace_selftest.c | 5 +- 6 files changed, 395 insertions(+), 252 deletions(-)
hi, while poking the multi-tracing interface I ended up with just one ftrace_ops object to attach all trampolines. This change allows to use less direct API calls during the attachment changes in the future code, so in effect speeding up the attachment. However having just single ftrace_ops object removes direct_call field from direct_call, which is needed by arm, so I'm not sure it's the right path forward. Mark, Florent, any idea how hard would it be to for arm to get rid of direct_call field? thougts? thanks, jirka --- Jiri Olsa (10): ftrace: Make alloc_and_copy_ftrace_hash direct friendly ftrace: Add register_ftrace_direct_hash function ftrace: Add unregister_ftrace_direct_hash function ftrace: Add modify_ftrace_direct_hash function ftrace: Export some of hash related functions ftrace: Use direct hash interface in direct functions bpf: Add trampoline ip hash table ftrace: Factor ftrace_ops ops_func interface bpf: Remove ftrace_ops from bpf_trampoline object Revert "ftrace: Store direct called addresses in their ops" include/linux/bpf.h | 8 +- include/linux/ftrace.h | 51 ++++++++++--- kernel/bpf/trampoline.c | 94 +++++++++++++----------- kernel/trace/ftrace.c | 481 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------- kernel/trace/trace.h | 8 -- kernel/trace/trace_selftest.c | 5 +- 6 files changed, 395 insertions(+), 252 deletions(-)
Hi Jiri, [adding some powerpc and riscv folk, see below] On Tue, Jul 29, 2025 at 12:28:03PM +0200, Jiri Olsa wrote: > hi, > while poking the multi-tracing interface I ended up with just one > ftrace_ops object to attach all trampolines. > > This change allows to use less direct API calls during the attachment > changes in the future code, so in effect speeding up the attachment. How important is that, and what sort of speedup does this result in? I ask due to potential performance hits noted below, and I'm lacking context as to why we want to do this in the first place -- what is this intended to enable/improve? > However having just single ftrace_ops object removes direct_call > field from direct_call, which is needed by arm, so I'm not sure > it's the right path forward. It's also needed by powerpc and riscv since commits: a52f6043a2238d65 ("powerpc/ftrace: Add support for DYNAMIC_FTRACE_WITH_DIRECT_CALLS") b21cdb9523e5561b ("riscv: ftrace: support direct call using call_ops") > Mark, Florent, > any idea how hard would it be to for arm to get rid of direct_call field? For architectures which follow the arm64 style of implementation, it's pretty hard to get rid of it without introducing a performance hit to the call and/or a hit to attachment/detachment/modification. It would also end up being a fair amount more complicated. There's some historical rationale at: https://lore.kernel.org/lkml/ZfBbxPDd0rz6FN2T@FVFF77S0Q05N/ ... but the gist is that for several reasons we want the ops pointer in the callsite, and for atomic modification of this to switch everything dependent on that ops atomically, as this keeps the call logic and attachment/detachment/modification logic simple and pretty fast. If we remove the direct_call pointer from the ftrace_ops, then IIUC our options include: * Point the callsite pointer at some intermediate structure which points to the ops (e.g. the dyn_ftrace for the callsite). That introduces an additional dependent load per call that needs the ops, and introduces potential incoherency with other fields in that structure, requiring more synchronization overhead for attachment/detachment/modification. * Point the callsite pointer at a trampoline which can generate the ops pointer. This requires that every ops has a trampoline even for non-direct usage, which then requires more memory / I$, has more potential failure points, and is generally more complicated. The performance here will vary by architecture and platform, on some this might be faster, on some it might be slower. Note that we probably still need to bounce through an intermediary trampoline here to actually load from the callsite pointer and indirectly branch to it. ... but I'm not really keen on either unless we really have to remove the ftrace_ops::direct_call field, since they come with a substantial jump in complexity. Mark. > > thougts? thanks, > jirka > > > --- > Jiri Olsa (10): > ftrace: Make alloc_and_copy_ftrace_hash direct friendly > ftrace: Add register_ftrace_direct_hash function > ftrace: Add unregister_ftrace_direct_hash function > ftrace: Add modify_ftrace_direct_hash function > ftrace: Export some of hash related functions > ftrace: Use direct hash interface in direct functions > bpf: Add trampoline ip hash table > ftrace: Factor ftrace_ops ops_func interface > bpf: Remove ftrace_ops from bpf_trampoline object > Revert "ftrace: Store direct called addresses in their ops" > > include/linux/bpf.h | 8 +- > include/linux/ftrace.h | 51 ++++++++++--- > kernel/bpf/trampoline.c | 94 +++++++++++++----------- > kernel/trace/ftrace.c | 481 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------- > kernel/trace/trace.h | 8 -- > kernel/trace/trace_selftest.c | 5 +- > 6 files changed, 395 insertions(+), 252 deletions(-)
On Tue, Jul 29, 2025 at 06:57:40PM +0100, Mark Rutland wrote: > Hi Jiri, > > [adding some powerpc and riscv folk, see below] > > On Tue, Jul 29, 2025 at 12:28:03PM +0200, Jiri Olsa wrote: > > hi, > > while poking the multi-tracing interface I ended up with just one > > ftrace_ops object to attach all trampolines. > > > > This change allows to use less direct API calls during the attachment > > changes in the future code, so in effect speeding up the attachment. > > How important is that, and what sort of speedup does this result in? I > ask due to potential performance hits noted below, and I'm lacking > context as to why we want to do this in the first place -- what is this > intended to enable/improve? so it's all work on PoC stage, the idea is to be able to attach many (like 20,30,40k) functions to their trampolines quickly, which at the moment is slow because all the involved interfaces work with just single function/tracempoline relation there's ongoing development by Menglong [1] to organize such attachment for multiple functions and trampolines, but still at the end we have to use ftrace direct interface to do the attachment for each involved ftrace_ops so at the point of attachment it helps to have as few ftrace_ops objects as possible, in my test code I ended up with just single ftrace_ops object and I see attachment time for 20k functions to be around 3 seconds IIUC Menglong's change needs 12 ftrace_ops objects so we need to do around 12 direct ftrace_ops direct calls .. so probably not that bad, but still it would be faster with just single ftrace_ops involved [1] https://lore.kernel.org/bpf/20250703121521.1874196-1-dongml2@chinatelecom.cn/ > > > However having just single ftrace_ops object removes direct_call > > field from direct_call, which is needed by arm, so I'm not sure > > it's the right path forward. > > It's also needed by powerpc and riscv since commits: > > a52f6043a2238d65 ("powerpc/ftrace: Add support for DYNAMIC_FTRACE_WITH_DIRECT_CALLS") > b21cdb9523e5561b ("riscv: ftrace: support direct call using call_ops") > > > Mark, Florent, > > any idea how hard would it be to for arm to get rid of direct_call field? > > For architectures which follow the arm64 style of implementation, it's > pretty hard to get rid of it without introducing a performance hit to > the call and/or a hit to attachment/detachment/modification. It would > also end up being a fair amount more complicated. > > There's some historical rationale at: > > https://lore.kernel.org/lkml/ZfBbxPDd0rz6FN2T@FVFF77S0Q05N/ > > ... but the gist is that for several reasons we want the ops pointer in > the callsite, and for atomic modification of this to switch everything > dependent on that ops atomically, as this keeps the call logic and > attachment/detachment/modification logic simple and pretty fast. > > If we remove the direct_call pointer from the ftrace_ops, then IIUC our > options include: > > * Point the callsite pointer at some intermediate structure which points > to the ops (e.g. the dyn_ftrace for the callsite). That introduces an > additional dependent load per call that needs the ops, and introduces > potential incoherency with other fields in that structure, requiring > more synchronization overhead for attachment/detachment/modification. > > * Point the callsite pointer at a trampoline which can generate the ops > pointer. This requires that every ops has a trampoline even for > non-direct usage, which then requires more memory / I$, has more > potential failure points, and is generally more complicated. The > performance here will vary by architecture and platform, on some this > might be faster, on some it might be slower. > > Note that we probably still need to bounce through an intermediary > trampoline here to actually load from the callsite pointer and > indirectly branch to it. > > ... but I'm not really keen on either unless we really have to remove > the ftrace_ops::direct_call field, since they come with a substantial > jump in complexity. ok, that sounds bad.. thanks for the details Steven, please correct me if/when I'm wrong ;-) IIUC in x86_64, IF there's just single ftrace_ops defined for the function, it will bypass ftrace trampoline and call directly the direct trampoline for the function, like: <foo>: call direct_trampoline ... IF there are other ftrace_ops 'users' on the same function, we execute each of them like: <foo>: call ftrace_trampoline call ftrace_ops_1->func call ftrace_ops_2->func ... with our direct ftrace_ops->func currently using ftrace_ops->direct_call to return direct trampoline for the function: -static void call_direct_funcs(unsigned long ip, unsigned long pip, - struct ftrace_ops *ops, struct ftrace_regs *fregs) -{ - unsigned long addr = READ_ONCE(ops->direct_call); - - if (!addr) - return; - - arch_ftrace_set_direct_caller(fregs, addr); -} in the new changes it will do hash lookup (based on ip) for the direct trampoline we want to execute: +static void call_direct_funcs_hash(unsigned long ip, unsigned long pip, + struct ftrace_ops *ops, struct ftrace_regs *fregs) +{ + unsigned long addr; + + addr = ftrace_find_rec_direct(ip); + if (!addr) + return; + + arch_ftrace_set_direct_caller(fregs, addr); +} still this is the slow path for the case where multiple ftrace_ops objects use same function.. for the fast path we have the direct attachment as described above sorry I probably forgot/missed discussion on this, but doing the fast path like in x86_64 is not an option in arm, right? thanks, jirka
On Wed, Jul 30, 2025 at 01:19:51PM +0200, Jiri Olsa wrote: > On Tue, Jul 29, 2025 at 06:57:40PM +0100, Mark Rutland wrote: > > Hi Jiri, > > > > [adding some powerpc and riscv folk, see below] > > > > On Tue, Jul 29, 2025 at 12:28:03PM +0200, Jiri Olsa wrote: > > > hi, > > > while poking the multi-tracing interface I ended up with just one > > > ftrace_ops object to attach all trampolines. > > > > > > This change allows to use less direct API calls during the attachment > > > changes in the future code, so in effect speeding up the attachment. > > > > How important is that, and what sort of speedup does this result in? I > > ask due to potential performance hits noted below, and I'm lacking > > context as to why we want to do this in the first place -- what is this > > intended to enable/improve? > > so it's all work on PoC stage, the idea is to be able to attach many > (like 20,30,40k) functions to their trampolines quickly, which at the > moment is slow because all the involved interfaces work with just single > function/tracempoline relation Do you know which aspect of that is slow? e.g. is that becuase you have to update each ftrace_ops independently, and pay the synchronization overhead per-ops? I ask because it might be possible to do some more batching there, at least for architectures like arm64 that use the CALL_OPS approach. > there's ongoing development by Menglong [1] to organize such attachment > for multiple functions and trampolines, but still at the end we have to use > ftrace direct interface to do the attachment for each involved ftrace_ops > > so at the point of attachment it helps to have as few ftrace_ops objects > as possible, in my test code I ended up with just single ftrace_ops object > and I see attachment time for 20k functions to be around 3 seconds > > IIUC Menglong's change needs 12 ftrace_ops objects so we need to do around > 12 direct ftrace_ops direct calls .. so probably not that bad, but still > it would be faster with just single ftrace_ops involved > > [1] https://lore.kernel.org/bpf/20250703121521.1874196-1-dongml2@chinatelecom.cn/ > > > > > > However having just single ftrace_ops object removes direct_call > > > field from direct_call, which is needed by arm, so I'm not sure > > > it's the right path forward. > > > > It's also needed by powerpc and riscv since commits: > > > > a52f6043a2238d65 ("powerpc/ftrace: Add support for DYNAMIC_FTRACE_WITH_DIRECT_CALLS") > > b21cdb9523e5561b ("riscv: ftrace: support direct call using call_ops") > > > > > Mark, Florent, > > > any idea how hard would it be to for arm to get rid of direct_call field? > > > > For architectures which follow the arm64 style of implementation, it's > > pretty hard to get rid of it without introducing a performance hit to > > the call and/or a hit to attachment/detachment/modification. It would > > also end up being a fair amount more complicated. > > > > There's some historical rationale at: > > > > https://lore.kernel.org/lkml/ZfBbxPDd0rz6FN2T@FVFF77S0Q05N/ > > > > ... but the gist is that for several reasons we want the ops pointer in > > the callsite, and for atomic modification of this to switch everything > > dependent on that ops atomically, as this keeps the call logic and > > attachment/detachment/modification logic simple and pretty fast. > > > > If we remove the direct_call pointer from the ftrace_ops, then IIUC our > > options include: > > > > * Point the callsite pointer at some intermediate structure which points > > to the ops (e.g. the dyn_ftrace for the callsite). That introduces an > > additional dependent load per call that needs the ops, and introduces > > potential incoherency with other fields in that structure, requiring > > more synchronization overhead for attachment/detachment/modification. > > > > * Point the callsite pointer at a trampoline which can generate the ops > > pointer. This requires that every ops has a trampoline even for > > non-direct usage, which then requires more memory / I$, has more > > potential failure points, and is generally more complicated. The > > performance here will vary by architecture and platform, on some this > > might be faster, on some it might be slower. > > > > Note that we probably still need to bounce through an intermediary > > trampoline here to actually load from the callsite pointer and > > indirectly branch to it. > > > > ... but I'm not really keen on either unless we really have to remove > > the ftrace_ops::direct_call field, since they come with a substantial > > jump in complexity. > > ok, that sounds bad.. thanks for the details > > Steven, please correct me if/when I'm wrong ;-) > > IIUC in x86_64, IF there's just single ftrace_ops defined for the function, > it will bypass ftrace trampoline and call directly the direct trampoline > for the function, like: > > <foo>: > call direct_trampoline > ... More details at the end of this reply; arm64 can sometimes do this, but not always, and even when there's a single ftrace_ops we may need to bounce through a common trampoline (which can still be cheap). > IF there are other ftrace_ops 'users' on the same function, we execute > each of them like: > > <foo>: > call ftrace_trampoline > call ftrace_ops_1->func > call ftrace_ops_2->func > ... More details at the end of this reply; arm64 does essentially the same thing via the ftrace_list_ops and ftrace_ops_list_func(). > with our direct ftrace_ops->func currently using ftrace_ops->direct_call > to return direct trampoline for the function: > > -static void call_direct_funcs(unsigned long ip, unsigned long pip, > - struct ftrace_ops *ops, struct ftrace_regs *fregs) > -{ > - unsigned long addr = READ_ONCE(ops->direct_call); > - > - if (!addr) > - return; > - > - arch_ftrace_set_direct_caller(fregs, addr); > -} More details at the end of this reply; at present, when an instrumented function has a single ops, arm64 can call ops->direct_call directly from the common trampoline, and only needs to fall back to call_direct_funcs() when there are multiple ops. > in the new changes it will do hash lookup (based on ip) for the direct > trampoline we want to execute: > > +static void call_direct_funcs_hash(unsigned long ip, unsigned long pip, > + struct ftrace_ops *ops, struct ftrace_regs *fregs) > +{ > + unsigned long addr; > + > + addr = ftrace_find_rec_direct(ip); > + if (!addr) > + return; > + > + arch_ftrace_set_direct_caller(fregs, addr); > +} > > still this is the slow path for the case where multiple ftrace_ops objects use > same function.. for the fast path we have the direct attachment as described above > > sorry I probably forgot/missed discussion on this, but doing the fast path like in > x86_64 is not an option in arm, right? On arm64 we have a fast path, BUT branch range limitations means that we cannot always branch directly from the instrumented function to the direct func with a single branch instruction. We use ops->direct_call to handle that case within a common trampoline, which is significantly cheaper that iterating over the ops and/or looking up the direct func from a hash. With CALL_OPS, we place a pointer to the ops immediately before the instrumented function, and have the instrumented function branch to a common trampoline which can load that pointer (and can then branch to any direct func as necessary). The instrumented function looks like: # Aligned to 8 bytes func - 8: < pointer to ops > func: BTI // optional MOV X9, LR // save original return address NOP // patched to `BL ftrace_caller` func_body: ... and then in ftrace_caller we can recover the 'ops' pointer with: BIC <tmp>, LR, 0x7 // align down (skips BTI) LDR <ops>, [<tmp>, #-16] // load ops pointer LDR <direct>, [<ops>, #FTRACE_OPS_DIRECT_CALL] // load ops->direct_call CBNZ <direct>, ftrace_caller_direct // if !NULL, make direct call < fall through to non-direct func case here > Having the ops (and ops->direct_call) means that getting to the direct func is significantly cheaper than having to lookup the direct func via the hash. Where an instrumented function has a single ops, this can get to the direct func with a low constant overhead, significantly cheaper than looking up the direct func via the hash. Where an instrumented function has multiple ops, the ops pointer is pointed at a common ftrace_list_ops, where ftrace_ops_list_func() iterates over all the other relevant ops. Mark.
On Fri, Aug 01, 2025 at 10:49:56AM +0100, Mark Rutland wrote: > On Wed, Jul 30, 2025 at 01:19:51PM +0200, Jiri Olsa wrote: > > On Tue, Jul 29, 2025 at 06:57:40PM +0100, Mark Rutland wrote: > > > Hi Jiri, > > > > > > [adding some powerpc and riscv folk, see below] > > > > > > On Tue, Jul 29, 2025 at 12:28:03PM +0200, Jiri Olsa wrote: > > > > hi, > > > > while poking the multi-tracing interface I ended up with just one > > > > ftrace_ops object to attach all trampolines. > > > > > > > > This change allows to use less direct API calls during the attachment > > > > changes in the future code, so in effect speeding up the attachment. > > > > > > How important is that, and what sort of speedup does this result in? I > > > ask due to potential performance hits noted below, and I'm lacking > > > context as to why we want to do this in the first place -- what is this > > > intended to enable/improve? > > > > so it's all work on PoC stage, the idea is to be able to attach many > > (like 20,30,40k) functions to their trampolines quickly, which at the > > moment is slow because all the involved interfaces work with just single > > function/tracempoline relation > > Do you know which aspect of that is slow? e.g. is that becuase you have > to update each ftrace_ops independently, and pay the synchronization > overhead per-ops? > > I ask because it might be possible to do some more batching there, at > least for architectures like arm64 that use the CALL_OPS approach. IIRC it's the rcu sync in register_ftrace_direct and ftrace_shutdown I'll try to profile that case again, there might have been changes since the last time we did that > > > there's ongoing development by Menglong [1] to organize such attachment > > for multiple functions and trampolines, but still at the end we have to use > > ftrace direct interface to do the attachment for each involved ftrace_ops > > > > so at the point of attachment it helps to have as few ftrace_ops objects > > as possible, in my test code I ended up with just single ftrace_ops object > > and I see attachment time for 20k functions to be around 3 seconds > > > > IIUC Menglong's change needs 12 ftrace_ops objects so we need to do around > > 12 direct ftrace_ops direct calls .. so probably not that bad, but still > > it would be faster with just single ftrace_ops involved > > > > [1] https://lore.kernel.org/bpf/20250703121521.1874196-1-dongml2@chinatelecom.cn/ > > > > > > > > > However having just single ftrace_ops object removes direct_call > > > > field from direct_call, which is needed by arm, so I'm not sure > > > > it's the right path forward. > > > > > > It's also needed by powerpc and riscv since commits: > > > > > > a52f6043a2238d65 ("powerpc/ftrace: Add support for DYNAMIC_FTRACE_WITH_DIRECT_CALLS") > > > b21cdb9523e5561b ("riscv: ftrace: support direct call using call_ops") > > > > > > > Mark, Florent, > > > > any idea how hard would it be to for arm to get rid of direct_call field? > > > > > > For architectures which follow the arm64 style of implementation, it's > > > pretty hard to get rid of it without introducing a performance hit to > > > the call and/or a hit to attachment/detachment/modification. It would > > > also end up being a fair amount more complicated. > > > > > > There's some historical rationale at: > > > > > > https://lore.kernel.org/lkml/ZfBbxPDd0rz6FN2T@FVFF77S0Q05N/ > > > > > > ... but the gist is that for several reasons we want the ops pointer in > > > the callsite, and for atomic modification of this to switch everything > > > dependent on that ops atomically, as this keeps the call logic and > > > attachment/detachment/modification logic simple and pretty fast. > > > > > > If we remove the direct_call pointer from the ftrace_ops, then IIUC our > > > options include: > > > > > > * Point the callsite pointer at some intermediate structure which points > > > to the ops (e.g. the dyn_ftrace for the callsite). That introduces an > > > additional dependent load per call that needs the ops, and introduces > > > potential incoherency with other fields in that structure, requiring > > > more synchronization overhead for attachment/detachment/modification. > > > > > > * Point the callsite pointer at a trampoline which can generate the ops > > > pointer. This requires that every ops has a trampoline even for > > > non-direct usage, which then requires more memory / I$, has more > > > potential failure points, and is generally more complicated. The > > > performance here will vary by architecture and platform, on some this > > > might be faster, on some it might be slower. > > > > > > Note that we probably still need to bounce through an intermediary > > > trampoline here to actually load from the callsite pointer and > > > indirectly branch to it. > > > > > > ... but I'm not really keen on either unless we really have to remove > > > the ftrace_ops::direct_call field, since they come with a substantial > > > jump in complexity. > > > > ok, that sounds bad.. thanks for the details > > > > Steven, please correct me if/when I'm wrong ;-) > > > > IIUC in x86_64, IF there's just single ftrace_ops defined for the function, > > it will bypass ftrace trampoline and call directly the direct trampoline > > for the function, like: > > > > <foo>: > > call direct_trampoline > > ... > > More details at the end of this reply; arm64 can sometimes do this, but > not always, and even when there's a single ftrace_ops we may need to > bounce through a common trampoline (which can still be cheap). > > > IF there are other ftrace_ops 'users' on the same function, we execute > > each of them like: > > > > <foo>: > > call ftrace_trampoline > > call ftrace_ops_1->func > > call ftrace_ops_2->func > > ... > > More details at the end of this reply; arm64 does essentially the same > thing via the ftrace_list_ops and ftrace_ops_list_func(). > > > with our direct ftrace_ops->func currently using ftrace_ops->direct_call > > to return direct trampoline for the function: > > > > -static void call_direct_funcs(unsigned long ip, unsigned long pip, > > - struct ftrace_ops *ops, struct ftrace_regs *fregs) > > -{ > > - unsigned long addr = READ_ONCE(ops->direct_call); > > - > > - if (!addr) > > - return; > > - > > - arch_ftrace_set_direct_caller(fregs, addr); > > -} > > More details at the end of this reply; at present, when an instrumented > function has a single ops, arm64 can call ops->direct_call directly from > the common trampoline, and only needs to fall back to > call_direct_funcs() when there are multiple ops. > > > in the new changes it will do hash lookup (based on ip) for the direct > > trampoline we want to execute: > > > > +static void call_direct_funcs_hash(unsigned long ip, unsigned long pip, > > + struct ftrace_ops *ops, struct ftrace_regs *fregs) > > +{ > > + unsigned long addr; > > + > > + addr = ftrace_find_rec_direct(ip); > > + if (!addr) > > + return; > > + > > + arch_ftrace_set_direct_caller(fregs, addr); > > +} > > > > still this is the slow path for the case where multiple ftrace_ops objects use > > same function.. for the fast path we have the direct attachment as described above > > > > sorry I probably forgot/missed discussion on this, but doing the fast path like in > > x86_64 is not an option in arm, right? > > On arm64 we have a fast path, BUT branch range limitations means that we > cannot always branch directly from the instrumented function to the > direct func with a single branch instruction. We use ops->direct_call to > handle that case within a common trampoline, which is significantly > cheaper that iterating over the ops and/or looking up the direct func > from a hash. > > With CALL_OPS, we place a pointer to the ops immediately before the > instrumented function, and have the instrumented function branch to a > common trampoline which can load that pointer (and can then branch to > any direct func as necessary). > > The instrumented function looks like: > > # Aligned to 8 bytes > func - 8: > < pointer to ops > stupid question.. so there's ftrace_ops pointer stored for each function at 'func - 8` ? why not store the func's direct trampoline address in there? > func: > BTI // optional > MOV X9, LR // save original return address > NOP // patched to `BL ftrace_caller` > func_body: > > ... and then in ftrace_caller we can recover the 'ops' pointer with: > > BIC <tmp>, LR, 0x7 // align down (skips BTI) > LDR <ops>, [<tmp>, #-16] // load ops pointer > > LDR <direct>, [<ops>, #FTRACE_OPS_DIRECT_CALL] // load ops->direct_call > CBNZ <direct>, ftrace_caller_direct // if !NULL, make direct call > > < fall through to non-direct func case here > > > Having the ops (and ops->direct_call) means that getting to the direct > func is significantly cheaper than having to lookup the direct func via > the hash. > > Where an instrumented function has a single ops, this can get to the > direct func with a low constant overhead, significantly cheaper than > looking up the direct func via the hash. > > Where an instrumented function has multiple ops, the ops pointer is > pointed at a common ftrace_list_ops, where ftrace_ops_list_func() > iterates over all the other relevant ops. thanks for all the details, I'll check if both the new change and ops->direct_call could live together for x86 and other arch, but it will probably complicate things a lot more jirka
On Sat, Aug 02, 2025 at 11:26:46PM +0200, Jiri Olsa wrote: > On Fri, Aug 01, 2025 at 10:49:56AM +0100, Mark Rutland wrote: > > On Wed, Jul 30, 2025 at 01:19:51PM +0200, Jiri Olsa wrote: > > > On Tue, Jul 29, 2025 at 06:57:40PM +0100, Mark Rutland wrote: > > > > > > > > On Tue, Jul 29, 2025 at 12:28:03PM +0200, Jiri Olsa wrote: > > > > > hi, > > > > > while poking the multi-tracing interface I ended up with just one > > > > > ftrace_ops object to attach all trampolines. > > > > > > > > > > This change allows to use less direct API calls during the attachment > > > > > changes in the future code, so in effect speeding up the attachment. > > > > > > > > How important is that, and what sort of speedup does this result in? I > > > > ask due to potential performance hits noted below, and I'm lacking > > > > context as to why we want to do this in the first place -- what is this > > > > intended to enable/improve? > > > > > > so it's all work on PoC stage, the idea is to be able to attach many > > > (like 20,30,40k) functions to their trampolines quickly, which at the > > > moment is slow because all the involved interfaces work with just single > > > function/tracempoline relation > > > > Do you know which aspect of that is slow? e.g. is that becuase you have > > to update each ftrace_ops independently, and pay the synchronization > > overhead per-ops? > > > > I ask because it might be possible to do some more batching there, at > > least for architectures like arm64 that use the CALL_OPS approach. > > IIRC it's the rcu sync in register_ftrace_direct and ftrace_shutdown > I'll try to profile that case again, there might have been changes > since the last time we did that Do you mean synchronize_rcu_tasks()? The call in register_ftrace_direct() was removed in commit: 33f137143e651321 ("ftrace: Use asynchronous grace period for register_ftrace_direct()") ... but in ftrace_shutdown() we still have a call to synchronize_rcu_tasks(), and to synchronize_rcu_tasks_rude(). The call to synchronize_rcu_tasks() is still necessary, but we might be abel to batch that better with API changes. I think we might be able to remove the call to synchronize_rcu_tasks_rude() on architectures with ARCH_WANTS_NO_INSTR, since there shouldn't be any instrumentable functions called with RCU not watching. That'd need to be checked. [...] > > > sorry I probably forgot/missed discussion on this, but doing the fast path like in > > > x86_64 is not an option in arm, right? > > > > On arm64 we have a fast path, BUT branch range limitations means that we > > cannot always branch directly from the instrumented function to the > > direct func with a single branch instruction. We use ops->direct_call to > > handle that case within a common trampoline, which is significantly > > cheaper that iterating over the ops and/or looking up the direct func > > from a hash. > > > > With CALL_OPS, we place a pointer to the ops immediately before the > > instrumented function, and have the instrumented function branch to a > > common trampoline which can load that pointer (and can then branch to > > any direct func as necessary). > > > > The instrumented function looks like: > > > > # Aligned to 8 bytes > > func - 8: > > < pointer to ops > > > stupid question.. so there's ftrace_ops pointer stored for each function at > 'func - 8` ? why not store the func's direct trampoline address in there? Once reason is that today we don't have trampolines for all ops. Since branch range limitations can require bouncing through the common ops, it's simpler/better to bounce from that to the regular call than to bounce from that to a trampoline which makes the regular call. We *could* consider adding trampolines, but that comes with a jump in complexity that we originally tried to avoid, and a potential performance hit for regular ftrace calls. IIUC that will require similar synchronization to what we have today, so it's not clearly a win generally. I'd like to better understand what the real bottleneck is; AFAICT it's the tasks-rcu synchronization, and sharing the hash means that you only need to do that once. I think that it should be possible to share that synchronization across multiple ops updates with some API changes (e.g. something like the batching of text_poke on x86). If we can do that, it might benefit other users too (e.g. live-patching), even if trampolines aren't being used, and would keep the arch bits simple/maintainable. [...] > thanks for all the details, I'll check if both the new change and ops->direct_call > could live together for x86 and other arch, but it will probably complicate > things a lot more Thanks; please let me know if there's any challenges there! Mark.
On Wed, Aug 06, 2025 at 11:20:08AM +0100, Mark Rutland wrote: > On Sat, Aug 02, 2025 at 11:26:46PM +0200, Jiri Olsa wrote: > > On Fri, Aug 01, 2025 at 10:49:56AM +0100, Mark Rutland wrote: > > > On Wed, Jul 30, 2025 at 01:19:51PM +0200, Jiri Olsa wrote: > > > > On Tue, Jul 29, 2025 at 06:57:40PM +0100, Mark Rutland wrote: > > > > > > > > > > On Tue, Jul 29, 2025 at 12:28:03PM +0200, Jiri Olsa wrote: > > > > > > hi, > > > > > > while poking the multi-tracing interface I ended up with just one > > > > > > ftrace_ops object to attach all trampolines. > > > > > > > > > > > > This change allows to use less direct API calls during the attachment > > > > > > changes in the future code, so in effect speeding up the attachment. > > > > > > > > > > How important is that, and what sort of speedup does this result in? I > > > > > ask due to potential performance hits noted below, and I'm lacking > > > > > context as to why we want to do this in the first place -- what is this > > > > > intended to enable/improve? > > > > > > > > so it's all work on PoC stage, the idea is to be able to attach many > > > > (like 20,30,40k) functions to their trampolines quickly, which at the > > > > moment is slow because all the involved interfaces work with just single > > > > function/tracempoline relation > > > > > > Do you know which aspect of that is slow? e.g. is that becuase you have > > > to update each ftrace_ops independently, and pay the synchronization > > > overhead per-ops? > > > > > > I ask because it might be possible to do some more batching there, at > > > least for architectures like arm64 that use the CALL_OPS approach. > > > > IIRC it's the rcu sync in register_ftrace_direct and ftrace_shutdown > > I'll try to profile that case again, there might have been changes > > since the last time we did that > > Do you mean synchronize_rcu_tasks()? > > The call in register_ftrace_direct() was removed in commit: > > 33f137143e651321 ("ftrace: Use asynchronous grace period for register_ftrace_direct()") > > ... but in ftrace_shutdown() we still have a call to synchronize_rcu_tasks(), > and to synchronize_rcu_tasks_rude(). > > The call to synchronize_rcu_tasks() is still necessary, but we might be > abel to batch that better with API changes. > > I think we might be able to remove the call to > synchronize_rcu_tasks_rude() on architectures with ARCH_WANTS_NO_INSTR, > since there shouldn't be any instrumentable functions called with RCU > not watching. That'd need to be checked. > > [...] > > > > > sorry I probably forgot/missed discussion on this, but doing the fast path like in > > > > x86_64 is not an option in arm, right? > > > > > > On arm64 we have a fast path, BUT branch range limitations means that we > > > cannot always branch directly from the instrumented function to the > > > direct func with a single branch instruction. We use ops->direct_call to > > > handle that case within a common trampoline, which is significantly > > > cheaper that iterating over the ops and/or looking up the direct func > > > from a hash. > > > > > > With CALL_OPS, we place a pointer to the ops immediately before the > > > instrumented function, and have the instrumented function branch to a > > > common trampoline which can load that pointer (and can then branch to > > > any direct func as necessary). > > > > > > The instrumented function looks like: > > > > > > # Aligned to 8 bytes > > > func - 8: > > > < pointer to ops > > > > > stupid question.. so there's ftrace_ops pointer stored for each function at > > 'func - 8` ? why not store the func's direct trampoline address in there? > > Once reason is that today we don't have trampolines for all ops. Since > branch range limitations can require bouncing through the common ops, > it's simpler/better to bounce from that to the regular call than to > bounce from that to a trampoline which makes the regular call. > > We *could* consider adding trampolines, but that comes with a jump in > complexity that we originally tried to avoid, and a potential > performance hit for regular ftrace calls. IIUC that will require similar > synchronization to what we have today, so it's not clearly a win > generally. > > I'd like to better understand what the real bottleneck is; AFAICT it's > the tasks-rcu synchronization, and sharing the hash means that you only > need to do that once. I think that it should be possible to share that > synchronization across multiple ops updates with some API changes (e.g. > something like the batching of text_poke on x86). yea, so rcu does not seem to be the cause anymore (IIRC that was the case some time ago) it looks like now the time is spent in the ftrace internals that iterate and update call sites the test was loop on attach/detach of fentry program 31.48% test_progs [kernel.kallsyms] [k] ftrace_replace_code 10.98% test_progs [kernel.kallsyms] [k] __ftrace_hash_update_ipmodify 6.41% test_progs [kernel.kallsyms] [k] __ftrace_hash_rec_update 4.69% test_progs [kernel.kallsyms] [k] ftrace_check_record 4.59% test_progs [kernel.kallsyms] [k] ftrace_lookup_ip 3.65% swapper [kernel.kallsyms] [k] acpi_os_read_port 3.40% test_progs [kernel.kallsyms] [k] srso_alias_return_thunk 2.97% test_progs [kernel.kallsyms] [k] srso_alias_safe_ret 2.67% test_progs [kernel.kallsyms] [k] ftrace_rec_iter_record 2.05% test_progs [kernel.kallsyms] [k] ftrace_test_record 1.83% test_progs [kernel.kallsyms] [k] ftrace_rec_iter_next 1.76% test_progs [kernel.kallsyms] [k] smp_call_function_many_cond 1.05% rcu_tasks_kthre [kernel.kallsyms] [k] rcu_tasks_pertask 0.70% test_progs [kernel.kallsyms] [k] btf_find_by_name_kind 0.61% swapper [kernel.kallsyms] [k] srso_alias_safe_ret 0.55% swapper [kernel.kallsyms] [k] io_idle so by sharing the hash we do that (iterate and update functions) just once jirka --- 31.48% test_progs [kernel.kallsyms] [k] ftrace_replace_code | |--11.54%--ftrace_replace_code | ftrace_modify_all_code | | | |--6.06%--ftrace_shutdown.part.0 | | unregister_ftrace_function | | unregister_ftrace_direct | | bpf_trampoline_update | | bpf_trampoline_unlink_prog | | bpf_tracing_link_release | | bpf_link_free | | bpf_link_release | | __fput | | __x64_sys_close | | do_syscall_64 | | entry_SYSCALL_64_after_hwframe | | __syscall_cancel_arch_end | | __syscall_cancel | | __close | | fentry_test_common | | fentry_test | | test_fentry_test | | run_one_test | | main | | __libc_start_call_main | | __libc_start_main@@GLIBC_2.34 | | _start | | | --5.47%--ftrace_startup | register_ftrace_function_nolock | register_ftrace_direct | bpf_trampoline_update | __bpf_trampoline_link_prog | bpf_trampoline_link_prog | bpf_tracing_prog_attach | bpf_raw_tp_link_attach | __sys_bpf | __x64_sys_bpf | do_syscall_64 | entry_SYSCALL_64_after_hwframe | syscall | skel_raw_tracepoint_open | fentry_test_lskel__test1__attach | fentry_test_common | fentry_test | test_fentry_test | run_one_test | main | __libc_start_call_main | __libc_start_main@@GLIBC_2.34 | _start | |--8.81%--ftrace_check_record | ftrace_replace_code | ftrace_modify_all_code | | | |--4.72%--ftrace_shutdown.part.0 | | unregister_ftrace_function | | unregister_ftrace_direct | | bpf_trampoline_update | | bpf_trampoline_unlink_prog | | bpf_tracing_link_release | | bpf_link_free | | bpf_link_release | | __fput | | __x64_sys_close | | do_syscall_64 | | entry_SYSCALL_64_after_hwframe | | __syscall_cancel_arch_end | | __syscall_cancel | | __close | | fentry_test_common | | fentry_test | | test_fentry_test | | run_one_test | | main | | __libc_start_call_main | | __libc_start_main@@GLIBC_2.34 | | _start | | | --4.10%--ftrace_startup | register_ftrace_function_nolock | register_ftrace_direct | bpf_trampoline_update | __bpf_trampoline_link_prog | bpf_trampoline_link_prog | bpf_tracing_prog_attach | bpf_raw_tp_link_attach | __sys_bpf | __x64_sys_bpf | do_syscall_64 | entry_SYSCALL_64_after_hwframe | syscall | skel_raw_tracepoint_open | fentry_test_lskel__test1__attach | fentry_test_common | fentry_test | test_fentry_test | run_one_test | main | __libc_start_call_main | __libc_start_main@@GLIBC_2.34 | _start | |--3.60%--ftrace_rec_iter_record | ftrace_replace_code | ftrace_modify_all_code | | | |--1.91%--ftrace_shutdown.part.0 | | unregister_ftrace_function | | unregister_ftrace_direct | | bpf_trampoline_update | | bpf_trampoline_unlink_prog | | bpf_tracing_link_release | | bpf_link_free | | bpf_link_release | | __fput | | __x64_sys_close | | do_syscall_64 | | entry_SYSCALL_64_after_hwframe | | __syscall_cancel_arch_end | | __syscall_cancel | | __close | | fentry_test_common | | fentry_test | | test_fentry_test | | run_one_test | | main | | __libc_start_call_main | | __libc_start_main@@GLIBC_2.34 | | _start | | | --1.69%--ftrace_startup | register_ftrace_function_nolock | register_ftrace_direct | bpf_trampoline_update | __bpf_trampoline_link_prog | bpf_trampoline_link_prog | bpf_tracing_prog_attach | bpf_raw_tp_link_attach | __sys_bpf | __x64_sys_bpf | do_syscall_64 | entry_SYSCALL_64_after_hwframe | syscall | skel_raw_tracepoint_open | fentry_test_lskel__test1__attach | fentry_test_common | fentry_test | test_fentry_test | run_one_test | main | __libc_start_call_main | __libc_start_main@@GLIBC_2.34 | _start | |--3.50%--ftrace_rec_iter_next | ftrace_replace_code | ftrace_modify_all_code | | | |--2.08%--ftrace_startup | | register_ftrace_function_nolock | | register_ftrace_direct | | bpf_trampoline_update | | __bpf_trampoline_link_prog | | bpf_trampoline_link_prog | | bpf_tracing_prog_attach | | bpf_raw_tp_link_attach | | __sys_bpf | | __x64_sys_bpf | | do_syscall_64 | | entry_SYSCALL_64_after_hwframe | | syscall | | skel_raw_tracepoint_open | | fentry_test_lskel__test1__attach | | fentry_test_common | | fentry_test | | test_fentry_test | | run_one_test | | main | | __libc_start_call_main | | __libc_start_main@@GLIBC_2.34 | | _start | | | --1.42%--ftrace_shutdown.part.0 | unregister_ftrace_function | unregister_ftrace_direct | bpf_trampoline_update | bpf_trampoline_unlink_prog | bpf_tracing_link_release | bpf_link_free | bpf_link_release | __fput | __x64_sys_close | do_syscall_64 | entry_SYSCALL_64_after_hwframe | __syscall_cancel_arch_end | __syscall_cancel | __close | fentry_test_common | fentry_test | test_fentry_test | run_one_test | main | __libc_start_call_main | __libc_start_main@@GLIBC_2.34 | _start | |--2.44%--srso_alias_safe_ret | srso_alias_return_thunk | ftrace_replace_code | ftrace_modify_all_code | | | |--1.36%--ftrace_shutdown.part.0 | | unregister_ftrace_function | | unregister_ftrace_direct | | bpf_trampoline_update | | bpf_trampoline_unlink_prog | | bpf_tracing_link_release | | bpf_link_free | | bpf_link_release | | __fput | | __x64_sys_close | | do_syscall_64 | | entry_SYSCALL_64_after_hwframe | | __syscall_cancel_arch_end | | __syscall_cancel | | __close | | fentry_test_common | | fentry_test | | test_fentry_test | | run_one_test | | main | | __libc_start_call_main | | __libc_start_main@@GLIBC_2.34 | | _start | | | --1.07%--ftrace_startup | register_ftrace_function_nolock | register_ftrace_direct | bpf_trampoline_update | __bpf_trampoline_link_prog | bpf_trampoline_link_prog | bpf_tracing_prog_attach | bpf_raw_tp_link_attach | __sys_bpf | __x64_sys_bpf | do_syscall_64 | entry_SYSCALL_64_after_hwframe | syscall | skel_raw_tracepoint_open | fentry_test_lskel__test1__attach | fentry_test_common | fentry_test | test_fentry_test | run_one_test | main | __libc_start_call_main | __libc_start_main@@GLIBC_2.34 | _start | --1.59%--ftrace_test_record ftrace_replace_code ftrace_modify_all_code | |--0.87%--ftrace_startup | register_ftrace_function_nolock | register_ftrace_direct | bpf_trampoline_update | __bpf_trampoline_link_prog | bpf_trampoline_link_prog | bpf_tracing_prog_attach | bpf_raw_tp_link_attach | __sys_bpf | __x64_sys_bpf | do_syscall_64 | entry_SYSCALL_64_after_hwframe | syscall | skel_raw_tracepoint_open | fentry_test_lskel__test1__attach | fentry_test_common | fentry_test | test_fentry_test | run_one_test | main | __libc_start_call_main | __libc_start_main@@GLIBC_2.34 | _start | --0.72%--ftrace_shutdown.part.0 unregister_ftrace_function unregister_ftrace_direct bpf_trampoline_update bpf_trampoline_unlink_prog bpf_tracing_link_release bpf_link_free bpf_link_release __fput __x64_sys_close do_syscall_64 entry_SYSCALL_64_after_hwframe __syscall_cancel_arch_end __syscall_cancel __close fentry_test_common fentry_test test_fentry_test run_one_test main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 10.98% test_progs [kernel.kallsyms] [k] __ftrace_hash_update_ipmodify | |--7.90%--__ftrace_hash_update_ipmodify | | | |--4.27%--ftrace_shutdown.part.0 | | unregister_ftrace_function | | unregister_ftrace_direct | | bpf_trampoline_update | | bpf_trampoline_unlink_prog | | bpf_tracing_link_release | | bpf_link_free | | bpf_link_release | | __fput | | __x64_sys_close | | do_syscall_64 | | entry_SYSCALL_64_after_hwframe | | __syscall_cancel_arch_end | | __syscall_cancel | | __close | | fentry_test_common | | fentry_test | | test_fentry_test | | run_one_test | | main | | __libc_start_call_main | | __libc_start_main@@GLIBC_2.34 | | _start | | | --3.63%--ftrace_startup | register_ftrace_function_nolock | register_ftrace_direct | bpf_trampoline_update | __bpf_trampoline_link_prog | bpf_trampoline_link_prog | bpf_tracing_prog_attach | bpf_raw_tp_link_attach | __sys_bpf | __x64_sys_bpf | do_syscall_64 | entry_SYSCALL_64_after_hwframe | syscall | skel_raw_tracepoint_open | fentry_test_lskel__test1__attach | fentry_test_common | fentry_test | test_fentry_test | run_one_test | main | __libc_start_call_main | __libc_start_main@@GLIBC_2.34 | _start | --3.06%--ftrace_lookup_ip __ftrace_hash_update_ipmodify | |--1.92%--ftrace_startup | register_ftrace_function_nolock | register_ftrace_direct | bpf_trampoline_update | __bpf_trampoline_link_prog | bpf_trampoline_link_prog | bpf_tracing_prog_attach | bpf_raw_tp_link_attach | __sys_bpf | __x64_sys_bpf | do_syscall_64 | entry_SYSCALL_64_after_hwframe | syscall | skel_raw_tracepoint_open | fentry_test_lskel__test1__attach | fentry_test_common | fentry_test | test_fentry_test | run_one_test | main | __libc_start_call_main | __libc_start_main@@GLIBC_2.34 | _start | --1.14%--ftrace_shutdown.part.0 unregister_ftrace_function unregister_ftrace_direct bpf_trampoline_update bpf_trampoline_unlink_prog bpf_tracing_link_release bpf_link_free bpf_link_release __fput __x64_sys_close do_syscall_64 entry_SYSCALL_64_after_hwframe __syscall_cancel_arch_end __syscall_cancel __close fentry_test_common fentry_test test_fentry_test run_one_test main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 6.41% test_progs [kernel.kallsyms] [k] __ftrace_hash_rec_update | |--3.37%--__ftrace_hash_rec_update | | | |--1.90%--ftrace_startup | | register_ftrace_function_nolock | | register_ftrace_direct | | bpf_trampoline_update | | __bpf_trampoline_link_prog | | bpf_trampoline_link_prog | | bpf_tracing_prog_attach | | bpf_raw_tp_link_attach | | __sys_bpf | | __x64_sys_bpf | | do_syscall_64 | | entry_SYSCALL_64_after_hwframe | | syscall | | skel_raw_tracepoint_open | | fentry_test_lskel__test1__attach | | fentry_test_common | | fentry_test | | test_fentry_test | | run_one_test | | main | | __libc_start_call_main | | __libc_start_main@@GLIBC_2.34 | | _start | | | --1.47%--ftrace_shutdown.part.0 | unregister_ftrace_function | unregister_ftrace_direct | bpf_trampoline_update | bpf_trampoline_unlink_prog | bpf_tracing_link_release | bpf_link_free | bpf_link_release | __fput | __x64_sys_close | do_syscall_64 | entry_SYSCALL_64_after_hwframe | __syscall_cancel_arch_end | __syscall_cancel | __close | fentry_test_common | fentry_test | test_fentry_test | run_one_test | main | __libc_start_call_main | __libc_start_main@@GLIBC_2.34 | _start | |--2.16%--ftrace_lookup_ip | __ftrace_hash_rec_update | | | |--1.16%--ftrace_shutdown.part.0 | | unregister_ftrace_function | | unregister_ftrace_direct | | bpf_trampoline_update | | bpf_trampoline_unlink_prog | | bpf_tracing_link_release | | bpf_link_free | | bpf_link_release | | __fput | | __x64_sys_close | | do_syscall_64 | | entry_SYSCALL_64_after_hwframe | | __syscall_cancel_arch_end | | __syscall_cancel | | __close | | fentry_test_common | | fentry_test | | test_fentry_test | | run_one_test | | main | | __libc_start_call_main | | __libc_start_main@@GLIBC_2.34 | | _start | | | --0.99%--ftrace_startup | register_ftrace_function_nolock | register_ftrace_direct | bpf_trampoline_update | __bpf_trampoline_link_prog | bpf_trampoline_link_prog | bpf_tracing_prog_attach | bpf_raw_tp_link_attach | __sys_bpf | __x64_sys_bpf | do_syscall_64 | entry_SYSCALL_64_after_hwframe | syscall | skel_raw_tracepoint_open | fentry_test_lskel__test1__attach | fentry_test_common | fentry_test | test_fentry_test | run_one_test | main | __libc_start_call_main | __libc_start_main@@GLIBC_2.34 | _start | --0.88%--srso_alias_safe_ret | --0.79%--__ftrace_hash_rec_update | --0.52%--ftrace_shutdown.part.0 unregister_ftrace_function unregister_ftrace_direct bpf_trampoline_update bpf_trampoline_unlink_prog bpf_tracing_link_release bpf_link_free bpf_link_release __fput __x64_sys_close do_syscall_64 entry_SYSCALL_64_after_hwframe __syscall_cancel_arch_end __syscall_cancel __close fentry_test_common fentry_test test_fentry_test run_one_test main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 4.69% test_progs [kernel.kallsyms] [k] ftrace_check_record | |--2.04%--ftrace_check_record | ftrace_replace_code | ftrace_modify_all_code | | | |--1.06%--ftrace_startup | | register_ftrace_function_nolock | | register_ftrace_direct | | bpf_trampoline_update | | __bpf_trampoline_link_prog | | bpf_trampoline_link_prog | | bpf_tracing_prog_attach | | bpf_raw_tp_link_attach | | __sys_bpf | | __x64_sys_bpf | | do_syscall_64 | | entry_SYSCALL_64_after_hwframe | | syscall | | skel_raw_tracepoint_open | | fentry_test_lskel__test1__attach | | fentry_test_common | | fentry_test | | test_fentry_test | | run_one_test | | main | | __libc_start_call_main | | __libc_start_main@@GLIBC_2.34 | | _start | | | --0.98%--ftrace_shutdown.part.0 | unregister_ftrace_function | unregister_ftrace_direct | bpf_trampoline_update | bpf_trampoline_unlink_prog | bpf_tracing_link_release | bpf_link_free | bpf_link_release | __fput | __x64_sys_close | do_syscall_64 | entry_SYSCALL_64_after_hwframe | __syscall_cancel_arch_end | __syscall_cancel | __close | fentry_test_common | fentry_test | test_fentry_test | run_one_test | main | __libc_start_call_main | __libc_start_main@@GLIBC_2.34 | _start | --1.28%--ftrace_replace_code ftrace_modify_all_code | --0.81%--ftrace_startup register_ftrace_function_nolock register_ftrace_direct bpf_trampoline_update __bpf_trampoline_link_prog bpf_trampoline_link_prog bpf_tracing_prog_attach bpf_raw_tp_link_attach __sys_bpf __x64_sys_bpf do_syscall_64 entry_SYSCALL_64_after_hwframe syscall skel_raw_tracepoint_open fentry_test_lskel__test1__attach fentry_test_common fentry_test test_fentry_test run_one_test main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 4.59% test_progs [kernel.kallsyms] [k] ftrace_lookup_ip | |--1.99%--__ftrace_hash_update_ipmodify | | | |--1.03%--ftrace_shutdown.part.0 | | unregister_ftrace_function | | unregister_ftrace_direct | | bpf_trampoline_update | | bpf_trampoline_unlink_prog | | bpf_tracing_link_release | | bpf_link_free | | bpf_link_release | | __fput | | __x64_sys_close | | do_syscall_64 | | entry_SYSCALL_64_after_hwframe | | __syscall_cancel_arch_end | | __syscall_cancel | | __close | | fentry_test_common | | fentry_test | | test_fentry_test | | run_one_test | | main | | __libc_start_call_main | | __libc_start_main@@GLIBC_2.34 | | _start | | | --0.96%--ftrace_startup | register_ftrace_function_nolock | register_ftrace_direct | bpf_trampoline_update | __bpf_trampoline_link_prog | bpf_trampoline_link_prog | bpf_tracing_prog_attach | bpf_raw_tp_link_attach | __sys_bpf | __x64_sys_bpf | do_syscall_64 | entry_SYSCALL_64_after_hwframe | syscall | skel_raw_tracepoint_open | fentry_test_lskel__test1__attach | fentry_test_common | fentry_test | test_fentry_test | run_one_test | main | __libc_start_call_main | __libc_start_main@@GLIBC_2.34 | _start | |--1.67%--ftrace_lookup_ip | | | --1.19%--__ftrace_hash_update_ipmodify | | | |--0.60%--ftrace_shutdown.part.0 | | unregister_ftrace_function | | unregister_ftrace_direct | | bpf_trampoline_update | | bpf_trampoline_unlink_prog | | bpf_tracing_link_release | | bpf_link_free | | bpf_link_release | | __fput | | __x64_sys_close | | do_syscall_64 | | entry_SYSCALL_64_after_hwframe | | __syscall_cancel_arch_end | | __syscall_cancel | | __close | | fentry_test_common | | fentry_test | | test_fentry_test | | run_one_test | | main | | __libc_start_call_main | | __libc_start_main@@GLIBC_2.34 | | _start | | | --0.59%--ftrace_startup | register_ftrace_function_nolock | register_ftrace_direct | bpf_trampoline_update | __bpf_trampoline_link_prog | bpf_trampoline_link_prog | bpf_tracing_prog_attach | bpf_raw_tp_link_attach | __sys_bpf | __x64_sys_bpf | do_syscall_64 | entry_SYSCALL_64_after_hwframe | syscall | skel_raw_tracepoint_open | fentry_test_lskel__test1__attach | fentry_test_common | fentry_test | test_fentry_test | run_one_test | main | __libc_start_call_main | __libc_start_main@@GLIBC_2.34 | _start | --0.81%--__ftrace_hash_rec_update 3.65% swapper [kernel.kallsyms] [k] acpi_os_read_port | |--1.03%--acpi_os_read_port | acpi_hw_read_port | acpi_hw_read | acpi_hw_register_read | acpi_read_bit_register | acpi_idle_enter_bm | cpuidle_enter_state | cpuidle_enter | do_idle | cpu_startup_entry | | | --0.97%--start_secondary | common_startup_64 | |--0.82%--srso_alias_safe_ret | --0.74%--acpi_hw_read acpi_hw_register_read acpi_read_bit_register acpi_idle_enter_bm cpuidle_enter_state cpuidle_enter do_idle cpu_startup_entry | --0.74%--start_secondary common_startup_64 3.40% test_progs [kernel.kallsyms] [k] srso_alias_return_thunk | |--0.85%--ftrace_replace_code | ftrace_modify_all_code | | | --0.51%--ftrace_startup | register_ftrace_function_nolock | register_ftrace_direct | bpf_trampoline_update | __bpf_trampoline_link_prog | bpf_trampoline_link_prog | bpf_tracing_prog_attach | bpf_raw_tp_link_attach | __sys_bpf | __x64_sys_bpf | do_syscall_64 | entry_SYSCALL_64_after_hwframe | syscall | skel_raw_tracepoint_open | fentry_test_lskel__test1__attach | fentry_test_common | fentry_test | test_fentry_test | run_one_test | main | __libc_start_call_main | __libc_start_main@@GLIBC_2.34 | _start | --0.64%--ftrace_check_record ftrace_replace_code ftrace_modify_all_code 2.97% test_progs [kernel.kallsyms] [k] srso_alias_safe_ret | |--0.73%--ftrace_check_record | ftrace_replace_code | ftrace_modify_all_code | --0.69%--ftrace_replace_code ftrace_modify_all_code 2.67% test_progs [kernel.kallsyms] [k] ftrace_rec_iter_record | |--1.19%--ftrace_replace_code | ftrace_modify_all_code | | | |--0.68%--ftrace_startup | | register_ftrace_function_nolock | | register_ftrace_direct | | bpf_trampoline_update | | __bpf_trampoline_link_prog | | bpf_trampoline_link_prog | | bpf_tracing_prog_attach | | bpf_raw_tp_link_attach | | __sys_bpf | | __x64_sys_bpf | | do_syscall_64 | | entry_SYSCALL_64_after_hwframe | | syscall | | skel_raw_tracepoint_open | | fentry_test_lskel__test1__attach | | fentry_test_common | | fentry_test | | test_fentry_test | | run_one_test | | main | | __libc_start_call_main | | __libc_start_main@@GLIBC_2.34 | | _start | | | --0.51%--ftrace_shutdown.part.0 | unregister_ftrace_function | unregister_ftrace_direct | bpf_trampoline_update | bpf_trampoline_unlink_prog | bpf_tracing_link_release | bpf_link_free | bpf_link_release | __fput | __x64_sys_close | do_syscall_64 | entry_SYSCALL_64_after_hwframe | __syscall_cancel_arch_end | __syscall_cancel | __close | fentry_test_common | fentry_test | test_fentry_test | run_one_test | main | __libc_start_call_main | __libc_start_main@@GLIBC_2.34 | _start | --0.69%--ftrace_check_record ftrace_replace_code ftrace_modify_all_code 2.05% test_progs [kernel.kallsyms] [k] ftrace_test_record | --0.79%--ftrace_replace_code ftrace_modify_all_code 1.83% test_progs [kernel.kallsyms] [k] ftrace_rec_iter_next | --0.87%--ftrace_replace_code ftrace_modify_all_code | --0.51%--ftrace_startup register_ftrace_function_nolock register_ftrace_direct bpf_trampoline_update __bpf_trampoline_link_prog bpf_trampoline_link_prog bpf_tracing_prog_attach bpf_raw_tp_link_attach __sys_bpf __x64_sys_bpf do_syscall_64 entry_SYSCALL_64_after_hwframe syscall skel_raw_tracepoint_open fentry_test_lskel__test1__attach fentry_test_common fentry_test test_fentry_test run_one_test main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 1.76% test_progs [kernel.kallsyms] [k] smp_call_function_many_cond | --1.73%--smp_call_function_many_cond on_each_cpu_cond_mask | --1.57%--smp_text_poke_batch_finish | --1.55%--ftrace_modify_all_code | |--0.91%--ftrace_shutdown.part.0 | unregister_ftrace_function | unregister_ftrace_direct | bpf_trampoline_update | bpf_trampoline_unlink_prog | bpf_tracing_link_release | bpf_link_free | bpf_link_release | __fput | __x64_sys_close | do_syscall_64 | entry_SYSCALL_64_after_hwframe | __syscall_cancel_arch_end | __syscall_cancel | __close | fentry_test_common | fentry_test | test_fentry_test | run_one_test | main | __libc_start_call_main | __libc_start_main@@GLIBC_2.34 | _start | --0.64%--ftrace_startup register_ftrace_function_nolock register_ftrace_direct bpf_trampoline_update __bpf_trampoline_link_prog bpf_trampoline_link_prog bpf_tracing_prog_attach bpf_raw_tp_link_attach __sys_bpf __x64_sys_bpf do_syscall_64 entry_SYSCALL_64_after_hwframe syscall skel_raw_tracepoint_open fentry_test_lskel__test1__attach fentry_test_common fentry_test test_fentry_test run_one_test main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 1.05% rcu_tasks_kthre [kernel.kallsyms] [k] rcu_tasks_pertask | --0.65%--rcu_tasks_wait_gp rcu_tasks_one_gp rcu_tasks_kthread kthread ret_from_fork ret_from_fork_asm 0.70% test_progs [kernel.kallsyms] [k] btf_find_by_name_kind | --0.59%--btf_find_by_name_kind
On Wed, 30 Jul 2025 13:19:51 +0200 Jiri Olsa <olsajiri@gmail.com> wrote: > so it's all work on PoC stage, the idea is to be able to attach many > (like 20,30,40k) functions to their trampolines quickly, which at the > moment is slow because all the involved interfaces work with just single > function/tracempoline relation Sounds like you are reinventing the ftrace mechanism itself. Which I warned against when I first introduced direct trampolines, which were purposely designed to do a few functions, not thousands. But, oh well. > Steven, please correct me if/when I'm wrong ;-) > > IIUC in x86_64, IF there's just single ftrace_ops defined for the function, > it will bypass ftrace trampoline and call directly the direct trampoline > for the function, like: > > <foo>: > call direct_trampoline > ... Yes. And it will also do the same for normal ftrace functions. If you have: struct ftrace_ops { .func = myfunc; }; It will create a trampoline that has: <tramp> ... call myfunc ... ret On x86, I believe the ftrace_ops for myfunc is added to the trampoline, where as in arm, it's part of the function header. To modify it, it requires converting to the list operation (which ignores the ops parameter), then the ops at the function gets changed before it goes to the new function. And if it is the only ops attached to a function foo, the function foo would have: <foo> call tramp ... But what's nice about this is that if you have 12 different ftrace_ops that each attach to a 1000 different functions, but no two ftrace_ops attach to the same function, they all do the above. No hash needed! > > IF there are other ftrace_ops 'users' on the same function, we execute > each of them like: > > <foo>: > call ftrace_trampoline > call ftrace_ops_1->func > call ftrace_ops_2->func > ... > > with our direct ftrace_ops->func currently using ftrace_ops->direct_call > to return direct trampoline for the function: > > -static void call_direct_funcs(unsigned long ip, unsigned long pip, > - struct ftrace_ops *ops, struct ftrace_regs *fregs) > -{ > - unsigned long addr = READ_ONCE(ops->direct_call); > - > - if (!addr) > - return; > - > - arch_ftrace_set_direct_caller(fregs, addr); > -} > > in the new changes it will do hash lookup (based on ip) for the direct > trampoline we want to execute: > > +static void call_direct_funcs_hash(unsigned long ip, unsigned long pip, > + struct ftrace_ops *ops, struct ftrace_regs *fregs) > +{ > + unsigned long addr; > + > + addr = ftrace_find_rec_direct(ip); > + if (!addr) > + return; > + > + arch_ftrace_set_direct_caller(fregs, addr); > +} I think the above will work. > > still this is the slow path for the case where multiple ftrace_ops objects use > same function.. for the fast path we have the direct attachment as described above > > sorry I probably forgot/missed discussion on this, but doing the fast path like in > x86_64 is not an option in arm, right? That's a question for Mark, right? -- Steve
On Wed, Jul 30, 2025 at 09:56:41AM -0400, Steven Rostedt wrote: > On Wed, 30 Jul 2025 13:19:51 +0200 > Jiri Olsa <olsajiri@gmail.com> wrote: > > > so it's all work on PoC stage, the idea is to be able to attach many > > (like 20,30,40k) functions to their trampolines quickly, which at the > > moment is slow because all the involved interfaces work with just single > > function/tracempoline relation > > Sounds like you are reinventing the ftrace mechanism itself. Which I warned > against when I first introduced direct trampolines, which were purposely > designed to do a few functions, not thousands. But, oh well. > > > > Steven, please correct me if/when I'm wrong ;-) > > > > IIUC in x86_64, IF there's just single ftrace_ops defined for the function, > > it will bypass ftrace trampoline and call directly the direct trampoline > > for the function, like: > > > > <foo>: > > call direct_trampoline > > ... > > Yes. > > And it will also do the same for normal ftrace functions. If you have: > > struct ftrace_ops { > .func = myfunc; > }; > > It will create a trampoline that has: > > <tramp> > ... > call myfunc > ... > ret > > On x86, I believe the ftrace_ops for myfunc is added to the trampoline, > where as in arm, it's part of the function header. To modify it, it > requires converting to the list operation (which ignores the ops > parameter), then the ops at the function gets changed before it goes to the > new function. > > And if it is the only ops attached to a function foo, the function foo > would have: > > <foo> > call tramp > ... > > But what's nice about this is that if you have 12 different ftrace_ops that > each attach to a 1000 different functions, but no two ftrace_ops attach to > the same function, they all do the above. No hash needed! > > > > > IF there are other ftrace_ops 'users' on the same function, we execute > > each of them like: > > > > <foo>: > > call ftrace_trampoline > > call ftrace_ops_1->func > > call ftrace_ops_2->func > > ... > > > > with our direct ftrace_ops->func currently using ftrace_ops->direct_call > > to return direct trampoline for the function: > > > > -static void call_direct_funcs(unsigned long ip, unsigned long pip, > > - struct ftrace_ops *ops, struct ftrace_regs *fregs) > > -{ > > - unsigned long addr = READ_ONCE(ops->direct_call); > > - > > - if (!addr) > > - return; > > - > > - arch_ftrace_set_direct_caller(fregs, addr); > > -} > > > > in the new changes it will do hash lookup (based on ip) for the direct > > trampoline we want to execute: > > > > +static void call_direct_funcs_hash(unsigned long ip, unsigned long pip, > > + struct ftrace_ops *ops, struct ftrace_regs *fregs) > > +{ > > + unsigned long addr; > > + > > + addr = ftrace_find_rec_direct(ip); > > + if (!addr) > > + return; > > + > > + arch_ftrace_set_direct_caller(fregs, addr); > > +} > > I think the above will work. > > > > > still this is the slow path for the case where multiple ftrace_ops objects use > > same function.. for the fast path we have the direct attachment as described above > > > > sorry I probably forgot/missed discussion on this, but doing the fast path like in > > x86_64 is not an option in arm, right? > > That's a question for Mark, right? yes, thanks for the other details jirka
© 2016 - 2025 Red Hat, Inc.