bpf: tracing multi-link support

[PATCH bpf-next 00/25] bpf: tracing multi-link support

Posted by Menglong Dong 8 months, 2 weeks ago

After four months, I finally finish the coding and testing of this series.
This is my first time to write such a complex series, and it's so hard :/
Anyway, I finished it.
(I'm scared :/)

For now, the BPF program of type BPF_PROG_TYPE_TRACING is not allowed to
be attached to multiple hooks, and we have to create a BPF program for
each kernel function, for which we want to trace, even through all the
program have the same (or similar) logic. This can consume extra memory,
and make the program loading slow if we have plenty of kernel function to
trace.

In this series, we add the support to allow attaching a tracing BPF
program to multi hooks, which is similar to BPF_TRACE_KPROBE_MULTI.
Generally speaking, this series can be divided into 5 parts:

1. Add per-function metadata storage support.
2. Add bpf global trampoline support for x86_64.
3. Add bpf global trampoline link support.
4. Add tracing multi-link support.
5. Compatibility between tracing and tracing_multi.

per-function metadata storage
-----------------------------
The per-function metadata storage is the basic of the bpf global
trampoline. It has 2 mode: function padding mode and hash table mode. When
The CONFIG_FUNCTION_METADATA_PADDING is enabled, the function padding mode
will be used, and it has higher performance with almost no overhead. It
will allocate a metadata array and storage the index of the metadata to
the function padding.

The function padding can increase the text size, so it can be not
enabled sometimes. When the CONFIG_FUNCTION_METADATA is not enabled, we
will fallback to the hash table mode, and storage the function metadata
in a hlist.

The release of the metadata is a big difficulty in the function padding
mode. In this mode, we use a metadata array to store the metadata. The
array need to be enlarged when it is full. So we not only need to control
the release of the metadata, but also the release of the metadata array.
per-cpu ref and rcu are used in these situations. The release of the
metadata is much easier in the hash table mode. We just need to use the
rcu to release the metadata.

For now, function padding mode is supported in x86_64 and arm64. And
the function metadata is only used by the bpf global trampoline in x86_64.
So maybe we don't need to support the function padding mode in arm64 in
this series?

The performance comparison between the function padding mode and hash
table mode can be found in the following commit log. As Alexei pointed
out in [1], we can fallback to the hash table mode when the function
padding is not supported.

bpf global trampoline
---------------------
The bpf global trampoline is similar to the general bpf trampoline. The
bpf trampoline store the bpf progs and some metadata in the trampoline
instructions directly. However, the bpf global trampoline store and get
the metadata from the function metadata with kfunc_md_get_noref(). This
makes the bpf global trampoline more flexible and can be used for all the
kernel functions with a single instance.

The bpf global trampoline is designed to implement the tracing multi-link
for FENTRY, FEXIT and MODIFY_RETURN. Sleepable bpf progs are not
supported, as we call __bpf_prog_enter_recur() and __bpf_prog_exit_recur()
directly in the bpf global trampoline, which can be optimized later. And
we make __bpf_prog_{enter,exit}_recur() global to use it in the global
trampoline.

The overhead of the bpf global trampoline is slightly higher than the
function trampoline, as we need prepare all the things we need in the
stack. For example, we store the address of the traced function in the
stack for bpf_get_func_ip(), even if it is not used.

As Steven mentioned in [1], we need to mitigate for spectre, as we use
indirect call in the bpf global trampoline. I haven't fully understood
Spectre yet, and I make the indirect call with CALL_NOSPEC just like the
others do. Could it prevent the spectre?

bpf global trampoline link
--------------------------
We reuse part of the code in [2] to implement the tracing multi-link. The
struct bpf_gtramp_link is introduced for the bpf global trampoline link.
Similar to the bpf trampoline link, the bpf global trampoline link has
bpf_gtrampoline_link_prog() and bpf_gtrampoline_unlink_prog() to link and
unlink the bpf progs.

The "entries" in the bpf_gtramp_link is a array of struct
bpf_gtramp_link_entry, which contain all the information of the functions
that we trace, such as the address, the number of args, the cookie and so
on.

The bpf global trampoline is much simpler than the bpf trampoline, and we
introduce then new struct bpf_global_trampoline for it. The "image" field
is a pointer to bpf_global_caller. We implement the global trampoline
based on the direct ftrace, and the "fops" field for this propose. This
means bpf2bpf is not supported by the tracing multi-link.

When we link the bpf prog, we will add it to all the target functions'
kfunc_md. Then, we get all the function addresses that have bpf progs with
kfunc_md_bpf_ips(), and reset the ftrace filter of the fops to it. The
direct ftrace don't support to reset the filter functions yet, so we
introduce the reset_ftrace_direct_ips() to do this.

We use a global lock to protect the global trampoline, and we need to hold
it when we link or unlink the bpf prog. The global lock is a read-write.
In fact, it should be a mutex here, the rw_semaphore is used for the
following patches that make tracing_multi compatible with tracing.

tracing multi-link
------------------
Most of the code of this part comes from the series [2].

In the 10th patch, we add the support to record index of the accessed
function args of the target for tracing program. Meanwhile, we add the
function btf_check_func_part_match() to compare the accessed function args
of two function prototype. This function will be used in the next commit.

In the 11th patch, we refactor the struct modules_array to ptr_array, as
we need similar function to hold the target btf, target program and kernel
modules that we reference to in the following commit.

In the 15th patch, we implement the multi-link support for tracing, and
following new attach types are added:

  BPF_TRACE_FENTRY_MULTI
  BPF_TRACE_FEXIT_MULTI
  BPF_MODIFY_RETURN_MULTI

We introduce the struct bpf_tracing_multi_link for this purpose, which
can hold all the kernel modules, target bpf program (for attaching to bpf
program) or target btf (for attaching to kernel function) that we
referenced.

During loading, the first target is used for verification by the verifier.
And during attaching, we check the consistency of all the targets with
the first target.

Compatibility between tracing and tracing_multi
------------------------------------------------
The tracing_multi is not compatible with the tracing without the 16-18th
patches. For example, we will fail on attaching the FENTRY_MULTI to the
functions that FENTRY exists, and FENTRY will also fail if FENTRY_MULTI
exists.

Generally speaking, we will replace the global trampoline with bpf
trampoline if both of them exist on the same function. The function
replace_ftrace_direct() is added for this, which is used to replace the
direct ftrace_ops with another one for all the filter functions in the
source ftrace_ops.

The most difficult part is synchronization between
bpf_gtrampoline_link_prog and bpf_trampoline_link_prog, and we use a
rw_semaphore here, which is quite ugly. We hold the write lock in
bpf_gtrampoline_link_prog and read lock in bpf_trampoline_link_prog.

We introduce the function bpf_gtrampoline_link_tramp() to make
bpf_gtramp_link fit bpf_trampoline, which will be called in
bpf_gtrampoline_link_prog(). If the bpf_trampoline of the function exist
in the kfunc_md or we find it with bpf_trampoline_lookup_exist(), it means
that we need do the fitting. The fitting is simple, we create a
bpf_shim_tramp_link for our prog and link it to the bpf_trampoline with
__bpf_trampoline_link_prog().

It's a little complex for the bpf_trampoline_link_prog() case. We create
bpf_shim_tramp_link for all the bpf progs in kfunc_md and add it to the
bpf_trampoline before we call __bpf_trampoline_link_prog() in
bpf_gtrampoline_override(). And we will fallback in
bpf_gtrampoline_override_finish() if error is returned by
__bpf_trampoline_link_prog().

Another solution is to fit into the existing trampoline. For example, we
can add the bpf prog to the kfunc_md if tracing_multi bpf prog is attached
on the target function when we attach a tracing bpf prog. And we can also
update the tracing_multi prog to the trampoline if tracing prog exists
on the target function. I think this will make the compatibility much
easier.

The code in this part is very ugly and messy, and I think it will be a
liberation to split it out to another series :/

Performance comparison
----------------------
We have implemented the performance testings in the selftests in
test_tracing_multi_bench(), and you can get the result by running:

  ./test_progs -t tracing_multi_bench -v | grep time

In this testcase, bpf_fentry_test1() will be called 10000000 times, and
the time consumed will be returned and printed. Following cases is
considered:

- nop: nothing is attached to bpf_fentry_test1()
- fentry: a empty FENTRY bpf program is attached to bpf_fentry_test1()
- fentry_multi_single: a empty FENTRY_MULTI bpf program is attached to
  bpf_fentry_test1().
  We alias it as "fm_single" in the following.
- fentry_multi_all: a empty FENTRY_MULTI bpf program is attached to all
  the kernel functions.
  We alias it as "fm_all" in the following.
- kprobe_multi_single: a empty KPROBE_MULTI bpf program is attached to
  bpf_fentry_test1().
  We alias it as "km_single" in the following.
- kprobe_multi_all: a empty KPROBE_MULTI bpf program is attached to all
  the kernel functions.
  We alias it as "km_all" in the following.

The "fentry_multi_all" is used to test the performance of the hash table
mode of the function metadata.

Different kconfig can affect the performance, and the base kconfig we use
comes from debian 12.

no-mitigate + hash table mode
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
nop     | fentry    | fm_single | fm_all    | km_single | km_all
9.014ms | 162.378ms | 180.511ms | 446.286ms | 220.634ms | 1465.133ms
9.038ms | 161.600ms | 178.757ms | 445.807ms | 220.656ms | 1463.714ms
9.048ms | 161.435ms | 180.510ms | 452.530ms | 220.943ms | 1487.494ms
9.030ms | 161.585ms | 178.699ms | 448.167ms | 220.107ms | 1463.785ms
9.056ms | 161.530ms | 178.947ms | 445.609ms | 221.026ms | 1560.584ms

The mitigate is enabled by default in the kernel, and we can disable it
with the "mitigations=off" cmdline to do the testing.

The count of the kernel functions that we traced in the fentry_multi_all
and kprobe_multi_all testcase is 43871, which can be obtained by running:

$ ./test_progs -t trace_bench -v | grep before
attach 43871 functions before testings
attach 43871 functions before testings

However, we use hlist_add_tail_rcu() in the kfunc_md, and bpf_fentry_test1
is the 39425th function in
/sys/kernel/debug/tracing/available_filter_functions, which means the
bpf_fentry_test1 is in the tail of the hlist budget. So we can image that
the total traced functions is 79k(39425 * 2) to make the hash table lookup
fair enough.

Note, the performance of fentry can vary significantly with different
kconfig. When the kernel is compiled with "tinyconfig" of x86_64, the
performance of fentry is about 80ms, and the fentry_multi is about 95ms.

mitigate + hash table mode
^^^^^^^^^^^^^^^^^^^^^^^^^^^
nop      | fentry    | fm_single | fm_all     | km_single  | km_all
37.348ms | 753.621ms | 844.110ms | 1033.151ms | 1033.814ms | 2403.759ms
37.439ms | 753.894ms | 843.922ms | 1033.182ms | 1034.066ms | 2364.879ms
37.420ms | 754.802ms | 844.430ms | 1046.192ms | 1035.357ms | 2368.233ms
37.436ms | 754.051ms | 844.831ms | 1035.204ms | 1034.431ms | 2252.827ms
37.425ms | 753.802ms | 844.714ms | 1106.462ms | 1034.119ms | 2252.217ms

The performance of fentry_multi is much higher than fentry with
mitigations, and I think that it is because we made one more function call
in the fentry_multi, which is the kfunc_md_get_noref(). What's more, the
mitigate for the indirect call in the bpf global trampoline also increase
the overhead. We can still do some optimizations in the bpf global
trampoline.

I think this is what Alexei meant in [1], that the performance suffers
from the mitigations anyway, so the indirect call doesn't matter in this
case.

no-mitigate + function padding mode
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
nop     | fentry    | fm_single | fm_all    | km_single | km_all
9.320ms | 166.454ms | 184.094ms | 193.884ms | 227.320ms | 1441.462ms
9.326ms | 166.651ms | 183.954ms | 193.912ms | 227.503ms | 1544.634ms
9.313ms | 170.501ms | 183.985ms | 191.738ms | 227.801ms | 1441.284ms
9.311ms | 166.957ms | 182.086ms | 192.063ms | 410.411ms | 1489.665ms
9.329ms | 166.332ms | 182.196ms | 194.154ms | 227.443ms | 1511.272ms

The overhead of fentry_multi_all is a little higher than the
fentry_multi_single. Maybe it is because the function
ktime_get_boottime_ns(), which is used in bpf_testmod_bench_run(), is also
traced? I haven't figured it out yet, but it doesn't matter :/

The overhead of fentry_multi_single is a little higher than the fentry,
and it makes sense, as we do more things in the bpf global trampoline. In
comparison, the bpf trampoline is simpler.

mitigate + function padding mode
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
nop      | fentry    | fm_single | fm_all    | km_single  | km_all
37.340ms | 754.659ms | 840.295ms | 849.632ms | 1043.997ms | 2180.874ms
37.543ms | 753.809ms | 840.535ms | 849.746ms | 1034.481ms | 2355.085ms
37.442ms | 753.803ms | 840.797ms | 850.012ms | 1034.462ms | 2276.567ms
37.501ms | 753.931ms | 840.789ms | 850.938ms | 1035.594ms | 2218.350ms
37.700ms | 753.714ms | 840.875ms | 851.614ms | 1034.465ms | 2329.307ms

conclusion
^^^^^^^^^^
The performance of fentry_multi is close to fentry in the function padding
mode. And in the hash table mode, the performance of fentry_multi can be
276% of fentry when the number of the traced functions is 79k.

Link: https://lore.kernel.org/all/20250303132837.498938-1-dongml2@chinatelecom.cn/ [1]
Link: https://lore.kernel.org/bpf/20240311093526.1010158-1-dongmenglong.8@bytedance.com/ [2]
Menglong Dong (25):
  add per-function metadata storage support
  x86: implement per-function metadata storage for x86
  arm64: implement per-function metadata storage for arm64
  bpf: make kfunc_md support global trampoline link
  x86,bpf: add bpf_global_caller for global trampoline
  ftrace: factor out ftrace_direct_update from register_ftrace_direct
  ftrace: add reset_ftrace_direct_ips
  bpf: introduce bpf_gtramp_link
  bpf: tracing: add support to record and check the accessed args
  bpf: refactor the modules_array to ptr_array
  bpf: verifier: add btf to the function args of bpf_check_attach_target
  bpf: verifier: move btf_id_deny to bpf_check_attach_target
  x86,bpf: factor out __arch_get_bpf_regs_nr
  bpf: tracing: add multi-link support
  ftrace: factor out __unregister_ftrace_direct
  ftrace: supporting replace direct ftrace_ops
  bpf: make trampoline compatible with global trampoline
  libbpf: don't free btf if tracing_multi progs existing
  libbpf: support tracing_multi
  libbpf: add btf type hash lookup support
  libbpf: add skip_invalid and attach_tracing for tracing_multi
  selftests/bpf: use the glob_match() from libbpf in test_progs.c
  selftests/bpf: add get_ksyms and get_addrs to test_progs.c
  selftests/bpf: add testcases for multi-link of tracing
  selftests/bpf: add performance bench test for trace prog

 arch/arm64/Kconfig                            |  21 +
 arch/arm64/Makefile                           |  23 +-
 arch/arm64/include/asm/ftrace.h               |  34 +
 arch/arm64/kernel/ftrace.c                    |  13 +-
 arch/x86/Kconfig                              |  30 +
 arch/x86/include/asm/alternative.h            |   2 +
 arch/x86/include/asm/ftrace.h                 |  47 ++
 arch/x86/kernel/asm-offsets.c                 |  15 +
 arch/x86/kernel/callthunks.c                  |   2 +-
 arch/x86/kernel/ftrace.c                      |  26 +
 arch/x86/kernel/ftrace_64.S                   | 231 ++++++
 arch/x86/net/bpf_jit_comp.c                   |  36 +-
 include/linux/bpf.h                           |  67 ++
 include/linux/bpf_types.h                     |   1 +
 include/linux/bpf_verifier.h                  |   1 +
 include/linux/ftrace.h                        |  15 +
 include/linux/kfunc_md.h                      |  63 ++
 include/uapi/linux/bpf.h                      |  10 +
 kernel/bpf/btf.c                              | 113 ++-
 kernel/bpf/syscall.c                          | 409 ++++++++++-
 kernel/bpf/trampoline.c                       | 476 +++++++++++-
 kernel/bpf/verifier.c                         | 161 ++--
 kernel/trace/Makefile                         |   1 +
 kernel/trace/bpf_trace.c                      |  48 +-
 kernel/trace/ftrace.c                         | 288 ++++++--
 kernel/trace/kfunc_md.c                       | 689 ++++++++++++++++++
 net/bpf/test_run.c                            |   3 +
 net/core/bpf_sk_storage.c                     |   2 +
 tools/bpf/bpftool/common.c                    |   3 +
 tools/include/uapi/linux/bpf.h                |  10 +
 tools/lib/bpf/bpf.c                           |  10 +
 tools/lib/bpf/bpf.h                           |   6 +
 tools/lib/bpf/btf.c                           | 102 +++
 tools/lib/bpf/btf.h                           |   6 +
 tools/lib/bpf/libbpf.c                        | 288 +++++++-
 tools/lib/bpf/libbpf.h                        |  25 +
 tools/lib/bpf/libbpf.map                      |   5 +
 tools/testing/selftests/bpf/Makefile          |   2 +-
 .../selftests/bpf/prog_tests/fentry_fexit.c   |  22 +-
 .../selftests/bpf/prog_tests/fentry_test.c    |  79 +-
 .../selftests/bpf/prog_tests/fexit_test.c     |  79 +-
 .../bpf/prog_tests/kprobe_multi_test.c        | 220 +-----
 .../selftests/bpf/prog_tests/modify_return.c  |  60 ++
 .../selftests/bpf/prog_tests/trace_bench.c    | 149 ++++
 .../bpf/prog_tests/tracing_multi_link.c       | 276 +++++++
 .../selftests/bpf/progs/fentry_empty.c        |  13 +
 .../selftests/bpf/progs/fentry_multi_empty.c  |  13 +
 .../testing/selftests/bpf/progs/trace_bench.c |  21 +
 .../bpf/progs/tracing_multi_override.c        |  28 +
 .../selftests/bpf/progs/tracing_multi_test.c  | 181 +++++
 .../selftests/bpf/test_kmods/bpf_testmod.c    |  40 +
 tools/testing/selftests/bpf/test_progs.c      | 349 ++++++++-
 tools/testing/selftests/bpf/test_progs.h      |   5 +
 53 files changed, 4334 insertions(+), 485 deletions(-)
 create mode 100644 include/linux/kfunc_md.h
 create mode 100644 kernel/trace/kfunc_md.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/trace_bench.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tracing_multi_link.c
 create mode 100644 tools/testing/selftests/bpf/progs/fentry_empty.c
 create mode 100644 tools/testing/selftests/bpf/progs/fentry_multi_empty.c
 create mode 100644 tools/testing/selftests/bpf/progs/trace_bench.c
 create mode 100644 tools/testing/selftests/bpf/progs/tracing_multi_override.c
 create mode 100644 tools/testing/selftests/bpf/progs/tracing_multi_test.c

-- 
2.39.5

Re: [PATCH bpf-next 00/25] bpf: tracing multi-link support

Posted by Alexei Starovoitov 8 months ago

On Tue, May 27, 2025 at 8:49 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
>
> 1. Add per-function metadata storage support.
> 2. Add bpf global trampoline support for x86_64.
> 3. Add bpf global trampoline link support.
> 4. Add tracing multi-link support.
> 5. Compatibility between tracing and tracing_multi.

...

> ... and I think it will be a
> liberation to split it out to another series :/

There are lots of interesting ideas here and you know
already what the next step should be...
Split it into small chunks.
As presented it's hard to review and even if maintainers take on
that challenge the set is unlandable, since it spans various
subsystems.

In a small reviewable patch set we can argue about
approach A vs B while the current set has too many angles
to argue about.
Like the new concept of global trampoline.
It's nice to write bpf_global_caller() in asm
compared to arch_prepare_bpf_trampoline() that emits asm
on the fly, but it seems the only thing where it truly
needs asm is register save/restore. The rest can be done in C.
I suspect the whole gtramp can be written in C.
There is an attribute(interrupt) that all compilers support...
or use no attributes and inline asm for regs save/restore ?
or attribute(naked) and more inline asm ?

> no-mitigate + hash table mode
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> nop     | fentry    | fm_single | fm_all    | km_single | km_all
> 9.014ms | 162.378ms | 180.511ms | 446.286ms | 220.634ms | 1465.133ms
> 9.038ms | 161.600ms | 178.757ms | 445.807ms | 220.656ms | 1463.714ms
> 9.048ms | 161.435ms | 180.510ms | 452.530ms | 220.943ms | 1487.494ms
> 9.030ms | 161.585ms | 178.699ms | 448.167ms | 220.107ms | 1463.785ms
> 9.056ms | 161.530ms | 178.947ms | 445.609ms | 221.026ms | 1560.584ms

...

> no-mitigate + function padding mode
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> nop     | fentry    | fm_single | fm_all    | km_single | km_all
> 9.320ms | 166.454ms | 184.094ms | 193.884ms | 227.320ms | 1441.462ms
> 9.326ms | 166.651ms | 183.954ms | 193.912ms | 227.503ms | 1544.634ms
> 9.313ms | 170.501ms | 183.985ms | 191.738ms | 227.801ms | 1441.284ms
> 9.311ms | 166.957ms | 182.086ms | 192.063ms | 410.411ms | 1489.665ms
> 9.329ms | 166.332ms | 182.196ms | 194.154ms | 227.443ms | 1511.272ms
>
> The overhead of fentry_multi_all is a little higher than the
> fentry_multi_single. Maybe it is because the function
> ktime_get_boottime_ns(), which is used in bpf_testmod_bench_run(), is also
> traced? I haven't figured it out yet, but it doesn't matter :/

I think it matters a lot.
Looking at patch 25 the fm_all (in addition to fm_single) only
suppose to trigger from ktime_get_boottime,
but for hash table mode the difference is huge.
10M bpf_fentry_test1() calls are supposed to dominate 2 calls
to ktime_get and whatever else is called there,
but this is not what numbers tell.

Same discrepancy with kprobe_multi. 7x difference has to be understood,
since it's a sign that the benchmark is not really measuring
what it is supposed to measure. Which casts doubts on all numbers.

Another part is how come fentry is 20x slower than nop.
We don't see it in the existing bench-es. That's another red flag.

You need to rethink benchmarking strategy. The bench itself
should be spotless. Don't invent new stuff. Add to existing benchs.
They already measure nop, fentry, kprobe, kprobe-multi.

Then only introduce a global trampoline with a simple hash tab.
Compare against current numbers for fentry.
fm_single has to be within couple percents of fentry.
Then make fm_all attach to everything except funcs that bench trigger calls.
fm_all has to be exactly equal to fm_single.
If the difference is 2.5x like here (180 for fm_single vs 446 for fm_all)
something is wrong. Investigate it and don't proceed without full
understanding.

And only then introduce 5 byte special insn that indices into
an array for fast access to metadata.
Your numbers are a bit suspicious, but they show that fm_single
with hash tab is the same speed as the special kfunc_md_arch_support().
Which is expected.
With fm_all that triggers small set of kernel function
in a tight benchmark loop the performance of hashtab vs special
should _also_ be the same, because hashtab will perform O(1) lookup
that is hot in the cache (or hashtab has bad collisions and should be fixed).
fm_all should have the same speed as fm_single too,
because bench will only attach to things outside of the tight bench loop.
So attaching to thousands of kernel functions that are not being
triggered by the benchmark should not affect results.

The performance advantage of special kfunc_md_arch_support()
can probably only be seen in production when fentry.multi attaches
to thousands of kernel functions and random functions are called.
Then hash tab cache misses will be noticeable vs direct access.
There will be cache misses in both cases, but significantly more misses
for hash tab. Only then we can decide where special stuff is truly necessary.
So patches 2 and 3 are really last. After everything had already landed.

Re: [PATCH bpf-next 00/25] bpf: tracing multi-link support

Posted by Menglong Dong 8 months ago

On 6/11/25 11:32, Alexei Starovoitov wrote:
> On Tue, May 27, 2025 at 8:49 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
>>
>> 1. Add per-function metadata storage support.
>> 2. Add bpf global trampoline support for x86_64.
>> 3. Add bpf global trampoline link support.
>> 4. Add tracing multi-link support.
>> 5. Compatibility between tracing and tracing_multi.
>
> ...
>
>> ... and I think it will be a
>> liberation to split it out to another series :/
>
> There are lots of interesting ideas here and you know
> already what the next step should be...
> Split it into small chunks.
> As presented it's hard to review and even if maintainers take on
> that challenge the set is unlandable, since it spans various
> subsystems.
>
> In a small reviewable patch set we can argue about
> approach A vs B while the current set has too many angles
> to argue about.

Hi, Alexei.

You are right. In the very beginning, I planned to make the kernel function
metadata to be the first series. However, it's hard to judge if the function
metadata is useful without the usage of the BPF tracing multi-link. So I
kneaded them together in this series.

The features in this series can be split into 4 part:
* kernel function metadata
* BPF global trampoline
* tracing multi-link support
* gtramp work together with trampoline

I was planning to split out the 4th part out of this series. And now, I'm
not sure if we should split it in the following way:

* series 1: kernel function metadata
* series 2: BPF global trampoline + tracing multi-link support
* series 3: gtramp work together with trampoline

>
> Like the new concept of global trampoline.
> It's nice to write bpf_global_caller() in asm
> compared to arch_prepare_bpf_trampoline() that emits asm
> on the fly, but it seems the only thing where it truly
> needs asm is register save/restore. The rest can be done in C.

We also need to get the function ip from the stack and do the origin
call with asm.

>
> I suspect the whole gtramp can be written in C.
> There is an attribute(interrupt) that all compilers support...
> or use no attributes and inline asm for regs save/restore ?
> or attribute(naked) and more inline asm ?

That's a nice shot, which will make the bpf_global_caller() much easier.
I believe it worth a try.

>
>> no-mitigate + hash table mode
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> nop     | fentry    | fm_single | fm_all    | km_single | km_all
>> 9.014ms | 162.378ms | 180.511ms | 446.286ms | 220.634ms | 1465.133ms
>> 9.038ms | 161.600ms | 178.757ms | 445.807ms | 220.656ms | 1463.714ms
>> 9.048ms | 161.435ms | 180.510ms | 452.530ms | 220.943ms | 1487.494ms
>> 9.030ms | 161.585ms | 178.699ms | 448.167ms | 220.107ms | 1463.785ms
>> 9.056ms | 161.530ms | 178.947ms | 445.609ms | 221.026ms | 1560.584ms
>
> ...
>
>> no-mitigate + function padding mode
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> nop     | fentry    | fm_single | fm_all    | km_single | km_all
>> 9.320ms | 166.454ms | 184.094ms | 193.884ms | 227.320ms | 1441.462ms
>> 9.326ms | 166.651ms | 183.954ms | 193.912ms | 227.503ms | 1544.634ms
>> 9.313ms | 170.501ms | 183.985ms | 191.738ms | 227.801ms | 1441.284ms
>> 9.311ms | 166.957ms | 182.086ms | 192.063ms | 410.411ms | 1489.665ms
>> 9.329ms | 166.332ms | 182.196ms | 194.154ms | 227.443ms | 1511.272ms
>>
>> The overhead of fentry_multi_all is a little higher than the
>> fentry_multi_single. Maybe it is because the function
>> ktime_get_boottime_ns(), which is used in bpf_testmod_bench_run(), is also
>> traced? I haven't figured it out yet, but it doesn't matter :/
>
> I think it matters a lot.
> Looking at patch 25 the fm_all (in addition to fm_single) only
> suppose to trigger from ktime_get_boottime,
> but for hash table mode the difference is huge.
> 10M bpf_fentry_test1() calls are supposed to dominate 2 calls
> to ktime_get and whatever else is called there,
> but this is not what numbers tell.
>
> Same discrepancy with kprobe_multi. 7x difference has to be understood,
> since it's a sign that the benchmark is not really measuring
> what it is supposed to measure. Which casts doubts on all numbers.

I think there is some misunderstand here. In the hash table mode, we trace
all the kernel function for fm_all and km_all. Compared to fm_single and
km_single, the overhead of fm_all and km_all suffer from the hash lookup,
as we traced 40k+ functions in these case.

The overhead of kprobe_multi has a linear relation with the total kernel
function number in fprobe, so the 7x difference is reasonable. The same
to fentry_multi in hash table mode.

NOTE: The hash table lookup is not O(1) if the function number that we
traced more than 1k. According to my research, the loop count that we use
to find bpf_fentry_test1() with hlist_for_each_entry() is about 35 when
the functions number in the hash table is 47k.

BTW, the array length of the hash table that we use is 1024.

The CPU I used for the testing is:
AMD Ryzen 9 7940HX with Radeon Graphics

>
> Another part is how come fentry is 20x slower than nop.
> We don't see it in the existing bench-es. That's another red flag.

I think this has a strong relation with the Kconfig I use. When I do the
testing with "make tinyconfig" as the base, the fentry is ~9x slower than
nop. I do this test with the Kconfig of debian12 (6.1 kernel), and I think
there is more overhead to rcu_read_lock, migrate_disable, etc, in this
Kconfig.

>
> You need to rethink benchmarking strategy. The bench itself
> should be spotless. Don't invent new stuff. Add to existing benchs.
> They already measure nop, fentry, kprobe, kprobe-multi.

Great! It seems that I did so many useless works on the bench testing :/

>
> Then only introduce a global trampoline with a simple hash tab.
> Compare against current numbers for fentry.
> fm_single has to be within couple percents of fentry.
> Then make fm_all attach to everything except funcs that bench trigger calls.
> fm_all has to be exactly equal to fm_single.
> If the difference is 2.5x like here (180 for fm_single vs 446 for fm_all)
> something is wrong. Investigate it and don't proceed without full
> understanding.

Emm......Like what I explain above, the 2.5X difference is reasonable, and
this is exact the reason why we need the function padding based metadata,
which is able to make fentry_multi and kprobe_multi(in the feature) out of
overhead of the hash lookup.

>
> And only then introduce 5 byte special insn that indices into
> an array for fast access to metadata.
> Your numbers are a bit suspicious, but they show that fm_single
> with hash tab is the same speed as the special kfunc_md_arch_support().
> Which is expected.
> With fm_all that triggers small set of kernel function
> in a tight benchmark loop the performance of hashtab vs special
> should _also_ be the same, because hashtab will perform O(1) lookup
> that is hot in the cache (or hashtab has bad collisions and should be fixed).

I think this is the problem. The kernel function number is much more than
the array length, which makes the hash lookup not O(1) anymore.

Sorry that I wanted to show the performance of function padding based
metadata, and made the kernel function number that we traced huge, which
is ~47k.

When the function number less than 2k, the performance of fm_single and
fm_all don't have much difference, according to my previous testing :/

>
> fm_all should have the same speed as fm_single too,
> because bench will only attach to things outside of the tight bench loop.
> So attaching to thousands of kernel functions that are not being
> triggered by the benchmark should not affect results.

This is 47k kernel functions in this testing :/

> The performance advantage of special kfunc_md_arch_support()
> can probably only be seen in production when fentry.multi attaches
> to thousands of kernel functions and random functions are called.
> Then hash tab cache misses will be noticeable vs direct access.
> There will be cache misses in both cases, but significantly more misses
> for hash tab. Only then we can decide where special stuff is truly necessary.
> So patches 2 and 3 are really last. After everything had already landed.

Emm......The cache miss is something I didn't expect. The only thing I
concerned before is just the overhead of the hash lookup. To my utter
astonishment, this actually helps with cache misses as well!

BTW, should I still split out the function padding based metadata in
the last series?

Thanks!
Menglong Dong

Re: [PATCH bpf-next 00/25] bpf: tracing multi-link support

Posted by Alexei Starovoitov 8 months ago

On Wed, Jun 11, 2025 at 5:59 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
>
> On 6/11/25 11:32, Alexei Starovoitov wrote:
> > On Tue, May 27, 2025 at 8:49 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
> >>
> >> 1. Add per-function metadata storage support.
> >> 2. Add bpf global trampoline support for x86_64.
> >> 3. Add bpf global trampoline link support.
> >> 4. Add tracing multi-link support.
> >> 5. Compatibility between tracing and tracing_multi.
> >
> > ...
> >
> >> ... and I think it will be a
> >> liberation to split it out to another series :/
> >
> > There are lots of interesting ideas here and you know
> > already what the next step should be...
> > Split it into small chunks.
> > As presented it's hard to review and even if maintainers take on
> > that challenge the set is unlandable, since it spans various
> > subsystems.
> >
> > In a small reviewable patch set we can argue about
> > approach A vs B while the current set has too many angles
> > to argue about.
>
>
> Hi, Alexei.
>
>
> You are right. In the very beginning, I planned to make the kernel function
> metadata to be the first series. However, it's hard to judge if the function
> metadata is useful without the usage of the BPF tracing multi-link. So I
> kneaded them together in this series.
>
>
> The features in this series can be split into 4 part:
> * kernel function metadata
> * BPF global trampoline
> * tracing multi-link support
> * gtramp work together with trampoline
>
>
> I was planning to split out the 4th part out of this series. And now, I'm
> not sure if we should split it in the following way:
>
> * series 1: kernel function metadata
> * series 2: BPF global trampoline + tracing multi-link support
> * series 3: gtramp work together with trampoline

Neither. First thing is to understand benchmark numbers.
We're not there yet.

> >
> > Like the new concept of global trampoline.
> > It's nice to write bpf_global_caller() in asm
> > compared to arch_prepare_bpf_trampoline() that emits asm
> > on the fly, but it seems the only thing where it truly
> > needs asm is register save/restore. The rest can be done in C.
>
>
> We also need to get the function ip from the stack and do the origin
> call with asm.
>
>
> >
> > I suspect the whole gtramp can be written in C.
> > There is an attribute(interrupt) that all compilers support...
> > or use no attributes and inline asm for regs save/restore ?
> > or attribute(naked) and more inline asm ?
>
>
> That's a nice shot, which will make the bpf_global_caller() much easier.
> I believe it worth a try.
>
>
> >
> >> no-mitigate + hash table mode
> >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >> nop     | fentry    | fm_single | fm_all    | km_single | km_all
> >> 9.014ms | 162.378ms | 180.511ms | 446.286ms | 220.634ms | 1465.133ms
> >> 9.038ms | 161.600ms | 178.757ms | 445.807ms | 220.656ms | 1463.714ms
> >> 9.048ms | 161.435ms | 180.510ms | 452.530ms | 220.943ms | 1487.494ms
> >> 9.030ms | 161.585ms | 178.699ms | 448.167ms | 220.107ms | 1463.785ms
> >> 9.056ms | 161.530ms | 178.947ms | 445.609ms | 221.026ms | 1560.584ms
> >
> > ...
> >
> >> no-mitigate + function padding mode
> >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >> nop     | fentry    | fm_single | fm_all    | km_single | km_all
> >> 9.320ms | 166.454ms | 184.094ms | 193.884ms | 227.320ms | 1441.462ms
> >> 9.326ms | 166.651ms | 183.954ms | 193.912ms | 227.503ms | 1544.634ms
> >> 9.313ms | 170.501ms | 183.985ms | 191.738ms | 227.801ms | 1441.284ms
> >> 9.311ms | 166.957ms | 182.086ms | 192.063ms | 410.411ms | 1489.665ms
> >> 9.329ms | 166.332ms | 182.196ms | 194.154ms | 227.443ms | 1511.272ms
> >>
> >> The overhead of fentry_multi_all is a little higher than the
> >> fentry_multi_single. Maybe it is because the function
> >> ktime_get_boottime_ns(), which is used in bpf_testmod_bench_run(), is also
> >> traced? I haven't figured it out yet, but it doesn't matter :/
> >
> > I think it matters a lot.
> > Looking at patch 25 the fm_all (in addition to fm_single) only
> > suppose to trigger from ktime_get_boottime,
> > but for hash table mode the difference is huge.
> > 10M bpf_fentry_test1() calls are supposed to dominate 2 calls
> > to ktime_get and whatever else is called there,
> > but this is not what numbers tell.
> >
> > Same discrepancy with kprobe_multi. 7x difference has to be understood,
> > since it's a sign that the benchmark is not really measuring
> > what it is supposed to measure. Which casts doubts on all numbers.
>
>
> I think there is some misunderstand here. In the hash table mode, we trace
> all the kernel function for fm_all and km_all. Compared to fm_single and
> km_single, the overhead of fm_all and km_all suffer from the hash lookup,
> as we traced 40k+ functions in these case.
>
>
> The overhead of kprobe_multi has a linear relation with the total kernel
> function number in fprobe, so the 7x difference is reasonable. The same
> to fentry_multi in hash table mode.

No, it's not. More below...

> NOTE: The hash table lookup is not O(1) if the function number that we
> traced more than 1k. According to my research, the loop count that we use
> to find bpf_fentry_test1() with hlist_for_each_entry() is about 35 when
> the functions number in the hash table is 47k.
>
> BTW, the array length of the hash table that we use is 1024.

and that's the bug.
You added 47k functions to a htab with 1k bucket and
argue it's performance is slow?!
That's a pointless baseline.
Use rhashtable or size up buckets to match the number of functions
being traced, so that hash lookup is O(1).

>
> The CPU I used for the testing is:
> AMD Ryzen 9 7940HX with Radeon Graphics
>
>
> >
> > Another part is how come fentry is 20x slower than nop.
> > We don't see it in the existing bench-es. That's another red flag.
>
>
> I think this has a strong relation with the Kconfig I use. When I do the
> testing with "make tinyconfig" as the base, the fentry is ~9x slower than
> nop. I do this test with the Kconfig of debian12 (6.1 kernel), and I think
> there is more overhead to rcu_read_lock, migrate_disable, etc, in this
> Kconfig.

It shouldn't make any difference if hashtable is properly used.

>
>
> >
> > You need to rethink benchmarking strategy. The bench itself
> > should be spotless. Don't invent new stuff. Add to existing benchs.
> > They already measure nop, fentry, kprobe, kprobe-multi.
>
>
> Great! It seems that I did so many useless works on the bench testing :/
>
>
> >
> > Then only introduce a global trampoline with a simple hash tab.
> > Compare against current numbers for fentry.
> > fm_single has to be within couple percents of fentry.
> > Then make fm_all attach to everything except funcs that bench trigger calls.
> > fm_all has to be exactly equal to fm_single.
> > If the difference is 2.5x like here (180 for fm_single vs 446 for fm_all)
> > something is wrong. Investigate it and don't proceed without full
> > understanding.
>
>
> Emm......Like what I explain above, the 2.5X difference is reasonable, and
> this is exact the reason why we need the function padding based metadata,
> which is able to make fentry_multi and kprobe_multi(in the feature) out of
> overhead of the hash lookup.

Absolutely not. It only points into an implementation issue with hashtab.

>
> >
> > And only then introduce 5 byte special insn that indices into
> > an array for fast access to metadata.
> > Your numbers are a bit suspicious, but they show that fm_single
> > with hash tab is the same speed as the special kfunc_md_arch_support().
> > Which is expected.
> > With fm_all that triggers small set of kernel function
> > in a tight benchmark loop the performance of hashtab vs special
> > should _also_ be the same, because hashtab will perform O(1) lookup
> > that is hot in the cache (or hashtab has bad collisions and should be fixed).
>
>
> I think this is the problem. The kernel function number is much more than
> the array length, which makes the hash lookup not O(1) anymore.
>
> Sorry that I wanted to show the performance of function padding based
> metadata, and made the kernel function number that we traced huge, which
> is ~47k.
>
>
> When the function number less than 2k, the performance of fm_single and
> fm_all don't have much difference, according to my previous testing :/

Sigh. You should have said that in the beginning that your hashtab
is fixed size. All the comparisons and reasons are bogus.

>
> >
> > fm_all should have the same speed as fm_single too,
> > because bench will only attach to things outside of the tight bench loop.
> > So attaching to thousands of kernel functions that are not being
> > triggered by the benchmark should not affect results.
>
>
> This is 47k kernel functions in this testing :/
>
>
> > The performance advantage of special kfunc_md_arch_support()
> > can probably only be seen in production when fentry.multi attaches
> > to thousands of kernel functions and random functions are called.
> > Then hash tab cache misses will be noticeable vs direct access.
> > There will be cache misses in both cases, but significantly more misses
> > for hash tab. Only then we can decide where special stuff is truly necessary.
> > So patches 2 and 3 are really last. After everything had already landed.
>
>
> Emm......The cache miss is something I didn't expect. The only thing I
> concerned before is just the overhead of the hash lookup. To my utter
> astonishment, this actually helps with cache misses as well!
>
>
> BTW, should I still split out the function padding based metadata in
> the last series?

No. First make sure fm_single and fm_all has the same performance
with hashtable and demonstrate that with existing selftests/bpf/benchs/

Re: [PATCH bpf-next 00/25] bpf: tracing multi-link support

Posted by Menglong Dong 8 months ago

On Thu, Jun 12, 2025 at 12:11 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Jun 11, 2025 at 5:59 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
> >
> > On 6/11/25 11:32, Alexei Starovoitov wrote:
> > > On Tue, May 27, 2025 at 8:49 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
> > >>
> > >> 1. Add per-function metadata storage support.
> > >> 2. Add bpf global trampoline support for x86_64.
> > >> 3. Add bpf global trampoline link support.
> > >> 4. Add tracing multi-link support.
> > >> 5. Compatibility between tracing and tracing_multi.
> > >
> > > ...
> > >
> > >> ... and I think it will be a
> > >> liberation to split it out to another series :/
> > >
> > > There are lots of interesting ideas here and you know
> > > already what the next step should be...
> > > Split it into small chunks.
> > > As presented it's hard to review and even if maintainers take on
> > > that challenge the set is unlandable, since it spans various
> > > subsystems.
> > >
> > > In a small reviewable patch set we can argue about
> > > approach A vs B while the current set has too many angles
> > > to argue about.
> >
> >
> > Hi, Alexei.
> >
> >
> > You are right. In the very beginning, I planned to make the kernel function
> > metadata to be the first series. However, it's hard to judge if the function
> > metadata is useful without the usage of the BPF tracing multi-link. So I
> > kneaded them together in this series.
> >
> >
> > The features in this series can be split into 4 part:
> > * kernel function metadata
> > * BPF global trampoline
> > * tracing multi-link support
> > * gtramp work together with trampoline
> >
> >
> > I was planning to split out the 4th part out of this series. And now, I'm
> > not sure if we should split it in the following way:
> >
> > * series 1: kernel function metadata
> > * series 2: BPF global trampoline + tracing multi-link support
> > * series 3: gtramp work together with trampoline
>
> Neither. First thing is to understand benchmark numbers.
> We're not there yet.
>
> > >
> > > Like the new concept of global trampoline.
> > > It's nice to write bpf_global_caller() in asm
> > > compared to arch_prepare_bpf_trampoline() that emits asm
> > > on the fly, but it seems the only thing where it truly
> > > needs asm is register save/restore. The rest can be done in C.
> >
> >
> > We also need to get the function ip from the stack and do the origin
> > call with asm.
> >
> >
> > >
> > > I suspect the whole gtramp can be written in C.
> > > There is an attribute(interrupt) that all compilers support...
> > > or use no attributes and inline asm for regs save/restore ?
> > > or attribute(naked) and more inline asm ?
> >
> >
> > That's a nice shot, which will make the bpf_global_caller() much easier.
> > I believe it worth a try.
> >
> >
> > >
> > >> no-mitigate + hash table mode
> > >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > >> nop     | fentry    | fm_single | fm_all    | km_single | km_all
> > >> 9.014ms | 162.378ms | 180.511ms | 446.286ms | 220.634ms | 1465.133ms
> > >> 9.038ms | 161.600ms | 178.757ms | 445.807ms | 220.656ms | 1463.714ms
> > >> 9.048ms | 161.435ms | 180.510ms | 452.530ms | 220.943ms | 1487.494ms
> > >> 9.030ms | 161.585ms | 178.699ms | 448.167ms | 220.107ms | 1463.785ms
> > >> 9.056ms | 161.530ms | 178.947ms | 445.609ms | 221.026ms | 1560.584ms
> > >
> > > ...
> > >
> > >> no-mitigate + function padding mode
> > >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > >> nop     | fentry    | fm_single | fm_all    | km_single | km_all
> > >> 9.320ms | 166.454ms | 184.094ms | 193.884ms | 227.320ms | 1441.462ms
> > >> 9.326ms | 166.651ms | 183.954ms | 193.912ms | 227.503ms | 1544.634ms
> > >> 9.313ms | 170.501ms | 183.985ms | 191.738ms | 227.801ms | 1441.284ms
> > >> 9.311ms | 166.957ms | 182.086ms | 192.063ms | 410.411ms | 1489.665ms
> > >> 9.329ms | 166.332ms | 182.196ms | 194.154ms | 227.443ms | 1511.272ms
> > >>
> > >> The overhead of fentry_multi_all is a little higher than the
> > >> fentry_multi_single. Maybe it is because the function
> > >> ktime_get_boottime_ns(), which is used in bpf_testmod_bench_run(), is also
> > >> traced? I haven't figured it out yet, but it doesn't matter :/
> > >
> > > I think it matters a lot.
> > > Looking at patch 25 the fm_all (in addition to fm_single) only
> > > suppose to trigger from ktime_get_boottime,
> > > but for hash table mode the difference is huge.
> > > 10M bpf_fentry_test1() calls are supposed to dominate 2 calls
> > > to ktime_get and whatever else is called there,
> > > but this is not what numbers tell.
> > >
> > > Same discrepancy with kprobe_multi. 7x difference has to be understood,
> > > since it's a sign that the benchmark is not really measuring
> > > what it is supposed to measure. Which casts doubts on all numbers.
> >
> >
> > I think there is some misunderstand here. In the hash table mode, we trace
> > all the kernel function for fm_all and km_all. Compared to fm_single and
> > km_single, the overhead of fm_all and km_all suffer from the hash lookup,
> > as we traced 40k+ functions in these case.
> >
> >
> > The overhead of kprobe_multi has a linear relation with the total kernel
> > function number in fprobe, so the 7x difference is reasonable. The same
> > to fentry_multi in hash table mode.
>
> No, it's not. More below...
>
> > NOTE: The hash table lookup is not O(1) if the function number that we
> > traced more than 1k. According to my research, the loop count that we use
> > to find bpf_fentry_test1() with hlist_for_each_entry() is about 35 when
> > the functions number in the hash table is 47k.
> >
> > BTW, the array length of the hash table that we use is 1024.
>
> and that's the bug.
> You added 47k functions to a htab with 1k bucket and
> argue it's performance is slow?!
> That's a pointless baseline.
> Use rhashtable or size up buckets to match the number of functions
> being traced, so that hash lookup is O(1).

Hi Alexei, thank you for your explanation, and now I realize the
problem is my hash table :/

My hash table made reference to ftrace and fprobe, whose
max budget length is 1024.

It's interesting to make the hash table O(1) by using rhashtable
or sizing up the budgets, as you said. I suspect we even don't
need the function padding part if the hash table is random
enough.

I'll redesign the hash table part, and do the testing with the existing
bench to make fm_single the same as fm_all, which should be in
theory.

Thanks!
Menglong Dong

>
> >
> > The CPU I used for the testing is:
> > AMD Ryzen 9 7940HX with Radeon Graphics
> >
> >
> > >
> > > Another part is how come fentry is 20x slower than nop.
> > > We don't see it in the existing bench-es. That's another red flag.
> >
> >
> > I think this has a strong relation with the Kconfig I use. When I do the
> > testing with "make tinyconfig" as the base, the fentry is ~9x slower than
> > nop. I do this test with the Kconfig of debian12 (6.1 kernel), and I think
> > there is more overhead to rcu_read_lock, migrate_disable, etc, in this
> > Kconfig.
>
> It shouldn't make any difference if hashtable is properly used.
>
> >
> >
> > >
> > > You need to rethink benchmarking strategy. The bench itself
> > > should be spotless. Don't invent new stuff. Add to existing benchs.
> > > They already measure nop, fentry, kprobe, kprobe-multi.
> >
> >
> > Great! It seems that I did so many useless works on the bench testing :/
> >
> >
> > >
> > > Then only introduce a global trampoline with a simple hash tab.
> > > Compare against current numbers for fentry.
> > > fm_single has to be within couple percents of fentry.
> > > Then make fm_all attach to everything except funcs that bench trigger calls.
> > > fm_all has to be exactly equal to fm_single.
> > > If the difference is 2.5x like here (180 for fm_single vs 446 for fm_all)
> > > something is wrong. Investigate it and don't proceed without full
> > > understanding.
> >
> >
> > Emm......Like what I explain above, the 2.5X difference is reasonable, and
> > this is exact the reason why we need the function padding based metadata,
> > which is able to make fentry_multi and kprobe_multi(in the feature) out of
> > overhead of the hash lookup.
>
> Absolutely not. It only points into an implementation issue with hashtab.
>
> >
> > >
> > > And only then introduce 5 byte special insn that indices into
> > > an array for fast access to metadata.
> > > Your numbers are a bit suspicious, but they show that fm_single
> > > with hash tab is the same speed as the special kfunc_md_arch_support().
> > > Which is expected.
> > > With fm_all that triggers small set of kernel function
> > > in a tight benchmark loop the performance of hashtab vs special
> > > should _also_ be the same, because hashtab will perform O(1) lookup
> > > that is hot in the cache (or hashtab has bad collisions and should be fixed).
> >
> >
> > I think this is the problem. The kernel function number is much more than
> > the array length, which makes the hash lookup not O(1) anymore.
> >
> > Sorry that I wanted to show the performance of function padding based
> > metadata, and made the kernel function number that we traced huge, which
> > is ~47k.
> >
> >
> > When the function number less than 2k, the performance of fm_single and
> > fm_all don't have much difference, according to my previous testing :/
>
> Sigh. You should have said that in the beginning that your hashtab
> is fixed size. All the comparisons and reasons are bogus.
>
> >
> > >
> > > fm_all should have the same speed as fm_single too,
> > > because bench will only attach to things outside of the tight bench loop.
> > > So attaching to thousands of kernel functions that are not being
> > > triggered by the benchmark should not affect results.
> >
> >
> > This is 47k kernel functions in this testing :/
> >
> >
> > > The performance advantage of special kfunc_md_arch_support()
> > > can probably only be seen in production when fentry.multi attaches
> > > to thousands of kernel functions and random functions are called.
> > > Then hash tab cache misses will be noticeable vs direct access.
> > > There will be cache misses in both cases, but significantly more misses
> > > for hash tab. Only then we can decide where special stuff is truly necessary.
> > > So patches 2 and 3 are really last. After everything had already landed.
> >
> >
> > Emm......The cache miss is something I didn't expect. The only thing I
> > concerned before is just the overhead of the hash lookup. To my utter
> > astonishment, this actually helps with cache misses as well!
> >
> >
> > BTW, should I still split out the function padding based metadata in
> > the last series?
>
> No. First make sure fm_single and fm_all has the same performance
> with hashtable and demonstrate that with existing selftests/bpf/benchs/

Re: [PATCH bpf-next 00/25] bpf: tracing multi-link support

Posted by Alexei Starovoitov 8 months ago

On Wed, Jun 11, 2025 at 5:07 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
>
> Hi Alexei, thank you for your explanation, and now I realize the
> problem is my hash table :/
>
> My hash table made reference to ftrace and fprobe, whose
> max budget length is 1024.
>
> It's interesting to make the hash table O(1) by using rhashtable
> or sizing up the budgets, as you said. I suspect we even don't
> need the function padding part if the hash table is random
> enough.

I suggest starting with rhashtable. It's used in many
performance critical places, and when rhashtable_params are
constant the compiler optimizes everything nicely.
lookup is lockless and only needs RCU, so safe to use
from fentry_multi.

Re: [PATCH bpf-next 00/25] bpf: tracing multi-link support

Posted by Menglong Dong 7 months, 3 weeks ago

On Thu, Jun 12, 2025 at 8:58 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Jun 11, 2025 at 5:07 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
> >
> > Hi Alexei, thank you for your explanation, and now I realize the
> > problem is my hash table :/
> >
> > My hash table made reference to ftrace and fprobe, whose
> > max budget length is 1024.
> >
> > It's interesting to make the hash table O(1) by using rhashtable
> > or sizing up the budgets, as you said. I suspect we even don't
> > need the function padding part if the hash table is random
> > enough.
>
> I suggest starting with rhashtable. It's used in many
> performance critical places, and when rhashtable_params are
> constant the compiler optimizes everything nicely.
> lookup is lockless and only needs RCU, so safe to use
> from fentry_multi.

Hi, Alexei. Sorry to bother you. I have implemented the hash table
with the rhashtable and did the bench testing with the existing
framework.

You said that "fm_single has to be within couple percents of fentry"
before, and I think it's a little difficult if we use the hashtable without
the function padding mode.

The extra overhead of the global trampoline can be as follows:
1. addition hash lookup. The rhashtable is O(1), but the hash key
compute and memory read can still have a slight overhead.
2. addition function call to kfunc_md_get_noref() in the asm, which is
used to get the function metadata. We can make it inlined in the
function padding mode in the asm, but it's hard if we are using the
rhashtable.
3. extra logic in the global trampoline. For example, we save and
restore the reg1 to reg6 on the stack for the function args even if
the function we attached to doesn't have any args.

following is the bench result in rhashtable mode, and the performance
of fentry_multi is about 77.7% of fentry:

  usermode-count :  893.357 ± 0.566M/s
  kernel-count   :  421.290 ± 0.159M/s
  syscall-count  :   21.018 ± 0.165M/s
  fentry         :  100.742 ± 0.065M/s
  fexit          :   51.283 ± 0.784M/s
  fmodret        :   55.410 ± 0.026M/s
  fentry-multi   :   78.237 ± 0.117M/s
  fentry-multi-all:   80.090 ± 0.049M/s
  rawtp          :  161.496 ± 0.197M/s
  tp             :   70.021 ± 0.015M/s
  kprobe         :   54.693 ± 0.013M/s
  kprobe-multi   :   51.481 ± 0.023M/s
  kretprobe      :   22.504 ± 0.011M/s
  kretprobe-multi:   27.221 ± 0.037M/s

(It's weird that the performance of fentry-multi-all is a little higher
than fentry-multi, but I'm sure that the bpf prog is attached to all the
kernel functions in the fentry-multi-all testcase.)

The overhead of the part 1 can be eliminated with using the
function padding mode, and following is the bench result:

  usermode-count :  895.874 ± 2.472M/s
  kernel-count   :  423.882 ± 0.342M/s
  syscall-count  :   20.480 ± 0.009M/s
  fentry         :  105.191 ± 0.275M/s
  fexit          :   52.430 ± 0.050M/s
  fmodret        :   56.130 ± 0.062M/s
  fentry-multi   :   88.114 ± 0.108M/s
  fentry-multi-all:   86.988 ± 0.024M/s
  rawtp          :  145.488 ± 0.043M/s
  tp             :   73.386 ± 0.095M/s
  kprobe         :   55.294 ± 0.046M/s
  kprobe-multi   :   50.457 ± 0.075M/s
  kretprobe      :   22.414 ± 0.020M/s
  kretprobe-multi:   27.205 ± 0.044M/s

The performance of fentry_multi now is 83.7% of fentry. Next
step, we make the function metadata inlined in the asm, and
the performance of fentry_multi is 89.7% of the fentry, which is
close to "be within couple percents of fentry":

  usermode-count :  886.836 ± 0.300M/s
  kernel-count   :  419.962 ± 1.252M/s
  syscall-count  :   20.715 ± 0.022M/s
  fentry         :  102.783 ± 0.166M/s
  fexit          :   52.502 ± 0.014M/s
  fmodret        :   55.822 ± 0.038M/s
  fentry-multi   :   92.201 ± 0.027M/s
  fentry-multi-all:   89.831 ± 0.057M/s
  rawtp          :  158.337 ± 4.918M/s
  tp             :   72.883 ± 0.041M/s
  kprobe         :   54.963 ± 0.013M/s
  kprobe-multi   :   50.069 ± 0.079M/s
  kretprobe      :   22.260 ± 0.012M/s
  kretprobe-multi:   27.211 ± 0.011M/s

For the overhead of the part3, I'm thinking of introducing a
dynamic global trampoline. We create different global trampoline
for the functions that have different features, and the features
can be:
* function arguments count
* if bpf_get_func_ip() is ever called in the bpf progs
* if FEXIT and MODIFY_RETURN progs existing
* etc.

Then, we can generate a global trampoline for the function with
minimum instructions. According to my estimation, the performance
of the fentry_multi should be above 95% of the fentry with
function padding, function metadata inline and dynamic global
trampoline.

In fact, I implemented the first version of this series with the dynamic
global trampoline.However, that makes the series very very complex.
So I think it's not a good idea to mention it in this series.

All in all, the performance of the fentry_multi can't be within a couple
percents of fentry if we use rhashtable only according to my testing,
and I'm not sure if I should go ahead :/

BTW, the Kconfig I used in the testing comes from "make tinyconfig", and
I enabled some config to make the tools/testing/selftests/bpf can be compiled
successfully. I would appreciate it if someone can offer a better and
authoritative Kconfig for the testing :/

Thanks, have a nice weekend!
Menglong Dong

Re: [PATCH bpf-next 00/25] bpf: tracing multi-link support

Posted by Menglong Dong 8 months ago

On Thu, Jun 12, 2025 at 8:58 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Jun 11, 2025 at 5:07 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
> >
> > Hi Alexei, thank you for your explanation, and now I realize the
> > problem is my hash table :/
> >
> > My hash table made reference to ftrace and fprobe, whose
> > max budget length is 1024.
> >
> > It's interesting to make the hash table O(1) by using rhashtable
> > or sizing up the budgets, as you said. I suspect we even don't
> > need the function padding part if the hash table is random
> > enough.
>
> I suggest starting with rhashtable. It's used in many
> performance critical places, and when rhashtable_params are
> constant the compiler optimizes everything nicely.
> lookup is lockless and only needs RCU, so safe to use
> from fentry_multi.

Thanks for the advice! rhashtable is a nice choice for fentry_multi.
and I'll redesign the function metadata with it.

Thanks!
Menglong Dong

Re: [PATCH bpf-next 00/25] bpf: tracing multi-link support

Posted by Steven Rostedt 8 months, 2 weeks ago

On Wed, 28 May 2025 11:46:47 +0800
Menglong Dong <menglong8.dong@gmail.com> wrote:

> After four months, I finally finish the coding and testing of this series.
> This is my first time to write such a complex series, and it's so hard :/
> Anyway, I finished it.
> (I'm scared :/)
> 

Note, sending out a complex series like this at the start of the merge
window is not good timing.

Most kernel maintainers will not be able to even look at this until the
merge window is closed (in two weeks).

That includes myself.

-- Steve

Re: [PATCH bpf-next 00/25] bpf: tracing multi-link support

Posted by Menglong Dong 8 months, 2 weeks ago

On Wed, May 28, 2025 at 9:50 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 28 May 2025 11:46:47 +0800
> Menglong Dong <menglong8.dong@gmail.com> wrote:
>
> > After four months, I finally finish the coding and testing of this series.
> > This is my first time to write such a complex series, and it's so hard :/
> > Anyway, I finished it.
> > (I'm scared :/)
> >
>
> Note, sending out a complex series like this at the start of the merge
> window is not good timing.
>
> Most kernel maintainers will not be able to even look at this until the
> merge window is closed (in two weeks).

Hi Steven, thank you for letting me know.

(It seems that I need to spend some time learning more about the merge
window and related processes :/)

I plan to resend this patch series in two weeks, which also gives me a
chance to further improve it to make it less shit.

Thanks!
Menglong Dong

>
> That includes myself.
>
> -- Steve