x86 tail-call fentry patching mirrors CALL text pokes to the tail-call
landing slot.
The helper that locates that mirrored slot assumes an ENDBR-prefixed
landing, which works on IBT JITs but fails on non-IBT JITs where the
landing starts directly with the 5-byte patch slot.
As a result, the regular entry gets patched but the tail-call landing
remains NOP5, so fentry never fires for tail-called programs on non-IBT
kernels.
Anchor the lookup on the landing address, verify the short-jump layout
first, and only check ENDBR when one is actually emitted.
Signed-off-by: Takeru Hayasaka <hayatake396@gmail.com>
---
arch/x86/net/bpf_jit_comp.c | 47 ++++++++++++++++++++++++++++++++++---
1 file changed, 44 insertions(+), 3 deletions(-)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index e9b78040d703..fe5fd37f65d8 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -325,8 +325,10 @@ struct jit_context {
/* Number of bytes emit_patch() needs to generate instructions */
#define X86_PATCH_SIZE 5
+/* Number of bytes used by the short jump that skips the tail-call hook. */
+#define X86_TAIL_CALL_SKIP_JMP_SIZE 2
/* Number of bytes that will be skipped on tailcall */
-#define X86_TAIL_CALL_OFFSET (12 + ENDBR_INSN_SIZE)
+#define X86_TAIL_CALL_OFFSET (12 + X86_TAIL_CALL_SKIP_JMP_SIZE + ENDBR_INSN_SIZE)
static void push_r9(u8 **pprog)
{
@@ -545,8 +547,15 @@ static void emit_prologue(u8 **pprog, u8 *ip, u32 stack_depth, bool ebpf_from_cb
EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */
}
+ if (!is_subprog) {
+ /* Normal entry skips the tail-call-only trampoline hook. */
+ EMIT2(0xEB, ENDBR_INSN_SIZE + X86_PATCH_SIZE);
+ }
+
/* X86_TAIL_CALL_OFFSET is here */
EMIT_ENDBR();
+ if (!is_subprog)
+ emit_nops(&prog, X86_PATCH_SIZE);
/* sub rsp, rounded_stack_depth */
if (stack_depth)
@@ -632,12 +641,33 @@ static int __bpf_arch_text_poke(void *ip, enum bpf_text_poke_type old_t,
return ret;
}
+static void *bpf_tail_call_fentry_ip(void *ip)
+{
+ u8 *tail_ip = ip + X86_TAIL_CALL_OFFSET;
+ u8 *landing = tail_ip - ENDBR_INSN_SIZE;
+
+ /* ip points at the regular fentry slot after the entry ENDBR. */
+ if (landing[-X86_TAIL_CALL_SKIP_JMP_SIZE] != 0xEB ||
+ landing[-X86_TAIL_CALL_SKIP_JMP_SIZE + 1] !=
+ ENDBR_INSN_SIZE + X86_PATCH_SIZE)
+ return NULL;
+
+ if (ENDBR_INSN_SIZE && !is_endbr((u32 *)landing))
+ return NULL;
+
+ return tail_ip;
+}
+
int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type old_t,
enum bpf_text_poke_type new_t, void *old_addr,
void *new_addr)
{
+ void *tail_ip = NULL;
+ bool is_bpf_text = is_bpf_text_address((long)ip);
+ int ret, tail_ret;
+
if (!is_kernel_text((long)ip) &&
- !is_bpf_text_address((long)ip))
+ !is_bpf_text)
/* BPF poking in modules is not supported */
return -EINVAL;
@@ -648,7 +678,18 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type old_t,
if (is_endbr(ip))
ip += ENDBR_INSN_SIZE;
- return __bpf_arch_text_poke(ip, old_t, new_t, old_addr, new_addr);
+ if (is_bpf_text && (old_t == BPF_MOD_CALL || new_t == BPF_MOD_CALL))
+ tail_ip = bpf_tail_call_fentry_ip(ip);
+
+ ret = __bpf_arch_text_poke(ip, old_t, new_t, old_addr, new_addr);
+ if (ret < 0 || !tail_ip)
+ return ret;
+
+ tail_ret = __bpf_arch_text_poke(tail_ip, old_t, new_t, old_addr, new_addr);
+ if (tail_ret < 0)
+ return tail_ret;
+
+ return ret && tail_ret;
}
#define EMIT_LFENCE() EMIT3(0x0F, 0xAE, 0xE8)
--
2.43.0
On Fri, Mar 27, 2026 at 7:16 AM Takeru Hayasaka <hayatake396@gmail.com> wrote: > > x86 tail-call fentry patching mirrors CALL text pokes to the tail-call > landing slot. > > The helper that locates that mirrored slot assumes an ENDBR-prefixed > landing, which works on IBT JITs but fails on non-IBT JITs where the > landing starts directly with the 5-byte patch slot. tailcalls are deprecated. We should go the other way and disable them ibt jit instead. The less interaction between fentry and tailcall the better. pw-bot: cr
Hi Alexei Thanks, and Sorry, I sent an older changelog from while I was still iterating on this, and it described the issue incorrectly. My changelog made this sound like an IBT/non-IBT-specific issue, but that was wrong. On current kernels, fentry on tail-called programs is not supported in either case. Only the regular fentry patch site is patched; there is no tail-call landing patching in either case, so disabling IBT does not make it work. What this series was trying to do was add support for fentry on tail-called x86 programs. The non-IBT part was only about a bug in my initial implementation of that support, not the underlying motivation. The motivation is observability of existing tailcall-heavy BPF/XDP programs, where tail-called leaf programs are currently a blind spot for fentry-based debugging. If supporting fentry on tail-called programs is still not something you'd want upstream, I understand. If I resend this, I'll fix the changelog/cover letter to describe it correctly.
On Fri, Mar 27, 2026 at 8:12 AM Takeru Hayasaka <hayatake396@gmail.com> wrote: > > Hi Alexei > > Thanks, and Sorry, I sent an older changelog from while I was still > iterating on this, and it described the issue incorrectly. > > My changelog made this sound like an IBT/non-IBT-specific issue, but > that was wrong. On current kernels, fentry on tail-called programs is > not supported in either case. Only the regular fentry patch site is > patched; there is no tail-call landing patching in either case, so > disabling IBT does not make it work. > > What this series was trying to do was add support for fentry on > tail-called x86 programs. The non-IBT part was only about a bug in my > initial implementation of that support, not the underlying motivation. > > The motivation is observability of existing tailcall-heavy BPF/XDP > programs, where tail-called leaf programs are currently a blind spot for > fentry-based debugging. I get that, but I'd rather not open this can of worms. We had enough headaches when tailcalls, fentry, subprogs are combined. Like this set: https://lore.kernel.org/all/20230912150442.2009-1-hffilwlqm@gmail.com/ and the followups.
Understood. I was a bit surprised to read that this area ended up taking months of follow-up work.... One thing I am still trying to understand is what the preferred debuggability/observability direction would be for existing tailcall-heavy BPF/XDP deployments. Tail calls are already used in practice as a program decomposition mechanism, especially in XDP pipelines, and that leaves tail-called leaf programs harder to observe today. If fentry on tail-called programs is not something you'd want upstream, is there another direction you would recommend for improving observability/debuggability of such existing deployments? 2026年3月28日(土) 0:21 Alexei Starovoitov <alexei.starovoitov@gmail.com>: > > On Fri, Mar 27, 2026 at 8:12 AM Takeru Hayasaka <hayatake396@gmail.com> wrote: > > > > Hi Alexei > > > > Thanks, and Sorry, I sent an older changelog from while I was still > > iterating on this, and it described the issue incorrectly. > > > > My changelog made this sound like an IBT/non-IBT-specific issue, but > > that was wrong. On current kernels, fentry on tail-called programs is > > not supported in either case. Only the regular fentry patch site is > > patched; there is no tail-call landing patching in either case, so > > disabling IBT does not make it work. > > > > What this series was trying to do was add support for fentry on > > tail-called x86 programs. The non-IBT part was only about a bug in my > > initial implementation of that support, not the underlying motivation. > > > > The motivation is observability of existing tailcall-heavy BPF/XDP > > programs, where tail-called leaf programs are currently a blind spot for > > fentry-based debugging. > > I get that, but I'd rather not open this can of worms. > We had enough headaches when tailcalls, fentry, subprogs are combined. > Like this set: > https://lore.kernel.org/all/20230912150442.2009-1-hffilwlqm@gmail.com/ > and the followups.
On Fri, Mar 27, 2026 at 8:45 AM Takeru Hayasaka <hayatake396@gmail.com> wrote: > > Understood. I was a bit surprised to read that this area ended up taking > months of follow-up work.... > > One thing I am still trying to understand is what the preferred > debuggability/observability direction would be for existing > tailcall-heavy BPF/XDP deployments. > > Tail calls are already used in practice as a program decomposition > mechanism, especially in XDP pipelines, and that leaves tail-called leaf > programs harder to observe today. > > If fentry on tail-called programs is not something you'd want upstream, > is there another direction you would recommend for improving > observability/debuggability of such existing deployments? You don't need fentry to debug. perf works just fine on all bpf progs whether tailcall or not. Also pls don't top post.
Sorry about the top-posting. That makes sense, thanks. I agree perf can provide visibility into which BPF programs are running, including tail-called ones. What I am still unsure about is packet-level / structured-data observability. My use case is closer to xdpdump-style debugging, where I want to inspect packet-related context from specific XDP leaf programs in a live pipeline. That feels harder to express with perf alone, so I am trying to understand what the preferred direction would be for that kind of use case in tailcall-heavy XDP deployments.
On Fri, Mar 27, 2026 at 9:06 AM Takeru Hayasaka <hayatake396@gmail.com> wrote: > > Sorry about the top-posting. yet you're still top posting :( > That makes sense, thanks. I agree perf can provide visibility into which > BPF programs are running, including tail-called ones. > > What I am still unsure about is packet-level / structured-data > observability. My use case is closer to xdpdump-style debugging, where I > want to inspect packet-related context from specific XDP leaf programs > in a live pipeline. see how cilium did it. with pwru tool, etc.
> yet you're still top posting :( Sorry about that. I misunderstood what top posting meant and ended up replying in the wrong style. I had not understood that it referred to quoting in that way, and I am embarrassed that I got it wrong.... > see how cilium did it. with pwru tool, etc. Thank you for the suggestion. As for pwru, I had thought it was not able to capture packet data such as pcap, and understood it more as a tool to trace where a specific packet enters the processing path and how it is handled. For example, in an environment where systems are already interconnected and running, I sometimes want to capture the actual packets being sent for real processing. On the other hand, if the goal is simply to observe processing safely in a development environment, I think tools such as ipftrace2 or pwru can be very useful.
On 28/3/26 00:30, Takeru Hayasaka wrote: >> see how cilium did it. with pwru tool, etc. > > Thank you for the suggestion. > As for pwru, I had thought it was not able to capture packet data such as pcap, > and understood it more as a tool to trace where a specific packet > enters the processing path and how it is handled. > > For example, in an environment where systems are already > interconnected and running, I sometimes want to capture the actual > packets being sent for real processing. > On the other hand, if the goal is simply to observe processing safely > in a development environment, I think tools such as ipftrace2 or pwru > can be very useful. > Sounds like you are developing/maintaining an XDP project. If so, and the kernel carries the patches in https://lore.kernel.org/all/20230912150442.2009-1-hffilwlqm@gmail.com/, recommend modifying the XDP project using dispatcher like libxdp [1]. Then, you are able to trace the subprogs which aim to run tail calls; meanwhile, you are able to filter packets using pcap-filter, and to output packets using bpf_xdp_output() helper. [1] https://github.com/xdp-project/xdp-tools/blob/main/lib/libxdp/xdp-dispatcher.c.in Thanks, Leon
> Sounds like you are developing/maintaining an XDP project. > > If so, and the kernel carries the patches in > https://lore.kernel.org/all/20230912150442.2009-1-hffilwlqm@gmail.com/, > recommend modifying the XDP project using dispatcher like libxdp [1]. > Then, you are able to trace the subprogs which aim to run tail calls; > meanwhile, you are able to filter packets using pcap-filter, and to > output packets using bpf_xdp_output() helper. > > [1] > https://github.com/xdp-project/xdp-tools/blob/main/lib/libxdp/xdp-dispatcher.c.in Thank you very much for your wonderful comment, Leon. This was the first time I learned that such a mechanism exists. It is a very interesting ecosystem. If I understand correctly, the idea is to invoke a component that dumps pcap data as one of the tail-called components, right? Thank you very much for sharing this idea with me. If I have a chance to write a new XDP program in the future, I would definitely like to try it. On the other hand, I feel that it is somewhat difficult to apply this idea directly to existing codebases, or to cases where the code is written in Go using something like cilium/ebpf. Also, when it comes to code running in production environments, making changes itself can be difficult. For that reason, I prototyped a tool like this. It is something like a middle ground between xdpdump and xdpcap. I built it so that only packets matched by cbpf are sent up through perf, and while testing it, I noticed that it does not work well for targets invoked via tail call. This is what motivated me to send the patch. https://github.com/takehaya/xdp-ninja Once again, thank you for sharing the idea. Takeru
On 31/3/26 00:46, Takeru Hayasaka wrote:
>> Sounds like you are developing/maintaining an XDP project.
>>
>> If so, and the kernel carries the patches in
>> https://lore.kernel.org/all/20230912150442.2009-1-hffilwlqm@gmail.com/,
>> recommend modifying the XDP project using dispatcher like libxdp [1].
>> Then, you are able to trace the subprogs which aim to run tail calls;
>> meanwhile, you are able to filter packets using pcap-filter, and to
>> output packets using bpf_xdp_output() helper.
>>
>> [1]
>> https://github.com/xdp-project/xdp-tools/blob/main/lib/libxdp/xdp-dispatcher.c.in
>
> Thank you very much for your wonderful comment, Leon.
> This was the first time I learned that such a mechanism exists.
>
> It is a very interesting ecosystem.
> If I understand correctly, the idea is to invoke a component that
> dumps pcap data as one of the tail-called components, right?
It is similar to xdp-ninja/xdp-dump.
However, this idea has one more step forward: it is to trace the
subprogs instead of only the main prog.
For example,
__noinline int subprog0(struct xdp_md *xdp) { bpf_tail_call_static(xdp,
&m, 0); }
__noinline int subprog1(struct xdp_md *xdp) { bpf_tail_call_static(xdp,
&m, 1); }
__noinline int subprog2(struct xdp_md *xdp) { bpf_tail_call_static(xdp,
&m, 2); }
SEC("xdp") int main(struct xdp_md *xdp)
{
subprog0(xdp);
subprog1(xdp);
subprog2(xdp);
return XDP_PASS;
}
All of them, subprog{0,1,2} and main, will be traced.
In this idea, it is to inject pcap-filter expression, the cbpf, using
elibpcap [1], and to output packets like your xdp-ninja.
It works well during the time I maintained an XDP project.
[1] https://github.com/jschwinger233/elibpcap
> Thank you very much for sharing this idea with me.
> If I have a chance to write a new XDP program in the future, I would
> definitely like to try it.
>
> On the other hand, I feel that it is somewhat difficult to apply this
> idea directly to existing codebases, or to cases where the code is> written in Go using something like cilium/ebpf.
> Also, when it comes to code running in production environments, making
> changes itself can be difficult.
Correct. If cannot modify the code, and the tail calls are not called
inner subprogs, the aforementioned idea is helpless to trace the tail
callees.
>
> For that reason, I prototyped a tool like this.
> It is something like a middle ground between xdpdump and xdpcap.
> I built it so that only packets matched by cbpf are sent up through
> perf, and while testing it, I noticed that it does not work well for
> targets invoked via tail call.
> This is what motivated me to send the patch.
>
I have similar idea years ago, a more generic tracer for tail calls.
However, as Alexei's concern, I won't post it.
> https://github.com/takehaya/xdp-ninja
>
It looks wonderful.
I developed a similar tool, bpfsnoop [1], to trace BPF progs/subprogs
and kernel functions with filtering packets/arguments and outputting
packets/arguments info. However, it lacks the ability of outputting
packets to pcap file.
[1] https://github.com/bpfsnoop/bpfsnoop
Thanks,
Leon
> Once again, thank you for sharing the idea.
> Takeru
> It is similar to xdp-ninja/xdp-dump.
>
> However, this idea has one more step forward: it is to trace the
> subprogs instead of only the main prog.
>
> For example,
>
> __noinline int subprog0(struct xdp_md *xdp) { bpf_tail_call_static(xdp,
> &m, 0); }
> __noinline int subprog1(struct xdp_md *xdp) { bpf_tail_call_static(xdp,
> &m, 1); }
> __noinline int subprog2(struct xdp_md *xdp) { bpf_tail_call_static(xdp,
> &m, 2); }
> SEC("xdp") int main(struct xdp_md *xdp)
> {
> subprog0(xdp);
> subprog1(xdp);
> subprog2(xdp);
> return XDP_PASS;
> }
>
> All of them, subprog{0,1,2} and main, will be traced.
>
> In this idea, it is to inject pcap-filter expression, the cbpf, using
> elibpcap [1], and to output packets like your xdp-ninja.
>
> It works well during the time I maintained an XDP project.
>
> [1] https://github.com/jschwinger233/elibpcap
Thank you very much for your kind reply.
elibpcap is a very interesting idea as well.
I had also considered whether making it __noinline might leave the
function prologue intact, so that it could be hooked via trampoline. I
actually tried that idea myself, but I had not yet been able to get it
working. I had not investigated it in enough detail, but I was
suspecting that it might already be expanded by JIT, which could make
that approach difficult.
That is why I thought it was excellent that you realized it by taking
the approach of inserting the logic there in the first place, rather
than trying to hook it afterward.
I had also been thinking about supporting use cases where packets need
to be captured only after some packet-related decision has already
been made in the program. For example, there are cases where the value
to match depends on runtime state, such as a Session ID that changes
whenever a tunnel is established. I think this kind of approach is
very helpful for such cases.
So, I was very happy to learn that someone else had been thinking
about a similar technique.
> It looks wonderful.
>
> I developed a similar tool, bpfsnoop [1], to trace BPF progs/subprogs
> and kernel functions with filtering packets/arguments and outputting
> packets/arguments info. However, it lacks the ability of outputting
> packets to pcap file.
>
> [1] https://github.com/bpfsnoop/bpfsnoop
Also, bpfsnoop looks like a very nice tool. I starred it on GitHub:)
Thank you very much again. I am very grateful that you shared such an
excellent idea with me.
Thanks,
Takeru
© 2016 - 2026 Red Hat, Inc.