bpf: enable x86 fentry on tail-called programs

[PATCH bpf-next 1/2] bpf, x86: patch tail-call fentry slot on non-IBT JITs

Posted by Takeru Hayasaka 6 days, 1 hour ago

x86 tail-call fentry patching mirrors CALL text pokes to the tail-call
landing slot.

The helper that locates that mirrored slot assumes an ENDBR-prefixed
landing, which works on IBT JITs but fails on non-IBT JITs where the
landing starts directly with the 5-byte patch slot.

As a result, the regular entry gets patched but the tail-call landing
remains NOP5, so fentry never fires for tail-called programs on non-IBT
kernels.

Anchor the lookup on the landing address, verify the short-jump layout
first, and only check ENDBR when one is actually emitted.

Signed-off-by: Takeru Hayasaka <hayatake396@gmail.com>
---
 arch/x86/net/bpf_jit_comp.c | 47 ++++++++++++++++++++++++++++++++++---
 1 file changed, 44 insertions(+), 3 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index e9b78040d703..fe5fd37f65d8 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -325,8 +325,10 @@ struct jit_context {
 
 /* Number of bytes emit_patch() needs to generate instructions */
 #define X86_PATCH_SIZE		5
+/* Number of bytes used by the short jump that skips the tail-call hook. */
+#define X86_TAIL_CALL_SKIP_JMP_SIZE	2
 /* Number of bytes that will be skipped on tailcall */
-#define X86_TAIL_CALL_OFFSET	(12 + ENDBR_INSN_SIZE)
+#define X86_TAIL_CALL_OFFSET	(12 + X86_TAIL_CALL_SKIP_JMP_SIZE + ENDBR_INSN_SIZE)
 
 static void push_r9(u8 **pprog)
 {
@@ -545,8 +547,15 @@ static void emit_prologue(u8 **pprog, u8 *ip, u32 stack_depth, bool ebpf_from_cb
 		EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */
 	}
 
+	if (!is_subprog) {
+		/* Normal entry skips the tail-call-only trampoline hook. */
+		EMIT2(0xEB, ENDBR_INSN_SIZE + X86_PATCH_SIZE);
+	}
+
 	/* X86_TAIL_CALL_OFFSET is here */
 	EMIT_ENDBR();
+	if (!is_subprog)
+		emit_nops(&prog, X86_PATCH_SIZE);
 
 	/* sub rsp, rounded_stack_depth */
 	if (stack_depth)
@@ -632,12 +641,33 @@ static int __bpf_arch_text_poke(void *ip, enum bpf_text_poke_type old_t,
 	return ret;
 }
 
+static void *bpf_tail_call_fentry_ip(void *ip)
+{
+	u8 *tail_ip = ip + X86_TAIL_CALL_OFFSET;
+	u8 *landing = tail_ip - ENDBR_INSN_SIZE;
+
+	/* ip points at the regular fentry slot after the entry ENDBR. */
+	if (landing[-X86_TAIL_CALL_SKIP_JMP_SIZE] != 0xEB ||
+	    landing[-X86_TAIL_CALL_SKIP_JMP_SIZE + 1] !=
+		    ENDBR_INSN_SIZE + X86_PATCH_SIZE)
+		return NULL;
+
+	if (ENDBR_INSN_SIZE && !is_endbr((u32 *)landing))
+		return NULL;
+
+	return tail_ip;
+}
+
 int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type old_t,
 		       enum bpf_text_poke_type new_t, void *old_addr,
 		       void *new_addr)
 {
+	void *tail_ip = NULL;
+	bool is_bpf_text = is_bpf_text_address((long)ip);
+	int ret, tail_ret;
+
 	if (!is_kernel_text((long)ip) &&
-	    !is_bpf_text_address((long)ip))
+	    !is_bpf_text)
 		/* BPF poking in modules is not supported */
 		return -EINVAL;
 
@@ -648,7 +678,18 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type old_t,
 	if (is_endbr(ip))
 		ip += ENDBR_INSN_SIZE;
 
-	return __bpf_arch_text_poke(ip, old_t, new_t, old_addr, new_addr);
+	if (is_bpf_text && (old_t == BPF_MOD_CALL || new_t == BPF_MOD_CALL))
+		tail_ip = bpf_tail_call_fentry_ip(ip);
+
+	ret = __bpf_arch_text_poke(ip, old_t, new_t, old_addr, new_addr);
+	if (ret < 0 || !tail_ip)
+		return ret;
+
+	tail_ret = __bpf_arch_text_poke(tail_ip, old_t, new_t, old_addr, new_addr);
+	if (tail_ret < 0)
+		return tail_ret;
+
+	return ret && tail_ret;
 }
 
 #define EMIT_LFENCE()	EMIT3(0x0F, 0xAE, 0xE8)
-- 
2.43.0

Re: [PATCH bpf-next 1/2] bpf, x86: patch tail-call fentry slot on non-IBT JITs

Posted by Alexei Starovoitov 6 days, 1 hour ago

On Fri, Mar 27, 2026 at 7:16 AM Takeru Hayasaka <hayatake396@gmail.com> wrote:
>
> x86 tail-call fentry patching mirrors CALL text pokes to the tail-call
> landing slot.
>
> The helper that locates that mirrored slot assumes an ENDBR-prefixed
> landing, which works on IBT JITs but fails on non-IBT JITs where the
> landing starts directly with the 5-byte patch slot.

tailcalls are deprecated. We should go the other way and
disable them ibt jit instead.
The less interaction between fentry and tailcall the better.

pw-bot: cr

Re: [PATCH bpf-next 1/2] bpf, x86: patch tail-call fentry slot on non-IBT JITs

Posted by Takeru Hayasaka 6 days ago

Hi Alexei

Thanks, and Sorry, I sent an older changelog from while I was still
iterating on this, and it described the issue incorrectly.

My changelog made this sound like an IBT/non-IBT-specific issue, but
that was wrong. On current kernels, fentry on tail-called programs is
not supported in either case. Only the regular fentry patch site is
patched; there is no tail-call landing patching in either case, so
disabling IBT does not make it work.

What this series was trying to do was add support for fentry on
tail-called x86 programs. The non-IBT part was only about a bug in my
initial implementation of that support, not the underlying motivation.

The motivation is observability of existing tailcall-heavy BPF/XDP
programs, where tail-called leaf programs are currently a blind spot for
fentry-based debugging.

If supporting fentry on tail-called programs is still not something
you'd want upstream, I understand. If I resend this, I'll fix the
changelog/cover letter to describe it correctly.

Re: [PATCH bpf-next 1/2] bpf, x86: patch tail-call fentry slot on non-IBT JITs

Posted by Alexei Starovoitov 6 days ago

On Fri, Mar 27, 2026 at 8:12 AM Takeru Hayasaka <hayatake396@gmail.com> wrote:
>
> Hi Alexei
>
> Thanks, and Sorry, I sent an older changelog from while I was still
> iterating on this, and it described the issue incorrectly.
>
> My changelog made this sound like an IBT/non-IBT-specific issue, but
> that was wrong. On current kernels, fentry on tail-called programs is
> not supported in either case. Only the regular fentry patch site is
> patched; there is no tail-call landing patching in either case, so
> disabling IBT does not make it work.
>
> What this series was trying to do was add support for fentry on
> tail-called x86 programs. The non-IBT part was only about a bug in my
> initial implementation of that support, not the underlying motivation.
>
> The motivation is observability of existing tailcall-heavy BPF/XDP
> programs, where tail-called leaf programs are currently a blind spot for
> fentry-based debugging.

I get that, but I'd rather not open this can of worms.
We had enough headaches when tailcalls, fentry, subprogs are combined.
Like this set:
https://lore.kernel.org/all/20230912150442.2009-1-hffilwlqm@gmail.com/
and the followups.

Re: [PATCH bpf-next 1/2] bpf, x86: patch tail-call fentry slot on non-IBT JITs

Posted by Takeru Hayasaka 5 days, 23 hours ago

Understood. I was a bit surprised to read that this area ended up taking
months of follow-up work....

One thing I am still trying to understand is what the preferred
debuggability/observability direction would be for existing
tailcall-heavy BPF/XDP deployments.

Tail calls are already used in practice as a program decomposition
mechanism, especially in XDP pipelines, and that leaves tail-called leaf
programs harder to observe today.

If fentry on tail-called programs is not something you'd want upstream,
is there another direction you would recommend for improving
observability/debuggability of such existing deployments?

2026年3月28日(土) 0:21 Alexei Starovoitov <alexei.starovoitov@gmail.com>:
>
> On Fri, Mar 27, 2026 at 8:12 AM Takeru Hayasaka <hayatake396@gmail.com> wrote:
> >
> > Hi Alexei
> >
> > Thanks, and Sorry, I sent an older changelog from while I was still
> > iterating on this, and it described the issue incorrectly.
> >
> > My changelog made this sound like an IBT/non-IBT-specific issue, but
> > that was wrong. On current kernels, fentry on tail-called programs is
> > not supported in either case. Only the regular fentry patch site is
> > patched; there is no tail-call landing patching in either case, so
> > disabling IBT does not make it work.
> >
> > What this series was trying to do was add support for fentry on
> > tail-called x86 programs. The non-IBT part was only about a bug in my
> > initial implementation of that support, not the underlying motivation.
> >
> > The motivation is observability of existing tailcall-heavy BPF/XDP
> > programs, where tail-called leaf programs are currently a blind spot for
> > fentry-based debugging.
>
> I get that, but I'd rather not open this can of worms.
> We had enough headaches when tailcalls, fentry, subprogs are combined.
> Like this set:
> https://lore.kernel.org/all/20230912150442.2009-1-hffilwlqm@gmail.com/
> and the followups.

Re: [PATCH bpf-next 1/2] bpf, x86: patch tail-call fentry slot on non-IBT JITs

Posted by Alexei Starovoitov 5 days, 23 hours ago

On Fri, Mar 27, 2026 at 8:45 AM Takeru Hayasaka <hayatake396@gmail.com> wrote:
>
> Understood. I was a bit surprised to read that this area ended up taking
> months of follow-up work....
>
> One thing I am still trying to understand is what the preferred
> debuggability/observability direction would be for existing
> tailcall-heavy BPF/XDP deployments.
>
> Tail calls are already used in practice as a program decomposition
> mechanism, especially in XDP pipelines, and that leaves tail-called leaf
> programs harder to observe today.
>
> If fentry on tail-called programs is not something you'd want upstream,
> is there another direction you would recommend for improving
> observability/debuggability of such existing deployments?

You don't need fentry to debug.
perf works just fine on all bpf progs whether tailcall or not.

Also pls don't top post.

Re: [PATCH bpf-next 1/2] bpf, x86: patch tail-call fentry slot on non-IBT JITs

Posted by Takeru Hayasaka 5 days, 23 hours ago

Sorry about the top-posting.

That makes sense, thanks. I agree perf can provide visibility into which
BPF programs are running, including tail-called ones.

What I am still unsure about is packet-level / structured-data
observability. My use case is closer to xdpdump-style debugging, where I
want to inspect packet-related context from specific XDP leaf programs
in a live pipeline.

That feels harder to express with perf alone, so I am trying to
understand what the preferred direction would be for that kind of use
case in tailcall-heavy XDP deployments.

Re: [PATCH bpf-next 1/2] bpf, x86: patch tail-call fentry slot on non-IBT JITs

Posted by Alexei Starovoitov 5 days, 23 hours ago

On Fri, Mar 27, 2026 at 9:06 AM Takeru Hayasaka <hayatake396@gmail.com> wrote:
>
> Sorry about the top-posting.

yet you're still top posting :(

> That makes sense, thanks. I agree perf can provide visibility into which
> BPF programs are running, including tail-called ones.
>
> What I am still unsure about is packet-level / structured-data
> observability. My use case is closer to xdpdump-style debugging, where I
> want to inspect packet-related context from specific XDP leaf programs
> in a live pipeline.

see how cilium did it. with pwru tool, etc.

Re: [PATCH bpf-next 1/2] bpf, x86: patch tail-call fentry slot on non-IBT JITs

Posted by Takeru Hayasaka 5 days, 23 hours ago

> yet you're still top posting :(

Sorry about that. I misunderstood what top posting meant and ended up
replying in the wrong style.
I had not understood that it referred to quoting in that way, and I am
embarrassed that I got it wrong....

> see how cilium did it. with pwru tool, etc.

Thank you for the suggestion.
As for pwru, I had thought it was not able to capture packet data such as pcap,
and understood it more as a tool to trace where a specific packet
enters the processing path and how it is handled.

For example, in an environment where systems are already
interconnected and running, I sometimes want to capture the actual
packets being sent for real processing.
On the other hand, if the goal is simply to observe processing safely
in a development environment, I think tools such as ipftrace2 or pwru
can be very useful.

Re: [PATCH bpf-next 1/2] bpf, x86: patch tail-call fentry slot on non-IBT JITs

Posted by Leon Hwang 3 days, 6 hours ago

On 28/3/26 00:30, Takeru Hayasaka wrote:
>> see how cilium did it. with pwru tool, etc.
> 
> Thank you for the suggestion.
> As for pwru, I had thought it was not able to capture packet data such as pcap,
> and understood it more as a tool to trace where a specific packet
> enters the processing path and how it is handled.
> 
> For example, in an environment where systems are already
> interconnected and running, I sometimes want to capture the actual
> packets being sent for real processing.
> On the other hand, if the goal is simply to observe processing safely
> in a development environment, I think tools such as ipftrace2 or pwru
> can be very useful.
> 

Sounds like you are developing/maintaining an XDP project.

If so, and the kernel carries the patches in
https://lore.kernel.org/all/20230912150442.2009-1-hffilwlqm@gmail.com/,
recommend modifying the XDP project using dispatcher like libxdp [1].
Then, you are able to trace the subprogs which aim to run tail calls;
meanwhile, you are able to filter packets using pcap-filter, and to
output packets using bpf_xdp_output() helper.

[1]
https://github.com/xdp-project/xdp-tools/blob/main/lib/libxdp/xdp-dispatcher.c.in

Thanks,
Leon

Re: [PATCH bpf-next 1/2] bpf, x86: patch tail-call fentry slot on non-IBT JITs

Posted by Takeru Hayasaka 2 days, 22 hours ago

> Sounds like you are developing/maintaining an XDP project.
>
> If so, and the kernel carries the patches in
> https://lore.kernel.org/all/20230912150442.2009-1-hffilwlqm@gmail.com/,
> recommend modifying the XDP project using dispatcher like libxdp [1].
> Then, you are able to trace the subprogs which aim to run tail calls;
> meanwhile, you are able to filter packets using pcap-filter, and to
> output packets using bpf_xdp_output() helper.
>
> [1]
> https://github.com/xdp-project/xdp-tools/blob/main/lib/libxdp/xdp-dispatcher.c.in

Thank you very much for your wonderful comment, Leon.
This was the first time I learned that such a mechanism exists.

It is a very interesting ecosystem.
If I understand correctly, the idea is to invoke a component that
dumps pcap data as one of the tail-called components, right?
Thank you very much for sharing this idea with me.
If I have a chance to write a new XDP program in the future, I would
definitely like to try it.

On the other hand, I feel that it is somewhat difficult to apply this
idea directly to existing codebases, or to cases where the code is
written in Go using something like cilium/ebpf.
Also, when it comes to code running in production environments, making
changes itself can be difficult.

For that reason, I prototyped a tool like this.
It is something like a middle ground between xdpdump and xdpcap.
I built it so that only packets matched by cbpf are sent up through
perf, and while testing it, I noticed that it does not work well for
targets invoked via tail call.
This is what motivated me to send the patch.

https://github.com/takehaya/xdp-ninja

Once again, thank you for sharing the idea.
Takeru

Re: [PATCH bpf-next 1/2] bpf, x86: patch tail-call fentry slot on non-IBT JITs

Posted by Leon Hwang 2 days, 13 hours ago

On 31/3/26 00:46, Takeru Hayasaka wrote:
>> Sounds like you are developing/maintaining an XDP project.
>>
>> If so, and the kernel carries the patches in
>> https://lore.kernel.org/all/20230912150442.2009-1-hffilwlqm@gmail.com/,
>> recommend modifying the XDP project using dispatcher like libxdp [1].
>> Then, you are able to trace the subprogs which aim to run tail calls;
>> meanwhile, you are able to filter packets using pcap-filter, and to
>> output packets using bpf_xdp_output() helper.
>>
>> [1]
>> https://github.com/xdp-project/xdp-tools/blob/main/lib/libxdp/xdp-dispatcher.c.in
> 
> Thank you very much for your wonderful comment, Leon.
> This was the first time I learned that such a mechanism exists.
> 
> It is a very interesting ecosystem.
> If I understand correctly, the idea is to invoke a component that
> dumps pcap data as one of the tail-called components, right?

It is similar to xdp-ninja/xdp-dump.

However, this idea has one more step forward: it is to trace the
subprogs instead of only the main prog.

For example,

__noinline int subprog0(struct xdp_md *xdp) { bpf_tail_call_static(xdp,
&m, 0); }
__noinline int subprog1(struct xdp_md *xdp) { bpf_tail_call_static(xdp,
&m, 1); }
__noinline int subprog2(struct xdp_md *xdp) { bpf_tail_call_static(xdp,
&m, 2); }
SEC("xdp") int main(struct xdp_md *xdp)
{
	subprog0(xdp);
	subprog1(xdp);
	subprog2(xdp);
	return XDP_PASS;
}

All of them, subprog{0,1,2} and main, will be traced.

In this idea, it is to inject pcap-filter expression, the cbpf, using
elibpcap [1], and to output packets like your xdp-ninja.

It works well during the time I maintained an XDP project.

[1] https://github.com/jschwinger233/elibpcap

> Thank you very much for sharing this idea with me.
> If I have a chance to write a new XDP program in the future, I would
> definitely like to try it.
> 
> On the other hand, I feel that it is somewhat difficult to apply this
> idea directly to existing codebases, or to cases where the code is> written in Go using something like cilium/ebpf.
> Also, when it comes to code running in production environments, making
> changes itself can be difficult.

Correct. If cannot modify the code, and the tail calls are not called
inner subprogs, the aforementioned idea is helpless to trace the tail
callees.

> 
> For that reason, I prototyped a tool like this.
> It is something like a middle ground between xdpdump and xdpcap.
> I built it so that only packets matched by cbpf are sent up through
> perf, and while testing it, I noticed that it does not work well for
> targets invoked via tail call.
> This is what motivated me to send the patch.
> 

I have similar idea years ago, a more generic tracer for tail calls.
However, as Alexei's concern, I won't post it.

> https://github.com/takehaya/xdp-ninja
> 

It looks wonderful.

I developed a similar tool, bpfsnoop [1], to trace BPF progs/subprogs
and kernel functions with filtering packets/arguments and outputting
packets/arguments info. However, it lacks the ability of outputting
packets to pcap file.

[1] https://github.com/bpfsnoop/bpfsnoop

Thanks,
Leon

> Once again, thank you for sharing the idea.
> Takeru

Re: [PATCH bpf-next 1/2] bpf, x86: patch tail-call fentry slot on non-IBT JITs

Posted by Takeru Hayasaka 2 days, 10 hours ago

> It is similar to xdp-ninja/xdp-dump.
>
> However, this idea has one more step forward: it is to trace the
> subprogs instead of only the main prog.
>
> For example,
>
> __noinline int subprog0(struct xdp_md *xdp) { bpf_tail_call_static(xdp,
> &m, 0); }
> __noinline int subprog1(struct xdp_md *xdp) { bpf_tail_call_static(xdp,
> &m, 1); }
> __noinline int subprog2(struct xdp_md *xdp) { bpf_tail_call_static(xdp,
> &m, 2); }
> SEC("xdp") int main(struct xdp_md *xdp)
> {
>         subprog0(xdp);
>         subprog1(xdp);
>         subprog2(xdp);
>         return XDP_PASS;
> }
>
> All of them, subprog{0,1,2} and main, will be traced.
>
> In this idea, it is to inject pcap-filter expression, the cbpf, using
> elibpcap [1], and to output packets like your xdp-ninja.
>
> It works well during the time I maintained an XDP project.
>
> [1] https://github.com/jschwinger233/elibpcap

Thank you very much for your kind reply.
elibpcap is a very interesting idea as well.

I had also considered whether making it __noinline might leave the
function prologue intact, so that it could be hooked via trampoline. I
actually tried that idea myself, but I had not yet been able to get it
working. I had not investigated it in enough detail, but I was
suspecting that it might already be expanded by JIT, which could make
that approach difficult.

That is why I thought it was excellent that you realized it by taking
the approach of inserting the logic there in the first place, rather
than trying to hook it afterward.

I had also been thinking about supporting use cases where packets need
to be captured only after some packet-related decision has already
been made in the program. For example, there are cases where the value
to match depends on runtime state, such as a Session ID that changes
whenever a tunnel is established. I think this kind of approach is
very helpful for such cases.

So, I was very happy to learn that someone else had been thinking
about a similar technique.

> It looks wonderful.
>
> I developed a similar tool, bpfsnoop [1], to trace BPF progs/subprogs
> and kernel functions with filtering packets/arguments and outputting
> packets/arguments info. However, it lacks the ability of outputting
> packets to pcap file.
>
> [1] https://github.com/bpfsnoop/bpfsnoop

Also, bpfsnoop looks like a very nice tool. I starred it on GitHub:)

Thank you very much again. I am very grateful that you shared such an
excellent idea with me.

Thanks,
Takeru