kernel/tracepoint.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
The grace period used internally within tracepoint.c:release_probes()
uses call_rcu() to batch waiting for quiescence of old probe arrays,
rather than using the tracepoint_synchronize_unregister() which blocks
while waiting for quiescence.
This causes use-after-free issues reproduced with syzkaller.
Fix this by introducing the following call_rcu chaining:
call_rcu() -> rcu_free_old_probes -> call_rcu_tasks_trace() -> rcu_tasks_trace_free_old_probes.
Just like it was done when SRCU was used.
Reported-by: syzbot+b390c8062d8387b6272a@syzkaller.appspotmail.com
Fixes: a363d27cdbc2 ("tracing: Allow system call tracepoints to handle page faults")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Jordan Rife <jrife@google.com>
---
kernel/tracepoint.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 6474e2cf22c9..33f6fa94d383 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -101,11 +101,16 @@ static inline void *allocate_probes(int count)
return p == NULL ? NULL : p->probes;
}
-static void rcu_free_old_probes(struct rcu_head *head)
+static void rcu_tasks_trace_free_old_probes(struct rcu_head *head)
{
kfree(container_of(head, struct tp_probes, rcu));
}
+static void rcu_free_old_probes(struct rcu_head *head)
+{
+ call_rcu_tasks_trace(head, rcu_tasks_trace_free_old_probes);
+}
+
static inline void release_probes(struct tracepoint_func *old)
{
if (old) {
--
2.39.5
I assume this patch isn't meant to fix the related issues with freeing BPF programs/links with call_rcu? On the BPF side I think there needs to be some smarter handling of when to use call_rcu or call_rcu_tasks_trace to free links/programs based on whether or not the program type can be executed in this context. Right now call_rcu_tasks_trace is used if the program is sleepable, but that isn't necessarily the case here. Off the top of my head this would be BPF_PROG_TYPE_RAW_TRACEPOINT and BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, but may extend to BPF_PROG_TYPE_TRACEPOINT? I'll let some of the BPF folks chime in here, as I'm not entirely sure. -Jordan
On 2024-10-22 12:14, Jordan Rife wrote: > I assume this patch isn't meant to fix the related issues with freeing > BPF programs/links with call_rcu? No, indeed. I notice that bpf_link_free() uses a prog->sleepable flag to choose between: if (sleepable) call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp); else call_rcu(&link->rcu, bpf_link_defer_dealloc_rcu_gp); But the faultable syscall tracepoint series does not require syscall programs to be sleepable. So some changes may be needed on the ebpf side there. > > On the BPF side I think there needs to be some smarter handling of > when to use call_rcu or call_rcu_tasks_trace to free links/programs > based on whether or not the program type can be executed in this > context. Right now call_rcu_tasks_trace is used if the program is > sleepable, but that isn't necessarily the case here. Off the top of my > head this would be BPF_PROG_TYPE_RAW_TRACEPOINT and > BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, but may extend to > BPF_PROG_TYPE_TRACEPOINT? I'll let some of the BPF folks chime in > here, as I'm not entirely sure. A big hammer solution would be to make all grace periods waited for after a bpf tracepoint probe unregister chain call_rcu and call_rcu_tasks_trace. Else, if we properly tag all programs attached to syscall tracepoints as sleepable, then keeping the call_rcu_tasks_trace() only for those would work. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On Tue, Oct 22, 2024 at 10:55 AM Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2024-10-22 12:14, Jordan Rife wrote: > > I assume this patch isn't meant to fix the related issues with freeing > > BPF programs/links with call_rcu? > > No, indeed. I notice that bpf_link_free() uses a prog->sleepable flag to > choose between: > > if (sleepable) > call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp); > else > call_rcu(&link->rcu, bpf_link_defer_dealloc_rcu_gp); > > But the faultable syscall tracepoint series does not require syscall programs > to be sleepable. So some changes may be needed on the ebpf side there. Your fix now adds a chain of call_rcu -> call_rcu_tasks_trace -> kfree, which should work regardless of sleepable/non-sleepable. For the BPF-side, yes, we do different things depending on prog->sleepable (adding extra call_rcu_tasks_trace for sleepable, while still keeping call_rcu in the chain), so the BPF side should be good, I think. > > > > > On the BPF side I think there needs to be some smarter handling of > > when to use call_rcu or call_rcu_tasks_trace to free links/programs > > based on whether or not the program type can be executed in this > > context. Right now call_rcu_tasks_trace is used if the program is > > sleepable, but that isn't necessarily the case here. Off the top of my > > head this would be BPF_PROG_TYPE_RAW_TRACEPOINT and > > BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, but may extend to > > BPF_PROG_TYPE_TRACEPOINT? I'll let some of the BPF folks chime in > > here, as I'm not entirely sure. > From the BPF standpoint, as of right now, neither of RAW_TRACEPOINT or TRACEPOINT programs are sleepable. So a single RCU grace period is fine. But even if they were (and we'll allow that later on), we handle sleepable programs with the same call_rcu_tasks_trace -> call_rcu chain. That's just to say that I don't think that we need any BPF-specific fix beyond what Mathieu is doing in this patch, so: Acked-by: Andrii Nakryiko <andrii@kernel.org> > A big hammer solution would be to make all grace periods waited for after > a bpf tracepoint probe unregister chain call_rcu and call_rcu_tasks_trace. > > Else, if we properly tag all programs attached to syscall tracepoints as > sleepable, then keeping the call_rcu_tasks_trace() only for those would > work. > > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
On 2024-10-22 15:53, Andrii Nakryiko wrote: > On Tue, Oct 22, 2024 at 10:55 AM Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >> On 2024-10-22 12:14, Jordan Rife wrote: >>> I assume this patch isn't meant to fix the related issues with freeing >>> BPF programs/links with call_rcu? >> >> No, indeed. I notice that bpf_link_free() uses a prog->sleepable flag to >> choose between: >> >> if (sleepable) >> call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp); >> else >> call_rcu(&link->rcu, bpf_link_defer_dealloc_rcu_gp); >> >> But the faultable syscall tracepoint series does not require syscall programs >> to be sleepable. So some changes may be needed on the ebpf side there. > > Your fix now adds a chain of call_rcu -> call_rcu_tasks_trace -> > kfree, which should work regardless of sleepable/non-sleepable. For > the BPF-side, yes, we do different things depending on prog->sleepable > (adding extra call_rcu_tasks_trace for sleepable, while still keeping > call_rcu in the chain), so the BPF side should be good, I think. > >> >>> >>> On the BPF side I think there needs to be some smarter handling of >>> when to use call_rcu or call_rcu_tasks_trace to free links/programs >>> based on whether or not the program type can be executed in this >>> context. Right now call_rcu_tasks_trace is used if the program is >>> sleepable, but that isn't necessarily the case here. Off the top of my >>> head this would be BPF_PROG_TYPE_RAW_TRACEPOINT and >>> BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, but may extend to >>> BPF_PROG_TYPE_TRACEPOINT? I'll let some of the BPF folks chime in >>> here, as I'm not entirely sure. >> > > From the BPF standpoint, as of right now, neither of RAW_TRACEPOINT or > TRACEPOINT programs are sleepable. So a single RCU grace period is > fine. But even if they were (and we'll allow that later on), we handle > sleepable programs with the same call_rcu_tasks_trace -> call_rcu > chain. Good points, in this commit: commit 4aadde89d8 ("tracing/bpf: disable preemption in syscall probe") I took care to disable preemption around use of the bpf program attached to a syscall tracepoint, which makes this change a no-op from the tracers' perspective. It's only when you'll decide to remove this preempt-off and allow syscall tracepoints to sleep in bpf that you'll need to tweak that. > > That's just to say that I don't think that we need any BPF-specific > fix beyond what Mathieu is doing in this patch, so: > > Acked-by: Andrii Nakryiko <andrii@kernel.org> Thanks! Mathieu > > >> A big hammer solution would be to make all grace periods waited for after >> a bpf tracepoint probe unregister chain call_rcu and call_rcu_tasks_trace. >> >> Else, if we properly tag all programs attached to syscall tracepoints as >> sleepable, then keeping the call_rcu_tasks_trace() only for those would >> work. >> >> Thanks, >> >> Mathieu >> >> -- >> Mathieu Desnoyers >> EfficiOS Inc. >> https://www.efficios.com >> -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On Tue, 22 Oct 2024 16:04:49 -0400 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > > > That's just to say that I don't think that we need any BPF-specific > > fix beyond what Mathieu is doing in this patch, so: > > > > Acked-by: Andrii Nakryiko <andrii@kernel.org> > > Thanks! Does this mean I can pull this patch as is? I don't usually take RFCs. -- Steve
On 2024-10-22 20:20, Steven Rostedt wrote: > On Tue, 22 Oct 2024 16:04:49 -0400 > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > >>> >>> That's just to say that I don't think that we need any BPF-specific >>> fix beyond what Mathieu is doing in this patch, so: >>> >>> Acked-by: Andrii Nakryiko <andrii@kernel.org> >> >> Thanks! > > Does this mean I can pull this patch as is? I don't usually take RFCs. It's only compile-tested on my end, so we should at least get a Tested-by before I feel confident posting this as a non-RFC patch. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
> so we should at least get a Tested-by I could apply this and run the syz repro script on my end if that helps, since I already have everything set up.
Mathieu's patch alone does not seem to be enough to prevent the use-after-free issue reported by syzbot. Link: https://lore.kernel.org/bpf/67121037.050a0220.10f4f4.000f.GAE@google.com/T/#u I reran the repro script with his patch applied to my tree and was still able to get the same KASAN crash to occur. In this case, when bpf_link_free is invoked it kicks off three instances of call_rcu*. bpf_link_free() ops->release() bpf_raw_tp_link_release() bpf_probe_unregister() tracepoint_probe_unregister() tracepoint_remove_func() release_probes() call_rcu() [1] bpf_prog_put() __bpf_prog_put() bpf_prog_put_deferred() __bpf_prog_put_noref() call_rcu() [2] call_rcu() [3] With Mathieu's patch, [1] is chained with call_rcu_tasks_trace() making the grace period suffiently long to safely free the probe itself. The callback for [2] and [3] may be invoked before the call_rcu_tasks_trace() grace period has elapsed leading to the link or program itself being freed while still in use. I was able to prevent any crashes with the patch below which also chains call_rcu_tasks_trace() and call_rcu() at [2] and [3]. --- kernel/bpf/syscall.c | 24 ++++++++++-------------- 1 file changed, 10 insertions(+), 14 deletions(-) diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 59de664e580d..5290eccb465e 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2200,6 +2200,14 @@ static void __bpf_prog_put_rcu(struct rcu_head *rcu) bpf_prog_free(aux->prog); } +static void __bpf_prog_put_tasks_trace_rcu(struct rcu_head *rcu) +{ + if (rcu_trace_implies_rcu_gp()) + __bpf_prog_put_rcu(rcu); + else + call_rcu(rcu, __bpf_prog_put_rcu); +} + static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred) { bpf_prog_kallsyms_del_all(prog); @@ -2212,10 +2220,7 @@ static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred) btf_put(prog->aux->attach_btf); if (deferred) { - if (prog->sleepable) - call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_rcu); - else - call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu); + call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_tasks_trace_rcu); } else { __bpf_prog_put_rcu(&prog->aux->rcu); } @@ -2996,24 +3001,15 @@ static void bpf_link_defer_dealloc_mult_rcu_gp(struct rcu_head *rcu) static void bpf_link_free(struct bpf_link *link) { const struct bpf_link_ops *ops = link->ops; - bool sleepable = false; bpf_link_free_id(link->id); if (link->prog) { - sleepable = link->prog->sleepable; /* detach BPF program, clean up used resources */ ops->release(link); bpf_prog_put(link->prog); } if (ops->dealloc_deferred) { - /* schedule BPF link deallocation; if underlying BPF program - * is sleepable, we need to first wait for RCU tasks trace - * sync, then go through "classic" RCU grace period - */ - if (sleepable) - call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp); - else - call_rcu(&link->rcu, bpf_link_defer_dealloc_rcu_gp); + call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp); } else if (ops->dealloc) ops->dealloc(link); } -- 2.47.0.105.g07ac214952-goog
On 2024-10-23 10:56, Jordan Rife wrote: > Mathieu's patch alone does not seem to be enough to prevent the > use-after-free issue reported by syzbot. > > Link: https://lore.kernel.org/bpf/67121037.050a0220.10f4f4.000f.GAE@google.com/T/#u > > I reran the repro script with his patch applied to my tree and was > still able to get the same KASAN crash to occur. > > In this case, when bpf_link_free is invoked it kicks off three instances > of call_rcu*. > > bpf_link_free() > ops->release() > bpf_raw_tp_link_release() > bpf_probe_unregister() > tracepoint_probe_unregister() > tracepoint_remove_func() > release_probes() > call_rcu() [1] > bpf_prog_put() > __bpf_prog_put() > bpf_prog_put_deferred() > __bpf_prog_put_noref() > call_rcu() [2] > call_rcu() [3] > > With Mathieu's patch, [1] is chained with call_rcu_tasks_trace() > making the grace period suffiently long to safely free the probe itself. > The callback for [2] and [3] may be invoked before the > call_rcu_tasks_trace() grace period has elapsed leading to the link or > program itself being freed while still in use. I was able to prevent > any crashes with the patch below which also chains > call_rcu_tasks_trace() and call_rcu() at [2] and [3]. Right, so removal of the tracepoint probe is done by tracepoint_probe_unregister by effectively removing the probe function from the array. The read-side counterpart of that is in __DO_TRACE(), where the rcu dereference is protected by rcu_read_lock_trace for syscall tracepoints now. We cannot expect that surrounding the ebpf probe execution with preempt disable like so: #define __BPF_DECLARE_TRACE_SYSCALL(call, proto, args) \ static notrace void \ __bpf_trace_##call(void *__data, proto) \ { \ might_fault(); \ preempt_disable_notrace(); \ CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args)); \ preempt_enable_notrace(); \ } Is sufficient to delay reclaim with call_rcu() after a tracepoint unregister, because the preempt disable does not include the rcu dereference done by the tracepoint in its critical section. So relying on a call_rcu() to delay reclaim of the bpf objects after unregistering their associated tracepoint is indeed not enough. Chaining call_rcu with call_rcu_tasks_trace works though. That question is relevant for ftrace and perf too: are there data structures that are reclaimed with call_rcu() after being unregistered from syscall tracepoints ? Thanks Jordan for your thorough analysis, Mathieu > > --- > kernel/bpf/syscall.c | 24 ++++++++++-------------- > 1 file changed, 10 insertions(+), 14 deletions(-) > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index 59de664e580d..5290eccb465e 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -2200,6 +2200,14 @@ static void __bpf_prog_put_rcu(struct rcu_head *rcu) > bpf_prog_free(aux->prog); > } > > +static void __bpf_prog_put_tasks_trace_rcu(struct rcu_head *rcu) > +{ > + if (rcu_trace_implies_rcu_gp()) > + __bpf_prog_put_rcu(rcu); > + else > + call_rcu(rcu, __bpf_prog_put_rcu); > +} > + > static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred) > { > bpf_prog_kallsyms_del_all(prog); > @@ -2212,10 +2220,7 @@ static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred) > btf_put(prog->aux->attach_btf); > > if (deferred) { > - if (prog->sleepable) > - call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_rcu); > - else > - call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu); > + call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_tasks_trace_rcu); > } else { > __bpf_prog_put_rcu(&prog->aux->rcu); > } > @@ -2996,24 +3001,15 @@ static void bpf_link_defer_dealloc_mult_rcu_gp(struct rcu_head *rcu) > static void bpf_link_free(struct bpf_link *link) > { > const struct bpf_link_ops *ops = link->ops; > - bool sleepable = false; > > bpf_link_free_id(link->id); > if (link->prog) { > - sleepable = link->prog->sleepable; > /* detach BPF program, clean up used resources */ > ops->release(link); > bpf_prog_put(link->prog); > } > if (ops->dealloc_deferred) { > - /* schedule BPF link deallocation; if underlying BPF program > - * is sleepable, we need to first wait for RCU tasks trace > - * sync, then go through "classic" RCU grace period > - */ > - if (sleepable) > - call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp); > - else > - call_rcu(&link->rcu, bpf_link_defer_dealloc_rcu_gp); > + call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp); > } else if (ops->dealloc) > ops->dealloc(link); > } -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On Wed, 23 Oct 2024 11:13:53 -0400 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > That question is relevant for ftrace and perf too: are there data > structures that are reclaimed with call_rcu() after being unregistered > from syscall tracepoints ? ftrace doesn't assume RCU for when to synchronize with tracepoints being unregistered. It uses: tracepoint_synchronize_unregister() -- Steve
On Wed, Oct 23, 2024 at 7:56 AM Jordan Rife <jrife@google.com> wrote: > > Mathieu's patch alone does not seem to be enough to prevent the > use-after-free issue reported by syzbot. > > Link: https://lore.kernel.org/bpf/67121037.050a0220.10f4f4.000f.GAE@google.com/T/#u > > I reran the repro script with his patch applied to my tree and was > still able to get the same KASAN crash to occur. > > In this case, when bpf_link_free is invoked it kicks off three instances > of call_rcu*. > > bpf_link_free() > ops->release() > bpf_raw_tp_link_release() > bpf_probe_unregister() > tracepoint_probe_unregister() > tracepoint_remove_func() > release_probes() > call_rcu() [1] > bpf_prog_put() > __bpf_prog_put() > bpf_prog_put_deferred() > __bpf_prog_put_noref() > call_rcu() [2] > call_rcu() [3] > > With Mathieu's patch, [1] is chained with call_rcu_tasks_trace() > making the grace period suffiently long to safely free the probe itself. > The callback for [2] and [3] may be invoked before the > call_rcu_tasks_trace() grace period has elapsed leading to the link or > program itself being freed while still in use. I was able to prevent > any crashes with the patch below which also chains > call_rcu_tasks_trace() and call_rcu() at [2] and [3]. > > --- > kernel/bpf/syscall.c | 24 ++++++++++-------------- > 1 file changed, 10 insertions(+), 14 deletions(-) > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index 59de664e580d..5290eccb465e 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -2200,6 +2200,14 @@ static void __bpf_prog_put_rcu(struct rcu_head *rcu) > bpf_prog_free(aux->prog); > } > > +static void __bpf_prog_put_tasks_trace_rcu(struct rcu_head *rcu) > +{ > + if (rcu_trace_implies_rcu_gp()) > + __bpf_prog_put_rcu(rcu); > + else > + call_rcu(rcu, __bpf_prog_put_rcu); > +} > + > static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred) > { > bpf_prog_kallsyms_del_all(prog); > @@ -2212,10 +2220,7 @@ static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred) > btf_put(prog->aux->attach_btf); > > if (deferred) { > - if (prog->sleepable) > - call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_rcu); > - else > - call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu); > + call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_tasks_trace_rcu); > } else { > __bpf_prog_put_rcu(&prog->aux->rcu); > } > @@ -2996,24 +3001,15 @@ static void bpf_link_defer_dealloc_mult_rcu_gp(struct rcu_head *rcu) > static void bpf_link_free(struct bpf_link *link) > { > const struct bpf_link_ops *ops = link->ops; > - bool sleepable = false; > > bpf_link_free_id(link->id); > if (link->prog) { > - sleepable = link->prog->sleepable; > /* detach BPF program, clean up used resources */ > ops->release(link); > bpf_prog_put(link->prog); > } > if (ops->dealloc_deferred) { > - /* schedule BPF link deallocation; if underlying BPF program > - * is sleepable, we need to first wait for RCU tasks trace > - * sync, then go through "classic" RCU grace period > - */ > - if (sleepable) > - call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp); > - else > - call_rcu(&link->rcu, bpf_link_defer_dealloc_rcu_gp); > + call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp); This patch is completely wrong. Looks like Mathieu patch broke bpf program contract somewhere. The tracepoint type bpf programs must be called with rcu_read_lock held. Looks like it's not happening anymore.
On 2024-10-23 11:14, Alexei Starovoitov wrote: > On Wed, Oct 23, 2024 at 7:56 AM Jordan Rife <jrife@google.com> wrote: >> >> Mathieu's patch alone does not seem to be enough to prevent the >> use-after-free issue reported by syzbot. >> >> Link: https://lore.kernel.org/bpf/67121037.050a0220.10f4f4.000f.GAE@google.com/T/#u >> >> I reran the repro script with his patch applied to my tree and was >> still able to get the same KASAN crash to occur. >> >> In this case, when bpf_link_free is invoked it kicks off three instances >> of call_rcu*. >> >> bpf_link_free() >> ops->release() >> bpf_raw_tp_link_release() >> bpf_probe_unregister() >> tracepoint_probe_unregister() >> tracepoint_remove_func() >> release_probes() >> call_rcu() [1] >> bpf_prog_put() >> __bpf_prog_put() >> bpf_prog_put_deferred() >> __bpf_prog_put_noref() >> call_rcu() [2] >> call_rcu() [3] >> >> With Mathieu's patch, [1] is chained with call_rcu_tasks_trace() >> making the grace period suffiently long to safely free the probe itself. >> The callback for [2] and [3] may be invoked before the >> call_rcu_tasks_trace() grace period has elapsed leading to the link or >> program itself being freed while still in use. I was able to prevent >> any crashes with the patch below which also chains >> call_rcu_tasks_trace() and call_rcu() at [2] and [3]. >> >> --- >> kernel/bpf/syscall.c | 24 ++++++++++-------------- >> 1 file changed, 10 insertions(+), 14 deletions(-) >> >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c >> index 59de664e580d..5290eccb465e 100644 >> --- a/kernel/bpf/syscall.c >> +++ b/kernel/bpf/syscall.c >> @@ -2200,6 +2200,14 @@ static void __bpf_prog_put_rcu(struct rcu_head *rcu) >> bpf_prog_free(aux->prog); >> } >> >> +static void __bpf_prog_put_tasks_trace_rcu(struct rcu_head *rcu) >> +{ >> + if (rcu_trace_implies_rcu_gp()) >> + __bpf_prog_put_rcu(rcu); >> + else >> + call_rcu(rcu, __bpf_prog_put_rcu); >> +} >> + >> static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred) >> { >> bpf_prog_kallsyms_del_all(prog); >> @@ -2212,10 +2220,7 @@ static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred) >> btf_put(prog->aux->attach_btf); >> >> if (deferred) { >> - if (prog->sleepable) >> - call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_rcu); >> - else >> - call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu); >> + call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_tasks_trace_rcu); >> } else { >> __bpf_prog_put_rcu(&prog->aux->rcu); >> } >> @@ -2996,24 +3001,15 @@ static void bpf_link_defer_dealloc_mult_rcu_gp(struct rcu_head *rcu) >> static void bpf_link_free(struct bpf_link *link) >> { >> const struct bpf_link_ops *ops = link->ops; >> - bool sleepable = false; >> >> bpf_link_free_id(link->id); >> if (link->prog) { >> - sleepable = link->prog->sleepable; >> /* detach BPF program, clean up used resources */ >> ops->release(link); >> bpf_prog_put(link->prog); >> } >> if (ops->dealloc_deferred) { >> - /* schedule BPF link deallocation; if underlying BPF program >> - * is sleepable, we need to first wait for RCU tasks trace >> - * sync, then go through "classic" RCU grace period >> - */ >> - if (sleepable) >> - call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp); >> - else >> - call_rcu(&link->rcu, bpf_link_defer_dealloc_rcu_gp); >> + call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp); > > This patch is completely wrong. Actually I suspect Jordan's patch works. > Looks like Mathieu patch broke bpf program contract somewhere. My patch series introduced this in the probe: #define __BPF_DECLARE_TRACE_SYSCALL(call, proto, args) \ static notrace void \ __bpf_trace_##call(void *__data, proto) \ { \ might_fault(); \ preempt_disable_notrace(); \ CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args)); \ preempt_enable_notrace(); \ } To ensure we'd call the bpf program from preempt-off context. > The tracepoint type bpf programs must be called with rcu_read_lock held. Calling the bpf program with preempt off is equivalent. __DO_TRACE() calls the probes with preempt_disable_notrace() in the normal case. > Looks like it's not happening anymore. The issue here is not about the context in which the bpf program runs, that's still preempt off. The problem is about expectations that a call_rcu grace period is enough to delay reclaim after unregistration of the tracepoint. Now that __DO_TRACE() uses rcu_read_lock_trace() to protect RCU dereference, it's not sufficient anymore, at least for syscall tracepoints. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On Wed, 23 Oct 2024 11:19:40 -0400 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > > Looks like Mathieu patch broke bpf program contract somewhere. > > My patch series introduced this in the probe: > > #define __BPF_DECLARE_TRACE_SYSCALL(call, proto, args) \ > static notrace void \ > __bpf_trace_##call(void *__data, proto) \ > { \ > might_fault(); \ > preempt_disable_notrace(); \ Is the problem that we can call this function *after* the prog has been freed? That is, the preempt_disable_notrace() here is meaningless. Is there a way to add something here to make sure the program is still valid? Like set a flag in the link structure? (I don't know how BPF works well enough to know what is involved here, so excuse me if this is totally off) -- Steve > CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args)); \ > preempt_enable_notrace(); \ > } >
> I don't have patches in front of me, so I'm guessing a bit. > So pls remind me where all these patches are? > What tree/branch are we talking about? I tested this in linux-next. Here is the patch series. Link: https://lore.kernel.org/bpf/20241009010718.2050182-1-mathieu.desnoyers@efficios.com/T/#u > and in addition all tracepoints (both sleepable and not) > should deref __data under normal rcu_read_lock() before > passing that pointer into __bpf_trace_##call. > Because that's bpf link and prog are rcu protected. I think this is the crux of the issue. > #define __DO_TRACE_CALL(name, args) \ > do { \ > struct tracepoint_func *it_func_ptr; \ > void *__data; \ > it_func_ptr = \ > rcu_dereference_raw((&__tracepoint_##name)->funcs); \ > if (it_func_ptr) { \ > __data = (it_func_ptr)->data; \ > static_call(tp_func_##name)(__data, args); \ > } \ > } while (0) > #else This code now executes under rcu_read_lock_trace for syscall tracepoints where __data gets dereferenced but call_rcu is used to free non-sleepable programs/links inside bpf_link_free. Is it viable to just use rcu_read_lock()/rcu_read_unlock() here to denote an RCU read-side critical section around where __data is used? > We're not going to penalize all bpf progs for that. > This patch is a non-starter. Fair enough. This could probably be more targeted, using call_rcu_tasks_trace only for raw tracepoint programs or those attached to syscall tracepoints instead of doing this chaining for everything. That is, assuming a change here is even needed at all and that this issue can't be fixed outside of the BPF stuff. I'm not sure whether or not that's the case. -Jordan
On Wed, Oct 23, 2024 at 7:05 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Wed, 23 Oct 2024 11:19:40 -0400 > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > > > > Looks like Mathieu patch broke bpf program contract somewhere. > > > > My patch series introduced this in the probe: > > > > #define __BPF_DECLARE_TRACE_SYSCALL(call, proto, args) \ > > static notrace void \ > > __bpf_trace_##call(void *__data, proto) \ > > { \ > > might_fault(); \ > > preempt_disable_notrace(); \ > > Is the problem that we can call this function *after* the prog has been > freed? That is, the preempt_disable_notrace() here is meaningless. > Yes, I think so. > Is there a way to add something here to make sure the program is still > valid? Like set a flag in the link structure? So I think a big assumption right now is that waiting for RCU grace period is enough to make sure that BPF link (and thus its underlying BPF program) are not referenced from that tracepoint anymore, and so we can proceed with freeing the memory. Now that some tracepoints are sleepable and are RCU Tasks Trace protected, this assumption is wrong. One solution might be to teach BPF raw tracepoint link to recognize sleepable tracepoints, and then go through cal_rcu_task_trace -> call_rcu chain instead of normal call_rcu. Similarly, for such cases we'd need to do the same chain for underlying BPF program, even if BPF program itself is not sleepable. Alternatively, we'd need to add synchronize_rcu() + synchronize_rcu_task_trace() somewhere inside tracepoint_probe_unregister() or bpf_probe_unregister(), which is just a thin wrapper around the former. Which would make detaching multiple tracepoints extremely slow (just like the problem we had with kprobe detachment before multi-kprobes were added). The fundamental reason for this is how we do lifetime tracking between tracepoint object and bpf_link/bpf_program. tracepoint doesn't hold a refcount on bpf_link. It's rather that when bpf_link's last refcount drops to zero, we go and do this: static void bpf_raw_tp_link_release(struct bpf_link *link) { struct bpf_raw_tp_link *raw_tp = container_of(link, struct bpf_raw_tp_link, link); bpf_probe_unregister(raw_tp->btp, raw_tp); bpf_put_raw_tracepoint(raw_tp->btp); } And the assumption is that after bpf_probe_unregister() it should be fine to free BPF link and program after call_rcu(). So we either make bpf_probe_unregister() synchronously wait for synchronize_rcu[_task_trace](), or we make sure to not free link/prog until call_rcu_tasks_trace->call_rcu. I'd prefer the former (add call_rcu_tasks_trace for sleepable BPF raw tracepoint link). You guys said you have a reproducer, right? Can you please share details (I know it's somewhere on another thread, but let's put all this in this thread). And as Alexei asked, where are the applied patches adding faultable tracepoints? > > (I don't know how BPF works well enough to know what is involved here, > so excuse me if this is totally off) it's not off, but I don't think we want tracepoint to hold a refcount on bpf_link (and thus bpf_program), because that invalidates how we do detachment. > > -- Steve > > > > CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args)); \ > > preempt_enable_notrace(); \ > > } > >
> One solution might be to teach BPF raw tracepoint link to recognize > sleepable tracepoints, and then go through cal_rcu_task_trace -> > call_rcu chain instead of normal call_rcu. Similarly, for such cases > we'd need to do the same chain for underlying BPF program, even if BPF > program itself is not sleepable. I don't suppose that tracepoints could themselves be marked as sleepable (e.g. a new 'sleepable' member of `struct tracepoint`), which could be checked when initializing or freeing the link? Something like this, static void bpf_link_defer_bpf_prog_put(struct rcu_head *rcu) { struct bpf_prog_aux *aux = container_of(rcu, struct bpf_prog_aux, rcu); bpf_prog_put(aux->prog); } /* bpf_link_free is guaranteed to be called from process context */ static void bpf_link_free(struct bpf_link *link) { const struct bpf_link_ops *ops = link->ops; bool sleepable = false; + if (ops->attachment_is_sleepable) + sleepable = ops->attachment_is_sleepable(link); + bpf_link_free_id(link->id); if (link->prog) { - sleepable = link->prog->sleepable; + sleepable = sleepable || link->prog->sleepable; /* detach BPF program, clean up used resources */ ops->release(link); - bpf_prog_put(link->prog); + if (sleepable) + call_rcu_tasks_trace(&link->prog->aux->rcu, + bpf_link_defer_bpf_prog_put); + else + bpf_prog_put(link->prog); } if (ops->dealloc_deferred) { /* schedule BPF link deallocation; if underlying BPF program ... } static bool bpf_raw_tp_link_attachment_is_sleepable(struct bpf_link *link) { struct bpf_raw_tp_link *raw_tp = container_of(link, struct bpf_raw_tp_link, link); return raw_tp->btp->tp->sleepable; } where if the attachment point of the link is sleepable as with BPF raw syscall tracepoints then wait for the RCU tasks trace grace period to elapse before freeing up the program and link. -Jordan
On Fri, Oct 25, 2024 at 8:01 AM Jordan Rife <jrife@google.com> wrote: > > > One solution might be to teach BPF raw tracepoint link to recognize > > sleepable tracepoints, and then go through cal_rcu_task_trace -> > > call_rcu chain instead of normal call_rcu. Similarly, for such cases > > we'd need to do the same chain for underlying BPF program, even if BPF > > program itself is not sleepable. > > I don't suppose that tracepoints could themselves be marked as sleepable > (e.g. a new 'sleepable' member of `struct tracepoint`), which could be > checked when initializing or freeing the link? Something like this, > > static void bpf_link_defer_bpf_prog_put(struct rcu_head *rcu) > { > struct bpf_prog_aux *aux = container_of(rcu, struct bpf_prog_aux, rcu); > bpf_prog_put(aux->prog); > } > > /* bpf_link_free is guaranteed to be called from process context */ > static void bpf_link_free(struct bpf_link *link) > { > const struct bpf_link_ops *ops = link->ops; > bool sleepable = false; > > + if (ops->attachment_is_sleepable) > + sleepable = ops->attachment_is_sleepable(link); > + > bpf_link_free_id(link->id); > if (link->prog) { > - sleepable = link->prog->sleepable; > + sleepable = sleepable || link->prog->sleepable; > /* detach BPF program, clean up used resources */ > ops->release(link); > - bpf_prog_put(link->prog); > + if (sleepable) > + call_rcu_tasks_trace(&link->prog->aux->rcu, > + bpf_link_defer_bpf_prog_put); > + else > + bpf_prog_put(link->prog); > } > if (ops->dealloc_deferred) { > /* schedule BPF link deallocation; if underlying BPF program > ... > } > > static bool bpf_raw_tp_link_attachment_is_sleepable(struct bpf_link *link) > { > struct bpf_raw_tp_link *raw_tp = > container_of(link, struct bpf_raw_tp_link, link); > > return raw_tp->btp->tp->sleepable; > } > > where if the attachment point of the link is sleepable as with BPF raw > syscall tracepoints then wait for the RCU tasks trace grace period > to elapse before freeing up the program and link. Yes, that's the direction I'm leaning towards (though implementation details would be different, I don't think we need attachment_is_sleepable callback). I replied to Mathieu's patch, I'll try to do fixes for the BPF side next week, hope that works. > > -Jordan
On 2024-10-24 13:12, Andrii Nakryiko wrote: [...] > And as Alexei asked, where are the applied > patches adding faultable tracepoints? > On linux-next next-20241024 : 0850e1bc88b1 tracing/bpf: Add might_fault check to syscall probes cdb537ac4179 tracing/perf: Add might_fault check to syscall probes a3204c740a59 tracing/ftrace: Add might_fault check to syscall probes a363d27cdbc2 tracing: Allow system call tracepoints to handle page faults <---- This is where bisection points as first offending commit. 4aadde89d81f tracing/bpf: disable preemption in syscall probe 65e7462a16ce tracing/perf: disable preemption in syscall probe 13d750c2c03e tracing/ftrace: disable preemption in syscall probe 0e6caab8db8b tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
> You guys said you have a reproducer, right? Can you please share > details (I know it's somewhere on another thread, but let's put all > this in this thread). For reference, the original syzbot report is here along with links to artifacts. Link: https://lore.kernel.org/bpf/67121037.050a0220.10f4f4.000f.GAE@google.com/ syz repro: https://syzkaller.appspot.com/x/repro.syz?x=153ef887980000 disk image: https://storage.googleapis.com/syzbot-assets/cf2ad43c81cc/disk-15e7d45e.raw.xz The steps I performed to reproduce locally are roughly as follows: 1. Copy the syz repro script to a file, repro.syz.txt 2. Download the disk image, disk.img 3. Build syzkaller (https://github.com/google/syzkaller) 4. Start up QEMU using disk.img: qemu-system-x86_64 -m 2G -smp 2,sockets=2,cores=1 -drive file=./disk.raw,format=raw -net nic,model=e1000 -net user,host=10.0.2.10,hostfwd:tcp::10022-:22 -enable-kvm -nographic 5. SCP syzkaller/bin/linux_amd64/syz-execprog and syzkaller/bin/linux_amd64/syz-executor to root@127.0.0.1:/root/ 6. SCP repro.syz.txt to root@127.0.0.1:/root/ 7. Run './syz-execprog -repeat=0 -procs=5 ./repro.syz.txt' over SSH on root@127.0.0.1 This typically crashes things within 20 seconds or so on my machine. -Jordan
On Wed, Oct 23, 2024 at 8:21 AM Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2024-10-23 11:14, Alexei Starovoitov wrote: > > On Wed, Oct 23, 2024 at 7:56 AM Jordan Rife <jrife@google.com> wrote: > >> > >> Mathieu's patch alone does not seem to be enough to prevent the > >> use-after-free issue reported by syzbot. > >> > >> Link: https://lore.kernel.org/bpf/67121037.050a0220.10f4f4.000f.GAE@google.com/T/#u > >> > >> I reran the repro script with his patch applied to my tree and was > >> still able to get the same KASAN crash to occur. > >> > >> In this case, when bpf_link_free is invoked it kicks off three instances > >> of call_rcu*. > >> > >> bpf_link_free() > >> ops->release() > >> bpf_raw_tp_link_release() > >> bpf_probe_unregister() > >> tracepoint_probe_unregister() > >> tracepoint_remove_func() > >> release_probes() > >> call_rcu() [1] > >> bpf_prog_put() > >> __bpf_prog_put() > >> bpf_prog_put_deferred() > >> __bpf_prog_put_noref() > >> call_rcu() [2] > >> call_rcu() [3] > >> > >> With Mathieu's patch, [1] is chained with call_rcu_tasks_trace() > >> making the grace period suffiently long to safely free the probe itself. > >> The callback for [2] and [3] may be invoked before the > >> call_rcu_tasks_trace() grace period has elapsed leading to the link or > >> program itself being freed while still in use. I was able to prevent > >> any crashes with the patch below which also chains > >> call_rcu_tasks_trace() and call_rcu() at [2] and [3]. > >> > >> --- > >> kernel/bpf/syscall.c | 24 ++++++++++-------------- > >> 1 file changed, 10 insertions(+), 14 deletions(-) > >> > >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > >> index 59de664e580d..5290eccb465e 100644 > >> --- a/kernel/bpf/syscall.c > >> +++ b/kernel/bpf/syscall.c > >> @@ -2200,6 +2200,14 @@ static void __bpf_prog_put_rcu(struct rcu_head *rcu) > >> bpf_prog_free(aux->prog); > >> } > >> > >> +static void __bpf_prog_put_tasks_trace_rcu(struct rcu_head *rcu) > >> +{ > >> + if (rcu_trace_implies_rcu_gp()) > >> + __bpf_prog_put_rcu(rcu); > >> + else > >> + call_rcu(rcu, __bpf_prog_put_rcu); > >> +} > >> + > >> static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred) > >> { > >> bpf_prog_kallsyms_del_all(prog); > >> @@ -2212,10 +2220,7 @@ static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred) > >> btf_put(prog->aux->attach_btf); > >> > >> if (deferred) { > >> - if (prog->sleepable) > >> - call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_rcu); > >> - else > >> - call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu); > >> + call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_tasks_trace_rcu); > >> } else { > >> __bpf_prog_put_rcu(&prog->aux->rcu); > >> } > >> @@ -2996,24 +3001,15 @@ static void bpf_link_defer_dealloc_mult_rcu_gp(struct rcu_head *rcu) > >> static void bpf_link_free(struct bpf_link *link) > >> { > >> const struct bpf_link_ops *ops = link->ops; > >> - bool sleepable = false; > >> > >> bpf_link_free_id(link->id); > >> if (link->prog) { > >> - sleepable = link->prog->sleepable; > >> /* detach BPF program, clean up used resources */ > >> ops->release(link); > >> bpf_prog_put(link->prog); > >> } > >> if (ops->dealloc_deferred) { > >> - /* schedule BPF link deallocation; if underlying BPF program > >> - * is sleepable, we need to first wait for RCU tasks trace > >> - * sync, then go through "classic" RCU grace period > >> - */ > >> - if (sleepable) > >> - call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp); > >> - else > >> - call_rcu(&link->rcu, bpf_link_defer_dealloc_rcu_gp); > >> + call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp); > > > > This patch is completely wrong. > > Actually I suspect Jordan's patch works. We're not going to penalize all bpf progs for that. This patch is a non-starter. > > Looks like Mathieu patch broke bpf program contract somewhere. > > My patch series introduced this in the probe: > > #define __BPF_DECLARE_TRACE_SYSCALL(call, proto, args) \ > static notrace void \ > __bpf_trace_##call(void *__data, proto) \ > { \ > might_fault(); \ > preempt_disable_notrace(); \ > CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args)); \ > preempt_enable_notrace(); \ > } > > To ensure we'd call the bpf program from preempt-off context. > > > The tracepoint type bpf programs must be called with rcu_read_lock held. > > Calling the bpf program with preempt off is equivalent. __DO_TRACE() calls > the probes with preempt_disable_notrace() in the normal case. > > > Looks like it's not happening anymore. > > The issue here is not about the context in which the bpf program runs, that's > still preempt off. The problem is about expectations that a call_rcu grace period > is enough to delay reclaim after unregistration of the tracepoint. Now that > __DO_TRACE() uses rcu_read_lock_trace() to protect RCU dereference, it's not > sufficient anymore, at least for syscall tracepoints. I don't see how preempt_disable vs preempt_disable_notrace and preemption in general are relevant here. The prog dereference needs to happen under rcu_read_lock. But since you've switched to use rcu_read_lock_trace, so then this part: > >> bpf_link_free() > >> ops->release() > >> bpf_raw_tp_link_release() > >> bpf_probe_unregister() > >> tracepoint_probe_unregister() > >> tracepoint_remove_func() > >> release_probes() > >> call_rcu() [1] is probably incorrect. It should be call_rcu_tasks_trace ? and in addition all tracepoints (both sleepable and not) should deref __data under normal rcu_read_lock() before passing that pointer into __bpf_trace_##call. Because that's bpf link and prog are rcu protected. I don't have patches in front of me, so I'm guessing a bit. So pls remind me where all these patches are? What tree/branch are we talking about?
© 2016 - 2024 Red Hat, Inc.