kernel/task_work.c | 35 ++++++++++++++++++----------------- 1 file changed, 18 insertions(+), 17 deletions(-)
Yi and syzbot managed to hang the task within task_run().
The problem is
task_work_run() -> __fput() -> perf_release() ->
perf_event_release_kernel() -> _free_event() ->
perf_pending_task_sync() -> task_work_cancel() failed ->
rcuwait_wait_event().
Once task_work_run() is running, the list of callbacks removed from the
task_struct and from this point on task_work_cancel() can't remove any
pending and not yet started work items.
The proposed solution is to remove one item at a time.
Oleg pointed out that this might be problematic if one closes 2.000.000
files at once. While testing this scenario by opening that many files
following by exit() to ensure that all files are closed at once, I did
not observe anything outside of noise.
Consume only one work item at a time. Forward the next work item under
the lock to ensure it is not canceled during the replacement.
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Reported-by: "Yi Lai" <yi1.lai@linux.intel.com>
Closes: https://lore.kernel.org/all/Zx9Losv4YcJowaP%2F@ly-workstation/
Reported-by: syzbot+3c4321e10eea460eb606@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/673adf75.050a0220.87769.0024.GAE@google.com/
Fixes: c5d93d23a2601 ("perf: Enqueue SIGTRAP always via task_work.")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/task_work.c | 35 ++++++++++++++++++-----------------
1 file changed, 18 insertions(+), 17 deletions(-)
diff --git a/kernel/task_work.c b/kernel/task_work.c
index d1efec571a4a4..7d6c71e9a00a9 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -194,7 +194,7 @@ bool task_work_cancel(struct task_struct *task, struct callback_head *cb)
void task_work_run(void)
{
struct task_struct *task = current;
- struct callback_head *work, *head, *next;
+ struct callback_head *work, *head;
for (;;) {
/*
@@ -202,17 +202,7 @@ void task_work_run(void)
* work_exited unless the list is empty.
*/
work = READ_ONCE(task->task_works);
- do {
- head = NULL;
- if (!work) {
- if (task->flags & PF_EXITING)
- head = &work_exited;
- else
- break;
- }
- } while (!try_cmpxchg(&task->task_works, &work, head));
-
- if (!work)
+ if (!work && !(task->flags & PF_EXITING))
break;
/*
* Synchronize with task_work_cancel_match(). It can not remove
@@ -220,13 +210,24 @@ void task_work_run(void)
* But it can remove another entry from the ->next list.
*/
raw_spin_lock_irq(&task->pi_lock);
+ do {
+ head = NULL;
+ if (work) {
+ head = READ_ONCE(work->next);
+ } else {
+ if (task->flags & PF_EXITING)
+ head = &work_exited;
+ else
+ break;
+ }
+ } while (!try_cmpxchg(&task->task_works, &work, head));
raw_spin_unlock_irq(&task->pi_lock);
- do {
- next = work->next;
- work->func(work);
- work = next;
+ if (!work)
+ break;
+ work->func(work);
+
+ if (head)
cond_resched();
- } while (work);
}
}
--
2.47.2
On 02/21, Sebastian Andrzej Siewior wrote: > > Yi and syzbot managed to hang the task within task_run(). > > The problem is > task_work_run() -> __fput() -> perf_release() -> > perf_event_release_kernel() -> _free_event() -> > perf_pending_task_sync() -> task_work_cancel() failed -> > rcuwait_wait_event(). > > Once task_work_run() is running, the list of callbacks removed from the > task_struct and from this point on task_work_cancel() can't remove any > pending and not yet started work items. But can this patch really solve the problem? Suppose we have two tasks, T1 and T2. T1 does fd = perf_event_open(pid => T2->pid); T2 does fd = perf_event_open(pid => T1->pid); Now, both T1 and T2 do close(fd), call task_work_run(), dequeue the ____fput work, and finally call __fput(). Now suppose that both perf events fire before T1/T2 call perf_event_release_kernel/_free_event. Now, T1 and T2 will hang forever in perf_pending_task_sync() waiting for each other. task_work_cancel(current) can't succeed with or without this patch. No? Oleg.
Le Wed, Feb 26, 2025 at 01:50:48PM +0100, Oleg Nesterov a écrit : > On 02/21, Sebastian Andrzej Siewior wrote: > > > > Yi and syzbot managed to hang the task within task_run(). > > > > The problem is > > task_work_run() -> __fput() -> perf_release() -> > > perf_event_release_kernel() -> _free_event() -> > > perf_pending_task_sync() -> task_work_cancel() failed -> > > rcuwait_wait_event(). > > > > Once task_work_run() is running, the list of callbacks removed from the > > task_struct and from this point on task_work_cancel() can't remove any > > pending and not yet started work items. > > But can this patch really solve the problem? > > Suppose we have two tasks, T1 and T2. > > T1 does fd = perf_event_open(pid => T2->pid); > T2 does fd = perf_event_open(pid => T1->pid); > > Now, both T1 and T2 do close(fd), call task_work_run(), dequeue the > ____fput work, and finally call __fput(). Now suppose that both perf > events fire before T1/T2 call perf_event_release_kernel/_free_event. > > Now, T1 and T2 will hang forever in perf_pending_task_sync() waiting > for each other. task_work_cancel(current) can't succeed with or without > this patch. > > No? Duh! So indeed, the wait/wake based solution is too fragile. Are we back to the old serialized workqueue days flavour of deadlocks with task work? Anyway the perf_pending_task()'s put_event() based solution thing should fix that scenario too. Thanks.
Well... I won't really argue because I can't suggest a better fix at
least right now. Most probably never.
However, let me say that this patch doesn't make me happy ;) See below.
On 02/21, Sebastian Andrzej Siewior wrote:
>
> Oleg pointed out that this might be problematic if one closes 2.000.000
> files at once. While testing this scenario by opening that many files
> following by exit() to ensure that all files are closed at once, I did
> not observe anything outside of noise.
and this probably means that we can revert c82199061009 ("task_work: remove
fifo ordering guarantee") and restore the fifo ordering which IMO makes much
more sense.
But:
> Fixes: c5d93d23a2601 ("perf: Enqueue SIGTRAP always via task_work.")
Yes. So, to fix this specific problem in perf this patch changes task_work.c
And after this change we can never enforce a "clear" ordering, fifo or even lifo.
The ordering is simply "unpredictable/random".
I'll try to find and read the previous discussions tomorrow, but iirc Frederic
had another solution?
Oleg.
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -194,7 +194,7 @@ bool task_work_cancel(struct task_struct *task, struct callback_head *cb)
> void task_work_run(void)
> {
> struct task_struct *task = current;
> - struct callback_head *work, *head, *next;
> + struct callback_head *work, *head;
>
> for (;;) {
> /*
> @@ -202,17 +202,7 @@ void task_work_run(void)
> * work_exited unless the list is empty.
> */
> work = READ_ONCE(task->task_works);
> - do {
> - head = NULL;
> - if (!work) {
> - if (task->flags & PF_EXITING)
> - head = &work_exited;
> - else
> - break;
> - }
> - } while (!try_cmpxchg(&task->task_works, &work, head));
> -
> - if (!work)
> + if (!work && !(task->flags & PF_EXITING))
> break;
> /*
> * Synchronize with task_work_cancel_match(). It can not remove
> @@ -220,13 +210,24 @@ void task_work_run(void)
> * But it can remove another entry from the ->next list.
> */
> raw_spin_lock_irq(&task->pi_lock);
> + do {
> + head = NULL;
> + if (work) {
> + head = READ_ONCE(work->next);
> + } else {
> + if (task->flags & PF_EXITING)
> + head = &work_exited;
> + else
> + break;
> + }
> + } while (!try_cmpxchg(&task->task_works, &work, head));
> raw_spin_unlock_irq(&task->pi_lock);
>
> - do {
> - next = work->next;
> - work->func(work);
> - work = next;
> + if (!work)
> + break;
> + work->func(work);
> +
> + if (head)
> cond_resched();
> - } while (work);
> }
> }
> --
> 2.47.2
>
On 2025-02-23 23:40:15 [+0100], Oleg Nesterov wrote:
> Well... I won't really argue because I can't suggest a better fix at
> least right now. Most probably never.
>
> However, let me say that this patch doesn't make me happy ;) See below.
>
> On 02/21, Sebastian Andrzej Siewior wrote:
> >
> > Oleg pointed out that this might be problematic if one closes 2.000.000
> > files at once. While testing this scenario by opening that many files
> > following by exit() to ensure that all files are closed at once, I did
> > not observe anything outside of noise.
>
> and this probably means that we can revert c82199061009 ("task_work: remove
> fifo ordering guarantee") and restore the fifo ordering which IMO makes much
> more sense.
So assume that turning around will fix the problem because the cancel
callback is run first followed by the clean up.
> But:
>
> > Fixes: c5d93d23a2601 ("perf: Enqueue SIGTRAP always via task_work.")
>
> Yes. So, to fix this specific problem in perf this patch changes task_work.c
>
> And after this change we can never enforce a "clear" ordering, fifo or even lifo.
> The ordering is simply "unpredictable/random".
>
> I'll try to find and read the previous discussions tomorrow, but iirc Frederic
> had another solution?
Two at least:
- having a pointer to the next item
- avoiding the wait in the task_work callback. I think this is the
unfortunate part. I think he had something but was very unhappy with
it.
> Oleg.
Sebastian
On 02/26, Sebastian Andrzej Siewior wrote:
>
> On 2025-02-23 23:40:15 [+0100], Oleg Nesterov wrote:
> > Well... I won't really argue because I can't suggest a better fix at
> > least right now. Most probably never.
> >
> > However, let me say that this patch doesn't make me happy ;) See below.
> >
> > On 02/21, Sebastian Andrzej Siewior wrote:
> > >
> > > Oleg pointed out that this might be problematic if one closes 2.000.000
> > > files at once. While testing this scenario by opening that many files
> > > following by exit() to ensure that all files are closed at once, I did
> > > not observe anything outside of noise.
> >
> > and this probably means that we can revert c82199061009 ("task_work: remove
> > fifo ordering guarantee") and restore the fifo ordering which IMO makes much
> > more sense.
>
> So assume that turning around will fix the problem because the cancel
> callback is run first followed by the clean up.
Not really, they can run in any order, so fifo can't really help.
But this doesn't matter, please see another email:
https://lore.kernel.org/all/20250226125048.GC8995@redhat.com/
Oleg.
On 2025-02-26 15:29:15 [+0100], Oleg Nesterov wrote: > > Not really, they can run in any order, so fifo can't really help. > > But this doesn't matter, please see another email: > https://lore.kernel.org/all/20250226125048.GC8995@redhat.com/ Yes, I just got to it. So it looks like that wait part from task_work should go… Let me check Frederic's patch in this thread… > Oleg. Sebastian
Le Sun, Feb 23, 2025 at 11:40:15PM +0100, Oleg Nesterov a écrit :
> Well... I won't really argue because I can't suggest a better fix at
> least right now. Most probably never.
>
> However, let me say that this patch doesn't make me happy ;) See below.
>
> On 02/21, Sebastian Andrzej Siewior wrote:
> >
> > Oleg pointed out that this might be problematic if one closes 2.000.000
> > files at once. While testing this scenario by opening that many files
> > following by exit() to ensure that all files are closed at once, I did
> > not observe anything outside of noise.
>
> and this probably means that we can revert c82199061009 ("task_work: remove
> fifo ordering guarantee") and restore the fifo ordering which IMO makes much
> more sense.
>
> But:
>
> > Fixes: c5d93d23a2601 ("perf: Enqueue SIGTRAP always via task_work.")
>
> Yes. So, to fix this specific problem in perf this patch changes task_work.c
>
> And after this change we can never enforce a "clear" ordering, fifo or even lifo.
> The ordering is simply "unpredictable/random".
>
> I'll try to find and read the previous discussions tomorrow, but iirc Frederic
> had another solution?
This:
https://lore.kernel.org/all/Zx-B0wK3xqRQsCOS@localhost.localdomain/
Though I'm not entirely happy with it either.
Thanks.
On 02/25, Frederic Weisbecker wrote:
>
> Le Sun, Feb 23, 2025 at 11:40:15PM +0100, Oleg Nesterov a écrit :
> >
> > I'll try to find and read the previous discussions tomorrow, but iirc Frederic
> > had another solution?
>
> This:
>
> https://lore.kernel.org/all/Zx-B0wK3xqRQsCOS@localhost.localdomain/
>
> Though I'm not entirely happy with it either.
Yes, thanks...
Can we do something else and avoid this rcuwait_wait_event() altogether?
To simplify the discussion, suppose we add a global XXX_LOCK. Just in
case, of course we shouldn't do this ;) But lets suppose we do.
Now, can _something_ like the (incomplete, ugly as hell) patch below work?
Oleg.
---
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5304,12 +5304,12 @@ static void perf_pending_task_sync(struct perf_event *event)
return;
}
- /*
- * All accesses related to the event are within the same RCU section in
- * perf_pending_task(). The RCU grace period before the event is freed
- * will make sure all those accesses are complete by then.
- */
- rcuwait_wait_event(&event->pending_work_wait, !event->pending_work, TASK_UNINTERRUPTIBLE);
+ spin_lock(XXX_LOCK);
+ if (event->pending_work) {
+ local_dec(&event->ctx->nr_no_switch_fast);
+ event->pending_work = -1;
+ }
+ spin_unlock(XXX_LOCK);
}
static void _free_event(struct perf_event *event)
@@ -5369,7 +5369,15 @@ static void _free_event(struct perf_event *event)
exclusive_event_destroy(event);
module_put(event->pmu->module);
- call_rcu(&event->rcu_head, free_event_rcu);
+ bool free = true;
+ spin_lock(XXX_LOCK)
+ if (event->pending_work == -1) {
+ event->pending_work = -2;
+ free = false;
+ }
+ spin_unlock(XXX_LOCK);
+ if (free)
+ call_rcu(&event->rcu_head, free_event_rcu);
}
/*
@@ -6981,7 +6989,14 @@ static void perf_pending_task(struct callback_head *head)
{
struct perf_event *event = container_of(head, struct perf_event, pending_task);
int rctx;
+ bool free = false;
+ spin_lock(XXX_LOCK);
+ if ((int)event->pending_work < 0) {
+ free = event->pending_work == -2u;
+ event->pending_work = 0;
+ goto unlock;
+ }
/*
* All accesses to the event must belong to the same implicit RCU read-side
* critical section as the ->pending_work reset. See comment in
@@ -7004,6 +7019,12 @@ static void perf_pending_task(struct callback_head *head)
if (rctx >= 0)
perf_swevent_put_recursion_context(rctx);
+
+unlock:
+ spin_unlock(XXX_LOCK);
+
+ if (free)
+ call_rcu(&event->rcu_head, free_event_rcu);
}
#ifdef CONFIG_GUEST_PERF_EVENTS
Le Tue, Feb 25, 2025 at 05:35:50PM +0100, Oleg Nesterov a écrit :
> On 02/25, Frederic Weisbecker wrote:
> >
> > Le Sun, Feb 23, 2025 at 11:40:15PM +0100, Oleg Nesterov a écrit :
> > >
> > > I'll try to find and read the previous discussions tomorrow, but iirc Frederic
> > > had another solution?
> >
> > This:
> >
> > https://lore.kernel.org/all/Zx-B0wK3xqRQsCOS@localhost.localdomain/
> >
> > Though I'm not entirely happy with it either.
>
> Yes, thanks...
>
> Can we do something else and avoid this rcuwait_wait_event() altogether?
>
> To simplify the discussion, suppose we add a global XXX_LOCK. Just in
> case, of course we shouldn't do this ;) But lets suppose we do.
>
> Now, can _something_ like the (incomplete, ugly as hell) patch below work?
>
> Oleg.
> ---
>
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -5304,12 +5304,12 @@ static void perf_pending_task_sync(struct perf_event *event)
> return;
> }
>
> - /*
> - * All accesses related to the event are within the same RCU section in
> - * perf_pending_task(). The RCU grace period before the event is freed
> - * will make sure all those accesses are complete by then.
> - */
> - rcuwait_wait_event(&event->pending_work_wait, !event->pending_work, TASK_UNINTERRUPTIBLE);
> + spin_lock(XXX_LOCK);
> + if (event->pending_work) {
> + local_dec(&event->ctx->nr_no_switch_fast);
> + event->pending_work = -1;
> + }
> + spin_unlock(XXX_LOCK);
> }
>
> static void _free_event(struct perf_event *event)
> @@ -5369,7 +5369,15 @@ static void _free_event(struct perf_event *event)
> exclusive_event_destroy(event);
> module_put(event->pmu->module);
>
> - call_rcu(&event->rcu_head, free_event_rcu);
> + bool free = true;
> + spin_lock(XXX_LOCK)
> + if (event->pending_work == -1) {
> + event->pending_work = -2;
> + free = false;
> + }
> + spin_unlock(XXX_LOCK);
> + if (free)
> + call_rcu(&event->rcu_head, free_event_rcu);
> }
>
> /*
> @@ -6981,7 +6989,14 @@ static void perf_pending_task(struct callback_head *head)
> {
> struct perf_event *event = container_of(head, struct perf_event, pending_task);
> int rctx;
> + bool free = false;
>
> + spin_lock(XXX_LOCK);
> + if ((int)event->pending_work < 0) {
> + free = event->pending_work == -2u;
> + event->pending_work = 0;
> + goto unlock;
> + }
> /*
> * All accesses to the event must belong to the same implicit RCU read-side
> * critical section as the ->pending_work reset. See comment in
> @@ -7004,6 +7019,12 @@ static void perf_pending_task(struct callback_head *head)
>
> if (rctx >= 0)
> perf_swevent_put_recursion_context(rctx);
> +
> +unlock:
> + spin_unlock(XXX_LOCK);
> +
> + if (free)
> + call_rcu(&event->rcu_head, free_event_rcu);
> }
>
> #ifdef CONFIG_GUEST_PERF_EVENTS
>
Heh, I suggested something similar also:
https://lore.kernel.org/lkml/ZyJUzhzHGDu5CLdi@localhost.localdomain/
The main point being that the event has its refcount inc when the task work
is queued and put from the task work handler. This restore the initial behaviour
but without the bug where the event was freed before the task work was reached.
My remaining worry was that exec'ing tasks with suid might get an irrelevant
signal from the unprivileged past but surely we can work around that. An
untested, complete version would look like:
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 8333f132f4a9..854486d78e1e 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -831,7 +831,6 @@ struct perf_event {
struct irq_work pending_disable_irq;
struct callback_head pending_task;
unsigned int pending_work;
- struct rcuwait pending_work_wait;
atomic_t event_limit;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index bcb09e011e9e..2cf7fa688dd9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5288,35 +5288,10 @@ static bool exclusive_event_installable(struct perf_event *event,
static void perf_addr_filters_splice(struct perf_event *event,
struct list_head *head);
-static void perf_pending_task_sync(struct perf_event *event)
-{
- struct callback_head *head = &event->pending_task;
-
- if (!event->pending_work)
- return;
- /*
- * If the task is queued to the current task's queue, we
- * obviously can't wait for it to complete. Simply cancel it.
- */
- if (task_work_cancel(current, head)) {
- event->pending_work = 0;
- local_dec(&event->ctx->nr_no_switch_fast);
- return;
- }
-
- /*
- * All accesses related to the event are within the same RCU section in
- * perf_pending_task(). The RCU grace period before the event is freed
- * will make sure all those accesses are complete by then.
- */
- rcuwait_wait_event(&event->pending_work_wait, !event->pending_work, TASK_UNINTERRUPTIBLE);
-}
-
static void _free_event(struct perf_event *event)
{
irq_work_sync(&event->pending_irq);
irq_work_sync(&event->pending_disable_irq);
- perf_pending_task_sync(event);
unaccount_event(event);
@@ -5441,10 +5416,17 @@ static void perf_remove_from_owner(struct perf_event *event)
static void put_event(struct perf_event *event)
{
+ struct perf_event *parent;
+
if (!atomic_long_dec_and_test(&event->refcount))
return;
+ parent = event->parent;
_free_event(event);
+
+ /* Matches the refcount bump in inherit_event() */
+ if (parent)
+ put_event(parent);
}
/*
@@ -5528,11 +5510,6 @@ int perf_event_release_kernel(struct perf_event *event)
if (tmp == child) {
perf_remove_from_context(child, DETACH_GROUP);
list_move(&child->child_list, &free_list);
- /*
- * This matches the refcount bump in inherit_event();
- * this can't be the last reference.
- */
- put_event(event);
} else {
var = &ctx->refcount;
}
@@ -5558,7 +5535,7 @@ int perf_event_release_kernel(struct perf_event *event)
void *var = &child->ctx->refcount;
list_del(&child->child_list);
- free_event(child);
+ put_event(child);
/*
* Wake any perf_event_free_task() waiting for this event to be
@@ -6982,12 +6959,6 @@ static void perf_pending_task(struct callback_head *head)
struct perf_event *event = container_of(head, struct perf_event, pending_task);
int rctx;
- /*
- * All accesses to the event must belong to the same implicit RCU read-side
- * critical section as the ->pending_work reset. See comment in
- * perf_pending_task_sync().
- */
- rcu_read_lock();
/*
* If we 'fail' here, that's OK, it means recursion is already disabled
* and we won't recurse 'further'.
@@ -6998,9 +6969,8 @@ static void perf_pending_task(struct callback_head *head)
event->pending_work = 0;
perf_sigtrap(event);
local_dec(&event->ctx->nr_no_switch_fast);
- rcuwait_wake_up(&event->pending_work_wait);
}
- rcu_read_unlock();
+ put_event(event);
if (rctx >= 0)
perf_swevent_put_recursion_context(rctx);
@@ -9945,6 +9915,7 @@ static int __perf_event_overflow(struct perf_event *event,
!task_work_add(current, &event->pending_task, notify_mode)) {
event->pending_work = pending_id;
local_inc(&event->ctx->nr_no_switch_fast);
+ WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
event->pending_addr = 0;
if (valid_sample && (data->sample_flags & PERF_SAMPLE_ADDR))
@@ -12266,7 +12237,6 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
init_irq_work(&event->pending_irq, perf_pending_irq);
event->pending_disable_irq = IRQ_WORK_INIT_HARD(perf_pending_disable);
init_task_work(&event->pending_task, perf_pending_task);
- rcuwait_init(&event->pending_work_wait);
mutex_init(&event->mmap_mutex);
raw_spin_lock_init(&event->addr_filters.lock);
@@ -13431,8 +13401,7 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
* Kick perf_poll() for is_event_hup();
*/
perf_event_wakeup(parent_event);
- free_event(event);
- put_event(parent_event);
+ put_event(event);
return;
}
@@ -13550,13 +13519,11 @@ static void perf_free_event(struct perf_event *event,
list_del_init(&event->child_list);
mutex_unlock(&parent->child_mutex);
- put_event(parent);
-
raw_spin_lock_irq(&ctx->lock);
perf_group_detach(event);
list_del_event(event, ctx);
raw_spin_unlock_irq(&ctx->lock);
- free_event(event);
+ put_event(event);
}
/*
On 02/26, Oleg Nesterov wrote:
>
Hmm. empty email? Let me resend.
On 02/25, Frederic Weisbecker wrote:
>
> Le Tue, Feb 25, 2025 at 05:35:50PM +0100, Oleg Nesterov a écrit :
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -5304,12 +5304,12 @@ static void perf_pending_task_sync(struct perf_event *event)
> > return;
> > }
> >
> > - /*
> > - * All accesses related to the event are within the same RCU section in
> > - * perf_pending_task(). The RCU grace period before the event is freed
> > - * will make sure all those accesses are complete by then.
> > - */
> > - rcuwait_wait_event(&event->pending_work_wait, !event->pending_work, TASK_UNINTERRUPTIBLE);
> > + spin_lock(XXX_LOCK);
> > + if (event->pending_work) {
> > + local_dec(&event->ctx->nr_no_switch_fast);
> > + event->pending_work = -1;
> > + }
> > + spin_unlock(XXX_LOCK);
> > }
> >
> > static void _free_event(struct perf_event *event)
> > @@ -5369,7 +5369,15 @@ static void _free_event(struct perf_event *event)
> > exclusive_event_destroy(event);
> > module_put(event->pmu->module);
> >
> > - call_rcu(&event->rcu_head, free_event_rcu);
> > + bool free = true;
> > + spin_lock(XXX_LOCK)
> > + if (event->pending_work == -1) {
> > + event->pending_work = -2;
> > + free = false;
> > + }
> > + spin_unlock(XXX_LOCK);
> > + if (free)
> > + call_rcu(&event->rcu_head, free_event_rcu);
> > }
> >
> > /*
> > @@ -6981,7 +6989,14 @@ static void perf_pending_task(struct callback_head *head)
> > {
> > struct perf_event *event = container_of(head, struct perf_event, pending_task);
> > int rctx;
> > + bool free = false;
> >
> > + spin_lock(XXX_LOCK);
> > + if ((int)event->pending_work < 0) {
> > + free = event->pending_work == -2u;
> > + event->pending_work = 0;
> > + goto unlock;
> > + }
> > /*
> > * All accesses to the event must belong to the same implicit RCU read-side
> > * critical section as the ->pending_work reset. See comment in
> > @@ -7004,6 +7019,12 @@ static void perf_pending_task(struct callback_head *head)
> >
> > if (rctx >= 0)
> > perf_swevent_put_recursion_context(rctx);
> > +
> > +unlock:
> > + spin_unlock(XXX_LOCK);
> > +
> > + if (free)
> > + call_rcu(&event->rcu_head, free_event_rcu);
> > }
> >
> > #ifdef CONFIG_GUEST_PERF_EVENTS
> >
>
> Heh, I suggested something similar also:
> https://lore.kernel.org/lkml/ZyJUzhzHGDu5CLdi@localhost.localdomain/
;)
I can't comment your patch because I don't understand this code enough.
My patch is more simple, it doesn't play with refcount.
perf_pending_task_sync() sets ->pending_work = -1, after that
perf_pending_task() (which can run in parallel on another CPU) will
only clear ->pending_work and do nothing else.
Then _free_event() rechecks ->pending_work before return, if it is still
nonzero then perf_pending_task() is still pending. In this case _free_event()
sets ->pending_work = -2 to offload call_rcu(free_event_rcu) to the pending
perf_pending_task().
But it is certainly more ugly, and perhaps the very idea is wrong. So I will
be happy if we go with your patch.
Either way, IMO we should try to kill this rcuwait_wait_event() logic. See
another email I sent a minute ago in this thread. Quite possibly I missed
something, but the very idea to wait for another task doesn't look safe
to me.
Thanks!
Oleg.
Le Wed, Feb 26, 2025 at 03:01:15PM +0100, Oleg Nesterov a écrit :
> On 02/25, Frederic Weisbecker wrote:
> >
> > Le Tue, Feb 25, 2025 at 05:35:50PM +0100, Oleg Nesterov a écrit :
> > > --- a/kernel/events/core.c
> > > +++ b/kernel/events/core.c
> > > @@ -5304,12 +5304,12 @@ static void perf_pending_task_sync(struct perf_event *event)
> > > return;
> > > }
> > >
> > > - /*
> > > - * All accesses related to the event are within the same RCU section in
> > > - * perf_pending_task(). The RCU grace period before the event is freed
> > > - * will make sure all those accesses are complete by then.
> > > - */
> > > - rcuwait_wait_event(&event->pending_work_wait, !event->pending_work, TASK_UNINTERRUPTIBLE);
> > > + spin_lock(XXX_LOCK);
> > > + if (event->pending_work) {
> > > + local_dec(&event->ctx->nr_no_switch_fast);
> > > + event->pending_work = -1;
> > > + }
> > > + spin_unlock(XXX_LOCK);
> > > }
> > >
> > > static void _free_event(struct perf_event *event)
> > > @@ -5369,7 +5369,15 @@ static void _free_event(struct perf_event *event)
> > > exclusive_event_destroy(event);
> > > module_put(event->pmu->module);
> > >
> > > - call_rcu(&event->rcu_head, free_event_rcu);
> > > + bool free = true;
> > > + spin_lock(XXX_LOCK)
> > > + if (event->pending_work == -1) {
> > > + event->pending_work = -2;
> > > + free = false;
> > > + }
> > > + spin_unlock(XXX_LOCK);
> > > + if (free)
> > > + call_rcu(&event->rcu_head, free_event_rcu);
> > > }
> > >
> > > /*
> > > @@ -6981,7 +6989,14 @@ static void perf_pending_task(struct callback_head *head)
> > > {
> > > struct perf_event *event = container_of(head, struct perf_event, pending_task);
> > > int rctx;
> > > + bool free = false;
> > >
> > > + spin_lock(XXX_LOCK);
> > > + if ((int)event->pending_work < 0) {
> > > + free = event->pending_work == -2u;
> > > + event->pending_work = 0;
> > > + goto unlock;
> > > + }
> > > /*
> > > * All accesses to the event must belong to the same implicit RCU read-side
> > > * critical section as the ->pending_work reset. See comment in
> > > @@ -7004,6 +7019,12 @@ static void perf_pending_task(struct callback_head *head)
> > >
> > > if (rctx >= 0)
> > > perf_swevent_put_recursion_context(rctx);
> > > +
> > > +unlock:
> > > + spin_unlock(XXX_LOCK);
> > > +
> > > + if (free)
> > > + call_rcu(&event->rcu_head, free_event_rcu);
> > > }
> > >
> > > #ifdef CONFIG_GUEST_PERF_EVENTS
> > >
> >
> > Heh, I suggested something similar also:
> > https://lore.kernel.org/lkml/ZyJUzhzHGDu5CLdi@localhost.localdomain/
>
> ;)
>
> I can't comment your patch because I don't understand this code enough.
>
> My patch is more simple, it doesn't play with refcount.
>
> perf_pending_task_sync() sets ->pending_work = -1, after that
> perf_pending_task() (which can run in parallel on another CPU) will
> only clear ->pending_work and do nothing else.
>
> Then _free_event() rechecks ->pending_work before return, if it is still
> nonzero then perf_pending_task() is still pending. In this case _free_event()
> sets ->pending_work = -2 to offload call_rcu(free_event_rcu) to the pending
> perf_pending_task().
Right it works but it does a parallel implementation of events refcounting.
>
> But it is certainly more ugly, and perhaps the very idea is wrong. So I will
> be happy if we go with your patch.
Ok I'll prepare a changelog and see where it goes.
> Either way, IMO we should try to kill this rcuwait_wait_event() logic. See
> another email I sent a minute ago in this thread. Quite possibly I missed
> something, but the very idea to wait for another task doesn't look safe
> to me.
You're right it's very fragile...
Thanks.
>
> Thanks!
>
> Oleg.
>
On 02/26, Frederic Weisbecker wrote: > > Right it works but it does a parallel implementation of events refcounting. Yes, yes, and this is the main reason why I think it is ugly. > > But it is certainly more ugly, and perhaps the very idea is wrong. So I will > > be happy if we go with your patch. > > Ok I'll prepare a changelog and see where it goes. Great! thanks. Oleg.
© 2016 - 2025 Red Hat, Inc.