[v3] perf: Make SIGTRAP and __perf_pending_irq() work on RT.

[PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.

Posted by Sebastian Andrzej Siewior 1 year, 10 months ago

A signal is delivered by raising irq_work() which works from any context
including NMI. irq_work() can be delayed if the architecture does not
provide an interrupt vector. In order not to lose a signal, the signal
is injected via task_work during event_sched_out().

Instead going via irq_work, the signal could be added directly via
task_work. The signal is sent to current and can be enqueued on its
return path to userland instead of triggering irq_work. A dummy IRQ is
required in the NMI case to ensure the task_work is handled before
returning to user land. For this irq_work is used. An alternative would
be just raising an interrupt like arch_send_call_function_single_ipi().

During testing with `remove_on_exec' it become visible that the event
can be enqueued via NMI during execve(). The task_work must not be kept
because free_event() will complain later. Also the new task will not
have a sighandler installed.

Queue signal via task_work. Remove perf_event::pending_sigtrap and
and use perf_event::pending_work instead. Raise irq_work in the NMI case
for a dummy interrupt. Remove the task_work if the event is freed.

Tested-by: Marco Elver <elver@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/perf_event.h |  3 +-
 kernel/events/core.c       | 58 ++++++++++++++++++++++----------------
 2 files changed, 34 insertions(+), 27 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d2a15c0c6f8a9..24ac6765146c7 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -781,7 +781,6 @@ struct perf_event {
 	unsigned int			pending_wakeup;
 	unsigned int			pending_kill;
 	unsigned int			pending_disable;
-	unsigned int			pending_sigtrap;
 	unsigned long			pending_addr;	/* SIGTRAP */
 	struct irq_work			pending_irq;
 	struct callback_head		pending_task;
@@ -959,7 +958,7 @@ struct perf_event_context {
 	struct rcu_head			rcu_head;
 
 	/*
-	 * Sum (event->pending_sigtrap + event->pending_work)
+	 * Sum (event->pending_work + event->pending_work)
 	 *
 	 * The SIGTRAP is targeted at ctx->task, as such it won't do changing
 	 * that until the signal is delivered.
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c7a0274c662c8..e0b2da8de485f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2283,21 +2283,6 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
 		state = PERF_EVENT_STATE_OFF;
 	}
 
-	if (event->pending_sigtrap) {
-		bool dec = true;
-
-		event->pending_sigtrap = 0;
-		if (state != PERF_EVENT_STATE_OFF &&
-		    !event->pending_work) {
-			event->pending_work = 1;
-			dec = false;
-			WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
-			task_work_add(current, &event->pending_task, TWA_RESUME);
-		}
-		if (dec)
-			local_dec(&event->ctx->nr_pending);
-	}
-
 	perf_event_set_state(event, state);
 
 	if (!is_software_event(event))
@@ -6741,11 +6726,6 @@ static void __perf_pending_irq(struct perf_event *event)
 	 * Yay, we hit home and are in the context of the event.
 	 */
 	if (cpu == smp_processor_id()) {
-		if (event->pending_sigtrap) {
-			event->pending_sigtrap = 0;
-			perf_sigtrap(event);
-			local_dec(&event->ctx->nr_pending);
-		}
 		if (event->pending_disable) {
 			event->pending_disable = 0;
 			perf_event_disable_local(event);
@@ -9592,14 +9572,23 @@ static int __perf_event_overflow(struct perf_event *event,
 
 		if (regs)
 			pending_id = hash32_ptr((void *)instruction_pointer(regs)) ?: 1;
-		if (!event->pending_sigtrap) {
-			event->pending_sigtrap = pending_id;
+		if (!event->pending_work) {
+			event->pending_work = pending_id;
 			local_inc(&event->ctx->nr_pending);
-			irq_work_queue(&event->pending_irq);
+			WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
+			task_work_add(current, &event->pending_task, TWA_RESUME);
+			/*
+			 * The NMI path returns directly to userland. The
+			 * irq_work is raised as a dummy interrupt to ensure
+			 * regular return path to user is taken and task_work
+			 * is processed.
+			 */
+			if (in_nmi())
+				irq_work_queue(&event->pending_irq);
 		} else if (event->attr.exclude_kernel && valid_sample) {
 			/*
 			 * Should not be able to return to user space without
-			 * consuming pending_sigtrap; with exceptions:
+			 * consuming pending_work; with exceptions:
 			 *
 			 *  1. Where !exclude_kernel, events can overflow again
 			 *     in the kernel without returning to user space.
@@ -9609,7 +9598,7 @@ static int __perf_event_overflow(struct perf_event *event,
 			 *     To approximate progress (with false negatives),
 			 *     check 32-bit hash of the current IP.
 			 */
-			WARN_ON_ONCE(event->pending_sigtrap != pending_id);
+			WARN_ON_ONCE(event->pending_work != pending_id);
 		}
 
 		event->pending_addr = 0;
@@ -13049,6 +13038,13 @@ static void sync_child_event(struct perf_event *child_event)
 		     &parent_event->child_total_time_running);
 }
 
+static bool task_work_cb_match(struct callback_head *cb, void *data)
+{
+	struct perf_event *event = container_of(cb, struct perf_event, pending_task);
+
+	return event == data;
+}
+
 static void
 perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
 {
@@ -13088,6 +13084,18 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
 		 * Kick perf_poll() for is_event_hup();
 		 */
 		perf_event_wakeup(parent_event);
+		/*
+		 * Cancel pending task_work and update counters if it has not
+		 * yet been delivered to userland. free_event() expects the
+		 * reference counter at one and keeping the event around until
+		 * the task returns to userland can be a unexpected if there is
+		 * no signal handler registered.
+		 */
+		if (event->pending_work &&
+		    task_work_cancel_match(current, task_work_cb_match, event)) {
+			put_event(event);
+			local_dec(&event->ctx->nr_pending);
+		}
 		free_event(event);
 		put_event(parent_event);
 		return;
-- 
2.43.0

Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.

Posted by Frederic Weisbecker 1 year, 10 months ago

Le Fri, Mar 22, 2024 at 07:48:22AM +0100, Sebastian Andrzej Siewior a écrit :
> A signal is delivered by raising irq_work() which works from any context
> including NMI. irq_work() can be delayed if the architecture does not
> provide an interrupt vector. In order not to lose a signal, the signal
> is injected via task_work during event_sched_out().
> 
> Instead going via irq_work, the signal could be added directly via
> task_work. The signal is sent to current and can be enqueued on its
> return path to userland instead of triggering irq_work. A dummy IRQ is
> required in the NMI case to ensure the task_work is handled before
> returning to user land. For this irq_work is used. An alternative would
> be just raising an interrupt like arch_send_call_function_single_ipi().
> 
> During testing with `remove_on_exec' it become visible that the event
> can be enqueued via NMI during execve(). The task_work must not be kept
> because free_event() will complain later. Also the new task will not
> have a sighandler installed.
> 
> Queue signal via task_work. Remove perf_event::pending_sigtrap and
> and use perf_event::pending_work instead. Raise irq_work in the NMI case
> for a dummy interrupt. Remove the task_work if the event is freed.
> 
> Tested-by: Marco Elver <elver@google.com>
> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

It clashes a bit with a series I have posted. Let's see the details:

> ---
>  include/linux/perf_event.h |  3 +-
>  kernel/events/core.c       | 58 ++++++++++++++++++++++----------------
>  2 files changed, 34 insertions(+), 27 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index d2a15c0c6f8a9..24ac6765146c7 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -781,7 +781,6 @@ struct perf_event {
>  	unsigned int			pending_wakeup;
>  	unsigned int			pending_kill;
>  	unsigned int			pending_disable;
> -	unsigned int			pending_sigtrap;
>  	unsigned long			pending_addr;	/* SIGTRAP */
>  	struct irq_work			pending_irq;
>  	struct callback_head		pending_task;
> @@ -959,7 +958,7 @@ struct perf_event_context {
>  	struct rcu_head			rcu_head;
>  
>  	/*
> -	 * Sum (event->pending_sigtrap + event->pending_work)
> +	 * Sum (event->pending_work + event->pending_work)
>  	 *
>  	 * The SIGTRAP is targeted at ctx->task, as such it won't do changing
>  	 * that until the signal is delivered.
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index c7a0274c662c8..e0b2da8de485f 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2283,21 +2283,6 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
>  		state = PERF_EVENT_STATE_OFF;
>  	}
>  
> -	if (event->pending_sigtrap) {
> -		bool dec = true;
> -
> -		event->pending_sigtrap = 0;
> -		if (state != PERF_EVENT_STATE_OFF &&
> -		    !event->pending_work) {
> -			event->pending_work = 1;
> -			dec = false;
> -			WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
> -			task_work_add(current, &event->pending_task, TWA_RESUME);
> -		}
> -		if (dec)
> -			local_dec(&event->ctx->nr_pending);
> -	}
> -
>  	perf_event_set_state(event, state);
>  
>  	if (!is_software_event(event))
> @@ -6741,11 +6726,6 @@ static void __perf_pending_irq(struct perf_event *event)
>  	 * Yay, we hit home and are in the context of the event.
>  	 */
>  	if (cpu == smp_processor_id()) {
> -		if (event->pending_sigtrap) {
> -			event->pending_sigtrap = 0;
> -			perf_sigtrap(event);
> -			local_dec(&event->ctx->nr_pending);
> -		}
>  		if (event->pending_disable) {
>  			event->pending_disable = 0;
>  			perf_event_disable_local(event);
> @@ -9592,14 +9572,23 @@ static int __perf_event_overflow(struct perf_event *event,
>  
>  		if (regs)
>  			pending_id = hash32_ptr((void *)instruction_pointer(regs)) ?: 1;
> -		if (!event->pending_sigtrap) {
> -			event->pending_sigtrap = pending_id;
> +		if (!event->pending_work) {
> +			event->pending_work = pending_id;
>  			local_inc(&event->ctx->nr_pending);
> -			irq_work_queue(&event->pending_irq);
> +			WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
> +			task_work_add(current, &event->pending_task, TWA_RESUME);

If the overflow happens between exit_task_work() and perf_event_exit_task(),
you're leaking the event. (This was there before this patch).
See:
	https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m5e6c8ebbef04ab9a1d7f05340cd3e2716a9a8c39

> +			/*
> +			 * The NMI path returns directly to userland. The
> +			 * irq_work is raised as a dummy interrupt to ensure
> +			 * regular return path to user is taken and task_work
> +			 * is processed.
> +			 */
> +			if (in_nmi())
> +				irq_work_queue(&event->pending_irq);
>  		} else if (event->attr.exclude_kernel && valid_sample) {
>  			/*
>  			 * Should not be able to return to user space without
> -			 * consuming pending_sigtrap; with exceptions:
> +			 * consuming pending_work; with exceptions:
>  			 *
>  			 *  1. Where !exclude_kernel, events can overflow again
>  			 *     in the kernel without returning to user space.
> @@ -9609,7 +9598,7 @@ static int __perf_event_overflow(struct perf_event *event,
>  			 *     To approximate progress (with false negatives),
>  			 *     check 32-bit hash of the current IP.
>  			 */
> -			WARN_ON_ONCE(event->pending_sigtrap != pending_id);
> +			WARN_ON_ONCE(event->pending_work != pending_id);
>  		}
>  
>  		event->pending_addr = 0;
> @@ -13049,6 +13038,13 @@ static void sync_child_event(struct perf_event *child_event)
>  		     &parent_event->child_total_time_running);
>  }
>  
> +static bool task_work_cb_match(struct callback_head *cb, void *data)
> +{
> +	struct perf_event *event = container_of(cb, struct perf_event, pending_task);
> +
> +	return event == data;
> +}

I suggest we introduce a proper API to cancel an actual callback head, see:

https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#mbfac417463018394f9d80c68c7f2cafe9d066a4b
https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m0a347249a462523358724085f2489ce9ed91e640

> +
>  static void
>  perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
>  {
> @@ -13088,6 +13084,18 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
>  		 * Kick perf_poll() for is_event_hup();
>  		 */
>  		perf_event_wakeup(parent_event);
> +		/*
> +		 * Cancel pending task_work and update counters if it has not
> +		 * yet been delivered to userland. free_event() expects the
> +		 * reference counter at one and keeping the event around until
> +		 * the task returns to userland can be a unexpected if there is
> +		 * no signal handler registered.
> +		 */
> +		if (event->pending_work &&
> +		    task_work_cancel_match(current, task_work_cb_match, event)) {
> +			put_event(event);
> +			local_dec(&event->ctx->nr_pending);
> +		}

So exiting task, privileged exec and also exit on exec call into this before
releasing the children.

And parents rely on put_event() from file close + the task work.

But what about remote release of children on file close?
See perf_event_release_kernel() directly calling free_event() on them.

One possible fix is to avoid the reference count game around task work
and flush them on free_event().

See here:

https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50

Thanks.

>  		free_event(event);
>  		put_event(parent_event);
>  		return;
> -- 
> 2.43.0
> 
>

Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.

Posted by Sebastian Andrzej Siewior 1 year, 10 months ago

On 2024-04-08 23:29:03 [+0200], Frederic Weisbecker wrote:
> > index c7a0274c662c8..e0b2da8de485f 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -2283,21 +2283,6 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
> >  		state = PERF_EVENT_STATE_OFF;
> >  	}
> >  
> > -	if (event->pending_sigtrap) {
> > -		bool dec = true;
> > -
> > -		event->pending_sigtrap = 0;
> > -		if (state != PERF_EVENT_STATE_OFF &&
> > -		    !event->pending_work) {
> > -			event->pending_work = 1;
> > -			dec = false;
> > -			WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
> > -			task_work_add(current, &event->pending_task, TWA_RESUME);
> > -		}
> > -		if (dec)
> > -			local_dec(&event->ctx->nr_pending);
> > -	}
> > -
> >  	perf_event_set_state(event, state);
> >  
> >  	if (!is_software_event(event))
> > @@ -6741,11 +6726,6 @@ static void __perf_pending_irq(struct perf_event *event)
> >  	 * Yay, we hit home and are in the context of the event.
> >  	 */
> >  	if (cpu == smp_processor_id()) {
> > -		if (event->pending_sigtrap) {
> > -			event->pending_sigtrap = 0;
> > -			perf_sigtrap(event);
> > -			local_dec(&event->ctx->nr_pending);
> > -		}
> >  		if (event->pending_disable) {
> >  			event->pending_disable = 0;
> >  			perf_event_disable_local(event);
> > @@ -9592,14 +9572,23 @@ static int __perf_event_overflow(struct perf_event *event,
> >  
> >  		if (regs)
> >  			pending_id = hash32_ptr((void *)instruction_pointer(regs)) ?: 1;
> > -		if (!event->pending_sigtrap) {
> > -			event->pending_sigtrap = pending_id;
> > +		if (!event->pending_work) {
> > +			event->pending_work = pending_id;
> >  			local_inc(&event->ctx->nr_pending);
> > -			irq_work_queue(&event->pending_irq);
> > +			WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
> > +			task_work_add(current, &event->pending_task, TWA_RESUME);
> 
> If the overflow happens between exit_task_work() and perf_event_exit_task(),
> you're leaking the event. (This was there before this patch).
> See:
> 	https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m5e6c8ebbef04ab9a1d7f05340cd3e2716a9a8c39

Okay.

> > +			/*
> > +			 * The NMI path returns directly to userland. The
> > +			 * irq_work is raised as a dummy interrupt to ensure
> > +			 * regular return path to user is taken and task_work
> > +			 * is processed.
> > +			 */
> > +			if (in_nmi())
> > +				irq_work_queue(&event->pending_irq);
> >  		} else if (event->attr.exclude_kernel && valid_sample) {
> >  			/*
> >  			 * Should not be able to return to user space without
> > -			 * consuming pending_sigtrap; with exceptions:
> > +			 * consuming pending_work; with exceptions:
> >  			 *
> >  			 *  1. Where !exclude_kernel, events can overflow again
> >  			 *     in the kernel without returning to user space.
> > @@ -9609,7 +9598,7 @@ static int __perf_event_overflow(struct perf_event *event,
> >  			 *     To approximate progress (with false negatives),
> >  			 *     check 32-bit hash of the current IP.
> >  			 */
> > -			WARN_ON_ONCE(event->pending_sigtrap != pending_id);
> > +			WARN_ON_ONCE(event->pending_work != pending_id);
> >  		}
> >  
> >  		event->pending_addr = 0;
> > @@ -13049,6 +13038,13 @@ static void sync_child_event(struct perf_event *child_event)
> >  		     &parent_event->child_total_time_running);
> >  }
> >  
> > +static bool task_work_cb_match(struct callback_head *cb, void *data)
> > +{
> > +	struct perf_event *event = container_of(cb, struct perf_event, pending_task);
> > +
> > +	return event == data;
> > +}
> 
> I suggest we introduce a proper API to cancel an actual callback head, see:
> 
> https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#mbfac417463018394f9d80c68c7f2cafe9d066a4b
> https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m0a347249a462523358724085f2489ce9ed91e640

This rework would work.

> >  static void
> >  perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> >  {
> > @@ -13088,6 +13084,18 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> >  		 * Kick perf_poll() for is_event_hup();
> >  		 */
> >  		perf_event_wakeup(parent_event);
> > +		/*
> > +		 * Cancel pending task_work and update counters if it has not
> > +		 * yet been delivered to userland. free_event() expects the
> > +		 * reference counter at one and keeping the event around until
> > +		 * the task returns to userland can be a unexpected if there is
> > +		 * no signal handler registered.
> > +		 */
> > +		if (event->pending_work &&
> > +		    task_work_cancel_match(current, task_work_cb_match, event)) {
> > +			put_event(event);
> > +			local_dec(&event->ctx->nr_pending);
> > +		}
> 
> So exiting task, privileged exec and also exit on exec call into this before
> releasing the children.
> 
> And parents rely on put_event() from file close + the task work.
> 
> But what about remote release of children on file close?
> See perf_event_release_kernel() directly calling free_event() on them.

Interesting things you are presenting. I had events popping up at random
even after the task decided that it won't go back to userland to handle
it so letting it free looked like the only option…

> One possible fix is to avoid the reference count game around task work
> and flush them on free_event().
> 
> See here:
> 
> https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50

That wake_up() within preempt_disable() section breaks on RT.

How do we go on from here?

> Thanks.

Sebastian

Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.

Posted by Frederic Weisbecker 1 year, 10 months ago

Le Tue, Apr 09, 2024 at 10:57:32AM +0200, Sebastian Andrzej Siewior a écrit :
> > >  static void
> > >  perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> > >  {
> > > @@ -13088,6 +13084,18 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> > >  		 * Kick perf_poll() for is_event_hup();
> > >  		 */
> > >  		perf_event_wakeup(parent_event);
> > > +		/*
> > > +		 * Cancel pending task_work and update counters if it has not
> > > +		 * yet been delivered to userland. free_event() expects the
> > > +		 * reference counter at one and keeping the event around until
> > > +		 * the task returns to userland can be a unexpected if there is
> > > +		 * no signal handler registered.
> > > +		 */
> > > +		if (event->pending_work &&
> > > +		    task_work_cancel_match(current, task_work_cb_match, event)) {
> > > +			put_event(event);
> > > +			local_dec(&event->ctx->nr_pending);
> > > +		}
> > 
> > So exiting task, privileged exec and also exit on exec call into this before
> > releasing the children.
> > 
> > And parents rely on put_event() from file close + the task work.
> > 
> > But what about remote release of children on file close?
> > See perf_event_release_kernel() directly calling free_event() on them.
> 
> Interesting things you are presenting. I had events popping up at random
> even after the task decided that it won't go back to userland to handle
> it so letting it free looked like the only option…
> 
> > One possible fix is to avoid the reference count game around task work
> > and flush them on free_event().
> > 
> > See here:
> > 
> > https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50
> 
> That wake_up() within preempt_disable() section breaks on RT.

Ah, but the wake-up still wants to go inside recursion protection somehow or
it could generate task_work loop again due to tracepoint events...

Although... the wake up occurs only when the event is dead after all...

> How do we go on from here?

I'd tend to think you need my patchset first because the problems it
fixes were not easily visible as long as there was an irq work to take
care of things most of the time. But once you rely on task_work only then
these become a real problem. Especially the sync against perf_release().

Thanks.

> 
> > Thanks.
> 
> Sebastian

Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.

Posted by Sebastian Andrzej Siewior 1 year, 10 months ago

On 2024-04-09 14:36:51 [+0200], Frederic Weisbecker wrote:
> > That wake_up() within preempt_disable() section breaks on RT.
> 
> Ah, but the wake-up still wants to go inside recursion protection somehow or
> it could generate task_work loop again due to tracepoint events...

okay.

> Although... the wake up occurs only when the event is dead after all...

corner case or not, it has to work, right?

> > How do we go on from here?
> 
> I'd tend to think you need my patchset first because the problems it
> fixes were not easily visible as long as there was an irq work to take
> care of things most of the time. But once you rely on task_work only then
> these become a real problem. Especially the sync against perf_release().

I don't mind rebasing on top of your series. But defaulting to task_work
is not an option then?

RT wise the irq_work is not handled in hardirq because of locks it
acquires and is handled instead in a thread. Depending on the priority
the task (receiving the event) it may run before the irq_work-thread.
Therefore the task_work looked neat because the event would be handled
_before_ the task returned to userland.

Couldn't we either flush _or_ remove the task_work in perf_release()?

> Thanks.
Sebastian

Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.

Posted by Frederic Weisbecker 1 year, 10 months ago

Le Tue, Apr 09, 2024 at 03:47:29PM +0200, Sebastian Andrzej Siewior a écrit :
> On 2024-04-09 14:36:51 [+0200], Frederic Weisbecker wrote:
> > > That wake_up() within preempt_disable() section breaks on RT.
> > 
> > Ah, but the wake-up still wants to go inside recursion protection somehow or
> > it could generate task_work loop again due to tracepoint events...
> 
> okay.
> 
> > Although... the wake up occurs only when the event is dead after all...
> 
> corner case or not, it has to work, right?

Yep.

> 
> > > How do we go on from here?
> > 
> > I'd tend to think you need my patchset first because the problems it
> > fixes were not easily visible as long as there was an irq work to take
> > care of things most of the time. But once you rely on task_work only then
> > these become a real problem. Especially the sync against perf_release().
> 
> I don't mind rebasing on top of your series. But defaulting to task_work
> is not an option then?
> 
> RT wise the irq_work is not handled in hardirq because of locks it
> acquires and is handled instead in a thread. Depending on the priority
> the task (receiving the event) it may run before the irq_work-thread.
> Therefore the task_work looked neat because the event would be handled
> _before_ the task returned to userland.

I see.
 
> Couldn't we either flush _or_ remove the task_work in perf_release()?

Right so the problem in perf_release() is that we may be dealing with task works
of other tasks than current. In that case, task_work_cancel() is fine if it
successes. But if it fails, you don't have the guarantee that the task work
isn't concurrently running or about to run. And you have no way to know about
that. So then you need some sort of flushing indeed.

Thanks.

> > Thanks.
> Sebastian

Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.

Posted by Sebastian Andrzej Siewior 1 year, 10 months ago

On 2024-04-10 13:37:05 [+0200], Frederic Weisbecker wrote:
> > Couldn't we either flush _or_ remove the task_work in perf_release()?
> 
> Right so the problem in perf_release() is that we may be dealing with task works
> of other tasks than current. In that case, task_work_cancel() is fine if it
> successes. But if it fails, you don't have the guarantee that the task work
> isn't concurrently running or about to run. And you have no way to know about
> that. So then you need some sort of flushing indeed.

Since perf_release() preemptible, a wait/sleep for completion would be
best (instead of flushing).

> Thanks.
> 
> > > Thanks.

Sebastian

Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.

Posted by Frederic Weisbecker 1 year, 10 months ago

Le Wed, Apr 10, 2024 at 03:47:02PM +0200, Sebastian Andrzej Siewior a écrit :
> On 2024-04-10 13:37:05 [+0200], Frederic Weisbecker wrote:
> > > Couldn't we either flush _or_ remove the task_work in perf_release()?
> > 
> > Right so the problem in perf_release() is that we may be dealing with task works
> > of other tasks than current. In that case, task_work_cancel() is fine if it
> > successes. But if it fails, you don't have the guarantee that the task work
> > isn't concurrently running or about to run. And you have no way to know about
> > that. So then you need some sort of flushing indeed.
> 
> Since perf_release() preemptible, a wait/sleep for completion would be
> best (instead of flushing).

Like this then?

https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50

> > Thanks.
> > 
> > > > Thanks.
> 
> Sebastian

Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.

Posted by Sebastian Andrzej Siewior 1 year, 10 months ago

On 2024-04-10 16:00:17 [+0200], Frederic Weisbecker wrote:
> Le Wed, Apr 10, 2024 at 03:47:02PM +0200, Sebastian Andrzej Siewior a écrit :
> > On 2024-04-10 13:37:05 [+0200], Frederic Weisbecker wrote:
> > > > Couldn't we either flush _or_ remove the task_work in perf_release()?
> > > 
> > > Right so the problem in perf_release() is that we may be dealing with task works
> > > of other tasks than current. In that case, task_work_cancel() is fine if it
> > > successes. But if it fails, you don't have the guarantee that the task work
> > > isn't concurrently running or about to run. And you have no way to know about
> > > that. So then you need some sort of flushing indeed.
> > 
> > Since perf_release() preemptible, a wait/sleep for completion would be
> > best (instead of flushing).
> 
> Like this then?
> 
> https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50

Kind of, yes. Do we have more than one waiter? If not, maybe that
rcuwait would work then.
Otherwise (>1 waiter) we did establish that we may need a per-task
counter for recursion handling so preempt-disable shouldn't be a problem
then. The pending_work_wq must not be used outside of task context (means
no hardirq or something like that).

Sebastian

Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.

Posted by Frederic Weisbecker 1 year, 10 months ago

Le Wed, Apr 10, 2024 at 04:06:33PM +0200, Sebastian Andrzej Siewior a écrit :
> On 2024-04-10 16:00:17 [+0200], Frederic Weisbecker wrote:
> > Le Wed, Apr 10, 2024 at 03:47:02PM +0200, Sebastian Andrzej Siewior a écrit :
> > > On 2024-04-10 13:37:05 [+0200], Frederic Weisbecker wrote:
> > > > > Couldn't we either flush _or_ remove the task_work in perf_release()?
> > > > 
> > > > Right so the problem in perf_release() is that we may be dealing with task works
> > > > of other tasks than current. In that case, task_work_cancel() is fine if it
> > > > successes. But if it fails, you don't have the guarantee that the task work
> > > > isn't concurrently running or about to run. And you have no way to know about
> > > > that. So then you need some sort of flushing indeed.
> > > 
> > > Since perf_release() preemptible, a wait/sleep for completion would be
> > > best (instead of flushing).
> > 
> > Like this then?
> > 
> > https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50
> 
> Kind of, yes. Do we have more than one waiter? If not, maybe that
> rcuwait would work then.

Indeed there is only one waiter so that should work. Would
that be something you can call while preemption is disabled?

Thanks.

> Otherwise (>1 waiter) we did establish that we may need a per-task
> counter for recursion handling so preempt-disable shouldn't be a problem
> then. The pending_work_wq must not be used outside of task context (means
> no hardirq or something like that).
> 
> Sebastian

Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.

Posted by Sebastian Andrzej Siewior 1 year, 10 months ago

On 2024-04-10 16:42:56 [+0200], Frederic Weisbecker wrote:
> > > Like this then?
> > > 
> > > https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50
> > 
> > Kind of, yes. Do we have more than one waiter? If not, maybe that
> > rcuwait would work then.
> 
> Indeed there is only one waiter so that should work. Would
> that be something you can call while preemption is disabled?

rcuwait_wake_up() does only wake_up_process() which is fine.
wake_up() does spin_lock_irqsave() which is a no.

On the other hand that preempt-disable needs to go anyway due to
perf_sigtrap(). But a slim wake is a slim wake ;)

> Thanks.

Sebastian

Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.

Posted by Frederic Weisbecker 1 year, 10 months ago

Le Wed, Apr 10, 2024 at 04:48:21PM +0200, Sebastian Andrzej Siewior a écrit :
> On 2024-04-10 16:42:56 [+0200], Frederic Weisbecker wrote:
> > > > Like this then?
> > > > 
> > > > https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50
> > > 
> > > Kind of, yes. Do we have more than one waiter? If not, maybe that
> > > rcuwait would work then.
> > 
> > Indeed there is only one waiter so that should work. Would
> > that be something you can call while preemption is disabled?
> 
> rcuwait_wake_up() does only wake_up_process() which is fine.
> wake_up() does spin_lock_irqsave() which is a no.

Duh!

> On the other hand that preempt-disable needs to go anyway due to
> perf_sigtrap(). But a slim wake is a slim wake ;)

Sure thing :)

> > Thanks.
> 
> Sebastian