[PATCH v4] signal: break out of wait loops on kthread_stop()

Jason A. Donenfeld posted 1 patch 3 years, 9 months ago
There is a newer version of this series
kernel/kthread.c | 1 +
1 file changed, 1 insertion(+)
[PATCH v4] signal: break out of wait loops on kthread_stop()
Posted by Jason A. Donenfeld 3 years, 9 months ago
I was recently surprised to learn that msleep_interruptible(),
wait_for_completion_interruptible_timeout(), and related functions
simply hung when I called kthread_stop() on kthreads using them. The
solution to fixing the case with msleep_interruptible() was more simply
to move to schedule_timeout_interruptible(). Why?

The reason is that msleep_interruptible(), and many functions just like
it, has a loop like this:

        while (timeout && !signal_pending(current))
                timeout = schedule_timeout_interruptible(timeout);

The call to kthread_stop() woke up the thread, so schedule_timeout_
interruptible() returned early, but because signal_pending() returned
true, it went back into another timeout, which was never woken up.

This wait loop pattern is common to various pieces of code, and I
suspect that the subtle misuse in a kthread that caused a deadlock in
the code I looked at last week is also found elsewhere.

So this commit causes signal_pending() to return true when
kthread_stop() is called, by setting TIF_NOTIFY_SIGNAL.

The same also probably applies to the similar kthread_park()
functionality, but that can be addressed later, as its semantics are
slightly different.

Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
Changes v3->v4:
- Don't address park() for now.
- Don't bother clearing the flag, since the task is about to be freed
  anyway.

 kernel/kthread.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 3c677918d8f2..8888987f2b25 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -704,6 +704,7 @@ int kthread_stop(struct task_struct *k)
 	kthread = to_kthread(k);
 	set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);
 	kthread_unpark(k);
+	test_and_set_tsk_thread_flag(k, TIF_NOTIFY_SIGNAL);
 	wake_up_process(k);
 	wait_for_completion(&kthread->exited);
 	ret = kthread->result;
-- 
2.35.1
Re: [PATCH v4] signal: break out of wait loops on kthread_stop()
Posted by Eric W. Biederman 3 years, 9 months ago
"Jason A. Donenfeld" <Jason@zx2c4.com> writes:

> I was recently surprised to learn that msleep_interruptible(),
> wait_for_completion_interruptible_timeout(), and related functions
> simply hung when I called kthread_stop() on kthreads using them. The
> solution to fixing the case with msleep_interruptible() was more simply
> to move to schedule_timeout_interruptible(). Why?
>
> The reason is that msleep_interruptible(), and many functions just like
> it, has a loop like this:
>
>         while (timeout && !signal_pending(current))
>                 timeout = schedule_timeout_interruptible(timeout);
>
> The call to kthread_stop() woke up the thread, so schedule_timeout_
> interruptible() returned early, but because signal_pending() returned
> true, it went back into another timeout, which was never woken up.
>
> This wait loop pattern is common to various pieces of code, and I
> suspect that the subtle misuse in a kthread that caused a deadlock in
> the code I looked at last week is also found elsewhere.
>
> So this commit causes signal_pending() to return true when
> kthread_stop() is called, by setting TIF_NOTIFY_SIGNAL.
>
> The same also probably applies to the similar kthread_park()
> functionality, but that can be addressed later, as its semantics are
> slightly different.
>
> Cc: Eric W. Biederman <ebiederm@xmission.com>
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
> Changes v3->v4:
> - Don't address park() for now.
> - Don't bother clearing the flag, since the task is about to be freed
>   anyway.
>
>  kernel/kthread.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 3c677918d8f2..8888987f2b25 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -704,6 +704,7 @@ int kthread_stop(struct task_struct *k)
>  	kthread = to_kthread(k);
>  	set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);
>  	kthread_unpark(k);
> +	test_and_set_tsk_thread_flag(k, TIF_NOTIFY_SIGNAL);
>  	wake_up_process(k);
>  	wait_for_completion(&kthread->exited);
>  	ret = kthread->result;

Minor it.  Unless I have missed something that should just be
set_tsk_thread_flag.  You aren't using the return value so I don't
think there is any point in testing the previous state of
TIF_NOTIFY_SIGNAL.

Eric
[PATCH v5] signal: break out of wait loops on kthread_stop()
Posted by Jason A. Donenfeld 3 years, 9 months ago
I was recently surprised to learn that msleep_interruptible(),
wait_for_completion_interruptible_timeout(), and related functions
simply hung when I called kthread_stop() on kthreads using them. The
solution to fixing the case with msleep_interruptible() was more simply
to move to schedule_timeout_interruptible(). Why?

The reason is that msleep_interruptible(), and many functions just like
it, has a loop like this:

        while (timeout && !signal_pending(current))
                timeout = schedule_timeout_interruptible(timeout);

The call to kthread_stop() woke up the thread, so schedule_timeout_
interruptible() returned early, but because signal_pending() returned
true, it went back into another timeout, which was never woken up.

This wait loop pattern is common to various pieces of code, and I
suspect that the subtle misuse in a kthread that caused a deadlock in
the code I looked at last week is also found elsewhere.

So this commit causes signal_pending() to return true when
kthread_stop() is called, by setting TIF_NOTIFY_SIGNAL.

The same also probably applies to the similar kthread_park()
functionality, but that can be addressed later, as its semantics are
slightly different.

Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
Changes v4->v5:
- Use set_tsk_thread_flag instead of test_and_set_tsk_thread_flag. Must
  have been a copy and paste mistarget.
Changes v3->v4:
- Don't address park() for now.
- Don't bother clearing the flag, since the task is about to be freed
  anyway.

 kernel/kthread.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 3c677918d8f2..7243a010f433 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -704,6 +704,7 @@ int kthread_stop(struct task_struct *k)
 	kthread = to_kthread(k);
 	set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);
 	kthread_unpark(k);
+	set_tsk_thread_flag(k, TIF_NOTIFY_SIGNAL);
 	wake_up_process(k);
 	wait_for_completion(&kthread->exited);
 	ret = kthread->result;
-- 
2.35.1
Re: [PATCH v5] signal: break out of wait loops on kthread_stop()
Posted by Eric W. Biederman 3 years, 9 months ago
"Jason A. Donenfeld" <Jason@zx2c4.com> writes:

> I was recently surprised to learn that msleep_interruptible(),
> wait_for_completion_interruptible_timeout(), and related functions
> simply hung when I called kthread_stop() on kthreads using them. The
> solution to fixing the case with msleep_interruptible() was more simply
> to move to schedule_timeout_interruptible(). Why?
>
> The reason is that msleep_interruptible(), and many functions just like
> it, has a loop like this:
>
>         while (timeout && !signal_pending(current))
>                 timeout = schedule_timeout_interruptible(timeout);
>
> The call to kthread_stop() woke up the thread, so schedule_timeout_
> interruptible() returned early, but because signal_pending() returned
> true, it went back into another timeout, which was never woken up.
>
> This wait loop pattern is common to various pieces of code, and I
> suspect that the subtle misuse in a kthread that caused a deadlock in
> the code I looked at last week is also found elsewhere.
>
> So this commit causes signal_pending() to return true when
> kthread_stop() is called, by setting TIF_NOTIFY_SIGNAL.
>
> The same also probably applies to the similar kthread_park()
> functionality, but that can be addressed later, as its semantics are
> slightly different.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>

Do I need to pick this up and put it on a topic branch?
Or does this patch have another patch to get merged?


Eric

> Cc: Eric W. Biederman <ebiederm@xmission.com>
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
> Changes v4->v5:
> - Use set_tsk_thread_flag instead of test_and_set_tsk_thread_flag. Must
>   have been a copy and paste mistarget.
> Changes v3->v4:
> - Don't address park() for now.
> - Don't bother clearing the flag, since the task is about to be freed
>   anyway.
>
>  kernel/kthread.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 3c677918d8f2..7243a010f433 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -704,6 +704,7 @@ int kthread_stop(struct task_struct *k)
>  	kthread = to_kthread(k);
>  	set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);
>  	kthread_unpark(k);
> +	set_tsk_thread_flag(k, TIF_NOTIFY_SIGNAL);
>  	wake_up_process(k);
>  	wait_for_completion(&kthread->exited);
>  	ret = kthread->result;
Re: [PATCH v5] signal: break out of wait loops on kthread_stop()
Posted by Jason A. Donenfeld 3 years, 9 months ago
Hi Eric,

On Mon, Jul 11, 2022 at 07:00:25PM -0500, Eric W. Biederman wrote:
> "Jason A. Donenfeld" <Jason@zx2c4.com> writes:
> 
> > I was recently surprised to learn that msleep_interruptible(),
> > wait_for_completion_interruptible_timeout(), and related functions
> > simply hung when I called kthread_stop() on kthreads using them. The
> > solution to fixing the case with msleep_interruptible() was more simply
> > to move to schedule_timeout_interruptible(). Why?
> >
> > The reason is that msleep_interruptible(), and many functions just like
> > it, has a loop like this:
> >
> >         while (timeout && !signal_pending(current))
> >                 timeout = schedule_timeout_interruptible(timeout);
> >
> > The call to kthread_stop() woke up the thread, so schedule_timeout_
> > interruptible() returned early, but because signal_pending() returned
> > true, it went back into another timeout, which was never woken up.
> >
> > This wait loop pattern is common to various pieces of code, and I
> > suspect that the subtle misuse in a kthread that caused a deadlock in
> > the code I looked at last week is also found elsewhere.
> >
> > So this commit causes signal_pending() to return true when
> > kthread_stop() is called, by setting TIF_NOTIFY_SIGNAL.
> >
> > The same also probably applies to the similar kthread_park()
> > functionality, but that can be addressed later, as its semantics are
> > slightly different.
> 
> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> Do I need to pick this up and put it on a topic branch?
> Or does this patch have another patch to get merged?

I think it'd make sense to put this in a topic branch.

I discovered this in the process of developing [1] (which could use an
Ack for the wake_up_state EXPORT_SYMBOL, by the way). That's marked
for stable@ because the breakage there is kind of bad -- people can't put
their laptops to sleep right now. After both this patch and that one
land, I may then revisit the ath9k one and make changes for non-stable@.
That is, of course, if [1] even lands; right now I fear it might be
relegated to scheduler bikeshed purgatory and I don't have the time
right now to deal with that.

Longer term, this patch here will let me rework [1] to get rid of the
set_notify_signal() trick and use a proper condition variable wait with
`wait_for_condition_interruptible_timeout`, since this patch makes that
function work right for both contexts in which hwrng devices are called
(kthread and process). But that will mean reworking the hwrng API, which
means writing patches for every single hwrng driver, so that's work for
another time.

In the mean time, [1] is a good way forward (provided it gets acked).
And then this patch puts things in a good position to improve down the
line. I did a bunch of testing of this patch when trying out different
candidates for [1] before settling on [1] as a good-enough intermediate
solution. 

So, anyway, feel free to put this in a topic branch.

Jason

[1] https://lore.kernel.org/lkml/20220629114240.946411-1-Jason@zx2c4.com/